Space Station 13

by Exadv1
Space Station 13
Stay alive inside Space Station 13
ID:1345084
 
Okay - so I've read through many of the forum posts - and it seems there are a lot of wonderfully intelligent people in these forums.

Now - I know the idea of getting a "Lag Free" station seems to be a bit of a "Myth" among people - but I figure we can put together some hardware specs with some software/code changes that would help all of us towards creating a lag free environment.

So - let's start hardware - the game, from what I understand, runs off a single core/single thread - I'm assuming, without changing the byond code itself, there is absolutely NO way to force this into a second thread? I mean... ABSOLUTELY NO way??

So basically - you're going to want a computer/server that is fast on a single processor.

Secondly - Ram - I've heard people say it spikes up to 1.5 GB in Ram - getting 8-16GB in a computer/server isn't that difficult - so I think that'll be fine.

HardDrive - I've had some debates over this one - Someone suggested to me once that moving the station to a RAM Disk would be really good - my question is, I wonder how much information does this game transfer/read/write in a single 1 hour running of the station. I've seen a lot of stations where, as the round goes on, it gets slower and slower - sometimes becoming nearly unplayable - so there's a bottle neck somewhere (even with servers that have a cap in player counts). Something, on these servers, is running out of room (so let's figure out what that is together). If it was a pure connection, throughput problem, the lag would happen from the start of the round. My guess is that it's just the CPU running out of room as it game goes on (hmm, any way we can offload this weight on the CPU to something else?)

Software/Bug Fixes - this is pretty much the end of my knowledge here - as I'm just digging into the code now - so any concrete information anyone can provide on exactly what the station does during the process of a round would be great :)

Thank you! Hope everyone has a great day!

Pro
No amount of hardware can fix the lag, it's all in the code. Get the fastest CPU possible to help with the single core limitation but there's just too much going on in SS13 for it to run smoothly out of the box.

As for what to change to fix it?

Trade secret :)
I'll answer any general questions you have but it took me a while to figure out how to speed this game up and I'm still working on completing the idea.

I'll give you one hint, BYOND's profiler is not enough to figure it out. You'll need to rig up some debugging tools written in DM to watch what's going on and see the problems.


I don't know how familiar you are with BYOND and DM in general. world/New executes at the beginning of a round after a restart, world/Del when it is going away. client/New executes when a new player connects, client/Del when someone disconnects. That should be enough of a start to walk through and see what things are doing.
For one idea you really should look what tickrate suits your server the best, and as MagicMountain, a decent amount of people have found out how to run a more smoother environment.
sounds good - I'll do what I can to take the advice, and our team will work on smoothing things out. I think that's why I started this thread was to sort of help everyone out in some way or another :)
In response to Sleepz
Sleepz wrote:
a decent amount of people have found out how to run a more smoother environment.

There are lots of people that understand parts of the problem but I haven't seen anyone else put it all together before. I just need more time to finish and I'll be able to prove it. About two months ago work exploded and just before that I'd finally figured out what to do and had just started grinding through it. There is an awful lot of code to fix. I'd say I've fixed somewhere between 5 and 10 percent of the problem code I know about on LLJK and that was a night and day difference. If I ever get done it'll be unreal.

Hell, if things are actually responsive this game might stay alive alongside the remake.

By the way sorry for not just outright saying everything. I won't be vague about it forever. Want to finish and enjoy a performance gap for a little while before giving away the gold, you know?
Okay, here's a little something that's verified and ready. One of the major problems with SS13 is constantly creating new image variables. Even if the icon state or overlay hasn't changed some atoms in ss13 will create new ones and slap them in over and over. This adds up and wastes a lot of CPU.

Here's an example of how to fix it. Take the APC code. (Forgive me if this doesn't match the public /tg/ code, I don't have that because we try to come up with original ideas and I'd rather not be influenced by something and accidentally steal it.) It has a proc updateicon() which changes the graphic based on the state of the APC such as whether the lid is open or closed or whether the power is working.

/obj/machinery/power/apc/proc/updateicon()
if(opened)
icon_state = "[ cell ? "apc2" : "apc1" ]" // if opened, show cell if it's inserted
src.overlays = null // also delete all overlays
else if(emagged)
icon_state = "apcemag"
src.overlays = null
return
else if(wiresexposed)
icon_state = "apcwires"
src.overlays = null
return
else
icon_state = "apc0"

// if closed, update overlays for channel status

src.overlays = null

overlays += image('power.dmi', "apcox-[locked]") // 0=blue 1=red
overlays += image('power.dmi', "apco3-[charging]") // 0=red, 1=yellow/black 2=green


if(operating)
overlays += image('power.dmi', "apco0-[equipment]") // 0=red, 1=green, 2=blue
overlays += image('power.dmi', "apco1-[lighting]")
overlays += image('power.dmi', "apco2-[environ]")


So every time updateicon() is called (which is quite often) several new image objects are being created and the old ones need to be collected. Multiply this by the number of APCs on the map and you see why they are one of the most expensive machines.

Here's my answer:

/obj/machinery/power/apc/var/list/status_overlays

/obj/machinery/power/apc/proc/updateicon()

if (isnull(status_overlays)) // if no status overlays list, this is first call
status_overlays = new
status_overlays.len = 5
status_overlays[1] = image('power.dmi', "apcox-[locked]") // 0=blue 1=red
status_overlays[2] = image('power.dmi', "apco3-[charging]") // 0=red, 1=yellow/black 2=green

status_overlays[3] = image('power.dmi', "apco0-[equipment]") // 0=red, 1=green, 2=blue
status_overlays[4] = image('power.dmi', "apco1-[lighting]")
status_overlays[5] = image('power.dmi', "apco2-[environ]")

if(opened)
icon_state = "[ cell ? "apc2" : "apc1" ]" // if opened, show cell if it's inserted
if (overlays.len) overlays.len = 0 // also delete all overlays
return
else if(emagged)
icon_state = "apcemag"
if (overlays.len) overlays.len = 0
return
else if(wiresexposed)
icon_state = "apcwires"
if (overlays.len) overlays.len = 0
return
else
icon_state = "apc0"

// if closed, update overlays for channel status

if (overlays.len) overlays.len = 0

var/image/buffer

buffer = status_overlays[1]
buffer.icon_state = "apcox-[locked]"

buffer = status_overlays[2]
buffer.icon_state = "apco3-[charging]"

buffer = status_overlays[3]
buffer.icon_state = "apco0-[equipment]"

buffer = status_overlays[4]
buffer.icon_state = "apco1-[lighting]"

buffer = status_overlays[5]
buffer.icon_state = "apco2-[environ]"

overlays += status_overlays[1]
overlays += status_overlays[2]

if(operating)
overlays += status_overlays[3]
overlays += status_overlays[4]
overlays += status_overlays[5]


Those icon sets never change, the APC just used different icon states in different combinations. So, why load them over and over? This way they're loaded only once and future calls move them around as needed.

APCs are just one example, this problem is all over the place and it's not just images. One of the biggest ways to save CPU in SS13's code is find resources that don't change but are being generated over and over and cache them instead.

Another way to attack the APC problem is look at why updateicon() is being called so often and change it so it's only called when the icon definitely has changed and needs to be redrawn. That's another problem pattern that's everywhere, stuff being called over and over instead of as needed.
Here's another pattern that can be applied in many laggy procs to help reduce the impact on players.

One thing I didn't understand about BYOND at first is that multiple procs can be running concurrently as if there were multiple threads on a single core but it doesn't happen by default. Unless a proc explicitly sets itself as background = 1 or sleeps it will never be interrupted until it finishes. Sometimes some loops in SS13 take a long time to complete and during that time the server isn't responding to player input at all. That's a lag spike!

set background = 1 is a little dangerous because you have no control over when you get interrupted and no idea if it happened or not, it makes it impossible to really trust your variables if the code is looking at a lot of stuff over a long period. But sleep, sleep is under your control.

So here's a general thing that I've stuck all over the place in laggy loops as a sort of bandaid.

// somewhere that the vars can be accessed by the laggy loop
var/tmp/sleep_check = 0 // buffer for checking elapsed ticks
var/tmp/work_length = 2 // number of ticks to run before yielding cpu
var/tmp/sleep_length = 5 // number of ticks to yield

// before the start of the loop
sleep_check = world.timeofday

// inside the loop
if ( ((world.timeofday - sleep_check) > work_length) || ((world.timeofday - sleep_check) < 0) )
sleep(sleep_length)
sleep_check = world.timeofday


If world.timeofday rolls over it'll just trigger an extra sleep, not the end of the world. The variables can be adjusted to give the desired balance between working and sleeping. Yes I overuse parenthesis, deal with it :colbert:.

The dangerous part is variables being used in the loop might have changed while the proc was asleep so it's important to understand the code you're changing or you can introduce runtime errors or unexpected behavior. Just like real multithreading! It's easy enough to deal with, either stick some extra sanity checks after the sleep or put the sleeping code at the end of the loop so it'll just wake up and start with a fresh thing and run through it completely before checking to sleep again.

I hope this is helpful, if anyone has any questions I'll be happy to try and answer. More to come, cheers.
That's really quite awesome, thanks.
In response to Mloc
Mloc wrote:
That's really quite awesome, thanks.

And interesting. Great thing I decided to check these forums..
Oh, hey, people ARE reading this! I was getting a little sad. Maybe I'll write up another post sometime.
In response to MagicMountain
MagicMountain wrote:
Oh, hey, people ARE reading this! I was getting a little sad. Maybe I'll write up another post sometime.

I'm reading and learning all at once, I didn't know you could use a list as a kind of "buffer", so I learned something new. But yeah, that'd be cool.
We've made a lot of progress implementing this stuff, one of the remake coders even came back and helped out for a while. I know most people aren't going to leave their home server but check out a round on LLJK sometime.

I haven't had much free time but I've thought of a couple more topics. What sounds most helpful, a thing on using the profiler and writing debug code to track down laggy procs? A more detailed explanation of sleep and spawn? Maybe an explanation of good and bad practices when writing new code for ss13?
Anything that helps find laggy procs sounds great.
Yep definitaly profiling. I've got some laggy procs under the mob controller that I just can't figure out.
Hi, I didn't forget about you all I'm just getting crushed at work. Someday I'll effortpost again I promise.

edit 11/5 - Holy moly we just found and patched a doozy. This'll be a perfect case study in using the profiler effectively. Should have it written up sometime this week.
This is overdue but here goes. The profiler is one of the most powerful tools for finding and solving problems. It's really flexible and depending on how it's used you can learn a lot about what's happening with the code.


When I'm profiling I'll usually do one of two things; I'll either leave it running over a long period of time or take a bunch of short snapshots over a brief period. The reason for the short snapshots is averages don't always give you the whole picture. Sometimes the same bit of code can spike up with a much longer execution time under certain conditions.

The first method is pretty simple to understand, just fire up the profiler and go. I accomplish the second one by refreshing the profiler, copying the text, clearing the profiler and then pasting it into Notepad++ and hitting ctrl+N to open a new blank window. After doing this for a while it's more than fast enough, normally I'm grabbing 10 or 15 second chunks at a time.

Either way, sorting is key. Get totals and averages sorted by all three columns, they're all useful in different ways. It's annoying and you end up juggling 40 text files but they're gold, platinum even. So, that's how I gather the data but what does it mean? What do you do with it? Let's look at some.


All the data in this post is from LLJK, about three weeks ago. This is just one of many snapshots from the session. Using this we were able to focus and make some pretty significant improvements and lately players are consistently reporting a better experience. Profiling!

                   Profile results (total time)

Proc Name Self CPU Total CPU Real Time Calls
-------------------------------------- -------- --------- --------- ---------
/datum/Del 34.005 158.478 158.500 49230
/datum/controller/process_loop/start 0.003 49.469 0.000 3
/datum/controller/process_loop/process 0.044 49.465 0.000 3
/mob/living/carbon/human/update_clothing 3.253 29.668 29.666 538
/proc/sd_Update 1.841 28.617 28.616 559
/atom/proc/sd_SetOpacity 0.005 28.610 28.609 472
/obj/machinery/door/proc/toggle_opacity 0.003 28.347 28.346 429
/icon/proc/RscFile 28.337 28.337 28.336 3091
/proc/AStar 1.833 21.260 51.764 162
/obj/machinery/door/proc/close 0.012 19.954 869.875 480
/datum/guardbot_mover/proc/master_move 0.494 19.629 481.601 43
/atom/proc/default_click 0.069 17.950 146.014 1876
/turf/proc/sd_LumUpdate 17.509 17.705 17.772 329309
/client/Move 0.097 14.589 14.593 17034
/client/Move 0.704 14.491 14.496 17034
/atom/proc/sd_ApplyLum 8.958 13.457 13.459 6776


So this is about 40 minutes of gameplay, set to show total time and sorted by Total CPU. We can see from the data that players' update_clothing is eating up quite a lot of cpu time, thanks mostly to /icon/proc/RscFile. Looks like sd_lighting is another major offender through sheer call volume besides CPU use. Strangely enough /datum/Del is by far the worst user of CPU. Since /datum is the root of all types that's the overhead of all deletions of everything. Pretty out of control!


The 'Self CPU' column means the amount of CPU time used by the code within that proc only, excluding calls to other procs. When the self CPU is high you have the source of your problem, either change the code to make it cheaper or somehow call it less often.

'Total CPU' is the amount of CPU time used by the proc and everything it calls. I find this the most useful, but it's a little deceptive. The same CPU time can be listed many times in the total cpu column because of code calling other code, compare with the Self CPU to find where the actual work is happening.

'Real Time' is how long it actually took to complete. If some code's Total CPU and Real Time are equal that code never sleeps or yields the CPU. If you have some expensive code and it can't be optimized any further you can still improve the player's experience by breaking it up across multiple ticks with sleep().

'Calls' is fairly self explanatory. Code with a very high number of calls benefits more from small optimizations. It's also useful for bug fixing, you might find that something is being called far more often than you intended.



Here's another view, one of the shorter snapshots sorted by Total CPU and set to average time.

                 Profile results (average time)

Proc Name Self CPU Total CPU Real Time Calls
--------------------------------------- -------- ------- ------- -----
/proc/explosion 0.003 1.070 1.111 5
/mob/proc/throw_impacted 0.000 0.643 0.643 36
/obj/critter/killertomato/ChaseAttack 0.000 0.635 0.635 30
/obj/machinery/bot/guardbot/proc/explode 0.005 0.618 3.512 2
/mob/living/carbon/human/ex_act 0.000 0.562 0.562 6
/obj/item/device/pda2/proc/post_signal 0.000 0.462 1.335 7
/atom/proc/throw_impact 0.000 0.437 0.437 57
/obj/critter/martian/proc/MartianPsyblast 0.001 0.396 0.395 2


The key here is all those procs with the same Total CPU and Real Time. That code is taking a serious amount of time to complete a single call and while it's using the CPU the server is going to queue or outright drop player input and give you a bad experience.



Consider this part 1 I guess, it's already getting way too long and there's still a lot more I'd like to write. Formatting to come later, too.
Dropping by to say: great work MagicMountain!

Yeah, Del() is enormously expensive. It iterates through the game world looking for references to whatever you're deleting so it can clear those references, and this gets absurdly costly whenever a bunch of things are getting deleted at once.
In response to Dr. Cogwerks
Dr. Cogwerks wrote:
Yeah, Del() is enormously expensive. It iterates through the game world looking for references to whatever you're deleting so it can clear those references, and this gets absurdly costly whenever a bunch of things are getting deleted at once.

I've learned that using src = null seems to work slightly more efficient, I may post in a few moments with some tests, who knows.

Alternatively is to make it periodically delete objects when you have to delete a LOT. May prove to be more efficient than making it delete throughout your main code. But this one I won't test..