Space Station 13

General · Bug Reports · Feature Requests

SS13 Optimization (Creating a Lag Free Environment)

ID:1345084

Aug 5 2013, 6:55 am

Okay - so I've read through many of the forum posts - and it seems there are a lot of wonderfully intelligent people in these forums.

Now - I know the idea of getting a "Lag Free" station seems to be a bit of a "Myth" among people - but I figure we can put together some hardware specs with some software/code changes that would help all of us towards creating a lag free environment.

So - let's start hardware - the game, from what I understand, runs off a single core/single thread - I'm assuming, without changing the byond code itself, there is absolutely NO way to force this into a second thread? I mean... ABSOLUTELY NO way??

So basically - you're going to want a computer/server that is fast on a single processor.

Secondly - Ram - I've heard people say it spikes up to 1.5 GB in Ram - getting 8-16GB in a computer/server isn't that difficult - so I think that'll be fine.

HardDrive - I've had some debates over this one - Someone suggested to me once that moving the station to a RAM Disk would be really good - my question is, I wonder how much information does this game transfer/read/write in a single 1 hour running of the station. I've seen a lot of stations where, as the round goes on, it gets slower and slower - sometimes becoming nearly unplayable - so there's a bottle neck somewhere (even with servers that have a cap in player counts). Something, on these servers, is running out of room (so let's figure out what that is together). If it was a pure connection, throughput problem, the lag would happen from the start of the round. My guess is that it's just the CPU running out of room as it game goes on (hmm, any way we can offload this weight on the CPU to something else?)

Software/Bug Fixes - this is pretty much the end of my knowledge here - as I'm just digging into the code now - so any concrete information anyone can provide on exactly what the station does during the process of a round would be great :)

Thank you! Hope everyone has a great day!

Pro

Aug 6 2013, 4:12 pm
MagicMountain	No amount of hardware can fix the lag, it's all in the code. Get the fastest CPU possible to help with the single core limitation but there's just too much going on in SS13 for it to run smoothly out of the box. As for what to change to fix it? Trade secret :)

Aug 6 2013, 4:21 pm (Edited on Aug 6 2013, 4:26 pm)

MagicMountain

I'll answer any general questions you have but it took me a while to figure out how to speed this game up and I'm still working on completing the idea.

I'll give you one hint, BYOND's profiler is not enough to figure it out. You'll need to rig up some debugging tools written in DM to watch what's going on and see the problems.

I don't know how familiar you are with BYOND and DM in general. world/New executes at the beginning of a round after a restart, world/Del when it is going away. client/New executes when a new player connects, client/Del when someone disconnects. That should be enough of a start to walk through and see what things are doing.

Aug 9 2013, 1:45 am
Sleepz	For one idea you really should look what tickrate suits your server the best, and as MagicMountain, a decent amount of people have found out how to run a more smoother environment.

Aug 11 2013, 6:29 am
ProStasisX	sounds good - I'll do what I can to take the advice, and our team will work on smoothing things out. I think that's why I started this thread was to sort of help everyone out in some way or another :)

Sep 3 2013, 4:43 pm (Edited on Sep 15 2013, 8:16 pm)

In response to Sleepz

MagicMountain

Sleepz wrote:

a decent amount of people have found out how to run a more smoother environment.

There are lots of people that understand parts of the problem but I haven't seen anyone else put it all together before. I just need more time to finish and I'll be able to prove it. About two months ago work exploded and just before that I'd finally figured out what to do and had just started grinding through it. There is an awful lot of code to fix. I'd say I've fixed somewhere between 5 and 10 percent of the problem code I know about on LLJK and that was a night and day difference. If I ever get done it'll be unreal.

Hell, if things are actually responsive this game might stay alive alongside the remake.

By the way sorry for not just outright saying everything. I won't be vague about it forever. Want to finish and enjoy a performance gap for a little while before giving away the gold, you know?

Sep 8 2013, 10:00 pm (Edited on Sep 8 2013, 10:12 pm)

MagicMountain

Okay, here's a little something that's verified and ready. One of the major problems with SS13 is constantly creating new image variables. Even if the icon state or overlay hasn't changed some atoms in ss13 will create new ones and slap them in over and over. This adds up and wastes a lot of CPU.

Here's an example of how to fix it. Take the APC code. (Forgive me if this doesn't match the public /tg/ code, I don't have that because we try to come up with original ideas and I'd rather not be influenced by something and accidentally steal it.) It has a proc updateicon() which changes the graphic based on the state of the APC such as whether the lid is open or closed or whether the power is working.

/obj/machinery/power/apc/proc/updateicon()
    if(opened)
        icon_state = "[ cell ? "apc2" : "apc1" ]"       // if opened, show cell if it's inserted
        src.overlays = null                             // also delete all overlays
    else if(emagged)
        icon_state = "apcemag"
        src.overlays = null
        return
    else if(wiresexposed)
        icon_state = "apcwires"
        src.overlays = null
        return
    else
        icon_state = "apc0"

        // if closed, update overlays for channel status

        src.overlays = null

        overlays += image('power.dmi', "apcox-[locked]")    // 0=blue 1=red
        overlays += image('power.dmi', "apco3-[charging]") // 0=red, 1=yellow/black 2=green


        if(operating)
            overlays += image('power.dmi', "apco0-[equipment]") // 0=red, 1=green, 2=blue
            overlays += image('power.dmi', "apco1-[lighting]")
            overlays += image('power.dmi', "apco2-[environ]")

So every time updateicon() is called (which is quite often) several new image objects are being created and the old ones need to be collected. Multiply this by the number of APCs on the map and you see why they are one of the most expensive machines.

Here's my answer:

/obj/machinery/power/apc/var/list/status_overlays

/obj/machinery/power/apc/proc/updateicon()

    if (isnull(status_overlays)) // if no status overlays list, this is first call
        status_overlays = new
        status_overlays.len = 5
        status_overlays[1] = image('power.dmi', "apcox-[locked]")    // 0=blue 1=red
        status_overlays[2] = image('power.dmi', "apco3-[charging]") // 0=red, 1=yellow/black 2=green

        status_overlays[3] = image('power.dmi', "apco0-[equipment]") // 0=red, 1=green, 2=blue
        status_overlays[4] = image('power.dmi', "apco1-[lighting]")
        status_overlays[5] = image('power.dmi', "apco2-[environ]")

    if(opened)
        icon_state = "[ cell ? "apc2" : "apc1" ]"       // if opened, show cell if it's inserted
        if (overlays.len) overlays.len = 0               // also delete all overlays
        return
    else if(emagged)
        icon_state = "apcemag"
        if (overlays.len) overlays.len = 0
        return
    else if(wiresexposed)
        icon_state = "apcwires"
        if (overlays.len) overlays.len = 0
        return
    else
        icon_state = "apc0"

        // if closed, update overlays for channel status

        if (overlays.len) overlays.len = 0

        var/image/buffer

        buffer = status_overlays[1]
        buffer.icon_state = "apcox-[locked]"

        buffer = status_overlays[2]
        buffer.icon_state = "apco3-[charging]"

        buffer = status_overlays[3]
        buffer.icon_state = "apco0-[equipment]"

        buffer = status_overlays[4]
        buffer.icon_state = "apco1-[lighting]"

        buffer = status_overlays[5]
        buffer.icon_state = "apco2-[environ]"

        overlays += status_overlays[1]
        overlays += status_overlays[2]

        if(operating)
            overlays += status_overlays[3]
            overlays += status_overlays[4]
            overlays += status_overlays[5]

Those icon sets never change, the APC just used different icon states in different combinations. So, why load them over and over? This way they're loaded only once and future calls move them around as needed.

APCs are just one example, this problem is all over the place and it's not just images. One of the biggest ways to save CPU in SS13's code is find resources that don't change but are being generated over and over and cache them instead.

Another way to attack the APC problem is look at why updateicon() is being called so often and change it so it's only called when the icon definitely has changed and needs to be redrawn. That's another problem pattern that's everywhere, stuff being called over and over instead of as needed.

Sep 15 2013, 8:09 pm (Edited on Oct 6 2013, 9:07 am)

MagicMountain

Here's another pattern that can be applied in many laggy procs to help reduce the impact on players.

One thing I didn't understand about BYOND at first is that multiple procs can be running concurrently as if there were multiple threads on a single core but it doesn't happen by default. Unless a proc explicitly sets itself as background = 1 or sleeps it will never be interrupted until it finishes. Sometimes some loops in SS13 take a long time to complete and during that time the server isn't responding to player input at all. That's a lag spike!

set background = 1 is a little dangerous because you have no control over when you get interrupted and no idea if it happened or not, it makes it impossible to really trust your variables if the code is looking at a lot of stuff over a long period. But sleep, sleep is under your control.

So here's a general thing that I've stuck all over the place in laggy loops as a sort of bandaid.

// somewhere that the vars can be accessed by the laggy loop
var/tmp/sleep_check = 0 // buffer for checking elapsed ticks
var/tmp/work_length = 2 // number of ticks to run before yielding cpu
var/tmp/sleep_length = 5 // number of ticks to yield

// before the start of the loop
sleep_check = world.timeofday

// inside the loop
if ( ((world.timeofday - sleep_check) > work_length) || ((world.timeofday - sleep_check) < 0) )
    sleep(sleep_length)
    sleep_check = world.timeofday

If world.timeofday rolls over it'll just trigger an extra sleep, not the end of the world. The variables can be adjusted to give the desired balance between working and sleeping. Yes I overuse parenthesis, deal with it :colbert:.

The dangerous part is variables being used in the loop might have changed while the proc was asleep so it's important to understand the code you're changing or you can introduce runtime errors or unexpected behavior. Just like real multithreading! It's easy enough to deal with, either stick some extra sanity checks after the sleep or put the sleeping code at the end of the loop so it'll just wake up and start with a fresh thing and run through it completely before checking to sleep again.

I hope this is helpful, if anyone has any questions I'll be happy to try and answer. More to come, cheers.

Oct 5 2013, 12:54 am
Mloc	That's really quite awesome, thanks.

Oct 6 2013, 3:21 am In response to Mloc
Laser50	Mloc wrote: That's really quite awesome, thanks. And interesting. Great thing I decided to check these forums..

Oct 6 2013, 4:52 am
MagicMountain	Oh, hey, people ARE reading this! I was getting a little sad. Maybe I'll write up another post sometime.

Oct 6 2013, 5:05 am In response to MagicMountain
Laser50	MagicMountain wrote: Oh, hey, people ARE reading this! I was getting a little sad. Maybe I'll write up another post sometime. I'm reading and learning all at once, I didn't know you could use a list as a kind of "buffer", so I learned something new. But yeah, that'd be cool.

Oct 17 2013, 8:04 pm

MagicMountain

We've made a lot of progress implementing this stuff, one of the remake coders even came back and helped out for a while. I know most people aren't going to leave their home server but check out a round on LLJK sometime.

I haven't had much free time but I've thought of a couple more topics. What sounds most helpful, a thing on using the profiler and writing debug code to track down laggy procs? A more detailed explanation of sleep and spawn? Maybe an explanation of good and bad practices when writing new code for ss13?

Oct 17 2013, 9:44 pm
Mloc	Anything that helps find laggy procs sounds great.

Nov 1 2013, 12:53 am
ZomgPonies	Yep definitaly profiling. I've got some laggy procs under the mob controller that I just can't figure out.

Nov 1 2013, 6:41 am (Edited on Nov 5 2013, 2:43 am)
MagicMountain	Hi, I didn't forget about you all I'm just getting crushed at work. Someday I'll effortpost again I promise. edit 11/5 - Holy moly we just found and patched a doozy. This'll be a perfect case study in using the profiler effectively. Should have it written up sometime this week.

Nov 17 2013, 4:48 pm (Edited on Nov 17 2013, 5:10 pm)

MagicMountain

This is overdue but here goes. The profiler is one of the most powerful tools for finding and solving problems. It's really flexible and depending on how it's used you can learn a lot about what's happening with the code.

When I'm profiling I'll usually do one of two things; I'll either leave it running over a long period of time or take a bunch of short snapshots over a brief period. The reason for the short snapshots is averages don't always give you the whole picture. Sometimes the same bit of code can spike up with a much longer execution time under certain conditions.

The first method is pretty simple to understand, just fire up the profiler and go. I accomplish the second one by refreshing the profiler, copying the text, clearing the profiler and then pasting it into Notepad++ and hitting ctrl+N to open a new blank window. After doing this for a while it's more than fast enough, normally I'm grabbing 10 or 15 second chunks at a time.

Either way, sorting is key. Get totals and averages sorted by all three columns, they're all useful in different ways. It's annoying and you end up juggling 40 text files but they're gold, platinum even. So, that's how I gather the data but what does it mean? What do you do with it? Let's look at some.

All the data in this post is from LLJK, about three weeks ago. This is just one of many snapshots from the session. Using this we were able to focus and make some pretty significant improvements and lately players are consistently reporting a better experience. Profiling!

                   Profile results (total time)

Proc Name                                 Self CPU    Total CPU    Real Time        Calls
--------------------------------------    --------    ---------    ---------    ---------
/datum/Del                                  34.005      158.478      158.500        49230
/datum/controller/process_loop/start         0.003       49.469        0.000            3
/datum/controller/process_loop/process       0.044       49.465        0.000            3
/mob/living/carbon/human/update_clothing     3.253       29.668       29.666          538
/proc/sd_Update                              1.841       28.617       28.616          559
/atom/proc/sd_SetOpacity                     0.005       28.610       28.609          472
/obj/machinery/door/proc/toggle_opacity      0.003       28.347       28.346          429
/icon/proc/RscFile                          28.337       28.337       28.336         3091
/proc/AStar                                  1.833       21.260       51.764          162
/obj/machinery/door/proc/close               0.012       19.954      869.875          480
/datum/guardbot_mover/proc/master_move       0.494       19.629      481.601           43
/atom/proc/default_click                     0.069       17.950      146.014         1876
/turf/proc/sd_LumUpdate                     17.509       17.705       17.772       329309
/client/Move                                 0.097       14.589       14.593        17034
/client/Move                                 0.704       14.491       14.496        17034
/atom/proc/sd_ApplyLum                       8.958       13.457       13.459         6776

So this is about 40 minutes of gameplay, set to show total time and sorted by Total CPU. We can see from the data that players' update_clothing is eating up quite a lot of cpu time, thanks mostly to /icon/proc/RscFile. Looks like sd_lighting is another major offender through sheer call volume besides CPU use. Strangely enough /datum/Del is by far the worst user of CPU. Since /datum is the root of all types that's the overhead of all deletions of everything. Pretty out of control!

The 'Self CPU' column means the amount of CPU time used by the code within that proc only, excluding calls to other procs. When the self CPU is high you have the source of your problem, either change the code to make it cheaper or somehow call it less often.

'Total CPU' is the amount of CPU time used by the proc and everything it calls. I find this the most useful, but it's a little deceptive. The same CPU time can be listed many times in the total cpu column because of code calling other code, compare with the Self CPU to find where the actual work is happening.

'Real Time' is how long it actually took to complete. If some code's Total CPU and Real Time are equal that code never sleeps or yields the CPU. If you have some expensive code and it can't be optimized any further you can still improve the player's experience by breaking it up across multiple ticks with sleep().

'Calls' is fairly self explanatory. Code with a very high number of calls benefits more from small optimizations. It's also useful for bug fixing, you might find that something is being called far more often than you intended.

Here's another view, one of the shorter snapshots sorted by Total CPU and set to average time.

                 Profile results (average time)

Proc Name                                Self CPU  Total CPU Real Time  Calls
---------------------------------------  --------   -------   -------   -----
/proc/explosion                             0.003     1.070     1.111       5
/mob/proc/throw_impacted                    0.000     0.643     0.643      36
/obj/critter/killertomato/ChaseAttack       0.000     0.635     0.635      30
/obj/machinery/bot/guardbot/proc/explode    0.005     0.618     3.512       2
/mob/living/carbon/human/ex_act             0.000     0.562     0.562       6
/obj/item/device/pda2/proc/post_signal      0.000     0.462     1.335       7
/atom/proc/throw_impact                     0.000     0.437     0.437      57
/obj/critter/martian/proc/MartianPsyblast   0.001     0.396     0.395       2

The key here is all those procs with the same Total CPU and Real Time. That code is taking a serious amount of time to complete a single call and while it's using the CPU the server is going to queue or outright drop player input and give you a bad experience.

Consider this part 1 I guess, it's already getting way too long and there's still a lot more I'd like to write. Formatting to come later, too.

Nov 19 2013, 11:50 am
Snaipperi	Dropping by to say: great work MagicMountain!

Nov 25 2013, 11:11 pm
Dr. Cogwerks	Yeah, Del() is enormously expensive. It iterates through the game world looking for references to whatever you're deleting so it can clear those references, and this gets absurdly costly whenever a bunch of things are getting deleted at once.

Nov 26 2013, 1:17 am

In response to Dr. Cogwerks

Laser50

Dr. Cogwerks wrote:

Yeah, Del() is enormously expensive. It iterates through the game world looking for references to whatever you're deleting so it can clear those references, and this gets absurdly costly whenever a bunch of things are getting deleted at once.

I've learned that using src = null seems to work slightly more efficient, I may post in a few moments with some tests, who knows.

Alternatively is to make it periodically delete objects when you have to delete a LOT. May prove to be more efficient than making it delete throughout your main code. But this one I won't test..

	Space Station 13 by Exadv1
	Stay alive inside Space Station 13