ID:2021963
 
Resolved
The server ticker skipped a large number of map ticks when overloaded, instead of skipping map ticks gradually based on usage.
BYOND Version:509
Operating System:Windows Server 2012 rc2
Web Browser:Chrome 47.0.2526.111
Applies to:Dream Daemon
Status: Resolved (510.1320)

This issue has been resolved.
Descriptive Problem Summary:
When more than 74 people are connected (75 and higher) and any proc runs for 3 or more byond ticks(world.tick_lag), byond enters a 2 to 5 second time frame after the proc finishes where it will not send map/client updates but continues to process all incoming verb commands. This lagged time frame does not show up on profile.

cpu usage will show up as abnormally low during these time frames.

Network monitoring shows almost no attempted sending of network data.

Clients get output control updates, but not map or stat panel updates.

The lagged time frame seems to scale to something, as it's the same in any given round where this happens, but can change round to round.

One byond tick immediately following the proc ending fires, causing map and stat panel updates, it is after that that the lag starts.
Hopefully world.tick_usage will help with this.
I mean, That seems like a really bad stop gap measure.

The fact that byond doesn't use cpu, doesn't use network, basically means that byond just decides to sit by and do nothing for 5 seconds for really bad and random reasons.



For all I know, below 75 players or below 3 ticks causes smaller lags that just aren't as noticeable.

at 16fps we hit this often, but it COULD be controlled, but a game running at 30fps would hit it from something as simple as a view() or range() (not the loop, the actual view()/range() call), 60fps could hit it from a simple chain of proc calls, it would only take 15 proc calls on a chain of interactions to trigger this, no loop required.

Lets try to debug this.


I know that the current byond tick fires, its the following one that doesn't, I know that incoming network operations fire, processing procs and triggering world << "test" and mob << "test" outputs.

So let's look at the byond 'tick' loop. I know it has to do the following things,


Run any sleep or spawn procs whoes time has come,

Send map updates

Triggering stat panel updates

Schedule the next byond tick fire



I know that byond has to have some outer loop that runs more often (most every likely 1,2,5,10 ms), deciding when to trigger a game 'tick', taking incoming network data, deciding to either send it to a queue for the inner game tick loop or triggering a proc call immediately.

So let's take a look at all of that.

First, what order is the inner loop processed, Then I can see if I can tell what parts are triggered on the run just before the ticks stop coming. is it halting in the middle of a game tick loop run without scheduling the next tick, and some failover functionality is bring that back on line? or is it not even running game ticks.

Or maybe game ticks fire, but they just decide to not trigger map/stat updates.

I suppose I could check for slept'ed procs triggering during this time frame.

As I get more info, and you get more info, the code that could potentially be responsible starts to shrink.

I know when we've done this before, (think windows triggering a lot of file handles on shell()) At some point, I suggest somewhere to look, you look, and bam! you have the eureka! moment and the bug gets solved.
What kind of timer does the outer loop use to decide when to run the inner loop? What's it's resolution?

Could it be rolling over and not detecting this because of the missed tick and/or client processing taking too long (from 75+ players being connected)?
Here's the gist of the server tick:

1) Process sleeping procs.

2) Process pending client commands that have been queued. (I forget if a server-side queue is currently enforced.)

3) Check for garbage in the unique map cells. (Unless forced by running out, which shouldn't even happen anymore, this doesn't happen more than once every ten minutes. It would also use CPU noticeably if it took any serious amount of time.)

4) If there's a map countdown, tick it down and skip the map step.

5) Send maps to all players.

6) Determine map usage; if world.cpu is running over 120% and the maps account for at least 20%, set a map countdown based on usage.

7) Turn off the timer if no players are connected and no procs/walkers are pending.

8) The timer handler ends here, which means that any procs that may be called as a direct result of network action (e.g. some player commands, new connections, callbacks) can be called during this time.

The map_countdown value would seem like a potential culprit, but I highly doubt this. It caps out at a countdown of 16 on a normal 10 FPS world, and can only go higher in high-FPS situations. Also, the map_countdown is limited to only the map's portion of the tick usage, so the map sending would have to take 16% of a tick to max out here. The only way for a 5s delay would be if your tick_lag was set to something like 0.5, but it'd have to be because the map sending routine blew up and took way more time than it should have.
Well, the only way to rule it out would be to disable it.

try to see if you can fit in a dd command line option (or cfg option) for that to 510 (or the next 509 release if possible). It would be something we would want to try a hand at disabling at /tg/station anyways, as client performance and framerate is generally more important than everything else.

Does anything in the map proc do file read/writes, the only disk usage i saw during this was to docs/byond/cfg/sauth.txt, but i didn't see any indication it was triggering blocks. The server has enough ram it can serve that from the cache anyways.
for reference, our world.fps is 16, we've seen it happen at 11fps, but less so.

The map stuff should never be doing disk writes.

I did however find something very interesting that might be worth exploring. This is the code governing the map_countdown specifically:

//if the cpu is overloaded, punish SendMaps() for its share of the guilt
if(tick_usage > 120 && map_tick_usage > tick_usage - 100) {
map_countdown = map_tick_usage;
if(map_countdown > 16) map_countdown = 16;
map_countdown = (unsigned short)(map_countdown * 100 / tick_lag); //slow things down even more on high frequency tickers
}
else map_countdown = 0;

Per that logic, map_countdown actually goes right to the cap if it appears at all. (Which, granted, isn't ideal.) But so far this looks fine on the surface.

Two items of note here. The first is that tick_usage and map_tick_usage are both rolling averages, calculated over 16 ticks, and both of them are constrained to a 1-byte value. It's conceivable that while running at 300% CPU for an extended time that the value could loop around. If this plays into the logic at all behind your issue, then it's reasonable to conclude that changing these to 2-byte values would solve the problem. The question is whether this is factoring in at all.

The other item of note is that from what I'm seeing on review, map_tick_usage is measured entirely incorrectly! In fact map_tick_usage is the non-map portion of the tick.

Based on what I'm seeing, and mind you I'm not functioning on all cylinders right now, I think the 1-byte thing is a red herring and won't impact any of this (indeed the rollover should improve matters!), but if the running averages got into the range where tick_usage is over 120 and map_tick_usage (again, measuring the wrong thing) is anywhere over 20, you'd end up with map_countdown jumping to 16 and then being scaled by FPS/10.

The only problem I have with this theory is that because tick_usage and map_tick_usage are running averages, you'd need quite a spike to make this jump. I'm not sure a 3-tick proc would be quite enough to cause that jump with a running average of 16 ticks. (And indeed a 3-tick proc is going to wreak havoc lag-wise anyway.) My math says you'd need to be running close to 100% CPU or even over, and then have a spike in the 400-450% range or higher for the running average to hit 120.

That said, I feel like that's the most likely explanation. The flaw in the logic here is probably what's causing this issue, and if you're running at a higher framerate that will have a direct impact on the "timeout". (Note: the tick_lag value is actually in ms, so a standard value would be 100. If you run at 16 fps I assume this would be 62.)
That actually makes some sense, When this happens I've noticed world.cpu being around 100. Something that isn't abnormal since we scale certain things around avoiding idle time while attempting (but sometimes failing) to keep it from actually going over.

Now talking to somebody who's decoded a lot of byond internals (i think you know who) they suggested that byond might be doing some compression in client <-> server stuff.

Wouldn't it make more sense to disable that rather than suspend all map updates in these cases?


Also, I would still love for the ability to disable this (and compression) as a cfg or command line option.

Should I make a feature request for that?
I'm not sure how much compression we're still using, to be honest; I know the threshold for it was turned up at one point. I'll have to look into that.
Here's a quick question, does this shortcircuit if sendmaps() never ran (because it's being delayed) if not it could cause it to continue to reset it to 16*100/tick_lag until enough ticks fire to drop it below 120%
I'm not sure I follow the syntax there, but if you're asking if the map_countdown can be continually reset to the max without ever getting a chance to count down, no. Once the map_countdown value is set, SendMaps() and the calculation to set map_countdown again are both skipped until the countdown expires.
ya, the more i look at this bit of code, the more i don't like it.

This will always be 16, it can't be anything else, (since it has to be over 20 to trigger)

So that is a 1.6 second delay of client map updates, (at the minimum, if something else causes ticks to overrun, that delays it even more)

Because this would then cause an immediate world.cpu drop, most of /tg/'s limiters would stop, as world.cpu has to be above 50% for them to kick in, so atmos processing's fire rate would speed up, causing more ticks to overrun, (and this would explain why the delay would get longer some times and stay longer.)

I think it should get capped at 0.5 seconds of updates(ie, 5), and be re-worked to actually scale. (120 for base, 200 for max?)

At some point, we must remember that this is a game engine, client responsiveness should be high up on the list of priorities.
Yeah, I just need to work out new math for that, and make a note that map_tick_usage is actually nonmap. (Maybe I'll just rename the var.)
How does this handle world.tick_lag/world.fps changes? does it reset, or would that cause it stay the same number of ticks?

Edit: for the moment I have a work around up to just avoid getting to 120% cpu at all costs. It's not perfect, but should make things better during peak time high pop (80-100 players)

https://github.com/tgstation/-tg-station/pull/14776
I've did an experiment on a empty hosted game and Dragon-Ki, I've left it all night and morning until afternoon. When I've checked both of em only the Empty game didnt disconnect me, while Dragon-Ki I was disconnected. Both are on the same server.

Not sure if this applys to what we talking here :o

Note: before when we had a player base of 150-200 players logged in, sometimes there were those random disconnected of up to 20-30 people at the same time.
In response to MrStonedOne
MrStonedOne wrote:
How does this handle world.tick_lag/world.fps changes? does it reset, or would that cause it stay the same number of ticks?

It should always calculate based on the current value of tick_lag. Adjusting tick_lag adjusts the tick timer and also tells clients to do the same. (In theory you could escape a timeout early if you played it just right by temporarily raising the FPS. In practice it's a horrible way to go, because you could inadvertently trigger a longer timeout when the first expires.)

Edit: for the moment I have a work around up to just avoid getting to 120% cpu at all costs. It's not perfect, but should make things better during peak time high pop (80-100 players)

https://github.com/tgstation/-tg-station/pull/14776

That's a good idea. Let me know how that works out for you. I'll get a fix for this tick thing into 510.
So i've been thinking about this, and it seems like the better way to go would be to have it skip every n ticks scaled to lag, rather than batch skip, which would cause oscillation if it accounted for enough of the overloading. (skipping a bunch, going a bunch, etc)

short skip_ticks = 0;
if (tick_usage > 105) {
skip_ticks = (short)(20 - ((tick_usage - 100) / 5));
if (skip_ticks < 2)
skip_ticks = 2;
}
if (!skip_ticks || tick_count % slip_ticks)
sendmaps();


or something to that effect.

(edit: I still think the true best way to go would be to flat out remove it, or the reverse, scale slept proc processing rather than client sendmaps())
Aha, so this is the source of our recent lag.

I run a server with an average population in the 60-100 range, and we've been getting MAD LEVELS of lag lately.

Is there a workaround yet?
if you can ensure a proc runs every tick, yes, you just detect when world.cpu first goes near 120, but you have to catch it that exact tick it goes over, set world.fps to 10 or lower, then once world.cpu drops back down (from not sending map updates) you raise it again.

If done exactly right, byond will delay for 1.6 seconds at 10fps, so 16 ticks, then you raise tick rate back up, lowering the amount of time it takes 16 ticks to pass by. lowering the effective time byond delays client side update.

The best way would be setting world.fps to 5 then back to something high like 20. causing the delay to last at most half a second rather than 1.6 to 3 seconds.


world/New()
spawn tickcheck()


proc/tickcheck()
var/oldfps
while (1)
if (!oldfps && world.cpu > 115)
oldfps = world.fps
world.fps = 5
else if (oldfps && world.cpu < 100)
world.fps = oldfps
oldfps = null
sleep(world.tick_lag*0.99)








Page: 1 2