ID:1852790
 
Resolved
Reading a movable atom's locs list caused a memory leak.
BYOND Version:507
Operating System:Windows Server 2012 r2 64bit
Web Browser:Chrome 43.0.2357.65
Applies to:Dream Daemon
Status: Resolved (507.1286)

This issue has been resolved.
Descriptive Problem Summary:
Hey there, I am the current host and system admin for tgstation13, one of the more popular ss13 code bases. We have had this issue for a while since about late last year (2014).

Basically, a random forloop will runtime out with "out of resources" followed by what seems like every global (global scope, or world, not global vars as in static vars attached to a class object) list triggering a "bad list" run time on access. Rebooting the world via either code (admins have verb to do it) or the world->reboot menu option will trigger world/reboot(), but otherwise fail to actually restart the world. (Hitting stop in dd works fine)

The server has 64gb of memory, and the hypervisor VM that byond runs in has exclusive access to 8gb of memory. During the time that this triggers, DD is not using any abnormal amount of memory, and is still at it average amount of 700mb. (we have also had this triggered while using only 600mb on the low pop server)

We are currently running 507.1284 but this error has been seen on 1282 and i think 1279 (can't remember what exact version we ran before i took over).

Here is the memory stats after this has happened:

Prototypes:
obj: 1306644 (7982)
mob: 1308788 (134)
proc: 8411592 (17922)
str: 4267297 (85786)
appearance: 6051897 (8923)
id array: 8395740 (34880)
map: 78031920 (255,255,7)
objects:
mobs: 222480 (202)
objs: 31409744 (122566)
datums: 6002816 (49362)
lists: 25153408 (465187)

Here is what they look like directly after world/new has finished:

Prototypes:
obj: 1306644 (7982)
mob: 1308788 (134)
proc: 8411592 (17922)
str: 4246606 (85786)
appearance: 5908006 (8307)
id array: 8402164 (34880)
map: 76949104 (255,255,7)
objects:
mobs: 60576 (66)
objs: 31151568 (122566)
datums: 5116576 (43725)
lists: 37901992 (962898)

(note: both of these are from a testing server on my home computer(windows 7 64bit, 1284), where I ran the server for a little over a day with me connected but not doing anything to get this to trigger, no idea how much time had past with the world in this state, as I wasn't checking up on it much, and basically almost forgot about it)

The following is a list of permalinks to time frozen code of the forloops that have triggered this runtime in the past few months. (taken from runtime logs on the production servers) (the first two are from the last few days, and after we upgraded to 1284) (this is just a short list, as i got tired of digging thru runtime logs and generating time frozen links)

https://github.com/tgstation/-tg-station/blob/ a029a49392b4708b76f9f4f709d8b707475c04c7/code/LINDA/ LINDA_system.dm#L85
https://github.com/tgstation/-tg-station/blob/ 01a8aa662a9f00f5795247087ce4b7e53a0e4035/code/game/ communications.dm#L209
https://github.com/tgstation/-tg-station/blob/ 01a8aa662a9f00f5795247087ce4b7e53a0e4035/code/game/turfs/ turf.dm#L83 (happened here 4 times, but this proc gets called ALOT)
https://github.com/tgstation/-tg-station/blob/ 29609457f57a8cefd244da5b36e43181399ed002/code/controllers/ master_controller.dm#L85
https://github.com/tgstation/-tg-station/blob/ 9038fb15af6ea1e93729f4d6877cf3a7083d1c05/code/modules/mob/ mob_movement.dm#L311

(note, the highlighted lines are the lines given by the runtime error)

To shorten our search, I'm gonna focus in depth on that last one, mob_movement.dm Line 311 because its the simplist proc in relation to use of other variables in the forloop.



for(var/atom/A in orange(1, get_turf(src)))


get_turf, for reference:

/proc/get_turf(atom/movable/AM)
if(istype(AM))
return locate(/turf) in AM.locs
else if(isturf(AM))
return AM


So this is a simple loop in a mob proc, that only relates to a byond generated list (orange) of the atoms in the 1 tiles around the mob.

This runtime triggered on basil (our lower population server) back in march of 28. Shortly after stopping and starting DD, it triggered again two separate times on turf/enter (above) until i actually force closed dd and byond pager.

We've been waiting to report this because of the bug report about list corruption that was fixed in 1284, it wasn't until it happened twice (the first two links) after upgrading that i started to investigate it.

I had assumed it might be bad ram until it happen on my home computer.
MrStonedOne wrote:
The server has 64gb of memory, and the hypervisor VM that byond runs in has exclusive access to 8gb of memory. During the time that this triggers, DD is not using any abnormal amount of memory, and is still at it average amount of 700mb. (we have also had this triggered while using only 600mb on the low pop server)

One assumption you're making here is that BYOND can use 8 GB of memory; it can't. It's a 32-bit application. Still, 700 MB should be nowhere near a limit.

Although the list corruption really shouldn't happen when the server runs out of memory (in that sense this is a bug), it's hard to predict what will happen in that case. The out of resources message however is the same as "out of memory"; this particular message only happens when a list is copied.

Aside from running out of memory, the only other possibility I can think of is heap corruption. Inability to allocate more memory seems much more likely, though. The striking thing for me is that while you start off with an unholy number of lists, at the time of the crash you have about half that.
In response to Lummox JR
Lummox JR wrote:
The striking thing for me is that while you start off with an unholy number of lists, at the time of the crash you have about half that.

This is most likely from the lists getting corrupted, and no longer getting counted.

A bit more research showed that a good chunk of lists get corrupted right as it happens, seemingly random selection. Than the remaining lists all slowly get corrupted over the next 15 minutes.

This leads to the game basically slowly breaking, as so many functions use at least the list of machines, or the list of active living mobs, or the player list, or the admin list. So while it just starts with saycode breaking, soon after the periodic events handler stops because it loses its list of subsystems, than spawn points break because thats a list. somewhere along the line new connections stop working, most likely relating to a runtime in client new relating to lists.

While I'm sure most of those lists are contents lists, We do use lists heavily to avoid having to do in world loops, to save time. with 500000 instances being common, even with in world's optimizations and type specifying, cutting that down to a search of a few hundred is much quicker.

I can't actually map out what lists break because the runtime limit gets hit almost immediately after "the event" from all the bad list runtimes and it stops reporting runtimes. (as an aside: It would be nice if there was a way to disable this check)

Honestly, from here, it is looking like heap corruption of some kind. That more naturally explains the spreading of the corruption than out of memory errors. (unless byond has a habit of regenerating already existing lists upon access of others, where it would get an out of memory error again)

Another note is that the out of resources runtime only happens once, after that it's bad lists until it hits the runtime limit.
In response to Lummox JR
Lummox JR wrote:
MrStonedOne wrote:
The server has 64gb of memory, and the hypervisor VM that byond runs in has exclusive access to 8gb of memory. During the time that this triggers, DD is not using any abnormal amount of memory, and is still at it average amount of 700mb. (we have also had this triggered while using only 600mb on the low pop server)

One assumption you're making here is that BYOND can use 8 GB of memory; it can't. It's a 32-bit application. Still, 700 MB should be nowhere near a limit.

Although the list corruption really shouldn't happen when the server runs out of memory (in that sense this is a bug), it's hard to predict what will happen in that case. The out of resources message however is the same as "out of memory"; this particular message only happens when a list is copied.

Aside from running out of memory, the only other possibility I can think of is heap corruption. Inability to allocate more memory seems much more likely, though. The striking thing for me is that while you start off with an unholy number of lists, at the time of the crash you have about half that.

I went over the code, this is not caused by an allocation failure (directly), it's caused by NewUserList returning a bad list ID or a list of the wrong length. The resulting proc crash likely prevents various cleanup code from running and leaves some internal state bad.

Looking at NewUserList, there are some incredibly wacky semantics when reallocating the list of lists when the length is close to 0xFFFF. Every case of OOM in NewUserList should produce an InternalBugPrinter message, which should go straight into the log and be pretty obvious.
In response to Tobba
I can explain those semantics. The "list of lists" tries to grow by at least 1/16 its current size. If it reaches what it feels is the max ID, it will grow no further. If it grows such that NONE (0xFFFF) is the last ID available, it increases the size by one more--just to make sure there is at least one valid list ID to use.

NewUserList() isn't experiencing the out-of-memory error (or it should print out), and the new list is being allocated, so the problem must come in where the new list's length is set. But that doesn't seem to be printing an out of memory error either. Actually it's really puzzling, because any path that could potentially lead to an error should be printing "out of memory" somewhere.
If you could maybe go in to details a bit more about how NewUserList() gets called, and what triggers the Out Of Resources line to get printed. I can try to make some test projects that brute force list creation and removal and transveral to see if I can get it to trigger.

In the mean time, if you want to try your hand at running this in a debugger to get it to trigger, here is a time frozen link to the code base i am currently running in my test case:
https://github.com/tgstation/-tg-station/archive/ bc58a8977fd6d0e9fe5b1eaf465512cfec941136.zip

compile, edit config/admins.txt to give your ckey a rank of host, run in DD (with a debugger attached persumabity and break points attached at all lines that trigger the out of resouces runtime.), connect with DS, and wait about 18 hours. (it happens much sooner if people are doing things, but if you just have somebody idle, but connected (so that world.time ticks) doing nothing, it triggers in under 24 hours every time Usally at around 17 hours.)

You will know if it's triggered by checking the MC tab (only shown to admins) in DS. it should have a list of systems, things like air, objects, bots, mobs. If it has nothing, than the master controller's subsystem list got corrupted, a tale tale sign that the bug has happened.

https://tgstation13.org/msoshit/mctab.png <--- a screen cap of what it looks like after it's triggered. There is normally a lot more info under that mastercontroller line.

Note: go to the server tab and hit toggle random events such that they are disabled so you don't get spammed with the random events that trigger sounds (and runtimes relating to random events that require a dead player, something that's been mostly fixed in later versions). after that, as long as you don't join the round, and stay in the new player lobby, only 2 sounds should play, the round start sound, and the code blue sound shortly after it.

If you do this, feel free to message me on pager if you have any questions.
Huh. So if I hosted a server, joined it in DS but did nothing but join a round, and sat for a while, it should still happen? That's pretty interesting. I don't know if that would ever be feasible to test in the debugger because of the ultra-long time frame, though.

When monitoring this as it goes along, are you certain the memory is not creeping up?
Maybe it is, I'll re do it today and have a program log all of that.

also, this MAY be related: http://www.byond.com/forum/?post=1854520
I FIGURED IT OUT (test case below as a pastebin)

So yes, the memory was raising up, only to fall back down once the event happened as lists started getting corrupted. (hence why i never noticed it, i always looked after it happened)

memory stats showed the number of lists rising.

I also had the normal amount of memory usage wrong. Average usage after round start is 300MB, for some reason i thought it was 600MB.

Anywho, since it was raising at a steady rate, i tracked down what period loop scheduler was causing it by setting them each to fire at insane rates one by one and watching memory usage, and from there tracked it down to 1 line in a helper proc:

return locate(/turf) in AM.locs

Where AM is var/atom/movable/AM
That line in place:
https://github.com/tgstation/-tg-station/blob/ 36df27ad53d16318566dd6a3c76ebd50d06cd672/code/__HELPERS/ unsorted.dm#L1077


Built a test case:
http://pastebin.com/1kiXQb6s

Running locate in locs causes lists to hang around after they aren't being used. (running it on loc does not)

It happens every time, in less than a minute.
Once you hit 16 million phantom lists, the next for thing in list (thats not world or another optimized list) will trigger the out of resources runtime.

Ooh, that's very helpful. I'll see what I can dig up there. Thanks for providing a test case.

If pixel movement and big objects are not in play, the locs var is pretty pointless to use anyway. But it definitely shouldn't be leaking.
Lummox JR resolved issue with message:
Reading a movable atom's locs list caused a memory leak.