ID:2213856
 
Resolved
Although the limit for unique map cells was increased in a previous version, a bug in the way that was handled caused errors to occur after crossing the 64K boundary. Additionally, map cells did not go through garbage collection often enough when the list grew rapidly, causing large and avoidable memory increases.
BYOND Version:510
Operating System:Windows 7 Pro
Web Browser:Firefox 51.0
Applies to:Dream Daemon
Status: Resolved (511.1374)

This issue has been resolved.
Descriptive Problem Summary:

Changing the color (or perhaps appearance in general) of an atom leaks a small amount of memory. Rapidly changing the color of numerous atoms makes the leaking memory obvious.

Code Snippet (if applicable) to Reproduce Problem:
// This snippet randomizes the color of every turf every decisecond
// The turfs don't have icons, but that doesn't seem to matter
/world
maxx = 50 // Only relevant for how quickly memory is consumed
maxy = 50

/client/New()
. = ..()
spawn()
while(src)
for(var/turf/T)
T.color = list(rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand())
// This also works, but seems to slow down before consuming too much memory:
// T.color = rgb(rand(255),rand(255),rand(255))
sleep(1)


Expected Results:
Memory usage would remain fairly consistent.
Actual Results:
  • Usually, memory usage climbs at a modest pace (several MB each second), eventually reaching the limit and crashing Dream Daemon.
  • Stopping the world may free some of the memory, or may crash DD. Attempting to re-start the world and reconnect usually causes it to hang or crash.
  • Sometimes, DD will stop responding to the client shortly after connecting. Attempting to stop it will cause it to hang on "closing world", and attempting to exit will cause it to crash.

When does the problem NOT occur?
When the color var is not modified inside the loop.

Did the problem NOT occur in any earlier versions? If so, what was the last version that worked? (Visit http://www.byond.com/download/build to download old versions for testing.)

507.1286 seems to exhibit the problem (with RGB instead of a color matrix, of course). 505.1235 would not let me connect at all, and I'm not sure why. I did not attempt to go back further.

I also tested 511.1373, and the problem still exists.

Workarounds:
Less frequent changing of colors, and more frequent restarting of Dream Daemon to clear out leaked memory.
It doesn't make sense that anything would happen in 507, because you can't set a color to a list value there. But I'll check 511 and see what I can find.
That appearance hang I was talking about occurred in my pixelbender project, which was rapidly changing the color of turfs to emulate a graphics buffer.

Coincidentally, I switch to an icon with 32 states and a LUT profile in memory using a heavily reduced HSV coordinate algorithm to simulate a limited color count.

The problem went away and the hang stopped happening.

I didn't look into it further. But I think there's some serious credence to this theory.
This is pretty consistent with http://www.byond.com/forum/?post=2188696, isn't it?

It seems like there is some issue with image memory usage in 511.
In response to Clusterfack
Actually no; I suspect that issue is related to var memory. And for some reason it seems highly specific to images, which is odd.
Lummox JR wrote:
It doesn't make sense that anything would happen in 507, because you can't set a color to a list value there.

When testing 507, I used the commented-out line that sets color to a random RGB value, since 507 predates color matrices.

Clusterfack wrote:
This is pretty consistent with http://www.byond.com/forum/?post=2188696, isn't it?

Not sure, I mainly tested this on 510 and that sounds like something that showed up in 511.

The problem with the server refusing connections after reboot is something I noticed with the test code, however. You also mention crashing, which is probably from this color-changing memory leak - I started digging into this problem when we noticed we were getting memory leakage and out-of-memory errors on Paradise, and they started after we merged in Goonlights, which makes heavy use of color matrices. You guys use Goonlights on VG, too.
What you're seeing in 507 isn't a leak, because if an appearance is seen by the client it's going to stick around at least until the client disconnects or sends a garbage collection message that it doesn't need that appearance anymore. For every client connected to the server, the server keeps a list of which appearances they have been sent, and that list counts towards the appearances' references.

In fact any change you make to appearances, if it results in a new unique appearance every time, is going to cause this to happen when that appearance is visible to a client. The big question is, do you see a leak when you change turfs that no player can see?

You can test that by adding another z level and only looping through turfs on that level, e.g. via block(locate(1,1,2), locate(world.maxx,world.maxy,2)).
Also a quick question regarding that Goonlights merge: What did your memory profile look like before? Because if you were already pushing some bad limits, something like that could easily have sent you over the edge. A great many of the SS13 branches are really crappy about memory management, and it's possible you need to look into doing some trimming.

If you have a lot of unique appearances that players have seen, even if those appearances aren't in use anymore they won't go away until the clients let go of them, which is often a matter of quite a few minutes. That's going to contribute to memory usage by quite a bit, and Goonlights by its nature creates a lot of appearances.
For what it's worth I host goonstation and haven't noticed any especially egregious memory leaks, from lighting or otherwise. We DO get a whole bunch of rather cryptic runtimes that I still haven't gotten around to making a bug report on, but it's unlikely to be lighting related and I don't really know why I'm bringing it up...
In response to Wirewraith
Wirewraith wrote:
For what it's worth I host goonstation and haven't noticed any especially egregious memory leaks, from lighting or otherwise. We DO get a whole bunch of rather cryptic runtimes that I still haven't gotten around to making a bug report on, but it's unlikely to be lighting related and I don't really know why I'm bringing it up...

But did you do an optimization on other memory usage when Goonlights was implemented? Because it's possible your codebase is functioning quite well as a result of having made said optimizations (or even if they were made earlier), whereas another branch that implements Goonlights could run into a serious problem if they were already running high on memory usage.
Lummox JR wrote:
What you're seeing in 507 isn't a leak, because if an appearance is seen by the client it's going to stick around at least until the client disconnects or sends a garbage collection message that it doesn't need that appearance anymore. For every client connected to the server, the server keeps a list of which appearances they have been sent, and that list counts towards the appearances' references.

Has this changed since 507? I'm not sure how all the internals work, but it seems like the server should be the authority on whether an appearance is needed anymore.

The big question is, do you see a leak when you change turfs that no player can see?

I increased maxz to 2, and am now using this code:
/mob/verb/stress_colors()
spawn()
var/list/turfs = block(locate(1,1,2), locate(world.maxx,world.maxy,2))
usr << "You are at [x],[y],[z]. Stressing [turfs.len] turfs on z2..."
while(usr && usr.client)
for(var/turf/T in turfs)
// T.color = list(rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand())
T.color = rgb(rand(255),rand(255),rand(255))
sleep(1)


Client is at 1,1,1, 2500 turfs. On 507, memory usage starts at ~6MB, then after using the verb, climbs to ~22MB, and within a second or two, I consistently got this:
runtime error: BYOND Error: bad turf
I've never seen that error before, and it stops the loop dead in its tracks.

I then added icons to the turfs and player, and tested again on 507. The first time, the runtime didn't happen, and memory climbed steadily, despite the turfs not being visible to the client (I stopped it around 650MB). I figured iconless turfs could be breaking something, but the next several tries, I got the "bad turf" runtime again each time I used the verb.

I then switched to 510, still using icons. Using the verb, I've experienced the following:
  • Dream Daemon almost immediately freezes. Has to be forcibly terminated.
  • Client responsiveness slows over several seconds, then stops (no movement, simple verb that sends a short message does not work). Upon closing the client, DD might crash. If it does not, after reconnecting...
    • Re-using verb fails due to "bad turf" runtimes; attempting to reboot in this state crashes DD.
    • Re-using verb freezes client again. Third connection fails after Connecting to byond://127.0.0.1:54099...connected. In this state, DD refuses to reboot, and gets stuck on "closing world". Using the "exit" command closes the UI but leaves DD running in the background until forcibly terminated.

I then made the spawn fully independent from the client (nulled out usr and src, used while(1)) and added a line to log the current world time. When the client stops responding, the loop halts with no runtime whatsoever. If the color change is omitted, the loop continues running without the client connected, as expected.

Next, I moved up to 511.1373, with the modified loop that separates itself from the client and logs the world time. Seems to behave exactly as 510 does.

Shrinking the world size (so less turfs are modified per decisecond) increases the time it takes for DD to start acting up, but it still happens eventually.

The good news is, I can't get the memory usage to increase above 23ish MB on 510/511. Unfortunately, DD seems to destabilize much quicker than it was in my previous testing, so I simply cannot get it to run the loop long enough to know whether it would continue consuming memory.

What did your memory profile look like before?

We weren't graphing the memory usage before, but the server never (or at least, rarely enough to evade notice) hit the memory limit. Now, it looks like this (the blue line is memory usage):

The large drop near the beginning is DD being restarted. The small drops are DD rebooting the world. Near 2GB, DD logs "Out of memory!", and everything basically breaks until we kill and restart DD. Here's a longer-duration version of the previous graph:

So, a few crashes over the course of a day, with several reboots in between, and each reboot only returning some of the memory in use.

I can't say for sure that it's Goonlights' atom color changes causing the problem, but we didn't have a problem before switching to them, and now we have severe memory leakage between rounds. I'm not aware of anything else that has recently changed which would be causing that.

Wirewraith wrote:
For what it's worth I host goonstation and haven't noticed any especially egregious memory leaks, from lighting or otherwise.

Do you terminate and restart DD between rounds? Also, how quickly does Goon's own lighting update? Ours updates as often as 20 times a second (yes, that sounds bad, but it throttles itself as needed!), and the memory leak only seems to get bad enough to kill DD if it's left running and reboots multiple times.
In response to Krausus
Krausus wrote:
Lummox JR wrote:
What you're seeing in 507 isn't a leak, because if an appearance is seen by the client it's going to stick around at least until the client disconnects or sends a garbage collection message that it doesn't need that appearance anymore. For every client connected to the server, the server keeps a list of which appearances they have been sent, and that list counts towards the appearances' references.

Has this changed since 507? I'm not sure how all the internals work, but it seems like the server should be the authority on whether an appearance is needed anymore.

Here's the problem with that: if the server had to keep track of what was in use on the client, it would have to track not just map stuff but also statpanels, grids, etc. in excruciating detail. And it'd have to have some kind of leeway built in for allowing appearances to get "old" but not recycle them just yet on the basis that the client might need them again soon.

All things considered, it makes way more sense just to have the client say it when it's safe to let go of an appearance it no longer uses. Although this only applies to appearances sent to that client.

The big question is, do you see a leak when you change turfs that no player can see?

I increased maxz to 2, and am now using this code:
...
Client is at 1,1,1, 2500 turfs. On 507, memory usage starts at ~6MB, then after using the verb, climbs to ~22MB, and within a second or two, I consistently got this:
runtime error: BYOND Error: bad turf
I've never seen that error before, and it stops the loop dead in its tracks.

Okay, that's pretty interesting data. I'll run some tests on that myself with the 2-level thing so that client garbage collection is eliminated as a factor, and see what I can find. Maybe I can find the source of the bad turf error while I'm at it; perhaps it relates to this issue.

One thing I should bring up here is that there was an internal limit regarding turfs that was increased after 507, so I'm not sure 507 can be considered very relevant for these tests. But at any rate I'll do all my testing in 511 because that's where I'm gonna have to fix any issues anyway.
I want to add something here:

After porting /vg/lights (more correct name for "goonlights", since no code is taken from goon's lighting engine), Paradise increased their update rate on the lighting process to run once per tick (using world.tick_lag). So the increased update rate could cause this memory leak to become a lot more apparent and actually get into the range of crashing on your average two hour round.

We increased the update rate too when we ported TG's master controller to /vg/ a bit before that, seeing as the better scheduling and priority management could have us take the extra CPU load. We've been seeing crashes amplify too.

I mean I doubt more data is needed to prove that there is an issue here but I'd just like to pitch in either way.
I do have an active item on my todo list to investigate a minor memory leak on /tg/, but i hadn't narrowed it down as not caused by us being dumb. We don't (yet) have /vg/oon lighting but we do enough color fuckary that its been just barely noticeable.

(also on my list is a client memory leak that seems to go away the moment i try to test for it, but i'd bet money its just a game bug with some overlay getting re-added on an atom it already exists on.)
From your code and the issues I had in 511--the loop would simply stop with no "bad turf" error at all, and DS refused to close properly--I found an issue here with unique map cells that may have been relevant. To explain the term, a map cell is not a turf but a unique identifier that a turf may have, similar to an appearance, and it says what the turf's appearance and area IDs are. A map cell may be shared by more than one turf, or it may be unused; map cells are eventually recycled on a timer or as needed, but they are not refcounted directly. (That might be a sensible thing to look into changing in the future.)

The number of unique map cells was increased in 508.1292 to allow more than 64K. Before that, any time it hit 64K it forced a recycling of unused cells and then tried again.

There was an error in the new code in which the array wasn't expanded properly, leading to a small level of heap corruption. This appears to have set off a chain reaction that caused problems elsewhere, stopping the loop or in your case producing the bad turf error.

Having looked at this, I also realized that a map cell is also going to hold an appearance reference, so whether the appearance was ever seen or not doesn't end up mattering. Because the garbage collection for map cells doesn't trigger except after a certain number of minutes, I changed the way that triggers so it will happen a little more often--any time the number of map cells reaches a multiple of 32K. The upside of that is, you won't see memory growth be so severe in your example case and when the collection hits, it will be less stressful to the server. Prior to making that change, I saw that the array for holding cells remained quite large, so even when a bunch of cells got deleted it didn't really roll back the memory very much.

After all that, I strongly suspect there is no leak with appearances; what you were seeing was the slow buildup of map cells without a garbage collection cycle, and that was compounded by a bug in the way map cells were handled when they first crossed the 64K barrier. I'm going to resolve this issue because I did find a bug here, and I think the collection change I made will cover the rest of those problems.
Lummox JR resolved issue with message:
Although the limit for unique map cells was increased in a previous version, a bug in the way that was handled caused errors to occur after crossing the 64K boundary. Additionally, map cells did not go through garbage collection often enough when the list grew rapidly, causing large and avoidable memory increases.
Ah, so it's specific to turfs, which means the objects we're using for lighting won't be affected. Sure enough, if I replace modifying the turfs with modifying an object on each turf, DD runs fine. A client that can see the objects eventually uses too much memory and dies, though, so, I guess I should see if that's been reported yet.