ID:2131747
 
Resolved
BYOND Version:510
Operating System:Linux
Web Browser:Chrome 52.0.2743.85
Applies to:Dream Daemon
Status: Resolved

This issue has been resolved.
DreamDaemon version: 510.1346

Descriptive Problem Summary:
I'm going to use lots of images to explain this, so I hope you like graphs.
All graphs represent the same time period, and I'll be talking about correlations between them.

This one is the key graph: https://baystation12.net/bs12perf/cpu.png
The height of the green bars represents how much CPU DD is using. The point where it drops is when I did a cold reboot on the server (killing the process and starting it again). Doing a soft reboot (world.Reboot()) makes no dent on this graph.
So it's clear that something's happening to kick DD into a high CPU state. But what?

Here's another CPU graph: https://baystation12.net/bs12perf/syscpu.png
This basically shows that, when DD starts running on high, most of that added CPU time seems to be spent in the kernel (probably) in some sort of syscall.
Here's strace log over 10 seconds during high CPU: https://gist.github.com/Mloc/ 3ba490fb6ad82808a31ee3502dc8faef
... and without high CPU: https://gist.github.com/Mloc/ 93a174d19e3e3a7bc03ba349d802afbf
So there's not a huge amount of difference, but times() and gettimeofday() are being called a lot. (that and select() being used on a modern system makes me wince, but that's a problem for another day :p)

This graph is a bit more obscure; interrupts: https://baystation12.net/bs12perf/irq.png
(note that this graph is log-scale, so even small changes in y can be significant)
I'm not an expert on interrupts and kernel internals, but there's definitely a correlation between DD's high CPU state and a significant jump in Local Timer Interrupts and Performance Monitoring Interrupts.

And one final (odd) graph; HDD temperature: https://baystation12.net/bs12perf/hddtemp.png
(sda and sdb are in a RAID1 array)
I don't know what to say about this one, really. There's a correlation between DD's low CPU state and an increase in HDD temperature, and it goes down once DD goes into high CPU. Maybe it just has less time for IO operations?

Bonus graph for good luck: https://baystation12.net/bs12perf/rss.png
I took this at the same time as the rest of the graphs, but it doesn't really show anything.

Numbered Steps to Reproduce Problem:
1. Run server on Linux, running latest code at https://github.com/Baystation12/Baystation12/tree/dev, possibly under high load.
2. Wait some amount of time ranging from 10 minutes to 12 hours to possibly more.
3. Watch as DreamDaemon's CPU usage jumps to 100% and sticks.

Code Snippet (if applicable) to Reproduce Problem:
n/a

Expected Results:
CPU to remain at its low level OR CPU to reset to its low level after world.Reboot()

Actual Results:
CPU jumps to 100% and stays there until the DreamDaemon process is killed and started again.

Does the problem occur:
Every time? Or how often?
Every time, after a randomish amount of time.
In other games?
Unknown
In other user accounts?
n/a
On other computers?
Unknown

When does the problem NOT occur?
On Windows, apparently.

Did the problem NOT occur in any earlier versions? If so, what was the last version that worked? (Visit http://www.byond.com/download/build to download old versions for testing.)
Unknown

Workarounds:
Don't use Linux for hosting.
As an addendum, these graphs were taken a month ago but nothing has really changed since. I just didn't have a chance to make this post until now.
I would suggest trying to break in with gdb and get a snapshot of what the current stack looks like, and what code is executing. If I can get something like a list of offsets into the libbyond.so file, that should be useful for figuring out the problem.
I can't really get offsets using gdb without debug symbols, best I could do is give you the memory locations and you could guess from the relative positions.

Backtraces in gdb aren't really showing much anyway. Most of the time it's sitting in WaitForSocketIO ( https://gist.github.com/Mloc/ fd398ac51d7f50a8daea85accb1c928d ) but sometimes we get a longer trace through libbyond ( https://gist.github.com/Mloc/ dcb5920b62f4cccc0505de0b9181a8b1 ).

I've got a couple of flamegraphs, but again, libbyond.so is an opaque mass.
https://baystation12.net/bs12perf/perf.svg
https://baystation12.net/bs12perf/perf-hz.svg
Width represents share of CPU time, height represents stack depth. Colour and order on the x-axis are arbitrary.
You can click on entries to zoom in on them, and there's a search bar up top.
And now some flamegraphs from a low CPU state (previous ones were high CPU):
https://baystation12.net/bs12perf/perf-low.svg
https://baystation12.net/bs12perf/perf-hz-low.svg

Looks like a lot of time is being spent in select() on high CPU.
For offsets I just mean how far into the file you are. Without knowing where libbyond.so was loaded in memory, I can't take those memory locations and turn them into anything useful.
Right. I assumed ASLR would mess that up but forgot that I can just check where it's loaded.
Bunch of gdb traces: https://gist.github.com/Mloc/ 3b7a8c613e6a787437050e34f582e923

Executable section: f7220000-f7742000 r-xp 00000000 09:03 111939053 /home/bs12/byond/bin/libbyond.so
(libbyond.so's code is loaded at f7220000)
This seems to be fixed now, at least for us: https://i.imgur.com/YOn9Zdb.png
Lummox JR resolved issue