ID:2528780
 
BYOND Version:512
Operating System:Windows Server 2012 R2
Web Browser:Firefox 71.0
Applies to:Dream Daemon
Status: Open

Issue hasn't been assigned a status value.
Descriptive Problem Summary:

BYOND core segfault crashes from time to time in hosting server.

Numbered Steps to Reproduce Problem:

Don't know the cause exactly, but throughout hosting it will crash from time to time, mostly when a new round starts/ends, but not always.

Code Snippet (if applicable) to Reproduce Problem:
Application: dreamdaemon.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: exception code c0000005, exception address 6D3F9675
Stack:

Faulting application name: dreamdaemon.exe, version: 5.0.512.1488, time stamp: 0x5d94fd60
Faulting module name: byondcore.dll, version: 5.0.512.1488, time stamp: 0x5d94fce6
Exception code: 0xc0000005
Fault offset: 0x00129675
Faulting process id: 0x47c
Faulting application start time: 0x01d5afcc913aff15
Faulting application path: C:\TGS\TGMC\BYOND\bin\dreamdaemon.exe
Faulting module path: C:\TGS\TGMC\BYOND\bin\byondcore.dll
Report Id: 16c92235-1bc5-11ea-80eb-00155d081311
Faulting package full name:
Faulting package-relative application ID:


I have several crash dumps that can annex, each with around 13MB.
Last one, related to this event: (EDIT: old link)
That crash is coming from within savefile code, where it's trying to keep track of the savefile's current state just before reading a value. The dump file gave me some information to trace this back to an input call, and it's on an input where the savefile path specified is not empty. (That is, you're using savefile["something"] >> myvar instead of savefile >> >> myvar.) However I don't have any more information than that, since info on the path name was not preserved. It appears that part of the internal structure used for handling savefiles is either bogus or corrupted, which of course shouldn't be possible.

I would recommend trying to switch to 513, because it's possible that some of the fixes made since 512.1488 have rendered this issue moot.

It's also possible that your server was out of memory, which can cause all sorts of weird issues as things fail to allocate and not everything is sanity-checked. If that's the case, 513 is unlikely to help but you can look for ways to decrease the server's memory footprint. The big question is whether this is possibly the case; it would help to know what typical memory usage looks like for your server. If you have other crash reports that show a different offset for the crash, then I think memory exhaustion is a very strong candidate.
On DreamDaemon memory usage:
https://pastebin.com/3XusHSUr

Last lines there:
[3:19:30 AM]: CPU: 013% Memory: 276868KB
[3:19:35 AM]: CPU: 009% Memory: 276868KB
[3:19:40 AM]: CPU: 007% Memory: 276868KB
[3:19:45 AM]: CPU: 007% Memory: 276868KB
[3:19:50 AM]: Crash detected! Exit code: 255


By comparison, right now:
[6:57:08 PM]: CPU: 011% Memory: 361856KB
[6:57:13 PM]: CPU: 010% Memory: 361856KB
[6:57:18 PM]: CPU: 012% Memory: 361904KB
[6:57:23 PM]: CPU: 012% Memory: 362024KB
[6:57:28 PM]: CPU: 010% Memory: 362344KB


Right now our server is at 49% total assigned memory usage (2GB out of 4). It's on the same box as /tg/'s Bagil and Sybil, besides their wiki and forums.

I'll see about monitoring the memory usage, but I don't think that's likely to be an issue.

On switching to 513, we eventually will, probably along with /tg/, but from what I gather there are some issues that need smoothing out first.
513 is largely smoothed out already. About the only thing I can figure SS13 would need to deal with would be making sure all the HTML interfaces had the right meta tag to use UTF-8 encoding.

From the data in your file I would have to agree memory is likely not the issue. I'm not as sure what you mean about 49% total assigned memory usage, though; if your server ever did get near using 2GB it would be very prone to crash, since 2GB is the most Dream Daemon can use and it will actually fail before that point. But 275 MB or 360 MB are nowhere near that.

My suggestion would be to continue monitoring the situation to see if you can get more data about what's being loaded and from where. You might be able to discover a specific savefile is at fault, and if so maybe it'll be possible to suss out 1) specifically where in your code the crash is occurring, and 2) what's in that savefile. This is of course assuming all your crashes remain somewhat consistent and point to savefiles as a matter of routine.
Sorry, the issues needing some smoothing out are ours, not BYOND's. Chiefly needing to update BSQL before we can move to 513. Again, we are riding /tg/'s train here.

By 50% memory usage I mean all the server's processes, not just dream daemon: https://i.imgur.com/gdQ9HXY.png
There should be plenty to spare.

I'll see to logging data loads and saves as suggested by our host as well.
There's other crash dumps if you'd like for comparison, for example: (EDIT: old link)
I cannot really say if they are the same cause or not, but the pattern seems similar.
Update: found the exact place where it crashes.

Context: A contributor added a preference for the players to use a custom css for their chat. He set the max file size to a million characters, 1MB of information. I assumed that would be the issue, and it was addressed diminishing it to a much more sensible number, however it wasn't. Last crash was triggered by an admin's file which is empty (has a white space only in it):

Code (stripped down version):
/datum/chatOutput/proc/loadClientCSS()
var/last_savefile = owner.Import() //owner is client
var/savefile/F = new(last_savefile)
WRITE_LOG(GLOB.rw_file_log, "PRE-READING CSS from [owner]")
clientCSS = read_file(F, "CSS")

/proc/read_file(savefile/file, buffer)
WRITE_LOG(GLOB.rw_file_log, "READING ([usr ? usr : "no usr"]): [file.name] ([buffer])")
file[buffer] >> .
WRITE_LOG(GLOB.rw_file_log, "READ: [isnull(.) ? "null" : .]")


Logging results:
[2019-12-21 19:45:39.459] PRE-READING CSS from CKEY1
[2019-12-21 19:45:39.459] READING (no usr): 32C86913 (CSS)
[2019-12-21 19:45:39.459] READ:
[2019-12-21 19:45:40.030] PRE-READING CSS from CKEY2
[2019-12-21 19:45:40.030] READING (no usr): 47FEF44A (CSS)
[2019-12-21 19:45:40.030] READ:
[2019-12-21 19:45:40.343] PRE-READING CSS from CKEY3
[2019-12-21 19:45:40.343] READING (no usr): 579C3618 (CSS)
[2019-12-21 19:45:40.343] READ:
[2019-12-21 19:45:40.551] PRE-READING CSS from CKEY4
[2019-12-21 19:45:40.551] READING (no usr): 4586707F (CSS)
(log stops here, along with the crash)


Next round, moments after (server started, crashed and restarted, players may have not even noticed it).
[2019-12-21 19:45:54.912] PRE-READING CSS from CKEY 5
[2019-12-21 19:45:54.912] READING (no usr): 77295ACC (CSS)
[2019-12-21 19:45:54.912] READ:
[2019-12-21 19:45:54.968] PRE-READING CSS from CKEY 6
[2019-12-21 19:45:54.968] READING (no usr): 478B574A (CSS)
[2019-12-21 19:45:54.968] READ:
[2019-12-21 19:45:54.969] PRE-READING CSS from CKEY 7
[2019-12-21 19:45:54.969] READING (no usr): 7F5FFA67 (CSS)
[2019-12-21 19:45:54.969] READ:
[2019-12-21 19:45:55.509] PRE-READING CSS from CKEY 4
[2019-12-21 19:45:55.509] READING (no usr): 676CF0D9 (CSS)
[2019-12-21 19:45:55.509] READ:
[2019-12-21 19:45:55.669] PRE-READING CSS from CKEY 8
[2019-12-21 19:45:55.669] READING (no usr): 262EE6C8 (CSS)
[2019-12-21 19:45:55.669] READ:
[2019-12-21 19:45:56.180] PRE-READING CSS from CKEY 2
[2019-12-21 19:45:56.180] READING (no usr): 6A3E3E6C (CSS)
[2019-12-21 19:45:56.180] READ:
[2019-12-21 19:45:56.233] PRE-READING CSS from CKEY 9
[2019-12-21 19:45:56.233] READING (no usr): 6C55B8C7 (CSS)
[2019-12-21 19:45:56.233] READ:
[2019-12-21 19:45:56.348] PRE-READING CSS from CKEY 1
[2019-12-21 19:45:56.348] READING (no usr): 6F29EED3 (CSS)
[2019-12-21 19:45:56.348] READ:
[2019-12-21 19:45:56.608] PRE-READING CSS from CKEY 10
[2019-12-21 19:45:56.608] READING (no usr): 4BDCF502 (CSS)
[2019-12-21 19:45:56.608] READ:
[2019-12-21 19:45:56.691] PRE-READING CSS from CKEY 3
[2019-12-21 19:45:56.691] READING (no usr): 67B9999B (CSS)
[2019-12-21 19:45:56.691] READ:
... (goes on, no crashes)


Removed the specific ckeys, but referenced them by numbers to note they repeat both in where there's crashes and were there isn't, and that the ckey's prefs that crashed are empty same as these others.
There are a few users with custom CSS on these logs without causing issues, can share in DMs if needed.

Relevant PR: https://github.com/tgstation/TerraGov-Marine-Corps/pull/3603

Thoughts?
Does that specific file always cause the same issue? If so a copy of that would be a big help. You can always grab a copy by saving that locally after it's imported.
Sent the savefile in private through a discord message. It's likely there will be other crashes, so I'll see what else comes up.

So far the 3 crashes that happened since the logging is up have been while trying something functionally like this:
var/last_savefile = client.Import()
var/savefile/F = new(last_savefile)
var/css_info
F["CSS"] >> css_info


On the first two I didn't have a reference to which client, but on the third one I added that to logging.
Let's see what happens in the next ones.
The issue has been solved on this PR by removing the source of it:
https://github.com/tgstation/TerraGov-Marine-Corps/pull/3637

The crash happened when running this code:
// Load the CSS from the local client
/datum/chatOutput/proc/loadClientCSS()
var/last_savefile = owner.Import()
if(!last_savefile)
saveClientCSS("")
return
var/savefile/F = new(last_savefile)
READ_FILE(F["CSS"], clientCSS)

More precisely when reading the file (READ_FILE(file, text) is a macro for file >> text).

From what the reference says client.Export saves it in the client in a location reserved to world.hub, which is shared with other SS13 servers. Since more crashes were at roundstart/setup, it might have been something related to lag between client and server, but that's just wild speculation on my part.

I consider the issue closed, unless if you'd like me to investigate further.