ID:2024837
 
BYOND Version:509
Operating System:Windows Server 2012 rc2 64-bit
Web Browser:Chrome 47.0.2526.111
Applies to:Dream Daemon
Status: Open

Issue hasn't been assigned a status value.
Descriptive Problem Summary:


Round restarted, followed by:

[00:29:32] [+Mrs_Sybil] AUTOMATIC ANNOUNCEMENT : Server | Round just ended.
[00:31:58] [+Mrs_Sybil] AUTOMATIC ANNOUNCEMENT : WATCHDOG | Server exit detected. Restarting server in 60 seconds.
[00:34:23] [+Mrs_Sybil] AUTOMATIC ANNOUNCEMENT : WATCHDOG | Server exit detected. Restarting server in 60 seconds.

So far I have narrowed the crash point down to the part of code that handles generating a minimap, a icon proc heavy function that also accesses some files (to cache the generated minimap, or get the minimap from cache)

Here is the code for that: https://github.com/tgstation/-tg-station/blob/ 47f08f3714d9e8ef36485a11e6175ba619d1362d/code/controllers/ subsystem/minimap.dm

Just before one of the crashes, Initialize() in that file was ran, I don't know about the other crash, but its most likely.

The two crashes have different offsets:

Faulting application name: dreamdaemon.exe, version: 5.0.509.1319, time stamp: 0x56901beb
Faulting module name: byondcore.dll, version: 5.0.509.1319, time stamp: 0x56901af4
Exception code: 0xc0000005
Fault offset: 0x000d0451
Faulting process id: 0x288c
Faulting application start time: 0x01d15792395968ac
Faulting application path: c:\Program Files (x86)\BYOND\bin\dreamdaemon.exe
Faulting module path: c:\Program Files (x86)\BYOND\bin\byondcore.dll
Report Id: 647d0bb0-c407-11e5-80c6-00155d6ef20a
Faulting package full name:
Faulting package-relative application ID:

Faulting application name: dreamdaemon.exe, version: 5.0.509.1319, time stamp: 0x56901beb
Faulting module name: byondcore.dll, version: 5.0.509.1319, time stamp: 0x56901af4
Exception code: 0xc0000005
Fault offset: 0x000e0c12
Faulting process id: 0x2528
Faulting application start time: 0x01d158144b708a83
Faulting application path: c:\Program Files (x86)\BYOND\bin\dreamdaemon.exe
Faulting module path: c:\Program Files (x86)\BYOND\bin\byondcore.dll
Report Id: bb51719c-c407-11e5-80c6-00155d6ef20a
Faulting package full name:
Faulting package-relative application ID:

Another one, after 2 round restarts, it crashed once and came back up.

Faulting application name: dreamdaemon.exe, version: 5.0.509.1319, time stamp: 0x56901beb
Faulting module name: byondcore.dll, version: 5.0.509.1319, time stamp: 0x56901af4
Exception code: 0xc0000005
Fault offset: 0x000d0451
Faulting process id: 0x2724
Faulting application start time: 0x01d15814a17cdc10
Faulting application path: c:\Program Files (x86)\BYOND\bin\dreamdaemon.exe
Faulting module path: c:\Program Files (x86)\BYOND\bin\byondcore.dll
Report Id: cab5bcd2-c413-11e5-80c6-00155d6ef20a
Faulting package full name:
Faulting package-relative application ID:
My tracing says the error at d0451 is in EraseProcChainSrc(), which says it's being passed an invalid pointer for the proc info. That suggests something got seriously borked. This routine is only called in three places, and the proc reference could refer to either a sleeping proc in the queue, the currently running proc if any, or a proc waiting for a callback. Without a full stack trace I can't narrow down which of these cases it is.

The e0c12 crash is in ScanProcMem(), so this can't be a coincidence. Again I think I need a full stack trace to narrow down which type of proc this is happening to.

Heap corruption is a possibility, but it feels like if that were happening you'd have much more evidence of it. So working with the assumption that some more casual corruption is going on (a mistake in failing to reset a value, perhaps), it seems like the very best candidate would be the "background" procs waiting for a callback--things like winget() or world.Export(), for instance. They tend to be the least well-behaved.

Shooting from the hip here, I wonder if you're using try/catch in proximity to any of these, and if there's maybe a scenario that the catch is screwing up. Or alternatively, maybe there's a runtime error that might be appearing in your logs. Do you have anything along those lines?

[edit]
Doh, I skipped past the part about having narrowed down the part of the code where this occurs. I see no try/catch nor anything backgroundy involved. Very strange, this.
Found it.


https://github.com/tgstation/-tg-station/blob/ 47f08f3714d9e8ef36485a11e6175ba619d1362d/code/controllers/ subsystem/minimap.dm#L67

var/list/obj_icons = list()

The old list isn't getting gc'ed when list() overrides it in the loop until the end of the proc, so they hang around, use up memory, this causes the global list of lists to get expanded to use 800mb of memory, then 1.6gb of memory, then finally, attempts to runtime as it can't expand to 3.2gb of memory, and fails at that as it can't allocate a new list to hold a copy of the arglist. (that last part is just a guess, i can't see the code)

This leads to either the proc hanging and everything else hanging, basically byond is in a "frozen" state (dd is responsive, but not running procs or ticks or anything), or a crash, depending on factors i haven't pinpointed yet.
Relevant to this is that procs have two built-in "scratch" vars: v1 and v2. So if a lot of procs are hanging around, that could potentially be an issue.
scratch vars?

I'm not following.
In response to MrStonedOne
MrStonedOne wrote:
scratch vars?

I'm not following.

Like, buffer or accum variables. They're meant to clear when the proc ends (it's allocated stack is gone); if the proc never ends, the proc's stack stays.
In this case, it was just one long running world initialization proc (34 seconds) that used vars in a loop.

In this case, a list var.

The old list on the next iteration would stick around.
ok, that MIGHT not have fixed it...

I'm getting reports it crashed again.
Descriptive Problem Summary:
Faulting application name: dreamdaemon.exe, version: 5.0.510.1322, time stamp: 0x56afd430
Faulting module name: byondcore.dll, version: 5.0.510.1322, time stamp: 0x56afd331
Exception code: 0xc0000005
Fault offset: 0x000d3861
Faulting process id: 0x1b08
Faulting application start time: 0x01d15ff136ce167f
Faulting application path: c:\Program Files (x86)\BYOND\bin\dreamdaemon.exe
Faulting module path: c:\Program Files (x86)\BYOND\bin\byondcore.dll
Report Id: 8f1aabc6-cc5c-11e5-80c7-00155d6ef20f
Faulting package full name:
Faulting package-relative application ID:
In response to MrStonedOne
I reattached your report here because it's the exact same crash.
Ya, something odd is going on with /icon/s, no matter what i do, it seems to randomly like to either go over memory, or only use 100mb more than our idle memory usage.

I confirmed this crashed was also with minimap generation shortly after making the bug report, but forgot to comment
So ya, I can't figure this out.

Minimap generation either raises DD usage an additional ~100mb or it causes an OOM crash by raising it to above 2.4GB (from about 400mb).

It used to use 1.6GB or raise above 2.4 and cause a crash (the cause of the original bug/crash report), i made one change and it was doing so good, always using less then 600mb during generation, for a week or so, and now it's crashing repeatively on that section since the 510 upgrade.

How do icons handle blends? Because the fact that making that change helped seems like a bug, like it was increasing those icon's ref counters and copying it to a new icon and deleting the old fixed that.
I suppose you could add debugging to track /icon datums, and also perhaps the built-in objects (stored in icon.icon; it may be either a special object or a cache reference). That might help to determine if there's some kind of leak.
So it's about time I get back on this issue. It's still happening, but at a different crash point.

It will either take only ~30 extra MB's to process(and not crash), or it will bounce between an ever increasing amount and 30mb (if say resting usage is 600mb, it would bounce between 630, then 700, then 630, then 900, then 630, then 1,200, then crash). Does not happen on test project, leading me to suspect it might have something to do with the actual content of an icon tripping stuff up. (meaning it might only happen when certain things spawn randomly)

We attempted to force the proc's scratch bin to get cleared out by moving the work to a sub proc but that was no help.

Offending code: https://github.com/tgstation/tgstation/blob/ 0fb0688acc184d2e8bf64d74f0f68a7d30170868/code/controllers/ subsystem/minimap.dm#L70 (its hacky as hell, some of it is speed optimiations, some of it is us attempting to bypass the issue)
Faulting application name: dreamdaemon.exe, version: 5.0.511.1357, time stamp: 0x57dc504e
Faulting module name: MSVCR120.dll, version: 12.0.21005.1, time stamp: 0x524f7ce6
Exception code: 0xc0000005
Fault offset: 0x0000f8c5
Faulting process id: 0x2ffc
Faulting application start time: 0x01d2133ee9d5b89f
Faulting application path: C:\Program Files (x86)\BYOND\bin\dreamdaemon.exe
Faulting module path: C:\Program Files (x86)\BYOND\bin\MSVCR120.dll
Report Id: 6dcb732f-7f32-11e6-ba05-6805ca088e14

eax=00000000 ebx=2d1ceb70 ecx=00000000 edx=00020000 esi=36f70020 edi=00000000
eip=5da6f8c5 esp=001ead04 ebp=001ead30 iopl=0 nv up ei pl nz ac pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010216
msvcr120!wcsicmp+0x85:
5da6f8c5 660f7f07 movdqa xmmword ptr [edi],xmm0 ds:002b:00000000=????????????????????????????????

0:000> k
ChildEBP RetAddr
WARNING: Stack unwind information not available. Following frames may be wrong.
001ead30 5c9847b6 msvcr120!wcsicmp+0x85
001ead44 5c9803ad byondcore!DMFrame::operator=+0x26
001ead8c 5c982ac3 byondcore!cropIcon+0x70d
001eadac 5c9d8d93 byondcore!overlayIconIcon+0x23
001eade8 5c9d7b1c byondcore!DMTextPrinter::HLine+0x813
001eae14 5c9f0ed4 byondcore!DungServer::ThreadNetMsg+0x4dbc
001eb628 5c9f9a31 byondcore!DMTextPrinter::HLine+0x18954
001eb70c 5c9e2543 byondcore!DMTextPrinter::HLine+0x214b1
001eb758 5c9e1796 byondcore!DMTextPrinter::HLine+0x9fc3
001ebf80 5c9f9a31 byondcore!DMTextPrinter::HLine+0x9216
001ec064 5c9e2543 byondcore!DMTextPrinter::HLine+0x214b1
001ec0b0 5c9e1796 byondcore!DMTextPrinter::HLine+0x9fc3
001ec8d8 5c9e957f byondcore!DMTextPrinter::HLine+0x9216
001ed0d8 5c9e957f byondcore!DMTextPrinter::HLine+0x10fff
001ed8d8 5c9e957f byondcore!DMTextPrinter::HLine+0x10fff
001ee0d8 5ca08d01 byondcore!DMTextPrinter::HLine+0x10fff
001ee10c 5cacb96a byondcore!LocalDB::HubToJS+0x4a51
001ee140 00407562 byondcore!TimeLib::SystemAlarm+0x10a
001ee14c 5c71540a dreamdaemon+0x17562


Byond version 511.1357

Minidump: https://tgstation13.org/msoshit/crashdumps/ dreamdaemon-mini160920063601.zip
Fulldump available as well, but its 1.8GB and unlikely to be useful.
I believe the cropIcon+ is referring to a static routine since that's far enough out from the main cropIcon routine to suspect it's not the same one, but DMFrame::operator= and overlayIconIcon both look like they're happening in those actual routines, which is very interesting.

I believe this crash in the icon routines is happening because you ran out of memory, and there just wasn't adequate sanity checking in this spot. In doIconIcon(), the main routine that handles all icon blends (it was probably jumped to by overlayIconIcon()), there's a section that will expand the target icon state if it doesn't have the right number of animation frames or dirs. That expansion is most likely to run out of memory if the target icon is very large.

All that said, I don't think that necessarily connects to the crash originally reported in this thread. Running out of memory on an icon op shouldn't translate to a block proc stack, which is what was happening in those first two cases.
msvcr120.dll!memcpy(unsigned char * dst=0x00000282, unsigned char * src=0x00000002, unsigned long count=0) Line 745 Unknown


This is what visual studio shows me. Somehow i doubt there are pointers at the addresses 282 and 2.

edit: most likely an overflow, but the fact that src is also overflown is odd
Alright, So I've tentatively narrowed it down to a proc we use called getFlatIcon (Yet another bug that would be mitigated by ID:2039888 Please please give us a byond version of getFlatIcon!)

https://github.com/tgstation/tgstation/blob/ 634662d6b4c9a3573a503d6f3f832ff62b59dfd3/code/__HELPERS/ icons.dm#L648

I replaced the line in Generate() that used getFlatIcon with just pulling the icon without overlays and it prevented the issue.

So far I have it doing a reboot, generate, reboot, generate... loop and it hasn't crashed yet.

So it's been generating minimaps under the no getflaticon version for 2 hours now, no crashes.
Page: 1 2