ID:2245438
 
Resolved
Crashes could occur in rare (but fairly consistent) cases when making changes to turf appearance, type, or area in large projects.
BYOND Version:511
Operating System:Windows 10 Home 64-bit
Web Browser:N/A
Applies to:Dream Daemon
Status: Resolved (511.1384)

This issue has been resolved.
Descriptive Problem Summary:
DreamDaemon.exe server crashes on startup, but not every startup.

The sequence of events when the crash happens:
1) Start server. DreamDaemon UI becomes unresponsive (normal behavior) while it loads the world etc.
2) Wait until it becomes responsive again and then connect.
3) More game init code starts running once someone has connected.
4) At some point it will stop executing code prematurely, but the dreamdaemon ui will become responsive again. At this point, if the client disconnects it will instantly crash DreamDaemon.exe

I have tried varying what subsystems in our code startup, adding logging statements etc. But I have unfortunately not been able to exactly pinpoint at which point it crashes. Given the dreamdaemon process crashes I can't be sure any logging I do actually gets flushed anyway.

If the game successfully starts up all the way, I don't have any problems after that point, works perfectly.

I can't find any pattern that influences whether or not any given startup will crash or not.

Project depends on 511 features, so compiling on 510 isn't feasible. But to be specific the crash has been observed on 511.1380, 511.1381 and 511.1382.

Dump Summary (version 511.1380)
This dump file has an exception of interest stored in it.
The stored exception information can be accessed via .ecxr.
(fa0.2b8c): Access violation - code c0000005 (first/second chance not available)
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for KERNELBASE.dll -
eax=00000000 ebx=00000000 ecx=00000000 edx=13ab5d00 esi=00000003 edi=00000003
eip=7718718c esp=00ffcf00 ebp=00ffd090 iopl=0         nv up ei pl nz ac pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000216
ntdll!NtWaitForMultipleObjects+0xc:
7718718c c21400          ret     14h

*** ERROR: Symbol file could not be found.  Defaulted to export symbols for byondcore.dll -
eax=00000000 ebx=00ffc320 ecx=00000000 edx=13ab5d00 esi=00000000 edi=125acb10
eip=016ec452 esp=00ffdcf0 ebp=00ffdcfc iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010202
byondcore!LocalDB::HubToJS+0x7d62:
016ec452 8b7028          mov     esi,dword ptr [eax+28h] ds:002b:00000028=????????
            ^ Extra character error in '.ecxr.'


Call stack:
00ffdcfc 016eb97c byondcore!LocalDB::HubToJS+0x7d62
00ffdd1c 016d8fec byondcore!LocalDB::HubToJS+0x728c
00ffdda0 017334d2 byondcore!DungServer::ThreadNetMsg+0x2313c
00ffdde0 0174a77d byondcore!DMTextPrinter::ValidateResourceFile+0x2762
00ffe240 01726c5b byondcore!DMTextPrinter::ValidateResourceFile+0x19a0d
00ffe25c 017bde28 byondcore!DungPager::InstallProgress_PIO+0x8c2b
00ffe284 017bc895 byondcore!ByondHttpServerLink::WriteBuffer+0x2618
00ffe294 017ce8c1 byondcore!ByondHttpServerLink::WriteBuffer+0x1085
00ffe2bc 00a87445 byondcore!SocketLib::Event_io+0x1f1
00ffe2cc 0f5d540a dreamdaemon+0x17445
00ffe39c 0f5d50ca mfc120!CWnd::OnWndMsg+0x31d
....


Full dump available upon request


So far I've determined that this crash is happening inside of ScanProcMem() for the currently running proc, and your stack trace does point to the fact that a client has been deleted. (What I'm not sure about is if your code is forcing the disconnect or if it's happening naturally.)

Where this is crashing specifically is right in the beginning of the routine, where it's checking the proc data's associated proc info, and that latter value happens to be null. That should not be possible. Other parts of the code have sanity checks for this, although the ScanProcMem() doesn't.

That sanity check is easy to add and it make sense for me to do so--but far more troubling is that this value is becoming null in the first place, which shouldn't be possible. The null proc pointer explains why in the crashy cases, your code simply stops running prematurely: other sanity checks are catching this and execution doesn't happen.

I'll keep looking into this to see if I can uncover the reason for the null pointer. This is a real head-scratcher so I'm not sure if I'll be able to figure it out without running the code.
Also, I'll have to recheck this but I'm pretty sure all the logging auto-flushes, precisely for this kind of reason.
I've been working on this project and issue with Myryan.

In our case, the client just closes the DreamSeeker window to disconnect.

There are other things that will cause the crash when DreamDaemon is put into this unstable state, though. Clicking some in-game UI elements that would present alert() boxes do it, while other verb usage might be completely ignored.
Okay, if you have time to run the code, I have created a fork of our git repository with the master branch already set up to run and reproduce this. The game is an SS13 fork.

If you clone https://github.com/Leshana/VOREStation-1.git everything is set up to go, just compile vorestation.dme and run vorestation.dmb in dreamdaemon, all configs are already set up.

Crash occours about 30%ish of the time.
When the issue happens the chat box will look like: http://imgur.com/a/Du3MZ and DreamDaemon UI will unfreeze within a couple seconds. Closing DreamSeeker or clicking Observe will crash DreamDaemon at this point.

If you see this: http://imgur.com/su5WBkq you've gone too far, no crash today.

Note: Three people on different computers have run into this, so its not completely specific to my environment. Only tested this on windows so far.
Can you run me through what code happens between that output and "Initializing areas"? It seems that's going to be the main place to look, and it might give you insight as well if there's something, say, stressing system resources too far.

In the event of the system running out of memory, all kinds of weird things can happen, but generally a lot of parts of the server are built to try to handle it gracefully. sleep() and spawn() for instance will fetch a pooled structure for proc info, and create one if they can't, but if that creation fails then they should hit you with an "Out of memory" error rather than allow the situation that's happening with the null pointer.

Unfortunately this being an SS13 server complicates the problem, as I suspect I'm not going to be able to run this properly in the debugger (or see the problem if I do). Certain large codebases like /tg have been known to be difficult to manage in that environment, and I suspect based on your issue that your server falls on the heavier side.
When originally testing this, I inserted world.log outputs before every object.initialize() call in that for() loop, and had DD write out to a file, with the thinking that it's one particular object's initialize() proc doing something mysterious to cause this.

The first time I tested, it did not crash, and I was left with a world.log of about 200,000 lines of that output.

The second time, it stopped output to world.log on reaching:
STARTUP OBJECT: the steel decal (/obj/effect/floor_decal/steeldecal/steel_decals7) (72,81,5)

Whereupon it entered the 'still counting up time, responsive UI, but will crash if anything happens' DreamDaemon state.

The third time, it stopped output to world.log on reaching:
STARTUP OBJECT: the steel decal (/obj/effect/floor_decal/steeldecal/steel_decals4) (36,83,5)

Both of those happened after about 39,000 lines, as opposed to the 200,000 from the successful run. So in those cases, it didn't SEEM to actually finish initializing objects. Either that, or the I/O for world.log was not flushed and I don't have the rest of the log.

This led me to believe it might be something about initialize() on that object type, so I stub'd the initialize() proc on that object, yet it still occurred, just in different places each time.

In another test, I inserted a world.log output above and below the sleep(-1) before initializing objects (having removed the individual world.log outputs for each object, as it made the server very time-consuming to debug). In that instance, it logged the 'before sleep' debug message, slept (I assume transferring execution elsewhere), and during that sleep DD UI control returned and it entered this 'broken' state, waited 6-8 seconds, and then printed the 'after sleep' message.

Did you do any monitoring of memory consumption during these cases? I'm curious where it maxed out, especially during a crash.
I did add some logging statements around the area.

datum/controller/game_controller/proc/setup_objects()
        world.log << "[__FILE__]:[__LINE__]"
        admin_notice("Initializing objects", R_DEBUG)
        world.log << "[__FILE__]:[__LINE__]"
        sleep(-1)
        world.log << "[__FILE__]:[__LINE__]"
        for(var/atom/movable/object in world)
                if(isnull(object.gcDestroyed))
                        object.initialize()

        world.log << "[__FILE__]:[__LINE__]"
        admin_notice("Initializing areas", R_DEBUG)


On successful runs I get all four log prints, on crashy runs I get only the first three. However, on one of my test runs I got a different instruction segfaulting:

eax=00ffd0e0 ebx=00000001 ecx=137f61f8 edx=00000000 esi=a5708d4b edi=00ffbd0c
eip=0fd18e41 esp=00ffb264 ebp=00ffb270 iopl=0         nv up ei ng nz na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010286
byondcore!DungServer::ThreadNetMsg+0x22f91:
0fd18e41 8a4610          mov     al,byte ptr [esi+10h]      ds:002b:a5708d5b=??
            ^ Extra character error in '.ecxr.'

Call stack:
byondcore!DungServer::ThreadNetMsg+0x22f91
byondcore!DungServer::ThreadNetMsg+0x237a4
byondcore!DungServer::ThreadNetMsg+0x230ef
byondcore!DMTextPrinter::ValidateResourceFile+0x27673
byondcore!DungServer::ThreadNetMsg+0xfde3
byondcore!DungServer::ThreadNetMsg+0xcf60
byondcore!DungServer::ThreadNetMsg+0xcea3
byondcore!DungServer::ThreadNetMsg+0x1982b
byondcore!DungServer::ThreadNetMsg+0x23f61
byondcore!DungServer::ThreadNetMsg+0xca63
byondcore!DungServer::ThreadNetMsg+0xc845
byondcore!DMTextPrinter::ValidateResourceFile+0x27d09
byondcore!DungServer::ThreadNetMsg+0x1c9ad
byondcore!DungServer::ThreadNetMsg+0x23f61
byondcore!DungServer::ThreadNetMsg+0xca63
byondcore!DungServer::ThreadNetMsg+0xbc86
byondcore!DungServer::ThreadNetMsg+0x23f61
byondcore!DungServer::ThreadNetMsg+0x1a5ae
byondcore!DungServer::ThreadNetMsg+0x23f61
byondcore!DungServer::ThreadNetMsg+0xca63
byondcore!DungServer::ThreadNetMsg+0xc845
byondcore!ServerWorldConnectionTest+0x3cb4
byondcore!DMTextPrinter::ValidateResourceFile+0x1fba
byondcore!DMTextPrinter::ValidateResourceFile+0x2714
byondcore!DMTextPrinter::ValidateResourceFile+0x19a0d
byondcore!DungPager::InstallProgress_PIO+0x8c2b
byondcore!ByondHttpServerLink::WriteBuffer+0x2618
byondcore!ByondHttpServerLink::WriteBuffer+0x1085
byondcore!SocketLib::Event_io+0x1f1
dreamdaemon+0x17445




P.S. that initialization code sucks. Someday will port /tg's proper sleeping system. But for a quick test we tried sleeping after every 500 objects, did still get a crash under that condition.
I watched memory usage this time. Private bytes never rose above 400mb.

Note: In general I find DreamDaemon never uses much memory. I don't see it use more than 1gb even with several people connected.
I was able to reproduce this on linux, where the segv handler prints the backtrace when it crashes. Here is a hastebin of two of the outputs.

https://hastebin.com/raw/qerowesuci
Wow, linux crash dumps give all that info? I should have switched to linux for testing too.


Interesting... flooring_decals.dm:
17      /obj/effect/floor_decal/initialize()
18              if(supplied_dir) set_dir(supplied_dir)
19              var/turf/T = get_turf(src)
20              if(istype(T, /turf/simulated/floor) || istype(T, /turf/unsimulated/floor) || istype(T, /turf/simulated/shuttle/floor))
21                      var/cache_key = "[alpha]-[color]-[dir]-[icon_state]-[layer]"
22                      if(!floor_decals[cache_key])
23                              var/image/I = image(icon = src.icon, icon_state = src.icon_state, dir = src.dir)
24                              I.layer = T.layer
25                              I.color = src.color
26                              I.alpha = src.alpha
27                              floor_decals[cache_key] = I
28                      if(!T.decals) T.decals = list()
29                      T.decals |= floor_decals[cache_key]
30                      T.overlays |= floor_decals[cache_key]   <---- Line 30
31                      qdel(src)
32              return



I've always wondered if you could do that. I have seen people's code using |= on overlays. It sure is useful to be able to do that, and from what I have seen it works (that line of code has been there for 2 years), but the reference do suggest overlays is write-only. (Although I have seen in practice that it is not. As an aside, would be interested to know the full story about that).

That proc has been unchanged since 2016, but we did recently add a huge number of these objects to the map. So is it a coincidence it mentioned that line or are there shenanigans? I do notice this: http://www.byond.com/forum/?post=2229337
That other thread is unrelated. Altering turf.overlays shouldn't hurt anything; what I noted in that thread was that a reference to an overlays list that was bogus was causing a problem, but in your case T.overlays shouldn't be a bogus list.

While I'll have to look at that Linux crash trace separately, I can tell you that the new non-Linux crash details you posted point to a crash in EraseProcChainSrc()--which again is connected to references being erased. (That specific routine looks at waiting procs and ends them if their src value matches the target ref.)

What's interesting about your newer trace is that the proc pointer is not null, but it is invalid. I don't see how that could happen any more than the nulling of the pointer, but at least it's another data point to work with.
Sorry about this but there is one more thing which I think is irrelevant, but I wanted to include just in case.

The code snippet for /obj/effect/floor_decal/initialize() is what I have right now. But the stack traces given earlier actually were using slightly different code. We had replaced the
qdel(src)
with
spawn(rand(1 SECOND, 60 SECONDS))
                qdel(src)

as an experiment earlier. I only mention because that experimental code would have spawned lots of waiting procs. But it crashes without the spawn too, so this probably is not relevant.

We also tried changing
overlays |=
to
overlays +=
as another experiment.
When you had that STARTUP OBJECT debug in place, where was it? I see it's not in your code, so I have to assume it was only put in there temporarily. I'm curious if it was before or after the init call, because after would be a lot less useful.
Yes, it was before the init call, the first thing in the if() statement inside that for() loop.
Some interesting results on Linux when testing.

I altered the initialize() on decals to print many debug statements as it progresses:
/obj/effect/floor_decal/initialize()
        world.log << "[src] ([type]) ([x],[y],[z])";sleep(-1)
        if(supplied_dir) set_dir(supplied_dir)
        var/turf/T = get_turf(src)
        world.log << "T: [T] ([T.type]) ([T.x],[T.y],[T.z]), [T.overlays.len]";sleep(-1)
        if(istype(T, /turf/simulated/floor) || istype(T, /turf/unsimulated/floor) || istype(T, /turf/simulated/shuttle/floor))
                world.log << "istype = true";sleep(-1)
                var/cache_key = "[alpha]-[color]-[dir]-[icon_state]-[layer]"
                world.log << "cache_key = [cache_key]";sleep(-1)
                var/image/I = floor_decals[cache_key]
                if(!I)
                        I = image(icon = src.icon, icon_state = src.icon_state, dir = src.dir)
                        I.layer = T.layer
                        I.color = src.color
                        I.alpha = src.alpha
                        floor_decals[cache_key] = I
                else
                        I = floor_decals[cache_key]
                world.log << "[I ? "image gotten" : "image failed?"]";sleep(-1)
                if(!T.decals) T.decals = list()
                world.log << "T.decals: [T.decals.len], T.overlays: [T.overlays.len]";sleep(-1)
                T.decals += I
                world.log << "About to overlay.";sleep(-1)
                T.overlays += I
                world.log << "qdel()";sleep(-1)
                qdel(src)
        return


Then ran it until it crashed on Linux, and this is the output. You can see the last few lines of world.log at the top, I included the last two, because they show something interesting.

The border floor (/obj/effect/floor_decal/borderfloorwhite) (57,38,7)
T: the floor (/turf/simulated/floor/tiled/white) (57,38,7), 0
istype = true
cache_key = 255--10-borderfloor_white-2.01
image gotten
T.decals: 0, T.overlays: 0
About to overlay.
qdel()
The pale blue corner (/obj/effect/floor_decal/corner/paleblue/border) (57,38,7)
T:  (/turf/simulated/floor/tiled/white) (57,38,7), 0
istype = true
cache_key = 255-#8bbbd5-10-bordercolor-2.01
image gotten
T.decals: 1, T.overlays: 0
About to overlay.
BUG: Crashing due to an illegal operation!
proc name: initialize (/obj/effect/floor_decal/initialize)
  source file: flooring_decals.dm,40
  usr: null
  src: the pale blue corner (/obj/effect/floor_decal/corner/paleblue/border)
  src.loc:  (57,38,7) (/turf/simulated/floor/tiled/white)
  call stack:
the pale blue corner (/obj/effect/floor_decal/corner/paleblue/border): initialize()
/datum/controller/game_control... (/datum/controller/game_controller): setup objects()
/datum/controller/game_control... (/datum/controller/game_controller): setup()
world: New()

Backtrace for BYOND 511.1380 on Linux:
Generated at Wed May 10 15:44:33 2017

DreamDaemon [0x8048000, 0x0], [0x8048000, 0x804bb24]
libbyond.so [0xf7198000, 0x0], 0x1f4f60
 [0xf76f2000, 0xf76f2bd0], [0xf76f2000, 0xf76f2bd0]
libbyond.so [0xf7198000, 0x0], 0x1f4f60
libbyond.so [0xf7198000, 0x0], 0x23923c
libbyond.so [0xf7198000, 0x0], 0x25bfa9
libbyond.so [0xf7198000, 0x0], 0x26a39e
libbyond.so [0xf7198000, 0x0], 0x2729c3
libbyond.so [0xf7198000, 0x0], 0x27362d
libbyond.so [0xf7198000, 0x0], 0x25179c
libbyond.so [0xf7198000, 0x0], 0x25a8b5
libbyond.so [0xf7198000, 0x0], 0x25a8b5
libbyond.so [0xf7198000, 0x0], 0x26a16c
libbyond.so [0xf7198000, 0x0], 0x2764dd
libbyond.so [0xf7198000, 0x0], 0x2269a9
libbyond.so 0x33c7f0, 0x33c90a
libbyond.so 0x303950, 0x303b52
DreamDaemon [0x8048000, 0x0], [0x8048000, 0x804ae34]
libc.so.6 0x18540, 0x18637 (__libc_start_main)
DreamDaemon [0x8048000, 0x0], [0x8048000, 0x804a731]

Recent proc calls:
/obj/effect/floor_decal/initialize
/atom/movable/Del
/obj/proc/unbuckle_mob
/obj/Del
/datum/controller/process/garbage_collector/proc/AddTrash
/proc/IsPooled
/atom/finalize_qdel
/datum/nanomanager/proc/close_uis
/datum/proc/Destroy
/datum/Destroy
/decl/observ/proc/raise_event
/decl/observ/destroyed/raise_event
/datum/Destroy
/atom/proc/get_listener_list_from_event
/atom/Destroy
/atom/Destroy


As you can see, the last two operations were on the same turf, at 57,38,7. But for some reason, the second debug statement lists the [T] as simply null, or an empty string, though is somehow able to retrieve the type, and also appears to be able to get the length of T.overlays, but claims it is 0. It printed normally as "T: the floor" in the previous decal's application, then just prints "T: (type)" when this next decal is being applied, and the 0 length overlays.

It does appear to realize that decals is now of length 1.

I don't know if this indicates that the previous one somehow breaks the turf.
I added a 'predictor' to try and predict when we would crash by examining overlays after trying to apply one, to make sure it was applied.

What I actually discovered is that this happens far more often than it crashes, but only crashes sometimes.

This is what happens when the game does not crash, but has broken a turf. It is somehow replaced with a reference to client, which has the icon of the overlay which is 'gone':
http://i.imgur.com/yl1uuJC.png

87540-The border floor (/obj/effect/floor_decal/borderfloorwhite) (59,38,7)
87541-T: the floor (/turf/simulated/floor/tiled/white) (59,38,7), [0x1016312] 0
87542-istype = true
87543-cache_key = 255--2-borderfloor_white-2.01
87544-image gotten
87545-T.decals: 0, T.overlays: 0
87546-About to overlay.
87547:We are about to crash! The border floor (/obj/effect/floor_decal/borderfloorwhite) on  (/turf/simulated/floor/tiled/white) (59,38,7) [0x1016312]
87548-
87549-
87550-
87551-qdel()
87552-The pale blue corner (/obj/effect/floor_decal/corner/paleblue/border) (59,38,7)
87553-T: the client (/turf/simulated/floor/tiled/white) (59,38,7), [0x1016312] 0
87554-istype = true
87555-cache_key = 255-#8bbbd5-2-bordercolor-2.01
87556-image gotten
87557-T.decals: 1, T.overlays: 0
87558-About to overlay.
87559-qdel()
87560-The border floor (/obj/effect/floor_decal/borderfloorwhite/corner2) (59,38,7)
87561-T: the client (/turf/simulated/floor/tiled/white) (59,38,7), [0x1016312] 1
87562-istype = true
87563-cache_key = 255--2-borderfloorcorner2_white-2.01
87564-image gotten
87565-T.decals: 2, T.overlays: 1
87566-About to overlay.
87567-qdel()
87568-The pale blue corner (/obj/effect/floor_decal/corner/paleblue/bordercorner2) (59,38,7)
87569-T: the client (/turf/simulated/floor/tiled/white) (59,38,7), [0x1016312] 2
87570-istype = true
87571-cache_key = 255-#8bbbd5-2-bordercolorcorner2-2.01
87572-image gotten
87573-T.decals: 3, T.overlays: 2
87574-About to overlay.
87575-qdel()


My 'we are about to crash' predictor (which is more of a 'we broke overlays' predictor) has a sleep 100 in it, so I can see/read it in the stream of world.log.

It seems like once it comes back from this sleep, it's now called "the client" instead of the previous null name.
Curious. This suggests to me that somehow the appearance of the turf has been broken, so that perhaps its refcount has been set too low and it's getting recycled prematurely. Can you print out \ref[T.appearance] before and after the overlay?
Of course the turf appearance being broken doesn't actually explain the fact that a proc pointer also broke. The problems you're having seem to be much more specific than generic heap corruption, however.
Page: 1 2 3