ID:2448877
 
Resolved
Large numbers of turf changes in a short time, especially involving changes to their overlays/underlays, could sometimes cause appearances to be prematurely deleted or rendered invalid, which could cause crashes. This was an intermittent issue brought on by high-stress situations.
BYOND Version:512
Operating System:Windows 10 Pro 64-bit
Web Browser:Firefox 66.0
Applies to:Dream Daemon
Status: Resolved (512.1467)

This issue has been resolved.
Descriptive Problem Summary: Upon running certain large maps on our server, DreamDaemon will crash with an access violation (0xc0000005) error without leaving any logs, runtimes, or other events that may indicate a specific cause.

Numbered Steps to Reproduce Problem:
Running one of two(possibly three) maps where the issue exists.
This is a guess, but it seems as though the crashes have to do with explosions, which cause damage to the map in a localized region. Most final logs when crashes do happen are explosions, but this is not always consistent.

Code Snippet (if applicable) to Reproduce Problem:
Not applicable.


Expected Results: No access violation error.

Actual Results:
Faulting application name: dreamdaemon.exe, version: 5.0.512.1466, time stamp: 0x5c9e4325
Faulting module name: byondcore.dll, version: 5.0.512.1466, time stamp: 0x5c9e42ae
Exception code: 0xc0000005
Fault offset: 0x000ff72b
Faulting process id: 0x1168
Faulting application start time: 0x01d4e82581432a52
Faulting application path: C:\byond\bin\dreamdaemon.exe
Faulting module path: C:\byond\bin\byondcore.dll


Faulting application name: dreamdaemon.exe, version: 5.0.512.1466, time stamp: 0x5c9e4325
Faulting module name: byondcore.dll, version: 5.0.512.1466, time stamp: 0x5c9e42ae
Exception code: 0xc0000005
Fault offset: 0x000ff5a8
Faulting process id: 0x1e20
Faulting application start time: 0x01d4e7be48c1a249
Faulting application path: C:\byond\bin\dreamdaemon.exe
Faulting module path: C:\byond\bin\byondcore.dll

ff72b seems to be the most prevalent crash opcode.

Does the problem occur:
Every time? Or how often? Very consistently when the map is run on the public server. Not always triggerable on a private instance.
In other games? N/A
In other user accounts? N/A
On other computers? Possibly?

When does the problem NOT occur? When any other map is run.

Did the problem NOT occur in any earlier versions? If so, what was the last version that worked? (Visit http://www.byond.com/download/build to download old versions for testing.)
Appeared spontaneously while we ran 512.1462, but did not explicitly appear with it. Has persisted up until 1466. 1464 was also ran for a while, where the issue also existed.

Workarounds:
Do not run effected maps.

An additional note:
This is a list of our maps with their respective line counts, and note of which ones crash. The offending map files will be sent privately.

129 Z.04.Low_Orbit.dmm
32722 Z.02.Admin_Level.dmm
55819 Z.01.Whiskey_Outpost_v2.dmm
60506 Z.01.LV624.dmm
90146 Z.01.BigRed_v2.dmm
113527 Z.01.Ice_Colony_v2.dmm <- Crashes
128893 Z.03.USS_Almayer.dmm
139082 Z.01.Prison_Station_FOP.dmm <- Crashes
208409 Z.01.Desert_Dam.dmm <- Not in rotation enough to report on.
829233 total

Note that all non-Z.01 maps are always compiled and present. Z.01 maps rotate.
The crash data I'm reading suggests this is not necessarily due to your large maps (although I haven't ruled that out exactly), but rather it looks exactly like the fix in 512.1466 for turf.visual_contents not always resetting properly. Yet you're running 512.1466, and the fix in question applied to a regression bug that appeared in 1465 (not earlier) and showed up only after world reboots.

To be both more and less specific, it appears the crash is happening when reading a turf's appearance vars. The appearance in question does not, it seems, exist. This should never happen.

In the aforementioned bug, this was happening when appearances got destroyed prematurely, due to references hanging around after a world reboot; those bogus refs being decremented caused valid appearances to die early.

What this case reminds me of is when there were issues with unique turf cell IDs. Past 64K of those (no one map file can pass that limit though), it was sometimes running into problems due to some broken logic. However that too was fixed a while back, and I can't find any evidence of a recent server change that would have caused your issues to appear in 1462. (The most recent server change up to and including 1462 was in 1454.)

It would help to have more info on some of this.

1) What happens during explosions? Are turf appearances often changed?
2) How many turfs, roughly, go through any changes? (A ballpark number rather than a percentage would be great.)
3) Does this issue occur after the subsequent reboot, or during/after the explosion?
4) Are you using visual contents for turfs?
5) Are visual contents on any of the exploded turfs?
1. Explosions are recursive and attempt to go around walls and other obstacles when possible. Therefore, size of the surrounding area not impeded by dense objects influence the spread and travel of the explosion. The explosion calls an ex_act proc on every turf in range, and allows that turf to decide what to do with that information. In the cases of floors that can sustain damage, their icons are changed, and a var indicating they are broken is altered on them. Many turfs may also delete themselves depending on their damage.

2. With an explosion given off from a commonly sized source (An RPG round), a test in optimal spread conditions averaged to 500 turfs being accessed. In a maximum size explosion, over 2,500 turfs were accessed. (Note that in local test environments, we have not been able to induce a crash regardless of explosion size)

I would ballpark the average size somewhere in the middle, say 800-1,000, as there are sources of larger explosions used, but they normally are not maximum size.

3. The crash is immediate, with the only indicator of the cause being that our game-mode created a standard log in a logfile indicating an explosion is about to happen being the final entry before crash. To my knowledge, DreamDaemon has never crashed upon reboot.

4 & 5. vis_contents is never used in our game-mode.

An additional note:

Additionally, here are further fault offsets that have been seen with their BYOND version, since this existed pre-vis fix. Despite my original post, checking my pagefault logs indicated this actually began on 1454. Relevant info for that has been included below.

0x000fdef2 - 1466
0x000ff72b - 1466 (Most common for 1466
0x000ff5a8 - 1466
0x000ff7cf - 1466
0x000ff0ab - 1462 (Most common for 1462)
0x000ff14f - 1462
0x000fe8ff - 1454 (Most common for 1454)
0x000fe85b - 1454
Curious. Follow-ups:

1) You mentioned turfs can be deleted by the explosion process. How do they do that? I.e. with what code? In fact it would probably be helpful to see all code related to the explosion; maybe if you sent me the codebase and pointed out which files and which procs were responsible, that would help a lot.

2) It makes sense that 1454 or earlier would begin this crash since 1454 was the last server change prior to 1465. However the only server changes in 1454 are to visual contents, which you said isn't used (are you positive about that?), and to a problem with temp files.

I took special note of the fact that you said visual contents is not used in your game mode. Does that mean it's still in the codebase somewhere? Because if it's in your codebase at all, maybe it is being used in your game mode and you're not necessarily aware of it; like perhaps it's used by a certain kind of object (security monitor maybe?). It would make sense in light of 1454's fix, which impacted hard-deleted objs and mobs used in visual contents, and I expect that explosions are a time for hard deletions.

Additionally, it would be helpful if your logging could indicate where the explosion is about to happen and what kinds of things are in range of it: special turfs, computers, mobs, cameras, anything you think might be important to take note of. That might pry out additional clues.

[edit]
Forgot to mention, but even without looking I can tell those offsets are into the same code. They're very consistent across versions, and some of the differences line up.
Here is an example of one such object that deletes itself if impacted by a sufficiently sized explosion.

//Pipe affected by explosion
/obj/machinery/disposal/ex_act(severity)
switch(severity)
if(0 to EXPLOSION_THRESHOLD_LOW)
if(prob(25))
qdel(src)
if(EXPLOSION_THRESHOLD_LOW to EXPLOSION_THRESHOLD_MEDIUM)
if(prob(60))
qdel(src)
return
if(EXPLOSION_THRESHOLD_MEDIUM to INFINITY)
qdel(src)
return


Here is the qdel proc.

/proc/qdel(const/datum/D, ignore_pooling = 0, ignore_destroy = 0)
if(isnull(D))
return
if(!istype(D))
del(D)
return

if(D.being_sent_to_past())
return

if(isnull(garbageCollector))
del(D)
return

if(istype(D, /atom) && !istype(D, /atom/movable))
if(istype(D, /turf/))
var/turf/ot = D
ot.Dispose()
return
else
WARNING("qdel() passed object of type [D.type]. qdel() cannot handle unmovable atoms.")
del(D)
garbageCollector.hard_dels++
garbageCollector.dels_count++
return

//This is broken. The correct index to use is D.type, not "[D.type]"
if(("[D.type]" in masterdatumPool) && !ignore_pooling)
returnToPool(D)
return

if(isnull(D.gcDestroyed))
// Let our friend know they're about to get fucked up.
if(!ignore_destroy)
D.Dispose()

garbageCollector.addTrash(D)


Where vis_contents are concerned, I confirmed this with a simple grep. Note this is my home machine, not our production one.

jamie@jamie-battlestation :~/ColonialMarines$grep -r "vis_content"
.git/packed-refs:f2d8025fae0ee358c77e8418c01ea090a0ea142b refs/remotes/origin/vis_contents
.git/FETCH_HEAD:f2d8025fae0ee358c77e8418c01ea090a0ea142b not-for-merge branch 'vis_contents' of gitlab.com:cmdevs/ColonialMarines
jamie@jamie-battlestation :~/ColonialMarines$


While there is a trace of an old, defunct attempt to implement vis_contents. The results didn't line up with our hopes, and the branch never saw anything past "R&D"

I had sent you a Pager message about the codebase earlier. If you were interested in seeing the repo properly, I would have absolutely no problem with giving you access to our Gitlab, and hence all of the resources my developers have. If you would prefer a more simple option like a compressed archive, I can send one of those as well. I had mentioned over Pager that there was an update the day before the crashes began wherein some overlays for wall turfs were changed, particularly relating to when they are damaged. Once you pointed us at turfs, this was the primary commit that "stuck out." However, quantifying its' changes in words is easier said than done.

The entire codebase is about ~550M in size once unpacked.

Hey Lummox,

I am one of the developers for this codebase.
I had a bit of time to quickly look into a few things.

One odd thing is that both the ice and prison map sometimes have 2 turfs in the same location in the map files.
(On prison 28, 124,1) (On ice nearly all ice cave walls have flooring underneath.)
You mentioned turfs and I noticed this when someone asked why ice walls spawned in during the game looked different. I found out that it adds one of the turfs as an overlay and that was enough to make them look the same when spawned in.


But since I cannot find any documentation on this I wondered if this also uses vis contents.

It is hard for me to check the other maps because I have to do so by hand. I asked another developer to run a query and expect to know if this is being done on other maps as well.

I am not a mapper and cannot tell you why this was done.
I did thought it might be worth to tell you.
So, the query has run.
Maybe this helps a bit.

Z.01.BigRed_v2.dmm 4 tokens 26 locs
Z.01.Desert_Dam.dmm 78 tokens 120 locs <-- Also crashes
Z.01.Ice_Colony_v2.dmm 260 tokens, 2712 locs <-- Crashes
Z.01.LV624.dmm 0 tokens, 0 locs.
Z.01.Prison_Station_FOP.dmm 71 tokens, 168 locs. <-- Crashes
Z.01.Whiskey_Outpost_v2.dmm not checked
Z.02.Admin_Level.dmm not checked.
Z.03.USS_Almayer.dmm 26 tokens 29 locs
Z.04.Low_Orbit.dmm 0 tokens, 0 locs.

Now, this can just be a symptom of having a bigger map. It could also be the issue.
Z-level one changes between rounds the others stay the same. The explosions that crash the server mostly happen on the 3rd z-level.

Most of what is in range during explosions turf wise are floors, walls, and catwalks.
In response to Awan
Awan wrote:
One odd thing is that both the ice and prison map sometimes have 2 turfs in the same location in the map files.
(On prison 28, 124,1) (On ice nearly all ice cave walls have flooring underneath.)
You mentioned turfs and I noticed this when someone asked why ice walls spawned in during the game looked different. I found out that it adds one of the turfs as an overlay and that was enough to make them look the same when spawned in.

Yes, that's correct. Specifically, one turf is on top and the other is an underlay. But a turf having underlays shouldn't really make any difference in terms of destructibility, and there isn't any new-to-1454 behavior that would account for any problems.

Those token and loc counts don't look right at all, unless those are only the counts for token and locs that use turf underlays.
These token and loc counts are locs/tokens with multiple turfs. I wanted to make sure that this is not what is causing the issues and that it is not a hidden way to use vis_contents. It is one of the things we do with turfs I noted down as weird.
I'm closing this for release purposes. Although I haven't verified that I've fixed this specific issue, I strongly believe I have. Moreover I've still fixed a bug, so it makes sense to go ahead and close this one.
Lummox JR resolved issue with message:
Large numbers of turf changes in a short time, especially involving changes to their overlays/underlays, could sometimes cause appearances to be prematurely deleted or rendered invalid, which could cause crashes. This was an intermittent issue brought on by high-stress situations.