Okay sure. I also added a preposterous amount of logging this time. Also I switched to linux for testing also.

For reference the code is now:
var/global/floor_decl_call_count = 0;

/obj/effect/floor_decal/initialize()
        var/call_id = ++floor_decl_call_count
        world.log << "[call_id]: [src] ([type]) ([x],[y],[z])"
        if(supplied_dir) set_dir(supplied_dir)
        var/turf/T = get_turf(src)
        world.log << "[call_id]: T: [T] ([T.type]) ([T.x],[T.y],[T.z]), \ref[T] overlays.len=[T.overlays.len]"
        if(istype(T, /turf/simulated/floor) || istype(T, /turf/unsimulated/floor) || istype(T, /turf/simulated/shuttle/floor))
                world.log << "[call_id]: istype = true"
                var/cache_key = "[alpha]-[color]-[dir]-[icon_state]-[layer]"
                world.log << "[call_id]: cache_key = [cache_key]"
                var/image/I = floor_decals[cache_key]
                if(!I)
                        I = image(icon = src.icon, icon_state = src.icon_state, dir = src.dir)
                        I.layer = T.layer
                        I.color = src.color
                        I.alpha = src.alpha
                        floor_decals[cache_key] = I
                world.log << "[call_id]: [I ? "image gotten: \ref[I]" : "image failed?"]"
                if(!T.decals) T.decals = list()
                world.log << "[call_id]: T.decals: [T.decals.len], T.overlays: [T.overlays.len]"
                if(I in T.decals)
                        world.log << "[call_id]: WARNING Adding a duplicate decal image \ref[T]"
                T.decals += I
                var/pre_overlay_count = T.overlays.len
                world.log << "[call_id]: About to overlay img:\ref[I] onto turf:\ref[T] which has appearance:\ref[T.appearance]"
                T.overlays += I
                world.log << "[call_id]: post-overlay T: [T] ([T.type]) ([T.x],[T.y],[T.z]), \ref[T] overlays.len=[T.overlays.len], appearance:\ref[T.appearance]"
                var/post_overlay_count = T.overlays.len
                if(pre_overlay_count + 1 != post_overlay_count)
                        world.log << "[call_id]: Oh dear, we tried to add an overlay and it didn't go up...  This is a sign that the turf may be corrupted."
                world.log << "[call_id]: qdel()"
                qdel(src)
        return


The relevant final lines, bothfloor decals at the same coordinate:
10933: the border floor (/obj/effect/floor_decal/borderfloorwhite) (57,38,7)
10933: T: the floor (/turf/simulated/floor/tiled/white) (57,38,7), [0x1016310] overlays.len=0
10933: istype = true
10933: cache_key = 255--10-borderfloor_white-2.01
10933: image gotten: [0xd000594]
10933: T.decals: 0, T.overlays: 0
10933: About to overlay img:[0xd000594] onto turf:[0x1016310] which has appearance:[0x3a0003cd]
10933: post-overlay T:  (/turf/simulated/floor/tiled/white) (57,38,7), [0x1016310] overlays.len=0, appearance:[0x3a0031a1]
10933: Oh dear, we tried to add an overlay and it didn't go up...  This is a sign that the turf may be corrupted.
10933: qdel()

10934: the pale blue corner (/obj/effect/floor_decal/corner/paleblue/border) (57,38,7)
10934: T:  (/turf/simulated/floor/tiled/white) (57,38,7), [0x1016310] overlays.len=0
10934: istype = true
10934: cache_key = 255-#8bbbd5-10-bordercolor-2.01
10934: image gotten: [0xd000595]
10934: T.decals: 1, T.overlays: 0
10934: About to overlay img:[0xd000595] onto turf:[0x1016310] which has appearance:[0x3a0031a1]
BUG: Crashing due to an illegal operation!


I do have the full core dump if that's useful.


Just for reference, I randomly picked logs from a different turf which didn't break and the appearance refs look like:
10913: T.decals: 0, T.overlays: 0
10913: About to overlay img:[0xd0005b1] onto turf:[0x1016299] which has appearance:[0x3a0003cd]
10913: post-overlay T: the floor (/turf/simulated/floor/tiled/white) (58,37,7), [0x1016299] overlays.len=1, appearance:[0x3a00323b]


10914: T.decals: 1, T.overlays: 1
10914: About to overlay img:[0xd000529] onto turf:[0x1016299] which has appearance:[0x3a00323b]
10914: post-overlay T: the floor (/turf/simulated/floor/tiled/white) (58,37,7), [0x1016299] overlays.len=2, appearance:[0x3a0034aa]

10915: T.decals: 2, T.overlays: 2
10915: About to overlay img:[0xd0005b2] onto turf:[0x1016299] which has appearance:[0x3a0034aa]
10915: post-overlay T: the floor (/turf/simulated/floor/tiled/white) (58,37,7), [0x1016299] overlays.len=3, appearance:[0x3a0034ab]




There is a workaround for this in that if I read the still-readable type of the turf and "new T.type(T)", it replaces the turf with a good, working turf with a proper overlays list, and I can slap overlays on it again.
Those results are very weird. The fact that the changed appearance apparently derives from the wrong object is all kinds of messed up.

Your results confirm that the corruption of the appearance is happening during the overlay list change. Unfortunately it tells me nothing about why, nor does it explain the proc pointer corruption that accounts for the earlier crash reports.

[edit]
I can verify that on the Linux crash, you're seeing the crash within AppearanceAdd_overlays(), and more specifically within AppearanceSet_overlays(). It's telling me that the appearance you're copying as part of that process is invalid. Odd that the name should work without a crash.

My working theories about the appearance corruption are that it's happening in one of these:

- Creation of a new appearance ID within AppearanceSet_overlays() (the first time it's called), possibly returning an invalid value.
- Assigning the result of to turf.appearance, which may create a new unique map cell; perhaps it's possible that the wrong cell ID is being used, and therefore it has a bogus appearance

But to be honest, neither of those seem very likely. And neither explains the proc pointer corruption you saw under Windows, which is a completely different problem--unless the issue is that the bogus appearance is a valid pointer for something else.

I feel like we're narrowing in on this, though, and I can see some room for improvement in the way overlay additions are handled.

Out of raw curiosity, what happens if you replace T.overlays += I with T.overlays = T.overlays:Copy() + I? It should call a very different set of routines, but I suspect it would either have the same problem or shift the problem somewhere else.
I tried making that change:
On linux it doesn't crash anymore, tried 10+ times. Smoke tested walking around once the game started and didn't see any immediate problems.
On windows so far it also hasn't crashed, but I have only had time for 3 attempts so far. I'll update if I can get it to crash later.

For reference, log experts from linux non-crashing attempts with the overlays:Copy() code:
16981: T.decals: 0, T.overlays: 0
16981: About to overlay img:[0xd000645] onto turf:[0x102a226] which has appearance:[0x3a000399]
16981: post-overlay T: the lush grass (/turf/simulated/floor/holofloor/grass) (23,119,12), [0x102a226] overlays.len=1, appearance:[0x3a00293f]
16981: qdel()

...

16982: T.decals: 0, T.overlays: 0
16982: About to overlay img:[0xd000646] onto turf:[0x102a227] which has appearance:[0x3a000399]
16982: post-overlay T: the lush grass (/turf/simulated/floor/holofloor/grass) (24,119,12), [0x102a227] overlays.len=1, appearance:[0x3a002937]
16982: qdel()



So it would seem that either the overlay append process is to blame, or like I thought the problem has been shifted somewhere else and hasn't manifested yet. I'll keep looking deeper to see if I can figure out anything that might explain the problem, but at this point I'm still baffled. Frankly I didn't expect the change you made to work; I just wanted to rule something else out to help narrow things down.

I'm deeply curious to know what the ID numbers of the overlays ID arrays are in your original code, but DM doesn't actually expose those. I'd have to give it a try in the debugger to find out.
Just to follow up on this, I think I've gotten about as far as I can on my own with this. I tried running your code in the debugger but nothing happened--which isn't a shock considering you said the problem is intermittent. Frankly the startup time is way too long for me to throw my time at trying to get an intermittent bug to happen.

Regardless of this situation, you would do well to figure out ways of greatly reducing startup time, which might possibly end up obviating the issue anyway.
I'm not 100% sure this is related, but I just experienced a crash on Linux during a malloc() call, and it complains of a double linked list (could be overlays...?). This crash was not a segfault so it didn't invoke the DD SIGSEGV handler, rather just printed the normal Linux dump to stderr.

https://hastebin.com/raw/ekotunovem

EDIT: I'm going to run DreamDaemon in valgrind and see what happens.

EDIT EDIT: Also, startup time is fairly painless for us. Maybe 10 seconds, then it's joinable? Then maybe another 20-30 seconds until you can start the round. Doesn't seem too terrible.
The results from valgrind are here:
https://hastebin.com/raw/rabudehija

I see three memory accesses that seem to be problematic, though obviously I don't have the symbols for 511.1380 so I have no idea what functions are involved. The third of the three improper memory accesses in the hastebin is the overlays one. I can increase the stack depth, if you need, to whatever depth you require.

Just to note, memory usage never goes above 1.4GB or so (with Valgrind, which amplifies memory usage). Without Valgrind, it stays around 700-900mb.
With that valgrind log, do you know where libbyond.so was located in memory? Those addresses aren't any good to me without knowing what they represent as offsets in the file.
I'll include a memory map on the next one, but what about the malloc() exception that included the memory map up there? Does that show anything unusual?

https://hastebin.com/raw/ekotunovem <- Isolated incident, but probably related somehow. mmap is different from valgrind logs, so don't use it together.
It appears to always have the same virtual address space every start of dreamdaemon, so here's a memory map from the same system during the same testing. Should be the same:

https://hastebin.com/raw/aqagamijed <- Goes with Valgrind logs
I also get these as well. Considering all the problems seem to center around appearances and overlays, posting them since they seem to be related to image manipulation (these use the valgrind mmap posted above):
==2335== Conditional jump or move depends on uninitialised value(s)
==2335==    at 0x45F57E0: ??? (in /opt/vorestation/byond/byond-511.1380/bin/libext.so)
==2335==    by 0x45F3E31: deflate (in /opt/vorestation/byond/byond-511.1380/bin/libext.so)
==2335==    by 0x45EE10F: png_write_finish_row (in /opt/vorestation/byond/byond-511.1380/bin/libext.so)
==2335==    by 0x45EE6A9: png_write_filtered_row (in /opt/vorestation/byond/byond-511.1380/bin/libext.so)
==2335==    by 0x45EEB30: png_write_find_filter (in /opt/vorestation/byond/byond-511.1380/bin/libext.so)
==2335==    by 0x45E8919: png_write_row (in /opt/vorestation/byond/byond-511.1380/bin/libext.so)
==2335==    by 0x45E8B59: png_write_image (in /opt/vorestation/byond/byond-511.1380/bin/libext.so)
==2335==    by 0x43C8781: ??? (in /opt/vorestation/byond/byond-511.1380/bin/libbyond.so)
==2335==    by 0x43B29A0: writeIcon(_IO_FILE*, IconType, DMIcon*) (in /opt/vorestation/byond/byond-511.1380/bin/libbyond.so)
==2335==    by 0x42E2DFF: ??? (in /opt/vorestation/byond/byond-511.1380/bin/libbyond.so)
==2335==    by 0x42E56AC: ??? (in /opt/vorestation/byond/byond-511.1380/bin/libbyond.so)
==2335==    by 0x42AE52A: ??? (in /opt/vorestation/byond/byond-511.1380/bin/libbyond.so)
==2335==
==2335== Conditional jump or move depends on uninitialised value(s)
==2335==    at 0x45F57EC: ??? (in /opt/vorestation/byond/byond-511.1380/bin/libext.so)
==2335==    by 0x45F3E31: deflate (in /opt/vorestation/byond/byond-511.1380/bin/libext.so)
==2335==    by 0x45EE10F: png_write_finish_row (in /opt/vorestation/byond/byond-511.1380/bin/libext.so)
==2335==    by 0x45EE6A9: png_write_filtered_row (in /opt/vorestation/byond/byond-511.1380/bin/libext.so)
==2335==    by 0x45EEB30: png_write_find_filter (in /opt/vorestation/byond/byond-511.1380/bin/libext.so)
==2335==    by 0x45E8919: png_write_row (in /opt/vorestation/byond/byond-511.1380/bin/libext.so)
==2335==    by 0x45E8B59: png_write_image (in /opt/vorestation/byond/byond-511.1380/bin/libext.so)
==2335==    by 0x43C8781: ??? (in /opt/vorestation/byond/byond-511.1380/bin/libbyond.so)
==2335==    by 0x43B29A0: writeIcon(_IO_FILE*, IconType, DMIcon*) (in /opt/vorestation/byond/byond-511.1380/bin/libbyond.so)
==2335==    by 0x42E2DFF: ??? (in /opt/vorestation/byond/byond-511.1380/bin/libbyond.so)
==2335==    by 0x42E56AC: ??? (in /opt/vorestation/byond/byond-511.1380/bin/libbyond.so)
==2335==    by 0x42AE52A: ??? (in /opt/vorestation/byond/byond-511.1380/bin/libbyond.so)
==2335==
Okay, that valgrind log turned out to be extremely helpful. I think I have an explanation for what's happening.

The problem lies with unique map cells. (This is a unique combo of turf type, appearance, and area ID; it may be shared by many actual turfs.) There's a function called AddGetCellId() which tries to find a matching map cell to the one being added; if one is found, it's used and no new unique cell is added. If not, then AddCell() is assigned to the current pointer at the end of a linked list chain.

Here's where things get screwy: I believe what's happening is that AddCell() is triggering a garbage check in this case, and a number of unused map cells are being removed. That's good, but if one of those cells just happens to be the last one in the linked list chain, the pointer that will be given the result of AddCell() is no longer valid.

Basically you've got A -> B -> C -> D, where D is the new cell, but before D can be assigned, C gets thrown out.

This should be quite fixable.
Lummox JR resolved issue with message:
Crashes could occur in rare (but fairly consistent) cases when making changes to turf appearance, type, or area in large projects.
Huzzah! Thanks!
Great news then, thanks
Oh dear, Some bad news. I re-ran our test case using 511.1384 on linux, and got the same crash, at the same memory address even.

Backtrace for BYOND 511.1384 on Linux:
Generated at Wed May 17 21:55:30 2017

DreamDaemon [0x8048000, 0x0], [0x8048000, 0x804bb24]
libbyond.so [0xf7209000, 0x0], 0x204b20
 [0xf7763000, 0xf7763bd0], [0xf7763000, 0xf7763bd0]


Full trace plus some of the preceding log (using same test code as over the weekend):

https://hastebin.com/raw/uzapesaney

I have not yet tried it on windows, or use valgrind yet. I don't suppose you happened to add a secret _dm_get_overlay_array_id_thingy() function in the meantime to get you those IDs you were curious about?
Oh, shoot. I think I see where I went wrong on this fix.

It's not really an easily testable fix, is the problem, but I understand now where it went wrong.
If you want to link to just-compiled test versions or something, we'd be happy to test.
I did a rerelease of 511.1384 to deal with the fact that the fix didn't work. Please retest with the new download.
Page: 1 2 3