ID:2024361
 
Resolved
Using text-to-type lookups for /mob types caused subsequent lookups, often on unrelated types (/sound being a common one), to fail in bizarre ways.
BYOND Version:509
Operating System:Linux
Web Browser:Chrome 47.0.2526.111
Applies to:Dream Daemon
Status: Resolved (511.1355)

This issue has been resolved.
Descriptive Problem Summary:

A round of SS13 on tgstation13 had to be forced to end due to a rather unique bug. During the round the game started to continuously output "Cleanbot (a number)" and a small icon of the cleanbot mob, this is intended behaviour if a:
world << obj

statement is given, however we don't do this usually (except upon examining) and it certainly doesn't happen continuously. At the same time as this occuring, the vast majority of the game's sound effects stopped working, we use the sound() proc to output sounds. looking into the runtime logs showed that cleanbot/New() had been called constantly, but with the arguments for the sound() proc!

Byond had gotten confused and swapped sound() with /mob/living/simple_animal/bot/cleanbot/New()
The round had to be ended due to this bug causing over 6000 cleanbot mobs to be created, which brought the game to a crawl.

This bug has never happened to us before, and only resembles one other bug, the one where invalid procs provided to call()() attempt to call the first available proc, which for that bug is world/New()

(Runtime output and some more information available in the issue report on our github)

Numbered Steps to Reproduce Problem:

N/A, we have no idea how this happened, and it hasn't reproduced yet.
All we know is that sound() calls were replaced with calls to cleanbot/New()

Code Snippet (if applicable) to Reproduce Problem:
N/A, but one of the snippits that caused it were:

Runtime:
Undefined variable /mob/living/simple_animal/bot/cleanbot/var/wait
proc name: playsound local (/atom/proc/playsound_local)
source file: sound.dm,27
usr: (src)
src: Maline Trell (/mob/living/carbon/human)
src.loc: the floor (90,146,1) (/turf/simulated/floor/plasteel)
call stack:
Maline Trell (/mob/living/carbon/human): playsound local(the floor (90,147,1) (/turf/simulated/floor/plasteel), 'sound/items/Screwdriver.ogg', 50, 1, 42685, null, 1)
Maline Trell (/mob/living/carbon/human): playsound local(the floor (90,147,1) (/turf/simulated/floor/plasteel), 'sound/items/Screwdriver.ogg', 50, 1, 42685, null, 1)
playsound(the floor (90,147,1) (/turf/simulated/floor/plasteel), 'sound/items/Screwdriver.ogg', 50, 1, null, null, 1)
the security camera (/obj/machinery/camera): attackby(the screwdriver (/obj/item/weapon/screwdriver), Maline Trell (/mob/living/carbon/human), "icon-x=7;icon-y=27;left=1;scre...")
Maline Trell (/mob/living/carbon/human): ClickOn(the security camera (/obj/machinery/camera), "icon-x=7;icon-y=27;left=1;scre...")
the security camera (/obj/machinery/camera): Click(the floor (90,147,1) (/turf/simulated/floor/plasteel), "mapwindow.map", "icon-x=7;icon-y=27;left=1;scre...")


Where the relevant code in that runtime's case is:
/atom/proc/playsound_local(turf/turf_source, soundin, vol as num, vary, frequency, falloff, surround = 1)
soundin = get_sfx(soundin)

var/sound/S = sound(soundin)
S.wait = 0 //No queue
S.channel = 0 //Any channel
S.volume = vol


Many more similar instances where mob/living/simple_animal/bot/cleanbot/New() was used instead of sound() call.

outputting of the bot's names and icons comes specifically from:
world << sound(...)

with sound obviously being replaced with the result of the cleanbot/New() call, which is of course a cleanbot mob, this bit is technically intended as DM simply outputted the mob as if it had been told to normally.


Expected Results:
sound() calls to function as sound() calls

Actual Results:
sound() calls were replaced with /mob/living/simple_animal/bot/cleanbot/New(), resulting in 6000+ cleanbot mobs, and almost all sound effects not playing.

Does the problem occur:
Every time? Or how often? It has occurred once so far
In other games? It has only occurred in SS13 so far
In other user accounts? Everyone connected to the server noticed the issue
On other computers? It has only occurred on the tgstation13 SS13 server computer so far

When does the problem NOT occur?
Every other round that has ever happened except for this one.

Did the problem NOT occur in any earlier versions? If so, what was the last version that worked? (Visit http://www.byond.com/download/build to download old versions for testing.)
The problem did not occur in any version below 509, but the bug does not always occur on 509 either (with this case being the first instance)

Workarounds:
None.

I should point out that nothing in clean bot code actually messages to world, it all goes out to view()

How ever world << sound happens often enough.
I don't think this is a case of proc calls being swapped, but an internal table getting borked. The results are consistent with either the string tree or the list of prototypes being messed up, although I have no idea how that could happen.

I'm gonna have to mark this Unverified till it can be reproduced reliably.
Lummox JR changed status to 'Unverified'
ITS HAPPENED AGAIN.

I'm saving the VM state (including ram) so we can revisit this and debug it.

edit:
same target proc (cleanbot's new) but a different causing proc, we haven't pinpointed it down yet, it didn't cause any runtimes

From this point on, we will trigger a runtime in a proc that cleanbot's New() calls, so we can always see what caused it to be called.
https://github.com/tgstation/-tg-station/pull/14952
Here's what we used to get the stack trace. We'll keep giving more information as it comes.
Happened again recently.

in fact it keeps happening, sometimes randomly on coder's test machines (even once on linux).

5 bucks say this is related to ID:2100606

lummox, I saved a FULLDUMP but because of the 15k simple mobs in nullspace from this bug triggering, the fulldump is 1gb in size.

I can try to trace thru it in a debugger on my computer with 16gbs of ram if you can help me figure out where to look
This smells like some kind of heap corruption. Tracing this from a dump is probably not going to reveal much, because I won't know where to find the right structures to examine. I'm really not sure where I would begin on something like this. Without a solid lead as to where this sort of this is happening, I'm at a loss.
heres a hint:

99% of the time the bug exists at world init
Hrm. What sorts of things are running at world init?
world init is mostly object creation, and adding things to lists, removing things from list, and other list logic.

There is a minor amount of deleting and gcing things as well.

(this is ss13, you already know world init is full of baggage)

Restarting the world normally doesn't fix it

Restarting DD normally doesn't fix it

Restarting server sometimes fixes it

Recompiling to the same map normally doesn't fix it.

Recompiling to another map almost always fixes it, (and there is no discernible correlation in maps it happens on, or maps we recompile to)
After playing a round on the new map, recompiling back to the old map will be ok, and the bug will be done for a bit before coming back.

Taking the compiled dmb from when the bug happens, saving it, and then trying to run it later on after the bug goes away does not make it re-trigger.
That's helpful info, thanks. That last bit is especially important, because it sounded as if compiling had some bearing on it; but if the very same .dmb has the problem one time and not another time, that suggests that if there is a compiler element, it's merely in how easily the bug triggers.

The mercurial nature of this bug suggests there's definitely some heap corruption happening, but apparently it only happens in weird places and ways.

I can't help but wonder if this is tied to the increased limit for unique map cells that was done in 508 (IIRC). Maybe there's some aspect of the logic there that's not behaving right, although it's not immediately clear what that could be. That old issue was fairly sensitive to the compilation and if the init process alters any turfs at all, I could see the old limit being crossed.
I remember running DD in a uninitialized memory reads checker and not finding much, but i should really check DM too in case that's at play.

Other then that, As for what you said, that could be the case, the first post world init thing to happen is loading the random ruins to place around space and around the mining z-level, this uses our map loader so it does change turfs. (I don't know that anything in init itself changes turfs, I'll have to check that.)

When it happened, I would know if a reboot or recompile fixed it by joining shortly after and clicking to observe. if the game tried to send me a sound, but all i got was the icon and name of some simple mob, i knew the game state was bugged.

This is pretty close to after world init while post-init stuff is happening, so that first thing in post init could be at play

Maybe you can try it with the ruins turned off to see if the problem goes away on a steadier basis. That'd be very interesting. My prediction is that unless something similar is running right after init, the problem would all but vanish--if indeed this is the cause, that is.

The bug as you've described it sounds exactly like a form of corruption, because heap allocations tend to be arbitrary but not entirely random. It's very often the case that an intermittent bug like this will appear steadily for a while, go away, come back again, because memory is being filled in a "predictable" way at one time that changes later as the system state changes. I've seen that happen in the debugger before, where on subsequent runs an allocation may end up with the exact same pointer.
Do you also do any runtime proc creation, and if so what does the code look like and where/when does it run? I know that's limited to simply copying an existing proc and giving it a new name and/or desc, but it'd be good info to know.
No, not unless some old ass code somewhere is doing it, but thats doubtful. I had attempted to code something to automatically add verbs for any hud button added, but behavior in those is defined by a datum, not the obj being clicked, so it never went thru as i'd need the verb on the datum.

Maybe you can try it with the ruins turned off

actually i figured a better way would just be to do this:


/world/New()
map_ready = 1
var/icon/Icon_test = icon('icons/BadAss.dmi')
if(!istype(Icon_test))
world.log << "Bad Icon at world init"
var/sound/Sound_test = sound('sound/misc/null.ogg')
if(!istype(Sound_test))
world.log << "Bad Sound at world init"
...


Edit, I've added this to a initialized global datum (/var/datum/thing = new()) to see world init, to world new, and to before and after post init, as well as to just before the code that officially marks the server as started and fully init'ed.

We shall see where it triggers next trigger.
so sadly, adding in these checks, makes it stop happening, removing them makes it happen again.
Update: Same story.

if i add the checks to detect where in world init it happens (when it happens) it goes away, removing the checks will make the possibility of it happening resume again.

its still rare, like 1 to 5% of startups, but once it happens, it will happen multiple rounds until stopping.

Here is the pr with those checks:

https://github.com/tgstation/tgstation/pull/18992/ files?diff=unified

(uncheck show notes to get a clear uninterrupted view)

(note: it still stops the bug if i do it without adding a proc, copying pasting the contents into each spot)
Well, if nothing else, you found a way to make it not happen. =P
Page: 1 2