ID:2100606
 
BYOND Version:510
Operating System:Windows Server 2012 rc2 64-bit
Web Browser:Chrome 51.0.2704.84
Applies to:Dream Daemon
Status: Open

Issue hasn't been assigned a status value.
Descriptive Problem Summary:
In my doubly linked list, when i loop thru it, the proc var for storing the current item being worked on randomly changes to something else, causing runtimes, but clearly only one thing ever accesses that var, and its on the up and up.

Code:
// Run thru the queue of subsystems to run, running them while balancing out their allocated tick percentage
/datum/controller/master/proc/RunQueue()
. = 0 //so the caller can detect runtimes by checking the return
var/datum/subsystem/queue_node //**RELATED VAR DEFINED HERE**
var/queue_node_flags
var/queue_node_priority
var/queue_node_paused

var/current_tick_budget
var/tick_precentage
var/tick_remaining
var/ran = TRUE //this is right
var/ran_non_ticker = FALSE
var/bg_calc //have we switched current_tick_budget to background mode yet?
var/tick_usage

//keep running while we have stuff to run and we haven't gone over a tick
// this is so subsystems paused eariler can use tick time that later subsystems never used
while (ran && queue_head && world.tick_usage < TICK_LIMIT_MC)
ran = FALSE
bg_calc = FALSE
current_tick_budget = queue_priority_count
**RUNTIME HERE**for (queue_node = queue_head; istype(queue_node); queue_node = queue_node.queue_next)**RUNTIME HERE**
if (ran && world.tick_usage > TICK_LIMIT_RUNNING)
break
queue_node_flags = queue_node.flags
queue_node_priority = queue_node.queued_priority

//super special case, subsystems where we can't make them pause mid way through
//if we can't run them this tick (without going over a tick)
//we bump up their priority and attempt to run them next tick
//(unless we haven't even ran anything this tick, since its unlikely they will ever be able run
// in those cases, so we just let them run)
if (queue_node_flags & SS_NO_TICK_CHECK)
if (queue_node.tick_usage > TICK_LIMIT_RUNNING - world.tick_usage && ran_non_ticker)
queue_node.queued_priority += queue_priority_count * 0.10
queue_priority_count -= queue_node_priority
queue_priority_count += queue_node.queued_priority
current_tick_budget -= queue_node_priority
continue

if ((queue_node_flags & SS_BACKGROUND) && !bg_calc)
current_tick_budget = queue_priority_count_bg
bg_calc = TRUE

tick_remaining = TICK_LIMIT_RUNNING - world.tick_usage

if (current_tick_budget > 0 && queue_node_priority > 0)
tick_precentage = tick_remaining / (current_tick_budget / queue_node_priority)
else
tick_precentage = tick_remaining

CURRENT_TICKLIMIT = world.tick_usage + tick_precentage

if (!(queue_node_flags & SS_TICKER))
ran_non_ticker = TRUE
ran = TRUE
tick_usage = world.tick_usage
queue_node_paused = queue_node.paused
queue_node.paused = FALSE
last_type_processed = queue_node

queue_node.fire(queue_node_paused)

current_tick_budget -= queue_node_priority
tick_usage = world.tick_usage - tick_usage

if (tick_usage < 0)
tick_usage = 0

if (queue_node.paused)

queue_node.paused_ticks++
queue_node.paused_tick_usage += tick_usage
continue

queue_node.ticks = MC_AVERAGE(queue_node.ticks, queue_node.paused_ticks)
tick_usage += queue_node.paused_tick_usage

queue_node.tick_usage = MC_AVERAGE_FAST(queue_node.tick_usage, tick_usage)

queue_node.cost = MC_AVERAGE_FAST(queue_node.cost, TICK_DELTA_TO_MS(tick_usage))
queue_node.paused_ticks = 0
queue_node.paused_tick_usage = 0

if (queue_node_flags & SS_BACKGROUND) //update our running total
queue_priority_count_bg -= queue_node_priority
else
queue_priority_count -= queue_node_priority

queue_node.last_fire = world.time
queue_node.times_fired++

if (queue_node_flags & SS_TICKER)
queue_node.next_fire = world.time + (world.tick_lag * queue_node.wait)
else if (queue_node_flags & SS_POST_FIRE_TIMING)
queue_node.next_fire = world.time + queue_node.wait
else if (queue_node_flags & SS_KEEP_TIMING)
queue_node.next_fire += queue_node.wait
else
queue_node.next_fire = queue_node.queued_time + queue_node.wait

queue_node.queued_time = 0

//remove from queue
queue_node.dequeue()

CURRENT_TICKLIMIT = TICK_LIMIT_RUNNING
. = 1 //normal finish, let the caller know we didn't runtime


[00:51:45]undefined variable /datum/controller/master/var/queue_next
[00:51:45]proc name: RunQueue (/datum/controller/master/proc/RunQueue)
[00:51:45] source file: master.dm,325
[00:51:45] usr: null
[00:51:45] src: Master (/datum/controller/master)
[00:51:45] call stack:
[00:51:45]Master (/datum/controller/master): RunQueue()
[00:51:45]Master (/datum/controller/master): Loop()
[00:51:45]Master (/datum/controller/master): StartProcessing()


Basically, according to the runtime, queue_node is randomly getting set to something of type /datum/controller/master. I know better than to trust this, and am assuming it could also be null or some other bad type. But as you'll see, this can't be the case, the istype() is right there!

Now, Wondering what the hell was going on, I added runtime detection to the thing that calls RunQueue, and i have the Master Controller doing a soft reset while sending to log any bad data it finds if RunQueue runtimes:

//resets the queue, and all subsystems, while filtering out the subsystem lists
// called if any mc's queue procs runtime or exit improperly.
/datum/controller/master/proc/SoftReset(list/ticker_SS, list/normal_SS, list/lobby_SS)
. = 0
world.log << "MC: SoftReset called, resetting MC queue state."
if (!istype(subsystems) || !istype(ticker_SS) || !istype(normal_SS) || !istype(lobby_SS))
world.log << "MC: SoftReset: Bad list contents: '[subsystems]' '[ticker_SS]' '[normal_SS]' '[lobby_SS]' Crashing!"
return
var/list/subsystemstocheck = subsystems + ticker_SS + normal_SS + lobby_SS

for (var/thing in subsystemstocheck)
var/datum/subsystem/SS = thing
if (!SS || !istype(SS))
//list(SS) is so if a list makes it in the subsystem list, we remove the list, not the contents
subsystems -= list(SS)
ticker_SS -= list(SS)
normal_SS -= list(SS)
lobby_SS -= list(SS)
world.log << "MC: SoftReset: Found bad entry in subsystem list, '[SS]'"
continue
if (SS.queue_next && !istype(SS.queue_next))
world.log << "MC: SoftReset: Found bad data in subsystem queue, queue_next = '[SS.queue_next]'"
SS.queue_next = null
if (SS.queue_prev && !istype(SS.queue_prev))
world.log << "MC: SoftReset: Found bad data in subsystem queue, queue_prev = '[SS.queue_prev]'"
SS.queue_prev = null
SS.queued_priority = 0
SS.queued_time = 0
SS.paused = 0
if (queue_head && !istype(queue_head))
world.log << "MC: SoftReset: Found bad data in subsystem queue, queue_head = '[queue_head]'"
queue_head = null
if (queue_tail && !istype(queue_tail))
world.log << "MC: SoftReset: Found bad data in subsystem queue, queue_tail = '[queue_tail]'"
queue_tail = null
queue_priority_count = 0
queue_priority_count_bg = 0
world.log << "MC: SoftReset: Finished."
. = 1


Result:
runtime error:
[04:37:57]undefined variable /datum/controller/master/var/queue_next
[04:37:57]proc name: RunQueue (/datum/controller/master/proc/RunQueue)
[04:37:57] source file: master.dm,325
[04:37:57] usr: null
[04:37:57] src: Master (/datum/controller/master)
[04:37:57] call stack:
[04:37:57]Master (/datum/controller/master): RunQueue()
[04:37:57]Master (/datum/controller/master): Loop()
[04:37:57]Master (/datum/controller/master): StartProcessing()
MC: SoftReset called, resetting MC queue state.
MC: SoftReset: Finished.

(ie, no bad linked list state detected, as it printed no lines showing that)


So either one of two things is true, there is some magical line that edits queue_node that my eyes aren't seeing (as if it changed between runs, the middle part of the for would not have ran yet to check it) Or there is some byond bug at play.

I have looked and looked and looked, no line outside of the for loop's define edits that var.
Is this queue circular, maybe with /datum/controller/master as its head? It'd help to see your dequeue() code.

The reason I ask this is, you're calling queue_node.dequeue() at the end of the loop, just before checking queue_node.queue_next to get the next node. While I don't see this as being able to directly cause your issue, this kind of thing always spooks me because if anything much happens to the node during the deqeuue, the queue_next reference could be wrong.

Whenever I loop through lists like this internally, when I know that the list may be altered, I immediately grab the next pointer right at the top of the loop interior, and I use that in the loop.

for(item=head, item, item=_next)
_next = item.next
...

Again I don't see how your specific bug could manifest from not doing that, but for safety I think I'd make that change anyway since you are altering the queue. I would think if a problem cropped up it'd be a null reference appearing, rather than a complete change.

The next thing I would probably try in RunQueue() is to figure out at which point queue_node is changing, by sticking either istype() checks in there, or by using a local var that stores queue_node's type at the beginning of the loop and periodically does a comparison (which would obviously be faster).

You also mentioned a #define; what does it look like?
I check istype in the middle part of the forloop and still get the runtime, but queue_node is a proc local var

So logically, the only way this should be able to happen, is if queue_node changes between istype() and queue_node.queue_next (ie, the body of the forloop). The only way this could happen is in the immediately body of the forloop (queue_node is a proc local var, so it can't be changed by outside forces) or queue_node's contents getting hard deleted, and that can't happen because deleting a subsystem puts it in the hard delete queue, while removing it as a subsystem in the master controller, so there is no way the master controller would still see it as a subsystem and it get deleted in the same tick. at least 60 seconds would have to pass where our system allows the subsystem to garbage collect. (and the fact that the only that should be trying to delete a subsystem is the subsystem if it detects another one of its type already exist. in it's New())

its not circular, and it should never contain a master in it, (and the istype would prevent a master from getting loaded up normally)

/datum/subsystem/proc/dequeue()
if (queue_next)
queue_next.queue_prev = queue_prev
if (queue_prev)
queue_prev.queue_next = queue_next
if (src == Master.queue_tail)
Master.queue_tail = queue_prev
if (src == Master.queue_head)
Master.queue_head = queue_next
queued_time = 0


The dequeue keeps the subsystems current references, and just basically joins it's neighbors up. when requeueing it ensures both prev and next are set to something.


as for defines, this should be all of the defines used in this proc:

#define MC_AVERAGE_FAST(average, current) (0.7 * (average) + 0.3 * (current))
#define MC_AVERAGE(average, current) (0.8 * (average) + 0.2 * (current))
#define MC_AVERAGE_SLOW(average, current) (0.9 * (average) + 0.1 * (current))
#define TICK_LIMIT_RUNNING 85
#define TICK_LIMIT_TO_RUN 80
#define TICK_LIMIT_MC 75
#define TICK_DELTA_TO_MS(percent_of_tick_used) ((percent_of_tick_used) * world.tick_lag)
#define TICK_USAGE_TO_MS(starting_tickusage) (TICK_DELTA_TO_MS(world.tick_usage-starting_tickusage))
//subsystem should fire during pre-game lobby.
#define SS_FIRE_IN_LOBBY 1
//subsystem does not initialize.
#define SS_NO_INIT 2
//subsystem does not fire.
// (like can_fire = 0, but keeps it from getting added to the processing subsystems list)
// (Requires a MC restart to change)
#define SS_NO_FIRE 4
//subsystem only runs on spare cpu (after all non-background subsystems have ran that tick)
// SS_BACKGROUND has its own priority bracket
#define SS_BACKGROUND 8
//subsystem does not tick check, and should not run unless there is enough time (or its running behind (unless background))
#define SS_NO_TICK_CHECK 16
//Treat wait as a tick count, not DS, run every wait ticks.
// (also forces it to run first in the tick, above even SS_NO_TICK_CHECK subsystems)
// (implies SS_FIRE_IN_LOBBY because of how it works)
// (overrides SS_BACKGROUND)
// This is designed for basically anything that works as a mini-mc (like SStimer)
#define SS_TICKER 32
//keep the subsystem's timing on point by firing early if it fired late last fire because of lag
// ie: if a 20ds subsystem fires say 5 ds late due to lag or what not, its next fire would be in 15ds, not 20ds.
#define SS_KEEP_TIMING 64
//Calculate its next fire after its fired.
// (IE: if a 5ds wait SS takes 2ds to run, its next fire should be 5ds away, not 3ds like it normally would be)
// This flag overrides SS_KEEP_TIMING
#define SS_POST_FIRE_TIMING 128



You said i mentioned a #define, but the only place i used that word was "no line outside of the for loop's define edits that var." and by that i was merely referring to the for loop header, the part where you create the forloop, not any #define, just that first line in a for loop after for between () where you initialize your variable and check and doing the incrementer
Thanks for the clarification.

I think we still need to figure out at what point queue_node is changing during the for loop, to get a handle on how. Obviously it shouldn't be changing, but something is causing the issue so it'd be good to narrow down where.

Because queue_node is accessed so much, my gut says the alteration is happening somewhere in or just after dequeue().

The reason I say "in" dequeue is that there are two possible ways this is manifesting:

1) The datum held by queue_node is having its type changed, most likely due to heap corruption or, and I can't rule this out, some routine that's being called in fire() or otherwise indirectly that's messing up the datum's prototype.

2) The actual reference in queue_node is changing, which could be either because of heap corruption, stack corruption, or something like a hidden #define that wasn't listed above but is having an influence on this code.

Given the way this is shaking out, heap and stack corruption seem unlikely--not impossible, but I don't buy them as an explanation. I think something else is in play that isn't obvious from these snippets of code. But it'll be easier to tell what that might be after narrowing down the place where queue_node changes--and importantly, whether its actual ref has changed or not.
alright, I'm running a version now that has an istype() check at every line, and i even store the name of the datum held by queue_node at the top of the proc and print that out when it detects bad data so that i can also see if it happens during a certain subsystem's routine.

Getting it to happen locally is a bit rarer than getting it to happen on the server, but it does happen if i generate enough load. (in fact, it seems like the only time it ever happens locally, is when i think i found out what triggers it, so that it can prove me =P)

I'll edit the code to also save the \ref and print the before and after too when it detects this, as that might be helpful.

Edit, we also have a tool that lets us see all defines in the code and where they are used. I'll check that some define isn't coming into play here that i didn't anticipate. They should all be UPPERCASE but some procs turned to macros were never changed.
So i can't get it to happen locally, I've queued an update job to run next round to apply it to the server.

I also MIGHT have found a correlation between why it happens on the server more often.

Could you if have spare time, dig around the handing of time, and maybe focus on timeofday, as i noticed that every time it happened locally, was after midnight, and most of the times it happened on the server, was shortly after midnight gmt.

Might be nothing, but it might explain quite a few things.
The time functions shouldn't be capable of this kind of problem, although I did check on time2text() to see if there was any chance it was responsible. Looks like our string handling there is quite good, though; stack smashing should be well out of the question.

The world.timeofday read is very simple:

struct timeb tb;
ftime(&tb);
return Num2Value((float)(((tb.time % 86400UL) * 10) + (tb.millitm / 100)));

hmm, what about profiling proc calls? i doubt that as well, (because you most likely just generate a timeb struct and look at the millitm number, and timeb/ftime is extremely unlikely to be unsafe in handling that)

ya, i guess we just wait for a trigger, (and hope that this isn't one of those there quantum bugs that goes away when you attempt to observe it) and maybe get more info.
so far no trigger (and it normally would have triggered by now.)

I'm 99% sure adding the new proc var to track the name and ref of the subsystem is what made the bug go away, i'll do some debugging here when i can.

edit: no luck. Now i can't get it to trigger at all. I'll check back here if this ever changes.