ID:2315078
 
Applies to:Dream Daemon
Status: Open

Issue hasn't been assigned a status value.
My current tests indicate 100K scheduler insertions takes approximately 50-65 seconds on my 3.6ghz CPU / DDR3 1333 RAM machine.

According to Lum, the current scheduler queue is a linked list to avoid dynamic collection resizing performance costs.

This means that spawn()/sleep() has to search through the scheduler queue per element from the beginning to find where to insert the new value. The larger your scheduler queue, and the later the current spawn is from current time, the worse performance gets. For large games, the overhead of a scheduler reinsertion is dramatic, and without a full picture of how this impacts developers, the big fish in this community throw their hands up in the air, totally unable to diagnose why their game is shitting the bed.

This is especially problematic because the spawn() pattern in particular makes CPU occupancy nearly impossible to profile, effectively hiding problem areas of your code. See my profiling results and how waitfor in favor of spawn makes where the CPU hit is coming from much more obvious without inducing much extra overhead.

My tests indicate that we can expect better than a five-fold increase in performance across the board, and will improve performance best and worst case.

I performed some testing in softcode to prove my suspicions:

//potential optimization #5: allow multiple schedules based on time granularity. A binary split of very large scheduler lists, attempting to segregate times with a progressive decrease in granularity vs list size should significantly improve scheduling performance and distribute that regained performance over time during Tick().

scheduler //potential optimization #4: Create special scheduler for cyclical events, removing scheduling overhead from repetition.
var
time = 0
tick_lag
cur_tick
next_tick = 0

default_invoke

list/handles = list() //planned optimization #2: Flatten all 5 lists into a single structure. Should result in dramatic speed boost.
list/scheduled = list()
list/invoking = list()
list/schedule = list()
list/parameters = list()

proc
Schedule(datum/ref,delay,invoke,params,handle)

if(!handle)
handle = "\ref[ref]:[time+delay]"

var/sched = time + delay
if(delay<0 || (!cur_tick && sched < next_tick))
call(ref,invoke || default_invoke)(time)
else
var/pos = binarySeek(schedule,sched)

handles.Insert(pos,handle)
scheduled.Insert(pos,ref)
invoking.Insert(pos,invoke)
schedule.Insert(pos,sched)
parameters.Insert(pos,params)

return handle

Cancel(datum/handle)
var/pos, epos
if(istext(handle))
pos = handles.Find(handle) //planned optimization #3: Circularize scheduler references to speed explicit deletion and patch holes.
else if(isnum(handle))
pos = handle
else if(istype(handle)) //planned optimization #3: Should be able to pop the specific index if done properly.
pos = scheduled.Find(handle)

if(!pos)
return 0

epos = pos+1
handles.Cut(pos,epos)
scheduled.Cut(pos,epos)
invoking.Cut(pos,epos)
schedule.Cut(pos,epos)
parameters.Cut(pos,epos)
return 1

Reschedule(datum/handle,delay,params)
var/pos, epos
if(istext(handle))
pos = handles.Find(handle)
else if(isnum(handle))
pos = handle
else if(istype(handle))
pos = scheduled.Find(handle)

if(!pos)
return 0

epos = pos+1

handle = handles[pos]
var/ref = scheduled[pos]
var/invoke = invoking[pos]
if(!params) params = parameters[pos]

handles.Cut(pos,epos)
scheduled.Cut(pos,epos)
invoking.Cut(pos,epos)
schedule.Cut(pos,epos)
parameters.Cut(pos,epos)

var/sched = time + delay
if(delay<0 || (!cur_tick && sched < next_tick))
call(ref,invoke || default_invoke)(arglist(params))
else
pos = binarySeek(schedule,sched)

handles.Insert(pos,handle)
scheduled.Insert(pos,ref)
invoking.Insert(pos,invoke)
schedule.Insert(pos,sched)
parameters.Insert(pos,params)

return 1

Tick(time)
cur_tick = args
next_tick = time + (tick_lag || world.tick_lag)
var/datum/ref, invoke, pos, params
while((pos=schedule.len))
if(schedule[pos] >= next_tick)
return
ref = scheduled[pos]
invoke = invoking[pos] || default_invoke
params = parameters[pos]

--handles.len
--scheduled.len
--invoking.len
--schedule.len
--parameters.len

call(ref,invoke)(arglist(params))

proc
binarySeek(list/l,value)
var/len = length(l)

if(!len)
return 1

var/hlen = round(len/2), seek = hlen

if(l[1]<=value)
return 1
else if(l[len]>=value)
return len+1

while(hlen)
hlen = round(hlen/2)
if(l[seek]>value)
seek += hlen
else
seek -= hlen

return l[seek]>value ? seek : seek+1


The scheduler datum is intented to function similarly to spawn(). It stores all the information that it needs to schedule callbacks. The basic idea is when an event is scheduled, store the parameters sorted by invocation time using a binary sort. Everything is stored backward to increase processing speed, as it's significantly faster to pop items from the end of a list than to cut them from the beginning of a list. This means that the soonest events are at the end of the list, and the latest events are at the beginning of the list.


The below functions are the tests that I used for benchmarking:

mob
var
tested = 0
verb
test_scheduler()
var/scheduler/s = new()
s.default_invoke = "schedtest"
tested = 0
for(var/count in 1 to 100000)
s.Schedule(src,rand()*10)

var/etime = world.time+10
while(world.time<=etime)
s.Tick(world.time)
sleep(world.tick_lag)

test_spawn()
tested = 0
for(var/count in 1 to 100000)
spawn(rand()*10)
++tested

test_waitfor()
tested = 0
for(var/count in 1 to 100000)
waitfortest()
proc
schedtest()
++tested

spawntest()
++tested

waitfortest()
set waitfor = 0
sleep(rand()*10)
++tested


Results indicate that the scheduler is averaging 11 seconds to schedule and run 100,000 callbacks, while spawn() and waitfor are taking roughly 52.9 seconds each.

The difference between the waitfor and spawn functions in the profiler is negligible, the waitfor test just much more clearly shows where the specific CPU occupancy comes from compared to spawn().



It should be noted that my scheduler outperforms spawn in every single configuration even though its logic is written in interpreted softcode. I would expect the built-in solution to absolutely shit on what is supposed to be an identical softcode solution. I would also expect the built in solution of my algorithm would offer even more dramatic performance improvements.

My algorithm is also hugely unoptimized at the moment. I've one done one of five major optimizations to it that I have outlined since I initially wrote it.


As for benefit, spawn()/sleep() is a massively overused pattern in DM. Spires, every shitty anime game, and SS13 in particular pepper that shit everywhere. This performance increase would be a dramatic global benefit to every single project using the language.
The actual suggestion is to swap the scheduler queue to a reverse-ordered vector and change the insert to use O(log n) binary search rather than the O(n) iteration technique it currently uses.
There is a better way too.

/tg/ uses our own timer system instead of spawns (we still have to use sleep), and it maintains two queues, a rolling index based list, and a mostly unsorted list.


The index based list contains the next 10 seconds worth of entries, each index is a single tick, and is either null or the head of a doubly linked list of every timer that needs to run that tick.

Insert is O(1), picking is O(1), and every 10 seconds we timsort the other queue and refill the first queue, a o(nlogn) operation (but only for timers that didn't fit in the first queue, since timsort takes advantage of the already sorted parts from the last refill, and since almost all timers fit within 10 seconds, there tends to not be much in this area).

edit: In fact, if i used your binary insert for timers that have to use the second queue, I could kill the timsort overhead entirely
In fact, if i used your binary insert for timers that have to use the second queue, I could kill the timsort overhead entirely

Feel free to lift all some or none of this wholesale. Whatever works for you guys. Any code I post here is basically fair game CC0.

Would really recommend you not ignore the optimization notes I've made though. Huge chunks of this are very much experimental. I just wanted to see if I could beat the scheduler.

I'm brewing a second feature request related to this one that I think you guys will like a lot.
Pending some more tests, this is in for 512.1396.
I'm assuming "this" refers to the binary insertion from and not the two lists system /tg/ uses that i detailed in my post.
Binary insertion.

Login to reply.