ID:2315078
 
Resolved
The proc scheduler has been updated to improve response times for various cases arising in complex games. Also, a number of experimental changes have been made under the hood to improve server performance.
Applies to:Dream Daemon
Status: Resolved (512.1396)

This issue has been resolved.
My current tests indicate 100K scheduler insertions takes approximately 50-65 seconds on my 3.6ghz CPU / DDR3 1333 RAM machine.

According to Lum, the current scheduler queue is a linked list to avoid dynamic collection resizing performance costs.

This means that spawn()/sleep() has to search through the scheduler queue per element from the beginning to find where to insert the new value. The larger your scheduler queue, and the later the current spawn is from current time, the worse performance gets. For large games, the overhead of a scheduler reinsertion is dramatic, and without a full picture of how this impacts developers, the big fish in this community throw their hands up in the air, totally unable to diagnose why their game is shitting the bed.

This is especially problematic because the spawn() pattern in particular makes CPU occupancy nearly impossible to profile, effectively hiding problem areas of your code. See my profiling results and how waitfor in favor of spawn makes where the CPU hit is coming from much more obvious without inducing much extra overhead.

My tests indicate that we can expect better than a five-fold increase in performance across the board, and will improve performance best and worst case.

I performed some testing in softcode to prove my suspicions:

//potential optimization #5: allow multiple schedules based on time granularity. A binary split of very large scheduler lists, attempting to segregate times with a progressive decrease in granularity vs list size should significantly improve scheduling performance and distribute that regained performance over time during Tick().

scheduler //potential optimization #4: Create special scheduler for cyclical events, removing scheduling overhead from repetition.
var
time = 0
tick_lag
cur_tick
next_tick = 0

default_invoke

list/handles = list() //planned optimization #2: Flatten all 5 lists into a single structure. Should result in dramatic speed boost.
list/scheduled = list()
list/invoking = list()
list/schedule = list()
list/parameters = list()

proc
Schedule(datum/ref,delay,invoke,params,handle)

if(!handle)
handle = "\ref[ref]:[time+delay]"

var/sched = time + delay
if(delay<0 || (!cur_tick && sched < next_tick))
call(ref,invoke || default_invoke)(time)
else
var/pos = binarySeek(schedule,sched)

handles.Insert(pos,handle)
scheduled.Insert(pos,ref)
invoking.Insert(pos,invoke)
schedule.Insert(pos,sched)
parameters.Insert(pos,params)

return handle

Cancel(datum/handle)
var/pos, epos
if(istext(handle))
pos = handles.Find(handle) //planned optimization #3: Circularize scheduler references to speed explicit deletion and patch holes.
else if(isnum(handle))
pos = handle
else if(istype(handle)) //planned optimization #3: Should be able to pop the specific index if done properly.
pos = scheduled.Find(handle)

if(!pos)
return 0

epos = pos+1
handles.Cut(pos,epos)
scheduled.Cut(pos,epos)
invoking.Cut(pos,epos)
schedule.Cut(pos,epos)
parameters.Cut(pos,epos)
return 1

Reschedule(datum/handle,delay,params)
var/pos, epos
if(istext(handle))
pos = handles.Find(handle)
else if(isnum(handle))
pos = handle
else if(istype(handle))
pos = scheduled.Find(handle)

if(!pos)
return 0

epos = pos+1

handle = handles[pos]
var/ref = scheduled[pos]
var/invoke = invoking[pos]
if(!params) params = parameters[pos]

handles.Cut(pos,epos)
scheduled.Cut(pos,epos)
invoking.Cut(pos,epos)
schedule.Cut(pos,epos)
parameters.Cut(pos,epos)

var/sched = time + delay
if(delay<0 || (!cur_tick && sched < next_tick))
call(ref,invoke || default_invoke)(arglist(params))
else
pos = binarySeek(schedule,sched)

handles.Insert(pos,handle)
scheduled.Insert(pos,ref)
invoking.Insert(pos,invoke)
schedule.Insert(pos,sched)
parameters.Insert(pos,params)

return 1

Tick(time)
cur_tick = args
next_tick = time + (tick_lag || world.tick_lag)
var/datum/ref, invoke, pos, params
while((pos=schedule.len))
if(schedule[pos] >= next_tick)
return
ref = scheduled[pos]
invoke = invoking[pos] || default_invoke
params = parameters[pos]

--handles.len
--scheduled.len
--invoking.len
--schedule.len
--parameters.len

call(ref,invoke)(arglist(params))

proc
binarySeek(list/l,value)
var/len = length(l)

if(!len)
return 1

var/hlen = round(len/2), seek = hlen

if(l[1]<=value)
return 1
else if(l[len]>=value)
return len+1

while(hlen)
hlen = round(hlen/2)
if(l[seek]>value)
seek += hlen
else
seek -= hlen

return l[seek]>value ? seek : seek+1


The scheduler datum is intented to function similarly to spawn(). It stores all the information that it needs to schedule callbacks. The basic idea is when an event is scheduled, store the parameters sorted by invocation time using a binary sort. Everything is stored backward to increase processing speed, as it's significantly faster to pop items from the end of a list than to cut them from the beginning of a list. This means that the soonest events are at the end of the list, and the latest events are at the beginning of the list.


The below functions are the tests that I used for benchmarking:

mob
var
tested = 0
verb
test_scheduler()
var/scheduler/s = new()
s.default_invoke = "schedtest"
tested = 0
for(var/count in 1 to 100000)
s.Schedule(src,rand()*10)

var/etime = world.time+10
while(world.time<=etime)
s.Tick(world.time)
sleep(world.tick_lag)

test_spawn()
tested = 0
for(var/count in 1 to 100000)
spawn(rand()*10)
++tested

test_waitfor()
tested = 0
for(var/count in 1 to 100000)
waitfortest()
proc
schedtest()
++tested

spawntest()
++tested

waitfortest()
set waitfor = 0
sleep(rand()*10)
++tested


Results indicate that the scheduler is averaging 11 seconds to schedule and run 100,000 callbacks, while spawn() and waitfor are taking roughly 52.9 seconds each.

The difference between the waitfor and spawn functions in the profiler is negligible, the waitfor test just much more clearly shows where the specific CPU occupancy comes from compared to spawn().



It should be noted that my scheduler outperforms spawn in every single configuration even though its logic is written in interpreted softcode. I would expect the built-in solution to absolutely shit on what is supposed to be an identical softcode solution. I would also expect the built in solution of my algorithm would offer even more dramatic performance improvements.

My algorithm is also hugely unoptimized at the moment. I've one done one of five major optimizations to it that I have outlined since I initially wrote it.


As for benefit, spawn()/sleep() is a massively overused pattern in DM. Spires, every shitty anime game, and SS13 in particular pepper that shit everywhere. This performance increase would be a dramatic global benefit to every single project using the language.
The actual suggestion is to swap the scheduler queue to a reverse-ordered vector and change the insert to use O(log n) binary search rather than the O(n) iteration technique it currently uses.
There is a better way too.

/tg/ uses our own timer system instead of spawns (we still have to use sleep), and it maintains two queues, a rolling index based list, and a mostly unsorted list.


The index based list contains the next 10 seconds worth of entries, each index is a single tick, and is either null or the head of a doubly linked list of every timer that needs to run that tick.

Insert is O(1), picking is O(1), and every 10 seconds we timsort the other queue and refill the first queue, a o(nlogn) operation (but only for timers that didn't fit in the first queue, since timsort takes advantage of the already sorted parts from the last refill, and since almost all timers fit within 10 seconds, there tends to not be much in this area).

edit: In fact, if i used your binary insert for timers that have to use the second queue, I could kill the timsort overhead entirely
In fact, if i used your binary insert for timers that have to use the second queue, I could kill the timsort overhead entirely

Feel free to lift all some or none of this wholesale. Whatever works for you guys. Any code I post here is basically fair game CC0.

Would really recommend you not ignore the optimization notes I've made though. Huge chunks of this are very much experimental. I just wanted to see if I could beat the scheduler.

I'm brewing a second feature request related to this one that I think you guys will like a lot.
Pending some more tests, this is in for 512.1396.
I'm assuming "this" refers to the binary insertion from and not the two lists system /tg/ uses that i detailed in my post.
Binary insertion.
In response to Ter13
Ter13 wrote:
Would really recommend you not ignore the optimization notes I've made though. Huge chunks of this are very much experimental. I just wanted to see if I could beat the scheduler.


here, have another:
hlen = round(hlen/2)


replace with
hlen >> 1


Lummox JR resolved issue with message:
The proc scheduler has been updated to improve response times for various cases arising in complex games. Also, a number of experimental changes have been made under the hood to improve server performance.


New results from the same test posted above.

waitfor pattern and spawn pattern are still pretty similar, and scheduler pattern is hot garbage.

spawn: 51.626(old) | 2.154(new)
waitfor: 52.840(old) | 1.172(new)
scheduler: 10.767(old) | 12.826(new)

All my tests are indicating a 25% decrease in overhead globally and my tests are showing that the built-in scheduler is far and away faster than the old one. Maybe 30x faster in some cases.
So the new code actually made your home-rolled scheduler worse? Interesting. I wonder why, unless it's tied to the fact that a binary insertion is going to be slower than front insertion for the smallest wait times.

The results for the rest are encouraging. I wonder to what extent proc call overhead may have gone down to contribute to those results, and how much is the scheduler change.
There is some fuzz in my scheduler because the test case I'm using is random insertion and I'm only doing a single pass.

I doubt it got worse in reality, and we're just looking at the consequence of small sample sizes.
I feel this change is a huge deal
Nice job
In response to Lummox JR
Lummox JR wrote:
I wonder to what extent proc call overhead may have gone down


1395
                                  Profile results (total time)
Proc Name Self CPU Total CPU Real Time Calls
--------------------------------------------- --------- --------- --------- ---------
/mob/verb/testprocoverhead_enhanced 19.022 59.928 59.927 1
/mob/proc/testprocoverhead_enhanced_chain 24.973 41.413 43.950 22542007
/proc/noop 1.219 1.762 4.327 22542007
/mob/proc/testprocoverhead_enhanced_mob_5var 2.168 2.391 2.661 2253896
/proc/testprocoverhead_enhanced_global_5var 2.039 2.211 2.437 2255528
/proc/testprocoverhead_enhanced_global_5arg 1.601 1.774 2.051 2257934
/mob/proc/testprocoverhead_enhanced_mob_5arg 1.614 1.777 2.037 2261937
/mob/proc/testprocoverhead_enhanced_mob_1arg 1.354 1.592 1.835 2249028
/mob/proc/testprocoverhead_enhanced_mob_1var 1.285 1.453 1.708 2242345
/proc/testprocoverhead_enhanced_global_1var 1.309 1.470 1.693 2260724
/proc/testprocoverhead_enhanced_global_1arg 1.183 1.340 1.607 2261171
/mob/proc/testprocoverhead_enhanced_mob_empty 1.117 1.289 1.528 2256311
/proc/testprocoverhead_enhanced_global_empty 1.044 1.209 1.466 2243133


1396
                                  Profile results (total time)
Proc Name Self CPU Total CPU Real Time Calls
--------------------------------------------- --------- --------- --------- ---------
/mob/verb/testprocoverhead_enhanced 18.805 59.923 59.923 1
/mob/proc/testprocoverhead_enhanced_chain 25.174 41.639 44.433 24050476
/proc/noop 1.202 1.727 4.439 24050476
/proc/testprocoverhead_enhanced_global_5var 2.111 2.278 2.557 2405153
/mob/proc/testprocoverhead_enhanced_mob_5var 2.111 2.276 2.522 2405805
/mob/proc/testprocoverhead_enhanced_mob_5arg 1.626 1.819 2.093 2414606
/proc/testprocoverhead_enhanced_global_5arg 1.633 1.805 2.082 2399273
/proc/testprocoverhead_enhanced_global_1var 1.393 1.568 1.829 2397437
/mob/proc/testprocoverhead_enhanced_mob_1var 1.346 1.527 1.826 2411464
/mob/proc/testprocoverhead_enhanced_mob_1arg 1.170 1.359 1.642 2395208
/mob/proc/testprocoverhead_enhanced_mob_empty 1.134 1.315 1.628 2409388
/proc/testprocoverhead_enhanced_global_1arg 1.147 1.315 1.587 2418673
/proc/testprocoverhead_enhanced_global_empty 1.071 1.235 1.516 2393469


(Note, i do consistent time profiling since its a cheap way to avoid the 24bit bug with incrementing loops, so when comparing two results, look at call count, not just time)

code: https://gist.github.com/MrStonedOne/ 5f404e0453ee144b58b367a3ff461f70 (feel free to steal this if you find value in it)
I see there's some inconsistency in the results in terms of sometimes being slightly faster or slower in a way that doesn't fit the trend, but it does look like there's a trend on the start and end times. What's your overall opinion, looking at the data?
Its a time limited system, so the call counts are the most telling.

While it lets me see how various things affect the proc call over head (both pre proc and after proc in terms of the more indef version) the more telling numbers are the fact the call counts are consistently higher for the same time frames.

I ran this a few more times and got the same results, doing it for longer also gave the same results. slightly more calls for the 512 verison in the same timeframe.

22 million to 24 million total calls is still close to a 10% gain, so thats a win in my book. (but i'm still holding out that you find some magical fix that kills off 90% of the overhead =P)

Edit: in terms of the in depth version, i think the profiling overhead from how i was profiling pre and post proc call overhead was creating more noise then we saw in gains, so it the noise hid the gains.
22 million to 24 million total calls is still close to a 10% gain

Shaving 10% off proc call overhead is a pretty significant optimization in my book. Especially when you consider it was done on a lark.
In response to Ter13
I've wanted to improve proc call overhead for quite some time; I just felt this was a good time to try revisiting and spent some time investigating the assembly being produced.