Switch scheduler insertions to a vector and use binary insertion sort

BYOND Forums

Announcements · BYOND Help · Bug Reports · Feature Requests · Beta Testers · Beta Bugs · Developer Help · Design Philosophy · Demos & Libraries · Tutorials & Snippets · Art & Sound · Classified Ads · Game Updates · Contests & Events · Linux Talk · On Topic · Off Topic

ID:2315078

Nov 14 2017, 7:13 am (Edited on Nov 14 2017, 7:31 am)

Ter13

Resolved

The proc scheduler has been updated to improve response times for various cases arising in complex games. Also, a number of experimental changes have been made under the hood to improve server performance.

Applies to:

Dream Daemon

Status:

Resolved (512.1396)

This issue has been resolved.

My current tests indicate 100K scheduler insertions takes approximately 50-65 seconds on my 3.6ghz CPU / DDR3 1333 RAM machine.

According to Lum, the current scheduler queue is a linked list to avoid dynamic collection resizing performance costs.

This means that spawn()/sleep() has to search through the scheduler queue per element from the beginning to find where to insert the new value. The larger your scheduler queue, and the later the current spawn is from current time, the worse performance gets. For large games, the overhead of a scheduler reinsertion is dramatic, and without a full picture of how this impacts developers, the big fish in this community throw their hands up in the air, totally unable to diagnose why their game is shitting the bed.

This is especially problematic because the spawn() pattern in particular makes CPU occupancy nearly impossible to profile, effectively hiding problem areas of your code. See my profiling results and how waitfor in favor of spawn makes where the CPU hit is coming from much more obvious without inducing much extra overhead.

My tests indicate that we can expect better than a five-fold increase in performance across the board, and will improve performance best and worst case.

I performed some testing in softcode to prove my suspicions:

//potential optimization #5: allow multiple schedules based on time granularity. A binary split of very large scheduler lists, attempting to segregate times with a progressive decrease in granularity vs list size should significantly improve scheduling performance and distribute that regained performance over time during Tick().

scheduler //potential optimization #4: Create special scheduler for cyclical events, removing scheduling overhead from repetition.
    var
        time = 0
        tick_lag
        cur_tick
        next_tick = 0

        default_invoke

        list/handles = list() //planned optimization #2: Flatten all 5 lists into a single structure. Should result in dramatic speed boost.
        list/scheduled = list()
        list/invoking = list()
        list/schedule = list()
        list/parameters = list()

    proc
        Schedule(datum/ref,delay,invoke,params,handle)

            if(!handle)
                handle = "\ref[ref]:[time+delay]"

            var/sched = time + delay
            if(delay<0 || (!cur_tick && sched < next_tick))
                call(ref,invoke || default_invoke)(time)
            else
                var/pos = binarySeek(schedule,sched)

                handles.Insert(pos,handle)
                scheduled.Insert(pos,ref)
                invoking.Insert(pos,invoke)
                schedule.Insert(pos,sched)
                parameters.Insert(pos,params)

            return handle

        Cancel(datum/handle)
            var/pos, epos
            if(istext(handle))
                pos = handles.Find(handle) //planned optimization #3: Circularize scheduler references to speed explicit deletion and patch holes.
            else if(isnum(handle))
                pos = handle
            else if(istype(handle)) //planned optimization #3: Should be able to pop the specific index if done properly.
                pos = scheduled.Find(handle)

            if(!pos)
                return 0

            epos = pos+1
            handles.Cut(pos,epos)
            scheduled.Cut(pos,epos)
            invoking.Cut(pos,epos)
            schedule.Cut(pos,epos)
            parameters.Cut(pos,epos)
            return 1

        Reschedule(datum/handle,delay,params)
            var/pos, epos
            if(istext(handle))
                pos = handles.Find(handle)
            else if(isnum(handle))
                pos = handle
            else if(istype(handle))
                pos = scheduled.Find(handle)

            if(!pos)
                return 0

            epos = pos+1

            handle = handles[pos]
            var/ref = scheduled[pos]
            var/invoke = invoking[pos]
            if(!params) params = parameters[pos]

            handles.Cut(pos,epos)
            scheduled.Cut(pos,epos)
            invoking.Cut(pos,epos)
            schedule.Cut(pos,epos)
            parameters.Cut(pos,epos)

            var/sched = time + delay
            if(delay<0 || (!cur_tick && sched < next_tick))
                call(ref,invoke || default_invoke)(arglist(params))
            else
                pos = binarySeek(schedule,sched)

                handles.Insert(pos,handle)
                scheduled.Insert(pos,ref)
                invoking.Insert(pos,invoke)
                schedule.Insert(pos,sched)
                parameters.Insert(pos,params)

            return 1

        Tick(time)
            cur_tick = args
            next_tick = time + (tick_lag || world.tick_lag)
            var/datum/ref, invoke, pos, params
            while((pos=schedule.len))
                if(schedule[pos] >= next_tick)
                    return
                ref = scheduled[pos]
                invoke = invoking[pos] || default_invoke
                params = parameters[pos]

                --handles.len
                --scheduled.len
                --invoking.len
                --schedule.len
                --parameters.len

                call(ref,invoke)(arglist(params))

proc
    binarySeek(list/l,value)
        var/len = length(l)

        if(!len)
            return 1

        var/hlen = round(len/2), seek = hlen

        if(l[1]<=value)
            return 1
        else if(l[len]>=value)
            return len+1

        while(hlen)
            hlen = round(hlen/2)
            if(l[seek]>value)
                seek += hlen
            else
                seek -= hlen

        return l[seek]>value ? seek : seek+1

The scheduler datum is intented to function similarly to spawn(). It stores all the information that it needs to schedule callbacks. The basic idea is when an event is scheduled, store the parameters sorted by invocation time using a binary sort. Everything is stored backward to increase processing speed, as it's significantly faster to pop items from the end of a list than to cut them from the beginning of a list. This means that the soonest events are at the end of the list, and the latest events are at the beginning of the list.

The below functions are the tests that I used for benchmarking:

mob
    var
        tested = 0
    verb
        test_scheduler()
            var/scheduler/s = new()
            s.default_invoke = "schedtest"
            tested = 0
            for(var/count in 1 to 100000)
                s.Schedule(src,rand()*10)

            var/etime = world.time+10
            while(world.time<=etime)
                s.Tick(world.time)
                sleep(world.tick_lag)

        test_spawn()
            tested = 0
            for(var/count in 1 to 100000)
                spawn(rand()*10)
                    ++tested

        test_waitfor()
            tested = 0
            for(var/count in 1 to 100000)
                waitfortest()
    proc
        schedtest()
            ++tested

        spawntest()
            ++tested

        waitfortest()
            set waitfor = 0
            sleep(rand()*10)
            ++tested

Results indicate that the scheduler is averaging 11 seconds to schedule and run 100,000 callbacks, while spawn() and waitfor are taking roughly 52.9 seconds each.

The difference between the waitfor and spawn functions in the profiler is negligible, the waitfor test just much more clearly shows where the specific CPU occupancy comes from compared to spawn().

It should be noted that my scheduler outperforms spawn in every single configuration even though its logic is written in interpreted softcode. I would expect the built-in solution to absolutely shit on what is supposed to be an identical softcode solution. I would also expect the built in solution of my algorithm would offer even more dramatic performance improvements.

My algorithm is also hugely unoptimized at the moment. I've one done one of five major optimizations to it that I have outlined since I initially wrote it.

As for benefit, spawn()/sleep() is a massively overused pattern in DM. Spires, every shitty anime game, and SS13 in particular pepper that shit everywhere. This performance increase would be a dramatic global benefit to every single project using the language.

Nov 14 2017, 7:32 am
Ter13	The actual suggestion is to swap the scheduler queue to a reverse-ordered vector and change the insert to use O(log n) binary search rather than the O(n) iteration technique it currently uses.

Nov 14 2017, 11:14 am (Edited on Nov 14 2017, 11:24 am)

MrStonedOne

There is a better way too.

/tg/ uses our own timer system instead of spawns (we still have to use sleep), and it maintains two queues, a rolling index based list, and a mostly unsorted list.

The index based list contains the next 10 seconds worth of entries, each index is a single tick, and is either null or the head of a doubly linked list of every timer that needs to run that tick.

Insert is O(1), picking is O(1), and every 10 seconds we timsort the other queue and refill the first queue, a o(nlogn) operation (but only for timers that didn't fit in the first queue, since timsort takes advantage of the already sorted parts from the last refill, and since almost all timers fit within 10 seconds, there tends to not be much in this area).

edit: In fact, if i used your binary insert for timers that have to use the second queue, I could kill the timsort overhead entirely

Nov 14 2017, 8:49 pm

Ter13

In fact, if i used your binary insert for timers that have to use the second queue, I could kill the timsort overhead entirely

Feel free to lift all some or none of this wholesale. Whatever works for you guys. Any code I post here is basically fair game CC0.

Would really recommend you not ignore the optimization notes I've made though. Huge chunks of this are very much experimental. I just wanted to see if I could beat the scheduler.

I'm brewing a second feature request related to this one that I think you guys will like a lot.

Nov 17 2017, 12:43 pm
Lummox JR	Pending some more tests, this is in for 512.1396.

Nov 21 2017, 2:26 pm
MrStonedOne	I'm assuming "this" refers to the binary insertion from and not the two lists system /tg/ uses that i detailed in my post.

Nov 22 2017, 7:11 am
Lummox JR	Binary insertion.

Nov 25 2017, 9:23 pm In response to Ter13
MrStonedOne	Ter13 wrote: Would really recommend you not ignore the optimization notes I've made though. Huge chunks of this are very much experimental. I just wanted to see if I could beat the scheduler. here, have another: hlen = round(hlen/2) replace with hlen >> 1

Dec 4 2017, 12:23 pm
Lummox JR	Lummox JR resolved issue with message: The proc scheduler has been updated to improve response times for various cases arising in complex games. Also, a number of experimental changes have been made under the hood to improve server performance.

Dec 4 2017, 4:02 pm (Edited on Dec 4 2017, 5:08 pm)

Ter13

New results from the same test posted above.

waitfor pattern and spawn pattern are still pretty similar, and scheduler pattern is hot garbage.

spawn: 51.626(old) | 2.154(new)
waitfor: 52.840(old) | 1.172(new)
scheduler: 10.767(old) | 12.826(new)

All my tests are indicating a 25% decrease in overhead globally and my tests are showing that the built-in scheduler is far and away faster than the old one. Maybe 30x faster in some cases.

Dec 4 2017, 8:41 pm
Lummox JR	So the new code actually made your home-rolled scheduler worse? Interesting. I wonder why, unless it's tied to the fact that a binary insertion is going to be slower than front insertion for the smallest wait times. The results for the rest are encouraging. I wonder to what extent proc call overhead may have gone down to contribute to those results, and how much is the scheduler change.

Dec 4 2017, 9:20 pm
Ter13	There is some fuzz in my scheduler because the test case I'm using is random insertion and I'm only doing a single pass. I doubt it got worse in reality, and we're just looking at the consequence of small sample sizes.

Dec 4 2017, 10:39 pm
Zagros5000	I feel this change is a huge deal Nice job

Dec 5 2017, 9:26 pm (Edited on Dec 5 2017, 9:34 pm)

In response to Lummox JR

MrStonedOne

Lummox JR wrote:

I wonder to what extent proc call overhead may have gone down

1395

                                  Profile results (total time)
Proc Name                                         Self CPU    Total CPU    Real Time        Calls
---------------------------------------------    ---------    ---------    ---------    ---------
/mob/verb/testprocoverhead_enhanced                 19.022       59.928       59.927            1
/mob/proc/testprocoverhead_enhanced_chain           24.973       41.413       43.950     22542007
/proc/noop                                           1.219        1.762        4.327     22542007
/mob/proc/testprocoverhead_enhanced_mob_5var         2.168        2.391        2.661      2253896
/proc/testprocoverhead_enhanced_global_5var          2.039        2.211        2.437      2255528
/proc/testprocoverhead_enhanced_global_5arg          1.601        1.774        2.051      2257934
/mob/proc/testprocoverhead_enhanced_mob_5arg         1.614        1.777        2.037      2261937
/mob/proc/testprocoverhead_enhanced_mob_1arg         1.354        1.592        1.835      2249028
/mob/proc/testprocoverhead_enhanced_mob_1var         1.285        1.453        1.708      2242345
/proc/testprocoverhead_enhanced_global_1var          1.309        1.470        1.693      2260724
/proc/testprocoverhead_enhanced_global_1arg          1.183        1.340        1.607      2261171
/mob/proc/testprocoverhead_enhanced_mob_empty        1.117        1.289        1.528      2256311
/proc/testprocoverhead_enhanced_global_empty         1.044        1.209        1.466      2243133

1396

                                  Profile results (total time)
Proc Name                                         Self CPU    Total CPU    Real Time        Calls
---------------------------------------------    ---------    ---------    ---------    ---------
/mob/verb/testprocoverhead_enhanced                 18.805       59.923       59.923            1
/mob/proc/testprocoverhead_enhanced_chain           25.174       41.639       44.433     24050476
/proc/noop                                           1.202        1.727        4.439     24050476
/proc/testprocoverhead_enhanced_global_5var          2.111        2.278        2.557      2405153
/mob/proc/testprocoverhead_enhanced_mob_5var         2.111        2.276        2.522      2405805
/mob/proc/testprocoverhead_enhanced_mob_5arg         1.626        1.819        2.093      2414606
/proc/testprocoverhead_enhanced_global_5arg          1.633        1.805        2.082      2399273
/proc/testprocoverhead_enhanced_global_1var          1.393        1.568        1.829      2397437
/mob/proc/testprocoverhead_enhanced_mob_1var         1.346        1.527        1.826      2411464
/mob/proc/testprocoverhead_enhanced_mob_1arg         1.170        1.359        1.642      2395208
/mob/proc/testprocoverhead_enhanced_mob_empty        1.134        1.315        1.628      2409388
/proc/testprocoverhead_enhanced_global_1arg          1.147        1.315        1.587      2418673
/proc/testprocoverhead_enhanced_global_empty         1.071        1.235        1.516      2393469

(Note, i do consistent time profiling since its a cheap way to avoid the 24bit bug with incrementing loops, so when comparing two results, look at call count, not just time)

code: https://gist.github.com/MrStonedOne/ 5f404e0453ee144b58b367a3ff461f70 (feel free to steal this if you find value in it)

Dec 5 2017, 10:52 pm
MrStonedOne	I did an more in depth version https://gist.github.com/MrStonedOne/ ed199853e1d8f946c734fd39979686b4

Dec 5 2017, 11:15 pm
Lummox JR	I see there's some inconsistency in the results in terms of sometimes being slightly faster or slower in a way that doesn't fit the trend, but it does look like there's a trend on the start and end times. What's your overall opinion, looking at the data?

Dec 6 2017, 6:54 pm (Edited on Dec 6 2017, 7:01 pm)

MrStonedOne

Its a time limited system, so the call counts are the most telling.

While it lets me see how various things affect the proc call over head (both pre proc and after proc in terms of the more indef version) the more telling numbers are the fact the call counts are consistently higher for the same time frames.

I ran this a few more times and got the same results, doing it for longer also gave the same results. slightly more calls for the 512 verison in the same timeframe.

22 million to 24 million total calls is still close to a 10% gain, so thats a win in my book. (but i'm still holding out that you find some magical fix that kills off 90% of the overhead =P)

Edit: in terms of the in depth version, i think the profiling overhead from how i was profiling pre and post proc call overhead was creating more noise then we saw in gains, so it the noise hid the gains.

Dec 6 2017, 10:32 pm
Ter13	22 million to 24 million total calls is still close to a 10% gain Shaving 10% off proc call overhead is a pretty significant optimization in my book. Especially when you consider it was done on a lark.

Dec 6 2017, 10:56 pm In response to Ter13
Lummox JR	I've wanted to improve proc call overhead for quite some time; I just felt this was a good time to try revisiting and spent some time investigating the assembly being produced.