Catastrophic, permanent global slowdown after derefing objects in a list (not causing GC only).

BYOND Forums

Announcements · BYOND Help · Bug Reports · Feature Requests · Beta Testers · Beta Bugs · Developer Help · Design Philosophy · Demos & Libraries · Tutorials & Snippets · Art & Sound · Classified Ads · Game Updates · Contests & Events · Linux Talk · On Topic · Off Topic

ID:2380458

Jul 1 2018, 3:13 am

Ter13

BYOND Version:	512.1426
Operating System:	Windows 7 Ultimate 64-bit
Web Browser:	Chrome 67.0.3396.87
Applies to:	Dream Daemon

Status:

Verified

A member of our crack team of bug testers has verified that this issue is reproducible, and has handed it off to the development team for investigation.

Descriptive Problem Summary:

I really struggle to explain this one. I discovered some kind of permanent fuckup state that BYOND can encounter when dereferencing a large number of items by assigning a list's length to zero, cutting a list, or derefing a list. My test cases have been getting weirder and weirder, but I've got a consistent one that shows the problem in action.

The weird part of this problem is that it's outright fucking catastrophic, and oddly specific as to why it happens. It just doesn't make sense at all, and I have the feeling this one is causing pretty much everyone's games horrible grief. I'm talking slowing object creation down 10 to 100 fold, and slowing global code execution down by up to 33%.

Numbered Steps to Reproduce Problem:

1) Create a large number of objects on the map, storing them in a list belonging to an object (local lists won't work).

2) Assign the length of the list to 0, cut the list, or throw the list away.

3) loop over the world to move all the objects you created on the map to null. This will cause them to be garbage collected.

BYOND now permanently runs like utter shit when creating new objects, and every proc I've tested runs about 33% slower. There's also a 20 meg permanent memory footprint that never goes away, but doesn't get bigger on subsequent tests.

Now, here's the really fucky part. If you do it in this order: The problem doesn't happen:

1) Create a large number of objects on the map, storing them in a list belonging to an object (again, local lists do not work).

2) loop over the world to move all the objects you created on the map to null. This will not cause them to be garbage collected, because they are still referenced by the list.

3) Assign the list's len to 0. This will trigger garbage collection of all the objects.

Interestingly, THIS process does not cause the bug, yet is affected by the bug when it's caused.

Even more interesting, when performing the above process, it does not happen again within the same running proc even if it has been slept. Only with procs started on a subsequent frame. So iterations of my test 1 always show the slowdown, but not iterations 2+. But when triggering the proc the first time, iteration 1 doesn't show the slowdown, then iteration 2 shows the slowdown, but iteration 3 does not.

Code Snippet (if applicable) to Reproduce Problem:

This code snippet will show the problem in action and act as a compilable test case.

world
    maxx = 1000
    maxy = 1000
    maxz = 1

#define ACCURATE_TIME (world.time + world.tick_lag * world.tick_usage/100)

var
    bugged = 0

mob
    var
        list/l = list()
        test_repetitions = 5
        test_iterations = 100000
    verb
        demonstrate_stable()
            world.log << "Demonstrating stable behavior[global.bugged ? " is affected by unstable behavior" : ""]..."

            var/start_time, duration, r, i, c

            for(r in 1 to test_repetitions)
                world.log << "Creating [test_iterations] objs..."

                sleep(1)

                start_time = ACCURATE_TIME
                c = 0
                for(i in 1 to test_iterations)
                    l += new/obj(locate(rand(1,1000),rand(1,1000),1))
                    ++c
                duration = ACCURATE_TIME - start_time
                world.log << "Created [c] objs in [duration*100]ms"


                world.log << "Deleting [test_iterations] objs..."

                sleep(1)

                start_time = ACCURATE_TIME
                c = 0
                for(var/obj/o in world)
                    o.loc = null
                    ++c
                world.log << "Clearing list (triggering garbage collection)"
                l.len = 0 //this line does not cause the bug
                duration = ACCURATE_TIME - start_time

                world.log << "Deleted [c] objs in [duration*100]ms"

                sleep(1)

            world.log << "Completed demonstrating stable behavior..."

        demonstrate_unstable()
            world.log << "Demonstrating unstable behavior..."

            var/start_time, duration, r, i, c

            for(r in 1 to test_repetitions)
                world.log << "Creating [test_iterations] objs..."

                sleep(1)

                start_time = ACCURATE_TIME
                c = 0
                for(i in 1 to test_iterations)
                    l += new/obj(locate(rand(1,1000),rand(1,1000),1))
                    ++c
                duration = ACCURATE_TIME - start_time
                world.log << "Created [c] objs in [duration*100]ms"


                world.log << "Deleting [test_iterations] objs..."

                sleep(1)

                start_time = ACCURATE_TIME
                c = 0
                world.log << "Clearing list (not triggering garbage collection)"
                l.len = 0 //this line causes the bug
                for(var/obj/o in world)
                    o.loc = null
                    ++c
                duration = ACCURATE_TIME - start_time

                world.log << "Deleted [c] objs in [duration*100]ms"
                global.bugged = 1

                sleep(1)

            world.log << "Completed demonstrating unstable behavior..."

Jul 1 2018, 3:30 am

Ter13

Stonedone did some analysis of this and found that the speed hit seemed to be coming from Astar nodes for some reason.

The really compelling part about that theory, is that if you never give these objs a loc, but still induce the deref problem another way, the bug never happens. So this bug has something to do with putting the objects on the map and then fucking with the length of a list in such a way that the items are unrefcounted, but not derefed completely.

Jul 1 2018, 3:30 am
MrStonedOne	Have some profile data during one of the slowed down rounds: byondcore.dll is loaded at 0x688F0000 (1427) created 1e+006 objects in 50.8 ticks deleted 1e+006 objects in 39.92 ticks created 1e+006 objects in 5131.94 ticks <--- profile is of this run. same code as the created one above. tick rate is 25fps

Jul 1 2018, 9:47 pm (Edited on Jul 2 2018, 12:10 am)

Lummox JR

I've investigated this issue and found what's going on. This is not an easy thing to address, but I would like to keep this on the list as a target. It will not however affect the vast majority of games, or even the vast majority of the minority; creation of 100K objects on one tick is a rare thing.

The slowdown is happening when an object ID is added to the dirty map chunk. Map chunks were introduced in 510 to deal with the visual bounds problem of big-icon objects; the dirty chunk keeps track of objects that need to have their visual bounds refreshed. The list of obj IDs is kept sorted, and when a new ID is added it's done via a binary search and then an insertion that shifts higher IDs forward.

When IDs are added in descending order, a worst-case scenario is triggered where the insertion is doing more work each time. This is what's happening here.

There's a stack of available IDs for various built-in types like obj, mob, etc. so the server doesn't have to do what it used to: scan the entire obj list till it finds a null. If objs are deleted in ascending ID order, the highest ID ends up on top of the stack of available IDs--so when re-creating the objs, they get remade in descending ID order. Adding them to a turf loc adds them to the dirty chunk, which is what triggers the worst-case insert.

Stable case: (fresh start)

1) New objs are created, in ascending ID order, and added to a list and to the map in that same order.
2) Looping through objs in world (in ascending ID order), objs are moved to null.
3) The list is cleared. Clearing happens from last to first, so the objs are deleted in descending ID order--therefore putting the lowest ID at the top of the available-ID stack.
4) When the objs are re-created, they pull IDs from the stack and therefore are remade in ascending order just like in step 1.

Unstable case:

1) New objs are created, in ascending ID order if this is a fresh start, and added to a list and to the map in that same order.
2) The list is cleared. Clearing happens from last to first, but the objs are not deleted because they're on the map.
3) Looping through objs in world (in ascending ID order), objs are moved to null. This deletes them in ascending ID order, so the highest ID ends up on top of the available-ID stack.
4) When the objs are re-created, they pull IDs from the stack and therefore are remade in descending order. This adds them to the dirty chunk in descending order as well which is what triggers the worst-case scenario. Because step 3 ensures they are always deleted in ascending ID order, this will keep happening. This is why it will also persist in the stable case now as well.

So, the fix to this would, it seems, require a change to the dirty chunk so that it does not always require IDs to be sorted, and instead can take a loose list that may contain duplicates and sort them on an as-needed basis. A dual-pivot quicksort of 100K objects followed by a purge seems likely to be faster. Either that, or a hashtable solution might be sensible.

Another option, not really a fix to this issue but one that would ameliorate it considerably, would be to change the ID stacks to min-heaps. It's a little more work than the current stack method but it would force lower IDs to be reused more often.

Jul 1 2018, 9:47 pm
Lummox JR	Lummox JR changed status to 'Verified'

Jul 2 2018, 3:16 pm
Optimumtact	why not use another sort method that doesn't have worst case characteristics like insertion sort? say timsort etc https://en.wikipedia.org/wiki/Timsort

Jul 2 2018, 7:04 pm
Lummox JR	The list is kept in a sorted state, not sorted after the fact. Timsort would be inappropriate.