[510 pls]Optimize datum variable accessing

Jul 2 2016, 2:13 pm
Lummox JR	Yeah, I went with binary. The old search was sequential through a linked list, although still in name order.

Jul 2 2016, 2:44 pm
MrStonedOne	I'm kinda shocked you didn't see that much in improvements then... I can't wait until we get our hands on it, maybe we might fair better.

Jul 13 2016, 9:43 am
Lummox JR	Lummox JR resolved issue with message: Experimental: Var access has been changed internally to use sorted arrays instead of linked lists, in the hopes of improving speed for some intensive calculations.

Jul 13 2016, 3:22 pm

MrStonedOne

alright, looking at the profile: datums with less than 10 vars saw a slight slowdown, and datums with more than 10 vars saw an improvement.

so this managed to slow down atmos slightly and speed up the mob controller decently.

edit; I still have more tests to run. this one was about accessing the last var in the var modification tree, an existing test case i already had laying around.

I'll need to randomize it some how.

Jul 13 2016, 5:07 pm
MrStonedOne	profile2: random var access testing. I'm not seeing much improvement at all in this case. A binary search should be yielding massive improvements, but there are none... how the hell. any chance to see some of the code involved. I'm really curious now.

Jul 13 2016, 7:29 pm

Lummox JR

Lack of improvement was consistent with my tests as well. They seemed to perform just about identically.

If there's a slowdown for datums with less than 10 vars, that's actually a case for reverting the change, because that would be the most common use case.

In the interest of comparison, this is the new AddGetVar():

static Var2 *AddGetVar(Var2List *vlist, DMVarId4 var_id) {
    Var2 *v1,*v;
    if(!vlist->varA) {
        if(available_varlist) {
            Var2ListChain *vl = available_varlist;
            *vlist = vl->list;
            vl->list.varA = NULL;
            available_varlist = vl->next;
            vl->next = used_varlist;
            used_varlist = vl;
            --available_varS;
            ++used_varS;
        }
        else vlist->varM = 0;
        vlist->varS = 0;
    }

    v = vlist->varA;
    if(vlist->varS >= vlist->varM) {
        vlist->varM += 4;
        vlist->varA = v = (Var2*)realloc(v, sizeof(Var2)*vlist->varM);
        if(!v) {
            vlist->varS = 0;
            OutOfMemory();
            return NULL;
        }
    }

    DMStrId4 vname = varnameA[var_id], vname2;
    u2c first=0,last=vlist->varS,i;
    while(first < last) {
        v1 = &v[i = (first+last) >> 1];
        if(vname < (vname2=v1->name)) last = i;
        else if(vname > vname2) first = i+1;
        else return v1;
    }

    v1 = &v[first];
    for(i=vlist->varS++; i>first; --i) v[i] = v[i-1];
    
    v1->var_id = var_id;
    v1->name = vname;
    v1->val = NULL_VALUE;
    return v1;
}

I tried a few variations on the shifting code at the end, including memmove() and a loop that operated strictly on pointers. No difference in performance. This is what the old AddGetVar() looked like:

static Var *AddGetVar(Var **vlist,DMVarId4 var_id) {
    DMStrId4 vname = varnameA[var_id];
    Var *v;

    while((v=*vlist)) {
        if(vname <= varnameA[v->var_id]) break;
        vlist = &v->next;
    }
    // in theory, we should never have two vars with the same name and different var_id
    if(v && v->var_id == var_id) {
        return v;
    }

    if(available_var) {
        --available_varS;
        v = available_var;
        available_var = available_var->next;
    }
    else {
        v = (Var *)malloc(sizeof(Var));
        if(!v) {
            OutOfMemory();
            return NULL;
        }
    }
    v->var_id = var_id;
    v->val = NULL_VALUE;

    v->next = *vlist;
    *vlist = v;

    return v;
}

Both of those have a fetch/unfetch thing designed to keep data structures on hand rather than free and reallocate memory constantly. The var structure did not contain the var name intrinsically, so you see a lookup on varnameA. In theory, switching to a flat array and cutting out the varnameA lookup should've packed a one-two punch.

Jul 13 2016, 7:43 pm

MrStonedOne

If there's a slowdown for datums with less than 10 vars,

That was with accessing that last var in the changed var list, When reading randomly, it went away.

510:

                                      Profile results (total time)
Proc Name                                                 Self CPU    Total CPU    Real Time        Calls
-----------------------------------------------------    ---------    ---------    ---------    ---------
/datum/testdatum48vars/proc/testlocaldatumvaroverhead        6.017        6.017        6.017            1
/datum/testdatum24vars/proc/testlocaldatumvaroverhead        4.752        4.752        4.752            1
/datum/testdatum8vars/proc/testlocaldatumvaroverhead         3.828        3.828        3.828            1
/datum/testdatum1var/proc/testlocaldatumvaroverhead          3.226        3.226        3.226            1
/mob/verb/testvaroverheads                                   0.000       17.823       17.823            1

511:

                                      Profile results (total time)
Proc Name                                                 Self CPU    Total CPU    Real Time        Calls
-----------------------------------------------------    ---------    ---------    ---------    ---------
/datum/testdatum48vars/proc/testlocaldatumvaroverhead        5.692        5.692        5.691            1
/datum/testdatum24vars/proc/testlocaldatumvaroverhead        4.645        4.645        4.645            1
/datum/testdatum8vars/proc/testlocaldatumvaroverhead         3.808        3.808        3.809            1
/datum/testdatum1var/proc/testlocaldatumvaroverhead          3.393        3.393        3.393            1
/mob/verb/testvaroverheads                                   0.001       17.539       17.539            1

(each was 5million varN++)

Jul 13 2016, 7:51 pm
Lummox JR	The fact that the break-even appears to be around 6-7 vars is still troubling. The moderate increase with only a few vars is a step in the wrong direction, albeit a tiny one. I'm on the fence about this.

Jul 13 2016, 8:09 pm
MrStonedOne	well, let's confirm that, datum1var is the first tested. its been proven that the first version tested always does slower then later version tested when profiling. let me put together a reversing test.

Jul 13 2016, 8:34 pm

MrStonedOne

510:

                                  Profile results (total time)
Proc Name                                        Self CPU    Total CPU    Real Time        Calls
--------------------------------------------    ---------    ---------    ---------    ---------
/datum/testdatum48vars/proc/testvaroverhead         6.036        6.036        6.036            1
/datum/testdatum48vars/proc/testvaroverheadR        6.029        6.029        6.029            1
/datum/testdatum24vars/proc/testvaroverheadR        4.887        4.887        4.887            1
/datum/testdatum24vars/proc/testvaroverhead         4.686        4.686        4.686            1
/datum/testdatum8vars/proc/testvaroverhead          3.808        3.808        3.808            1
/datum/testdatum8vars/proc/testvaroverheadR         3.760        3.760        3.761            1
/datum/testdatum1var/proc/testvaroverhead           3.285        3.285        3.285            1
/datum/testdatum1var/proc/testvaroverheadR          3.112        3.112        3.111            1

511:

                                  Profile results (total time)
Proc Name                                        Self CPU    Total CPU    Real Time        Calls
--------------------------------------------    ---------    ---------    ---------    ---------
/datum/testdatum48vars/proc/testvaroverhead         5.741        5.741        5.740            1
/datum/testdatum48vars/proc/testvaroverheadR        5.593        5.593        5.594            1
/datum/testdatum24vars/proc/testvaroverhead         4.560        4.560        4.560            1
/datum/testdatum24vars/proc/testvaroverheadR        4.554        4.554        4.554            1
/datum/testdatum8vars/proc/testvaroverhead          3.792        3.792        3.792            1
/datum/testdatum8vars/proc/testvaroverheadR         3.765        3.765        3.764            1
/datum/testdatum1var/proc/testvaroverheadR          3.372        3.372        3.372            1
/datum/testdatum1var/proc/testvaroverhead           3.356        3.356        3.357            1

I was hoping that it might be because of the fact i was running the test concurrently and would always start 510 first (since its the first one on the task bar list) (meaning the var1 one test would partially be while the cpu was only running one test)

but running them in series didn't seem to change the result.

running them once, clearing profile, then running them again (to pre-fill the availablevar(list) list) was also done on this profile.

This is clearly the result of some of the boilerplate code.

I don't know what this is about:

for(i=vlist->varS++; i>first; --i) v[i] = v[i-1];

but it seems like a likely cause, as it's the only new loop i see.

Jul 13 2016, 8:54 pm
Lummox JR	That code is to shift the vars in the array when a new one is added. In the linked list code it wasn't necessary. I tried memmove() and a pointer-only loop, but neither did any better.

Jul 13 2016, 8:57 pm (Edited on Jul 13 2016, 9:22 pm)

MrStonedOne

This is also a slowdown, because it won't be effectively branch predicted:

        if(vname < (vname2=v1->name)) last = i;
        else if(vname > vname2) first = i+1;
        else return v1;

This is the hard part of doing a good binary search, random branches can negate the benefit. (I've only just recently studied up on this)

if you wanted to get wild, you could try:

        if (vname == (vname2=v1->name))
           return v1;
        bool b = vname < vname2;
        bool bn = !b;
        last = i*(b) + last*(bn);
        first = (i+1)*(bn) + first*(b);

edit: or, just use ?: for a conditional move (works on everything but the clang compiler, (that just compiles it back to a branch))

        if (vname == (vname2=v1->name))
           return v1;
        last = (vname < vname2 ? i : last);
        first = (vname > vname2 ? i : first);

edit2, ya, that's the only thing that could be a slowdown compared to the old code.

the first big bit of boilerplate code is about checking if the datum has a vars list, and creating it if not (fetching from a pool if it can) the second bit just creates one if it can't fetch from a pool, or expands the list if needed, the for loop i mentoned is only for if the var doesn't exist in the list.

all three are cases that don't apply in my profile, because i ensure the vars are changed and in the list in the object's creation (just a simple bit of new code to assign to all of its vars.

Jul 13 2016, 9:25 pm

Lummox JR

Hrm. I've never even considered boolean multiplication, but that's an interesting idea. However I'm concerned it may not compile right.

http://stackoverflow.com/questions/14034478/ boolean-multiplication-in-c

However subtracting 1 and using AND would probably handle that efficiently. I might try that.

u2c bn = vname < vname2;
u2c b = !(bn--) - 1;
last = (i & b) | (last & bn);
first = ((i+1) & bn) | (first & b);

Jul 13 2016, 9:30 pm
MrStonedOne	per my edit (that you might not have seen) ?: compiles to a conditional move, and doesn't incur the overhead of a conditional jump in terms of branch prediction. https://realm.io/news/how-we-beat-cpp-stl-binary-search/

Jul 13 2016, 9:37 pm
Lummox JR	Hrm, the conditional move might work well. I'll have to test it out and see how this compiles. But I like this idea a lot. If there's any hay to be made there, then binary searches could probably be sped up for bags (most notably, appearances) and strings.

Jul 13 2016, 9:56 pm
Lummox JR	It appears that in VC6 (I know), it does not compile to conditional move instructions. VS2013 probably will, so I'll run some tests on that tomorrow and see what happens. All the releases are built with VS2013.

Jul 29 2016, 12:42 am (Edited on Jul 29 2016, 1:20 am)

Hiead

In all curiosity, have you profiled by simply using the C++ STL? I'm sure any overhead would be entirely trivial.

- std::vector instead of an array will offer the same caching benefits over a linked list due to its contiguous memory layout.

- std::map instead of that binary tree thing. (It is a sorted container, typically a binary tree.)

- VS2013 supports C++11, so there's also std::unordered_map == a hash table => O(1) complexity. Do vars need to be sorted?

I'm curious if this method is even the problem area to focus on.

- How does the var data get passed back to the VM? (Is there a bottleneck in the handshake here?)

- In the case of for(index = i, index < j, index += k), does index get resolved once, three times, or on every iteration? Even though the 3 logical sections of that loop header can reference different variables entirely, the VM could do a hard resolve on each one once, cache it, and then soft-resolve from its cache as needed.

My estimate is that you're zoning in on small optimizations which will yield small gains, but there are likely other big-picture optimizations which might yield bigger gains.

Edit: Please refer to this StackOverflow post, particularly to the "same damn thing" argument in the answers section. The desire to micro-optimize probably produces more developer effort and error than it does any tangible gains.

Jul 29 2016, 2:38 am
Optimumtact	it's easy to say oh just use standard library data structures, but it's highly likely the byond source has assumptions about the underlying data structures embedded into it which makes it hard to change data structures. All any of us can really do is uninformed speculation, which honestly doesn't help.

Jul 29 2016, 10:22 am (Edited on Aug 2 2016, 10:26 am)

MrStonedOne

My estimate is that you're zoning in on small optimizations which will yield small gains, but there are likely other big-picture optimizations which might yield bigger gains.

Arguable. The idea here is that this function is used in a lot of code, and has a non-trivial amount of overhead that would be trivial to remove.

Both me and tobba have dug into byond code, and when we did, this change was an immediate catch.

Oranges:

All any of us can really do is uninformed speculation, which honestly doesn't help.

Basically, ya.

Also, a quick word on this:

STL

The thing to remember here, is that whenever you go generic, over already existing specialized code, you take massive performance hits. genericization doesn't come free. it generally involves extra checks, or extra costs compiling from type template resolving, and can increase binary size non-trivialily.

Take c++'s getline(). if you generate a big text file, say 1gb, read from it a few times to load it into disk cache, then try to readline() the file. you will be lucky to get 11MB/s. I know, I had to change our runtime condenser, a c++ app that condenses out dupe runtimes and sorts them all by number of occurances. It was much slower then it had any right to be, and a profile showed that all of the cost was in getline()

if you make your own getline(), that can make certain assumptions about the format and layout of the file and instead buffer multiple lines to reduce i/o calls and read each line directly into the right vars, you'll get 50 to 100MB/s

Generic == Slow. If speed isn't an issue, ya, go ahead, it will speed up development, I'm not arguing against that; But i think we can agree that code that runs per every var that a datum/atom/obj/etc has every time a datum/atom/obj/etc var is accessed should not have a 20 cpu tick overhead.

To drive my point home: Lets do the math on that one, lets say we have a proc, over the course of it, it accesses 10 vars on its object. its object has 65 vars (this isn't unreasonable for mobs, in ss13 they get up to 300 vars). This is a binary search, so we can log2 the vars number, meaning 7 checks per 10 var access = 70 checks. Because of the binary search being jmp based and not mov based, that also adds an extra 20 or so cpu cycle overhead on failed branch predictions. So at worst case, that's 1400 wasted cpu ticks in that proc, a loop runs this proc on every object in view() of the same type, so lets say there was 100 of such type objects in view(), and now we are at 140,000 cpu ticks wasted, and the primary point, is that the overhead can be fixed with two line changes, so its not like it's alot of effort.

The STL map and unordered map have the exact same overhead, almost anything STL that does binary search does. https://realm.io/news/how-we-beat-cpp-stl-binary-search/

oh, and this is the top question of all time on stack exchange: http://stackoverflow.com/a/11227902

A general rule of thumb is to avoid data-dependent branching in critical loops.

I think we can agree that a loop that has to run for ever var access is a critical loop, 90% of game code will involve this loop in some way.

edit

In the case of for(index = i, index < j, index += k), does index get resolved once, three times, or on every iteration? Even though the 3 logical sections of that loop header can reference different variables entirely, the VM could do a hard resolve on each one once, cache it, and then soft-resolve from its cache as needed.

To answer this question, 3 times for every iteration, But I should point out, that none of the code he showed involves proc level vars, only user created vars inside objects (like datums, atoms, etc), proc level vars resolve to what is basically procvars[indexnumber] with procvars being a flat array. You might be onto something here about the optimization, however there already exists syntax to do that basically remove alot of the overhead: for (index in i to j step k). index is still resolved each loop, but the rest would not be, and the assignment would happen outside of userland, shaving off some more overhead.

Jul 29 2016, 9:49 pm (Edited on Jul 29 2016, 10:04 pm)

In response to MrStonedOne

Hiead

MrStonedOne wrote:

Both me and tobba have dug into byond code, and when we did, this change was an immediate catch.

I'm sure you're both experts in the field, but clearly this change didn't seem to have very tangible results in practice.

To the moderator that edited this out: Saying that the remainder of that post is nonsense violates none of the posting guidelines.

You can criticize the content of a post

So, again, the remainder of your post was nonsense. Have a nice day.