ID:115176
 
Resolved
Persistent world.Export() connections could cause responses to be lost, even when the outgoing message itself was sent, in cases of high server load for the sender. For now, persistent connections have been disabled with the intent of investigating the problem further at a later time.
BYOND Version:486
Operating System:Linux
Web Browser:Firefox 5.0
Applies to:Dream Daemon
Status: Resolved (487)

This issue has been resolved.
Descriptive Problem Summary:

When the Export()ing computer's CPU load is high (~100%), Export()s often return null even though they were received and returned by the exportee.

Numbered Steps to Reproduce Problem:

The easiest way to verify this is with Terulia Relay Chat.

1) download the TRC library, which doubles as a standalone client when run by itself (http://www.byond.com/developer/Gakumerasara/TRC)
2) compile and run it
3) in the TRC console, type "/subscribe test"
4) in the TRC console, type "/channel test"

5) type any text of your choosing in the TRC console to chat (Keep this TRC client open for the following steps...)

6a) log into FFO during peak hours (~8-11p EST) http://www.byond.com/games/Gakumerasara/FinalFantasyOnline or
6b) compile the TRC library into any "game" of your choosing that peaks at 100% CPU use.

7) resize the windows so you can see both TRC chat windows (one in the standalone, one in FFO et al) side-by-side

8) type any text of your choosing in FFO's TRC console to chat

You should notice that 100% of your messages are being received by the standalone client, but many of these same messages are being missed by the high-CPU version.

When you chat, the servers queues your message to any users who should hear it. In your case, the message is being queued to "you" in both the standalone and in FFO et al. I have verified that the server is receiving queries from both servers, i.e. from FFO as well as the standalone, and that the same message is being returned to both. Thus the hangup is on the sender's end, where the returned Export() is simply being ignored.

I can send you any additional code you may need, but the downloadable TRC client is identical to the one that is being used in FFO.

Expected Results:

100% of returned Export()s are recognized.

Actual Results:

a high proportion of returned Export()s are ignored during peak use (high CPU load)

When does the problem NOT occur?

low CPU load (early mornings, for example)

Did the problem NOT occur in any earlier versions? If so, what was the last version that worked? (Visit http://www.byond.com/download/build to download old versions for testing.)

This wasn't a problem in previous BYOND versions (basically all spring). I'm assuming 484 is the problem, though I may need to verify.
484 didn't change the network code regarding world.Export(), so it's highly unlikely the bug first appeared in that version. That build was largely for fixes related to icon processing. The only thing that touched the networking was the profiler fix, and that was actually a frontend fix.
Ok; then if you'd like me to, I'll run back through some of the old versions and see where the problem is when I have some time this evening.
I've seen this before, but it's happened in offline tests far from a high CPU load. I could never pin down a reliable demo for it.

Mikau mentioned that SuperAntx had this same problem with Decadence, so maybe he can chip in with some information.
I was running some tests earlier and I found that the TRC server received virtually all of the exports. I confirmed that everything was working through the very last line of Topic(), meaning that something was returned by the TRC server. From there, many of the returned messages get lost and don't ever reach var/e = world.Export() in the FFO server's TRC client.

I need to run some tests to see whether persistence makes any difference.
And yes, I also discovered that CPU load alone is insufficient to cause this bug.

very odd...
You should definitely run some tests with persistent connections on/off. Back when it was first added I tried to turn it on and, while it worked for a while, on populated servers only it'd start misplacing return values/associating them with the wrong Export call. I could never come up with a good description and turning persistent connections off made everything work so I forgot to try to make a report about it.

I'm kind of surprised that no one's reported this until now actually.
Reverting to 482 had no discernible effect.
On the other hand, removing persistence allows TRC to work with ~100% efficiency.
So far we haven't encountered a single "dropped" Export() from what we can tell.

edit: typo
Thanks, that's a good clue. Do you actually notice any efficiency improvements with persistence on? Because maybe we should just lose that; it's not really clear if the extra packets to startup/shutdown connections makes much difference, and there's clearly a bug with it.

Nevertheless, we'll take a look at that code to see if there's anything obvious.
If there were any efficiency improvements, they were overshadowed by the high frequency of lost packets.

It's odd that the players say this has only been a problem for the past 2 weeks; perhaps my new apartment's network doesn't like persistent connections for some reason. (despite the internal network being all the same hardware...? and software...?) I made the switch to persistent connections back in March as I recall. I didn't receive any complaints at the time.

Personally, I would prefer to use persistent connections since at present I am unable to block syn/ack attacks. (I hadn't started blocking syn/ack as of yet, but this was the reason I switched to persistent connections in the first place.)