ID:2905425
 
BYOND Version:515.1621
Operating System:Windows 11 Home 64-bit
Web Browser:Chrome 120.0.0.0
Applies to:Dream Seeker
Status: Open

Issue hasn't been assigned a status value.
Descriptive Problem Summary:
Over at Eternia/Meranthe, on BYOND version 515.1621, as a Windows server, we're experiencing an issue where after 48 hours all users get 'connection failed' when trying to connect to the game. This happens to every person while others are still connected and online within the server, unable to connect again if they log out.

It only begins after an amount of uptime, roughly two days (sometimes a bit sooner).

Our solution has been to just reboot the game daily, but obviously something unusual is at play here. We also might be a unique case since SS13 servers that are on latest 515 are going to be rebooting every round.

Console errors: https://i.imgur.com/Xesdibh.png
Server specs (Windows 11): https://i.imgur.com/udKU8vT.png



Does the problem occur:
Every time? Or how often? Every time, after sufficient uptime
In other games? Unsure
In other user accounts? All
On other computers? Unsure. Might be Windows specific host wise

When does the problem NOT occur?

Did the problem NOT occur in any earlier versions? If so, what was the last version that worked? The issue wasn't experienced in an earlier 515 build, just after sendmaps multi-threading was stable, but I'm also not sure since we do generally reboot daily anyway. We would need to downgrade to test

Workarounds: Reboot daily or suffer

I've had reports of this but no useful info to go on. I think it's going to require narrowing down to a specific version where the issue started. Beyond that, any log info about the connection and any other error messages on the client end would be helpful.

The earlier reports centered around a problem with unexpected messages from the client, suggesting the server had gotten into a weird state, but I had nothing to reproduce.

I will however add that the one thing I know about the earlier case (from Pomf) is that this was accompanied by "Network connection shutting down due to read error" on the server end (in the logs), and related in this particular case to a massively bloated .dyn.rsc with a huge number of files. So if you have a lot of runtime icon manipulation or uploads, that could be the same issue.
We see that 'network connection' error above, for the key PackedOut (who it always appears for, strangely). Will try to gather more info.
Sorry, maybe not related, but just in case it can help.

Some of our players are sometimes unable to connect to the server with "Connection failed" error. Problem started to appear last months, maybe after updating server from 1613 to 1620, but maybe it's just a coincidence. Also I should mention that there is reasons to think that problem may be in Russian ISPs.

1. This happens randomly, mostly for same players, they can't join for hours, then it works, then again "Connection failed".
2. If it happens, it happens only for our server, clients where still able to connect some other SS13 servers.
3. In a couple of cases we found out that it was because of ISP (different for different players). Connecting with different ISP, or via VPN can help. Changing the client Byond version or switching to another PC did not help.
4. MTR (WinMTR) from client to server was good, no packet loss or it was minimal.
5. And in every case I was able to see that server calls IsBanned for problem clients, so disconnect happens after. I don't remember if there where any byond/network errors in DD log.
6. For one player, he was still able to connect after just waiting on this "Connection failed" popup. For another, this not helped, but helped Byond Membership that disables 30-seconds delay before connection (where we should see ads, but we don't see Byond ads in Russia, it's just delay).

Also I found that sometimes IsBanned can be called several times for connecting client in one connection session. It happens randomly too, and not always related to "Connection failed" problem.

UPD: Got another player with this problem for tests. No magic like Membership helped. Relay server helped. Another point for ISP.

Also, it was not related to Byond version, player was unable to connect to the server on stable version too.
As far back as .1609 this still occurs. It has also gotten progressively worse over time, regardless of the BYOND version. Now the server will lock up within 24 hours (often less) and reject all connections. Our dyn.rsc is also tiny in comparison to most SS13 servers (128MB for reference, and likely only has maybe 2,000 files in it).

Also build 515.1627 isn't stable for us, it seems like. There was a hang within a few hours as of updating to it today, no errors to go off of, including the event viewer. So that has separate issues. We're still on 1627 but will then downgrade to 1620 if it hangs again. Was hoping the recent connection bug you fixed might solve this issue.
A hang can only be diagnosed by breaking in with a debugger and generating a mini-dump. It doesn't generate any event viewer info of use. I can't look into a hang without either a mini-dump or a way to consistently reproduce it.

For your connection issue I'll need to see what your server side says about the failing connections, which should be in your logs.
campbell, our lowest pop server (and also a linux server), gets this same thing. basic version (0x83) world/Topic connections also fail so our server toolkit (tgstation-server) detects this and restarts the server.

happens anywhere from 50 hours to 68 hours in. depending on things i don't have visibility into.
In response to Lummox JR
Lummox JR wrote:
A hang can only be diagnosed by breaking in with a debugger and generating a mini-dump. It doesn't generate any event viewer info of use. I can't look into a hang without either a mini-dump or a way to consistently reproduce it.

For your connection issue I'll need to see what your server side says about the failing connections, which should be in your logs.

You've been given all messages shown to us. Unfortunately it seems BYOND isn't generating any kind of message for what's causing the connection to be rejected in this case. We can only give you what we have, and if the engine isn't providing enough details then it's on you to make it do so.
I can add that when the server is in this state it'll also reject Topic() connections, including ?ping.
I don't know what I can add to detect the problem or spit out info when it occurs. Really it shouldn't be happening at all, so it's a complete mystery what I can include to improve debugging. Nominally it seems like there should be some info about the rejection on the server end, unless it's being blocked at a lower level like sockets. But a lower-level issue wouldn't be consistent across OSes.
No hang or connectivity issues after roughly 24 hours of uptime on 1627, so that's an improvement compared to the prior week or so testing on past versions. Will see how long it goes.

EDIT: 30 hours before we rebooted to update the game, so this was likely resolved with the recent connectivity fixes.
I'm not sure what connectivity fixes you're referring to. There's nothing in the last several releases related to connectivity, except a client-side fix which was unrelated to any of this.
As mentioned, the server was rejecting connections before 24 hours had passed recently (for the past two weeks or so, every day without fail). After updating to 1627 that didn't happen for the first time, so maybe something changed, or it was just a coincidence. Either way will see if it occurs again.

"For your connection issue I'll need to see what your server side says about the failing connections, which should be in your logs."
Where are these logs located? (Windows 11) If you meant the standard BYOND error logs then there's nothing
The logs I meant were ordinary world.log.

The change in 1627 was to prevent messages from being sent by the client during the initial handshake, which was really only relevant to reconnections, not to new connections. It wouldn't have any bearing on an issue where connections fail after a certain uptime.