ID:2192383
 
BYOND Version:510
Operating System:Linux
Web Browser:Chrome 55.0.2883.87
Applies to:Dream Daemon
Status: Open

Issue hasn't been assigned a status value.
Descriptive Problem Summary:
This is a slight continuation of an old post (http://www.byond.com/forum/?post=2088350), but I'd like to shed some new light.
I'm currently on the latest stable for Linux (510.1346).
After a while, Dream Daemon will be "running", and responding to the open port, but users can no longer log in. Users logged in will not be dropped but can no longer do any actions.
World log shows no runtimes or anything announcing a crash.

Numbered Steps to Reproduce Problem:
1. Host with DD.
2. After a while, freeze!

Did the problem NOT occur in any earlier versions? If so, what was the last version that worked? (Visit http://www.byond.com/download/build to download old versions for testing.)
Has occurred prior and after my taking over the service.

I can provide any sort of information or run anything necessary on the server(s).
Does it really shed any new light, though? It sounds like the only way to reproduce this is to run a server for a long period with many users joining, and even then maybe only with specific games.
It doesn't take a "long period". I can restart one of the servers today and it can potentially be down tomorrow (or even tonight).

As I stated, I'm willing to send whatever information (whether it is some sort of dump, diagnostic, whatever) to see if we can pinpoint the issue.
Anything over a few minutes is along period where debugging is concerned.

The problem for me here is I need some way of catching this in the debugger, so it needs to happen reliably, for me.
If the problem lies where we think it is (the shell() running to verify servers running), I can likely get it to reproduce fairly quickly while running gdb.
I can definitely reproduce it, although it does take a little time.

I suppose gdb dumps would help?
I'm not sure a gdb dump would help, really. I can't think of anything I could do with it. Catching the problem in the act, on my end, is the important part.
it is commonly known among ss13 circles that shell() is causing this.

Just shell() in a loop until shit breaks.

Even my web client byond oauth setup breaks and starts using 99% cpu for no reason.

That's what the whole process is. A loop that runs shell() to verify a few things.

If you need a test box, Lummox, I have an empty Linux VPS that you're free to use if it'll help figure this out. It's been a problem for quite some time now, so I'll assist in whatever way needed.
Does the shell() issue happen in Windows servers too, or just Linux?
just linux
Looking at http://www.byond.com/forum/?post=2004620, out of curiosity what flags are you passing through in the sigaction whilst registering the SIGCHLD handler?
Apologies for the double post, but:

[Truncated] (late night stuff = bad)

TL;DR: Not seeing anything wrong with shell() handling at the moment, but coaxed Hikato into running strace on one of his servers so we can see if the listening socket is close()'d or not being select()'d anymore, or if something around accept() is being funky.
In response to Audeuro
Audeuro wrote:
Looking at http://www.byond.com/forum/?post=2004620, out of curiosity what flags are you passing through in the sigaction whilst registering the SIGCHLD handler?

No flags.

struct sigaction sa;
sa.sa_handler = SIG_IGN;
sa.sa_flags = 0;

socklib.GetIOSignals(&sa.sa_mask);

...

sa.sa_handler = ChildHandler;
if(sigaction(SIGCHLD,&sa,NULL) == -1) SigActionError(SIGCHLD);
Ran strace on the PID when it was hung up, log only showed the following:

futex(0xf6e40420, FUTEX_WAIT_PRIVATE, 2, NULL


I also have results of an lsof, if that will be relevant to anything.
In response to Hikato
Hikato wrote:
futex(0xf6e40420, FUTEX_WAIT_PRIVATE, 2, NULL

This is interesting, because that would imply DS is getting blocked in its main thread waiting on another thread to execute -- which wouldn't be the case in the fork()/exec() model -- he issues a non-blocking call to waitpid() in case it exits immediately and lets SIGCHLD pick up everything else, also with non-blocking waitpid() calls.

Lummox, what all ended up getting threaded that may be at play here?

Hikato did indicate that at these times his `shell` commands don't seem to be completing, based on world.log << output before and after. The command he's executing does a write to a file using a standard redirect, so he's going to check if that's actually writing the file when it just stops doing useful things.
Threading isn't in play. It's still disabled.
In response to Lummox JR
That's doubly weird, then. futex(2) indicates that it's generally for blocking threads/forks, unless you're using some kind of mutex in there.
Do you have any plans on attempting to pinpoint this issue? I'm going above and BYOND (hehe) offering you whatever you would possibly need, but you've shrugged off the majority of it.

I'm not trying to get rude, but I'd like to know if I'm wasting my time trying to work with you here.
I definitely don't want to shrug any of this off. The shell() thing has been a pain for a while now and it'd be awesome to finally get to the bottom of it. I've just not put this on the front burner just yet because trying to reproduce it in Linux is not a simple task and not something I'm anxious to get in the weeds on just yet--especially if you guys come up with any ideas as to other possibilities.

The futex() call is definitely a strange thing, since it isn't called directly by any code in Dream Daemon. It's not referenced anywhere in the source. (I wonder if maybe it's done internally by waitpid() or some such.)

I randomly ran across this in some Google searching:

https://meenakshi02.wordpress.com/2011/02/02/ strace-hanging-at-futex/

To add some info to this case, this is what the code looks like that launches a child process:

if((*pid = fork()) == 0) {
//avoiding use of system() by using exec directly makes KillProcess() operate on sh rather than on the child DreamSeeker process
if(execl("/bin/sh","sh","-c",cmd,NULL)==-1) {
PutsText_SIO("exec failed\n");
_exit(127);
}
}
return (*pid != -1);
Page: 1 2 3