Dream Daemon Hang Up

Feb 2 2017, 7:19 pm In response to Lummox JR
Audeuro	Lummox JR wrote: Hrm. I wonder if the problem is more that the child handler is interrupting the regular timer event or vice-versa. As I said before, the child handler could theoretically interrupt anything, so you have to be careful not to clobber state you may be otherwise using or making assumptions about everywhere in your code base.

Feb 7 2017, 5:02 pm
Audeuro	I believe at this point I will bow out of this discussion. I'm still not entirely sure that my point has gotten across that the only safe thing to do here in your signal handler is to set a flag that it's finished, then letting your normal timing stuff handle that flag. Either way, I'm not sensing any progress on this given the sparse communication. Good luck, Hikato.

Feb 7 2017, 6:49 pm
Lummox JR	I think I know what I need to do; I just need to take some time to make the appropriate changes properly.

Mar 5 2017, 3:14 pm
Hikato	Any update?

Mar 18 2017, 6:47 pm

In response to Hikato

Pixel Realms

Hikato wrote:

Do you have an ETA to attempt to get a fix out? This is a year(s) old bug, and pretty serious. I'd hoped it'd be a little higher in priority than some graphical things.

Yeah, this has been happening almost daily in Eternia, with a reboot fixing it. It more or less shuts down the game since nobody can log in, which is a nuisance if an admin isn't around to reboot.

It's the sort of thing that really needs to be addressed because it's been actively damaging a good chunk of (if not all?) BYOND games for months now.

Mar 18 2017, 7:52 pm

In response to Pixel Realms

Audeuro

Pixel Realms wrote:

It's the sort of thing that really needs to be addressed because it's been actively damaging a good chunk of (if not all?) BYOND games for months now.

I'd estimate that "months" to be "many years," actually- it's just that it's a problem with a lot of factors. Processor speed, process scheduling (priorities change, etc), and timing within DD's VM has to be just right.

Because it's so involved, it's complicated to reproduce consistently and can manifest in many many ways because of the nature of the beast. You can increase the odds of problems arising by scattershot shell() at different timings so that SIGCHLD fires off more frequently, but that's about it.

This is problematic, though, since Lummox really can't know that it's actually fixed except by trusting that what I'm telling him is right (he shouldn't be doing so much or messing with external state within his signal handler) and whether or not people complain after the fact. =(

It makes me sad, too, that it's not been addressed, but we have now received a response.

LummoxJR wrote:

I think the best way to solve this is likely to be with a global flag that can be checked at some point in the background proc handling loop.

My interpretation of this was that he's going to make SIGCHLD's handler only set a global flag that the background proc handling loop checks and then does all of the waitpid(2) calls and cleanup associated. If this is correct, then that should alleviate these problems.

Mar 18 2017, 9:45 pm

Audeuro

@Lummox: Actually, now that I think about it, you call waitpid(2) frequently in your background proc loop, right? In that case, you could just rip out the SIGCHLD handler altogether and let them get naturally handled.

SIGCHLD is only really useful if you can immediately act on the child dying in a safe manner. In your case, you can't and you really have better things to be doing with the time currently spent in SIGCHLD anyways. You might as well just reap the children as you check on them anyways.

If it helps, we could whip up a simple case for you to at least observe that it hasn't broken anything (left zombie processes for more than a couple second) if you want to just stop handling SIGCHLD.

Mar 26 2017, 3:43 pm
Lummox JR	This is just a quick note that I haven't forgotten this issue. A prospective fix will be in the next release.

Apr 7 2017, 12:34 pm
Lummox JR	511.1379 has been released, so please retest with that.

Apr 21 2017, 9:01 am
Hikato	Shell is completely broken in the latest 511 stable. It will run one command, which is successful, but never gets past it.

Apr 21 2017, 9:41 am (Edited on Apr 21 2017, 10:06 am)

In response to Hikato

Lummox JR

Odd. I'm not seeing anything in the code that could explain that. There is a call to RemoveSlaveRec() that existed in the old CommandFinished() call that doesn't get called anymore, so that's something I can easily add to the routine that checks for finished shell() calls, but I don't see any reason why shell() shouldn't complete normally.

There's a routine already in place called TickBgProcs() that does polling on all shell processes. The change I added to 511.1379 was to simply set a var that tells the caller to call TickBgProcs() a second time if a shell() got finished while another proc was waiting.

[edit]
I ran a test and it appears this is a Linux-only issue. I don't know why, but IsCommandFinished_SIO() is apparently not working correctly.

Apr 21 2017, 9:48 am
Hikato	In my case, I see my debug log of the command being run, but nothing else happens afterward. The first shell() is to start a server, and the server does start, but it doesn't continue past that.

Apr 21 2017, 9:56 am

In response to Lummox JR

Audeuro

Lummox JR wrote:

There's a routine already in place called TickBgProcs() that does polling on all shell processes. The change I added to 511.1379 was to simply set a var that tells the caller to call TickBgProcs() a second time if a shell() got finished while another proc was waiting.

So what does your SIGCHLD handler look like now? (assuming you kept it -- not strictly necessary, as previously mentioned, if you're calling waitpid() periodically)

Apr 21 2017, 10:08 am
Lummox JR	The SIGCHLD handler on the frontend is the same; it calls a backend CommandFinished() function. But the backend function only sets a var, nothing more. Oddly, what I'm seeing in my tests is that all proc ticking seems to stop after this, which makes no kind of sense at all. Nothing else has changed on the frontend to explain that, and the backend is literally doing less work.

Apr 21 2017, 10:14 am

In response to Lummox JR

Audeuro

Lummox JR wrote:

The SIGCHLD handler on the frontend is the same; it calls a backend CommandFinished() function. But the backend function only sets a var, nothing more.

So it's calling CommandFinished() within a waitpid loop, which then signals later bits in normal execution to again call the function that does a waitpid()? The second waitpid will always fail your expectations if the process has already been reaped by the SIGCHLD waitpid().

Apr 21 2017, 12:16 pm

Lummox JR

Yep, the problem was twofold. First I had to basically gut the child handler (I just told it to set the var without ever calling waitpid() at all), but the main reason the procs stopped ticking was that no one was logged in and there was some disabled code (enabled only on Windows) that said to keep the ticker alive if any shell procs were waiting, so the server was effectively going into sleep mode. Between those two things that seems to have done the trick.

Apr 21 2017, 12:17 pm
Hikato	So there is another test fix ready?

Apr 21 2017, 12:40 pm
Lummox JR	The new build is out now. I confirmed that shell() completes again, so now the only open question is whether it can cause a hang at all. I suspect that will no longer happen.

Apr 21 2017, 1:10 pm In response to Lummox JR
Audeuro	+1 for gutting the child handler, +10 for making it more useful so that you don't have to waitpid() unnecessarily. =)

Apr 21 2017, 1:17 pm
Hikato	Well, everything works again. I'll run everything and see if the original issue is resolved as well.