In response to Lummox JR
Lummox JR wrote:
Hrm. I wonder if the problem is more that the child handler is interrupting the regular timer event or vice-versa.

As I said before, the child handler could theoretically interrupt *anything*, so you have to be careful not to clobber state you may be otherwise using or making assumptions about *everywhere* in your code base.
I believe at this point I will bow out of this discussion. I'm still not entirely sure that my point has gotten across that the only safe thing to do here in your signal handler is to set a flag that it's finished, then letting your normal timing stuff handle that flag. Either way, I'm not sensing any progress on this given the sparse communication.

Good luck, Hikato.
I think I know what I need to do; I just need to take some time to make the appropriate changes properly.
Any update?
In response to Hikato
Hikato wrote:
Do you have an ETA to attempt to get a fix out? This is a year(s) old bug, and pretty serious. I'd hoped it'd be a little higher in priority than some graphical things.

Yeah, this has been happening almost daily in Eternia, with a reboot fixing it. It more or less shuts down the game since nobody can log in, which is a nuisance if an admin isn't around to reboot.

It's the sort of thing that really needs to be addressed because it's been actively damaging a good chunk of (if not all?) BYOND games for months now.
In response to Pixel Realms
Pixel Realms wrote:
It's the sort of thing that really needs to be addressed because it's been actively damaging a good chunk of (if not all?) BYOND games for months now.

I'd estimate that "months" to be "many years," actually- it's just that it's a problem with a lot of factors. Processor speed, process scheduling (priorities change, etc), and timing within DD's VM has to be just right.

Because it's so involved, it's complicated to reproduce consistently and can manifest in many many ways because of the nature of the beast. You can increase the odds of problems arising by scattershot shell() at different timings so that SIGCHLD fires off more frequently, but that's about it.

This is problematic, though, since Lummox really can't know that it's actually fixed except by trusting that what I'm telling him is right (he shouldn't be doing so much or messing with external state within his signal handler) and whether or not people complain after the fact. =(

It makes me sad, too, that it's not been addressed, but we have now received a response.

LummoxJR wrote:
I think the best way to solve this is likely to be with a global flag that can be checked at some point in the background proc handling loop.

My interpretation of this was that he's going to make SIGCHLD's handler only set a global flag that the background proc handling loop checks and then does all of the waitpid(2) calls and cleanup associated. If this is correct, then that should alleviate these problems.
@Lummox: Actually, now that I think about it, you call waitpid(2) frequently in your background proc loop, right? In that case, you could just rip out the SIGCHLD handler altogether and let them get naturally handled.

SIGCHLD is only really useful if you can immediately act on the child dying in a safe manner. In your case, you can't and you really have better things to be doing with the time currently spent in SIGCHLD anyways. You might as well just reap the children as you check on them anyways.

If it helps, we could whip up a simple case for you to at least observe that it hasn't broken anything (left zombie processes for more than a couple second) if you want to just stop handling SIGCHLD.
This is just a quick note that I haven't forgotten this issue. A prospective fix will be in the next release.
511.1379 has been released, so please retest with that.
Shell is completely broken in the latest 511 stable.

It will run one command, which is successful, but never gets past it.
In response to Hikato
Odd. I'm not seeing anything in the code that could explain that. There is a call to RemoveSlaveRec() that existed in the old CommandFinished() call that doesn't get called anymore, so that's something I can easily add to the routine that checks for finished shell() calls, but I don't see any reason why shell() shouldn't complete normally.

There's a routine already in place called TickBgProcs() that does polling on all shell processes. The change I added to 511.1379 was to simply set a var that tells the caller to call TickBgProcs() a second time if a shell() got finished while another proc was waiting.

[edit]
I ran a test and it appears this is a Linux-only issue. I don't know why, but IsCommandFinished_SIO() is apparently not working correctly.
In my case, I see my debug log of the command being run, but nothing else happens afterward. The first shell() is to start a server, and the server does start, but it doesn't continue past that.
In response to Lummox JR
Lummox JR wrote:
There's a routine already in place called TickBgProcs() that does polling on all shell processes. The change I added to 511.1379 was to simply set a var that tells the caller to call TickBgProcs() a second time if a shell() got finished while another proc was waiting.

So what does your SIGCHLD handler look like now? (assuming you kept it -- not strictly necessary, as previously mentioned, if you're calling waitpid() periodically)

The SIGCHLD handler on the frontend is the same; it calls a backend CommandFinished() function. But the backend function only sets a var, nothing more.

Oddly, what I'm seeing in my tests is that all proc ticking seems to stop after this, which makes no kind of sense at all. Nothing else has changed on the frontend to explain that, and the backend is literally doing less work.
In response to Lummox JR
Lummox JR wrote:
The SIGCHLD handler on the frontend is the same; it calls a backend CommandFinished() function. But the backend function only sets a var, nothing more.

So it's calling CommandFinished() within a waitpid loop, which then signals later bits in normal execution to again call the function that does a waitpid()? The second waitpid will always fail your expectations if the process has already been reaped by the SIGCHLD waitpid().
Yep, the problem was twofold. First I had to basically gut the child handler (I just told it to set the var without ever calling waitpid() at all), but the main reason the procs stopped ticking was that no one was logged in and there was some disabled code (enabled only on Windows) that said to keep the ticker alive if any shell procs were waiting, so the server was effectively going into sleep mode. Between those two things that seems to have done the trick.
So there is another test fix ready?
The new build is out now. I confirmed that shell() completes again, so now the only open question is whether it can cause a hang at all. I suspect that will no longer happen.
In response to Lummox JR
+1 for gutting the child handler, +10 for making it more useful so that you don't have to waitpid() unnecessarily. =)
Well, everything works again. I'll run everything and see if the original issue is resolved as well.
Page: 1 2 3