Closed Bug 372582 Opened 17 years ago Closed 17 years ago

RunShellCommand does not reap timed-out children

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ajschult784, Assigned: ajschult784)

References

Details

Attachments

(2 files, 1 obsolete file)

Attached patch patchSplinter Review
RunShellCommand's timeout alarm interrupts the while loop that waits for the child process to exit and so any timed-out process is never reaped.   Ideally, the while loop would ignore the alarm, but I don't know how to pull that off (or if it's possible), so I added some bits after the process is killed to reap it.

I also switched WCOREDUMP to ($? & 128) since both of my systems don't seem to have this and it's not officially part of perl's POSIX http://perldoc.perl.org/POSIX.html
Attachment #257238 - Flags: review?(rhelmer)
This takes care of:

-- reap timed out children; there's no looping/sleeping logic because I'm not convinced such logic should exist; on systems that are well-behaved (mac/linux; win32 with msys?) kill() with -9 does the right thing, and waitpid() should as well. (Arguably, this patch should waitpid() withOUT WNOHANG, but then we'd have to put in a timeout for that, which... gets scary.)

-- if output is true, and STDOUT is still in buffered mode, printOutputImmediately is somewhat misleading, since output won't be printed immediately. I ran into this bug with the new dumpLogs config setting that was added as part of bug 371350. There are some steps that talk a long time (rm -rf mozilla), and you'd get partial printouts while that command took its sweet time (especially on Crapdows). This fixes that, plus restores the buffering setting to its previous state when we're done.
Attachment #258367 - Flags: review?(rhelmer)
> -- reap timed out children; there's no looping/sleeping logic because 
> I'm not convinced such logic should exist;

Actually, in testing, it would take ~2 seconds to reap a simple process like "sleep 300".  More specifically, a process temporarily in uninterruptible sleep (perhaps waiting on simple disk I/O) will never be reaped with your patch.
(In reply to comment #2)
> > -- reap timed out children; there's no looping/sleeping logic because 
> > I'm not convinced such logic should exist;
> 
> Actually, in testing, it would take ~2 seconds to reap a simple process like
> "sleep 300". 

Platform? Was the system loaded?

If this is the case, I suppose we could layer on the timeouts (call waitpid() without WNOHANG, and set another alarm()), but I'm still not convinced about the "retry waitpid() and sleep() some to make sure it works"-logic.

> More specifically, a process temporarily in uninterruptible sleep
> (perhaps waiting on simple disk I/O) will never be reaped with your patch.

I'm not worried about handling processes that are dead in I/O. waitpid() with a timeout should suffice.

This patch at least does the right thing with respect to non-broken and -pathological systems, AND it includes the STDOUT patch, which is being actively annoying, as opposed to this bug, which is almost-never annoying, so I'll let rhelmer review it, and do a follow-up to do the waitpid() timeout thinger.
> Platform? Was the system loaded?

Linux, FC6.  System was mostly idle,  I think.
There's no guarantee that the signal is handled before kill exits (on any platform I'd expect, certainly *nix, including Macs), which would seem to be a requirement for what you're saying to be true.

And further testing (w/idle system) shows "2 seconds" is wrong.  It did take two passes through the loop, but that's 1 second.  If I remove the sleep entirely (so just looping as fast as possible), the process gets reaped on the 3rd pass.
Same as before, except give waitpid() a little breathing room.
Attachment #258367 - Attachment is obsolete: true
Attachment #258478 - Flags: review?(rhelmer)
Attachment #258367 - Flags: review?(rhelmer)
Attachment #257238 - Flags: review?(rhelmer)
Comment on attachment 258478 [details] [diff] [review]
Reap timed out children + other fixes, Take II

>+            # Processes get 10 seconds to obey.
>+            eval {
>+                local $SIG{'ALRM'} = sub { die("alarm\n") };
>+                alarm($EXEC_REAP_TIMEOUT);
>+                my $waitRv = waitpid($childPid, 0);
>+                alarm(0);
>+                # Don't fill in these values if they're bogus.
>+                if ($waitRv > 0) {
>+                    $exitValue = WEXITSTATUS($?);
>+                    $signalNum = WIFSIGNALED($?) && WTERMSIG($?);
>+                    $dumpedCore = WIFSIGNALED($?) && ($? & 128);
>+                }
>+            };

Hmm, comment says "Processes get 10 seconds to obey", shouldn't that be alarm(10) or am I missing something?
(In reply to comment #6)
> (From update of attachment 258478 [details] [diff] [review])
> >+            # Processes get 10 seconds to obey.
> >+            eval {
> >+                local $SIG{'ALRM'} = sub { die("alarm\n") };
> >+                alarm($EXEC_REAP_TIMEOUT);
> >+                my $waitRv = waitpid($childPid, 0);
> >+                alarm(0);
> >+                # Don't fill in these values if they're bogus.
> >+                if ($waitRv > 0) {
> >+                    $exitValue = WEXITSTATUS($?);
> >+                    $signalNum = WIFSIGNALED($?) && WTERMSIG($?);
> >+                    $dumpedCore = WIFSIGNALED($?) && ($? & 128);
> >+                }
> >+            };
> 
> Hmm, comment says "Processes get 10 seconds to obey", shouldn't that be
> alarm(10) or am I missing something?

First diff chunks says:

+my $EXEC_REAP_TIMEOUT = 10;

I could've said "Processes get $EXEC_REAP_TIMEOUT seconds to obey," but then you'd have to go find the definition that variable... which is annoying.

Then again, remembering to change the comment if you ever change the timeout is annoying too.
Blocks: 374568
Attachment #258478 - Flags: review?(rhelmer) → review+
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: