Closed Bug 412507 Opened 17 years ago Closed 6 years ago

tinderbox doesn't figure out what signal killed a process correctly

Categories

(Webtools Graveyard :: Tinderbox, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: mark, Assigned: mark)

Details

Attachments

(2 files)

We had a couple of orange builds today, failing MozillaAliveTest on the Camino tinderbox.  Here's one:

http://tinderbox.mozilla.org/showlog.cgi?log=Camino/1200371880.1200372972.26899.gz

Running MozillaAliveTest test ...
Timeout = 15 seconds.
Begin: Mon Jan 14 20:55:56 2008
cmd = /builds/tinderbox/CmTrunk/Darwin_9.1.0_Depend/mozilla/../build/dist/Camino.app/Contents/MacOS/Camino -P default
End:   Mon Jan 14 20:56:07 2008
----------- Output from MozillaAliveTest ------------- 
  stdout and stderr unbuffered
----------- End Output from MozillaAliveTest --------- 
Error: Camino: received SIGHUP
Error: Camino: exited with status 16777215
Error: Camino: dumped core.
MozillaAliveTest: test failed

Why did the process get a signal AND exit with an exit status?  Furthermore, who sent it a SIGHUP?  Certainly not me or anyone else, and certainly not tinderbox, either.  What gives?

http://bonsai.mozilla.org/cvsblame.cgi?file=/mozilla/tools/tinderbox/build-seamonkey-util.pl&rev=1.378&mark=1740-1742#1730

The calculation of $signal_num is totally bogus.  It should be ($? & 127) to only look at the low-order 7 bits of $?.  ($? >> 127) is a right-shift by a whopping 127 bits, and is very different.  Whoops.  Our process likely didn't die by SIGHUP, but ($? >> 127) = 1 for other reasons.

Also, the waitpid checks seem to be wrong.  We check for WIFEXITED but not WIFSIGNALED.

If $wait_pid is -1, as it may be in our tinderbox' case (we don't know), we should at least log the error, to make it clear that whatever happens next might be totally bogus.

Finally, we should set $? to 0 before we enter the waitpid loop, to avoid running off with some stale last-status value.
Attached patch FixSplinter Review
I'm going to test this on cb-miniosx01.
Attachment #297251 - Flags: review?(nrthomas)
Comment on attachment 297251 [details] [diff] [review]
Fix

Seems ok to me, but seeking additional review from someone with more Perl under their belt.
Attachment #297251 - Flags: review?(nrthomas)
Attachment #297251 - Flags: review?(ccooper)
Attachment #297251 - Flags: review+
Comment on attachment 297251 [details] [diff] [review]
Fix

>+        if ($number <= $#sig_name) {
>+            return $sig_name[$number];
>+        }
>+        else {
>+            return $number;
>+        }

Do we hit instances where the sig number isn't in Config.pm? Should we be?

>         $exit_value = $? >> 8;
>-        $signal_num = $? >> 127;
>-        $dumped_core = $? & 128;
>+        $signal_num = $? & 0x7f;
>+        $dumped_core = $? & 0x80;

Any particular reason to switch to hex here? Bit shifting is opaque enough.
Attachment #297251 - Flags: review?(ccooper) → review+
FWIW, rhelmer pointed me at RunShellCommand in MozBuild that was originally designed to fix this broken piece of tinderbox:

http://mxr.mozilla.org/seamonkey/source/tools/release/MozBuild/Util.pm#49
(In reply to comment #3)
> Do we hit instances where the sig number isn't in Config.pm? Should we be?

It's not a likely case, but if you're using a Perl built on an older OS release, it might not be aware of newly-added signals.

> >         $exit_value = $? >> 8;
> >-        $signal_num = $? >> 127;
> >-        $dumped_core = $? & 128;
> >+        $signal_num = $? & 0x7f;
> >+        $dumped_core = $? & 0x80;
> 
> Any particular reason to switch to hex here? Bit shifting is opaque enough.

At least in my mind, the bit patterns corresponding to 0x7f and 0x80 are much clearer than the patterns corresponding to 127 and 128.  When used as masks, as they should be and as this patch does, the bit patterns are more important to understanding what the assignments actually do.  The shifts here were never right.
Checked in on the trunk.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
> FWIW, rhelmer pointed me at RunShellCommand in MozBuild that was originally
> designed to fix this broken piece of tinderbox:

==> Bug 374568
Attached patch Additional fixSplinter Review
We still weren't getting proper diagnostics when processes exited with nonzero exit statuses or were killed by signals.  I tracked this down to two problems:

. wait_for_pid was returning "done" instead of the %hash containing the exit
  information.
. print_test_errors checked if the exit status was nonzero before printing
  information about the exit value and signal, but if the process was killed
  by a signal, the exit value as passed back by waitpid in $? would be zero.

I've gotten ride of the "done" thing and have wait_for_pid now always returning the %hash, but if a process exited by timeout, the values in the return %hash are just left at 0.

I've fixed print_test_errors to check the variable it's going to print before printing it, instead of gating everything based on the presence of a nonzero exit value.

I'm currently testing this patch on the three Camino tinderboxen.
Attachment #299307 - Flags: review?
Attachment #299307 - Flags: review? → review?(ccooper)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment on attachment 299307 [details] [diff] [review]
Additional fix

Looks good.
Attachment #299307 - Flags: review?(ccooper) → review+
Product: Webtools → Webtools Graveyard
Tinderbox isn't maintained anymore. Closing.
Status: REOPENED → RESOLVED
Closed: 17 years ago6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: