tinderbox doesn't figure out what signal killed a process correctly

REOPENED
Assigned to

Status

Webtools Graveyard
Tinderbox
REOPENED
11 years ago
4 years ago

People

(Reporter: Mark Mentovai, Assigned: Mark Mentovai)

Tracking

Details

Attachments

(2 attachments)

(Assignee)

Description

11 years ago
We had a couple of orange builds today, failing MozillaAliveTest on the Camino tinderbox.  Here's one:

http://tinderbox.mozilla.org/showlog.cgi?log=Camino/1200371880.1200372972.26899.gz

Running MozillaAliveTest test ...
Timeout = 15 seconds.
Begin: Mon Jan 14 20:55:56 2008
cmd = /builds/tinderbox/CmTrunk/Darwin_9.1.0_Depend/mozilla/../build/dist/Camino.app/Contents/MacOS/Camino -P default
End:   Mon Jan 14 20:56:07 2008
----------- Output from MozillaAliveTest ------------- 
  stdout and stderr unbuffered
----------- End Output from MozillaAliveTest --------- 
Error: Camino: received SIGHUP
Error: Camino: exited with status 16777215
Error: Camino: dumped core.
MozillaAliveTest: test failed

Why did the process get a signal AND exit with an exit status?  Furthermore, who sent it a SIGHUP?  Certainly not me or anyone else, and certainly not tinderbox, either.  What gives?

http://bonsai.mozilla.org/cvsblame.cgi?file=/mozilla/tools/tinderbox/build-seamonkey-util.pl&rev=1.378&mark=1740-1742#1730

The calculation of $signal_num is totally bogus.  It should be ($? & 127) to only look at the low-order 7 bits of $?.  ($? >> 127) is a right-shift by a whopping 127 bits, and is very different.  Whoops.  Our process likely didn't die by SIGHUP, but ($? >> 127) = 1 for other reasons.

Also, the waitpid checks seem to be wrong.  We check for WIFEXITED but not WIFSIGNALED.

If $wait_pid is -1, as it may be in our tinderbox' case (we don't know), we should at least log the error, to make it clear that whatever happens next might be totally bogus.

Finally, we should set $? to 0 before we enter the waitpid loop, to avoid running off with some stale last-status value.
(Assignee)

Comment 1

11 years ago
Created attachment 297251 [details] [diff] [review]
Fix

I'm going to test this on cb-miniosx01.
Attachment #297251 - Flags: review?(nrthomas)
Comment on attachment 297251 [details] [diff] [review]
Fix

Seems ok to me, but seeking additional review from someone with more Perl under their belt.
Attachment #297251 - Flags: review?(nrthomas)
Attachment #297251 - Flags: review?(ccooper)
Attachment #297251 - Flags: review+

Comment 3

11 years ago
Comment on attachment 297251 [details] [diff] [review]
Fix

>+        if ($number <= $#sig_name) {
>+            return $sig_name[$number];
>+        }
>+        else {
>+            return $number;
>+        }

Do we hit instances where the sig number isn't in Config.pm? Should we be?

>         $exit_value = $? >> 8;
>-        $signal_num = $? >> 127;
>-        $dumped_core = $? & 128;
>+        $signal_num = $? & 0x7f;
>+        $dumped_core = $? & 0x80;

Any particular reason to switch to hex here? Bit shifting is opaque enough.
Attachment #297251 - Flags: review?(ccooper) → review+

Comment 4

11 years ago
FWIW, rhelmer pointed me at RunShellCommand in MozBuild that was originally designed to fix this broken piece of tinderbox:

http://mxr.mozilla.org/seamonkey/source/tools/release/MozBuild/Util.pm#49
(Assignee)

Comment 5

11 years ago
(In reply to comment #3)
> Do we hit instances where the sig number isn't in Config.pm? Should we be?

It's not a likely case, but if you're using a Perl built on an older OS release, it might not be aware of newly-added signals.

> >         $exit_value = $? >> 8;
> >-        $signal_num = $? >> 127;
> >-        $dumped_core = $? & 128;
> >+        $signal_num = $? & 0x7f;
> >+        $dumped_core = $? & 0x80;
> 
> Any particular reason to switch to hex here? Bit shifting is opaque enough.

At least in my mind, the bit patterns corresponding to 0x7f and 0x80 are much clearer than the patterns corresponding to 127 and 128.  When used as masks, as they should be and as this patch does, the bit patterns are more important to understanding what the assignments actually do.  The shifts here were never right.
(Assignee)

Comment 6

11 years ago
Checked in on the trunk.
Status: NEW → RESOLVED
Last Resolved: 11 years ago
Resolution: --- → FIXED

Comment 7

11 years ago
> FWIW, rhelmer pointed me at RunShellCommand in MozBuild that was originally
> designed to fix this broken piece of tinderbox:

==> Bug 374568
(Assignee)

Comment 8

11 years ago
Created attachment 299307 [details] [diff] [review]
Additional fix

We still weren't getting proper diagnostics when processes exited with nonzero exit statuses or were killed by signals.  I tracked this down to two problems:

. wait_for_pid was returning "done" instead of the %hash containing the exit
  information.
. print_test_errors checked if the exit status was nonzero before printing
  information about the exit value and signal, but if the process was killed
  by a signal, the exit value as passed back by waitpid in $? would be zero.

I've gotten ride of the "done" thing and have wait_for_pid now always returning the %hash, but if a process exited by timeout, the values in the return %hash are just left at 0.

I've fixed print_test_errors to check the variable it's going to print before printing it, instead of gating everything based on the presence of a nonzero exit value.

I'm currently testing this patch on the three Camino tinderboxen.
Attachment #299307 - Flags: review?
(Assignee)

Updated

11 years ago
Attachment #299307 - Flags: review? → review?(ccooper)
(Assignee)

Updated

11 years ago
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Comment 9

11 years ago
Comment on attachment 299307 [details] [diff] [review]
Additional fix

Looks good.
Attachment #299307 - Flags: review?(ccooper) → review+
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.