Closed Bug 329573 Opened 18 years ago Closed 18 years ago

multi-tinderbox.pl doesn't actually kill sub-builds when it's killed

Categories

(Webtools Graveyard :: Tinderbox, defect)

x86
Linux
defect
Not set
minor

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: preed, Assigned: preed)

Details

Attachments

(4 files, 3 obsolete files)

multi-tinderbox.pl captures its PID in multi.pid, but when it's killed, it kills itself, not the build-seamonkey.pl subprocesses that get run.

Thus, those builds keep going, requiring all of our maintenance documentation to say "So, run ./tinderbox multi stop && killall cvs && killall gmake && killall ..."

This would mostly be fixed by build-seamonkey.pl being killed.
A fix for this plus a couple of features:

The fix: have the shell script send an explicit TERM signal; this seems to be the default for /bin/kill, but make sure of it. Then, make a TERM handler that uses the saved PID from a fork()/exec() of build-seamonkey to explicitely send a TERM to to build-seamonkey.pl so it gets properly propogated down to make (eventually).

The features:

1. Add a HUP signal handler to re-read the tinder-config.pl when the current build finishes.

2. Add an INT signal handler, to replace the "fall-over" file; this is cleaner than dumping a file in a directory, and having it renamed with the PID.
Assignee: mcafee → preed
Status: NEW → ASSIGNED
Attachment #214280 - Flags: review?(ccooper)
(In reply to comment #2)
> re your point 1, note also bug 245574

Interesting.

I actually disagree with the premise of that bug, as the user should be able to explicitely control when the config file is re-read... hence, why I implemented it this way, here.

But thanks for pointing out; maybe we can resolve it a dup if this goes in.
yeah, this method would suffice for me.
Comment on attachment 214280 [details] [diff] [review]
Fix + a couple of features

Add exit to HandleSigTerm() and we're good to go.
Attachment #214280 - Flags: review?(ccooper) → review+
Attachment #215074 - Attachment is obsolete: true
Attachment #215346 - Flags: review?(mark)
Comment on attachment 215346 [details] [diff] [review]
Run "cvs update" after each set of multi-tinderboxn runs are complete

TBOX_CLIENT_CVSROOT sounds to me like it should be set to :ext:cltbld@cvs.mozilla.org:/cvsroot, but that's not what it is.  TBOX_SCRIPT_UPDATE_DIR?  (This also makes it clearer that it controls the update feature.)

Why is UpdateTinderboxScripts() in the OUTER loop instead of once per $treeentry?

I'm a little concerned about cvs hanging (nearly) forever and taking a tinderbox effectively out of service when the cvs server goes out to lunch.  I think this "cvs update" should come with a timeout.  build-seamonkey-util.pl uses a timeout for this reason when it runs "make -f client.mk checkout" (but not when it checks out client.mk - we should fix that too.)

I ordinarily hate pushing work off to the shell, but this is one of the few cases where it's really handy, if only because it means we don't need to worry about changing back to the right directory:

system('cd '.$ENV{'TBOX_SCRIPT_UPDATE_DIR'}.' && '.$TBOX_CLIENT_CVSUP_CMD)

While we're at it, we can eliminate the rest of the chdirs in the multi-tinderbox parent process by moving the chdir into the child.  It really doesn't belong in the parent, and it's kind of contrary to the multi principle for the parent to die if a directory is missing.
Bug 330763 for timeouts when checking out client.mk.
(In reply to comment #8)

> Why is UpdateTinderboxScripts() in the OUTER loop instead of once per
> $treeentry?

Because I want the entire loop of builds to use the same set of Tinderbox scripts.

Granted, it likely doesn't matter, but... maybe it does (on builds where there's like... a branch build and then a Moz1.8.0 l10n build).

As for the rest, I'll fix it.
Fixes:

-- Change the environment variable to something clearer
-- Move the chdir into the system() call.
-- Add a timeout for CVS wonkiness (currently 5 minutes).
-- Move the chdir()s out of the loop, and into the child.
Attachment #215346 - Attachment is obsolete: true
Attachment #215361 - Flags: review?(mark)
Attachment #215346 - Flags: review?(mark)
Comment on attachment 215361 [details] [diff] [review]
Run "cvs update," take II

> Because I want the entire loop of builds to use the same set of Tinderbox
> scripts.

That makes sense.  The only problem is that if you're anything like me, you never remember which build is "top" in multi-config, so you don't know when to expect a freshly-committed version to take effect.  But it's your call here, I was just curious about the placement.

>Index: multi-tinderbox.pl
>+sub UpdateTinderboxScripts() {
>+    if (exists($ENV{'TBOX_CLIENT_CVS_DIR'})) {
>+        my $oldAlrmHandler = $SIG{'ALRM'};
>+        $SIG{'ALRM'} = \&HandleSigAlrm;

Cleaner to move this inside the eval and call it local:

>+        print STDERR "Updating tinderbox scripts in $ENV{'TBOX_CLIENT_CVS_DIR'}\n";
>+        eval {

local $SIG{'ALRM'} = \&HandleSigAlrm;

>+            alarm($TBOX_CLIENT_CVS_TIMEOUT);
[...]
>+        if ($@) {
>+            print STDERR 'CVS update of client tinderbox scripts ' . 
>+              ($@ eq 'timeout' ? "timed out" : "failed $@") . "\n";

It would be great if we could kill the hung cvs here too, but that would require doing our own fork/exec, wrapping the alarms around a waitpid, and bringing in something like kill_process from build-seamonkey-util to forcibly kill it.  As it stands here, timed-out children created by system() will become unreaped zombies.  Dammit, I wish I had thought this out more fully ahead of time.  Do you think I'm being too paranoid?

On the other hand, reading through this patch here gave me some inspiration as to why it seems so hard to kill cvs processes - bug 330779 for more on that.
Attachment #215361 - Flags: review?(mark) → review+
(In reply to comment #12)

> That makes sense.  The only problem is that if you're anything like me, you
> never remember which build is "top" in multi-config, so you don't know when to
> expect a freshly-committed version to take effect.  But it's your call here, I
> was just curious about the placement.

Good point; I figure if you make a change like this, and want stuff pulled immediately, you can send it a SIGHUP, which should cause it re-read the configuration and (in the process) re-|cvs up| the scripts.

> Cleaner to move this inside the eval and call it local:

Neat suggestion; I don't use local very often, but this is perfect. Fixed.

> As it stands here, timed-out children created by system() will become
> unreaped zombies.  Dammit, I wish I had thought this out more fully ahead of
> time.  Do you think I'm being too paranoid?

Yes and no. :-)

I like a clean system too, but Tinderbox has far more annoying issues with it that I'd rather see fixed before this.

An interesting tidbit: cvs blocks a number of signals when it's talking to the network, so the 1st patch in this bug will *work*, but if it's doing a CVS checkout, and you kill tinderbox, you'll have the cvs processes lying around for... oh... a couple of minutes, until they unblock the signal, and die.

It mostly works, but it's not an immediate death.
(In reply to comment #12)

> Cleaner to move this inside the eval and call it local:
> 
> >+        print STDERR "Updating tinderbox scripts in $ENV{'TBOX_CLIENT_CVS_DIR'}\n";
> >+        eval {
> 
> local $SIG{'ALRM'} = \&HandleSigAlrm;

So... I thought about this some more, and decided to remove this altogether and out to here:

 $SIG{'INT'} = \&HandleSigInt;
+$SIG{'ALRM'} = \&HandleSigAlrm;
 
I used to go to the trouble of saving/restoring the signal handler for ALRM when I was working on larger systems where I couldn't guarantee that no one else was using ALRM for something.

Having said that, I've never seen ALRM used in perl for anything *but* an eval/alarm() handler to handle the timeout, and the use of alarm within multi-tinderbox is... pretty obvious.

I went ahead and checked it in like this, but LMK if you want me to look at doing something else instead.
> you can send it a SIGHUP, which should cause it re-read the
> configuration and (in the process) re-|cvs up| the scripts.

Yeah, if you're on the tinderbox - but then you could just do the update by hand.  At least until you've got a nice tinderbox-kicking front end sitting behind a password on the web somewhere.

> An interesting tidbit: cvs blocks a number of signals when it's talking to
> the network, so the 1st patch in this bug will *work*, but if it's doing a
> CVS checkout, and you kill tinderbox, you'll have the cvs processes lying
> around for... oh... a couple of minutes, until they unblock the signal, and
> die.

That's weird, although it's surely cvs client-dependent and maybe even system-dependent.  I've never had trouble killing cvs with ^C, which gives it a SIGINT, and at least the 1.11.21 code blocks SIGINT much more than it blocks SIGHUP or SIGTERM.

I'd be more inclined to blame this problem on tinderbox (not multi-tinderbox) not having any signal handlers ready to kill children when build-seamonkey takes a signal - the same reason you need to ps|grep make to make sure it's really all dead once you've asked it to stop.

P.S. while investigating this, I found that you need to zero out $CURRENT_BUILD_PID in Run() after waitpid() returns.  Also, your fork() logic assumes that fork() returns -1 when it fails, but like most other syscalls in perl, it actually returns undef.

> I used to go to the trouble of saving/restoring the signal handler for ALRM
> when I was working on larger systems where I couldn't guarantee that no one
> else was using ALRM for something.

I still don't think that we have this guarantee portably across all systems.  It's possible that something outside of the script (in a syscall) is installing its own SIGALRM handler: are we sure it'll put Perl's handler back when it's done?  Or does Perl waste its own time trying to put its handlers back?  I'm thinking about systems where sleep(n) is implemented by installing a SIGALRM handler and calling alarm(n);pause();.  Since multi-tinderbox calls sleep, this could bite us, unless Perl guarantees that it will put its own ALRM handler back.
(In reply to comment #15)

> Yeah, if you're on the tinderbox - but then you could just do the update by
> hand.  At least until you've got a nice tinderbox-kicking front end sitting
> behind a password on the web somewhere.

I'm OK with that.

> That's weird, although it's surely cvs client-dependent and maybe even
> system-dependent.  I've never had trouble killing cvs with ^C, which gives it a
> SIGINT, and at least the 1.11.21 code blocks SIGINT much more than it blocks
> SIGHUP or SIGTERM.
> 
> I'd be more inclined to blame this problem on tinderbox (not multi-tinderbox)
> not having any signal handlers ready to kill children when build-seamonkey
> takes a signal - the same reason you need to ps|grep make to make sure it's
> really all dead once you've asked it to stop.

*Actually*, I think it's ssh; cvs uses ssh as transport and I think ssh is blocking the signal(s).

> P.S. while investigating this, I found that you need to zero out
> $CURRENT_BUILD_PID in Run() after waitpid() returns.  Also, your fork() logic
> assumes that fork() returns -1 when it fails, but like most other syscalls in
> perl, it actually returns undef.

I <3 Perl.

> I still don't think that we have this guarantee portably across all systems.

Good point.

So I did some Googling, and the only advice they give is that sleep() and alarm() calls shouldn't be *mixed*, but they don't really talk about the guts of the perl implementation.

The examples in the perlipc man page, though, use a local $SIG{'ALRM'} within the eval, as you originally suggested. Good catch.

So I'll hack up a happy-patch to address these two issues.
Corrections that Mark pointed out: handle fork() badness correctly (undef, not -1), reset CURRENT_BUILD_PID, and save/store the SIGALRM handlers.
Attachment #215683 - Flags: review?(mark)
The last part of this bug (which has kinda feature creeped): add a --configdir option to build-seamonkey.pl which will have "cvs up" run in it before starting a build.

This is so things like tinder-config.pl and mozconfig can be versioned in CVS.

I actually went back and forth about whether to put this in multi-config.pl or build-seamonkey.pl, and eventually decided that build-seamonkey.pl was a better choice, since if I did it the other way, I'd have to add another field to multi-config.pl, and multi-tinderbox.pl would have to have more visibility than it needs into how stuff gets built.
Attachment #215684 - Flags: review?(mark)
Comment on attachment 215683 [details] [diff] [review]
Restore ALRM handlers/handle fork() badness correctly

Two simple changes, OK to make on checkin:

>Index: multi-tinderbox.pl
>+            if (0 == $buildPid) {
>                 chdir($treeentry->{tree}) or
>                  die "Tree $treeentry->{tree} does not exist";
>                 exec("./build-seamonkey.pl --once $treeentry->{args}");

die("exec(): $!");

>+            } elsif ($buildPid > 0) {
>+                $CURRENT_BUILD_PID = $buildPid;
>+                my $reapedPid = waitpid($buildPid, 0);
>+                $CURRENT_BUILD_PID = 0;
>+                if ($reapedPid != $buildPid) {
>+                    print STDERR "PID $buildPid died too quickly; " .
>+                     "status was: " . ($? >> 8);

Since the rest of the script uses warn and die, this should warn.  (Even though I think warn and die are ugly.)

Also, don't you find this message inaccurate?  This error condition doesn't really have anything to do with children dying too quickly, unless $SIG{'CHLD'}='IGNORE', which it isn't.  And $? is useless if waitpid failed.
Attachment #215683 - Flags: review?(mark) → review+
Comment on attachment 215684 [details] [diff] [review]
Add a --configdir option to run |cvs up| for config scripts

>Index: build-seamonkey-util.pl

>+        my $cwd = get_system_cwd();
>+
>+        chdir($args->{'configdir'}) or 
>+         die "Couldn't chdir() into $args->{'configdir'}; something's fishy here";
>+
>+        my $status = run_shell_command_with_timeout('cvs update', $co_default_timeout);

Ugh.  I wish we had a run_shell_command_with_timeout that took another argument, the directory to chdir to.  But run_shell_command_with_timeout is broken enough as it is, so this is fine for now.

>+        if ($status->{'exit_value'} != 0) {
>+           print "cvs update in $args->{'configdir'} failed";
>+           exit 1;

warn, die, warn, die...print?  (And to STDOUT?)

At first, I thought all of these hard errors were severe, but I guess they're OK in that they'll draw attention to broken CVS servers, and a broken CVS server would cause a build failure anyway.  But only if the tinderbox configs are checked in at the standard cvsroot.  If they're somewhere else, this will introduce a hard dependency on some other cvs server, which might be OK, but it's worth thinking about.

>+        }
>+        chdir($cwd) or die "Couldn't return to $cwd";
>+    }
>+}
> 
> sub Build {
>     #my () = @_;

>+            --configdir ConfigDirectory

I don't like this name.  It makes it sound like it's a directory where tinderbox will look for its configs to load, and doesn't say anything about cvs update.  Maybe it's a good idea to make tinderbox try to load tinder-config and mozconfig from this dir, but it would still need a separate switch to turn on cvs update.

--config-update-dir?

r+ with that setting renamed and the print/exit turned into a die or warn.
Attachment #215684 - Flags: review?(mark) → review+
(In reply to comment #20)

> Ugh.  I wish we had a run_shell_command_with_timeout that took another
> argument, the directory to chdir to.  But run_shell_command_with_timeout is
> broken enough as it is, so this is fine for now.

Yah... I figured you might say that, based on earlier feedback, but... I wasn't going to futz with that now.

> >+        if ($status->{'exit_value'} != 0) {
> >+           print "cvs update in $args->{'configdir'} failed";
> >+           exit 1;
> 
> warn, die, warn, die...print?  (And to STDOUT?)
> 
> At first, I thought all of these hard errors were severe, but I guess they're
> OK in that they'll draw attention to broken CVS servers, and a broken CVS
> server would cause a build failure anyway.  But only if the tinderbox configs
> are checked in at the standard cvsroot.  If they're somewhere else, this will
> introduce a hard dependency on some other cvs server, which might be OK, but
> it's worth thinking about.

They're (right now) on the same server, AFAIK.

The discrepancy between warn/die/print STDERR/print is not as crisp as it should be; that's mostly because it turns out that Tinderbox is sooooo sloppy with STDERR/STDOUT.

In fact, I remember why I did this now: STDERR gets swallowed in many instances (and thus, completely ignored), whereas STDOUT gets reliably printed... so that's why I printed to STDOUT.

The real solution is to clean up the handling of STDERR, so it, too, gets logged.

> --config-update-dir?

how about --config-cvsup-dir?

> r+ with that setting renamed and the print/exit turned into a die or warn.

I'm loath to turn that into something that I'm not sure where the output will go; I know where STDOUT will go (or more importantly, I know that I'll see this error message in the log).
Mark's suggestions, plus actually handling the args uhh... "correctly."
Attachment #215684 - Attachment is obsolete: true
Attachment #215708 - Flags: review?(mark)
Comment on attachment 215708 [details] [diff] [review]
Cleaned up version of the --configdir patch

mmmkay
Attachment #215708 - Flags: review?(mark) → review+
I think we can finally close this one up, some three patches after solving the initial problem. ;-)
Status: ASSIGNED → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: