Closed Bug 884471 Opened 11 years ago Closed 11 years ago

Find and fix the underlying cause of the OSX "code -20" intermittent test crashes

Categories

(Core :: IPC, defect)

All
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla26
Tracking Status
firefox24 --- fixed
firefox25 --- fixed
firefox26 --- fixed
b2g18 ? ---

People

(Reporter: RyanVM, Assigned: smichaud)

References

Details

(Whiteboard: [qa-])

Attachments

(6 files, 5 obsolete files)

966 bytes, patch
Details | Diff | Splinter Review
602 bytes, text/plain
Details
4.09 KB, patch
benjamin
: review+
Details | Diff | Splinter Review
5.43 KB, application/x-tar
Details
14.67 KB, text/plain
Details
10.94 KB, text/plain
Details
We frequently hit "code -20" (SIGCHLD) crashes in the OSX testsuite. This is a fatal and widespread crash, causing frequent test failures and retriggers. It is often underrepresented in our Orange Factor statistics due to the widespread nature of the crashes and the lack of individual bugs being filed for every instance.
>  Alex Keybl [:akeybl] 1 minute ago
> tracking-firefox23: --- → ?; 

Can you please explain the impact to users here?
(In reply to Justin Lebar [:jlebar] from comment #1)
> Can you please explain the impact to users here?

It deserves attention, which is why I've nominated it for my team to look at. We wouldn't actually track for a specific release, but we do want this fixed in the short- to medium- term.
An obvious first step is to find 100% reliable STR for these crashes.

Anyone willing to try? :-)
It's obviously a race condition. I highly doubt 100% STR are possible. Can we just understand why we're crashing with a SIGCHLD?

I don't really know POSIX signals, but after reading through a bunch of code I have a theory:

* When a process starts, the default SIGCHLD handler is a no-op
* When we shut down an IPC child process we end up calling ProcessWatcher::EnsureProcessTerminated which can end up in MessagePumpLibevent::CatchSignal
* This calls signal_set and signal_add
* After the child actually quits (normally), this which triggers the first SIGCHLD handler which calls MessagePumpLibevent::SignalEvent::StopCatching, which calls event_del. I believe at this point that the SIGCHLD handling is left i

It doesn't appear that this happens in the case where the child process shuts down before the parent process GeckoChildProcessHost is destroyed because it hits the IsProcessDead check here: http://hg.mozilla.org/mozilla-central/annotate/d2a7cfa34154/ipc/chromium/src/chrome/common/process_watcher_posix_sigchld.cc#l198

I suspect that multiple SIGCHLD handling in libevent is just completely broken: that when the signal handler is removed it should end up in the default "ignore" state but instead ends up in an aborting default. This may be limited to the kqueue implementation which is used on mac but not Linux.
> NOTE: child process received `Goodbye', closing down
> ...
> TEST-UNEXPECTED-FAIL | Shutdown | Exited with code -20 during test run
> ...

Does the browser actually crash here, or has it merely received a SIGCHLD (which it may be ignoring)?

I don't see anything about minidumps in these logs (which you usually see when the browser has actually crashed in the middle of (say) a mochitest).
The browser is actually exiting (see "exit code -20").

Breakpad catches SIGSEGV but it doesn't attach a signal handler for SIGCHLD because that's normally not a fatal signal.
But the browser appears to be exiting when it normally would (though its exit code is "wrong"), at the end of a suite of tests.

Or am I missing something?
No, it's not exiting normally. e.g. https://tbpl.mozilla.org/php/getParsedLog.php?id=24238247&tree=Mozilla-Inbound from bug 817412 stops in the middle of the indexedDB tests.
> TEST-UNEXPECTED-FAIL | Shutdown | Exited with code -20 during test run

This probably indicates the browser exiting when it normally would.

> TEST-UNEXPECTED-FAIL | /tests/dom/indexedDB/test/test_add_put.html |
>   Exited with code -20 during test run

But you're right that this (from bug 817412) doesn't.
(Following up comment #4)

> * After the child actually quits (normally), this [which] triggers the
> first SIGCHLD handler which calls
> MessagePumpLibevent::SignalEvent::StopCatching, which calls
> event_del. I believe at this point that the SIGCHLD handling is left i

Benjamin, you seem to have dropped part of your last sentence here.

And the first "which" in your first sentence seems to be superfluous ... or you've dropped something there, too.
Is there anything we can do to line up manual/automated testing resources to push the investigation along?
No activity in this bug for nearly 2 weeks. What can we do to move this forward again?
I'll try to find time this week to look further into this.

Without STR, though, it may not be possible to do anything about it.
(In reply to Steven Michaud from comment #13)
> I'll try to find time this week to look further into this.
> 
> Without STR, though, it may not be possible to do anything about it.

Thanks Steven!

Can we catch the issue when it occurs on automation rather than finding STR? Lukas or I can set up a meeting with the A Team if you'd like to move forward with that.
I really do need STR.

For example, someone could find (or write) a test that *always* fails in this way.  This should be possible, as someone's already done it at least once -- see bug 230675 comment #39.
We have fixed lots of bugs without being able to reproduce them on demand.  Indeed, many (I'd venture most) intermittent oranges have been fixed by people who were unable to reproduce them locally.

It would be great if we could reliably reproduce this bug, and I don't want to suggest that you shouldn't try to get STR.  But as a matter of historical fact, I do not think it's true that without STR, a bug like this is insoluble, and therefore I think digging into bsmedberg's train of thought from comment 4 might prove fruitful.
I stand by what I said in comment #15.

Without STR (without at least a test (or combination thereof) that always fails in this way), it may be impossible to fix such an obscure bug as this one.  Or (which is also very important) to know that we've actually fixed it, instead of just hacking around it.

Of course we should follow up all the leads we have, and I've already started doing that.

But right now, the very most useful contribution anyone could make is to find/write a test that always fails in this way -- possibly by starting from bug 230675 comment #39.

Would you care to try? :-)
> Would you care to try? :-)

I may have to, since it's my tests that are causing the majority of these errors (by virtue of creating a lot of child processes)!
For what it's worth, I'm going to start by playing with the plugin shutdown timer (for example by making it much shorter):

https://hg.mozilla.org/mozilla-central/annotate/4ffb23062b3b/dom/plugins/base/nsPluginHost.cpp#l800

Then I'll try (locally) running subsets of tests that have triggered this bug in the past, to see if the failures happen more often.
As often as this happens on our test infrastructure, you should also be able to retrigger yourself to glory on Try. It might be difficult to reproduce locally, but I'm very certain that we can find out in short order whether a proposed fix works or not on the test slaves.
> we can find out in short order whether a proposed fix works or not on the test
> slaves

Using this method, we won't be able to tell a hack from a fix.

We may have to settle for a hack in the end.  But I'd like to spend a few days (at least) looking for a fix.
(Following up comment #19)

This didn't lead anywhere.

I did two tryserver builds -- one that shortened the unload timer (by about 35 times) to 5 seconds, and another that lengthened it (by 10 times) to 30 minutes.  There were no test failures at all :-(
If you pushed those to try, you could just retrigger them thousands of times, but https://tbpl.mozilla.org/?tree=Try&pusher=smichaud@pobox.com suggests you didn't.
Just how uniform is the (OS X) test machines' setup?  For example do they all have the same version of Python?  And if they don't, is there any correlation between this bug and a particular Python version?

The reason I ask is that the "Exited with code -20 during test run" error message is displayed (by the parent process) from Python code here, after "status = proc.wait();":

https://hg.mozilla.org/mozilla-central/annotate/4ffb23062b3b/build/automation.py.in#l989

So I'm wondering if Python libraries might conceivably be involved in this bug.

In any case, please let me know the version(s) of Python that are installed on the test machines.
(In reply to comment #23)

> you could just retrigger them thousands of times

Not worth the trouble.  I was looking for gross effects, big enough to show up in a single set of tryserver runs.

But I didn't see any, so I'm going to stop playing with the length of the plugin unload timer.
Ben might know the answers to comment 24.
Flags: needinfo?(bhearsum)
(In reply to Steven Michaud from comment #24)
> Just how uniform is the (OS X) test machines' setup?  For example do they
> all have the same version of Python?  And if they don't, is there any
> correlation between this bug and a particular Python version?

They're all the same.

> The reason I ask is that the "Exited with code -20 during test run" error
> message is displayed (by the parent process) from Python code here, after
> "status = proc.wait();":
> 
> https://hg.mozilla.org/mozilla-central/annotate/4ffb23062b3b/build/
> automation.py.in#l989
> 
> So I'm wondering if Python libraries might conceivably be involved in this
> bug.
> 
> In any case, please let me know the version(s) of Python that are installed
> on the test machines.

[cltbld@talos-mtnlion-r5-026.test.releng.scl3.mozilla.com ~]$ python -V
Python 2.7.2
Flags: needinfo?(bhearsum)
I've written an interpose library that logs lots of signal activity in all currently running programs.

It hooks and logs the usual suspects (like signal(), sigaction(), kill() and raise()), plus a bunch of exit() methods.  And I've found that, on OS X, the kernel sends all signals (that have handlers) through an undocumented user-level _sigtramp() method (a "trampoline"), which my library also hooks.  So there's some hope that it will log all relevant signal activity.

I'd like to get it onto one of our build machines.

I hope and assume that its logging will get interleaved with the "regular" test logging, and redirected along with it to the correct destination(s).

First we should test that this is in fact true.

Then I'd like to run tests on this build machine (the one with my interpose library), continuously, until we hit the bug.
Is this (what I talk about in comment #28) something you could help me with, Ben?
Flags: needinfo?(bhearsum)
(In reply to Steven Michaud from comment #29)
> Is this (what I talk about in comment #28) something you could help me with,
> Ben?

Is this something that you could run in a loop on a single machine to repro, provided you had access? This doesn't seem like something we should be deploying to our machines unless absolutely necessary.
Flags: needinfo?(bhearsum) → needinfo?(smichaud)
> Is this something that you could run in a loop on a single machine
> to repro, provided you had access?

I assume so (though I'll probably need your help to figure out how to
do it).

The most important thing is that it should generate the same kind of
logs that the "real" test machines do, and that I should have access
to them (hopefully the same way as with the "real" machines).

Also, it needs to be set up *exactly* like the real test machines.

I think bsmedberg's hypothesis from comment #4 (that this bug is a
problem with the SIGCHLD handling in Mozilla's libevent) is highly
unlikely.  I think it's much more likely that some other program is
messing with the signals -- possibly one that only runs on our test
machines.
Flags: needinfo?(smichaud)
Here's a revised version of my interpose library that hooks a few more signal-related methods (probably not needed, but just in case).
Attachment #770967 - Attachment is obsolete: true
(In reply to Steven Michaud from comment #31)
> > Is this something that you could run in a loop on a single machine
> > to repro, provided you had access?
> 
> I assume so (though I'll probably need your help to figure out how to
> do it).
> 
> The most important thing is that it should generate the same kind of
> logs that the "real" test machines do, and that I should have access
> to them (hopefully the same way as with the "real" machines).
> 
> Also, it needs to be set up *exactly* like the real test machines.
> 
> I think bsmedberg's hypothesis from comment #4 (that this bug is a
> problem with the SIGCHLD handling in Mozilla's libevent) is highly
> unlikely.  I think it's much more likely that some other program is
> messing with the signals -- possibly one that only runs on our test
> machines.

OK, running out of band is a bad option then, because Buildbot won't be running.

Given that, we should think about deploying this to all of the machines temporarily. Would this library affect build results or performance negatively? How does it need to be deployed? Where does it log to?
Before we deploy my interpose library to all the build machines, we should try it out on one of them.

> Would this library affect build results or performance negatively?

That's hard to say.  It'd slow things down, but I think only marginally.  Hard to know for sure without a test, though.

> Where does it log to?

As I said in comment #28, I hope its logging will get interleaved with "regular" test logging.  But this also needs to be tested.

> How does it need to be deployed?

The source tarball I've posted contains some minimal instructions.  But I'll repeat (and elaborate on) them here:

Once you've built the interpose library ('make' should do the trick), you need to set the DYLD_INSERT_LIBRARIES environment variable so that it gets loaded into the appropriate executables (which should be *every* executable that runs as part of a build, including BuildBot).  I don't know the build system that's used on the tryservers, so I can't give detailed instructions.

What is the single, lowest-level command to start a build?  If it's something like 'BuildBot', and that's called from a shell script, I'd add the following line just before 'BuildBot':

export DYLD_INSERT_LIBRARIES=/full/path/to/interpose.dylib

(Here I assume that DYLD_INSERT_LIBRARIES isn't already being used for other purposes.)
> What is the single, lowest-level command to start a build?

I suppose I should have asked:

What is the single, highest-level command to start a build?
Blocks: 795269
Bug 230675 comment #39 *was* worth following up.  V5 of that bug's patch *does* still reliably trigger this bug:

https://tbpl.mozilla.org/?tree=Try&rev=dcc27681f63f

Not surprisingly, I can't reproduce these failures locally.

Once we have my interpose library running (successfully) on at least one tryserver build machine, we need to do a suite of tests with this patch.
(Following up comment #36)

Oops, spoke too soon.  Bug 817247 *is* referred to in the list of failures, but "exited with code -20" isn't one of them :-(
Attachment #772430 - Attachment is obsolete: true
Comment on attachment 771087 [details]
Interpose library for logging signal activity, rev1

I'd like to get this onto a single tryserver machine, to ensure that it works as intended and doesn't interfere with test results (if need be by altering it).  Then I'd like to get it on all the production test machines -- at least until the first occurrence of this bug.

Can you help me with this, Ben?  Or should I ask someone else?
Flags: needinfo?(bhearsum)
Actually, I asked Callek to have a look this...Sorry for the lack of communication on that!
Flags: needinfo?(bhearsum) → needinfo?(bugspam.Callek)
Steven, we're not the experts here -- the ateam usually does the testing of these sorts of libraries somehow, and we just do the final production deploy.

We think you can:
 - put your zip file on people.mozilla.com
 - modify your local copy of testing/talos/talos_from_code.py to refer to people.m.c
 - submit your job to the tryserver

If that doesn't make sense, it's probably best to ask in #ateam (many of us are there as well) and we can figure out what is needed. :jmaher knows the magic incantation for sure.
Flags: needinfo?(bugspam.Callek)
Steven-

I am unclear on what exactly you are trying to pull off here.  

1) do you need to modify talos in order to gain more information?
2) do you need to modify the browser and just run talos on try while collecting more information?
3) something else I don't know about?
(In reply to comment #40 and comment #41)

If you haven't already done so, please read through the rest of the bug.  Pay particular attention to comment #28 and comment #34.

An "interpose library" gets loaded by the OS's dynamic loader (dyld on OS X) into executables/libraries as they're loaded into memory.  It gets loaded "first", and is used to hook methods in those other executables/libraries.  For a more detailed description see http://books.google.com/books?id=K8vUkpOXhN4C&pg=PA73&lpg=PA73&dq=__interpose&source=bl&ots=OJnnXZYpZC&sig=o7I3lXvoduUi13SrPfOON7o3do4&hl=en&ei=AoehS9brCYGQNrvsmeUM&sa=X&oi=book_result&ct=result&resnum=6&ved=0CBsQ6AEwBQ#v=onepage&q=__interpose&f=false.
in releng they can get you buildvpn support, and give you a loaner machine.  Then I could help you get talos or whatever test running and we could put it in a loop until it fails!
I've said the following already several times, but I'll say it again as clearly as possible.

Here's what I need to do:

As best I can tell, this bug is caused by some part of the build infrastructure doing weird things with signals.  But whether or not this is true, this bug involves weird things happening with signals.  So I need to "hook" methods having to do with signals in all the programs that run when we're doing tests with our current build infrastructure -- this includes the programs that are part of the build infrastructure (including interpreters like Python), plus (of course) whatever's being tested.

For the methods I hook, I log their calling parameters and their results.  For the output to be useful, it needs to be "interleaved" with all the logging that's produced when a test is run -- which (for example) the success or failure of individual tests.
I have concerns with this as our production talos numbers will most likely be affected.  I assume we could set this up on a machine or two and use a standalone buildbot master to run the tests as we would normally.

releng, am I making sense here?
I need a machine that's set up *exactly* like the production machines that run mochitests and reftests and the like.  Nothing else will do.

I suggest that you guys pull one of the production machines from its pool and give me remote access to that, via VPN.
that is what I said in comment 43, and then it sounded like this needed to run with a 100% production environment.
(In reply to comment #45 and comment #47)

I'd also be happy with access to a small, separate network of testing machines -- once again presuming that it's set up exactly like the production network.
hwine, can you ensure smichaud has buildvpn access and a loaner machine.  I assume osx 10.7 will be sufficient.

:smichaud, after we get this setup on a loaner machine and attempt runs to reproduce, we can talk about multiple machines and buildbot masters if necessary.  a loaner machine is a machine pulled directly out of production such that it won't be scheduling jobs, but setup the same as it would in production.
(In reply to comment #49)

Sounds fine to me.  Thanks!
Depends on: 892107
:jmaher - thanks for helping to clarify

:smichaud - loaner being processed via bug 892107
(In reply to Hal Wine [:hwine] from comment #51)
> :jmaher - thanks for helping to clarify
> 
> :smichaud - loaner being processed via bug 892107

Hal confirmed in IRC that the specific Mac type does not matter, this issue has manifested in https://bugzilla.mozilla.org/show_bug.cgi?id=795256#c45 as well (on a mountain lion machine)

I'm going to loan that instead since we can easily bring this back up after loan right now. (if you do need a snow or lion tester we can do that, I just believe this is easier all around).
No longer blocks: talos-mtnlion-r5-002
Depends on: 896387
Depends on: 897422
Depends on: 897905
Depends on: 899534
Depends on: 900024
Depends on: 902893
I'm slowly making progress on this, but I've a ways to go yet.

Thanks to Callek and Kim Moir, I've been able to connect to my loaner, using both ssh and screen sharing, for about a week.  And by my own digging around, I've found a fairly reliable way to "reproduce" this bug at least once in 100 runs of the mochitest-2 (aka plain2) tests, with a tryserver build altered to lengthen the plugin shutdown timer.  The scare quotes are because this still definitely requires brute force -- 100 full runs of the plain2 tests take about 25 hours on my loaner.  I'll say more about this in a later comment.

I've also got my interpose library working, though I haven't been able to get it to interleave its logging with that of the test output.  But the test output logging uses timestamps, so I've largely gotten around the problem by making my interpose library also timestamp each of its log statements.

My interpose library tells me that it's python (and only python) that calls exit() with a "status" of '-20'.  So I currently suspect this is a bug in one of our python scripts, or in python itself.  But I need to keep digging to find out for sure.
Here's the shell script I've been using to run mochitest-2 tests multiple times on my loaner.

Before I can do that, I need to prepare the directory in which I run the tests by calling:

hg clone http://hg.mozilla.org/build/mozharness scripts

I got the general idea how to do this (how to run desktop_unittest.py without a connection to a buildbot master) here:

https://wiki.mozilla.org/ReleaseEngineering/Mozharness/07-May-2013?title=ReleaseEngineering/Mozharness
Does anyone have any idea how often the "Exited with code 20" problem occurs?  Or in other words, what percentage of test runs fail with this error?

I strongly suspect the percentage is *much* less than 1 in 100 test runs.  I ran about 300 plain2 tests without seeing even a single case, before I ran them another 100 times with the plugin unload timeout lengthened, and saw two of them.

I'd guess the percentage is once in 500 or 1000 test runs.  But do we have a more reliable way of finding out?
Unfortunately, we don't have a good way to provide the exact data you're asking for.

The below link shows the failures that were starred in the last week.
http://brasstacks.mozilla.com/orangefactor/?display=OrangeFactor&includefiltertype=quicksearch&includefilterdetailsexcludeResolved=false&includefilterdetailsexcludeDisabled=false&includefilterdetailsquicksearch=%22code%20-20%22&includefilterdetailsnumbugs=0&includefilterdetailsresolvedIds=&excludefiltertype=quicksearch&excludefilterdetailsquicksearch=&excludefilterdetailsnumbugs=0&excludefilterdetailsresolvedIds=&plat=All&test=All&type=All&startday=2013-08-01&endday=2013-08-08&tree=trunk

The combined number of failures is 43, which would currently rank it as the #9 top orange for the past week.

I'm also providing that with the earlier caveats from comment 0 that this is probably under-representative due to some instances being starred with a generic "code -20" comment rather than filing a new bug for each individual instance.
> The combined number of failures is 43 [last week]

Do we know how many Mac test runs we do per week (on all the slaves)?

This would give us a rough idea -- which is all I'm looking for.

And no, this isn't vital information.  I'm just curious.

I'll keep working on this bug regardless of the results :-)
Not easily AFAIK
Attached patch Effective workaround (obsolete) — Splinter Review
Here's something my tests show is an effective workaround for this bug.

It's a change to libevent, and follows a hint in libevent code here, written by you, Josh:

https://hg.mozilla.org/mozilla-central/annotate/89294cd501d9/ipc/chromium/src/third_party/libevent/kqueue.c#l441

Since it's your hint that I'm following, I've asked you to review this patch.  If you'd rather someone else do the review, please pass it along.

I tested this patch by adding to it my patch from comment #54, doing a tryserver build, then running the mochitest-2 (aka plain2) tests 100 times in that build on my loaner.  Normally I'd have seen 3-4 "Exited with code -20" failures, plus a lot of other failures of different kinds.  This time I saw no failures of any kind.

So besides "fixing" the "Exited with code -20" failures, this patch is likely to reduce other kinds of intermittent failures.

I still haven't managed to find the actual cause of the "Exited with code -20" failures.  But I'm quite sure it isn't in either libevent or Python.  As best I can tell it's an OS bug -- possibly an kernel bug.

Fortunately I didn't hook any of the wait...() calls in early versions of my interpose library, and so was able to find a way to "reproduce" these failures.  But then I found that when I hooked waitpid() (which is used by Python to wait for Firefox to finish a set of tests), I could no longer reproduce them.

I don't know whether doing this eliminates the "Exited with code -20" failures, or just makes them happen much less often.  Either way, it probably makes it impossible to find this bug's true cause.

My current theory is that the "Exited with code -20" failures happen when, while Python is waiting (in waitpid()) for Firefox to finish its tests, and Firefox has a handler installed for SIGCHLD, a plugin child calls exit(0).  This *should* make the kernel deliver a SIGCHLD signal to Firefox's handler, but the handler never gets called.  Instead (I think) the kernel terminates Firefox and tells Python (via waitpid()) that this happened because Firefox received a SIGCHLD.

I have a couple of ideas for finding out more information, which I'll try over the next few days.  But I'm not optimistic they'll tell me anything more than I already know.
Attachment #794102 - Flags: review?(joshmoz)
You.Freaking.Rock.
> You.Freaking.Rock.

With all deliberate speed :-)
Comment on attachment 794102 [details] [diff] [review]
Effective workaround

Review of attachment 794102 [details] [diff] [review]:
-----------------------------------------------------------------

FYI, I didn't write any of this code, as your comment suggests (maybe I'm misinterpreting). I just updated our libevent in bug 842887. I wonder if this existed before that update, or this is a regression. The code looks fine to me but you might want to run it by someone like bsmedberg as well. Also, I am not a module owner or peer for this.

Either way, if this fixes a problem that is not specific to how Mozilla sends signals the patch should be upstreamed. It's a bit messy to just apply patches to third party libs. The other cleaner option is to put a patch into the tree that our build system can apply.

::: ipc/chromium/src/third_party/libevent/kqueue.c
@@ +432,5 @@
>  	if (kevent(kqop->kq, &kev, 1, NULL, 0, &timeout) == -1)
>  		return (-1);
>  
>  	/* XXXX The manpage suggest we could use SIG_IGN instead of a
>  	 * do-nothing handler */

You can remove this comment now.

@@ +433,5 @@
>  		return (-1);
>  
>  	/* XXXX The manpage suggest we could use SIG_IGN instead of a
>  	 * do-nothing handler */
> +	if (_evsig_set_handler(base, nsignal, SIG_IGN) == -1)

Do you expect this change to actually do something or is SIG_IGN equivalent to the do-nothing function?
Attachment #794102 - Flags: review?(joshmoz) → review+
> FYI, I didn't write any of this code, as your comment suggests
> (maybe I'm misinterpreting). I just updated our libevent in bug
> 842887.

Oops, sorry.  I just saw your name in hg blame and assumed.

> The code looks fine to me but you might want to run it by someone
> like bsmedberg as well. Also, I am not a module owner or peer for
> this.

I'll find an additional review who's closer to the code I changed.

> Do you expect this change to actually do something or is SIG_IGN
> equivalent to the do-nothing function?

As I understand it, SIG_IGN (which tells the kernel to ignore the
signal) is *supposed* to be functionally equivalent to installing an
empty handler.  So also would SIG_DFL be in this case (since the
default behavior for SIGCHLD is to ignore it).

Clearly we've now found a case where they *aren't* equivalent, at
least on OS X.  But I think that's because of an OS bug (probably a
kernel bug).

> Either way, if this fixes a problem that is not specific to how
> Mozilla sends signals the patch should be upstreamed. It's a bit
> messy to just apply patches to third party libs. The other cleaner
> option is to put a patch into the tree that our build system can
> apply.

I'll look for an existing bug on libevent, or open one if I need to.
In it I'll refer to this bug.

I suspect the issue may take a long time to resolve, though.  So I'll
also follow your other suggestion, and imitate what we do for patches
to cairo.
I'd also be surprised if Chromium hasn't had issues with this. It may be worth looking through their bug tracker too.
Attached patch Effective workaround rev1 (obsolete) — Splinter Review
Follow Josh's suggestions and carry forward his r+.
Assignee: nobody → smichaud
Attachment #794102 - Attachment is obsolete: true
Attachment #794720 - Flags: review+
Attachment #794720 - Flags: review?(benjamin)
Actually, something very similar to my patch was landed on libevent trunk about two years ago, by Zack Weinberg of all people:

http://sourceforge.net/p/levent/bugs/244/
https://github.com/nmathewson/Libevent/commit/148458e0a1fd25e167aa2ef229d1c9a70b27c3e9

Next I'll try to find something relevant in Chromium's code and bugbase.
I couldn't find anything relevant in Chromium's bugbase.

Also, Chromium's current version of libevent (1.4.13) is older than ours:
https://codereview.chromium.org/412006
Apparently the first libevent "release" the changes from comment #67 appeared in was 2.1.1-alpha.  2.0.21 (the endpoint of Josh's update from bug 842887) is still the current release on the 2.0.X "stable" branch -- which (of course) doesn't contain these changes.

I don't know how the libevent release train works.  But whenever the patch from comment #67 gets onto the 2.0.X stable branch (or *some* stable branch), we can upgrade to that version and drop whatever's been done here (presuming such an upgrade doesn't raise other issues).

In the meantime we can land my patch, or something like it.
If people want I could change my patch to make it identical to the one from comment #67.  I don't think it'd make any difference.  But I don't know the libevent code at all well, so I might have missed something.
I don't fully understand the problem here but I can comment a little on libevent and the change I made to it:

It looks like Steven Michaud's patch unconditionally sets the signal handler to SIG_IGN, whereas mine uses SIG_DFL for SIGCHLD (and only SIGCHLD).  This is because, on many Unixes (including at least everything with kqueue) if the handler for SIGCHLD is SIG_IGN, child exit statuses are discarded by the kernel.  This means, not only is no signal handler called, kqueue doesn't report the signal either, and *none of the wait() functions will report that the child has exited.*  SIG_DFL for SIGCHLD behaves the same way SIG_IGN does for other signals, i.e. no signal handler is called but the event can still be fielded by kqueue, and wait() continues to work.

For a generic event handling library like libevent, SIG_DFL is therefore the correct choice when setting up child exits to be fielded via kqueue.  And it is *probably* also the correct choice for us - I assume we actually want to know when child processes exit!  However, since this bug is about spurious termination of the main Firefox process with SIGCHLD, I have to wonder if setting SIG_DFL here *wouldn't* fix the problem.  I strongly recommend that we try it and find out what happens.

It is indeed the case that libevent 2.1.x is not yet considered stable.  My patch did not go onto the 2.0.x branch because when I wrote it, I thought it was pure optimization; however, if it can be demonstrated to fix this problem, I'm sure Nick M would be happy to put it into the next 2.0.x release.  I have no real sense of how hard it will be to upgrade from 1.4.x to 2.0.x, but it ought to be done (in conjunction with Chromium upstream), because 1.4.x is EOLed.
Oh, also, the Python code that is printing exit statuses really ought to be changed to read something like this:

    # 'status' is the exit status
    if os.name != 'posix':
        # Windows error codes are easier to look up if printed in hexadecimal
        if status < 0: status += 2**32
        print "exit status %x\n" % status
    elif os.WIFEXITED(status):
        print "exit %d\n" % os.WEXITSTATUS(status)
    elif os.WIFSIGNALED(status):
        # the python stdlib doesn't appear to have strsignal(), alas
        print "killed by signal %d\n" % os.WTERMSIG(status)
    else:
        # this is probably a can't-happen condition on Unix, but let's be defensive
        print "undecodable exit status %04x\n" % status
> I have to wonder if setting SIG_DFL here *wouldn't* fix the problem.
> I strongly recommend that we try it and find out what happens.

I did check it, about a week ago.  It *does* work.

In other words, SIG_DFL works around this bug just as well as SIG_IGN.

> My patch did not go onto the 2.0.x branch because when I wrote it, I
> thought it was pure optimization; however, if it can be demonstrated
> to fix this problem, I'm sure Nick M would be happy to put it into
> the next 2.0.x release.

Glad to hear it.  Should I ask at
http://sourceforge.net/p/levent/bugs/244/?

> I have no real sense of how hard it will be to upgrade from 1.4.x to
> 2.0.x, but it ought to be done (in conjunction with Chromium
> upstream), because 1.4.x is EOLed.

Even though Chromium hasn't, we've already upgraded to 2.0.13, at bug
842887.  I'm sure it wouldn't be hard for us to upgrade again to
whatever the next 2.0.X release will be.

In the meantime, though, we need to do something here, since this bug
is pretty urgent.
Right now I'm strongly inclined to just land Zack's patch.

But Zack, first let me ask you what you think of my patch's changes to _evsig_set_handler() in event.c?  Good?  Bad?  Makes no difference?
> we've already upgraded to 2.0.13

We've upgraded to 2.0.21.
(In reply to comment #73)

Sounds like a good idea.

If we did change the Python code, we'd presumably need to do it here:
https://hg.mozilla.org/mozilla-central/annotate/17143a9a0d83/build/automation.py.in#l748
(In reply to Steven Michaud from comment #74)
> > I have to wonder if setting SIG_DFL here *wouldn't* fix the problem.
> > I strongly recommend that we try it and find out what happens.
> 
> I did check it, about a week ago.  It *does* work.
> In other words, SIG_DFL works around this bug just as well as SIG_IGN.

Good to know!

> Glad to hear it.  Should I ask at
> http://sourceforge.net/p/levent/bugs/244/?

Yes, but please also send mail to libevent-users@freehaven.net.  (I don't remember if you have to subscribe or not, sorry.)

(In reply to Steven Michaud from comment #75)
> But Zack, first let me ask you what you think of my patch's changes to
> _evsig_set_handler() in event.c?  Good?  Bad?  Makes no difference?

I believe those are harmless but unnecessary.  The kernel should ignore sa_flags and sa_mask when sa_handler is SIG_IGN or SIG_DFL.
Thanks, Zack, for your input!
Attachment #794720 - Attachment is obsolete: true
Attachment #794720 - Flags: review?(benjamin)
Attachment #794908 - Flags: review?(benjamin)
For my own future reference, here's how to subscribe to (and unsubscribe from) libevent-users@freehaven.net:

http://archives.seul.org/libevent/users/
(In reply to Zack Weinberg (:zwol) from comment #73)
> Oh, also, the Python code that is printing exit statuses really ought to be
> changed to read something like this:

Filed as bug 908945, thanks for the pointer (and the detailed info on your patch).
Attachment #794908 - Flags: review?(benjamin) → review+
Pushed to inbound to get as much bake time as possible for possible uplift to Aurora/Beta. Thanks for all the work on this, Steven! *fingers crossed*
https://hg.mozilla.org/integration/mozilla-inbound/rev/e7b8ca38851d
Thanks, Ryan.  You beat me to it.  And you're quite welcome.

I expect this patch to make the "Exited with code -20" errors disappear completely, and to cause a substantial reduction in the "application timed out" errors (most of which I now suspect are spurious).  Please comment again in a week with whatever results you can find.  I'll set the needinfo flag to remind you.

Working on this bug has taught me a lot about how our build infrastructure works, at least on the Mac.  That may make it easier for me to figure out other Mac build infrastructure bugs (like bug 898519).  Or to be more realistic, it may move some of those bugs from "impossible" to "very difficult" :-)
Flags: needinfo?(ryanvm)
https://hg.mozilla.org/mozilla-central/rev/e7b8ca38851d

\m/
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla26
Comment on attachment 794908 [details] [diff] [review]
Effective workaround rev2 (backport Zack's patch)

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Apple kernel bug
User impact if declined: none
Testing completed (on m-c, etc.): one day on m-c, lots of testing on loaner slave
Risk to taking this patch (and alternatives if risky): low to none
String or IDL/UUID changes made by this patch: none

The people who (like RyanVM) manage the build/test infrastructure would really like to get this patch on all branches as soon as possible.  Its randomness and frequency make it a significant burden.

RyanVM tells me there have been no "Exited with code -20" errors since this patch landed on the trunk last night.  That, by itself, is no guarantee.  But I myself did extensive testing of various versions of my patch (at least 400 runs), without seeing even one "Exited with code -20" error (or any other failures).

This patch will have no positive user impact.  "Ordinary" users (I assume) don't build Firefox or run it from a script.  Likewise it's very unlikely to have any negative impact.  But if it *is* going to cause problems, the best way to find them is to get more testers, which we can do by landing this patch on the branches.
Attachment #794908 - Flags: approval-mozilla-beta?
Attachment #794908 - Flags: approval-mozilla-aurora?
> or run it from a script

or run it repeatedly from a script
Comment on attachment 794908 [details] [diff] [review]
Effective workaround rev2 (backport Zack's patch)

Can definitely bake this on Aurora.
Attachment #794908 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 794908 [details] [diff] [review]
Effective workaround rev2 (backport Zack's patch)

Patch is low risk, avoid's sheriff pain.

Steven, given this will need to go out sooner than later to get more testing, if regressions do exist in the wild, what kind of reports would we be looking at , Would they be easy enough to discover in a couple of beta cycle's ?
Attachment #794908 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
> if regressions do exist in the wild, what kind of reports would we
> be looking at

It's very difficult to say.

It's extremely unlikely that the original bug effects users.  If it
did, though, I'd be looking for cases of Firefox spontaneously exiting
when lots of plugins are in use	-- probably without any Breakpad
report.

So maybe something like that?

> Would they be easy enough to discover in a couple of beta cycle's

Anything major would be.
Actually, if my patch is going to cause problems, they'd most likely show up as new kinds of intermittent failures in our automated tests.
I'm giving up my attempt to find this bug's true cause.  It's terribly
time consuming.  And there's already strong evidence that this is some
kind of Apple bug -- presumably a kernel bug.

I mentioned in comment #60 that hooking waitpid() in my interpose
library stops the bug happening.  I found a workaround for this
(instead of logging calls to waitpid() from my interpose library, I
now do it from Python code).  But I keep running into the same kind of
problem, and it's a royal pain to have to wait several hours (or
overnight) for the results of each new test.

Here, for the record, is my current best guess as to what's going on:

1) Firefox calls ProcessWatcher::EnsureProcessTerminated() on a child
   process, installs a SIGCHLD handler (in a call to
   loop->CatchSignal(SIGCHLD, reaper, reaper)), and the child process
   exits normally.

2) Before the kernel has had a chance to call the handler (or Firefox
   has had a chance to remove it) another child process of the same
   Firefox process exits normally.

   At least some of the time this bug results.

Another weird thing I noticed, which may or may not be related (and
which is only observable when I hook waitpid() in my interpose
library):

Sometimes a Firefox child process will exit normally (will call
_exit(0)), but then when Firefox calls waitpid() on that process, it
says that the process terminated with a SIGHUP.  I never see this when
Python calls waitpid() on a Firefox process.
For the record.
Attachment #771087 - Attachment is obsolete: true
For the record.

I found that this bug still happens with a build made from a Python 2.7.2 source tarball downloaded from http://www.python.org/.  So I played around with adding debug logging.
Once again for the record:

I haven't received any response to the following messages, and no longer expect to:

http://sourceforge.net/p/levent/bugs/244/
http://archives.seul.org/libevent/users/Aug-2013/msg00023.html
(In reply to Steven Michaud from comment #97)
> Once again for the record:
> 
> I haven't received any response to the following messages, and no longer
> expect to:
> 
> http://sourceforge.net/p/levent/bugs/244/
> http://archives.seul.org/libevent/users/Aug-2013/msg00023.html

Nick is chronically overworked.  I'll poke him.
For consideration in a Firefox OS 1.1.1 release, to help sheriffs with intermittent issues.

Would be great to know if there's any risk to B2G here.
tracking-b2g18: --- → ?
The code in this patch only affects the BSDs, so it shouldn't be risky to on-device B2G builds. How does this help the B2G branch? Are we actually doing mac B2G builds there?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #100)
> How does this help the B2G branch? Are we actually doing mac B2G builds there?

Yep. Builds and tests. And yes, we're hitting code -20 failures there too.
Assuming this does not need QA.
Whiteboard: [qa-]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: