Open Bug 922943 Opened 11 years ago Updated 2 years ago

Do something about jit-tests failing on OS X with exit code -9

Categories

(Core :: JavaScript Engine, defect)

x86_64
macOS
defect

Tracking

()

REOPENED
mozilla27

People

(Reporter: philor, Unassigned)

References

Details

Attachments

(1 file)

As I demonstrated with Windows bugs before bug 899697 allowed them to return either 0 or 3, I have a high tolerance for filing pointless intermittent-failure bugs, but I've reached my limits (rather more quickly) with Mac -9 failures. Pls fix, thx.
Blocks: 916648
Blocks: 916872
Blocks: 917008
For extra credit, though it's considerably less frequent than either Windows was or OS X is, allow Linux to return 1.
Blocks: 922947
Attached patch test-return-codeSplinter Review
Let's just drop the return code checking. We lived without it just fine for years. I don't think anyone wants to spend hours tracking down this problem. I'm guessing it's some weird bug with the test harness (or in the Python libs).
Attachment #813209 - Flags: review?(terrence)
Comment on attachment 813209 [details] [diff] [review]
test-return-code

Review of attachment 813209 [details] [diff] [review]:
-----------------------------------------------------------------

I agree. Multiprocessing seems to just be incompetent at handing back correct error codes. I expect that this will just keep plaguing us until we move to using os.exec directly like jsreftests does.

Thanks for taking this, Bill! r=me

::: js/src/tests/lib/jittests.py
@@ +323,5 @@
>  
>  def check_output(out, err, rc, test):
>      if test.expect_error:
> +        # We'd like to check the return code here, but doing so seems
> +        # to return incorrect values occasionally. See bug 899697.

Please add bug 922943 to this comment.
Attachment #813209 - Flags: review?(terrence) → review+
I only glanced at the code, but it seems like multiprocessing isn't use to deal with return codes at all. The shell is spawned by subprocess.Popen using some complicated code to handle timeouts. That seems more likely to be the cause here.
(In reply to Terrence Cole [:terrence] from comment #3)
> I agree. Multiprocessing seems to just be incompetent at handing back
> correct error codes.

I'm skeptical. What's our evidence for this statement?

A bug in subprocess is possible, but I don't see it. The code is pretty direct. Our th_run_cmd calls subprocess.Popen.communicate(), which on OS X calls Popen.wait(), which calls os.waitpid(). There's a little more code in Popen._handle_exitstatus() that extracts the returncode from what waitpid returned.

Also, it seems like a weirdly specific bug to return -9. That indicates the child was killed with SIGKILL.

As it happens, jittests.py has two paths where we intentionally kill the child process with SIGKILL. So maybe the call is coming from inside the house? We could try generating explicit log output when we do that, and see if it correlates.

I am probably missing something, but with this patch, would we detect crashes on exit?
(In reply to Jason Orendorff [:jorendorff] from comment #5)
> I'm skeptical. What's our evidence for this statement?

All we know is in bug 899697.

Also, keep in mind that we're having the same problem on all three platforms but with different exit codes. On Win32 we get 0, on Mac we get -9, on Linux we get 1.

> I am probably missing something, but with this patch, would we detect crashes on exit?

We won't detect crashes in tests that are annotated to generate errors. Otherwise we will detect crashes (in check_output).
(In reply to Jason Orendorff [:jorendorff] from comment #5)
> (In reply to Terrence Cole [:terrence] from comment #3)
> > I agree. Multiprocessing seems to just be incompetent at handing back
> > correct error codes.
> 
> I'm skeptical. What's our evidence for this statement?

Just intuition. |multiprocessing| is not the ugliest code I've ever read, but I would definitely not base any truly important project on it.
 
> A bug in subprocess is possible, but I don't see it. The code is pretty
> direct. Our th_run_cmd calls subprocess.Popen.communicate(), which on OS X
> calls Popen.wait(), which calls os.waitpid(). There's a little more code in
> Popen._handle_exitstatus() that extracts the returncode from what waitpid
> returned.

Willing to bet on that? We've already found one bug with our usage of subprocess[1]. Still, I don't think the bug we're seeing is here. I think it's much more likely something to do with multiprocessing, probably something similar to [1] but on a grander scale.

1 - https://bugzilla.mozilla.org/show_bug.cgi?id=725500#c0

> As it happens, jittests.py has two paths where we intentionally kill the
> child process with SIGKILL. So maybe the call is coming from inside the
> house?

Almost certainly. However, keep in mind that the signal is generated inside of a child process controlled by multiprocessing. We are relying on it's machinery to not break if this signal gets delivered at a weird time. I think this is the most likely culprit.
> We won't detect crashes in tests that are annotated to generate errors.

OK, let's do it and reopen bug 776043.

Terrence, your comment in bug 725500 is epic.
https://hg.mozilla.org/mozilla-central/rev/1328ae9ac634
Assignee: nobody → wmccloskey
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla27
So does the existence of https://tbpl.mozilla.org/php/getParsedLog.php?id=28713265&tree=Mozilla-Inbound after that landed on inbound mean that the completely opaque failure was not due to returning -9, but was instead something else completely unreported?

Instead of rewriting the harness to actually report what failed on failures, would it be easier to rewrite it like the parallel xpcshell harness, to just swallow failures the first time around and rerun those tests at the end, and only report a failure if they failed again?
(In reply to Phil Ringnalda (:philor) from comment #11)
> So does the existence of
> https://tbpl.mozilla.org/php/getParsedLog.php?id=28713265&tree=Mozilla-
> Inbound after that landed on inbound mean that the completely opaque failure
> was not due to returning -9, but was instead something else completely
> unreported?

Oh crud. I guess the tests that were failing weren't actually annotated as failures in the first place, so my patch had no effect.

However, we've been tracking bad return codes from normal tests for quite a while. Why did this just start happening? Perhaps there's an actual regression in the VM that's causing this? It looks like it only started happening around September 10-15 or so.
I backed out my change since it doesn't help at all.
https://hg.mozilla.org/integration/mozilla-inbound/rev/b2b511b83cc0

I have a Mac on my desk that I don't use much. I'll see how hard this is to reproduce there. Maybe there's some hope of finding a regression range or figuring out the problem.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Blocks: 923518
Terrence and I now suspect this is real, not an artifact of Python flakiness. I expect it'll be hard to track down.

Separately: We need to be able to trust the test harness. Terrence thinks multiprocessing is the main culprit, but I doubt ripping that out will solve the actual problem we face, which is intermittent failures being *blamed* on Python. It might be best to rewrite the js-engine-spawning portion of the harness in C (keeping Python for all the rest -- manifest creation, command line construction, output parsing -- things we know it does well). I'm not totally sure though and I don't plan to work on it right away.
Unassigning myself from old bugs. Don't worry, not quitting.
Assignee: wmccloskey → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: