922943 - Do something about jit-tests failing on OS X with exit code -9

Reporter

Description

•

11 years ago

As I demonstrated with Windows bugs before bug 899697 allowed them to return either 0 or 3, I have a high tolerance for filing pointless intermittent-failure bugs, but I've reached my limits (rather more quickly) with Mac -9 failures. Pls fix, thx.

Phil Ringnalda (:philor)

Reporter

Updated

•

11 years ago

Blocks: 916648

Phil Ringnalda (:philor)

Reporter

Updated

•

11 years ago

Blocks: 916872

Phil Ringnalda (:philor)

Reporter

Updated

•

11 years ago

Blocks: 917008

Phil Ringnalda (:philor)

Reporter

Comment 1

•

11 years ago

For extra credit, though it's considerably less frequent than either Windows was or OS X is, allow Linux to return 1.

Phil Ringnalda (:philor)

Reporter

Updated

•

11 years ago

Blocks: 922947

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 2

•

11 years ago

Attached patch test-return-code — Details — Splinter Review

Let's just drop the return code checking. We lived without it just fine for years. I don't think anyone wants to spend hours tracking down this problem. I'm guessing it's some weird bug with the test harness (or in the Python libs).

Attachment #813209 - Flags: review?(terrence)

Terrence Cole [:terrence]

Comment 3

•

11 years ago

Comment on attachment 813209 [details] [diff] [review]
test-return-code

Review of attachment 813209 [details] [diff] [review]:
-----------------------------------------------------------------

I agree. Multiprocessing seems to just be incompetent at handing back correct error codes. I expect that this will just keep plaguing us until we move to using os.exec directly like jsreftests does.

Thanks for taking this, Bill! r=me

::: js/src/tests/lib/jittests.py
@@ +323,5 @@
>  
>  def check_output(out, err, rc, test):
>      if test.expect_error:
> +        # We'd like to check the return code here, but doing so seems
> +        # to return incorrect values occasionally. See bug 899697.

Please add bug 922943 to this comment.

Attachment #813209 - Flags: review?(terrence) → review+

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 4

•

11 years ago

I only glanced at the code, but it seems like multiprocessing isn't use to deal with return codes at all. The shell is spawned by subprocess.Popen using some complicated code to handle timeouts. That seems more likely to be the cause here.

Jason Orendorff [:jorendorff]

Comment 5

•

11 years ago

(In reply to Terrence Cole [:terrence] from comment #3)
> I agree. Multiprocessing seems to just be incompetent at handing back
> correct error codes.

I'm skeptical. What's our evidence for this statement?

A bug in subprocess is possible, but I don't see it. The code is pretty direct. Our th_run_cmd calls subprocess.Popen.communicate(), which on OS X calls Popen.wait(), which calls os.waitpid(). There's a little more code in Popen._handle_exitstatus() that extracts the returncode from what waitpid returned.

Also, it seems like a weirdly specific bug to return -9. That indicates the child was killed with SIGKILL.

As it happens, jittests.py has two paths where we intentionally kill the child process with SIGKILL. So maybe the call is coming from inside the house? We could try generating explicit log output when we do that, and see if it correlates.

I am probably missing something, but with this patch, would we detect crashes on exit?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 6

•

11 years ago

(In reply to Jason Orendorff [:jorendorff] from comment #5)
> I'm skeptical. What's our evidence for this statement?

All we know is in bug 899697.

Also, keep in mind that we're having the same problem on all three platforms but with different exit codes. On Win32 we get 0, on Mac we get -9, on Linux we get 1.

> I am probably missing something, but with this patch, would we detect crashes on exit?

We won't detect crashes in tests that are annotated to generate errors. Otherwise we will detect crashes (in check_output).

Terrence Cole [:terrence]

Comment 7

•

11 years ago

(In reply to Jason Orendorff [:jorendorff] from comment #5)
> (In reply to Terrence Cole [:terrence] from comment #3)
> > I agree. Multiprocessing seems to just be incompetent at handing back
> > correct error codes.
> 
> I'm skeptical. What's our evidence for this statement?

Just intuition. |multiprocessing| is not the ugliest code I've ever read, but I would definitely not base any truly important project on it.
 
> A bug in subprocess is possible, but I don't see it. The code is pretty
> direct. Our th_run_cmd calls subprocess.Popen.communicate(), which on OS X
> calls Popen.wait(), which calls os.waitpid(). There's a little more code in
> Popen._handle_exitstatus() that extracts the returncode from what waitpid
> returned.

Willing to bet on that? We've already found one bug with our usage of subprocess[1]. Still, I don't think the bug we're seeing is here. I think it's much more likely something to do with multiprocessing, probably something similar to [1] but on a grander scale.

1 - https://bugzilla.mozilla.org/show_bug.cgi?id=725500#c0

> As it happens, jittests.py has two paths where we intentionally kill the
> child process with SIGKILL. So maybe the call is coming from inside the
> house?

Almost certainly. However, keep in mind that the signal is generated inside of a child process controlled by multiprocessing. We are relying on it's machinery to not break if this signal gets delivered at a weird time. I think this is the most likely culprit.

Jason Orendorff [:jorendorff]

Comment 8

•

11 years ago

> We won't detect crashes in tests that are annotated to generate errors.

OK, let's do it and reopen bug 776043.

Terrence, your comment in bug 725500 is epic.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 9

•

11 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/1328ae9ac634

Phil Ringnalda (:philor)

Reporter

Comment 10

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/1328ae9ac634

Assignee: nobody → wmccloskey

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla27

Phil Ringnalda (:philor)

Reporter

Comment 11

•

11 years ago

So does the existence of https://tbpl.mozilla.org/php/getParsedLog.php?id=28713265&tree=Mozilla-Inbound after that landed on inbound mean that the completely opaque failure was not due to returning -9, but was instead something else completely unreported?

Instead of rewriting the harness to actually report what failed on failures, would it be easier to rewrite it like the parallel xpcshell harness, to just swallow failures the first time around and rerun those tests at the end, and only report a failure if they failed again?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 12

•

11 years ago

(In reply to Phil Ringnalda (:philor) from comment #11)
> So does the existence of
> https://tbpl.mozilla.org/php/getParsedLog.php?id=28713265&tree=Mozilla-
> Inbound after that landed on inbound mean that the completely opaque failure
> was not due to returning -9, but was instead something else completely
> unreported?

Oh crud. I guess the tests that were failing weren't actually annotated as failures in the first place, so my patch had no effect.

However, we've been tracking bad return codes from normal tests for quite a while. Why did this just start happening? Perhaps there's an actual regression in the VM that's causing this? It looks like it only started happening around September 10-15 or so.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 13

•

11 years ago

I backed out my change since it doesn't help at all.
https://hg.mozilla.org/integration/mozilla-inbound/rev/b2b511b83cc0

I have a Mac on my desk that I don't use much. I'll see how hard this is to reproduce there. Maybe there's some hope of finding a regression range or figuring out the problem.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Phil Ringnalda (:philor)

Reporter

Updated

•

11 years ago

Blocks: 923518

Jason Orendorff [:jorendorff]

Comment 14

•

11 years ago

Terrence and I now suspect this is real, not an artifact of Python flakiness. I expect it'll be hard to track down.

Separately: We need to be able to trust the test harness. Terrence thinks multiprocessing is the main culprit, but I doubt ripping that out will solve the actual problem we face, which is intermittent failures being *blamed* on Python. It might be best to rewrite the js-engine-spawning portion of the harness in C (keeping Python for all the rest -- manifest creation, command line construction, output parsing -- things we know it does well). I'm not totally sure though and I don't plan to work on it right away.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 15

•

10 years ago

Unassigning myself from old bugs. Don't worry, not quitting.

Assignee: wmccloskey → nobody

BMO Automation

Updated

•

2 years ago

Severity: normal → S3