Closed Bug 924253 Opened 11 years ago Closed 11 years ago

Multiple threads race on output in runtests.py

Categories

(Testing :: Mozbase, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla29

People

(Reporter: billm, Assigned: wlach)

References

Details

Attachments

(4 files, 1 obsolete file)

First, here are the symptoms of the problem. I'm running mochitests locally on my Linux machine with a debug build. From the objdir, I'll do something like this:

./_virtualenv/bin/python ./_tests/testing/mochitest/runtests.py --browser-chrome --test-path=<blah>

The output looks mostly normal, but occasionally, it will be missing the beginning of a line. Here's an excerpt:

++DOCSHELL 0x7f30aef85c00 == 5 [id = 5]
++DOMWINDOW == 8 (0x7f30aefa28b8) [serial = 8] [outer = (nil)]
WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040111: file /home/billm/mozilla/in2/content/base/src/nsFrameLoader.cpp, line 401
+DOMWINDOW == 9 (0x7f30a9981cb8) [serial = 9] [outer = 0x7f30aefa28b8]
ARNING: NS_ENSURE_TRUE(NS_SUCCEEDED(rv) && subjPrincipal) failed: file /home/billm/mozilla/in2/docshell/base/nsDocShell.cpp, line 8472
ARNING: Subdocument container has no frame: file /home/billm/mozilla/in2/layout/base/nsDocumentViewer.cpp, line 2339
++DOMWINDOW == 10 (0x7f30aef80cb8) [serial = 10] [outer = 0x7f30ab680cb8]
WARNING: Subdocument container has no frame: file /home/billm/mozilla/in2/layout/base/nsDocumentViewer.cpp, line 2339

Notice that some of the lines are missing the first 'W' in WARNING, and the ++DOMWINDOW == 9 line is missing the first '+'.

Also, I've tried this in mach and it still happens.

As far as I can tell, the problem is that there are two threads waiting on the same file descriptor, and they race to read data out of it. Here are the stacks for when they're started:

STACK 1

  File "./_tests/testing/mochitest/runtests.py", line 1365, in <module>
  File "./_tests/testing/mochitest/runtests.py", line 1362, in main
  File "./_tests/testing/mochitest/runtests.py", line 1005, in runTests
  File "./_tests/testing/mochitest/runtests.py", line 866, in runApp
  File "/home/billm/mozilla/in2/testing/mozbase/mozrunner/mozrunner/local.py", line 190, in start
    self.process_handler.run(timeout, outputTimeout)
  File "/home/billm/mozilla/in2/testing/mozbase/mozprocess/mozprocess/processhandler.py", line 624, in run
    self.processOutput(timeout=timeout, outputTimeout=outputTimeout)
  File "/home/billm/mozilla/in2/testing/mozbase/mozprocess/mozprocess/processhandler.py", line 687, in processOutput
    traceback.print_stack()

STACK 2

  File "./_tests/testing/mochitest/runtests.py", line 1365, in <module>
  File "./_tests/testing/mochitest/runtests.py", line 1362, in main
  File "./_tests/testing/mochitest/runtests.py", line 1005, in runTests
  File "./_tests/testing/mochitest/runtests.py", line 811, in runApp
  File "/home/billm/mozilla/in2/testing/mozbase/mozprocess/mozprocess/processhandler.py", line 624, in run
    self.processOutput(timeout=timeout, outputTimeout=outputTimeout)
  File "/home/billm/mozilla/in2/testing/mozbase/mozprocess/mozprocess/processhandler.py", line 687, in processOutput
    traceback.print_stack()

It looks like we have one thread from the runtests.py process (STACK 1) that's trying to read Firefox's output. In addition, STACK 2 is a Perl script that's started to process Linux stack traces. It reads from the same fd as STACK 1. It seems like we probably only want one of these around at any time.

I changed the stackFixer method in runtests.py to always return (None, None) so that the Perl script isn't started. That fixes the problem. I'm not sure what the preferred way would be to fix this problem in general though.
I'll take a look at this tomorrow. Hopefully get a better idea on what is needed to fix.
Assignee: nobody → jhammel
Blocks: 746243
Attached patch testrunner-fixSplinter Review
Here is the patch I've been using to make mochitests work again in case it helps anyone.
See Also: → 937684
Attached patch bug-924253.diff (obsolete) — Splinter Review
This is a patch against https://github.com/mozilla/mozbase that illustrates a failure:

(mozbase)│python test_plumbing.py 
F
======================================================================
FAIL: test_pipe (__main__.TestPlumbing)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_plumbing.py", line 52, in test_pipe
    self.assertEqual(len(results), len(pipe.output))
AssertionError: 11000 != 2164

----------------------------------------------------------------------
Ran 1 test in 0.087s

FAILED (failures=1)
With additional instrumentation:

(mozbase)│python test_plumbing.py 
Difference at line 859:
Actual:
EIGHT FIVUR SIX THREE SIX
Should be:
EIGHT FIVE NINE
F
======================================================================
FAIL: test_pipe (__main__.TestPlumbing)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_plumbing.py", line 63, in test_pipe
    self.assertEqual(len(results), len(pipe.output))
AssertionError: 11000 != 1475

----------------------------------------------------------------------
Ran 1 test in 0.091s

FAILED (failures=1)
It looks like we skip a bunch of lines:

1096 - ONE ZERO NINE SEVEN
1097 - ONE ZERO NINE EIGHT
1098 - ONE ZERO NINE NINE
1099 - ONE ONE ZERO ZERO
1100 - ONE ONE ZERO ONE
1101 - ONE FOUR
1102 - TWO ONE EIGHT FIVE
1103 - TWO ONE EIGHT SIX
1104 - TWO ONE EIGHT SEVEN
1105 - TWO ONE EIGHT EIGHT
1106 - TWO ONE EIGHT NINE
1107 - TWO ONE NINE ZERO
1108 - TWO ONE NINE ONE
1109 - TWO ONE NINE TWO
1110 - TWO ONE NINE THREE
Attached patch bug-924253.diffSplinter Review
The essential problem is that the mozprocess API and the subprocess API (
http://docs.python.org/2/library/subprocess.html ) do not interact well to make pipes.  Reading the subprocess documents, it should be possible to use a file-like object for stdin (see e.g. http://stackoverflow.com/questions/163542/python-how-do-i-pass-a-string-into-subprocess-popen-using-the-stdin-argument and http://stackoverflow.com/questions/1421311/feeding-a-string-into-popen ).  I've played some with this but haven't gotten it working with mozprocess; I suspect it is do to how we handle stdout, but haven't figured out precisely what.  I suspect there is a way to implement a buffer for mozprocess; when doing so, its probably good to document/make convenient to do.
Attachment #8335732 - Attachment is obsolete: true
One thing that could be done is to use shell=True and piping the results via the good ole pipe : | . Not my favorite solution, but would be quick.  Thoughts?
Flags: needinfo?(jgriffin)
I'm totally fine with using that as a short-term solution if it fixes these problems, although it would probably be good longer-term to find a better fix.
Flags: needinfo?(jgriffin)
I've run into this problem too. It manifested as whole lines occasionally disappearing from the test output. When these lines were from calls to dump() that I was using for debugging, it was extremely confusing. billm's patch from comment 3 fixes it.
I just wasted most of a day (and a very frustrating one!) chasing shadows because NSPR logging messages were randomly being dropped due to this bug :-(. Bill's patch fixed things up for me.

This bug is critically important. Please fix it.
This has been broken for 2 months and caused problems for at least 4 developers. Can we adopt Bill's simple workaround while waiting for a correct fix?
Flags: needinfo?(jhammel)
Flags: needinfo?(jgriffin)
If nothing happens here in the next couple of days I'll escalate as far as necessary.
Apparently Jeff no longer works for Mozilla.
I had a look at this over the evening on the train back from my parent's this evening. A few observations:

1. Bill's hack essentially just disables all stack processing. I don't think that can go in.
2. Fixing mozprocess to work for this case is complicated, as Jeff observed above. Given what else I found (described below), I'm not sure if it's worth the energy.
3. Shelling out and piping the firefox process through fix-linux-stack.pl (as suggested in comment 8), would involve some pretty extensive surgery to mozrunner for the purposes of a hack. It's not obviously less complicated to me than doing (2).
4. Really, I think the ideal solution would be to port fix-linux-stack.pl to Python and just run it as part of the mochitest process. I don't think that would be that hard...
5. ... but the easiest possible thing to do would seem to be to spawn a fix-linux-stack.pl as a subprocess and slowly feed it the output of the Firefox process as it is read by mochitest. This is actually less complicated than what we're doing right now, from what I can tell and should fix the issue. I have a proof of concept test program that does this (works fine), which I am in the process of translated over to runtests.py (unfortunately I can't test now, but will tomorrow).
Flags: needinfo?(jhammel)
Flags: needinfo?(jgriffin)
I've previously ported the OSX stack fixing script (which is written in Python) to work on Linux, though I think it wasn't as fancy as the real Linux one.  I might be able to find it around somewhere.
I *think* this patch should do the trick. I have verified (via strace) it works only insofar as fix-linux-stack.pl does what I think it was supposed to do before: read lines of output from the firefox process, then re-output them (with mangling depending on the line's content) for later consumption by mochitest.

Advice on whether my approach is sane and/or how to test the specific case where fix-linux-stack.pl is actually needed appreciated. I'm not sure who would know enough to give advice here other than :ted.
Assignee: jhammel → wlachance
Attachment #8344453 - Flags: feedback?(ted)
(In reply to William Lachance (:wlach) from comment #18)
> Advice on [...] how to test the specific case
> where fix-linux-stack.pl is actually needed appreciated.

IIRC this is used for assertion stacks, so running a reftest with asserts in the reftest.list file should test this.
Yes, so e.g. "./mach reftest layout/generic/crashtests/crashtests.list"
(In reply to Karl Tomlinson (:karlt) from comment #19)
> (In reply to William Lachance (:wlach) from comment #18)
> > Advice on [...] how to test the specific case
> > where fix-linux-stack.pl is actually needed appreciated.
> 
> IIRC this is used for assertion stacks, so running a reftest with asserts in
> the reftest.list file should test this.

Hmm, I see. reftest doesn't use runtests.py though.
Reftest also doesn't use mozprocess yet, so I don't believe this bug exists there yet (not that I've tested to make sure).
Not sure if it matters but Mozharness uses mozprocess on "runreftest.py" itself:
http://mxr.mozilla.org/build/source/mozharness/scripts/desktop_unittest.py#412
http://mxr.mozilla.org/build/source/mozharness/mozharness/base/script.py#675

But not sure how that would effect things once this process spawns in runApp():
http://mxr.mozilla.org/mozilla-central/source/layout/tools/reftest/runreftest.py#147
I don't think that will be a problem since this is a bug in a specific way we're configuring processes with mozprocess (specifically piping to a separate process running perl).
(In reply to William Lachance (:wlach) from comment #21)
> Hmm, I see. reftest doesn't use runtests.py though.

Oops yeah. Sorry. Running mochitests in layout/generic/test should trigger assertions.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #25)
> (In reply to William Lachance (:wlach) from comment #21)
> > Hmm, I see. reftest doesn't use runtests.py though.
> 
> Oops yeah. Sorry. Running mochitests in layout/generic/test should trigger
> assertions.

Ok cool, running mochitest that way that I see lines like this:

 0:19.06 UNKNOWN [/home/wlach/src/mozilla-central-linux/obj-x86_64-unknown-linux-gnu/dist/bin/libxul.so +0x020f46ce]
 0:19.06 UNKNOWN [/home/wlach/src/mozilla-central-linux/obj-x86_64-unknown-linux-gnu/dist/bin/libxul.so +0x02077c92]
 0:24.41 XREMain::XRE_main(int, char**, nsXREAppData const*) (/home/wlach/src/mozilla-central-linux/toolkit/xre/nsAppRunner.cpp:4046)
 0:24.41 XRE_main+0x000000cf [/home/wlach/src/mozilla-central-linux/obj-x86_64-unknown-linux-gnu/dist/bin/libxul.so +0x0207836e]

Which seem identical to what we had before.
Comment on attachment 8344453 [details] [diff] [review]
0001-Bug-924253-Make-perl-stack-fixer-process-to-not-read.patch

Feeling confident enough in the solution now to ask for a review.
Attachment #8344453 - Flags: feedback?(ted) → review?(ted)
(In reply to William Lachance (:wlach) from comment #26)
>  0:19.06 UNKNOWN
> [/home/wlach/src/mozilla-central-linux/obj-x86_64-unknown-linux-gnu/dist/bin/
> libxul.so +0x020f46ce]

We should be able to do better than UNKNOWN, but that is a different bug
- bug 939610.
Comment on attachment 8344453 [details] [diff] [review]
0001-Bug-924253-Make-perl-stack-fixer-process-to-not-read.patch

Review of attachment 8344453 [details] [diff] [review]:
-----------------------------------------------------------------

This seems pretty sensible. We should file a followup on getting that ported to Python so we can ditch this ugliness.
Attachment #8344453 - Flags: review?(ted) → review+
Filed bug 948718 to deal with porting fix-linux-stack.pl to python.
https://hg.mozilla.org/mozilla-central/rev/9bbd931da6d9
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: