Closed Bug 620101 Opened 9 years ago Closed 9 years ago

Intermittent failing in test_nsIProcess.js | TEST-UNEXPECTED-FAIL | Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProcess.run]

Categories

(Core :: XPCOM, defect, major)

defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla2.0b10

People

(Reporter: standard8, Assigned: standard8)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

All our new VMs (-07 to -10) are failing in xpcshell-tests:

TEST-UNEXPECTED-FAIL | /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js | test failed (with xpcshell return code: 0), see following log:
>>>>>>>
TEST-INFO | (xpcshell/head.js) | test 1 pending
TEST-PASS | /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js | [test_kill : 81] false == false
TEST-PASS | /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js | [test_kill : 91] true == true
TEST-PASS | /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js | [test_kill : 95] false == false
TEST-PASS | /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js | [test_quick : 117] 42 == 42
TEST-PASS | /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js | [test_args : 131] 0 == 0
argv[1] = Møzîllå; expected = Møzîllå
argv[2] = Мозилла; expected = Мозилла
argv[3] = মোজিলা; expected = মোজিলা
argv[4] = 
Missed some copy/paste:

TEST-PASS | /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js | [test_args : 131] 0 == 0
TEST-INFO | (xpcshell/head.js) | test 2 pending
TEST-UNEXPECTED-FAIL | (xpcshell/head.js) | [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProcess.run]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js :: test_kill_2 :: line 244"  data: no]
<<<<<<<
Summary: All new linux VMs failing in test_nsIProcess.js → All new linux VMs failing in test_nsIProcess.js | TEST-UNEXPECTED-FAIL | Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProcess.run]
Whiteboard: [orange]
Assignee: nobody → gozer
Looks like the only sources of this could be failing when calling one of these:

- PR_CreateProcess
- PR_CreateThread

And, given that test starts 1,000 processes/threads, something is hitting a wall.

I've checked the obvious system limits and such, and they are all the same on both the old and the new systems.
I've spent some time today debugging this with momo-vm-linux-10 on try server.

So far I've discovered that it happens typically around the 475 ish mark (give or take 20) of creating processes/threads.

I've also just discovered that it looks to be PR_CreateThread that is failing (as opposed to PR_CreateProcess).

Now I know that, I should be able to get a bit more debug in the next try server run.
Ok, more debug. PR_CreateThread is failing with error PR_INSUFFICIENT_RESOURCES_ERROR.

The OS error number is 12 which from looking at asm-generic/errno-base.h implies out of memory.

This is kinda strange unless we've got some limit on the process itself?

What is also interesting, is if I run just the test_nsIProcess.js by itself, then it passes, if I run it alongside the other tests, then it fails. Seems consistent like that.
Compared ulimits and sysctls on an old and a new system, they are the same. Total memory is 3G on both systems, and nothing strange is taking up a lot of it.
Food for thought. All these new linux VMs are running on a new ESXi host, so that could be the common cause. However, momo-vm-linux64-06 is /also/ running there without apparent problems.
momo-vm-linux-11 is now up and running with the same exact configuration as the other new builders, but is running on a different VMWare host. Let's watch it and see.
Getting interesting. momo-vm-linux-11 completed green, and is exactly the same machine image as the others

http://build.mozillamessaging.com/buildbot/production/builders/Linux%20comm-central%20test%20xpcshell/builds/198

Culprit is starting to smell like VMWare itself. Hard to imagine why, but needs investigating further.
For added confusion, when running the test by itself, it passes.
When running the test via direct invocation of xpcshell, it passes too.
However, when running the test via direct invocation of xpcshell from inside a gdb session, it fails.
Just realized there is an assumption we are making here that is incorrect.

buildbot-master $> bzgrep "TEST-UNEXPECTED-FAIL.*test_nsIProcess.js" /buildbot/production/buildbot-configs/thunderbird/comm-central-trunk-linux-opt-unittest-xpcshell/*-log-xpcshell-stdio.bz2 | grep test_nsIProcess.js | cut -d: -f1 | sort | uniq | xargs bzgrep HOSTNAME= | cut -d= -f2 | sort | uniq -c | sort -k1 -rn

     28 momo-vm-linux-08.sj.mozillamessaging.com
     20 momo-vm-linux-09.sj.mozillamessaging.com
     17 momo-vm-linux-07.sj.mozillamessaging.com
     15 momo-vm-07.sj.mozillamessaging.com
     14 momo-vm-linux-11.sj.mozillamessaging.com
     13 momo-vm-linux-05.sj.mozillamessaging.com
     13 momo-vm-linux-04.sj.mozillamessaging.com
     12 momo-vm-12.sj.mozillamessaging.com
     11 momo-vm-linux-02.sj.mozillamessaging.com
     11 momo-vm-02.sj.mozillamessaging.com
     10 momo-vm-linux-06.sj.mozillamessaging.com
      8 momo-vm-linux-03.sj.mozillamessaging.com
      4 momo-vm-fedora12-01.sj.mozillamessaging.com

So this problem is *also* happening on our other VMs, so it's not something specific about our new build VMs. I suspect buildbot is just feeding more xpcshell runs to them, since they came new and pick them up first.
To run the tests on the Fedora host without going through buildbot, use this script: http://pastebin.mozilla.org/910356 .  If you limit the contents of all-test-dirs.list to "xpcom/tests/unit" this is a quick 5 second way to reproduce the failure.

Anyway, the test is based on bug 543441.  The last comment contains a link to two submitted changes.

So if you look at the patch in http://hg.mozilla.org/mozilla-central/rev/46c6105f5429 and compare it with the error we get from the test: "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE)", it seems the code in the patch is doing its job, does it not?

Perhaps we should soak that exception message (NS_ERROR_FAILURE) so the test only fails if there is a different exception or an actual crash.  The crash is what bug 543441 was all about and is what the test should be trying to expose.
Summary: All new linux VMs failing in test_nsIProcess.js | TEST-UNEXPECTED-FAIL | Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProcess.run] → Intermittent failing in test_nsIProcess.js | TEST-UNEXPECTED-FAIL | Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProcess.run]
http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird3.1/1294251562.1294253397.32263.gz
Linux x86-64 comm-1.9.2 check on 2011/01/05 10:19:22
s: momo-vm-linux64-06
{
TEST-PASS | /buildbot/linux64-comm-1.9.2-check/build/objdir/mozilla/_tests/xpcshell/xpcom/unit/test_nsIMutableArray.js | test passed
TEST-UNEXPECTED-FAIL | /buildbot/linux64-comm-1.9.2-check/build/objdir/mozilla/_tests/xpcshell/xpcom/unit/test_nsIProcess.js | test failed (with xpcshell return code: 0), see following log:
  >>>>>>>
  ### XPCOM_MEM_LEAK_LOG defined -- logging leaks to /tmp/tmpGxvSJm/runxpcshelltests_leaks.log
TEST-INFO | (xpcshell/head.js) | test 1 pending
TEST-PASS | /buildbot/linux64-comm-1.9.2-check/build/objdir/mozilla/_tests/xpcshell/xpcom/unit/test_nsIProcess.js | [test_kill : 76] false == false
TEST-PASS | /buildbot/linux64-comm-1.9.2-check/build/objdir/mozilla/_tests/xpcshell/xpcom/unit/test_nsIProcess.js | [test_kill : 86] true == true
TEST-UNEXPECTED-FAIL | (xpcshell/head.js) | [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProcess.kill]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: /buildbot/linux64-comm-1.9.2-check/build/objdir/mozilla/_tests/xpcshell/xpcom/unit/test_nsIProcess.js :: test_kill :: line 88"  data: no]
  <<<<<<<
}
Depends on: 493778
Serge, this is pretty much perma-fail at the moment, please don't add comments about the failures, we're actively working on fixing them.
Tweaked /etc/security/limits.d/90-nproc.conf on momo-vm-fedora12-02 and bumped
nproc to 2048 to see if it helps.
Assignee: gozer → bugzilla
Blocks: 543441
Component: Release Engineering → XPCOM
No longer depends on: 493778
Product: Mozilla Messaging → Core
QA Contact: release → xpcom
Version: other → Trunk
After a bit of debugging we've found the issue.

The changes to the unit test done in bug 543441 add a big for loop that basically does fork and exec. The way nsProcessCommon::Monitor works is to post an event to the main thread to re-join a thread for a process that has finished.

However, it appears that in some instances the big for loop saturates the main thread and doesn't allow events to run, therefore the threads don't get re-joined.

In at least Thunderbird's case this is causing out of memory/resource style issues where new threads can't be created because we're at limits.

The simple solution is to allow the main thread to process events regularly and hence tidy up the threads and not take all the system resources.


This issues is currently causing Thunderbird's Linux 32 boxes to be permanently orange, why it doesn't affect others I'm not too sure, but blocking the main thread from processing events isn't really a good idea anyway.
Attachment #502922 - Flags: review?(mh+mozilla)
Comment on attachment 502922 [details] [diff] [review]
Allow other events to run on main thread
[Checked in: See comment 22]

I'm wondering, why don't you consume pending events after each fork/exec ?
Blocks: 438871
It's more efficient to consume several messages in one call rather than only one message per call.
Attachment #502922 - Flags: review?(mh+mozilla) → review+
(In reply to comment #16)
> Comment on attachment 502922 [details] [diff] [review]
> Allow other events to run on main thread
> 
> I'm wondering, why don't you consume pending events after each fork/exec ?

I don't think Firefox does this either - I did a test on MozillaTry with some debug and I'm pretty sure the it didn't process those pending events.

I believe the main reason that FF doesn't see this failure is that TestQuickReturn is relatively slow and doesn't complete before process.kill() is called, and so the kill succeeds at which stage it does a PR_JoinThread. If TestQuickReturn is very quick (we tried it with a shell script) then kill fails (because the process is already dead) and doesn't do the join at that stage but waits for the gecko event loop.

Whilst a normal app might allow event posting once separating threads, I think there is a separate gecko level event handler. That doesn't get a look in (I see nothing in the PR* code or the nsProcess code relating to allowing threads to run) without explicit help.

For example, see this case for Modal windows, where we keep spinning the event loop within the modal code: http://mxr.mozilla.org/comm-central/source/mozilla/xpfe/appshell/src/nsXULWindow.cpp#415
Note that the sole reason for the loop of fork/exec in this test is to have chances to catch the race in nsProcess that was fixed in bug 543441.
Yes, it should still do that because the new code is after the kill and the run -> kill step is what we are testing.
(In reply to comment #16)
> Comment on attachment 502922 [details] [diff] [review]
> Allow other events to run on main thread
> 
> I'm wondering, why don't you consume pending events after each fork/exec ?

I read this wrong when you first commented. Yes, we should be consuming those events each time round the for loop. I'd transposed the patch wrong from when we tried the hacked up version on the builders.

This explains why the Thunderbird tree didn't go green yesterday when I pushed the fix.

I'll move it into the for loop and push that as a follow-up.
All fixed now :-) :

http://hg.mozilla.org/mozilla-central/rev/bb4bbfbb6074
http://hg.mozilla.org/mozilla-central/rev/625c08180421
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Attachment #502922 - Attachment description: Allow other events to run on main thread → Allow other events to run on main thread [Checked in: See comment 22]
Flags: in-testsuite+
Target Milestone: --- → mozilla2.0b10
(In reply to comment #23)
> mbrubeck%mozilla.com
> http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Aurora/1305062068.
> 1305064280.24657.gz

Mis-starred; this was actually bug 582821.
Whiteboard: [orange]
You need to log in before you can comment on or make changes to this bug.