Closed
Bug 620101
Opened 14 years ago
Closed 13 years ago
Intermittent failing in test_nsIProcess.js | TEST-UNEXPECTED-FAIL | Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProcess.run]
Categories
(Core :: XPCOM, defect)
Core
XPCOM
Tracking
()
RESOLVED
FIXED
mozilla2.0b10
People
(Reporter: standard8, Assigned: standard8)
References
Details
(Keywords: intermittent-failure)
Attachments
(1 file)
880 bytes,
patch
|
glandium
:
review+
|
Details | Diff | Splinter Review |
All our new VMs (-07 to -10) are failing in xpcshell-tests:
TEST-UNEXPECTED-FAIL | /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js | test failed (with xpcshell return code: 0), see following log:
>>>>>>>
TEST-INFO | (xpcshell/head.js) | test 1 pending
TEST-PASS | /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js | [test_kill : 81] false == false
TEST-PASS | /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js | [test_kill : 91] true == true
TEST-PASS | /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js | [test_kill : 95] false == false
TEST-PASS | /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js | [test_quick : 117] 42 == 42
TEST-PASS | /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js | [test_args : 131] 0 == 0
argv[1] = Møzîllå; expected = Møzîllå
argv[2] = Мозилла; expected = Мозилла
argv[3] = মোজিলা; expected = মোজিলা
argv[4] =
Assignee | ||
Comment 1•14 years ago
|
||
Missed some copy/paste: TEST-PASS | /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js | [test_args : 131] 0 == 0 TEST-INFO | (xpcshell/head.js) | test 2 pending TEST-UNEXPECTED-FAIL | (xpcshell/head.js) | [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProcess.run]" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js :: test_kill_2 :: line 244" data: no] <<<<<<<
Summary: All new linux VMs failing in test_nsIProcess.js → All new linux VMs failing in test_nsIProcess.js | TEST-UNEXPECTED-FAIL | Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProcess.run]
Whiteboard: [orange]
Updated•14 years ago
|
Assignee: nobody → gozer
Comment 2•14 years ago
|
||
Looks like the only sources of this could be failing when calling one of these: - PR_CreateProcess - PR_CreateThread And, given that test starts 1,000 processes/threads, something is hitting a wall. I've checked the obvious system limits and such, and they are all the same on both the old and the new systems.
Assignee | ||
Comment 3•14 years ago
|
||
I've spent some time today debugging this with momo-vm-linux-10 on try server. So far I've discovered that it happens typically around the 475 ish mark (give or take 20) of creating processes/threads. I've also just discovered that it looks to be PR_CreateThread that is failing (as opposed to PR_CreateProcess). Now I know that, I should be able to get a bit more debug in the next try server run.
Assignee | ||
Comment 4•14 years ago
|
||
Ok, more debug. PR_CreateThread is failing with error PR_INSUFFICIENT_RESOURCES_ERROR. The OS error number is 12 which from looking at asm-generic/errno-base.h implies out of memory. This is kinda strange unless we've got some limit on the process itself? What is also interesting, is if I run just the test_nsIProcess.js by itself, then it passes, if I run it alongside the other tests, then it fails. Seems consistent like that.
Comment 5•14 years ago
|
||
Compared ulimits and sysctls on an old and a new system, they are the same. Total memory is 3G on both systems, and nothing strange is taking up a lot of it.
Comment 6•14 years ago
|
||
Food for thought. All these new linux VMs are running on a new ESXi host, so that could be the common cause. However, momo-vm-linux64-06 is /also/ running there without apparent problems.
Comment 7•14 years ago
|
||
momo-vm-linux-11 is now up and running with the same exact configuration as the other new builders, but is running on a different VMWare host. Let's watch it and see.
Comment 8•14 years ago
|
||
Getting interesting. momo-vm-linux-11 completed green, and is exactly the same machine image as the others http://build.mozillamessaging.com/buildbot/production/builders/Linux%20comm-central%20test%20xpcshell/builds/198 Culprit is starting to smell like VMWare itself. Hard to imagine why, but needs investigating further.
Comment 9•14 years ago
|
||
For added confusion, when running the test by itself, it passes. When running the test via direct invocation of xpcshell, it passes too. However, when running the test via direct invocation of xpcshell from inside a gdb session, it fails.
Comment 10•14 years ago
|
||
Just realized there is an assumption we are making here that is incorrect. buildbot-master $> bzgrep "TEST-UNEXPECTED-FAIL.*test_nsIProcess.js" /buildbot/production/buildbot-configs/thunderbird/comm-central-trunk-linux-opt-unittest-xpcshell/*-log-xpcshell-stdio.bz2 | grep test_nsIProcess.js | cut -d: -f1 | sort | uniq | xargs bzgrep HOSTNAME= | cut -d= -f2 | sort | uniq -c | sort -k1 -rn 28 momo-vm-linux-08.sj.mozillamessaging.com 20 momo-vm-linux-09.sj.mozillamessaging.com 17 momo-vm-linux-07.sj.mozillamessaging.com 15 momo-vm-07.sj.mozillamessaging.com 14 momo-vm-linux-11.sj.mozillamessaging.com 13 momo-vm-linux-05.sj.mozillamessaging.com 13 momo-vm-linux-04.sj.mozillamessaging.com 12 momo-vm-12.sj.mozillamessaging.com 11 momo-vm-linux-02.sj.mozillamessaging.com 11 momo-vm-02.sj.mozillamessaging.com 10 momo-vm-linux-06.sj.mozillamessaging.com 8 momo-vm-linux-03.sj.mozillamessaging.com 4 momo-vm-fedora12-01.sj.mozillamessaging.com So this problem is *also* happening on our other VMs, so it's not something specific about our new build VMs. I suspect buildbot is just feeding more xpcshell runs to them, since they came new and pick them up first.
Comment 11•14 years ago
|
||
To run the tests on the Fedora host without going through buildbot, use this script: http://pastebin.mozilla.org/910356 . If you limit the contents of all-test-dirs.list to "xpcom/tests/unit" this is a quick 5 second way to reproduce the failure. Anyway, the test is based on bug 543441. The last comment contains a link to two submitted changes. So if you look at the patch in http://hg.mozilla.org/mozilla-central/rev/46c6105f5429 and compare it with the error we get from the test: "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE)", it seems the code in the patch is doing its job, does it not? Perhaps we should soak that exception message (NS_ERROR_FAILURE) so the test only fails if there is a different exception or an actual crash. The crash is what bug 543441 was all about and is what the test should be trying to expose.
Updated•14 years ago
|
Summary: All new linux VMs failing in test_nsIProcess.js | TEST-UNEXPECTED-FAIL | Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProcess.run] → Intermittent failing in test_nsIProcess.js | TEST-UNEXPECTED-FAIL | Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProcess.run]
Comment 12•14 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird3.1/1294251562.1294253397.32263.gz Linux x86-64 comm-1.9.2 check on 2011/01/05 10:19:22 s: momo-vm-linux64-06 { TEST-PASS | /buildbot/linux64-comm-1.9.2-check/build/objdir/mozilla/_tests/xpcshell/xpcom/unit/test_nsIMutableArray.js | test passed TEST-UNEXPECTED-FAIL | /buildbot/linux64-comm-1.9.2-check/build/objdir/mozilla/_tests/xpcshell/xpcom/unit/test_nsIProcess.js | test failed (with xpcshell return code: 0), see following log: >>>>>>> ### XPCOM_MEM_LEAK_LOG defined -- logging leaks to /tmp/tmpGxvSJm/runxpcshelltests_leaks.log TEST-INFO | (xpcshell/head.js) | test 1 pending TEST-PASS | /buildbot/linux64-comm-1.9.2-check/build/objdir/mozilla/_tests/xpcshell/xpcom/unit/test_nsIProcess.js | [test_kill : 76] false == false TEST-PASS | /buildbot/linux64-comm-1.9.2-check/build/objdir/mozilla/_tests/xpcshell/xpcom/unit/test_nsIProcess.js | [test_kill : 86] true == true TEST-UNEXPECTED-FAIL | (xpcshell/head.js) | [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProcess.kill]" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: /buildbot/linux64-comm-1.9.2-check/build/objdir/mozilla/_tests/xpcshell/xpcom/unit/test_nsIProcess.js :: test_kill :: line 88" data: no] <<<<<<< }
Depends on: 493778
Assignee | ||
Comment 13•14 years ago
|
||
Serge, this is pretty much perma-fail at the moment, please don't add comments about the failures, we're actively working on fixing them.
Comment 14•14 years ago
|
||
Tweaked /etc/security/limits.d/90-nproc.conf on momo-vm-fedora12-02 and bumped nproc to 2048 to see if it helps.
Assignee | ||
Updated•14 years ago
|
Assignee | ||
Comment 15•14 years ago
|
||
After a bit of debugging we've found the issue. The changes to the unit test done in bug 543441 add a big for loop that basically does fork and exec. The way nsProcessCommon::Monitor works is to post an event to the main thread to re-join a thread for a process that has finished. However, it appears that in some instances the big for loop saturates the main thread and doesn't allow events to run, therefore the threads don't get re-joined. In at least Thunderbird's case this is causing out of memory/resource style issues where new threads can't be created because we're at limits. The simple solution is to allow the main thread to process events regularly and hence tidy up the threads and not take all the system resources. This issues is currently causing Thunderbird's Linux 32 boxes to be permanently orange, why it doesn't affect others I'm not too sure, but blocking the main thread from processing events isn't really a good idea anyway.
Attachment #502922 -
Flags: review?(mh+mozilla)
Comment 16•14 years ago
|
||
Comment on attachment 502922 [details] [diff] [review] Allow other events to run on main thread [Checked in: See comment 22] I'm wondering, why don't you consume pending events after each fork/exec ?
Comment 17•14 years ago
|
||
It's more efficient to consume several messages in one call rather than only one message per call.
Updated•14 years ago
|
Attachment #502922 -
Flags: review?(mh+mozilla) → review+
Assignee | ||
Comment 18•14 years ago
|
||
(In reply to comment #16) > Comment on attachment 502922 [details] [diff] [review] > Allow other events to run on main thread > > I'm wondering, why don't you consume pending events after each fork/exec ? I don't think Firefox does this either - I did a test on MozillaTry with some debug and I'm pretty sure the it didn't process those pending events. I believe the main reason that FF doesn't see this failure is that TestQuickReturn is relatively slow and doesn't complete before process.kill() is called, and so the kill succeeds at which stage it does a PR_JoinThread. If TestQuickReturn is very quick (we tried it with a shell script) then kill fails (because the process is already dead) and doesn't do the join at that stage but waits for the gecko event loop. Whilst a normal app might allow event posting once separating threads, I think there is a separate gecko level event handler. That doesn't get a look in (I see nothing in the PR* code or the nsProcess code relating to allowing threads to run) without explicit help. For example, see this case for Modal windows, where we keep spinning the event loop within the modal code: http://mxr.mozilla.org/comm-central/source/mozilla/xpfe/appshell/src/nsXULWindow.cpp#415
Comment 19•14 years ago
|
||
Note that the sole reason for the loop of fork/exec in this test is to have chances to catch the race in nsProcess that was fixed in bug 543441.
Assignee | ||
Comment 20•14 years ago
|
||
Yes, it should still do that because the new code is after the kill and the run -> kill step is what we are testing.
Assignee | ||
Comment 21•13 years ago
|
||
(In reply to comment #16) > Comment on attachment 502922 [details] [diff] [review] > Allow other events to run on main thread > > I'm wondering, why don't you consume pending events after each fork/exec ? I read this wrong when you first commented. Yes, we should be consuming those events each time round the for loop. I'd transposed the patch wrong from when we tried the hacked up version on the builders. This explains why the Thunderbird tree didn't go green yesterday when I pushed the fix. I'll move it into the for loop and push that as a follow-up.
Assignee | ||
Comment 22•13 years ago
|
||
All fixed now :-) : http://hg.mozilla.org/mozilla-central/rev/bb4bbfbb6074 http://hg.mozilla.org/mozilla-central/rev/625c08180421
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•13 years ago
|
Attachment #502922 -
Attachment description: Allow other events to run on main thread → Allow other events to run on main thread
[Checked in: See comment 22]
Updated•13 years ago
|
Flags: in-testsuite+
Target Milestone: --- → mozilla2.0b10
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 24•13 years ago
|
||
(In reply to comment #23) > mbrubeck%mozilla.com > http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Aurora/1305062068. > 1305064280.24657.gz Mis-starred; this was actually bug 582821.
Updated•12 years ago
|
Keywords: intermittent-failure
Updated•12 years ago
|
Whiteboard: [orange]
You need to log in
before you can comment on or make changes to this bug.
Description
•