620101 - Intermittent failing in test_nsIProcess.js | TEST-UNEXPECTED-FAIL | Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProcess.run]

Assignee

Description

•

15 years ago

All our new VMs (-07 to -10) are failing in xpcshell-tests: TEST-UNEXPECTED-FAIL | /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js | test failed (with xpcshell return code: 0), see following log: >>>>>>> TEST-INFO | (xpcshell/head.js) | test 1 pending TEST-PASS | /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js | [test_kill : 81] false == false TEST-PASS | /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js | [test_kill : 91] true == true TEST-PASS | /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js | [test_kill : 95] false == false TEST-PASS | /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js | [test_quick : 117] 42 == 42 TEST-PASS | /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js | [test_args : 131] 0 == 0 argv[1] = Møzîllå; expected = Møzîllå argv[2] = Мозилла; expected = Мозилла argv[3] = মোজিলা; expected = মোজিলা argv[4] =

Mark Banner (:standard8)

Assignee

Comment 1

•

15 years ago

Missed some copy/paste: TEST-PASS | /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js | [test_args : 131] 0 == 0 TEST-INFO | (xpcshell/head.js) | test 2 pending TEST-UNEXPECTED-FAIL | (xpcshell/head.js) | [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProcess.run]" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: /buildbot/comm-central-trunk-linux-opt-unittest-xpcshell/build/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js :: test_kill_2 :: line 244" data: no] <<<<<<<

Summary: All new linux VMs failing in test_nsIProcess.js → All new linux VMs failing in test_nsIProcess.js | TEST-UNEXPECTED-FAIL | Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProcess.run]

Whiteboard: [orange]

Philippe M. Chiasson (:gozer)

Updated

•

15 years ago

Assignee: nobody → gozer

Philippe M. Chiasson (:gozer)

Comment 2

•

15 years ago

Looks like the only sources of this could be failing when calling one of these: - PR_CreateProcess - PR_CreateThread And, given that test starts 1,000 processes/threads, something is hitting a wall. I've checked the obvious system limits and such, and they are all the same on both the old and the new systems.

Mark Banner (:standard8)

Assignee

Comment 3

•

15 years ago

I've spent some time today debugging this with momo-vm-linux-10 on try server. So far I've discovered that it happens typically around the 475 ish mark (give or take 20) of creating processes/threads. I've also just discovered that it looks to be PR_CreateThread that is failing (as opposed to PR_CreateProcess). Now I know that, I should be able to get a bit more debug in the next try server run.

Mark Banner (:standard8)

Assignee

Comment 4

•

15 years ago

Ok, more debug. PR_CreateThread is failing with error PR_INSUFFICIENT_RESOURCES_ERROR. The OS error number is 12 which from looking at asm-generic/errno-base.h implies out of memory. This is kinda strange unless we've got some limit on the process itself? What is also interesting, is if I run just the test_nsIProcess.js by itself, then it passes, if I run it alongside the other tests, then it fails. Seems consistent like that.

Philippe M. Chiasson (:gozer)

Comment 5

•

15 years ago

Compared ulimits and sysctls on an old and a new system, they are the same. Total memory is 3G on both systems, and nothing strange is taking up a lot of it.

Philippe M. Chiasson (:gozer)

Comment 6

•

15 years ago

Food for thought. All these new linux VMs are running on a new ESXi host, so that could be the common cause. However, momo-vm-linux64-06 is /also/ running there without apparent problems.

Philippe M. Chiasson (:gozer)

Comment 7

•

15 years ago

momo-vm-linux-11 is now up and running with the same exact configuration as the other new builders, but is running on a different VMWare host. Let's watch it and see.

Philippe M. Chiasson (:gozer)

Comment 8

•

15 years ago

Getting interesting. momo-vm-linux-11 completed green, and is exactly the same machine image as the others http://build.mozillamessaging.com/buildbot/production/builders/Linux%20comm-central%20test%20xpcshell/builds/198 Culprit is starting to smell like VMWare itself. Hard to imagine why, but needs investigating further.

Philippe M. Chiasson (:gozer)

Comment 9

•

15 years ago

For added confusion, when running the test by itself, it passes. When running the test via direct invocation of xpcshell, it passes too. However, when running the test via direct invocation of xpcshell from inside a gdb session, it fails.

Philippe M. Chiasson (:gozer)

Comment 10

•

15 years ago

Just realized there is an assumption we are making here that is incorrect. buildbot-master $> bzgrep "TEST-UNEXPECTED-FAIL.*test_nsIProcess.js" /buildbot/production/buildbot-configs/thunderbird/comm-central-trunk-linux-opt-unittest-xpcshell/*-log-xpcshell-stdio.bz2 | grep test_nsIProcess.js | cut -d: -f1 | sort | uniq | xargs bzgrep HOSTNAME= | cut -d= -f2 | sort | uniq -c | sort -k1 -rn 28 momo-vm-linux-08.sj.mozillamessaging.com 20 momo-vm-linux-09.sj.mozillamessaging.com 17 momo-vm-linux-07.sj.mozillamessaging.com 15 momo-vm-07.sj.mozillamessaging.com 14 momo-vm-linux-11.sj.mozillamessaging.com 13 momo-vm-linux-05.sj.mozillamessaging.com 13 momo-vm-linux-04.sj.mozillamessaging.com 12 momo-vm-12.sj.mozillamessaging.com 11 momo-vm-linux-02.sj.mozillamessaging.com 11 momo-vm-02.sj.mozillamessaging.com 10 momo-vm-linux-06.sj.mozillamessaging.com 8 momo-vm-linux-03.sj.mozillamessaging.com 4 momo-vm-fedora12-01.sj.mozillamessaging.com So this problem is *also* happening on our other VMs, so it's not something specific about our new build VMs. I suspect buildbot is just feeding more xpcshell runs to them, since they came new and pick them up first.

John Hopkins (:jhopkins)

Comment 11

•

15 years ago

To run the tests on the Fedora host without going through buildbot, use this script: http://pastebin.mozilla.org/910356 . If you limit the contents of all-test-dirs.list to "xpcom/tests/unit" this is a quick 5 second way to reproduce the failure. Anyway, the test is based on bug 543441. The last comment contains a link to two submitted changes. So if you look at the patch in http://hg.mozilla.org/mozilla-central/rev/46c6105f5429 and compare it with the error we get from the test: "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE)", it seems the code in the patch is doing its job, does it not? Perhaps we should soak that exception message (NS_ERROR_FAILURE) so the test only fails if there is a different exception or an actual crash. The crash is what bug 543441 was all about and is what the test should be trying to expose.

Philippe M. Chiasson (:gozer)

Updated

•

15 years ago

Summary: All new linux VMs failing in test_nsIProcess.js | TEST-UNEXPECTED-FAIL | Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProcess.run] → Intermittent failing in test_nsIProcess.js | TEST-UNEXPECTED-FAIL | Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProcess.run]

Serge Gautherie (:sgautherie)

Comment 12

•

15 years ago

http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird3.1/1294251562.1294253397.32263.gz Linux x86-64 comm-1.9.2 check on 2011/01/05 10:19:22 s: momo-vm-linux64-06 { TEST-PASS | /buildbot/linux64-comm-1.9.2-check/build/objdir/mozilla/_tests/xpcshell/xpcom/unit/test_nsIMutableArray.js | test passed TEST-UNEXPECTED-FAIL | /buildbot/linux64-comm-1.9.2-check/build/objdir/mozilla/_tests/xpcshell/xpcom/unit/test_nsIProcess.js | test failed (with xpcshell return code: 0), see following log: >>>>>>> ### XPCOM_MEM_LEAK_LOG defined -- logging leaks to /tmp/tmpGxvSJm/runxpcshelltests_leaks.log TEST-INFO | (xpcshell/head.js) | test 1 pending TEST-PASS | /buildbot/linux64-comm-1.9.2-check/build/objdir/mozilla/_tests/xpcshell/xpcom/unit/test_nsIProcess.js | [test_kill : 76] false == false TEST-PASS | /buildbot/linux64-comm-1.9.2-check/build/objdir/mozilla/_tests/xpcshell/xpcom/unit/test_nsIProcess.js | [test_kill : 86] true == true TEST-UNEXPECTED-FAIL | (xpcshell/head.js) | [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProcess.kill]" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: /buildbot/linux64-comm-1.9.2-check/build/objdir/mozilla/_tests/xpcshell/xpcom/unit/test_nsIProcess.js :: test_kill :: line 88" data: no] <<<<<<< }

Depends on: 493778

Mark Banner (:standard8)

Assignee

Comment 13

•

15 years ago

Serge, this is pretty much perma-fail at the moment, please don't add comments about the failures, we're actively working on fixing them.

Philippe M. Chiasson (:gozer)

Comment 14

•

15 years ago

Tweaked /etc/security/limits.d/90-nproc.conf on momo-vm-fedora12-02 and bumped nproc to 2048 to see if it helps.

Mark Banner (:standard8)

Assignee

Updated

•

15 years ago

Assignee: gozer → bugzilla

Blocks: 543441

Component: Release Engineering → XPCOM

No longer depends on: 493778

Product: Mozilla Messaging → Core

QA Contact: release → xpcom

Version: other → Trunk

Mark Banner (:standard8)

Assignee

Comment 15

•

15 years ago

Attached patch Allow other events to run on main thread [Checked in: See comment 22] — Details — Splinter Review

After a bit of debugging we've found the issue. The changes to the unit test done in bug 543441 add a big for loop that basically does fork and exec. The way nsProcessCommon::Monitor works is to post an event to the main thread to re-join a thread for a process that has finished. However, it appears that in some instances the big for loop saturates the main thread and doesn't allow events to run, therefore the threads don't get re-joined. In at least Thunderbird's case this is causing out of memory/resource style issues where new threads can't be created because we're at limits. The simple solution is to allow the main thread to process events regularly and hence tidy up the threads and not take all the system resources. This issues is currently causing Thunderbird's Linux 32 boxes to be permanently orange, why it doesn't affect others I'm not too sure, but blocking the main thread from processing events isn't really a good idea anyway.

Attachment #502922 - Flags: review?(mh+mozilla)

Serge Gautherie (:sgautherie)

Updated

•

15 years ago

Blocks: 493778

Mike Hommey [:glandium]

Comment 16

•

15 years ago

Comment on attachment 502922 [details] [diff] [review] Allow other events to run on main thread [Checked in: See comment 22] I'm wondering, why don't you consume pending events after each fork/exec ?

(no longer active)

Updated

•

15 years ago

Blocks: 438871

John Hopkins (:jhopkins)

Comment 17

•

15 years ago

It's more efficient to consume several messages in one call rather than only one message per call.

Mike Hommey [:glandium]

Updated

•

15 years ago

Attachment #502922 - Flags: review?(mh+mozilla) → review+

Mark Banner (:standard8)

Assignee

Comment 18

•

15 years ago

(In reply to comment #16) > Comment on attachment 502922 [details] [diff] [review] > Allow other events to run on main thread > > I'm wondering, why don't you consume pending events after each fork/exec ? I don't think Firefox does this either - I did a test on MozillaTry with some debug and I'm pretty sure the it didn't process those pending events. I believe the main reason that FF doesn't see this failure is that TestQuickReturn is relatively slow and doesn't complete before process.kill() is called, and so the kill succeeds at which stage it does a PR_JoinThread. If TestQuickReturn is very quick (we tried it with a shell script) then kill fails (because the process is already dead) and doesn't do the join at that stage but waits for the gecko event loop. Whilst a normal app might allow event posting once separating threads, I think there is a separate gecko level event handler. That doesn't get a look in (I see nothing in the PR* code or the nsProcess code relating to allowing threads to run) without explicit help. For example, see this case for Modal windows, where we keep spinning the event loop within the modal code: http://mxr.mozilla.org/comm-central/source/mozilla/xpfe/appshell/src/nsXULWindow.cpp#415

Mike Hommey [:glandium]

Comment 19

•

15 years ago

Note that the sole reason for the loop of fork/exec in this test is to have chances to catch the race in nsProcess that was fixed in bug 543441.

Mark Banner (:standard8)

Assignee

Comment 20

•

15 years ago

Yes, it should still do that because the new code is after the kill and the run -> kill step is what we are testing.

Mark Banner (:standard8)

Assignee

Comment 21

•

15 years ago

(In reply to comment #16) > Comment on attachment 502922 [details] [diff] [review] > Allow other events to run on main thread > > I'm wondering, why don't you consume pending events after each fork/exec ? I read this wrong when you first commented. Yes, we should be consuming those events each time round the for loop. I'd transposed the patch wrong from when we tried the hacked up version on the builders. This explains why the Thunderbird tree didn't go green yesterday when I pushed the fix. I'll move it into the for loop and push that as a follow-up.

Mark Banner (:standard8)

Assignee

Comment 22

•

15 years ago

All fixed now :-) : http://hg.mozilla.org/mozilla-central/rev/bb4bbfbb6074 http://hg.mozilla.org/mozilla-central/rev/625c08180421

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Serge Gautherie (:sgautherie)

Updated

•

15 years ago

Attachment #502922 - Attachment description: Allow other events to run on main thread → Allow other events to run on main thread [Checked in: See comment 22]

Serge Gautherie (:sgautherie)

Updated

•

15 years ago

Flags: in-testsuite+

Target Milestone: --- → mozilla2.0b10

Comment hidden (Legacy TBPL/Treeherder Robot)

Matt Brubeck (:mbrubeck)

Comment 24

•

14 years ago

(In reply to comment #23) > mbrubeck%mozilla.com > http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Aurora/1305062068. > 1305064280.24657.gz Mis-starred; this was actually bug 582821.

Nobody; OK to take it and work on it

Updated

•

13 years ago

Keywords: intermittent-failure

Nobody; OK to take it and work on it

Updated

•

13 years ago

Whiteboard: [orange]