Closed Bug 847558 Opened 7 years ago Closed 5 years ago

on amazon ec2 vm's browser_dbg_createChrome.js hangs (timed out) on opt builds in _MD_WaitUnixProcess or nsProcess::Kill

Categories

(DevTools :: Debugger, defect, P3)

All
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED
Firefox 31

People

(Reporter: jmaher, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(2 files, 1 obsolete file)

we had fixed browser_dbg_createChrome.js to work on our VM images, now we are experiencing a crash while running the test.  This doesn't happen in debug builds, only in opt builds.

Here is a link to the failing testlog:
https://tbpl.mozilla.org/php/getParsedLog.php?id=20299323&tree=Cedar

You can always look for browser-chrome test failures on Ubuntu machines on the Cedar branch via tbpl:
https://tbpl.mozilla.org/?tree=Cedar

If there are suggestions for debugging this, working around this, or fixing it, I can reproduce this while working on a VM.
Depends on: 860349
Summary: on amazon ec2 vm's browser_dbg_createChrome.js crashes on opt builds → on amazon ec2 vm's browser_dbg_createChrome.js hangs (timed out) on opt builds
landed a potential fix on m-c for bug 860349. It needs a merge / landing.
leave this open for a bit, but first run doesn't have this error in it.
Assignee: nobody → past
Status: NEW → ASSIGNED
Priority: -- → P1
Attached patch More loggingSplinter Review
Added some more logging around the critical part of the code and I expect to get a better understanding of the failure when it reaches Cedar:

https://hg.mozilla.org/integration/fx-team/rev/b63a15eddd36
Whiteboard: [leave open]
Can we please land a change on mozilla-central to disable the test while we fix it on Cedar?
This will help me move load from Fedora machines to Ubuntu faster and re-purpose the rev3 minis as Win7 to help with wait times.

Thanks in advance!
Attached patch disable test by default (1.0) (obsolete) — Splinter Review
if we disable this test, we can get the browser chrome tests running (verified 5 runs in a row on my loaner ubuntu test slave).  Once there, we can actually test the fixes on try server which is much faster turnaround time than m-c -> cedar.

:past, please speak up if this doesn't sound good.
Attachment #745440 - Flags: superreview?(past)
Attachment #745440 - Flags: review?(armenzg)
Attachment #745440 - Attachment is obsolete: true
Attachment #745440 - Flags: superreview?(past)
Attachment #745440 - Flags: review?(armenzg)
Attachment #745517 - Flags: superreview?(past)
Attachment #745517 - Flags: review?(armenzg)
Attachment #745517 - Flags: superreview?(past) → feedback?(past)
Comment on attachment 745517 [details] [diff] [review]
disable test for linux only (1.1)

Review of attachment 745517 [details] [diff] [review]:
-----------------------------------------------------------------

I'm OK with it, but would it be too much to do a merge with the current patch first, since it's already in (or even a transplant)? I'm on vacation currently, otherwise I would have merged it myself.
Attachment #745517 - Flags: feedback?(past) → feedback+
Comment on attachment 745517 [details] [diff] [review]
disable test for linux only (1.1)

Review of attachment 745517 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!
Attachment #745517 - Flags: review?(armenzg) → review+
It looks like trying to kill the browser debugger process blocks for some reason:

http://mxr.mozilla.org/mozilla-central/source/browser/devtools/debugger/DebuggerUI.jsm#498

nsIProcess::Kill is here:

http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsProcessCommon.cpp#557

Can anyone see what is special about our Ubuntu VMs regarding this code?
Can we view a stack trace?
here is what happens:
https://tbpl.mozilla.org/php/getParsedLog.php?id=22945243&tree=Cedar&full=1#error0

We see it timeout, then we kill the process and get a stack trace.  I am not sure if that stack trace is very useful though.
_MD_WaitUnixProcess is waiting [1] for WaitPidDaemonThread() which is waiting for notification that the child has terminated [2].

[1] http://hg.mozilla.org/mozilla-central/annotate/eeb89381c7a3/nsprpub/pr/src/md/unix/uxproces.c#l824
[2] http://hg.mozilla.org/mozilla-central/annotate/eeb89381c7a3/nsprpub/pr/src/md/unix/uxproces.c#l612

I wondered about bug 705543, but I don't see evidence of the IcedTea java plugin being loaded.

Perhaps something else is handling SIGCHLD (bug 705543 comment 8).

There are a number of warning messages in the log like
WARNING: waitpid failed pid:2149 errno:10: file ../../../ipc/chromium/src/base/process_util_posix.cc, line 260
That may be due to bug 227246 issue 2, or perhaps something else is calling wait() or waitpid(-1,,).
Depends on: 678369
I believe the log from comment 14 is wrong, it belongs to a failure in a different test. Here is one of the few with the additional logging for this bug before the test was disabled:

https://tbpl.mozilla.org/php/getParsedLog.php?id=22707338&full=1&branch=cedar#error0
That looks like a similar problem, with one extra level of indirection.
The main thread is waiting for the nsProcess::Monitor() thread

 2  libnspr4.so!PR_JoinThread [ptthread.c:95e8754a36dc : 580 + 0xd]
    rip = 0x00007fca3dbeb90e   rsp = 0x00007fffd81a87d0
    rbp = 0x00007fffd81a87d8
    Found by: stack scanning
 3  libxul.so!nsProcess::Kill() [nsProcessCommon.cpp:95e8754a36dc : 581 + 0x8]
    rip = 0x00007fca3acd118d   rsp = 0x00007fffd81a87f0
    Found by: stack scanning

which is in _MD_WaitUnixProcess:

Thread 37

 1  libnspr4.so!PR_WaitCondVar [ptsynch.c:95e8754a36dc : 385 + 0xb]
    rip = 0x00007fca3dbe5b2e   rsp = 0x00007fc9ff5fedd0
    Found by: stack scanning
 2  libnspr4.so!_MD_WaitUnixProcess [uxproces.c:95e8754a36dc : 824 + 0xd]
    rip = 0x00007fca3dbee304   rsp = 0x00007fc9ff5fedf0
    Found by: stack scanning
 3  libxul.so!nsProcess::Monitor(void*) [nsProcessCommon.cpp:95e8754a36dc : 264 + 0x8]
    rip = 0x00007fca3acd13ff   rsp = 0x00007fc9ff5fee20
    Found by: stack scanning

Thread 32

 1  libnspr4.so!_pr_poll_with_poll [ptio.c:95e8754a36dc : 3920 + 0xe]
    rip = 0x00007fca3dbe6a7f   rsp = 0x00007fca0cbfdb20
    Found by: stack scanning

 8  libnspr4.so!WaitPidDaemonThread [uxproces.c:95e8754a36dc : 612 + 0x11]
    rip = 0x00007fca3dbede22   rsp = 0x00007fca0cbfddb0
    Found by: stack scanning
Summary: on amazon ec2 vm's browser_dbg_createChrome.js hangs (timed out) on opt builds → on amazon ec2 vm's browser_dbg_createChrome.js hangs (timed out) on opt builds in _MD_WaitUnixProcess or nsProcess::Kill
we haven't been running this for 6+ months, should we work on fixing this?
I wouldn't bother starting the tests running again if no one is working on fixing this.

The priority for fixing this depends on whether the r3 slaves are wanted for something else, though they also seem a bit slower.

If I were permitted time for this, I'd fix bug 678369, which would either make this problem go away or turn into a different problem.

Running strace -e trace=process -e trace=signal (or -e trace=process,signal if that is the right usage) might be helpful to identify which tests are potentially involved.  That could detect when and if something else stealing the SIGCHLD handler, but, if that is not happening, we'd really need stacks to know whether something else is using wait or waitpid incorrectly.

It would also be interesting to know whether this tests hangs when run on its own, or narrow down which previous test has caused the problem.
FTR the r3 minis are not wanted. We're just waiting to move tests off them so we can decommission them.
I'm not actually working on this.
Assignee: past → nobody
Status: ASSIGNED → NEW
Priority: P1 → P3
should we close this?
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Resolution: FIXED → INCOMPLETE
Whiteboard: [leave open]
Sounds like bug 850101 is a priority again.
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
The test was enabled again in all platforms here (bug 918507):

https://hg.mozilla.org/mozilla-central/rev/54e8f6d1410a

It seems to work so far, so I'm closing this bug.
Status: REOPENED → RESOLVED
Closed: 6 years ago5 years ago
Resolution: --- → FIXED
Probably fixed by bug 943174.
Depends on: bad-waitpid
Target Milestone: --- → Firefox 31
QA Whiteboard: [qa-]
Product: Firefox → DevTools
You need to log in before you can comment on or make changes to this bug.