Closed Bug 847558 Opened 8 years ago Closed 7 years ago
on amazon ec2 vm's browser
_dbg _create Chrome .js hangs (timed out) on opt builds in _MD _Wait Unix Process or ns Process::Kill
we had fixed browser_dbg_createChrome.js to work on our VM images, now we are experiencing a crash while running the test. This doesn't happen in debug builds, only in opt builds. Here is a link to the failing testlog: https://tbpl.mozilla.org/php/getParsedLog.php?id=20299323&tree=Cedar You can always look for browser-chrome test failures on Ubuntu machines on the Cedar branch via tbpl: https://tbpl.mozilla.org/?tree=Cedar If there are suggestions for debugging this, working around this, or fixing it, I can reproduce this while working on a VM.
Depends on: 860349
Summary: on amazon ec2 vm's browser_dbg_createChrome.js crashes on opt builds → on amazon ec2 vm's browser_dbg_createChrome.js hangs (timed out) on opt builds
landed a potential fix on m-c for bug 860349. It needs a merge / landing.
leave this open for a bit, but first run doesn't have this error in it.
Assignee: nobody → past
Status: NEW → ASSIGNED
Priority: -- → P1
Added some more logging around the critical part of the code and I expect to get a better understanding of the failure when it reaches Cedar: https://hg.mozilla.org/integration/fx-team/rev/b63a15eddd36
Can we please land a change on mozilla-central to disable the test while we fix it on Cedar? This will help me move load from Fedora machines to Ubuntu faster and re-purpose the rev3 minis as Win7 to help with wait times. Thanks in advance!
if we disable this test, we can get the browser chrome tests running (verified 5 runs in a row on my loaner ubuntu test slave). Once there, we can actually test the fixes on try server which is much faster turnaround time than m-c -> cedar. :past, please speak up if this doesn't sound good.
Attachment #745517 - Flags: superreview?(past) → feedback?(past)
Comment on attachment 745517 [details] [diff] [review] disable test for linux only (1.1) Review of attachment 745517 [details] [diff] [review]: ----------------------------------------------------------------- I'm OK with it, but would it be too much to do a merge with the current patch first, since it's already in (or even a transplant)? I'm on vacation currently, otherwise I would have merged it myself.
Attachment #745517 - Flags: feedback?(past) → feedback+
Comment on attachment 745517 [details] [diff] [review] disable test for linux only (1.1) Review of attachment 745517 [details] [diff] [review]: ----------------------------------------------------------------- Thanks!
Attachment #745517 - Flags: review?(armenzg) → review+
It looks like trying to kill the browser debugger process blocks for some reason: http://mxr.mozilla.org/mozilla-central/source/browser/devtools/debugger/DebuggerUI.jsm#498 nsIProcess::Kill is here: http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsProcessCommon.cpp#557 Can anyone see what is special about our Ubuntu VMs regarding this code?
ted: any ideas on Comment 11 ?
Can we view a stack trace?
here is what happens: https://tbpl.mozilla.org/php/getParsedLog.php?id=22945243&tree=Cedar&full=1#error0 We see it timeout, then we kill the process and get a stack trace. I am not sure if that stack trace is very useful though.
_MD_WaitUnixProcess is waiting  for WaitPidDaemonThread() which is waiting for notification that the child has terminated .  http://hg.mozilla.org/mozilla-central/annotate/eeb89381c7a3/nsprpub/pr/src/md/unix/uxproces.c#l824  http://hg.mozilla.org/mozilla-central/annotate/eeb89381c7a3/nsprpub/pr/src/md/unix/uxproces.c#l612 I wondered about bug 705543, but I don't see evidence of the IcedTea java plugin being loaded. Perhaps something else is handling SIGCHLD (bug 705543 comment 8). There are a number of warning messages in the log like WARNING: waitpid failed pid:2149 errno:10: file ../../../ipc/chromium/src/base/process_util_posix.cc, line 260 That may be due to bug 227246 issue 2, or perhaps something else is calling wait() or waitpid(-1,,).
Depends on: 678369
I believe the log from comment 14 is wrong, it belongs to a failure in a different test. Here is one of the few with the additional logging for this bug before the test was disabled: https://tbpl.mozilla.org/php/getParsedLog.php?id=22707338&full=1&branch=cedar#error0
That looks like a similar problem, with one extra level of indirection. The main thread is waiting for the nsProcess::Monitor() thread 2 libnspr4.so!PR_JoinThread [ptthread.c:95e8754a36dc : 580 + 0xd] rip = 0x00007fca3dbeb90e rsp = 0x00007fffd81a87d0 rbp = 0x00007fffd81a87d8 Found by: stack scanning 3 libxul.so!nsProcess::Kill() [nsProcessCommon.cpp:95e8754a36dc : 581 + 0x8] rip = 0x00007fca3acd118d rsp = 0x00007fffd81a87f0 Found by: stack scanning which is in _MD_WaitUnixProcess: Thread 37 1 libnspr4.so!PR_WaitCondVar [ptsynch.c:95e8754a36dc : 385 + 0xb] rip = 0x00007fca3dbe5b2e rsp = 0x00007fc9ff5fedd0 Found by: stack scanning 2 libnspr4.so!_MD_WaitUnixProcess [uxproces.c:95e8754a36dc : 824 + 0xd] rip = 0x00007fca3dbee304 rsp = 0x00007fc9ff5fedf0 Found by: stack scanning 3 libxul.so!nsProcess::Monitor(void*) [nsProcessCommon.cpp:95e8754a36dc : 264 + 0x8] rip = 0x00007fca3acd13ff rsp = 0x00007fc9ff5fee20 Found by: stack scanning Thread 32 1 libnspr4.so!_pr_poll_with_poll [ptio.c:95e8754a36dc : 3920 + 0xe] rip = 0x00007fca3dbe6a7f rsp = 0x00007fca0cbfdb20 Found by: stack scanning 8 libnspr4.so!WaitPidDaemonThread [uxproces.c:95e8754a36dc : 612 + 0x11] rip = 0x00007fca3dbede22 rsp = 0x00007fca0cbfddb0 Found by: stack scanning
Summary: on amazon ec2 vm's browser_dbg_createChrome.js hangs (timed out) on opt builds → on amazon ec2 vm's browser_dbg_createChrome.js hangs (timed out) on opt builds in _MD_WaitUnixProcess or nsProcess::Kill
we haven't been running this for 6+ months, should we work on fixing this?
I wouldn't bother starting the tests running again if no one is working on fixing this. The priority for fixing this depends on whether the r3 slaves are wanted for something else, though they also seem a bit slower. If I were permitted time for this, I'd fix bug 678369, which would either make this problem go away or turn into a different problem. Running strace -e trace=process -e trace=signal (or -e trace=process,signal if that is the right usage) might be helpful to identify which tests are potentially involved. That could detect when and if something else stealing the SIGCHLD handler, but, if that is not happening, we'd really need stacks to know whether something else is using wait or waitpid incorrectly. It would also be interesting to know whether this tests hangs when run on its own, or narrow down which previous test has caused the problem.
FTR the r3 minis are not wanted. We're just waiting to move tests off them so we can decommission them.
I'm not actually working on this.
Assignee: past → nobody
Status: ASSIGNED → NEW
Priority: P1 → P3
should we close this?
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Resolution: FIXED → INCOMPLETE
Whiteboard: [leave open]
Sounds like bug 850101 is a priority again.
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
The test was enabled again in all platforms here (bug 918507): https://hg.mozilla.org/mozilla-central/rev/54e8f6d1410a It seems to work so far, so I'm closing this bug.
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.