847558 - on amazon ec2 vm's browser_dbg_createChrome.js hangs (timed out) on opt builds in _MD_WaitUnixProcess or nsProcess::Kill

Reporter

Description

•

12 years ago

we had fixed browser_dbg_createChrome.js to work on our VM images, now we are experiencing a crash while running the test. This doesn't happen in debug builds, only in opt builds. Here is a link to the failing testlog: https://tbpl.mozilla.org/php/getParsedLog.php?id=20299323&tree=Cedar You can always look for browser-chrome test failures on Ubuntu machines on the Cedar branch via tbpl: https://tbpl.mozilla.org/?tree=Cedar If there are suggestions for debugging this, working around this, or fixing it, I can reproduce this while working on a VM.

Karl Tomlinson (:karlt)

Updated

•

12 years ago

Depends on: 860349

Summary: on amazon ec2 vm's browser_dbg_createChrome.js crashes on opt builds → on amazon ec2 vm's browser_dbg_createChrome.js hangs (timed out) on opt builds

Rob Campbell [:rc] (:robcee)

Comment 1

•

12 years ago

landed a potential fix on m-c for bug 860349. It needs a merge / landing.

Rob Campbell [:rc] (:robcee)

Comment 2

•

12 years ago

leave this open for a bit, but first run doesn't have this error in it.

Panos Astithas (he/him) [:past] (please ni?)

Updated

•

12 years ago

Assignee: nobody → past

Status: NEW → ASSIGNED

Priority: -- → P1

Panos Astithas (he/him) [:past] (please ni?)

Comment 3

•

12 years ago

Attached patch More logging — Details — Splinter Review

Added some more logging around the critical part of the code and I expect to get a better understanding of the failure when it reaches Cedar: https://hg.mozilla.org/integration/fx-team/rev/b63a15eddd36

Panos Astithas (he/him) [:past] (please ni?)

Updated

•

12 years ago

Whiteboard: [leave open]

Armen [:armenzg]

Comment 4

•

12 years ago

Can we please land a change on mozilla-central to disable the test while we fix it on Cedar? This will help me move load from Fedora machines to Ubuntu faster and re-purpose the rev3 minis as Win7 to help with wait times. Thanks in advance!

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 5

•

12 years ago

Attached patch disable test by default (1.0) (obsolete) — Details — Splinter Review

if we disable this test, we can get the browser chrome tests running (verified 5 runs in a row on my loaner ubuntu test slave). Once there, we can actually test the fixes on try server which is much faster turnaround time than m-c -> cedar. :past, please speak up if this doesn't sound good.

Attachment #745440 - Flags: superreview?(past)

Attachment #745440 - Flags: review?(armenzg)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 6

•

12 years ago

Attached patch disable test for linux only (1.1) — Details — Splinter Review

Attachment #745440 - Attachment is obsolete: true

Attachment #745440 - Flags: superreview?(past)

Attachment #745440 - Flags: review?(armenzg)

Attachment #745517 - Flags: superreview?(past)

Attachment #745517 - Flags: review?(armenzg)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Updated

•

12 years ago

Attachment #745517 - Flags: superreview?(past) → feedback?(past)

Panos Astithas (he/him) [:past] (please ni?)

Comment 7

•

12 years ago

Comment on attachment 745517 [details] [diff] [review] disable test for linux only (1.1) Review of attachment 745517 [details] [diff] [review]: ----------------------------------------------------------------- I'm OK with it, but would it be too much to do a merge with the current patch first, since it's already in (or even a transplant)? I'm on vacation currently, otherwise I would have merged it myself.

Attachment #745517 - Flags: feedback?(past) → feedback+

Tim Taubert [:ttaubert] (inactive)

Comment 8

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/b63a15eddd36

Armen [:armenzg]

Comment 9

•

12 years ago

Comment on attachment 745517 [details] [diff] [review] disable test for linux only (1.1) Review of attachment 745517 [details] [diff] [review]: ----------------------------------------------------------------- Thanks!

Attachment #745517 - Flags: review?(armenzg) → review+

Ryan VanderMeulen [:RyanVM]

Comment 10

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/e886f7097c42

Armen [:armenzg]

Updated

•

12 years ago

Blocks: 850101

Panos Astithas (he/him) [:past] (please ni?)

Comment 11

•

12 years ago

It looks like trying to kill the browser debugger process blocks for some reason: http://mxr.mozilla.org/mozilla-central/source/browser/devtools/debugger/DebuggerUI.jsm#498 nsIProcess::Kill is here: http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsProcessCommon.cpp#557 Can anyone see what is special about our Ubuntu VMs regarding this code?

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 12

•

12 years ago

ted: any ideas on Comment 11 ?

Karl Tomlinson (:karlt)

Comment 13

•

12 years ago

Can we view a stack trace?

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 14

•

12 years ago

here is what happens: https://tbpl.mozilla.org/php/getParsedLog.php?id=22945243&tree=Cedar&full=1#error0 We see it timeout, then we kill the process and get a stack trace. I am not sure if that stack trace is very useful though.

Karl Tomlinson (:karlt)

Comment 15

•

12 years ago

_MD_WaitUnixProcess is waiting [1] for WaitPidDaemonThread() which is waiting for notification that the child has terminated [2]. [1] http://hg.mozilla.org/mozilla-central/annotate/eeb89381c7a3/nsprpub/pr/src/md/unix/uxproces.c#l824 [2] http://hg.mozilla.org/mozilla-central/annotate/eeb89381c7a3/nsprpub/pr/src/md/unix/uxproces.c#l612 I wondered about bug 705543, but I don't see evidence of the IcedTea java plugin being loaded. Perhaps something else is handling SIGCHLD (bug 705543 comment 8). There are a number of warning messages in the log like WARNING: waitpid failed pid:2149 errno:10: file ../../../ipc/chromium/src/base/process_util_posix.cc, line 260 That may be due to bug 227246 issue 2, or perhaps something else is calling wait() or waitpid(-1,,).

Depends on: 678369

Panos Astithas (he/him) [:past] (please ni?)

Comment 16

•

12 years ago

I believe the log from comment 14 is wrong, it belongs to a failure in a different test. Here is one of the few with the additional logging for this bug before the test was disabled: https://tbpl.mozilla.org/php/getParsedLog.php?id=22707338&full=1&branch=cedar#error0

Karl Tomlinson (:karlt)

Comment 17

•

12 years ago

That looks like a similar problem, with one extra level of indirection. The main thread is waiting for the nsProcess::Monitor() thread 2 libnspr4.so!PR_JoinThread [ptthread.c:95e8754a36dc : 580 + 0xd] rip = 0x00007fca3dbeb90e rsp = 0x00007fffd81a87d0 rbp = 0x00007fffd81a87d8 Found by: stack scanning 3 libxul.so!nsProcess::Kill() [nsProcessCommon.cpp:95e8754a36dc : 581 + 0x8] rip = 0x00007fca3acd118d rsp = 0x00007fffd81a87f0 Found by: stack scanning which is in _MD_WaitUnixProcess: Thread 37 1 libnspr4.so!PR_WaitCondVar [ptsynch.c:95e8754a36dc : 385 + 0xb] rip = 0x00007fca3dbe5b2e rsp = 0x00007fc9ff5fedd0 Found by: stack scanning 2 libnspr4.so!_MD_WaitUnixProcess [uxproces.c:95e8754a36dc : 824 + 0xd] rip = 0x00007fca3dbee304 rsp = 0x00007fc9ff5fedf0 Found by: stack scanning 3 libxul.so!nsProcess::Monitor(void*) [nsProcessCommon.cpp:95e8754a36dc : 264 + 0x8] rip = 0x00007fca3acd13ff rsp = 0x00007fc9ff5fee20 Found by: stack scanning Thread 32 1 libnspr4.so!_pr_poll_with_poll [ptio.c:95e8754a36dc : 3920 + 0xe] rip = 0x00007fca3dbe6a7f rsp = 0x00007fca0cbfdb20 Found by: stack scanning 8 libnspr4.so!WaitPidDaemonThread [uxproces.c:95e8754a36dc : 612 + 0x11] rip = 0x00007fca3dbede22 rsp = 0x00007fca0cbfddb0 Found by: stack scanning

Karl Tomlinson (:karlt)

Updated

•

12 years ago

Summary: on amazon ec2 vm's browser_dbg_createChrome.js hangs (timed out) on opt builds → on amazon ec2 vm's browser_dbg_createChrome.js hangs (timed out) on opt builds in _MD_WaitUnixProcess or nsProcess::Kill

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 18

•

11 years ago

we haven't been running this for 6+ months, should we work on fixing this?

Karl Tomlinson (:karlt)

Comment 19

•

11 years ago

I wouldn't bother starting the tests running again if no one is working on fixing this. The priority for fixing this depends on whether the r3 slaves are wanted for something else, though they also seem a bit slower. If I were permitted time for this, I'd fix bug 678369, which would either make this problem go away or turn into a different problem. Running strace -e trace=process -e trace=signal (or -e trace=process,signal if that is the right usage) might be helpful to identify which tests are potentially involved. That could detect when and if something else stealing the SIGCHLD handler, but, if that is not happening, we'd really need stacks to know whether something else is using wait or waitpid incorrectly. It would also be interesting to know whether this tests hangs when run on its own, or narrow down which previous test has caused the problem.

Armen [:armenzg]

Comment 20

•

11 years ago

FTR the r3 minis are not wanted. We're just waiting to move tests off them so we can decommission them.

Panos Astithas (he/him) [:past] (please ni?)

Comment 21

•

11 years ago

I'm not actually working on this.

Assignee: past → nobody

Status: ASSIGNED → NEW

Priority: P1 → P3

Armen [:armenzg]

Comment 22

•

11 years ago

should we close this?

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Updated

•

11 years ago

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Ryan VanderMeulen [:RyanVM]

Updated

•

11 years ago

Resolution: FIXED → INCOMPLETE

Whiteboard: [leave open]

Karl Tomlinson (:karlt)

Comment 23

•

11 years ago

Sounds like bug 850101 is a priority again.

Status: RESOLVED → REOPENED

Resolution: INCOMPLETE → ---

Panos Astithas (he/him) [:past] (please ni?)

Comment 24

•

11 years ago

The test was enabled again in all platforms here (bug 918507): https://hg.mozilla.org/mozilla-central/rev/54e8f6d1410a It seems to work so far, so I'm closing this bug.

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

Ryan VanderMeulen [:RyanVM]

Comment 25

•

11 years ago

Probably fixed by bug 943174.

Depends on: bad-waitpid

Target Milestone: --- → Firefox 31

u279076

Updated

•

11 years ago

QA Whiteboard: [qa-]

BMO Automation

Updated

•

7 years ago

Product: Firefox → DevTools

More logging 12 years ago Panos Astithas (he/him) [:past] (please ni?) 3.17 KB, patch		Details \| Diff \| Splinter Review
disable test by default (1.0) 12 years ago Joel Maher ( :jmaher ) (UTC -8) 796 bytes, patch		Details \| Diff \| Splinter Review
disable test for linux only (1.1) 12 years ago Joel Maher ( :jmaher ) (UTC -8) 1.06 KB, patch	armenzg : review+ past : feedback+	Details \| Diff \| Splinter Review