Closed Bug 924771 Opened 11 years ago Closed 8 years ago

Some Mac builds always hit test_browserElement_oop_CloseFromOpener.html,test_browserElement_oop_CloseApp.html,test_browserElement_oop_ExposableURI.html | application timed out after 330 seconds with no output

Categories

(Core :: IPC, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME
Tracking Status
e10s - ---

People

(Reporter: cbook, Unassigned)

References

()

Details

(Keywords: intermittent-failure)

Rev4 MacOSX Lion 10.7 mozilla-inbound debug test mochitest-2 on 2013-10-08 14:57:10 PDT for push 7cd3fde42755

slave: talos-r4-lion-067

https://tbpl.mozilla.org/php/getParsedLog.php?id=28844636&tree=Mozilla-Inbound

TEST-UNEXPECTED-FAIL | /tests/dom/browser-element/mochitest/test_browserElement_oop_CloseFromOpener.html | application timed out after 330 seconds with no output
Markus, this sounds familiar to issues you've run into lately.
Flags: needinfo?(mstange)
Yes, it does sound very familiar! Thanks for letting me know.
Flags: needinfo?(mstange)
Josh, can you find somebody who can look into this?

I think the important part in the logs is this one:

> WARNING: parent WaitForMessage() failed: 0x10004003 (ipc/rcv) timed out: file ../../../ipc/glue/GeckoChildProcessHost.cpp, line 677

and shortly after that:

> WARNING: child WaitForMessage() failed: file ../../../toolkit/xre/nsEmbedFunctions.cpp, line 361

The relevant code was added by you in bug 592951.

This issue seems extremely timing-dependent. It's reproducible 100% of the time on 10.7 debug mochitest-2 when pushing inbound revision http://hg.mozilla.org/integration/mozilla-inbound/rev/17411d6192e4 to try. But whenever I add any instrumentation logging around it, it no longer occurs. (I've made about 10 pushes to try now, trying to debug this...)

This bug is keeping me from landing the patches in bug 886999 which would fix OMTC crashes and would allow us to enable OMTC by default on 10.6.

Unfortunately, I haven't been able to reproduce it locally. I've tried on multiply machines, both with 10.6 and 10.7, with the exact builds that were produced by the try builders.

Maybe you have an idea where to look and or what could be going wrong?
Blocks: 886999
Flags: needinfo?(joshmoz)
Blocks: 899758
(In reply to Markus Stange [:mstange] from comment #34)

> Maybe you have an idea where to look and or what could be going wrong?

I don't recall anything about that code really, I'd have to start all over again to get into it.
Flags: needinfo?(joshmoz)
Now it's holding one or both of https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=f45b84411aae hostage too, and inbound is closed while we wait for someone to decide whether to back out even more patches because this test is far too fragile, or to just disable the test on OS X.
Severity: normal → blocker
Gah, and it wasn't actually that push that started it up, either.
In the log, we have:

08:19:38     INFO -  WARNING: child WaitForMessage() failed: file ../../../toolkit/xre/nsEmbedFunctions.cpp, line 361
08:19:38     INFO -  WARNING: NS_ENSURE_SUCCESS(rv, 1) failed with result 0x80004005: file ../../../ipc/app/MozillaRuntimeMain.cpp, line 117

Which means this function is failing somehow <http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/mach_ipc_mac.mm#237>.

This debugging patch should help figure out what is causing the failures: https://hg.mozilla.org/integration/mozilla-inbound/rev/5373095214d9
Component: DOM → IPC
Fascinating, this episode was brought to you by https://hg.mozilla.org/integration/mozilla-inbound/rev/dc2b71e57211 and needs-clobber - failed 60-80% while that was in and unclobbered, and zero of twenty runs when it relanded with a clobber.
Severity: blocker → normal
16:49:50     INFO -  3020 INFO TEST-START | /tests/dom/browser-element/mochitest/test_browserElement_oop_CloseFromOpener.html
16:49:50     INFO -  ++DOMWINDOW == 61 (0x13f6e0c08) [serial = 2029] [outer = 0x115ce0528]
16:49:50     INFO -  [Child 1280] WARNING: NS_ENSURE_TRUE(mDocShell) failed: file ../../../../embedding/browser/webBrowser/nsWebBrowser.cpp, line 366
16:49:50     INFO -  [Child 1280] WARNING: NS_ENSURE_TRUE(domWindow) failed: file ../../../../embedding/browser/webBrowser/nsDocShellTreeOwner.cpp, line 86
16:49:50     INFO -  WaitForMessage returned 268451843
16:49:50     INFO -  [Child 1281] WARNING: NS_ENSURE_TRUE(mDocShell) failed: file ../../../../embedding/browser/webBrowser/nsWebBrowser.cpp, line 366
16:49:50     INFO -  [Parent 1262] WARNING: parent WaitForMessage() failed: 0x10004003 (ipc/rcv) timed out: file ../../../ipc/glue/GeckoChildProcessHost.cpp, line 677
16:49:50     INFO -  [Child 1281] WARNING: NS_ENSURE_TRUE(domWindow) failed: file ../../../../embedding/browser/webBrowser/nsDocShellTreeOwner.cpp, line 86
16:49:50     INFO -  --DOMWINDOW == 2 (0x1062f4e98) [serial = 2] [outer = 0x1057d8378] [url = about:blank]
16:49:50     INFO -  WARNING: XPCOM objects created/destroyed from static ctor/dtor: file ../../../xpcom/base/nsTraceRefcntImpl.cpp, line 138
16:49:50     INFO -  WARNING: XPCOM objects created/destroyed from static ctor/dtor: file ../../../xpcom/base/nsTraceRefcntImpl.cpp, line 138
16:49:50     INFO -  WARNING: XPCOM objects created/destroyed from static ctor/dtor: file ../../../xpcom/base/nsTraceRefcntImpl.cpp, line 138
16:49:50     INFO -  WARNING: XPCOM objects created/destroyed from static ctor/dtor: file ../../../xpcom/base/nsTraceRefcntImpl.cpp, line 138
16:49:50     INFO -  --DOMWINDOW == 2 (0x10d10d108) [serial = 2] [outer = 0x1068e3a98] [url = about:blank]
16:49:51     INFO -  WARNING: child WaitForMessage() failed: file ../../../toolkit/xre/nsEmbedFunctions.cpp, line 361
16:49:51     INFO -  WARNING: NS_ENSURE_SUCCESS(rv, 1) failed with result 0x80004005: file ../../../ipc/app/MozillaRuntimeMain.cpp, line 117
16:49:51     INFO -  ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to /var/folders/nq/80zh_y2524l6058rrskcpmyw00000w/T/tmpdhJRG1/runtests_leaks.log
16:49:51     INFO -  WaitForMessage returned 268451843
16:49:51     INFO -  nsStringStats
16:49:51     INFO -   => mAllocCount:              1
16:49:51     INFO -   => mReallocCount:            0
16:49:51     INFO -   => mFreeCount:               1
16:49:51     INFO -   => mShareCount:              0
16:49:51     INFO -   => mAdoptCount:              0
16:49:51     INFO -   => mAdoptFreeCount:          0
16:49:55     INFO -  [Child 1280] WARNING: NS_ENSURE_TRUE(inBrowser) failed: file ../../../../embedding/browser/webBrowser/nsDocShellTreeOwner.cpp, line 82
16:49:55     INFO -  [Child 1280] WARNING: NS_ENSURE_TRUE(mDocShell) failed: file ../../../../embedding/browser/webBrowser/nsWebBrowser.cpp, line 366
16:49:55     INFO -  [Child 1280] WARNING: NS_ENSURE_TRUE(mDocShell) failed: file ../../../../embedding/browser/webBrowser/nsWebBrowser.cpp, line 366
16:49:55     INFO -  [Child 1280] WARNING: NS_ENSURE_TRUE(domWindow) failed: file ../../../../embedding/browser/webBrowser/nsDocShellTreeOwner.cpp, line 86
16:49:55     INFO -  [Child 1280] WARNING: NS_ENSURE_TRUE(inBrowser) failed: file ../../../../embedding/browser/webBrowser/nsDocShellTreeOwner.cpp, line 82
16:49:56     INFO -  [Child 1281] WARNING: NS_ENSURE_TRUE(inBrowser) failed: file ../../../../embedding/browser/webBrowser/nsDocShellTreeOwner.cpp, line 82
16:49:56     INFO -  [Child 1281] WARNING: NS_ENSURE_TRUE(mDocShell) failed: file ../../../../embedding/browser/webBrowser/nsWebBrowser.cpp, line 366
16:49:56     INFO -  [Child 1281] WARNING: NS_ENSURE_TRUE(mDocShell) failed: file ../../../../embedding/browser/webBrowser/nsWebBrowser.cpp, line 366
16:49:56     INFO -  [Child 1281] WARNING: NS_ENSURE_TRUE(domWindow) failed: file ../../../../embedding/browser/webBrowser/nsDocShellTreeOwner.cpp, line 86
16:49:56     INFO -  [Child 1281] WARNING: NS_ENSURE_TRUE(inBrowser) failed: file ../../../../embedding/browser/webBrowser/nsDocShellTreeOwner.cpp, line 82
16:55:26  WARNING -  TEST-UNEXPECTED-FAIL | /tests/dom/browser-element/mochitest/test_browserElement_oop_CloseFromOpener.html | application timed out after 330 seconds with no output
Markus: I think it's permaorange again, and I'm certainly not going to do anything about that other than disable the test, so this would be an awesome time to land the OpenGL crash fix.
Too late, just another run of five in a row, not actually permaorange.
Test disabled on all platforms in https://hg.mozilla.org/integration/mozilla-inbound/rev/dbf01bda0a79 - if someone develops an interest in seeing it running on not-Mac again before someone develops an interest in fixing it on Mac, they can test the nested ifdefs and reenable it for not-Mac.
Whiteboard: [test disabled]
Whiteboard: [test disabled] → [test disabled][leave open]
FWIW, in this log <https://tbpl.mozilla.org/php/getParsedLog.php?id=29384876&tree=Mozilla-Inbound> this error was triggered: <https://hg.mozilla.org/integration/mozilla-inbound/rev/5373095214d9#l1.37>

18:59:45     INFO -  WaitForMessage returned 268451843

Which is mach_msg failing with MACH_RCV_TIMED_OUT (0x10004003).  I'm not sure what that means.

I backed out the debugging patch: https://hg.mozilla.org/integration/mozilla-inbound/rev/eb4a385bd96c
Whoops, I totally missed the activity here over the last days!

In the logs, before WaitForMessage fails in the child, it fails in the parent:

[Parent 1265] WARNING: parent WaitForMessage() failed: 0x10004003 (ipc/rcv) timed out: file ../../../ipc/glue/GeckoChildProcessHost.cpp, line 677

When I talked to jdm about this yesterday, our hypothesis was that we may be failing to launch the new child process, time out in the parent process waiting for it, and hang the parent process (no idea how). Then the remaining child processes from the preceding tests time out waiting for the parent process.
(The child error messages come from child processes with PIDs that were already present before the test_browserElement_oop_CloseFromOpener.html test started.)
Interesting theory.  Do you think there is anything special about test_browserElement_oop_CloseFromOpener.html which would trigger that kind of bug?
No idea, sorry. It was pretty much a guess in the dark. We just assumed that the problem is not in WaitForMessage itself or in its direct caller because that code has been around for so long.
Summary: Intermittent TEST-UNEXPECTED-FAIL | /tests/dom/browser-element/mochitest/test_browserElement_oop_CloseFromOpener.html | application timed out after 330 seconds with no output → Intermittent test_browserElement_oop_CloseFromOpener.html | application timed out after 330 seconds with no output
We had just enough of a gap (one of the OOM closures?) before the next one started that I didn't realize bug 925200 was just the next test inheriting the exact same thing. Now we're on to a third test.
Summary: Intermittent test_browserElement_oop_CloseFromOpener.html | application timed out after 330 seconds with no output → Some Mac builds always hit test_browserElement_oop_CloseFromOpener.html,test_browserElement_oop_CloseApp.html,test_browserElement_oop_ExposableURI.html | application timed out after 330 seconds with no output
Disabled test_browserElement_oop_ExposableURI.html in https://hg.mozilla.org/integration/mozilla-inbound/rev/d1689276f118
Mass tracking-e10s flag change. Filter bugmail on "2be0fcce-e36a-4e2c-aa80-0e3d33eb5406".
tracking-e10s: --- → +
Try confirms these work without timing out now.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
Whiteboard: [test disabled][leave open]
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e0516ed5cd30
bug 930449 - Re-enable some OOP browser-element tests that no longer timeout.
You need to log in before you can comment on or make changes to this bug.