Closed Bug 1267106 Opened 3 years ago Closed 3 years ago

Intermittent Windows/Mac e10s startup hang leading to REFTEST PROCESS-CRASH | reftest | application crashed [@ CrashingThread(void *)], | application timed out after 330 seconds with no output

Categories

(Core :: IPC, defect, P3)

Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
e10s ? ---
firefox46 --- unaffected
firefox47 --- wontfix
firefox48 --- wontfix
firefox49 --- wontfix
firefox50 --- fixed
firefox51 --- fixed
firefox52 --- fixed

People

(Reporter: RyanVM, Assigned: jimm)

References

(Blocks 1 open bug)

Details

(Keywords: crash, intermittent-failure, Whiteboard: [e10s-orangeblockers])

Attachments

(2 files, 1 obsolete file)

Attached image screenshot
I'd noticed this on Ash but haven't had time to narrow it down. Then I noticed today that it was affecting Aurora as well. After a giant pile of retriggering, I can say with a high level of certainty that it's caused by bug 1235633. It also fits the regression range on Ash (and I assume it would on other trunk branches as well, but I'll leave that to the full-time sheriffs to confirm).

Screenshot always shows a blank window. Affects Win7/Win8 reftests & crashtests with e10s enabled.

https://treeherder.mozilla.org/logviewer.html#?job_id=2460175&repo=mozilla-aurora
Flags: needinfo?(wmccloskey)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #0)
> but I'll leave that to the full-time sheriffs to confirm).

Going back through the history, I see failures matching this starred as "intermittent" without a bug being filed going back for about two weeks, so there's that...
Can you provide more information about the priority here Ryan? How often does it happen? Does it happen on inbound/central? I don't understand why it wasn't caught when my patch originally landed.
Flags: needinfo?(wmccloskey) → needinfo?(ryanvm)
It's intermittent. Yes, happens on production as well:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267106&endday=2016-04-28&startday=2016-04-01&tree=trunk

Mostly PGO-only, for whatever that's worth. My main concern is whether this'll manifest in a user-facing way or not.
Flags: needinfo?(ryanvm)
Intermittent test crash
Priority: -- → P3
 09:30:07 INFO - JavaScript error: , line 0: uncaught exception: undefined
This is extremely frequent across the board on Windows these days. Would be great if we could get someone to look into it. Affects all Windows e10s suites that use the reftest harness.
Tracy, can you look into this? ping me if you have questions.
Flags: needinfo?(jmathies) → needinfo?(twalker)
per IRC re-NIing Jim
Flags: needinfo?(twalker) → needinfo?(jmathies)
Bug 1265229 is probably the same issue.
See Also: → 1265229
I'm willing to believe that the OS X failures which started a few days ago are somehow different, but until someone explains how by filing a separate bug, I'm going to just dump them in here.
Summary: Intermittent Windows e10s startup hang leading to REFTEST PROCESS-CRASH | reftest | application crashed [@ CrashingThread(void *)] → Intermittent Windows/Mac e10s startup hang leading to REFTEST PROCESS-CRASH | reftest | application crashed [@ CrashingThread(void *)], | application timed out after 330 seconds with no output
Flags: needinfo?(jmathies)
I looked at this and logs say that we are failing to launch the child process completely: which triggers some assertions, but we don't have stacks/reasons for failing to launch.
I dragged a handy mozilla-central push off into a corner and retriggered the Win7 opt e10s reftest-noaccel 25 times, and got this 13 times on it, so that suite is unacceptable and needs to be hidden. Not sure how many other flavors also need to be hidden.
Only 7 of 25 on PGO, but then, 28% is also completely unacceptable.
Hiding it is temporarily blocked by bug 1301260.

Fun story: the Mac failures completely stopped September 1st, and since that time there have been 4 failures on regular accelerated reftest, and 245 on reftest-noaccel.
Flags: needinfo?(jmathies)
Win7 opt/PGO e10s Ru hidden on all trunk trees (and try, though it won't actually be running there until next week), click the "Excluded Jobs" link in the gray bar at the top of treeherder to see it.
Blocks: 1301905
Interesting. I was about to update the exclusion to hide it on aurora, since we merged whatever is busted there this morning, but 10 retriggers there gave me 10 green runs.
An update here - on Windows the child process *is* launching, but for some reason it fails to load up content and operate correctly. Will continue digging.

07:24:46     INFO - REFTEST INFO | Checking for orphan ssltunnel processes...
07:24:46     INFO - REFTEST INFO | Checking for orphan xpcshell processes...
07:24:47     INFO - REFTEST INFO | Running with e10s: True
07:24:47     INFO - REFTEST INFO | Application command: C:\slave\test\build\application\firefox\firefox.exe -marionette -profile c:\users\cltbld\appdata\local\temp\tmpwydbjz.mozrunner
07:24:51     INFO - 1475763891578	Marionette	INFO	Listening on port 2828
07:24:52     INFO - GeckoChildProcessHost::LaunchAndWaitForProcessHandle()
07:24:52     INFO - GeckoChildProcessHost::PrepareLaunch()
07:24:52     INFO - GeckoChildProcessHost::RunPerformAsyncLaunch()
07:24:52     INFO - GeckoChildProcessHost::PerformAsyncLaunch()
07:24:52     INFO - GeckoChildProcessHost::PerformAsyncLaunchInternal()
07:24:52     INFO - PerformAsyncLaunchInternal: GeckoProcessType_Content
07:24:52     INFO - call mSandboxBroker.LaunchApp...
07:24:52     INFO - ==> process 2160 launched child process 3124 ("C:\slave\test\build\application\firefox\firefox.exe" -contentproc --channel="2160.0.121296096\921340549" -greomni "C:\slave\test\build\application\firefox\omni.ja" -appomni "C:\slave\test\build\application\firefox\browser\omni.ja" -appdir "C:\slave\test\build\application\firefox\browser"  2160 "\\.\pipe\gecko-crash-server-pipe.2160" tab)
(timeout)
Flags: needinfo?(jmathies)
This is caused by a mysteriously missing 'load' event reftest-content.js registers for when it loads [1]. I couldn't find any existing bugs on frame scripts failing to fire 'load', so maybe this is something new. I need to dig a bit further.

try log [2]:
05:48:05     INFO - REFTEST INFO | Checking for orphan ssltunnel processes...
05:48:05     INFO - REFTEST INFO | Checking for orphan xpcshell processes...
05:48:06     INFO - REFTEST INFO | Running with e10s: True
05:48:06     INFO - REFTEST INFO | Application command: C:\slave\test\build\application\firefox\firefox.exe -marionette -profile c:\users\cltbld\appdata\local\temp\tmpffuyrr.mozrunner
05:48:06     INFO - **** firefox.exe child process launched
05:48:10     INFO - 1475930890470	Marionette	INFO	Listening on port 2828
05:48:11     INFO - GeckoChildProcessHost::LaunchAndWaitForProcessHandle()
05:48:11     INFO - GeckoChildProcessHost::PrepareLaunch()
05:48:11     INFO - GeckoChildProcessHost::RunPerformAsyncLaunch()
05:48:11     INFO - GeckoChildProcessHost::PerformAsyncLaunch()
05:48:11     INFO - GeckoChildProcessHost::PerformAsyncLaunchInternal()
05:48:11     INFO - PerformAsyncLaunchInternal: GeckoProcessType_Content
05:48:11     INFO - call mSandboxBroker.LaunchApp...
05:48:11     INFO - ==> process 3420 launched child process 1840 ("C:\slave\test\build\application\firefox\firefox.exe" -contentproc --channel="3420.0.1033849503\316985519" -greomni "C:\slave\test\build\application\firefox\omni.ja" -appomni "C:\slave\test\build\application\firefox\browser\omni.ja" -appdir "C:\slave\test\build\application\firefox\browser"  3420 "\\.\pipe\gecko-crash-server-pipe.3420" tab)
05:48:11     INFO - TabParent::SendLoadRemoteScript url='chrome://global/content/browser-content.js'
05:48:11     INFO - TabParent::SendLoadRemoteScript url='resource://gre/modules/addons/Content.js'
05:48:11     INFO - TabParent::SendLoadRemoteScript url='chrome://satchel/content/formSubmitListener.js'
05:48:11     INFO - **** firefox.exe child process launched
05:48:11     INFO - TabParent::SendLoadRemoteScript url='chrome://gfxsanity/content/gfxFrameScript.js'
05:48:11     INFO - dropping into loop->Run()
05:48:11     INFO - TabChild::RecvLoadRemoteScript url='chrome://global/content/browser-content.js'
05:48:11     INFO - TabChild::RecvLoadRemoteScript url='resource://gre/modules/addons/Content.js'
05:48:11     INFO - TabChild::RecvLoadRemoteScript url='chrome://satchel/content/formSubmitListener.js'
05:48:11     INFO - TabChild::RecvLoadRemoteScript url='chrome://gfxsanity/content/gfxFrameScript.js'
05:48:12     INFO - TabParent::SendLoadRemoteScript url='chrome://global/content/browser-content.js'
05:48:12     INFO - TabParent::SendLoadRemoteScript url='resource://gre/modules/addons/Content.js'
05:48:12     INFO - TabParent::SendLoadRemoteScript url='chrome://satchel/content/formSubmitListener.js'
05:48:12     INFO - TabParent::SendLoadRemoteScript url='chrome://global/content/browser-child.js'
05:48:12     INFO - TabParent::SendLoadRemoteScript url='chrome://global/content/select-child.js'
05:48:12     INFO - TabParent::SendLoadRemoteScript url='chrome://browser/content/tab-content.js'
05:48:12     INFO - TabParent::SendLoadRemoteScript url='chrome://browser/content/content.js'
05:48:12     INFO - TabParent::SendLoadRemoteScript url='chrome://browser/content/content-UITour.js'
05:48:12     INFO - TabParent::SendLoadRemoteScript url='chrome://global/content/manifestMessages.js'
05:48:12     INFO - TabParent::SendLoadRemoteScript url='chrome://browser/content/content-sessionStore.js'
05:48:12     INFO - TabParent::SendLoadRemoteScript url='chrome://marionette/content/listener.js'
05:48:12     INFO - TabChild::RecvLoadRemoteScript url='chrome://global/content/browser-content.js'
05:48:12     INFO - TabChild::RecvLoadRemoteScript url='resource://gre/modules/addons/Content.js'
05:48:12     INFO - TabChild::RecvLoadRemoteScript url='chrome://satchel/content/formSubmitListener.js'
05:48:12     INFO - TabChild::RecvLoadRemoteScript url='chrome://global/content/browser-child.js'
05:48:12     INFO - TabChild::RecvLoadRemoteScript url='chrome://global/content/select-child.js'
05:48:12     INFO - TabChild::RecvLoadRemoteScript url='chrome://browser/content/tab-content.js'
05:48:12     INFO - TabChild::RecvLoadRemoteScript url='chrome://browser/content/content.js'
05:48:12     INFO - TabChild::RecvLoadRemoteScript url='chrome://browser/content/content-UITour.js'
05:48:12     INFO - TabChild::RecvLoadRemoteScript url='chrome://global/content/manifestMessages.js'
05:48:12     INFO - TabChild::RecvLoadRemoteScript url='chrome://browser/content/content-sessionStore.js'
05:48:12     INFO - TabChild::RecvLoadRemoteScript url='chrome://marionette/content/listener.js'
05:48:14     INFO - REFTEST INFO | function OnRefTestLoad(win)
05:48:14     INFO - TabParent::SendLoadRemoteScript url='chrome://global/content/browser-content.js'
05:48:14     INFO - TabParent::SendLoadRemoteScript url='resource://gre/modules/addons/Content.js'
05:48:14     INFO - TabParent::SendLoadRemoteScript url='chrome://satchel/content/formSubmitListener.js'
05:48:14     INFO - TabChild::RecvLoadRemoteScript url='chrome://global/content/browser-content.js'
05:48:14     INFO - TabChild::RecvLoadRemoteScript url='resource://gre/modules/addons/Content.js'
05:48:14     INFO - TabChild::RecvLoadRemoteScript url='chrome://satchel/content/formSubmitListener.js'
05:48:14     INFO - REFTEST INFO | function RegisterMessageListenersAndLoadContentScript()
05:48:14     INFO - REFTEST INFO | Loading reftest-content.js frame script...
05:48:14     INFO - TabParent::SendLoadRemoteScript url='chrome://reftest/content/reftest-content.js'
05:48:14     INFO - TabChild::RecvLoadRemoteScript url='chrome://reftest/content/reftest-content.js'
05:48:14     INFO - REFTEST INFO | [CONTENT] reftest-content.js: i'm loaded!
05:53:44    ERROR - REFTEST ERROR | reftest | application timed out after 330 seconds with no output

[1] http://searchfox.org/mozilla-central/rev/c635b8c61d648bb8a0317c19f8905b3be8132a8a/layout/tools/reftest/reftest-content.js#1157
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=29f88c9234fbbeee8a19d1d06489b6b205e29d53
Attached patch wip (obsolete) — Splinter Review
Assignee: nobody → jmathies
Attached patch fixSplinter Review
This isn't a browser bug, it's a harness bug. The load of reftest-content.js frame script sometimes happens after the load event for content fires so reftest-content.js never gets started and we hang on test startup. The fix is to check to see if the load event has fired, and if so, go ahead and get the test suite running. Otherwise, keep the old load event as a starting trigger.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e36bf38727ce3bdadb91ae29525c2f5fbd0f5b5e
Attachment #8799781 - Attachment is obsolete: true
Attachment #8799889 - Flags: review?(jmaher)
Comment on attachment 8799889 [details] [diff] [review]
fix

Review of attachment 8799889 [details] [diff] [review]:
-----------------------------------------------------------------

good find!
Attachment #8799889 - Flags: review?(jmaher) → review+
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/bb296adbce78
Reftest harness should check readyState and continue if the document state is 'complete'. r=jmaher
Keywords: checkin-needed
(In reply to Phil Ringnalda (:philor) from comment #52)
> Win7 opt/PGO e10s Ru hidden on all trunk trees (and try, though it won't
> actually be running there until next week), click the "Excluded Jobs" link
> in the gray bar at the top of treeherder to see it.

Should I file a bug on undoing hiding of tests now that we (hopefully) have this fixed?
Flags: needinfo?(philringnalda)
Nope, that would be bug 1301905, you just have to wait until it has at least made it to mozilla-central and thus around to other branches, if not given it a day or so to bake after that, and then poke at me there.
Flags: needinfo?(philringnalda)
https://hg.mozilla.org/mozilla-central/rev/bb296adbce78
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Thanks for tracking this down, Jim!
Whiteboard: [e10s-orangeblockers][checkin-needed-aurora][checkin-needed-beta]
https://hg.mozilla.org/releases/mozilla-aurora/rev/5f981bfee893
Whiteboard: [e10s-orangeblockers][checkin-needed-aurora][checkin-needed-beta] → [e10s-orangeblockers][checkin-needed-beta]
https://hg.mozilla.org/releases/mozilla-beta/rev/9646fff5e1c7
Whiteboard: [e10s-orangeblockers][checkin-needed-beta] → [e10s-orangeblockers]
You need to log in before you can comment on or make changes to this bug.