Sometimes no 'toplevel-window-ready' observer notification is fired during startup
Categories
(Toolkit :: Startup and Profile System, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox57 | --- | fix-optional |
People
(Reporter: whimboo, Unassigned)
References
(Blocks 1 open bug)
Details
Reporter | ||
Comment 1•8 years ago
|
||
Reporter | ||
Comment 2•8 years ago
|
||
Comment 3•8 years ago
|
||
![]() |
||
Comment 4•8 years ago
|
||
Comment 5•8 years ago
|
||
Reporter | ||
Comment 6•8 years ago
|
||
![]() |
||
Comment 7•8 years ago
|
||
![]() |
||
Updated•8 years ago
|
Reporter | ||
Comment 8•8 years ago
|
||
![]() |
||
Comment 9•8 years ago
|
||
![]() |
||
Comment 10•8 years ago
|
||
Updated•7 years ago
|
Reporter | ||
Comment 12•6 years ago
|
||
Reporter | ||
Updated•6 years ago
|
Reporter | ||
Updated•6 years ago
|
Reporter | ||
Comment 13•6 years ago
|
||
Several bugs which track intermittent failures including bug 1414495 show that the failures are gone on mozilla-central since my patch on bug 1530979 (add a marionette-startup-requested notification) has been landed.
Could that mean that in cases when it failed in Firefox we may have a broken startup path for sessionstore? So that something fails while executing it, and no sessionstore-windows-restored message has been sent?
Florian, you mentioned that my patch shouldn't cause a difference, maybe you have an idea or know whom to forward this to?
Comment 14•6 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #13)
Florian, you mentioned that my patch shouldn't cause a difference, maybe you have an idea or know whom to forward this to?
Afaik, Dao and Mike care about session restore.
Reporter | ||
Comment 15•6 years ago
|
||
On bug 1530979 we switched to a custom marionette-restart-requested
observer notification. So updating the bug's summary.
As noticed like for bug 1540367 we still sometimes experience problems that the notification is not getting send out, and as such doesn't cause Marionette to initialize, which causes hangs in the test harnesses using Marionette.
Reporter | ||
Comment 17•6 years ago
|
||
So Kriti has seen this locally on bug 1483559 comment 11. So how often do actually you see this problem with Marionette not correctly initializing? Does it always happen for you? If that is the case it would be great if you could run some special steps for us.
Comment 18•6 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #17)
So Kriti has seen this locally on bug 1483559 comment 11. So how often do actually you see this problem with Marionette not correctly initializing? Does it always happen for you? If that is the case it would be great if you could run some special steps for us.
We would like to see a profile of what's happening. Here is how you could capture it:
MOZ_PROFILER_STARTUP=1 MOZ_PROFILER_SHUTDOWN="shutdown.json" ./mach <whatever command you were running to start the test that triggered the problem>
You should see a browser window on screen but the test won't start. Quit the browser (eg. press ctrl+Q, or click File->Quit in the menu, or something equivalent that closes the browser without killing it). At the end of shutdown, you should find that a shutdown.json file has been written on the disk, in the current directory.
Next, start the same browser (eg. ./mach run), load https://profiler.firefox.com/, install the add-on that's offered there.
Drop the shutdown.json file into the profiler.firefox.com webpage (or use the "Load a profile from file" button). Wait for it to load, and the various messages at the top ("Waiting for symbol tables for ...", "Symbolicating call stacks...", ...) to disappear, and click the "Share…" button. This should let you upload the profile, and provide a short url that can be pasted here.
Comment 19•6 years ago
|
||
:florian profile's URL is here : https://perfht.ml/2uLUnvl
:whimboo this occurs only for the screenshot.py test. Other wdspec tests which are .html files are running fine. I have not tested for other .py files, will let you know shortly.
Comment 20•6 years ago
|
||
(In reply to Kriti Singh from comment #19)
:florian profile's URL is here : https://perfht.ml/2uLUnvl
Thanks! In this profile, marionette begins to start about 8s after you pressed ctrl+q. Any chance you could do it again, but wait about 30s after the browser window appears before quitting?
Also, please add this environment variable:
MOZ_PROFILER_STARTUP_ENTRIES=10000000
Comment 21•6 years ago
|
||
:florian, for some reason, the browser window is not starting up now. I reran the command after removing existing shutdown.json file and doing ./mach build
but nothing changes. Also last time, I had to quit at 8s because the browser window quitted itself within 10-15 seconds. Here are the logs https://pastebin.com/bEnTLCNY
Comment 22•6 years ago
|
||
If the browser quits itself for some reason without crashing, a shutdown.json file should still be written to the disk. It may show us the reason why the browser quit.
Comment 23•6 years ago
|
||
For some strange reason, the tests are working as expected now. Although I'm not sure of what's going on internally, since this happens randomly(atleast in my experience), this might be a possible race condition.
Reporter | ||
Comment 24•6 years ago
|
||
(In reply to Kriti Singh from comment #23)
For some strange reason, the tests are working as expected now. Although I'm not sure of what's going on internally, since this happens randomly(atleast in my experience), this might be a possible race condition.
When it happens how does Firefox look like? Is the UI fully loaded, or is something missing? Do you see something else in the browser console, or in the terminal window? Does it happen for opt and debug builds for you?
Reporter | ||
Comment 25•6 years ago
|
||
The following job log gives some more details:
[task 2019-06-06T23:21:55.363Z] 23:21:55 INFO - 1559863315359 Marionette TRACE Received observer notification toplevel-window-ready
[task 2019-06-06T23:22:09.404Z] 23:22:09 INFO - [GFX1-]: Killing GPU process due to IPC reply timeout
[task 2019-06-06T23:22:09.405Z] 23:22:09 INFO - [GFX1-]: Receive IPC close with reason=AbnormalShutdown
[task 2019-06-06T23:22:09.406Z] 23:22:09 INFO - [GFX1-]: Receive IPC close with reason=AbnormalShutdown
[task 2019-06-06T23:22:09.407Z] 23:22:09 INFO - ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv
[task 2019-06-06T23:22:09.408Z] 23:22:09 INFO - [GFX1-]: Receive IPC close with reason=AbnormalShutdown
[task 2019-06-06T23:22:09.409Z] 23:22:09 INFO - ###!!! [Parent][MessageChannel] Error: (msgtype=0x4C001A,name=PGPU::Msg_ShutdownVR) Channel error: cannot send/recv
We recently turned on the GPU process on Linux, so maybe this made it actually appear, and gives more details why it fails? Looks like there are issues with IPC. Matt, any idea what this could mean? Maybe it might be worth filing this as its own bug.
Comment 26•6 years ago
|
||
The log doesn't give much useful information unfortunately.
The GPU process took too long to respond to an IPC message, so we killed it
It's possible that something in graphics code actually hung, but there's not much to on here.
Has anyone tried to get a pernosco capture of this happening?
Reporter | ||
Comment 27•6 years ago
|
||
(In reply to Matt Woodrow (:mattwoodrow) from comment #26)
Has anyone tried to get a pernosco capture of this happening?
No, it's all in CI only for now, and I never have seen it myself locally yet. :/ So we don't any recording with rr available.
Reporter | ||
Updated•6 years ago
|
Comment 28•6 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #27)
(In reply to Matt Woodrow (:mattwoodrow) from comment #26)
Has anyone tried to get a pernosco capture of this happening?
No, it's all in CI only for now, and I never have seen it myself locally yet. :/ So we don't any recording with rr available.
I'm pretty sure Pernosco is a service, which generates rr style traces for failures on try, if they're reproducible. My understanding is that this is intermittent, I don't know how frequent. Worth seeing with Matt if that's something pernosco can help with or not.
Reporter | ||
Comment 29•6 years ago
|
||
The intermittent rate is very low all the last weeks. So I doubt that it would make sense to have it right now. But still be interested to know how this would work.
Comment 30•6 years ago
|
||
Pernosco attempts to reproduce failures that it detects from the try pushes of enrolled users.
I asked roc, and it appears that this failure hasn't yet been caught unfortunately.
How important is this to get fixed? Sounds like it's fairly infrequent.
Reporter | ||
Comment 31•6 years ago
|
||
Matt, due to the not sent observer notification Marionette doesn't fully start-up, which means that test jobs depending on Marionette will hang, and Firefox will eventually be killed after 1000s of inactivity.
As mentioned above the failure rate is kinda low at the moment. So it might be hard to catch it. The number of closed bug on the blocking list indicates that. So I would suggest that I keep an eye on it for now, and will report back once the failure rate increased again.
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 36•6 years ago
|
||
Recently this mainly seem to happen on Linux but still very infrequently to actually figure out the reason.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 46•5 years ago
|
||
Note that we now have a user of geckodriver who can reliable reproduce this problem on his machine. See https://github.com/mozilla/geckodriver/issues/1660#issuecomment-566255446.
Basically it is a very long startup time, and Marionette just gives up to connect before Firefox has been started. Hopefully we will have a startup profile soon.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 49•5 years ago
|
||
This is most likely related to bug 1577713. Recently we have seen that specifically webdriver tests always timed out on Linux due to a very delayed marionette-startup-requested
notification. Also users of geckodriver reported that this notification arrived more than 60s later. And because Marionette has a 60s startup timeout we kill the browser.
Over on bug 1577713 I will do some measurements soon for ccov builds to see which part of the startup process of Firefox takes that long.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Mass-removing myself from cc; search for 12b9dfe4-ece3-40dc-8d23-60e179f64ac1 or any reasonable part thereof, to mass-delete these notifications (and sorry!)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 67•4 years ago
|
||
The underlying problem here is actually that no toplevel-window-ready
notification is fired. It means for the failing jobs the initial Firefox window never appears, or is that much delayed that we run into the timeout, and finally kill the process.
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 71•4 years ago
|
||
Florian, given that this specific failure causes a lot of trouble for us in Marionette tests I wonder what we could do to nail down the problem and figure out why toplevel-window-ready
isn't fired within a minute in a lot of cases. Are there any other helpful observer notifications that we could register for to get an idea in which step the startup of Firefox hangs/fails? Given that it's hard to trigger we could land such a patch temporary and use logger.trace()
for the output.
Comment 72•4 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #71)
Florian, given that this specific failure causes a lot of trouble for us in Marionette tests I wonder what we could do to nail down the problem and figure out why
toplevel-window-ready
isn't fired within a minute in a lot of cases.
The best way I can think of is to put MOZ_PROFILER_STARTUP in the environment. And when we give up because we would timeout, find a way to send a signal that would dump the profiler to disk (like for shutdown profiling) instead of killing the process immediately. If sending a signal is not a practical solution, we could put temporary code for this specific bug to dump a profile after startup as taken an excessive amount of time (but less than the time that would cause the test to kill the process).
If we are not interested in profiling the tests themselves, we could stop the profiler as soon as startup is complete.
Maybe not relevant if you've encountered this on all platforms, but I've noticed that for tests running on Linux 64 opt (especially with the profiler running, but maybe not exclusively), we block 100+s during startup waiting for dbus to let us know if we need to enable accessibility features or not. Example profile: https://share.firefox.dev/3vNbMSy
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 89•3 years ago
|
||
We could get rid of this bug if we would allow Marionette to start earlier, and not having to wait for the very first browser window. There is a request for that over on bug 1726465. It won't be that easy, but on the other side would allow some other interesting test scenarios.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 100•3 years ago
|
||
Since bug 1726465 has been fixed (window less support for Marionette) this observer notification is no longer in use. As such this bug should be marked as invalid.
Reporter | ||
Comment 101•3 years ago
|
||
Actually this bug is not in the Marionette component. As such I would suggest incomplete instead given that no other consumer triggered this particular situation yet.
Comment 102•2 years ago
|
||
treeherder |
New failure instance: https://treeherder.mozilla.org/logviewer?job_id=390482094&repo=autoland
Reporter | ||
Comment 103•2 years ago
|
||
(In reply to Treeherder Bug Filer from comment #102)
New failure instance: https://treeherder.mozilla.org/logviewer?job_id=390482094&repo=autoland
This was mis-classified and should have been for bug 1790325. Closing the bug again.
Comment 105•2 years ago
|
||
treeherder |
New failure instance: https://treeherder.mozilla.org/logviewer?job_id=392481633&repo=mozilla-central
Reporter | ||
Comment 106•2 years ago
|
||
(In reply to Treeherder Bug Filer from comment #105)
New failure instance: https://treeherder.mozilla.org/logviewer?job_id=392481633&repo=mozilla-central
This failure is for bug 1573448, which I've now reopened to avoid the inappropriate classification of this bug and all the sending of hundreds of emails to each recipient. I hope that this bug won't be reopened again.
Given that the former code doesn't exist anymore lets even close this bug as invalid now.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Description
•