Closed Bug 1382162 Opened 8 years ago Closed 2 years ago

Sometimes no 'toplevel-window-ready' observer notification is fired during startup

Categories

(Toolkit :: Startup and Profile System, defect)

defect
Not set
normal

Tracking

()

RESOLVED INVALID
Tracking Status
firefox57 --- fix-optional

People

(Reporter: whimboo, Unassigned)

References

(Blocks 1 open bug)

Details

With several Marionette restart tests we detected that in some cases Marionette is not getting initialized. The underlying issue is that it waits for the 'sessionstore-windows-restored' observer notification. And as it looks like it never gets fired. Here an excerpt of a failing test: 02:52:18 INFO - 1500432738433 Marionette DEBUG Received observer notification "profile-after-change" 02:52:18 INFO - 1500432738481 Marionette DEBUG Received observer notification "command-line-startup" 02:52:18 INFO - 1500432738481 Marionette INFO Enabled via --marionette 02:55:24 ERROR - TEST-UNEXPECTED-ERROR | test_quit_restart.py TestQuitRestart.test_force_restart | IOError: Process killed
(In reply to Geoff Brown [:gbrown] (pto July 21, July 24) from bug 1261598 comment #137) > (In reply to Henrik Skupin (:whimboo) from comment #135) > > active. Geoff, have you had a chance to see it on a one click loaner? If we > > can see it with VNC connected that would be great. > > I have not seen it on a one click loaner. I have only reproduced it with try > pushes like > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=8e781e0c5b0bf42723b7a8f2cd14062916f3b309. > > The latest result indicates sessionstore-windows-restored is not received > because browser.js's onLoad() is not called: > > https://dxr.mozilla.org/mozilla-central/rev/ > 92eb911c35da48907d326604c4c92cf55e551895/browser/base/content/browser.js#1255 > > and therefore doesn't notify sessionstore-windows-restored: > > https://dxr.mozilla.org/mozilla-central/rev/ > 92eb911c35da48907d326604c4c92cf55e551895/browser/base/content/browser.js#1313
Blocks: 1261598
Blocks: 1384770
Blocks: 1384859
Blocks: 1386131
Blocks: 1380037
Blocks: 1386141
Blocks: 1386590
> (In reply to Geoff Brown [:gbrown] (pto July 21, July 24) from bug 1261598 comment #137) > > I have not seen it on a one click loaner. I have only reproduced it with try > > pushes like > > https://treeherder.mozilla.org/#/jobs?repo=try&revision=8e781e0c5b0bf42723b7a8f2cd14062916f3b309. > > > > The latest result indicates sessionstore-windows-restored is not received > > because browser.js's onLoad() is not called: > > > > https://dxr.mozilla.org/mozilla-central/rev/92eb911c35da48907d326604c4c92cf55e551895/browser/base/content/browser.js#1255 > > > > and therefore doesn't notify sessionstore-windows-restored: > > > > https://dxr.mozilla.org/mozilla-central/rev/92eb911c35da48907d326604c4c92cf55e551895/browser/base/content/browser.js#1313 Mike or Gijs, does one of you have an idea what's going wrong here? It's strange that we sometimes not send out the "sessionstore-windows-restored" notification.
Flags: needinfo?(mdeboer)
Flags: needinfo?(gijskruitbosch+bugs)
(In reply to Henrik Skupin (:whimboo) from comment #2) > > (In reply to Geoff Brown [:gbrown] (pto July 21, July 24) from bug 1261598 comment #137) > > > I have not seen it on a one click loaner. I have only reproduced it with try > > > pushes like > > > https://treeherder.mozilla.org/#/jobs?repo=try&revision=8e781e0c5b0bf42723b7a8f2cd14062916f3b309. > > > > > > The latest result indicates sessionstore-windows-restored is not received > > > because browser.js's onLoad() is not called: > > > > > > https://dxr.mozilla.org/mozilla-central/rev/92eb911c35da48907d326604c4c92cf55e551895/browser/base/content/browser.js#1255 > > > > > > and therefore doesn't notify sessionstore-windows-restored: > > > > > > https://dxr.mozilla.org/mozilla-central/rev/92eb911c35da48907d326604c4c92cf55e551895/browser/base/content/browser.js#1313 > > Mike or Gijs, does one of you have an idea what's going wrong here? It's > strange that we sometimes not send out the "sessionstore-windows-restored" > notification. Where can I see a log file that actually shows this? The trypush doesn't really make sense to me - there's no marionette jobs running on it at all, as far as I can see. In any case, the comment you're quoting says that onLoad doesn't fire, which seems more fundamental than anything to do with session restore. It's hooked up in XUL: https://dxr.mozilla.org/mozilla-central/source/browser/base/content/browser.xul#30 . So if that doesn't fire, apparently the load event doesn't fire on the browser window, so something is seriously broken. Without more context (which this bug doesn't have) I don't know what else to tell you. (Mike's on PTO and this isn't a sessionstore issue anyway, from the looks of it, so I'm clearing both our needinfos.)
Flags: needinfo?(mdeboer)
Flags: needinfo?(hskupin)
Flags: needinfo?(gijskruitbosch+bugs)
Blocks: 1386834
(In reply to :Gijs from comment #3) > Where can I see a log file that actually shows this? The trypush doesn't > really make sense to me - there's no marionette jobs running on it at all, > as far as I can see. That comment and try push were copied from bug 1261598, where we have seen similar symptoms in a variety of browser tests, like mochitests: Firefox is started, the marionette client times out waiting to connect, and logging indicates that sessionstore-windows-restored was not received; in fact, it looked to me like onLoad() in browser.js was never called. I suppose there is an intermittent hang in firefox startup...or maybe a crash that goes undetected? That try push is quite old now and all the logs have expired. I can try to reproduce again next week, if there's still interest.
Flags: needinfo?(gbrown)
(In reply to Geoff Brown [:gbrown] from comment #4) > That try push is quite old now and all the logs have expired. I can try to > reproduce again next week, if there's still interest. Yes please! :-)
Blocks: 1387741
Clearing my ni? given that Geoff is taking care of it.
Flags: needinfo?(hskupin)
(In reply to Geoff Brown [:gbrown] from comment #4) > That try push is quite old now and all the logs have expired. I can try to > reproduce again next week, if there's still interest. Sigh. I'm not having any luck. There are no such failures in my recent try pushes: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9123bc9821c353d9da13d15c9e298e2683e4b7ec https://treeherder.mozilla.org/#/jobs?repo=try&revision=d0df94d3e4aa4366c95d9e89acd3e2c3ce290556 The failure rate for bug 1261598 is currently very low -- perhaps a timing change has made this problem less likely? I could push an older rev to try, but I imagine reproducing on old code wouldn't be very helpful. I'll wait a few days and try again...
Flags: needinfo?(gbrown)
Flags: needinfo?(gbrown)
Geoff, please note that we see this problem in Marionette tests mostly on Windows 10 only those days. Maybe you should trigger those jobs there which you haven't done yet.
Blocks: 1390373
Blocks: 1391104
In https://treeherder.mozilla.org/#/jobs?repo=try&revision=a64c2a37a60b7fab8fb66f3ae43130a4507493fa I hit just one instance of this type of failure, and in this case sessionstore-windows-restored was received: https://archive.mozilla.org/pub/firefox/try-builds/gbrown@mozilla.com-a64c2a37a60b7fab8fb66f3ae43130a4507493fa/try-win64-debug/try_win8_64-debug_test-mochitest-e10s-devtools-chrome-5-bm126-tests1-windows-build22.txt.gz 13:36:59 INFO - Application command: C:\slave\test\build\application\firefox\firefox.exe -marionette -foreground -profile c:\users\cltbld~1.001\appdata\local\temp\tmpvxfxby.mozrunner 13:36:59 INFO - runtests.py | Application pid: 3040 13:36:59 INFO - TEST-INFO | started process GECKO(3040) 13:37:00 INFO - GECKO(3040) | 1503002220480 Marionette INFO Received observer notification "profile-after-change" 13:37:01 INFO - GECKO(3040) | 1503002221960 Marionette INFO Received observer notification "command-line-startup" 13:37:01 INFO - GECKO(3040) | 1503002221960 Marionette INFO marionette cmdline handler called 13:37:01 INFO - GECKO(3040) | 1503002221960 Marionette INFO Enabled via --marionette 13:37:01 INFO - GECKO(3040) | SessionStore now observing browser-window-before-show 13:37:01 INFO - GECKO(3040) | SessionStore now observing domwindowclosed 13:37:01 INFO - GECKO(3040) | SessionStore now observing quit-application-granted 13:37:01 INFO - GECKO(3040) | SessionStore now observing browser-lastwindow-close-granted 13:37:01 INFO - GECKO(3040) | SessionStore now observing quit-application 13:37:01 INFO - GECKO(3040) | SessionStore now observing browser:purge-session-history 13:37:01 INFO - GECKO(3040) | SessionStore now observing browser:purge-domain-data 13:37:01 INFO - GECKO(3040) | SessionStore now observing idle-daily 13:37:02 INFO - GECKO(3040) | 1503002221995 Marionette INFO marionette cmdline handler called 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x240058,name=PContent::Msg_SetPluginList) Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x24004C,name=PContent::Msg_GMPsChanged) Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x150085,name=PBrowser::Msg_UpdateNativeWindowHandle) Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x150079,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv 13:37:04 INFO - GECKO(3040) | browser.js - onLoad 13:37:04 INFO - GECKO(3040) | onBeforeBrowserWindowShown 13:37:04 INFO - GECKO(3040) | browser.js - onLoad notified browser-window-before-show 13:37:05 INFO - GECKO(3040) | initializeWindow 13:37:05 INFO - GECKO(3040) | notify sessionstore-windows-restored 13:37:05 INFO - GECKO(3040) | 1503002225825 Marionette INFO Received observer notification "sessionstore-windows-restored" 13:39:00 INFO - Traceback (most recent call last): 13:39:00 INFO - File "C:\slave\test\build\tests\mochitest\runtests.py", line 2658, in doTests 13:39:00 INFO - marionette_args=marionette_args, 13:39:00 INFO - File "C:\slave\test\build\tests\mochitest\runtests.py", line 2160, in runApp 13:39:00 INFO - self.marionette.start_session(timeout=port_timeout) 13:39:00 INFO - File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 28, in _ 13:39:00 INFO - m._handle_socket_failure() 13:39:00 INFO - File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 23, in _ 13:39:00 INFO - return func(*args, **kwargs) 13:39:00 INFO - File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1203, in start_session 13:39:00 INFO - self.protocol, _ = self.client.connect() 13:39:00 INFO - File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\transport.py", line 223, in connect 13:39:00 INFO - self.sock.connect((self.addr, self.port)) 13:39:00 INFO - File "c:\mozilla-build\python27\Lib\socket.py", line 224, in meth 13:39:00 INFO - return getattr(self._sock,name)(*args) 13:39:00 INFO - error: [Errno 10061] No connection could be made because the target machine actively refused it 13:39:00 ERROR - Automation Error: Received unexpected exception while running application This is a Windows test, where marionette waits for the sanity test window after sessionstore-windows-restored -- I think that's where this got hung up. We are still seeing a few failures in bug 1261598 each day. That is a concern, but the low rate makes it hard to reproduce.
Flags: needinfo?(gbrown)
Blocks: 1352671
Blocks: 1405736
Blocks: 1410331
Blocks: 1411189
Blocks: 1407156
Blocks: 1402138
Blocks: 1414819
Blocks: 1300079
Blocks: 1418691
Blocks: 1419699
Blocks: 1423500
This might depend on bug 1414495.
Depends on: 1414495
Blocks: 1430717
Blocks: 1437170
Blocks: 1437312
Blocks: 1439335
Blocks: 1439549
Blocks: 1466790
Blocks: 1467292
Blocks: 1467740
Blocks: 1468208
Blocks: 1438572
Blocks: 1467891
Blocks: 1468109
Blocks: 1469138
Blocks: 1479014
Blocks: 1492244
Blocks: 1492135
Blocks: 1493042
Depends on: 1487243
Blocks: 1494826
Blocks: 1494964
Blocks: 1495668
Blocks: 1444600
Blocks: 1498944
Blocks: 1497094
Blocks: 1497743
Please note that there is a user on https://github.com/mozilla/geckodriver/issues/1408 who also seems to have this problem. I hope that I will be able to get more details.
Blocks: 1439869
Blocks: 1440208
Depends on: 1510271
Blocks: 1510271
No longer depends on: 1510271
Blocks: 1528287
Depends on: 1530979
Blocks: 1414495
No longer depends on: 1414495

Several bugs which track intermittent failures including bug 1414495 show that the failures are gone on mozilla-central since my patch on bug 1530979 (add a marionette-startup-requested notification) has been landed.

Could that mean that in cases when it failed in Firefox we may have a broken startup path for sessionstore? So that something fails while executing it, and no sessionstore-windows-restored message has been sent?

Florian, you mentioned that my patch shouldn't cause a difference, maybe you have an idea or know whom to forward this to?

Flags: needinfo?(florian)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #13)

Florian, you mentioned that my patch shouldn't cause a difference, maybe you have an idea or know whom to forward this to?

Afaik, Dao and Mike care about session restore.

Flags: needinfo?(mdeboer)
Flags: needinfo?(florian)
Flags: needinfo?(dao+bmo)

On bug 1530979 we switched to a custom marionette-restart-requested observer notification. So updating the bug's summary.

As noticed like for bug 1540367 we still sometimes experience problems that the notification is not getting send out, and as such doesn't cause Marionette to initialize, which causes hangs in the test harnesses using Marionette.

Summary: Sometimes no 'sessionstore-windows-restored' observer notification is fired during startup → Sometimes no 'marionette-startup-requested' observer notification is fired during startup
Blocks: 1540367
Blocks: 1540329

So Kriti has seen this locally on bug 1483559 comment 11. So how often do actually you see this problem with Marionette not correctly initializing? Does it always happen for you? If that is the case it would be great if you could run some special steps for us.

Flags: needinfo?(mdeboer)
Flags: needinfo?(kritisingh1.ks)
Flags: needinfo?(dao+bmo)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #17)

So Kriti has seen this locally on bug 1483559 comment 11. So how often do actually you see this problem with Marionette not correctly initializing? Does it always happen for you? If that is the case it would be great if you could run some special steps for us.

We would like to see a profile of what's happening. Here is how you could capture it:

MOZ_PROFILER_STARTUP=1 MOZ_PROFILER_SHUTDOWN="shutdown.json" ./mach <whatever command you were running to start the test that triggered the problem>

You should see a browser window on screen but the test won't start. Quit the browser (eg. press ctrl+Q, or click File->Quit in the menu, or something equivalent that closes the browser without killing it). At the end of shutdown, you should find that a shutdown.json file has been written on the disk, in the current directory.

Next, start the same browser (eg. ./mach run), load https://profiler.firefox.com/, install the add-on that's offered there.
Drop the shutdown.json file into the profiler.firefox.com webpage (or use the "Load a profile from file" button). Wait for it to load, and the various messages at the top ("Waiting for symbol tables for ...", "Symbolicating call stacks...", ...) to disappear, and click the "Share…" button. This should let you upload the profile, and provide a short url that can be pasted here.

:florian profile's URL is here : https://perfht.ml/2uLUnvl
:whimboo this occurs only for the screenshot.py test. Other wdspec tests which are .html files are running fine. I have not tested for other .py files, will let you know shortly.

Flags: needinfo?(kritisingh1.ks)

(In reply to Kriti Singh from comment #19)

:florian profile's URL is here : https://perfht.ml/2uLUnvl

Thanks! In this profile, marionette begins to start about 8s after you pressed ctrl+q. Any chance you could do it again, but wait about 30s after the browser window appears before quitting?

Also, please add this environment variable:
MOZ_PROFILER_STARTUP_ENTRIES=10000000

:florian, for some reason, the browser window is not starting up now. I reran the command after removing existing shutdown.json file and doing ./mach build but nothing changes. Also last time, I had to quit at 8s because the browser window quitted itself within 10-15 seconds. Here are the logs https://pastebin.com/bEnTLCNY

If the browser quits itself for some reason without crashing, a shutdown.json file should still be written to the disk. It may show us the reason why the browser quit.

Blocks: 1543111
Blocks: 1543654

For some strange reason, the tests are working as expected now. Although I'm not sure of what's going on internally, since this happens randomly(atleast in my experience), this might be a possible race condition.

Blocks: 1543671

(In reply to Kriti Singh from comment #23)

For some strange reason, the tests are working as expected now. Although I'm not sure of what's going on internally, since this happens randomly(atleast in my experience), this might be a possible race condition.

When it happens how does Firefox look like? Is the UI fully loaded, or is something missing? Do you see something else in the browser console, or in the terminal window? Does it happen for opt and debug builds for you?

Flags: needinfo?(kritisingh1.ks)
Blocks: 1547529
Blocks: 1557580

The following job log gives some more details:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=250487954&repo=mozilla-central&lineNumber=1322-1329

[task 2019-06-06T23:21:55.363Z] 23:21:55 INFO - 1559863315359 Marionette TRACE Received observer notification toplevel-window-ready
[task 2019-06-06T23:22:09.404Z] 23:22:09 INFO - [GFX1-]: Killing GPU process due to IPC reply timeout
[task 2019-06-06T23:22:09.405Z] 23:22:09 INFO - [GFX1-]: Receive IPC close with reason=AbnormalShutdown
[task 2019-06-06T23:22:09.406Z] 23:22:09 INFO - [GFX1-]: Receive IPC close with reason=AbnormalShutdown
[task 2019-06-06T23:22:09.407Z] 23:22:09 INFO - ###!!! [Parent][RunMessage] Error: Channel error: cannot send/recv
[task 2019-06-06T23:22:09.408Z] 23:22:09 INFO - [GFX1-]: Receive IPC close with reason=AbnormalShutdown
[task 2019-06-06T23:22:09.409Z] 23:22:09 INFO - ###!!! [Parent][MessageChannel] Error: (msgtype=0x4C001A,name=PGPU::Msg_ShutdownVR) Channel error: cannot send/recv

We recently turned on the GPU process on Linux, so maybe this made it actually appear, and gives more details why it fails? Looks like there are issues with IPC. Matt, any idea what this could mean? Maybe it might be worth filing this as its own bug.

Flags: needinfo?(kritisingh1.ks) → needinfo?(matt.woodrow)

The log doesn't give much useful information unfortunately.

The GPU process took too long to respond to an IPC message, so we killed it

It's possible that something in graphics code actually hung, but there's not much to on here.

Has anyone tried to get a pernosco capture of this happening?

Flags: needinfo?(matt.woodrow)

(In reply to Matt Woodrow (:mattwoodrow) from comment #26)

Has anyone tried to get a pernosco capture of this happening?

No, it's all in CI only for now, and I never have seen it myself locally yet. :/ So we don't any recording with rr available.

Flags: needinfo?(matt.woodrow)
Flags: needinfo?(matt.woodrow)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #27)

(In reply to Matt Woodrow (:mattwoodrow) from comment #26)

Has anyone tried to get a pernosco capture of this happening?

No, it's all in CI only for now, and I never have seen it myself locally yet. :/ So we don't any recording with rr available.

I'm pretty sure Pernosco is a service, which generates rr style traces for failures on try, if they're reproducible. My understanding is that this is intermittent, I don't know how frequent. Worth seeing with Matt if that's something pernosco can help with or not.

The intermittent rate is very low all the last weeks. So I doubt that it would make sense to have it right now. But still be interested to know how this would work.

Pernosco attempts to reproduce failures that it detects from the try pushes of enrolled users.

I asked roc, and it appears that this failure hasn't yet been caught unfortunately.

How important is this to get fixed? Sounds like it's fairly infrequent.

Matt, due to the not sent observer notification Marionette doesn't fully start-up, which means that test jobs depending on Marionette will hang, and Firefox will eventually be killed after 1000s of inactivity.

As mentioned above the failure rate is kinda low at the moment. So it might be hard to catch it. The number of closed bug on the blocking list indicates that. So I would suggest that I keep an eye on it for now, and will report back once the failure rate increased again.

Blocks: 1557592
Blocks: 1507121

Recently this mainly seem to happen on Linux but still very infrequently to actually figure out the reason.

Blocks: 1595173
Blocks: 1596616
Blocks: 1597532

Note that we now have a user of geckodriver who can reliable reproduce this problem on his machine. See https://github.com/mozilla/geckodriver/issues/1660#issuecomment-566255446.

Basically it is a very long startup time, and Marionette just gives up to connect before Firefox has been started. Hopefully we will have a startup profile soon.

This is most likely related to bug 1577713. Recently we have seen that specifically webdriver tests always timed out on Linux due to a very delayed marionette-startup-requested notification. Also users of geckodriver reported that this notification arrived more than 60s later. And because Marionette has a 60s startup timeout we kill the browser.

Over on bug 1577713 I will do some measurements soon for ccov builds to see which part of the startup process of Firefox takes that long.

Depends on: 1577713
Blocks: 1610266
Blocks: 1612026
Blocks: 1524813
No longer blocks: 1625413
Blocks: 1627293

Mass-removing myself from cc; search for 12b9dfe4-ece3-40dc-8d23-60e179f64ac1 or any reasonable part thereof, to mass-delete these notifications (and sorry!)

Blocks: 1644392
Blocks: 1645354
Blocks: 1656838
Blocks: 1655294
Blocks: 1659772
Blocks: 1659944
Blocks: 1664915
Blocks: 1676848
Blocks: 1527674

The underlying problem here is actually that no toplevel-window-ready notification is fired. It means for the failing jobs the initial Firefox window never appears, or is that much delayed that we run into the timeout, and finally kill the process.

Summary: Sometimes no 'marionette-startup-requested' observer notification is fired during startup → Sometimes no 'toplevel-window-ready' observer notification is fired during startup

Florian, given that this specific failure causes a lot of trouble for us in Marionette tests I wonder what we could do to nail down the problem and figure out why toplevel-window-ready isn't fired within a minute in a lot of cases. Are there any other helpful observer notifications that we could register for to get an idea in which step the startup of Firefox hangs/fails? Given that it's hard to trigger we could land such a patch temporary and use logger.trace() for the output.

Flags: needinfo?(florian)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #71)

Florian, given that this specific failure causes a lot of trouble for us in Marionette tests I wonder what we could do to nail down the problem and figure out why toplevel-window-ready isn't fired within a minute in a lot of cases.

The best way I can think of is to put MOZ_PROFILER_STARTUP in the environment. And when we give up because we would timeout, find a way to send a signal that would dump the profiler to disk (like for shutdown profiling) instead of killing the process immediately. If sending a signal is not a practical solution, we could put temporary code for this specific bug to dump a profile after startup as taken an excessive amount of time (but less than the time that would cause the test to kill the process).
If we are not interested in profiling the tests themselves, we could stop the profiler as soon as startup is complete.

Maybe not relevant if you've encountered this on all platforms, but I've noticed that for tests running on Linux 64 opt (especially with the profiler running, but maybe not exclusively), we block 100+s during startup waiting for dbus to let us know if we need to enable accessibility features or not. Example profile: https://share.firefox.dev/3vNbMSy

Flags: needinfo?(florian)
Blocks: 1721732

We could get rid of this bug if we would allow Marionette to start earlier, and not having to wait for the very first browser window. There is a request for that over on bug 1726465. It won't be that easy, but on the other side would allow some other interesting test scenarios.

See Also: → 1726465
Blocks: 1456641
Blocks: 1743284
Depends on: 1726465

Since bug 1726465 has been fixed (window less support for Marionette) this observer notification is no longer in use. As such this bug should be marked as invalid.

Actually this bug is not in the Marionette component. As such I would suggest incomplete instead given that no other consumer triggered this particular situation yet.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → INCOMPLETE
No longer blocks: 1664915
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---

(In reply to Treeherder Bug Filer from comment #102)

New failure instance: https://treeherder.mozilla.org/logviewer?job_id=390482094&repo=autoland

This was mis-classified and should have been for bug 1790325. Closing the bug again.

Status: REOPENED → RESOLVED
Closed: 3 years ago2 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---

(In reply to Treeherder Bug Filer from comment #105)

New failure instance: https://treeherder.mozilla.org/logviewer?job_id=392481633&repo=mozilla-central

This failure is for bug 1573448, which I've now reopened to avoid the inappropriate classification of this bug and all the sending of hundreds of emails to each recipient. I hope that this bug won't be reopened again.

Given that the former code doesn't exist anymore lets even close this bug as invalid now.

Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.