Closed Bug 1548515 Opened 6 years ago Closed 5 years ago

profileserver.py's command to trigger Quitter doesn't wait for all Firefox subsystems to initialize [was: Firefox exited with code 11 during profile initialization]

Categories

(Firefox Build System :: General, defect)

defect
Not set
normal

Tracking

(firefox69 fixed)

RESOLVED FIXED
mozilla69
Tracking Status
firefox69 --- fixed

People

(Reporter: NarcisB, Assigned: Gijs)

References

Details

Attachments

(1 file)

Log link: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=244159166&repo=autoland&lineNumber=268

Log snippet :

[task 2019-05-02T12:21:49.370Z] Firefox exited with code 11 during profile initialization
[task 2019-05-02T12:21:49.370Z] Firefox output (/builds/worker/artifacts/profile-run-1.log):
[task 2019-05-02T12:21:49.370Z] 1556799640565 addons.webextension.screenshots@mozilla.org WARN Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: mozillaAddons
[task 2019-05-02T12:21:49.370Z] 1556799640566 addons.webextension.screenshots@mozilla.org WARN Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: resource://pdf.js/
[task 2019-05-02T12:21:49.370Z] 1556799640566 addons.webextension.screenshots@mozilla.org WARN Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: about:reader*
[task 2019-05-02T12:21:49.370Z] 1556799644065 addons.xpi-utils WARN Error: XPI database modified after shutdown began(resource://gre/modules/addons/XPIDatabase.jsm:1304:17) JS Stack trace: saveChanges@XPIDatabase.jsm:1304:17


[task 2019-05-02T12:21:49.533Z] JavaScript error: resource://gre/modules/Sqlite.jsm, line 834: Error: Connection is not open.
[task 2019-05-02T12:21:49.534Z] JavaScript error: resource://gre/modules/IndexedDB.jsm, line 349: UnknownError: The operation failed for reasons unrelated to the database itself and not covered by any other error code.
[task 2019-05-02T12:21:49.534Z]
[task 2019-05-02T12:21:49.534Z] ###!!! [Parent][DispatchAsyncMessage] Error: PBackground::Msg_PBackgroundIDBFactoryConstructor Value error: message was deserialized, but contained an illegal value
[task 2019-05-02T12:21:49.534Z]
[task 2019-05-02T12:21:49.534Z]
[task 2019-05-02T12:21:49.534Z] ###!!! [Parent][DispatchAsyncMessage] Error: PBackgroundIDBFactory::Msg_PBackgroundIDBFactoryRequestConstructor Route error: message sent to unknown actor ID
[task 2019-05-02T12:21:49.534Z]
[task 2019-05-02T12:21:49.534Z] JavaScript error: resource:///modules/sessionstore/SessionStore.jsm, line 4606: NS_ERROR_UNEXPECTED: Component returned failure code: 0x8000ffff (NS_ERROR_UNEXPECTED) [nsIWindowMediator.getEnumerator]
[task 2019-05-02T12:21:49.534Z] JavaScript error: resource://gre/modules/UrlClassifierListManager.jsm, line 396: NS_ERROR_NOT_INITIALIZED: Component returned failure code: 0xc1f30001 (NS_ERROR_NOT_INITIALIZED) [nsIUrlClassifierDBService.getTables]
[task 2019-05-02T12:21:49.534Z] JavaScript error: resource://gre/modules/osfile/osfile_async_front.jsm, line 400: Error: OS.File has been shut down. Rejecting post to stat
[task 2019-05-02T12:21:49.534Z] ExceptionHandler::GenerateDump cloned child 212
[task 2019-05-02T12:21:49.534Z] ExceptionHandler::SendContinueSignalToChild sent continue signal to child
[task 2019-05-02T12:21:49.534Z] ExceptionHandler::WaitForContinueSignal waiting for continue signal...
[task 2019-05-02T12:21:49.534Z]
[task 2019-05-02T12:21:49.569Z] cleanup
[task 2019-05-02T12:21:49.569Z] + cleanup
[task 2019-05-02T12:21:49.569Z] + local rv=11
[task 2019-05-02T12:21:49.569Z] + cleanup_xvfb
[task 2019-05-02T12:21:49.569Z] ++ pidof Xvfb
[task 2019-05-02T12:21:49.570Z] + local xvfb_pid=37
[task 2019-05-02T12:21:49.570Z] + local vnc=false
[task 2019-05-02T12:21:49.570Z] + local interactive=false
[task 2019-05-02T12:21:49.570Z] + '[' -n 37 ']'
[task 2019-05-02T12:21:49.570Z] + [[ false == false ]]
[task 2019-05-02T12:21:49.570Z] + [[ false == false ]]
[task 2019-05-02T12:21:49.570Z] + kill 37
[task 2019-05-02T12:21:49.570Z] + screen -XS xvfb quit
[task 2019-05-02T12:21:49.712Z] No screen session found.
[task 2019-05-02T12:21:49.712Z] + true
[task 2019-05-02T12:21:49.712Z] + exit 11
[fetches 2019-05-02T12:21:49.713Z] removing /builds/worker/fetches
[fetches 2019-05-02T12:21:49.713Z] finished
[taskcluster 2019-05-02 12:21:50.023Z] === Task Finished ===
[taskcluster 2019-05-02 12:21:50.154Z] Artifact "public/build/profdata.tar.xz" not found at "/builds/worker/artifacts/profdata.tar.xz"
[taskcluster 2019-05-02 12:21:50.438Z] Artifact "public/build/profile-run-2.log" not found at "/builds/worker/artifacts/profile-run-2.log"
[taskcluster 2019-05-02 12:21:51.194Z] Unsuccessful task run with exit code: 11 completed in 174.362 seconds

It sounds like the generate-profile-linux64-shippable/opt job tried to shutdown the browser while it was still starting up? I don't think this would be specific to storage...

:mshal, your name is on a lot of the changes to taskcluster/ci/generate-profile/kind.yml (at least as far as hg.mozilla.org shows by default), do you have any idea who might be the owner of whatever's going on in this task?

Flags: needinfo?(mshal)

The code to generate the PGO profile has been around for a while, though we've been changing how it happens during the build by splitting it into several Taskcluster tasks. The actual running of Firefox during profile generation is mostly unchanged, though a possible source of new issues would be bug 1252556 where we now properly detect that Firefox failed to run during profile generation and report it as an error. Previously if Firefox crashed in this stage, we simply wouldn't have useful profile data and generate an unoptimized build. So if this is an intermittent issue, it could be something that was just unnoticed before. However, that bug landed in November, so it would have to be extremely intermittent.

This failure looks like it's happening in the first run of the browser: https://dxr.mozilla.org/mozilla-central/rev/083106d8fc7407c880a3a044c83d4e15e5961063/build/pgo/profileserver.py#73

That was added in bug 1229731 - during PGO generation, we load Firefox and then immediately shut it down, and then load Firefox a second time to generate the profile data. I think the interesting parts of the log are in #c0, but can also be seen by comparing profile-run-1.log in this build from a good build:

JavaScript error: resource://gre/modules/Sqlite.jsm, line 834: Error: Connection is not open.
JavaScript error: resource://gre/modules/IndexedDB.jsm, line 349: UnknownError: The operation failed for reasons unrelated to the database itself and not covered by any other error code.

###!!! [Parent][DispatchAsyncMessage] Error: PBackground::Msg_PBackgroundIDBFactoryConstructor Value error: message was deserialized, but contained an illegal value

###!!! [Parent][DispatchAsyncMessage] Error: PBackgroundIDBFactory::Msg_PBackgroundIDBFactoryRequestConstructor Route error: message sent to unknown actor ID

JavaScript error: resource:///modules/sessionstore/SessionStore.jsm, line 4606: NS_ERROR_UNEXPECTED: Component returned failure code: 0x8000ffff (NS_ERROR_UNEXPECTED) [nsIWindowMediator.getEnumerator]
JavaScript error: resource://gre/modules/UrlClassifierListManager.jsm, line 396: NS_ERROR_NOT_INITIALIZED: Component returned failure code: 0xc1f30001 (NS_ERROR_NOT_INITIALIZED) [nsIUrlClassifierDBService.getTables]
JavaScript error: resource://gre/modules/osfile/osfile_async_front.jsm, line 400: Error: OS.File has been shut down. Rejecting post to stat
ExceptionHandler::GenerateDump cloned child 212
ExceptionHandler::SendContinueSignalToChild sent continue signal to child
ExceptionHandler::WaitForContinueSignal waiting for continue signal...

Does that offer any clues as to what's happening?

Flags: needinfo?(mshal)

Yes, that's an incredibly useful explanation of what's going on, thank you!

It also makes it clear that "profile" means too many things in the source tree! There's what I arbitrarily dub "user settings disk profile", "profiler.firefox.com performance profile", and "PGO trace profile", I think. So in this case the task is initializing the "user settings disk profile" so that the one-off initialization logic it doesn't contaminate the "PGO trace profile" that we want to generate for PGO linking with stuff that's not part of the user's steady state usage.

So it sounds like there may be 2 problems going on here:

  1. The code at https://searchfox.org/mozilla-central/rev/b4e790d05f5a146d186c238bac5601a553581d23/build/pgo/profileserver.py#106 does cmdargs=['data:text/html,<script>Quitter.quit()</script>'], seems like it shuts down the browser earlier than it should if it wants the browser to have completed first-run initialization. As we've tried to avoid slowing down page-load by doing more things asynchronously, I think this may be too aggressive a way to conclude that the system is initialized. Also, as the browser now involves a lot of content-space things like about:newtab that don't block page load on their own but may perform initialization, it may behoove us to wait for them. It might also just be the case that this is a reasonable shutdown time and some code is too chatty with warnings.

  2. Something is crashing and we're not getting a backtrace / minidump. The ExceptionHandler::GenerateDump cloned child 212 suggests something fatal happened, but all I see in the log is JS spam and warnings. Whatever is crashing the process is concealed. It's possible this is actually related to LocalStorage NextGen as we have a suspicious crash in bug 1494117 that regression investigation suggest is LSNG but there's also no crash report there, which I filed at bug 1545856.

I'm going to speculatively re-purpose this bug to address the 1st case because we can't do anything about the 2nd case without a backtrace of the crash. (We will of course continue to monitor crash-stats and ASAN and otherwise attempt to reproduce whatever the underlying problem might be. Any assistance on bug 1545856 anyone can provide would be greatly appreciated!)

And so in pursuit of the 1st case, let's ask :mconley if he has any ideas about how to help the quitter logic happen at a potentially more representative time. From my perspective as someone involved with QuotaManager, waiting for IndexedDB to definitely be usable is probably a good way to make sure that QuotaManager has fully spun up and created its PROFILE/storage/storage.sqlite. But I think the about:newtab/etc. stuff is probably a big deal.

Note that it appears that bug 1393449 wants to replace quitter with Marionette and bug 1442741 wants someone to own quitter. However, this bug is probably still relevant as marionette needs to know when a good time is to shutdown, and that's a separate issue of how to trigger shutdown. (Although marionette may have a hard time triggering shutdown as aggressively as the quitter extension.)

Component: Storage → General
Flags: needinfo?(mconley)
Product: Toolkit → Firefox Build System
See Also: → 1393449
Summary: Firefox exited with code 11 during profile initialization → profileserver.py's command to trigger Quitter doesn't wait for all Firefox subsystems to initialize [was: Firefox exited with code 11 during profile initialization]

FYI snorp had replaced the use of quitter there with marionette in bug 1524992 (https://phabricator.services.mozilla.com/D21609), but it got backed out along with some other changes and wasn't part of the final set of patches. I think that part is probably land-able on its own? We already use marionette for the Android PGO script (testing/mozharness/scripts/android_emulator_pgo.py), so if it helps here I don't see any reason why we shouldn't use it.

Hm, so that patch seems to do the "user settings disk profile" generation step starting on line 102 via:

        driver.start_session()
        driver.restart(in_app=True)

It's not clear that this is doing any better of a job of waiting for the browser to fully initialize. It's hard to tell though because when I search for the Marionette API docs, I get https://marionette-client.readthedocs.io/en/master/reference.html and similar URLs, where every page on that domain results in a "Sorry" "This page does not exist yet." error page.

Sorry for the delay.

I'm not 100% sure how PGO works - do we collect different trace profiles for startup vs page loading? I thought we ran something like the tp5 set through the browser when collecting the PGO profile. Is that not the case?

At any rate, if you're looking for a better time to shut down the browser, I have some observer notifications we could wait for:

  1. browser-delayed-startup-finished fires after a browser window has reported that they've painted and done their "just after paint" JS. In this case, that'd fire not long after presenting the first browser window.

  2. browser-idle-startup-tasks-finished fires after a browser window reports that it has finished running its idle callback functions, which only run when the browser has enough budget to process low priority events.

Are either of those useful?

Flags: needinfo?(mconley) → needinfo?(bugmail)

Stealing this.

Assignee: nobody → gijskruitbosch+bugs
Status: NEW → ASSIGNED
Pushed by gijskruitbosch@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/6323cc613f96 Make quitter wait until we've run idle tasks, r=kmag
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla69

(it seems like :Gijs' fix for this went with "browser-idle-startup-tasks-finished" mentioned in comment 7 but using an explicit promise that operates in parallel with the observer notification, so this was addressed. Very late thanks, :Gijs! ;)

Flags: needinfo?(bugmail)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: