Open Bug 1741022 Opened 4 years ago Updated 1 month ago

Interrupted BrowserGlue startup tasks may leave the browser in a very bad state

Categories

(Firefox :: General, task, P3)

task

Tracking

()

People

(Reporter: mak, Unassigned)

References

(Depends on 2 open bugs)

Details

As a retrospective to Bug 1740987, I must admit our startup path is a bit weak, the browser is in a non-working state for a lot of features, because startup was interrupted.

This time it was _beforeUIStartup throwing due to _migrateUI(), but other parts of the BG startup suffer of similar issues. For example _onWindowsRestored(), as well as _onFirstWindowLoaded() are all methods that execute a bunch of unrelated code from different modules, and failing one will skip all the next ones.

For shutdown and idle tasks we solved the problem defining each single step as a task, and catching exceptions so that the process can continue even if one step fails.
Even if we decide to catch single cases, we should still cause evident test failures if something in the startup path throws, it should be possible to emit a test failure text that can be catched by the test harness.

Fwiw, this wouldn't have helped detecting the migrateUI case because most of those migrations run only if there's something to migrate, and tests unlikely have stuff to migrate. Moreover it's unclear if failing ui migration will leave the browser in an unusable state or not, because that depends on the specific migration we are running.

We've previously run into this problem at least in bug 1065998 and bug 1243549.

It'd be good to fix this thoroughly. I think that would involve something like:

  • chunking the work in the startup-critical methods into named arrow functions
  • running those sequentially within try...catch
  • in the catch statement, send out a telemetry event with the name of the arrow function (so we can see where failures are in the wild), and in automation, crash the process so the test will fail.

We could do this both for the browserglue startup work and for onLoad and onDOMContentLoaded and similar in browser.js .

Does that sound workable? It'll be a bit verbose but it'll be a lot more robust than what we have now.

The main downside I can think of is that if the process is in a bad state and continues trying to do more stuff, it could do more bad things and worsen the state of things... but we're already in a pretty bad place if this starts happening, so I'm not sure that should stop us.

Flags: needinfo?(mak)

(In reply to :Gijs (he/him) from comment #2)

  • chunking the work in the startup-critical methods into named arrow functions
  • running those sequentially within try...catch
  • in the catch statement, send out a telemetry event with the name of the arrow function (so we can see where failures are in the wild), and in automation, crash the process so the test will fail.

This is what we are doing in other methods (idle, shutdown) and we should probably start from this. I dont know if crashing is necessary, I did it on shutdown because at that time the test harness is already off, maybe on startup we could just trigger a TEST-UNEXPECTED-FAIL handler. But I don't know if crashing has huge disadvantages for the test harness (may it be more expensive to handle due to stack walking?).
The telemetry point is interesting, we can use a keyed scalar to report the name. Probably it should also be done for the other points where we already use the same approach, and unify them in behavior. A component could be written to manage it and reused.

The main downside I can think of is that if the process is in a bad state and continues trying to do more stuff, it could do more bad things and worsen the state of things... but we're already in a pretty bad place if this starts happening, so I'm not sure that should stop us.

Yes, there's no way to tell if Component A failing will cause Component B to fail in a worse manner, or long term harm. For example migrationUI will just not bump up the migration version if one migration throws, will do once fixed, but in the meanwhile new code that made us add the migration may end up using the old state.
I don't think there's an easy solution to that.

Flags: needinfo?(mak)

I can't think of a better component for this, so let's keep it in Firefox :: General.

So, when this fails, it's bad. It's really really bad, and that's what this bug is about - making recovery from things going wrong less bad. The worst-case scenario is a fully broken browser when things go bad and BrowserGlue doesn't catch it.

In terms of Severity, I don't know where that puts us. Obviously if there's something broken for everybody in BrowserGlue during startup, that's a S1. But this is more about building in some safety nets to make it harder to hit those bad scenarios. So I actually think this is a N/A severity.

Severity: -- → N/A
Priority: -- → P3

The Bugbug bot thinks this bug should belong to the 'Core::XPCOM' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: General → XPCOM
Product: Firefox → Core

The product::component has been changed since the backlog priority was decided, so we're resetting it.
For more information, please visit auto_nag documentation.

Priority: P3 → --
Component: XPCOM → General
Product: Core → Firefox
Type: defect → task
Priority: -- → P3

The situation for BrowserGlue is now a lot better due to the use of the category manager to invoke lots of dependencies, with try...catches around each of those invocations. There's still some work left to do so I'll leave this bug open for that.

(In reply to :Gijs (he/him) from comment #2)

We could do this both for the browserglue startup work and for onLoad and onDOMContentLoaded and similar in browser.js .

I filed bug 2000833 for the browser-init.js (used to be browser.js) portion of this.

Depends on: 1981454, 1960308
See Also: → 2000833
You need to log in before you can comment on or make changes to this bug.