Open Bug 1832089 Opened 1 year ago Updated 11 months ago

[meta] Mitigate background task shutdown hangs

Categories

(Toolkit :: Background Tasks, defect)

defect

Tracking

()

People

(Reporter: jstutte, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: meta)

It seems from many shutdown hangs that background tasks finish their work before some components are initialized, resulting in weird states and hangs.

To mitigate this, we could add a grace period after the background task finished its payload before shutting down. I assume here in runBackgroundTaskNamed could be a good place to add a waiting promise.

Blocks: 1798904

Note also that there are things that potentially do something after the RunBackgroundTask returned (thus after any timeout we add there this is not true, as any async/await will actually make us return and dispatch a new delayed runnable that will then continue the execution. So we might return even in the middle of the execution and then enter our main loop):

  1. AddScreenWakeLockListener(); (I wonder if that can ever be useful for background tasks?)
  2. GetTopSEHFilter(); (not sure how to follow the code path here, but setting exception handlers might have OS level side effects hard to control?)

I wonder if we should better wrap the background task into a runnable we dispatch to the main thread, letting it be executed later on top of our normal message loop run? Apart from the above, if ever something was able to dispatch events asynchronously to MT before we come here, we would run them before our payload runs.

Moving the discussion from bug 1798904:

(In reply to Nick Alexander :nalexander [he/him] from comment #12)

(In reply to Jens Stutte [:jstutte] from comment #9)

Interestingly we made a quite good job in avoiding AsyncShutdownTimeout for background tasks, see for example bug 1820517, bug 1814104, bug 1799421. The remaining hangs are probably "things that start and were never even designed to interact with shutdown" such that the terminator catches them.

I think we have several paths here (and might want all):

  1. Mitigate the impact by having an (arbitrary) timeout at the end of the background task before starting the shutdown sequence of the process. This is an easy hack but obviously not really nice. And we should check the Shutdown Reason annotation if we see many cases of OS shutdown (my gut feeling is: no, but it is not yet aggregatable, so we should take a closer look) - delaying process end could delay OS shutdown. But I assume the backgroundupdate is something we run anytime during normal operations, IIUC.
    Edit regarding 1.: I assume here in runBackgroundTaskNamed could be a good place to add a waiting promise.

backgroundupdate has an arbitrary short wait as it tries to let Glean uploads proceed: https://searchfox.org/mozilla-central/rev/373d05f4eabdb90a6480d5d36d983b8bff36c9d8/toolkit/mozapps/update/BackgroundTask_backgroundupdate.sys.mjs#414-416. Glean has improved greatly so this is no longer neccessary.

I'd be happy to lift this delay to a per-task-configured delay (like the per-task-configured timeout) in the tasks manager, and to extend it, if it might help. But I doubt that it would help in a non-linear fashion: we have racing startup and shutdown and racers gonna race. I'd like to see more labeling of activities at startup and shutdown to help us understand ones that are failing in the wild.

I think we might want to be able to try around a bit with different values for that delay. 500ms reads pretty short, at least on low end machines. And we might want to consider also having a "launch delay" for the above runnable, to give more things the occasion to dispatch their events to MT before we block it with our payload.

Keywords: meta
Summary: Mitigate background task shutdown hangs by adding a grace period before shutting down → [meta] Mitigate background task shutdown hangs

Assuming we want to try different things here.

Depends on: 1832252
Depends on: 1832254
Depends on: 1832288
Depends on: 1832773
Severity: -- → S3
You need to log in before you can comment on or make changes to this bug.