1741675 - Examine shutdown hangs caused by the updater (both background task and browser parent)

Reporter

Description

•

3 years ago

There is a decent number of cases where we end up hanging in nsThreadManager::Shutdown() while we wait on the BitsCommander thread to shut down.

It seems from this comment that the shutdown is supposed to empty the entire queue.

It might be worth checking if we should start the shutdown of this thread earlier and explicitly behind some AsyncShutdownBlocker rather than let it just happen at the latest possible moment.

Jens Stutte [:jstutte]

Reporter

Updated

•

3 years ago

Component: DOM: Content Processes → Application Update

Product: Core → Toolkit

bhearsum@mozilla.com (:bhearsum)

Comment 1

•

3 years ago

Kirk, do you have any thoughts here? It looks like you were the original author of most/all of the code in question.

Flags: needinfo?(ksteuber)

Robin Steuber (they/them) [:bytesized]

Comment 2

•

3 years ago

I'm a bit surprised to hear about problems shutting this down.

(In reply to Jens Stutte [:jstutte] from comment #0)

It seems from this comment that the shutdown is supposed to empty the entire queue.

Yes, I would really like for it to finish whatever it's doing before it shuts down. It's less than ideal if we dispatch, say, a "cancel update download" task and it isn't executed before we shutdown. But each task in that queue essentially executes a couple of system calls and returns. And it should be extremely unusual that there would be multiple items in the queue. So I would have expected it to shut down quickly.

I wanted to know what the BitsCommander threads were doing during the shutdown hang, so I surveyed the 100 most recent crash reports. This is what I found:

53 StartDownloadTask
8  MonitorDownloadTask

In the other 39 crash reports, I couldn't really make a lot of sense of the BitsCommander stack. There was no event loop call. Every stack frame seemed to be in system DLL. However, it seemed to be quite common that the deepest frame in the stack was the same frame. These are the frames that I saw at the bottom of the stack:

30 CCache::AddElement
3  <unknown in combase.dll>
2  CClientContextActivator::CreateInstance
1  RtlAppendUnicodeStringToString
1  CComClassInfo::QueryInterface
1  IUnknown_QueryInterface_Proxy
1  CreateLookasidePath

I'm not sure what is going on there. I would be very interested to learn, if anyone could enlighten me. (Here is an example of one like this)

It might be worth checking if we should start the shutdown of this thread earlier and explicitly behind some AsyncShutdownBlocker rather than let it just happen at the latest possible moment.

The tasks that I saw executing, StartDownloadTask and MonitorDownloadTask are the tasks that the update system runs when it starts up. The former is for initiating an update download, the latter is for reconnecting to an existing download. Thus, I think that, in most cases, it's unlikely that starting the thread shutdown earlier is going to result in it actually shutting down any earlier.

The crash reports that I surveyed mostly seemed to have been from sessions that haven't been running very long. And the update system startup does not necessarily run right at Firefox startup. So it seems reasonably likely that the system's IBackgroundCopyManager interface is just a bit slow to initialize and, in a short Firefox session, sometimes it is still initializing when we shutdown.

It's also possible that sometimes attempting to initialize the system's IBackgroundCopyManager interface hangs. Personally, I would have expected that more of the crashes would come from longer running sessions if this were the case. But I don't really know; maybe what we are seeing here is a common distribution of session lengths.

I'm not really sure what the right thing to do here is. It seems reasonable to abort if we are still trying to connect to IBackgroundCopyManager when we start shutting down. But I'm not really sure how to accomplish that. And I have no idea what to say about the 39% of the cases I surveyed with the weird stacks.

Flags: needinfo?(ksteuber)

bhearsum@mozilla.com (:bhearsum)

Updated

•

3 years ago

Severity: -- → S3

Priority: -- → P3

David Parks [:handyman]

Comment 3

•

2 years ago

This is showing up in crash data for bug 1505660, which is a top nightly crasher. I think it's responsible for over half of those crashes. Some of the hangs BitsCommander is part of are not annotated correctly.

Comment 4

•

2 years ago

I am not good in reading rust code, but it seems to me that we rely on request_count to go down to 0 before we start shutdown of the thread. There seems to be the possibility to cancel transfers, but I was not able to follow through if we have some shutdown notification observer (or async shutdown blocker) that cancels all transfers on shutdown. Do we?

Flags: needinfo?(bytesized)

Robin Steuber (they/them) [:bytesized]

Comment 5

•

2 years ago

(In reply to Jens Stutte [:jstutte] from comment #4)

There seems to be the possibility to cancel transfers, but I was not able to follow through if we have some shutdown notification observer (or async shutdown blocker) that cancels all transfers on shutdown. Do we?

We very much do not want to cancel transfers of shutdown. The entire point of using BITS rather than an internal download mechanism is that the transfer can continue while Firefox is not running.

It's been a while since I've looked at this, but I believe that the intended shutdown flow looks like this:

UpdateService.jsm has a quit-application observer that calls cleanup() on the Downloader
cleanup() calls Bits.jsm's shutdown() method
shutdown() drops the nsIBitsRequest
Dropping the nsIBitsRequest calls on_finished() which calls BitsService::dec_request_count()
dec_request_count() shuts down the command thread

Of those steps, the only one that stands out to me as a possible problem is whether dropping nsIBitsRequest in JS is properly causing it to be deconstructed in Rust. It might be nice to make that a bit more explicit, but I don't actually know of a way to explicitly tell JS to deconstruct an XPCOM instance.

Flags: needinfo?(bytesized)

Jens Stutte [:jstutte]

Reporter

Comment 6

•

2 years ago

(In reply to Kirk Steuber (he/him) [:bytesized] from comment #5)

Of those steps, the only one that stands out to me as a possible problem is whether dropping nsIBitsRequest in JS is properly causing it to be deconstructed in Rust. It might be nice to make that a bit more explicit, but I don't actually know of a way to explicitly tell JS to deconstruct an XPCOM instance.

Thanks, that's an interesting question, indeed. I assume there might be some GC/CC involved in order to make it happen? Nika, can you answer that question?

Flags: needinfo?(nika)

Jens Stutte [:jstutte]

Reporter

Comment 7

•

2 years ago

(In reply to Kirk Steuber (he/him) [:bytesized] from comment #5)

It's been a while since I've looked at this, but I believe that the intended shutdown flow looks like this:

UpdateService.jsm has a quit-application observer that calls cleanup() on the Downloader

cleanup() calls Bits.jsm's shutdown() method

shutdown() drops the nsIBitsRequest

Dropping the nsIBitsRequest calls on_finished() which calls BitsService::dec_request_count()

dec_request_count() shuts down the command thread

Related to that drop question, there might be also just a time window between the quit-application observer (or assuming that anything up to step 3 happens synchronously between the shutdown()) and the final dec_request_count() happening within which we receive and queue up a new request? I was able to find only that we check if we were able to dispatch to the command thread, but again my rust mixed with JS reading capabilities are bad. You could consider to use an isInOrBeyondShutdownPhase(SHUTDOWN_PHASE_APPSHUTDOWNCONFIRMED) check somewhere before accepting new requests?

Flags: needinfo?(bytesized)

Robin Steuber (they/them) [:bytesized]

Comment 8

•

2 years ago

Yeah, that might be a good idea.

Flags: needinfo?(bytesized)

Nika Layzell [:nika] (ni? for response)

Comment 9

•

2 years ago

(In reply to Jens Stutte [:jstutte] from comment #6)

(In reply to Kirk Steuber (he/him) [:bytesized] from comment #5)

Of those steps, the only one that stands out to me as a possible problem is whether dropping nsIBitsRequest in JS is properly causing it to be deconstructed in Rust. It might be nice to make that a bit more explicit, but I don't actually know of a way to explicitly tell JS to deconstruct an XPCOM instance.

Thanks, that's an interesting question, indeed. I assume there might be some GC/CC involved in order to make it happen? Nika, can you answer that question?

Assuming nsIBitsRequest is exposed into JS code, yes there will be a strong reference to the object from the JS code which is keeping it alive until it is GC'd. You might want to decouple the shutting down of the thread from the nsIBitsRequest object in JS being destroyed (in general it's best to avoid JS owned objects doing meaningful work in destructors)

Flags: needinfo?(nika)

Jens Stutte [:jstutte]

Reporter

Updated

•

1 year ago

Comment 10

•

1 year ago

If I read crash-stats correctly, this accounts for ca. 20% of all our shutdown hangs.

What strikes me is that only ~25% of them have a background task name set, apparently.

:bhearsum, would you consider increasing priority/severity here?

Flags: needinfo?(bhearsum)

bhearsum@mozilla.com (:bhearsum)

Comment 11

•

1 year ago

(In reply to Jens Stutte [:jstutte] from comment #10)

If I read crash-stats correctly, this accounts for ca. 20% of all our shutdown hangs.

What strikes me is that only ~25% of them have a background task name set, apparently.

:bhearsum, would you consider increasing priority/severity here?

It certainly sounds like we ought to.(In reply to Jens Stutte [:jstutte] from comment #10)

If I read crash-stats correctly, this accounts for ca. 20% of all our shutdown hangs.

What strikes me is that only ~25% of them have a background task name set, apparently.

:bhearsum, would you consider increasing priority/severity here?

I agree this probably ought to be higher severity, actually. As we discussed on Slack, this seems to have begun sometime in the 105 cycle, which is when we landed a bunch of background task work related to reengagement notifications.

However, if I'm reading crash stats correctly, it looks like the first occurence of this on Nightly is with 20220813092239. This seems to correspond to this range of commits, which includes https://bugzilla.mozilla.org/show_bug.cgi?id=1700158.

Kirk, Nick - since you both worked on background task things in the 105 cycle I'm needinfo'ing you both (although the data seems to point at the background update work...)

Severity: S3 → S2

Flags: needinfo?(nalexander)

Flags: needinfo?(bytesized)

Flags: needinfo?(bhearsum)

Whiteboard: [fidedi-ope]

Jira Integration Bot

Updated

•

1 year ago

See Also: → https://mozilla-hub.atlassian.net/browse/FIDEDI-415

Nick Alexander :nalexander [he/him]

Comment 12

•

1 year ago

However, if I'm reading crash stats correctly, it looks like the first occurence of this on Nightly is with 20220813092239. This seems to correspond to this range of commits, which includes https://bugzilla.mozilla.org/show_bug.cgi?id=1700158.

Kirk, Nick - since you both worked on background task things in the 105 cycle I'm needinfo'ing you both (although the data seems to point at the background update work...)

Bug 1700158 made it so that a failed BITS download did not start a Necko download; not downloading in the background update task means that the task exits. Apparently, it exits rather quickly, provoking this shutdown crash. So I think there's little specific impact from that ticket: it just prompts the existing shutdown race more frequently.

I think we want to pursue https://bugzilla.mozilla.org/show_bug.cgi?id=1741675#c7, and perhaps better information in the shutdown blocker metadata to get more information on the size of the BITS queue.

Flags: needinfo?(nalexander)

Robin Steuber (they/them) [:bytesized]

Comment 13

•

1 year ago

(In reply to bhearsum@mozilla.com (:bhearsum) from comment #11)

However, if I'm reading crash stats correctly, it looks like the first occurence of this on Nightly is with 20220813092239. This seems to correspond to this range of commits, which includes https://bugzilla.mozilla.org/show_bug.cgi?id=1700158.

Wait, this bug was filed a year ago. I have very, very little experience with crashstats, so I don't really know what to look for here, but it seems wrong that a bug filed a year ago could be caused by patches merged 2 and 4 months ago. Am I misunderstanding something?

Flags: needinfo?(bytesized)

bhearsum@mozilla.com (:bhearsum)

Comment 14

•

1 year ago

(In reply to Kirk Steuber (he/him) [:bytesized] from comment #13)

(In reply to bhearsum@mozilla.com (:bhearsum) from comment #11)

However, if I'm reading crash stats correctly, it looks like the first occurence of this on Nightly is with 20220813092239. This seems to correspond to this range of commits, which includes https://bugzilla.mozilla.org/show_bug.cgi?id=1700158.

Wait, this bug was filed a year ago. I have very, very little experience with crashstats, so I don't really know what to look for here, but it seems wrong that a bug filed a year ago could be caused by patches merged 2 and 4 months ago. Am I misunderstanding something?

What it looks like is that that patch caused a major uptick in the number of these that we hit (again, if I'm interpreting the data correctly).

Jens Stutte [:jstutte]

Reporter

Comment 15

•

1 year ago

(In reply to Nick Alexander :nalexander [he/him] from comment #12)

Bug 1700158 made it so that a failed BITS download did not start a Necko download; not downloading in the background update task means that the task exits. Apparently, it exits rather quickly, provoking this shutdown crash. So I think there's little specific impact from that ticket: it just prompts the existing shutdown race more frequently.

I think we want to pursue https://bugzilla.mozilla.org/show_bug.cgi?id=1741675#c7, and perhaps better information in the shutdown blocker metadata to get more information on the size of the BITS queue.

So IIUC we finish the background task before all our startup ceremony finished and then kick off the shutdown immediately, which gives room for all kind of races between things that are still initialized on other threads (with post backs to the main thread, probably) and advancement of shutdown phases on the main thread. And the more we optimize the background tasks payload to run faster, the more hangs we'll see.

I assume the severity for users is effectively relatively low, as we just have a hidden process without window sitting around until it crashes. But the noise we get from this justifies probably to give this some priority. I'd also not want to just ignore shutdown hangs from background tasks, as there might be other, more relevant cases hidden among them.

I wonder if we could identify an event that signals that our startup ceremony finished and block shutdown until that happens.

Nick Alexander :nalexander [he/him]

Comment 16

•

1 year ago

(In reply to Jens Stutte [:jstutte] from comment #15)

(In reply to Nick Alexander :nalexander [he/him] from comment #12)

Bug 1700158 made it so that a failed BITS download did not start a Necko download; not downloading in the background update task means that the task exits. Apparently, it exits rather quickly, provoking this shutdown crash. So I think there's little specific impact from that ticket: it just prompts the existing shutdown race more frequently.

I think we want to pursue https://bugzilla.mozilla.org/show_bug.cgi?id=1741675#c7, and perhaps better information in the shutdown blocker metadata to get more information on the size of the BITS queue.

So IIUC we finish the background task before all our startup ceremony finished and then kick off the shutdown immediately, which gives room for all kind of races between things that are still initialized on other threads (with post backs to the main thread, probably) and advancement of shutdown phases on the main thread. And the more we optimize the background tasks payload to run faster, the more hangs we'll see.

I assume the severity for users is effectively relatively low, as we just have a hidden process without window sitting around until it crashes. But the noise we get from this justifies probably to give this some priority. I'd also not want to just ignore shutdown hangs from background tasks, as there might be other, more relevant cases hidden among them.

I wonder if we could identify an event that signals that our startup ceremony finished and block shutdown until that happens.

In regular browsing, this is probably something like final-ui-startup, but that event should be fired before we invoke the background task code.

I'm not aware of any other event that might play that role, leaving us with the status quo: components should (but fail) to accommodate shutdown even while they start up :(

Jens Stutte [:jstutte]

Reporter

Comment 17

•

1 year ago

•

Edited

(In reply to Nick Alexander :nalexander [he/him] from comment #16)

In regular browsing, this is probably something like final-ui-startup, but that event should be fired before we invoke the background task code.

I'm not aware of any other event that might play that role, leaving us with the status quo: components should (but fail) to accommodate shutdown even while they start up :(

Is _onWindowsRestored ever called in background tasks? I see browser-startup-idle-tasks-finished caused by this probably sufficiently late. But startup idle tasks contain things that we really do not want to run in case of an early shutdown, so it would be probably the wrong event, anyways. Actually I'd hope we do not try to run them at all in background task mode, but did not try to follow through the code path here.

A hacky way of mitigating this a bit could be to do a delayed dispatch of Services.startup.quit(Ci.nsIAppStartup.eForceQuit, exitCode); instead of invoking it directly. We could even measure the time passed from starting the background task to finishing it and calculate a delay based on that and a pref for a minimum background task lifetime or such. That pref could have different values for Nightly and Release such that we capture more hangs on nightly and can slowly work towards "components should (but fail) to accommodate shutdown even while they start up :(" ?

Connected with this at the opposite side of the lifecycle there might be an additional timing question at least for background tasks spawned during browser shutdown: Browser shutdown can be very expensive in terms of CPU cycles/threads and it seems not unlikely that the background task will just make this worse and won't make progress on its task immediately. We might want to think of an initial delay here, maybe as command line parameter. If we evaluate this parameter early enough we could even avoid that the startup ceremony causes additional load here, too. (I did not check if we have already something like this)

Jens Stutte [:jstutte]

Reporter

Comment 18

•

1 year ago

From recent nightly numbers (1 week) I see:

15 non-background task instances. All I looked at have to do with BitsCommander or UpdateWatcher.
172 background task ones, all of which have "backgroundupdate" as task name.

so accounting for 100% of the nightly cases now.

We should probably rename this bug and analyze this further.

Jens Stutte [:jstutte]

Reporter

Updated

•

1 year ago

Summary: Check if the BitsCommander thread shutdown should be started earlier than during nsThreadManager::Shutdown → Examine shutdown hangs caused by the updater (both background task and browser parent)

Jens Stutte [:jstutte]

Reporter

Comment 19

•

1 year ago

(In reply to Jens Stutte [:jstutte] from comment #18)

From recent nightly numbers (1 week) I see:

15 non-background task instances. All I looked at have to do with BitsCommander or UpdateWatcher.

172 background task ones, all of which have "backgroundupdate" as task name.

so accounting for 100% of the nightly cases now.

We should probably rename this bug and analyze this further.

This seems pretty much still the case.

The first few instances I looked at were all showing the BitsCommander thread stuck waiting for a download command to be issued (at least that is what I understand from being inside bits_client::in_process::InProcessClient::start_job). I think we can have two different cases here:

the request has been issued after we already entered shutdown. That is something we should probably better avoid and I filed bug 1820517 for this.
the request has been dispatched to the command thread early enough but nothing told us to wait with shutdown until it had finished until we try to shutdown the thread itself.

The thread is shut down either when the request count reaches 0 or if the entire service instance is dropped.

I think we should have an explicit shutdown blocker (probably for "quit-application-granted") that waits until our request queue is empty, together with the check from bug 1820517.

Jens Stutte [:jstutte]

Reporter

Updated

•

1 year ago

Depends on: 1820517

Jens Stutte [:jstutte]

Reporter

Comment 20

•

1 year ago

(In reply to Nika Layzell [:nika] (ni? for response) from comment #9)

Assuming nsIBitsRequest is exposed into JS code, yes there will be a strong reference to the object from the JS code which is keeping it alive until it is GC'd. You might want to decouple the shutting down of the thread from the nsIBitsRequest object in JS being destroyed (in general it's best to avoid JS owned objects doing meaningful work in destructors)

And I assume we need to address this, too.

Jens Stutte [:jstutte]

Reporter

Comment 21

•

1 year ago

(In reply to Jens Stutte [:jstutte] from comment #19)

The first few instances I looked at were all showing the BitsCommander thread stuck waiting for a download command to be issued (at least that is what I understand from being inside bits_client::in_process::InProcessClient::start_job). I think we can have two different cases here:

the request has been issued after we already entered shutdown. That is something we should probably better avoid and I filed bug 1820517 for this.

We can rule this out as main cause as that bug is fixed now and we still see the hangs at a high rate.

the request has been dispatched to the command thread early enough but nothing told us to wait with shutdown until it had finished until we try to shutdown the thread itself.

This has got more likely to be the case we see here.

The thread is shut down either when the request count reaches 0 or if the entire service instance is dropped.

I think we should have an explicit shutdown blocker (probably for "quit-application-granted") that waits until our request queue is empty.

Note that this might just move the hang to the shutdown phase we want to choose if it is "inevitable" at OS level. But we can at least react explicitly and maybe have some more diagnostic information in case.

Jens Stutte [:jstutte]

Reporter

Comment 22

•

1 year ago

•

Edited

(In reply to Jens Stutte [:jstutte] from comment #20)

(In reply to Nika Layzell [:nika] (ni? for response) from comment #9)

Assuming nsIBitsRequest is exposed into JS code, yes there will be a strong reference to the object from the JS code which is keeping it alive until it is GC'd. You might want to decouple the shutting down of the thread from the nsIBitsRequest object in JS being destroyed (in general it's best to avoid JS owned objects doing meaningful work in destructors)

And I assume we need to address this, too.

Actually looking at the definition of CompleteTask and the invocation of shutdown only after the CompleteTask finishes I am inclined to say, that on_finished is always called before the object is nulled out (same goes for CancelTask).

Jens Stutte [:jstutte]

Reporter

Comment 23

•

1 year ago

Attached file Bug 1741675 - Have a shutdown blocker for BITS requests. r?bytesized (obsolete) — Details

Phabricator Automation

Updated

•

1 year ago

Assignee: nobody → jstutte

Status: NEW → ASSIGNED

Phabricator Automation

Updated

•

1 year ago

Attachment #9326039 - Attachment is obsolete: true

Jens Stutte [:jstutte]

Reporter

Comment 24

•

1 year ago

Recap of my understanding (thanks to :bytesized for patience and input):

We already listen to quit-application in the updater and we shutdown the request synchronously there. As this should be the only ever living request, this should be functionally equivalent to having the proposed shutdown blocker. We might want to transform this listener to an async shutdown blocker itself, but it is not expected to change anything (other than having more things running in parallel during shutdown).
Since we introduced the check to not create new requests during shutdown we should thus be safe to not see any active requests after quit-application happened.
In most of the instances I looked at recently we see the BitsCommander thread hanging while creating a COM object such that it will not go away until we block on XPCOM thread shutdown. So we might want to examine, which situations can lead to COM in general or BITS in particular not responding. One potentially dangerous situation could be OS shutdown, bug 1825917 would help to diagnose if we see this kind of situation.

Unassigning as I have no plan how to move forward here, this needs probably someone familiar with COM/BITS.

Assignee: jstutte → nobody

Status: ASSIGNED → NEW

Jens Stutte [:jstutte]

Reporter

Comment 25

•

1 year ago

Looking at crashstats more than half of them are happening in the backgroundupdate task.

:max, do you agree we should raise priority here?

Flags: needinfo?(mpohle)

Jens Stutte [:jstutte]

Reporter

Comment 26

•

1 year ago

(In reply to Jens Stutte [:jstutte] from comment #24)

In most of the instances I looked at recently we see the BitsCommander thread hanging while creating a COM object such that it will not go away until we block on XPCOM thread shutdown. So we might want to examine, which situations can lead to COM in general or BITS in particular not responding.

This seems to be true for all instances I randomly clicked on today.

One potentially dangerous situation could be OS shutdown, bug 1825917 would help to diagnose if we see this kind of situation.

At least for the instances I clicked on today the recently added Annotation ShutdownReason showed AppClose, which means no system shutdown had been detected. Or better: It means we did not receive WM_ENDSESSION but a normal WM_QUIT, apparently.

Max

Comment 27

•

1 year ago

I have discussed this bug and as it stands it deserves a higher priority, especially in comparison to other, similar bugs. But since the severity for users is so low, that it is barely even recognized by them (as you said bug 1741675, comment 15) and as long as it seems necessary to further gather information about its root causes and it also seems reasonable that the ongoing good work in bug 1505660 could positively influence this bug, we hope that it might help to identify possible obfuscated root causes for this one once the next change has landed. I will hope for now, that this will grant us insights to allow us to fix it entirely and we will be reminded to get back on this bug due to its higher priority. If it is going to turn out, that most of the problem got fixed by the latest changes we will still be able to lower the priority again.

Meanwhile allow me to ask if I interpret the uptime in the crashstats correctly: Do the shutdown hangs occur in more than 95% when the uptime was shorter than 5 minutes? Because if so that is somehow to be expected and also tells me, that there is still something sane running even when it seems to hang.

Flags: needinfo?(mpohle) → needinfo?(jstutte)

Priority: P3 → P2

Jens Stutte [:jstutte]

Reporter

Comment 28

•

1 year ago

•

Edited

Looking at two recent random instances like this, I see:

The runtime is only ~70 sec. Given that the terminator waits 60 + 3 sec and that there is probably some general overhead, the updater task apparently resolves its promise way before the background task framework times out and the background task framework initiates the normal shutdown.
The BitsCommander thread is stuck somewhere in doing RPC. This probably means, that nothing in the updater task was waiting (long enough) for whatever processing is happening on that thread, but that the task just returned too early?

I did not follow through the code again, but rethinking after understanding a bit better how the framework works I would expect that there is some hole in the chain of promises that should ensure to not return from the updater background task while work is happening on the BitsCommander thread. Once we fix that hole, we would either:

see successful runs that were just slower than normal, hence avoiding this race with shutdown
see the background task timeout of 10 minutes in our runtime if we are really stuck forever

BTW, I do not expect that these cases will decrease now that bug 1832254 or the other improvements from bug 1832089 landed.

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Reporter

Comment 29

•

1 year ago

•

Edited

Renewing the ni? as P2 probably deserves an assignee (and I cannot take this for sure, too much rust for me).

Flags: needinfo?(mpohle)

Max

Comment 30

•

1 year ago

•

Edited

I have now taken this ticket in order to further monitor the numbers and what collect information about what could potentially cause it.

Thanks for comment #28. Its a really good summery of what is potentially going on and certainly a good starting to add debugging code.

Assignee: nobody → mpohle

Flags: needinfo?(mpohle)

Jens Stutte [:jstutte]

Reporter

Comment 31

•

1 month ago

(In reply to Max from comment #30)

I have now taken this ticket in order to further monitor the numbers and what collect information about what could potentially cause it.

In only 1 week I see 21982 WaitForAllAsynchronousShutdowns hangs, and 20645 (that is 93.92 %) are caused by the background updater.

I'd say it would be worth to do something about these.

Thanks for comment #28. Its a really good summery of what is potentially going on and certainly a good starting to add debugging code.

Are you planning to look into this ?

Flags: needinfo?(mpohle)

Jens Stutte [:jstutte]

Reporter

Comment 32

•

1 month ago

•

Edited

An ugly but easy option to mitigate something here could be to set backgroundTaskMinRuntimeMS in the background updater task to a very high value, like 3 minutes or so. If that decreases numbers significantly we probably

see successful runs that were just slower than normal

as of comment 28 and we would know if it is worth finding the assumed hole in the promise chain. Otherwise there might be something blocking ongoing with the underlying Windows APIs, not sure if we can have timeouts for those calls.

Edit: this way we could also indirectly detect, if something else causes the early shutdown (like a WM_ENDSESSION) and not our own Services.startup.quit(Ci.nsIAppStartup.eForceQuit, exitCode); after 3min + X.

Nipun Shukla

Updated

•

1 month ago

Updated

•

24 days ago

Assignee: mpohle → nobody

Flags: needinfo?(mpohle)

Jens Stutte [:jstutte]

Reporter

Comment 33

•

19 days ago

(In reply to Jens Stutte [:jstutte] from comment #32)

Edit: this way we could also indirectly detect, if something else causes the early shutdown (like a WM_ENDSESSION) and not our own Services.startup.quit(Ci.nsIAppStartup.eForceQuit, exitCode); after 3min + X.

Looking at recent crashes after bug 1892062 it seems to me that we still see uptimes below 100 seconds most of the times. Assuming the backgroundTaskMinRuntimeMS works as expected, this would hint to something else shutting us down? To confirm this we could add a shutdown reason for the programmatical self close case, thought it might not be completely trivial to pass that through the layers.

My gut feeling would be that we see a situation where Windows wants to gently close us (AppClose) while we are stuck waiting for CoCreateInstance and thus blocking the shutdown until the terminator steps in. It is less clear to me what would the scenario be when this happens, though OS shutdown seems most likely.

Flags: needinfo?(nshukla)

Nipun Shukla

Comment 34

•

19 days ago

Ah sorry, this change was reverted almost immediately in bug 1893147 due to it causing the installer (which creates/uses background tasks) taking at least 3 minutes to install.

Flags: needinfo?(nshukla)

Jens Stutte [:jstutte]

Reporter

Comment 35

•

19 days ago

Makes sense that this is unwanted behavior when things go well.

Jens Stutte [:jstutte]

Reporter

Comment 36

•

19 days ago

•

Edited

(In reply to Nipun Shukla from comment #34)

Ah sorry, this change was reverted almost immediately in bug 1893147 due to it causing the installer (which creates/uses background tasks) taking at least 3 minutes to install.

Updating the query to look only at builds with that change leads to significantly longer times also there.

That would second instead:

I did not follow through the code again, but rethinking after understanding a bit better how the framework works I would expect that there is some hole in the chain of promises that should ensure to not return from the updater background task while work is happening on the BitsCommander thread.

so we are back at where we were in comment 28 and

To confirm this we could add a shutdown reason for the programmatical self close case, thought it might not be completely trivial to pass that through the layers.

is probably not helpful here.

Jens Stutte [:jstutte]

Reporter

Comment 37

•

19 days ago

My rust is not good enough to be of much help, but from the stacks in the crashes I could imagine that one could debug this locally by adding a very long sleep before instantiating the BackgroundCopyManager. I'd expect to see very similar hangs then.

Flags: needinfo?(nshukla)

Nipun Shukla

Comment 38

•

12 days ago

Unfortunately my Rust also isn’t good enough to be very useful for the investigation. I’m a bit unclear on whether we would like to have the 3 minute min task runtime change reintroduced (with some changes to prevent installer breakage) or if we’ve collected enough data in that regard. Otherwise I’ll try to introduce the sleep you recommended above once I have some time.

Flags: needinfo?(nshukla) → needinfo?(jstutte)

Jens Stutte [:jstutte]

Reporter

Comment 39

•

12 days ago

•

Edited

We collected enough data to know that we seem to not wait for whatever is ongoing in the BitsCommander thread while creating a COM object.

To reproduce the failure locally you could add a very long sleep before instantiating the BackgroundCopyManager and somehow trigger that code path (I do not know under what conditions this COM instantiation is activated). The stack from the BitsCommander thread on one of those crashes might give you some hint on what to activate, too. IIUC it all starts with StartDownloadTask, but I do not know how to force it to run (some of the tests might know?).

Flags: needinfo?(jstutte)