Closed Bug 1928288 Opened 1 year ago Closed 11 months ago

Process remains for a long time(approx. 30seconds) after closing Firefox

Categories

(Data Platform and Tools :: Glean: SDK, defect, P3)

Desktop
Windows 11
defect

Tracking

(firefox132 wontfix, firefox133 wontfix, firefox134 wontfix, firefox135 fixed)

RESOLVED FIXED
Tracking Status
firefox132 --- wontfix
firefox133 --- wontfix
firefox134 --- wontfix
firefox135 --- fixed

People

(Reporter: alice0775, Assigned: chutten|PTO)

References

(Regression)

Details

(Keywords: nightly-community, regression)

Attachments

(2 files)

Steps to reproduce:

  1. Start nightly (firefox.exe) with newly created clean profile
  2. Click new tab button
  3. Close the tab
  4. Quit browser
  5. Start nightly again
  6. Repeat step2 to step5

Actual results:
A alert dialog pops up as follow.
Firefox already running, but is not responding. The old Firefox process must be closed to open a new window.

Expected results:
No alert. and start nightly normally.

Workaround:
Disable telemetry with policies.json seems to fix the issue.

{
 "policies": {
   "DisableTelemetry": true
 }
}
Attached file about:support
Summary: Process remains for a long time after closing Firefox → Process remains for a long time(approx. 30seconds) after closing Firefox
OS: Unspecified → Windows 11
Hardware: Unspecified → Desktop

Set release status flags based on info from the regressing bug 1838856

:timhuang, since you are the author of the regressor, bug 1838856, could you take a look? Also, could you set the severity field?

For more information, please visit BugBot documentation.

Flags: needinfo?(tihuang)

I don't think Bug 1838856 is the root cause here because it only records a glean probe when canvas randomization is triggered. It doesn't block shutdown at all. Maybe this is an issue of the telemetry implementation?

Flags: needinfo?(tihuang)

Hi Chris, any thoughts on this?

Flags: needinfo?(chutten)

There are a lot of systems and behaviours governed by the enterprise policy DisableTelemetry: data collection systems Legacy Telemetry and Glean, experimentation system Nimbus, broken site reporter, ...

Alas the only ones I know much about are the first two: Legacy Telemetry and Glean.

Glean's shutdown behaviour is pretty simple: first, if we're in the middle of init, finish that. Then, flush any pending data to the db on disk and then wait at most 30s for the uploader to finish uploading what it's doing. Usually the uploader isn't doing anything at all, and if it's uploading something it takes far less than 30s to do it, so this isn't where I'd go looking for our culprit (though if you could gather a profile of this behaviour, that'd really help. I can't reproduce it myself) unless you have a very slow disk or very slow network.

Legacy Telemetry is more complicated. It doesn't have a db: its only persistence is based on saving submitted pings to disk. This involves doing however much work is necessary to gather all the information for those pings together, and then writing them down. Once for every kind of ping. On the first run of Firefox, I'd expect there to be quite a few pings: possibly an "update", "main", "first-shutdown", probably an "event" in there, "new-profile", maybe "health"... a lot of code to run and a lot of data to save all at once. This is the system that feels more likely to be the culprit. Again, a profile would very much help. If that's not working out, then setting toolkit.telemetry.log.level to Trace and getting the results of that from the browser console (or, by setting toolkit.telemetry.log.dump to true, from stdout) would be almost as good (docs).

In both cases, if DisableTelemetry is set to true, very little of this machinery will do a darned thing.

I have good news and bad news. The good news is that we're fixing this. The bad news is that it'll take a long time. This is because the way we're fixing this is by decommissioning Legacy Telemetry (which will take its reliance on at-shutdown activities along with it), which we can only do once everything important has been moved over.

I don't think this bug can move forward without a profile or at least logs of an affected run. And if it identifies Legacy Telemetry it's unlikely that we'd be able to prioritize fixing a slow async shutdown over the work of migrating data collection away.

Flags: needinfo?(chutten)

The issue is reproduced. And the profiler log captured with MOZ_PROFILER_SHUTDOWN : https://share.firefox.dev/3Z2NWmo

FWIW, When set the policy.json("DisableTelemetry": true): the log is https://share.firefox.dev/3UOvPhx

Thank you for such a prompt profile!

Wouldjalookitthat, it's the full 30s wait in glean_core::uploader_shutdown() after all. This is very illuminating, thank you. I'll move this bug into the Glean SDK component here.

Rather irritatingly, we're not getting all the threads in this profile. I'd like to see what the glean.shutdown and (more importantly) glean.upload threads are doing. Alice0775 White, if you are willing to reproduce this in the profiler again could you select the Debug preset in the Profiler UI? That should get those threads for us.

Without knowing where exactly they may be stuck, let's do some code inspection... well, the main thread is in uploader_shutdown. It records some data about how long it waits, so we can look at that... Okay, so on release this affects over 1% of shutdowns which both is and isn't a lot. It's good confirmation that, usually, everything is okay at shutdown. Which is nice on the whole, but doesn't help us in the specific. Back to the code.

So it records how long it takes, then spawns a thread (glean.shutdown) whose job it is to synchronously call state.callbacks.shutdown() while the thread calling uploader_shutdown (the main thread) waits at most 30s for that to happen. If state.callbacks.shutdown() takes longer than 30s to complete, we log a warning and complete shutdown.

Okay, so we're obviously stuck inside state.callbacks.shutdown(). What does that do? Nothing much, it just calls upload_manager.shutdown().

What does upload_manager.shutdown() do? Well, it signals via the thread_running atomic for the glean.upload thread to begin shutting down. Then it waits for the glean.upload thread to complete. We could theoretically be stuck in either of two places here: in storing State::ShuttingDown to the thread_running atomic (if lock-free atomic u8 is unavailable on your platform. Which, it being Win64, it almost certainly supports), or in waiting for the glean.upload thread to react to State::ShuttingDown and to exit.

I'm going to assume it's the second one. Let's see what the glean.upload thread could be doing. First and foremost, it might not yet exist. But if it doesn't, the we don't actually wait for it to stop running, so let's ignore that. If it exists, that means it's gotten at least this far and is in the following loop. It could be:

  • Getting the next upload task via glean_get_upload_task. This is a function that should quickly return an instruction to upload a ping, wait for a bit because we've reached our limit of 15 pings sent within 60s (which is configurable), or finish the thread.
    • I don't think this is where we're stuck as this is a quick return. An all-thread profile would tell us for certain.
  • Uploading a ping. This takes however long it takes network to happen.
    • You haven't mentioned that your network is slow, so I imagine this isn't where we're stuck. If we are, we'd be in inner.uploader.upload
  • Being asked to wait because, though there's pings to send, we've sent too many. The longest length of time for a wait it can return is the max of WAIT_TIME_FOR_PING_PROCESSING (which is 1s) or one throttle interval of, well, that same interval of 60s that dictates how many pings can be sent.
    • Well, this has potential, doesn't it. I wonder how many pings we've been asked to send and over what interval in this case. If we're throttled, the glean.upload thread will be in this sleep call for the full duration of the 30s shutdown wait.

If that's indeed where we're sleeping, we have a couple of choices. We can change the ping upload interval to be shorter (decreasing the pings we can send per interval accordingly so we don't DDoS ourselves) so that the maximum Wait time is shorter (15 pings per 60s could be factored down as low as 1 ping per 4s, but we have to figure out what to do if we're configured to send a different number pings-per-minute by Nimbus). We can increase the number of pings we can send per interval (which we can do via Nimbus off the trains) so we're less likely to get a Wait in the first place. Or we could make the upload manager check for shutdown in its sleep (by either turning shutdown into a channel that it waits on or simply subdividing the sleep interval into smaller units, checking for shutdown on each subinterval).

Alice0775 White, would you be willing to get a Debug profile of this behaviour so that I can use the view on all threads to confirm that this is what we're waiting on?

Component: Telemetry → Glean: SDK
Flags: needinfo?(alice0775)
Product: Toolkit → Data Platform and Tools

(( Oh, before I forget, there are performance reasons to suggest that waiting less often and for longer is a better use of resources (power, CPU), so we shouldn't jump straight for switching from one thread wakeup per 60s to 15 thread wakeups per 60s without giving things a little think. ))

Profiler log w/ Debug preset : https://share.firefox.dev/3YMRsQr

Flags: needinfo?(alice0775)

Thank you!

Aw, dang, that doesn't have any more Glean SDK threads in it. And I should've known that'd be the case because I helped write bug 1899618.

We may have to approach this speculatively, then. I'm a fan of the approach of making Wait tasks timeout on a channel that shutdown would transmit over, otherwise leaving the surrounding behaviour of throttling and waiting the same. If this works, we should see in the data a decrease of 30s waits in the glean.validation.shutdown_wait metric.

Alas, I don't have the time to pick this up right now or possibly in the near future. But this bug's in the right component now and will get picked up in triage on Monday.

Thank you again for all your help in both filing and supplying profiles for this issue.

See Also: → 1899618

(In reply to Alice0775 White from comment #10)

Profiler log w/ Debug preset : https://share.firefox.dev/3YMRsQr

Have you noticed the 2 RunProcess threads that last 30s? https://share.firefox.dev/3YMzAFh

Flags: needinfo?(chutten)

I had not thought to look for thread markers, no, as though I've had long acquaintance with the gecko profiler, I haven't actually used it that often. Thank you for pointing them out!

Not sure what RunProcess threads do... I'm seeing threads getting named that in nsProcess::Monitor but only if coming in via nsProcess::CreateMonitorThread() (which simply means it's monitoring a process in a non-blocking manner). That seems to happen... let's see: screenshots (when screenshotting), crashes (minidump analyzer), remote debugging (adb), and telemetry (pingsender).

Most likely these are pingsender invocations for at-shutdown Telemetry pings. (( Could be nice to annotate the RunProcess thread marker with the executable name and args. )) Which is fine: pingsender outlives the firefox process normally. If we shorten the glean.upload thread exit, those monitoring threads'll just go away with the parent process later.

Flags: needinfo?(chutten)

Florian pointed out that those RunProcess threads actually disappear before the end of the profile, after exactly 30s of runtime. Not sure what's going on there. If it's a background-task-style pingsender, the task could cap its execution time... but it doesn't, so it uses the default 10min timeout.

So maybe those monitor threads are related and either are causing or will cause the delayed shutdown even if we change how glean.upload responds at shutdown.

Assignee: nobody → chutten
Priority: -- → P3

badboy merged PR [mozilla/glean]: Bug 1928288 Support interrupting Wait upload tasks during shutdown (#3005) in 818546c.

Glean patch merged, keeping it open until we get that into a release (hopefully next week).

Depends on: 1933939

Just updated to Nightly135.0a1 (build ID 20241214094501) and confirmed that this problem no longer occurs.

Status: NEW → RESOLVED
Closed: 11 months ago
Resolution: --- → FIXED
Blocks: 1929970
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: