1928288 - Process remains for a long time(approx. 30seconds) after closing Firefox

Alice0775 White

Reporter

Description

•

1 year ago

•

Edited

Steps to reproduce:

Start nightly (firefox.exe) with newly created clean profile
Click new tab button
Close the tab
Quit browser
Start nightly again
Repeat step2 to step5

Actual results:
A alert dialog pops up as follow.
Firefox already running, but is not responding. The old Firefox process must be closed to open a new window.

Expected results:
No alert. and start nightly normally.

Workaround:
Disable telemetry with policies.json seems to fix the issue.

{
 "policies": {
   "DisableTelemetry": true
 }
}

Alice0775 White

Reporter

Comment 1

•

1 year ago

Attached file about:support — Details

Alice0775 White

Reporter

Updated

•

1 year ago

Summary: Process remains for a long time after closing Firefox → Process remains for a long time(approx. 30seconds) after closing Firefox

Alice0775 White

Reporter

Comment 2

•

1 year ago

Regression window:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=391b29854d65124c758af8d7a7765b53ec46b9bf&tochange=147bbf0910da38bd103a6cac5b77bca042225107
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=391b29854d65124c758af8d7a7765b53ec46b9bf&tochange=6183660c7ff4c09f958887b6a14bfc2b3b2dd843

Suspect : Bug 1838856

Keywords: regression

Regressed by: 1838856

Alice0775 White

Reporter

Updated

•

1 year ago

OS: Unspecified → Windows 11

Hardware: Unspecified → Desktop

Alice0775 White

Reporter

Updated

•

1 year ago

status-firefox134: --- → affected

BugBot [:suhaib / :marco/ :calixte]

Comment 3

•

1 year ago

Set release status flags based on info from the regressing bug 1838856

:timhuang, since you are the author of the regressor, bug 1838856, could you take a look? Also, could you set the severity field?

For more information, please visit BugBot documentation.

status-firefox132: --- → affected

status-firefox133: --- → affected

status-firefox-esr128: --- → affected

Flags: needinfo?(tihuang)

Tim Huang[:timhuang]

Comment 4

•

1 year ago

I don't think Bug 1838856 is the root cause here because it only records a glean probe when canvas randomization is triggered. It doesn't block shutdown at all. Maybe this is an issue of the telemetry implementation?

Flags: needinfo?(tihuang)

Ryan VanderMeulen [:RyanVM]

Comment 5

•

1 year ago

•

Edited

Hi Chris, any thoughts on this?

status-firefox132: affected → wontfix

status-firefox-esr115: --- → unaffected

Flags: needinfo?(chutten)

Donal Meehan [:dmeehan]

Updated

•

1 year ago

status-firefox133: affected → wontfix

Chris H-C :chutten|PTO (back Jan 5)

Assignee

Comment 6

•

1 year ago

There are a lot of systems and behaviours governed by the enterprise policy DisableTelemetry: data collection systems Legacy Telemetry and Glean, experimentation system Nimbus, broken site reporter, ...

Alas the only ones I know much about are the first two: Legacy Telemetry and Glean.

Glean's shutdown behaviour is pretty simple: first, if we're in the middle of init, finish that. Then, flush any pending data to the db on disk and then wait at most 30s for the uploader to finish uploading what it's doing. Usually the uploader isn't doing anything at all, and if it's uploading something it takes far less than 30s to do it, so this isn't where I'd go looking for our culprit (though if you could gather a profile of this behaviour, that'd really help. I can't reproduce it myself) unless you have a very slow disk or very slow network.

Legacy Telemetry is more complicated. It doesn't have a db: its only persistence is based on saving submitted pings to disk. This involves doing however much work is necessary to gather all the information for those pings together, and then writing them down. Once for every kind of ping. On the first run of Firefox, I'd expect there to be quite a few pings: possibly an "update", "main", "first-shutdown", probably an "event" in there, "new-profile", maybe "health"... a lot of code to run and a lot of data to save all at once. This is the system that feels more likely to be the culprit. Again, a profile would very much help. If that's not working out, then setting toolkit.telemetry.log.level to Trace and getting the results of that from the browser console (or, by setting toolkit.telemetry.log.dump to true, from stdout) would be almost as good (docs).

In both cases, if DisableTelemetry is set to true, very little of this machinery will do a darned thing.

I have good news and bad news. The good news is that we're fixing this. The bad news is that it'll take a long time. This is because the way we're fixing this is by decommissioning Legacy Telemetry (which will take its reliance on at-shutdown activities along with it), which we can only do once everything important has been moved over.

I don't think this bug can move forward without a profile or at least logs of an affected run. And if it identifies Legacy Telemetry it's unlikely that we'd be able to prioritize fixing a slow async shutdown over the work of migrating data collection away.

Flags: needinfo?(chutten)

Alice0775 White

Reporter

Comment 7

•

1 year ago

The issue is reproduced. And the profiler log captured with MOZ_PROFILER_SHUTDOWN : https://share.firefox.dev/3Z2NWmo

FWIW, When set the policy.json("DisableTelemetry": true): the log is https://share.firefox.dev/3UOvPhx

Chris H-C :chutten|PTO (back Jan 5)

Assignee

Comment 8

•

1 year ago

Thank you for such a prompt profile!

Wouldjalookitthat, it's the full 30s wait in glean_core::uploader_shutdown() after all. This is very illuminating, thank you. I'll move this bug into the Glean SDK component here.

Rather irritatingly, we're not getting all the threads in this profile. I'd like to see what the glean.shutdown and (more importantly) glean.upload threads are doing. Alice0775 White, if you are willing to reproduce this in the profiler again could you select the Debug preset in the Profiler UI? That should get those threads for us.

Without knowing where exactly they may be stuck, let's do some code inspection... well, the main thread is in uploader_shutdown. It records some data about how long it waits, so we can look at that... Okay, so on release this affects over 1% of shutdowns which both is and isn't a lot. It's good confirmation that, usually, everything is okay at shutdown. Which is nice on the whole, but doesn't help us in the specific. Back to the code.

So it records how long it takes, then spawns a thread (glean.shutdown) whose job it is to synchronously call state.callbacks.shutdown() while the thread calling uploader_shutdown (the main thread) waits at most 30s for that to happen. If state.callbacks.shutdown() takes longer than 30s to complete, we log a warning and complete shutdown.

Okay, so we're obviously stuck inside state.callbacks.shutdown(). What does that do? Nothing much, it just calls upload_manager.shutdown().

What does upload_manager.shutdown() do? Well, it signals via the thread_running atomic for the glean.upload thread to begin shutting down. Then it waits for the glean.upload thread to complete. We could theoretically be stuck in either of two places here: in storing State::ShuttingDown to the thread_running atomic (if lock-free atomic u8 is unavailable on your platform. Which, it being Win64, it almost certainly supports), or in waiting for the glean.upload thread to react to State::ShuttingDown and to exit.

I'm going to assume it's the second one. Let's see what the glean.upload thread could be doing. First and foremost, it might not yet exist. But if it doesn't, the we don't actually wait for it to stop running, so let's ignore that. If it exists, that means it's gotten at least this far and is in the following loop. It could be:

Getting the next upload task via glean_get_upload_task. This is a function that should quickly return an instruction to upload a ping, wait for a bit because we've reached our limit of 15 pings sent within 60s (which is configurable), or finish the thread.
- I don't think this is where we're stuck as this is a quick return. An all-thread profile would tell us for certain.
Uploading a ping. This takes however long it takes network to happen.
- You haven't mentioned that your network is slow, so I imagine this isn't where we're stuck. If we are, we'd be in inner.uploader.upload
Being asked to wait because, though there's pings to send, we've sent too many. The longest length of time for a wait it can return is the max of WAIT_TIME_FOR_PING_PROCESSING (which is 1s) or one throttle interval of, well, that same interval of 60s that dictates how many pings can be sent.
- Well, this has potential, doesn't it. I wonder how many pings we've been asked to send and over what interval in this case. If we're throttled, the glean.upload thread will be in this sleep call for the full duration of the 30s shutdown wait.

If that's indeed where we're sleeping, we have a couple of choices. We can change the ping upload interval to be shorter (decreasing the pings we can send per interval accordingly so we don't DDoS ourselves) so that the maximum Wait time is shorter (15 pings per 60s could be factored down as low as 1 ping per 4s, but we have to figure out what to do if we're configured to send a different number pings-per-minute by Nimbus). We can increase the number of pings we can send per interval (which we can do via Nimbus off the trains) so we're less likely to get a Wait in the first place. Or we could make the upload manager check for shutdown in its sleep (by either turning shutdown into a channel that it waits on or simply subdividing the sleep interval into smaller units, checking for shutdown on each subinterval).

Alice0775 White, would you be willing to get a Debug profile of this behaviour so that I can use the view on all threads to confirm that this is what we're waiting on?

status-firefox-esr115: unaffected → ---

status-firefox-esr128: affected → ---

Component: Telemetry → Glean: SDK

Flags: needinfo?(alice0775)

Product: Toolkit → Data Platform and Tools

Chris H-C :chutten|PTO (back Jan 5)

Assignee

Comment 9

•

1 year ago

(( Oh, before I forget, there are performance reasons to suggest that waiting less often and for longer is a better use of resources (power, CPU), so we shouldn't jump straight for switching from one thread wakeup per 60s to 15 thread wakeups per 60s without giving things a little think. ))

Alice0775 White

Reporter

Comment 10

•

1 year ago

Profiler log w/ Debug preset : https://share.firefox.dev/3YMRsQr

Flags: needinfo?(alice0775)

Chris H-C :chutten|PTO (back Jan 5)

Assignee

Comment 11

•

1 year ago

Thank you!

Aw, dang, that doesn't have any more Glean SDK threads in it. And I should've known that'd be the case because I helped write bug 1899618.

We may have to approach this speculatively, then. I'm a fan of the approach of making Wait tasks timeout on a channel that shutdown would transmit over, otherwise leaving the surrounding behaviour of throttling and waiting the same. If this works, we should see in the data a decrease of 30s waits in the glean.validation.shutdown_wait metric.

Alas, I don't have the time to pick this up right now or possibly in the near future. But this bug's in the right component now and will get picked up in triage on Monday.

Thank you again for all your help in both filing and supplying profiles for this issue.

Comment 12

•

1 year ago

(In reply to Alice0775 White from comment #10)

Profiler log w/ Debug preset : https://share.firefox.dev/3YMRsQr

Have you noticed the 2 RunProcess threads that last 30s? https://share.firefox.dev/3YMzAFh

Flags: needinfo?(chutten)

Chris H-C :chutten|PTO (back Jan 5)

Assignee

Comment 13

•

1 year ago

I had not thought to look for thread markers, no, as though I've had long acquaintance with the gecko profiler, I haven't actually used it that often. Thank you for pointing them out!

Not sure what RunProcess threads do... I'm seeing threads getting named that in nsProcess::Monitor but only if coming in via nsProcess::CreateMonitorThread() (which simply means it's monitoring a process in a non-blocking manner). That seems to happen... let's see: screenshots (when screenshotting), crashes (minidump analyzer), remote debugging (adb), and telemetry (pingsender).

Most likely these are pingsender invocations for at-shutdown Telemetry pings. (( Could be nice to annotate the RunProcess thread marker with the executable name and args. )) Which is fine: pingsender outlives the firefox process normally. If we shorten the glean.upload thread exit, those monitoring threads'll just go away with the parent process later.

Flags: needinfo?(chutten)

Chris H-C :chutten|PTO (back Jan 5)

Assignee

Comment 14

•

1 year ago

Florian pointed out that those RunProcess threads actually disappear before the end of the profile, after exactly 30s of runtime. Not sure what's going on there. If it's a background-task-style pingsender, the task could cap its execution time... but it doesn't, so it uses the default 10min timeout.

So maybe those monitor threads are related and either are causing or will cause the delayed shutdown even if we change how glean.upload responds at shutdown.

Travis Long [:travis_]

Updated

•

1 year ago

Assignee: nobody → chutten

Priority: -- → P3

BMO Github Automation

Comment 15

•

1 year ago

Attached file [mozilla/glean] Bug 1928288 Support interrupting `Wait` upload tasks during shutdown (#3005) — Details

Jan-Erik Rediger [:janerik] (Away until 2026-01-05)

Comment 16

•

1 year ago

badboy merged PR [mozilla/glean]: Bug 1928288 Support interrupting Wait upload tasks during shutdown (#3005) in 818546c.

Glean patch merged, keeping it open until we get that into a release (hopefully next week).

Chris H-C :chutten|PTO (back Jan 5)

Assignee

Updated

•

1 year ago

Depends on: 1933939

Pascal Chevrel:pascalc

Updated

•

1 year ago

status-firefox134: affected → wontfix

status-firefox135: --- → affected

Alice0775 White

Reporter

Comment 17

•

1 year ago

Just updated to Nightly135.0a1 (build ID 20241214094501) and confirmed that this problem no longer occurs.

Nick Rishel [:nrishel]

Updated

•

11 months ago

Status: NEW → RESOLVED

Closed: 11 months ago

Resolution: --- → FIXED

Ryan VanderMeulen [:RyanVM]

Updated

•

11 months ago

status-firefox135: affected → fixed

Mathew Hodson

Updated

•

11 months ago

Blocks: 1929970

about:support 1 year ago Alice0775 White 34.59 KB, text/plain		Details
[mozilla/glean] Bug 1928288 Support interrupting `Wait` upload tasks during shutdown (#3005) 1 year ago BMO Github Automation 42 bytes, text/x-github-pull-request		Details \| Review