Crash in [@ shutdownhang | RtlpWaitOnAddressWithTimeout | RtlpWaitOnAddress | RtlWaitOnAddress | WaitOnAddress] in Glean
Categories
(Data Platform and Tools :: Glean: SDK, defect, P1)
Tracking
(firefox112 wontfix, firefox113 wontfix, firefox114 fixed, firefox115 fixed)
People
(Reporter: aryx, Assigned: janerik)
References
(Blocks 2 open bugs)
Details
(Keywords: crash, topcrash)
Crash Data
Attachments
(4 files)
42 bytes,
text/x-github-pull-request
|
Details | Review | |
2.28 KB,
text/plain
|
chutten
:
data-review+
|
Details |
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
|
Details | Review |
Shutdown hang reported for Windows 10 and 11 which got more frequent with Firefox 111.0.x (6000 crash reports vs. 450 for 110.0.x).
Crash report: https://crash-stats.mozilla.org/report/index/47e42dbc-f99c-474f-8154-bbf470230413
MOZ_CRASH Reason: Shutdown hanging at step XPCOMShutdown. Something is blocking the main-thread.
Top 10 frames of crashing thread:
0 ntdll.dll ZwWaitForAlertByThreadId
1 ntdll.dll RtlpWaitOnAddressWithTimeout
2 ntdll.dll RtlpWaitOnAddress
3 ntdll.dll RtlWaitOnAddress
4 KERNELBASE.dll WaitOnAddress
5 xul.dll std::sys::windows::thread_parker::Parker::park library/std/src/sys/windows/thread_parker.rs:117
5 xul.dll std::thread::park library/std/src/thread/mod.rs:999
6 xul.dll crossbeam_channel::context::Context::wait_until third_party/rust/crossbeam-channel/src/context.rs:177
7 xul.dll crossbeam_channel::flavors::zero::impl$3::recv::closure$1 third_party/rust/crossbeam-channel/src/flavors/zero.rs:323
7 xul.dll crossbeam_channel::context::impl$0::with::closure$0<crossbeam_channel::flavors::zero::impl$3::recv::closure_env$1<tuple$<> >, enum2$<core::result::Result<tuple$<>, crossbeam_channel::err::RecvTimeoutError> > > third_party/rust/crossbeam-channel/src/context.rs:52
Comment 1•2 years ago
|
||
The bug is linked to a topcrash signature, which matches the following criteria:
- Top 20 desktop browser crashes on release
- Top 20 desktop browser crashes on beta
:nika, could you consider increasing the severity of this top-crash bug?
For more information, please visit auto_nag documentation.
Those mostly look like junk frames in the signature that should probably be added to the skip list or whatever else we use to clean up shutdown hangs.
Splitting by proto signature, it looks like this shutdown hang is in Glean code, like the crash in comment 0.
In the crash in comment 0, if you do show all threads and search for "name: glean" you can see that there are 4 separate threads related to Glean.
Maybe one of those is doing something that is causing the hang.
glean.init is also in thread::park. I'm not sure what that means.
glean.dispatcher seems to be sitting in a system call, std::sys::windows::fs::rename(), which involves the disk so maybe that could take an extremely long time under some circumstances.
The other two are waiting on a condvar.
Making a nice signature out of this stack could be a bit messy. By my count, the first 17 frames are basically useless and have lots of Rust goo like crossbeam_channel::flavors::zero::impl$3::recv::closure$1
. glean_core::shutdown()
is the first interesting frame.
Assignee | ||
Comment 5•2 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #3)
In the crash in comment 0, if you do show all threads and search for "name: glean" you can see that there are 4 separate threads related to Glean.
Maybe one of those is doing something that is causing the hang.
glean.init is also in thread::park. I'm not sure what that means.
glean.dispatcher seems to be sitting in a system call, std::sys::windows::fs::rename(), which involves the disk so maybe that could take an extremely long time under some circumstances.
The other two are waiting on a condvar.
Yeah, looks like Glean is trying to submit data (which involves writing them to disk). Then when shutting down Glean is asked to shutdown, which usually waits for pending tasks before cleaning up and exiting.
It has no timeouts and so waits indefinitely. For whatever reason the rename/IO operation seems to take far longer than it should, causing the hang.
The right way will be to have a short timeout on those operations, and bail out if the timeout triggers.
Updated•2 years ago
|
Comment 6•2 years ago
|
||
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 7•2 years ago
|
||
Updated•2 years ago
|
Comment 8•2 years ago
|
||
Comment on attachment 9331776 [details]
data-review-request.txt
DATA COLLECTION REVIEW RESPONSE:
Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?
Yes.
Is there a control mechanism that allows the user to turn the data collection on and off?
Yes. This collection can be controlled through the product's preferences.
If the request is for permanent data collection, is there someone who will monitor the data over time?
jrediger@mozilla.com will be responsible for the permanent collections..
Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?
Category 1, Technical.
Is the data collection request for default-on or default-off?
Default on for all channels.
Does the instrumentation include the addition of any new identifiers?
No.
Is the data collection covered by the existing Firefox privacy notice?
Yes.
Does the data collection use a third-party collection tool?
No.
Result: datareview+
Assignee | ||
Comment 9•2 years ago
|
||
badboy merged PR #2461: "Bug 1828066 - At shutdown block with a timeout and bail out if that fails." in ce37542.
Keeping this open to use as the bug to land a new Glean release.
Assignee | ||
Comment 10•2 years ago
|
||
Assignee | ||
Comment 11•2 years ago
|
||
Depends on D177617
Comment 12•2 years ago
|
||
Comment 13•2 years ago
|
||
bugherder |
Updated•2 years ago
|
Comment 14•2 years ago
|
||
Is this something we should consider for Beta backport?
Assignee | ||
Comment 15•2 years ago
|
||
afaik beta has Glean 52.6.0, so it should be easy to cherry-pick on their. The changes are relatively small, so keeping the risk small. IMO we can do it.
Assignee | ||
Comment 16•2 years ago
|
||
Comment on attachment 9332597 [details]
Bug 1828066 - Update to Glean v52.7.0 r?chutten!
Beta/Release Uplift Approval Request
- User impact if declined: Continued shutdown hangs on Windows due to Glean blocking.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Tested in Glean, already running on Nightly without issues.
We have telemetry instrumentation to monitor the effects. - String changes made/needed:
- Is Android affected?: No
Assignee | ||
Updated•2 years ago
|
Comment 17•2 years ago
|
||
Comment on attachment 9332597 [details]
Bug 1828066 - Update to Glean v52.7.0 r?chutten!
Approved for 114 beta 8, thanks.
Updated•2 years ago
|
Comment 18•2 years ago
|
||
bugherder uplift |
Description
•