Closed Bug 1828066 Opened 2 years ago Closed 2 years ago

Crash in [@ shutdownhang | RtlpWaitOnAddressWithTimeout | RtlpWaitOnAddress | RtlWaitOnAddress | WaitOnAddress] in Glean

Categories

(Data Platform and Tools :: Glean: SDK, defect, P1)

Unspecified
Windows
defect

Tracking

(firefox112 wontfix, firefox113 wontfix, firefox114 fixed, firefox115 fixed)

RESOLVED FIXED
Tracking Status
firefox112 --- wontfix
firefox113 --- wontfix
firefox114 --- fixed
firefox115 --- fixed

People

(Reporter: aryx, Assigned: janerik)

References

(Blocks 2 open bugs)

Details

(Keywords: crash, topcrash)

Crash Data

Attachments

(4 files)

Shutdown hang reported for Windows 10 and 11 which got more frequent with Firefox 111.0.x (6000 crash reports vs. 450 for 110.0.x).

Crash report: https://crash-stats.mozilla.org/report/index/47e42dbc-f99c-474f-8154-bbf470230413

MOZ_CRASH Reason: Shutdown hanging at step XPCOMShutdown. Something is blocking the main-thread.

Top 10 frames of crashing thread:

0  ntdll.dll  ZwWaitForAlertByThreadId  
1  ntdll.dll  RtlpWaitOnAddressWithTimeout  
2  ntdll.dll  RtlpWaitOnAddress  
3  ntdll.dll  RtlWaitOnAddress  
4  KERNELBASE.dll  WaitOnAddress  
5  xul.dll  std::sys::windows::thread_parker::Parker::park  library/std/src/sys/windows/thread_parker.rs:117
5  xul.dll  std::thread::park  library/std/src/thread/mod.rs:999
6  xul.dll  crossbeam_channel::context::Context::wait_until  third_party/rust/crossbeam-channel/src/context.rs:177
7  xul.dll  crossbeam_channel::flavors::zero::impl$3::recv::closure$1  third_party/rust/crossbeam-channel/src/flavors/zero.rs:323
7  xul.dll  crossbeam_channel::context::impl$0::with::closure$0<crossbeam_channel::flavors::zero::impl$3::recv::closure_env$1<tuple$<> >, enum2$<core::result::Result<tuple$<>, crossbeam_channel::err::RecvTimeoutError> > >  third_party/rust/crossbeam-channel/src/context.rs:52

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on release
  • Top 20 desktop browser crashes on beta

:nika, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(nika)
Keywords: topcrash

Those mostly look like junk frames in the signature that should probably be added to the skip list or whatever else we use to clean up shutdown hangs.

Splitting by proto signature, it looks like this shutdown hang is in Glean code, like the crash in comment 0.

Component: XPCOM → Glean: SDK
Flags: needinfo?(nika)
Product: Core → Data Platform and Tools
Summary: Crash in [@ shutdownhang | RtlpWaitOnAddressWithTimeout | RtlpWaitOnAddress | RtlWaitOnAddress | WaitOnAddress] → Crash in [@ shutdownhang | RtlpWaitOnAddressWithTimeout | RtlpWaitOnAddress | RtlWaitOnAddress | WaitOnAddress] in Glean

In the crash in comment 0, if you do show all threads and search for "name: glean" you can see that there are 4 separate threads related to Glean.

Maybe one of those is doing something that is causing the hang.

glean.init is also in thread::park. I'm not sure what that means.

glean.dispatcher seems to be sitting in a system call, std::sys::windows::fs::rename(), which involves the disk so maybe that could take an extremely long time under some circumstances.

The other two are waiting on a condvar.

Making a nice signature out of this stack could be a bit messy. By my count, the first 17 frames are basically useless and have lots of Rust goo like crossbeam_channel::flavors::zero::impl$3::recv::closure$1. glean_core::shutdown() is the first interesting frame.

(In reply to Andrew McCreight [:mccr8] from comment #3)

In the crash in comment 0, if you do show all threads and search for "name: glean" you can see that there are 4 separate threads related to Glean.

Maybe one of those is doing something that is causing the hang.

glean.init is also in thread::park. I'm not sure what that means.

glean.dispatcher seems to be sitting in a system call, std::sys::windows::fs::rename(), which involves the disk so maybe that could take an extremely long time under some circumstances.

The other two are waiting on a condvar.

Yeah, looks like Glean is trying to submit data (which involves writing them to disk). Then when shutting down Glean is asked to shutdown, which usually waits for pending tasks before cleaning up and exiting.
It has no timeouts and so waits indefinitely. For whatever reason the rename/IO operation seems to take far longer than it should, causing the hang.
The right way will be to have a short timeout on those operations, and bail out if the timeout triggers.

Assignee: nobody → jrediger
Priority: -- → P3
Priority: P3 → P1
Attachment #9331776 - Flags: data-review?(chutten)

Comment on attachment 9331776 [details]
data-review-request.txt

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection can be controlled through the product's preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

jrediger@mozilla.com will be responsible for the permanent collections..

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does the data collection use a third-party collection tool?

No.


Result: datareview+

Attachment #9331776 - Flags: data-review?(chutten) → data-review+

badboy merged PR #2461: "Bug 1828066 - At shutdown block with a timeout and bail out if that fails." in ce37542.


Keeping this open to use as the bug to land a new Glean release.

Pushed by jrediger@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/93c7e111de59 Update to Glean v52.7.0 r=chutten,supply-chain-reviewers https://hg.mozilla.org/integration/autoland/rev/503594461247 Upgrade glean_parser to v7.2.1 r=chutten
Blocks: 1833362

Is this something we should consider for Beta backport?

Flags: needinfo?(jrediger)

afaik beta has Glean 52.6.0, so it should be easy to cherry-pick on their. The changes are relatively small, so keeping the risk small. IMO we can do it.

Flags: needinfo?(jrediger)

Comment on attachment 9332597 [details]
Bug 1828066 - Update to Glean v52.7.0 r?chutten!

Beta/Release Uplift Approval Request

  • User impact if declined: Continued shutdown hangs on Windows due to Glean blocking.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Tested in Glean, already running on Nightly without issues.
    We have telemetry instrumentation to monitor the effects.
  • String changes made/needed:
  • Is Android affected?: No
Attachment #9332597 - Flags: approval-mozilla-beta?
Attachment #9332598 - Flags: approval-mozilla-beta?

Comment on attachment 9332597 [details]
Bug 1828066 - Update to Glean v52.7.0 r?chutten!

Approved for 114 beta 8, thanks.

Attachment #9332597 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9332598 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: