Crash in [@ std::sys::windows::fs::impl$13::next]
Categories
(Toolkit :: Telemetry, defect, P5)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr102 | --- | unaffected |
firefox107 | --- | unaffected |
firefox108 | + | fixed |
firefox109 | --- | fixed |
People
(Reporter: gsvelto, Assigned: chutten)
References
Details
(Keywords: crash, topcrash, topcrash-startup)
Crash Data
Attachments
(1 file)
48 bytes,
text/x-phabricator-request
|
dmeehan
:
approval-mozilla-beta+
|
Details | Review |
Crash report: https://crash-stats.mozilla.org/report/index/02651884-5b3a-488e-9a43-994b80221106
MOZ_CRASH Reason: assertion failed: info.is_aligned()
Top 10 frames of crashing thread:
0 xul.dll MOZ_Crash mfbt/Assertions.h:261
0 xul.dll RustMozCrash mozglue/static/rust/wrappers.cpp:17
1 xul.dll mozglue_static::panic_hook mozglue/static/rust/lib.rs:91
2 xul.dll core::ops::function::Fn::call<void ../897e37553bba8b42751c67658967889d11ecd120/library/core/src/ops/function.rs:164
3 xul.dll std::panicking::rust_panic_with_hook library/std/src/panicking.rs:702
4 xul.dll std::panicking::begin_panic_handler::closure$0 library/std/src/panicking.rs:586
5 xul.dll std::sys_common::backtrace::__rust_end_short_backtrace<std::panicking::begin_panic_handler::closure_env$0, never$> library/std/src/sys_common/backtrace.rs:138
6 xul.dll std::panicking::begin_panic_handler library/std/src/panicking.rs:584
7 xul.dll core::panicking::panic_fmt library/core/src/panicking.rs:142
8 xul.dll core::panicking::panic library/core/src/panicking.rs:48
This crash is happening via Glean but the actual reason is deep down into Rust's standard library. In particular we're tripping over this assertion.
This might be a Windows bug and for the time being it seems to happen only on three versions of Windows 10: 10.0.18362, 10.0.18363 and 10.0.19044. All of which are more than one year old.
Assignee | ||
Comment 1•3 years ago
|
||
From the stack this is under Glean::clear_metrics
which in these crashing cases are being called when we start up and find that upload is disabled. We're asking for std::fs::remove_dir_all
to take care of the pending pings, of which there are most likely few or none.
This code's been around and unchanged for about two years (since Glean 33.7.0) so I'm not sure what's so exciting about 108.0a1 that'd make it crop up all of a sudden. Four crashes on 20221104160427
and the other four on 20221105215350
... no real pattern except they're both the second nightly build of the weekday.
Not sure what we could do even if we figured out the problem. I suppose we could try to build a "delete all the files" routine that avoided DirBuffIter
?
Gonna make this P5 until/unless we get more crash volume out of it.
Jan-Erik, do you think we should report this to rust-lang/rust
as a courtesy?
Assignee | ||
Updated•3 years ago
|
Comment 2•3 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #0)
This might be a Windows bug and for the time being it seems to happen only on three versions of Windows 10: 10.0.18362, 10.0.18363 and 10.0.19044. All of which are more than one year old.
With so few actual crashes, and then limited to a small range of versions of Windows, I'm not sure we can report anything really actionable upstream.
Reporter | ||
Comment 3•3 years ago
|
||
I'd say let's wait to see if the volume increases, and if it doesn't we can just ignore this.
![]() |
||
Comment 4•3 years ago
|
||
There had been 6 affected installations with yesterday's 10am UTC build. This puts it into the top crasher list for Nightly.
Assignee | ||
Comment 5•3 years ago
|
||
And now it's escaped to Beta (still 108) with two crashes from Windows 7. Suggests this is indeed a code change.
Let's change tactics. Earliest build_id that showed crash reports (and quite a lot of them!) is 20221108094235
(rev a62fef9a10c5
). Changelog between it and the previous build (20221107212933
rev df541ee6f622
) is https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=df541ee6f622&tochange=a62fef9a10c5
Nothing really stands out to me? Maybe bug 1791476? It talks about clang and rust and llvm. I'm really grasping at straws here, but maybe :glandium would have an inkling?
Anyone else on the bug see something interesting in the pushlog? There's not a lot in there that wasn't backed out. Maybe the code change came in on the previous build and just didn't crash, so we should widen our search by a build or two?
Comment 6•2 years ago
|
||
Has the assembly of DirBuffIter::next changed would be my immediate question.
Comment 7•2 years ago
|
||
It's worth noting that the assert in question was added in rust 1.65 and we recently upgraded to that version. Before that, there was UB in that code...
Assignee | ||
Comment 8•2 years ago
|
||
Lessee, upgrade to Rust 1.65 happened in bug 1789508 which hit on 20221104095303
. It's near enough that we could've just gotten lucky from the 5th to the 9th... Oh no, I'm wrong.
I was only looking at crashes from the past week to get my changelog range! If I expand to a month the earliest crashing buildid is 20221104160427
which is the second build we built with Rust 1.65. Given that the affected clients only upgraded to the second nightly each day (until the 8th), it makes sense we didn't see any on the 0416 build.
Dang. Evidence is consistent with :glandium's theory. It's probable we were in UB land before, and now we're in the land of defined behaviour... where the behaviour is defined as a crash.
So what do we do... hrm.
Well, we could try some speculation. This code is only crashingly run when on some previous run the profile's opted out of uploading data. This puts us in one of two possible states: We've removed the pending pings already, or we haven't yet. If we can detect that we've already removed the pending pings, we can cut down on the number of times we call the crashing code. If this reduces the number of crashes, we'll know that the code crashes irrespective of whether we've already emptied the data dir. If the crashes go to zero, we'll know that it had to do with removing files from an empty dir (which rust-lang might be interested to learn). If the crash volume remains undisturbed, we'll learn that it has to do with iterating over a dir with contents.
A quick scan over read_dir
and the ReadDir
iterator suggest they use different code than the crashing DirBufIter
(ReadDir
uses fileapi.h
's FindNextFileW
which returns WIN32_FIND_DATA
. DirBufIter
used the supposed-to-be-aligned FILE_ID_BOTH_DIR_INFO
from winbase.h
). Seems like we might be able to write a "dir is empty?" checker from there.
But also, maybe we should consider rewriting the code unconditionally in terms of read_dir
?
Jan-Erik, you look like someone who I can fob this decision off on... I mean, you look like the Tech Lead for the Glean SDK. What do you think?
Updated•2 years ago
|
Comment 9•2 years ago
|
||
An issue should probably be opened on the rust side with as much information as we can.
Comment 10•2 years ago
|
||
(In reply to Chris H-C :chutten from comment #8)
But also, maybe we should consider rewriting the code unconditionally in terms of
read_dir
?Jan-Erik, you look like someone who I can fob this decision off on... I mean, you look like the Tech Lead for the Glean SDK. What do you think?
Looks like we already transitively depend on the remove_dir_all
crate, which brings its own "reliable" implementation for Windows. It uses read_dir
, so we can just use that instead of writing it ourselves.
(In reply to Mike Hommey [:glandium] from comment #9)
An issue should probably be opened on the rust side with as much information as we can.
I'll file one.
Comment 11•2 years ago
|
||
Reported as https://github.com/rust-lang/rust/issues/104530
Assignee | ||
Comment 12•2 years ago
|
||
Updated•2 years ago
|
Comment 13•2 years ago
|
||
Comment 14•2 years ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 10 desktop browser crashes on nightly (startup)
:chutten, could you consider increasing the severity of this top-crash bug?
For more information, please visit auto_nag documentation.
Comment 15•2 years ago
|
||
The bug is marked as tracked for firefox108 (beta). However, the bug still has low priority and has low severity.
:Dexter, could you please increase the priority and increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.
For more information, please visit auto_nag documentation.
Comment 16•2 years ago
|
||
Backed out changeset 75acc8e81d81 (Bug 1799442) for causing WR tidy bustage.
Backout link
Push with failures <--> tidy
Failure Log
Comment 17•2 years ago
|
||
(In reply to Release mgmt bot [:suhaib / :marco/ :calixte] from comment #15)
The bug is marked as tracked for firefox108 (beta). However, the bug still has low priority and has low severity.
:Dexter, could you please increase the priority and increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.
For more information, please visit auto_nag documentation.
I guess :chutten is the best person for this.
Updated•2 years ago
|
Comment 18•2 years ago
|
||
(In reply to Marian-Vasile Laza from comment #16)
Backed out changeset 75acc8e81d81 (Bug 1799442) for causing WR tidy bustage.
Backout link
Push with failures <--> tidy
Failure Log
Gnah. I'll check that.
Comment 19•2 years ago
|
||
Comment 20•2 years ago
|
||
bugherder |
Updated•2 years ago
|
Comment 21•2 years ago
|
||
The patch landed in nightly and beta is affected.
:chutten, is this bug important enough to require an uplift?
- If yes, please nominate the patch for beta approval.
- If no, please set
status-firefox108
towontfix
.
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 22•2 years ago
|
||
I would if this fixed the crash, but we're still getting reports.
Welp, it's not us any more. It's WebRender:
9 xul.dll std::sys::windows::fs::remove_dir_all_iterative() library/std/src/sys/windows/fs.rs:1108 cfi
10 xul.dll std::sys::windows::fs::remove_dir_all() library/std/src/sys/windows/fs.rs:1068 cfi
11 xul.dll std::fs::remove_dir_all(std::path::PathBuf*) ../897e37553bba8b42751c67658967889d11ecd120/library/std/src/fs.rs:2216 inlined
11 xul.dll webrender_bindings::program_cache::remove_disk_cache(nsstring::nsAString*) gfx/webrender_bindings/src/program_cache.rs:351 inlined
11 xul.dll webrender_bindings::bindings::remove_program_binary_disk_cache(nsstring::nsAString*) gfx/webrender_bindings/src/bindings.rs:1140 cfi
12 xul.dll gfxUtils::RemoveShaderCacheFromDiskIfNecessary() gfx/thebes/gfxUtils.cpp:1676 cfi
We might have to whack-a-mole this through the entire codebase.
Assignee | ||
Comment 24•2 years ago
|
||
Good news, it appears as though webrender may be the last user of std::fs::remove_dir_all
in-tree: https://searchfox.org/mozilla-central/search?q=remove_dir_all&path=
Comment 25•2 years ago
•
|
||
fwiw the Rust issue has been fixed and the code should be more robust now. That's only landing in Rust 1.67 though.
It's also been determined that it's most likely due to a sandboxing tool (see comment, upstream issue).
Updated•2 years ago
|
Comment 26•2 years ago
|
||
What's the plan for Beta here? Given that 108 already has Glean 51.8.0, can we just bump directly to 51.8.2?
Assignee | ||
Comment 27•2 years ago
|
||
The plan for Beta is:
- Wait for bug 1801767 to land (done)
- Check to see if that fixes the crash (looking for 0 crashes for nightlies
20221122214324
and higher. So far the latest crashing nightly is20221121093640
in the list of reports) by waiting a day or two. - If that fixes it, request uplift for this bug and bug 1801767
Though, if we wanna make some broad assumption that we've fixed it and the risk of taking these patches at this stage in the release is low enough to do it speculatively... we can start that process now? What would you prefer, RyanVM?
Assignee | ||
Comment 28•2 years ago
|
||
Latest nightly build with crash reports is still 20221121093640
suggesting that this patch and the patch to bug 1801767 should be uplifted. Let's get the party started.
Assignee | ||
Comment 29•2 years ago
•
|
||
Comment on attachment 9303987 [details]
Bug 1799442 - Update Glean to v51.8.2, rkv to 0.18 r?janerik!
Beta/Release Uplift Approval Request
- User impact if declined: Startup Crash when starting Firefox within a Sandboxie-using Windows environment
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: Bug 1801767, bug 1800741
- Risk to taking this patch: Medium
- Why is the change risky/not risky? (and alternatives if risky): Not that risky because we are using the public workaround for
remove_dir_all
unreliability. A little risky because the Glean SDK release has some ride-alongs that are unrelated to fixing this issue. - String changes made/needed:
- Is Android affected?: No
(edited to add uplift ridealong bug 1800741 which vendored Glean to 51.8.1. Should help ease merge conflicts)
Comment 30•2 years ago
|
||
Comment on attachment 9303987 [details]
Bug 1799442 - Update Glean to v51.8.2, rkv to 0.18 r?janerik!
Approved for 108.0b6.
Comment 31•2 years ago
|
||
bugherder uplift |
Description
•