Closed Bug 1799442 Opened 3 years ago Closed 2 years ago

Crash in [@ std::sys::windows::fs::impl$13::next]

Categories

(Toolkit :: Telemetry, defect, P5)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
109 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox107 --- unaffected
firefox108 + fixed
firefox109 --- fixed

People

(Reporter: gsvelto, Assigned: chutten)

References

Details

(Keywords: crash, topcrash, topcrash-startup)

Crash Data

Attachments

(1 file)

Crash report: https://crash-stats.mozilla.org/report/index/02651884-5b3a-488e-9a43-994b80221106

MOZ_CRASH Reason: assertion failed: info.is_aligned()

Top 10 frames of crashing thread:

0  xul.dll  MOZ_Crash  mfbt/Assertions.h:261
0  xul.dll  RustMozCrash  mozglue/static/rust/wrappers.cpp:17
1  xul.dll  mozglue_static::panic_hook  mozglue/static/rust/lib.rs:91
2  xul.dll  core::ops::function::Fn::call<void   ../897e37553bba8b42751c67658967889d11ecd120/library/core/src/ops/function.rs:164
3  xul.dll  std::panicking::rust_panic_with_hook  library/std/src/panicking.rs:702
4  xul.dll  std::panicking::begin_panic_handler::closure$0  library/std/src/panicking.rs:586
5  xul.dll  std::sys_common::backtrace::__rust_end_short_backtrace<std::panicking::begin_panic_handler::closure_env$0, never$>  library/std/src/sys_common/backtrace.rs:138
6  xul.dll  std::panicking::begin_panic_handler  library/std/src/panicking.rs:584
7  xul.dll  core::panicking::panic_fmt  library/core/src/panicking.rs:142
8  xul.dll  core::panicking::panic  library/core/src/panicking.rs:48

This crash is happening via Glean but the actual reason is deep down into Rust's standard library. In particular we're tripping over this assertion.

This might be a Windows bug and for the time being it seems to happen only on three versions of Windows 10: 10.0.18362, 10.0.18363 and 10.0.19044. All of which are more than one year old.

From the stack this is under Glean::clear_metrics which in these crashing cases are being called when we start up and find that upload is disabled. We're asking for std::fs::remove_dir_all to take care of the pending pings, of which there are most likely few or none.

This code's been around and unchanged for about two years (since Glean 33.7.0) so I'm not sure what's so exciting about 108.0a1 that'd make it crop up all of a sudden. Four crashes on 20221104160427 and the other four on 20221105215350... no real pattern except they're both the second nightly build of the weekday.

Not sure what we could do even if we figured out the problem. I suppose we could try to build a "delete all the files" routine that avoided DirBuffIter?

Gonna make this P5 until/unless we get more crash volume out of it.

Jan-Erik, do you think we should report this to rust-lang/rust as a courtesy?

Flags: needinfo?(jrediger)
Severity: -- → S4
Priority: -- → P5

(In reply to Gabriele Svelto [:gsvelto] from comment #0)

This might be a Windows bug and for the time being it seems to happen only on three versions of Windows 10: 10.0.18362, 10.0.18363 and 10.0.19044. All of which are more than one year old.

With so few actual crashes, and then limited to a small range of versions of Windows, I'm not sure we can report anything really actionable upstream.

Flags: needinfo?(jrediger)

I'd say let's wait to see if the volume increases, and if it doesn't we can just ignore this.

There had been 6 affected installations with yesterday's 10am UTC build. This puts it into the top crasher list for Nightly.

And now it's escaped to Beta (still 108) with two crashes from Windows 7. Suggests this is indeed a code change.

Let's change tactics. Earliest build_id that showed crash reports (and quite a lot of them!) is 20221108094235 (rev a62fef9a10c5). Changelog between it and the previous build (20221107212933 rev df541ee6f622) is https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=df541ee6f622&tochange=a62fef9a10c5

Nothing really stands out to me? Maybe bug 1791476? It talks about clang and rust and llvm. I'm really grasping at straws here, but maybe :glandium would have an inkling?

Anyone else on the bug see something interesting in the pushlog? There's not a lot in there that wasn't backed out. Maybe the code change came in on the previous build and just didn't crash, so we should widen our search by a build or two?

Flags: needinfo?(mh+mozilla)

Has the assembly of DirBuffIter::next changed would be my immediate question.

Flags: needinfo?(mh+mozilla)

It's worth noting that the assert in question was added in rust 1.65 and we recently upgraded to that version. Before that, there was UB in that code...

Lessee, upgrade to Rust 1.65 happened in bug 1789508 which hit on 20221104095303. It's near enough that we could've just gotten lucky from the 5th to the 9th... Oh no, I'm wrong.

I was only looking at crashes from the past week to get my changelog range! If I expand to a month the earliest crashing buildid is 20221104160427 which is the second build we built with Rust 1.65. Given that the affected clients only upgraded to the second nightly each day (until the 8th), it makes sense we didn't see any on the 0416 build.

Dang. Evidence is consistent with :glandium's theory. It's probable we were in UB land before, and now we're in the land of defined behaviour... where the behaviour is defined as a crash.

So what do we do... hrm.

Well, we could try some speculation. This code is only crashingly run when on some previous run the profile's opted out of uploading data. This puts us in one of two possible states: We've removed the pending pings already, or we haven't yet. If we can detect that we've already removed the pending pings, we can cut down on the number of times we call the crashing code. If this reduces the number of crashes, we'll know that the code crashes irrespective of whether we've already emptied the data dir. If the crashes go to zero, we'll know that it had to do with removing files from an empty dir (which rust-lang might be interested to learn). If the crash volume remains undisturbed, we'll learn that it has to do with iterating over a dir with contents.

A quick scan over read_dir and the ReadDir iterator suggest they use different code than the crashing DirBufIter (ReadDir uses fileapi.h's FindNextFileW which returns WIN32_FIND_DATA. DirBufIter used the supposed-to-be-aligned FILE_ID_BOTH_DIR_INFO from winbase.h). Seems like we might be able to write a "dir is empty?" checker from there.

But also, maybe we should consider rewriting the code unconditionally in terms of read_dir?

Jan-Erik, you look like someone who I can fob this decision off on... I mean, you look like the Tech Lead for the Glean SDK. What do you think?

Flags: needinfo?(jrediger)

An issue should probably be opened on the rust side with as much information as we can.

(In reply to Chris H-C :chutten from comment #8)

But also, maybe we should consider rewriting the code unconditionally in terms of read_dir?

Jan-Erik, you look like someone who I can fob this decision off on... I mean, you look like the Tech Lead for the Glean SDK. What do you think?

Looks like we already transitively depend on the remove_dir_all crate, which brings its own "reliable" implementation for Windows. It uses read_dir, so we can just use that instead of writing it ourselves.

(In reply to Mike Hommey [:glandium] from comment #9)

An issue should probably be opened on the rust side with as much information as we can.

I'll file one.

Flags: needinfo?(jrediger)
Depends on: 1801128
Assignee: nobody → chutten
Status: NEW → ASSIGNED
Pushed by jrediger@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/75acc8e81d81 Update Glean to v51.8.2, rkv to 0.18 r=janerik,supply-chain-reviewers

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly (startup)

:chutten, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(chutten)

The bug is marked as tracked for firefox108 (beta). However, the bug still has low priority and has low severity.

:Dexter, could you please increase the priority and increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit auto_nag documentation.

Flags: needinfo?(alessio.placitelli)

Backed out changeset 75acc8e81d81 (Bug 1799442) for causing WR tidy bustage.
Backout link
Push with failures <--> tidy
Failure Log

(In reply to Release mgmt bot [:suhaib / :marco/ :calixte] from comment #15)

The bug is marked as tracked for firefox108 (beta). However, the bug still has low priority and has low severity.

:Dexter, could you please increase the priority and increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit auto_nag documentation.

I guess :chutten is the best person for this.

Flags: needinfo?(alessio.placitelli)

(In reply to Marian-Vasile Laza from comment #16)

Backed out changeset 75acc8e81d81 (Bug 1799442) for causing WR tidy bustage.
Backout link
Push with failures <--> tidy
Failure Log

Gnah. I'll check that.

Pushed by jrediger@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/754a711f3ac6 Update Glean to v51.8.2, rkv to 0.18 r=janerik,supply-chain-reviewers
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 109 Branch
Flags: needinfo?(chutten)

The patch landed in nightly and beta is affected.
:chutten, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox108 to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(chutten)

I would if this fixed the crash, but we're still getting reports.

Welp, it's not us any more. It's WebRender:

9 	xul.dll 	std::sys::windows::fs::remove_dir_all_iterative() 	library/std/src/sys/windows/fs.rs:1108 	cfi
10 	xul.dll 	std::sys::windows::fs::remove_dir_all() 	library/std/src/sys/windows/fs.rs:1068 	cfi
11 	xul.dll 	std::fs::remove_dir_all(std::path::PathBuf*) 	../897e37553bba8b42751c67658967889d11ecd120/library/std/src/fs.rs:2216 	inlined
11 	xul.dll 	webrender_bindings::program_cache::remove_disk_cache(nsstring::nsAString*) 	gfx/webrender_bindings/src/program_cache.rs:351 	inlined
11 	xul.dll 	webrender_bindings::bindings::remove_program_binary_disk_cache(nsstring::nsAString*) 	gfx/webrender_bindings/src/bindings.rs:1140 	cfi
12 	xul.dll 	gfxUtils::RemoveShaderCacheFromDiskIfNecessary() 	gfx/thebes/gfxUtils.cpp:1676 	cfi

We might have to whack-a-mole this through the entire codebase.

Flags: needinfo?(chutten) → needinfo?(gwatson)

Redirecting

Flags: needinfo?(gwatson) → needinfo?(sotaro.ikeda.g)

Good news, it appears as though webrender may be the last user of std::fs::remove_dir_all in-tree: https://searchfox.org/mozilla-central/search?q=remove_dir_all&path=

fwiw the Rust issue has been fixed and the code should be more robust now. That's only landing in Rust 1.67 though.
It's also been determined that it's most likely due to a sandboxing tool (see comment, upstream issue).

See Also: → 1801683
Depends on: 1801767
Flags: needinfo?(sotaro.ikeda.g)

What's the plan for Beta here? Given that 108 already has Glean 51.8.0, can we just bump directly to 51.8.2?

Flags: needinfo?(chutten)

The plan for Beta is:

  1. Wait for bug 1801767 to land (done)
  2. Check to see if that fixes the crash (looking for 0 crashes for nightlies 20221122214324 and higher. So far the latest crashing nightly is 20221121093640 in the list of reports) by waiting a day or two.
  3. If that fixes it, request uplift for this bug and bug 1801767

Though, if we wanna make some broad assumption that we've fixed it and the risk of taking these patches at this stage in the release is low enough to do it speculatively... we can start that process now? What would you prefer, RyanVM?

Flags: needinfo?(chutten) → needinfo?(ryanvm)

Latest nightly build with crash reports is still 20221121093640 suggesting that this patch and the patch to bug 1801767 should be uplifted. Let's get the party started.

Flags: needinfo?(ryanvm)

Comment on attachment 9303987 [details]
Bug 1799442 - Update Glean to v51.8.2, rkv to 0.18 r?janerik!

Beta/Release Uplift Approval Request

  • User impact if declined: Startup Crash when starting Firefox within a Sandboxie-using Windows environment
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: Bug 1801767, bug 1800741
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): Not that risky because we are using the public workaround for remove_dir_all unreliability. A little risky because the Glean SDK release has some ride-alongs that are unrelated to fixing this issue.
  • String changes made/needed:
  • Is Android affected?: No

(edited to add uplift ridealong bug 1800741 which vendored Glean to 51.8.1. Should help ease merge conflicts)

Attachment #9303987 - Flags: approval-mozilla-beta?

Comment on attachment 9303987 [details]
Bug 1799442 - Update Glean to v51.8.2, rkv to 0.18 r?janerik!

Approved for 108.0b6.

Attachment #9303987 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: