1385925 - stylo: Crash in core::ptr::drop_in_place<T>

Reporter

Description

•

7 years ago

This bug was filed from the Socorro interface and is report bp-9f68eebb-ae7d-4ce8-93ab-bef1c0170731. ============================================================= Seen while looking at crash stats: http://bit.ly/2tR0pIL. First crash seen in 20170723030206. Small volume windows crash, some stylo references in the code so adding the stylo: preface.

Chris Peterson [:cpeterson]

Updated

•

7 years ago

Priority: -- → P2

Bobby Holley (:bholley)

Updated

•

7 years ago

Priority: P2 → --

Bobby Holley (:bholley)

Updated

•

7 years ago

Priority: -- → P3

Jim Mathies [:jimm]

Comment 1

•

7 years ago

p3 -> fix optional for tracking

status-firefox56: affected → fix-optional

Xidorn Quan [:xidorn] UTC+10

Comment 2

•

7 years ago

This seems to happen a lot these days (163 on 80 installs on 57). Probably worth checking what's happening. Having a look at several reports, it seems we always crash in Servo_StyleSet_FlushStyleSheets when called from FlushUserFontSet via different paths.

Priority: P3 → P2

Julien Cristau [:jcristau]

Updated

•

7 years ago

status-firefox56: fix-optional → disabled

status-firefox57: --- → affected

Chris Peterson [:cpeterson]

Updated

•

7 years ago

Blocks: stylo-crash-reports

Bobby Holley (:bholley)

Comment 3

•

7 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #2) > Having a look at several reports, it seems we always crash in > Servo_StyleSet_FlushStyleSheets when called from FlushUserFontSet via > different paths. There are also some reports in stylist teardown: https://crash-stats.mozilla.com/report/index/a2d262c3-1fe7-4732-a1a3-0b8470170825 Always seems to happen when dropping the hash tables. It would be nice to know if the crash comes from dropping the tables, or from their contents. We're about to land our own fork of the HashMap stuff, so will be interesting to see if this persists with that. If it does, we could potentially route the "dropping the contents" part through a separate #[inline(never)] call from the "dropping the hashtable" part, which would give us better visibility into what's going on.

Depends on: 1393656

Bobby Holley (:bholley)

Comment 4

•

7 years ago

Oh, and I guess the fact that this happens for both the invalidation maps as well as the rule hashes means it's less likely to be an issue with the contents and more likely to be an issue with the hash tables themselves. Emilio, we're not doing anything unsafe when building those tables, right? I certainly hope this isn't a bug in HashMap...

Bobby Holley (:bholley)

Comment 5

•

7 years ago

CCing Jack, given his interest in HashMap stability issues.

Manish Goregaokar [:manishearth]

Comment 6

•

7 years ago

I poked around for a bit and can't find anything unsafe we're doing here. There's a good chance it's a hashmap bug -- the stdlib hashmap is pretty well vetted but also super unsafe, so I wouldn't put it past there being a bug, especially on 32 bit. SmallVec is turning up there, might it be a capacity overflow on 32 bit? Vec is hardened against that, but SmallVec doesn't seem to be.

(Away)

Comment 7

•

7 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #3) > (In reply to Xidorn Quan [:xidorn] UTC+10 from comment #2) > > Having a look at several reports, it seems we always crash in > > Servo_StyleSet_FlushStyleSheets when called from FlushUserFontSet via > > different paths. > > There are also some reports in stylist teardown: > > https://crash-stats.mozilla.com/report/index/a2d262c3-1fe7-4732-a1a3- > 0b8470170825 > > Always seems to happen when dropping the hash tables. > > It would be nice to know if the crash comes from dropping the tables, or > from their contents. We're about to land our own fork of the HashMap stuff, > so will be interesting to see if this persists with that. If it does, we > could potentially route the "dropping the contents" part through a separate > #[inline(never)] call from the "dropping the hashtable" part, which would > give us better visibility into what's going on. Disassembling the relevant offset in xul.dll from build 20170823100553 shows me: xul!style::gecko_string_cache::{{impl}}::deref+0x4 [C:\projects\rust\src\libcore\ptr.rs @ 60] [inlined in xul!core::ptr::drop_in_place<std::collections::hash::map::HashMap<style::gecko_string_cache::Atom, smallvec::SmallVec<[style::invalidation::element::invalidation_map::Dependency; 1]>, core::hash::BuildHasherDefault<style::selector_map::PrecomputedHasher>>>+0x77 [C:\projects\rust\src\libcore\ptr.rs @ 60]]: 00000001`80188197 8b4108 mov eax,dword ptr [rcx+8] 00000001`8018819a 4421e8 and eax,r13d 00000001`8018819d 3d00000040 cmp eax,40000000h 00000001`801881a2 7405 je xul!core::ptr::drop_in_place<std::collections::hash::map::HashMap<style::gecko_string_cache::Atom, smallvec::SmallVec<[style::invalidation::element::invalidation_map::Dependency; 1]>, core::hash::BuildHasherDefault<style::selector_map::PrecomputedHasher>>>+0x89 (00000001`801881a9) 00000001`801881a4 e89b9ef901 call xul!Gecko_ReleaseAtom (00000001`82122044) This looks a lot like https://dxr.mozilla.org/mozilla-central/source/servo/components/style/gecko_string_cache/mod.rs#302, and rcx is 0, so I suppose it's a null Atom? (Or perhaps, _something_ in that `self.(do stuff)` call tree is null -- I don't know how many layers of indirection there are here.) In any case, it looks like the contents rather than the hashtable itself. At least in that one crash report.

Xidorn Quan [:xidorn] UTC+10

Comment 8

•

7 years ago

It sounds plausible that a null Atom can causes this. We probably should harden the assertions when converting from a raw ptr to Atom in gecko_string_cache/mod.rs.

Bobby Holley (:bholley)

Comment 9

•

7 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #8) > It sounds plausible that a null Atom can causes this. We probably should > harden the assertions when converting from a raw ptr to Atom in > gecko_string_cache/mod.rs. You're talking about making the debug_assertions in [1] and [2] release-mode? I think that would probably be acceptable performance-wise, at least temporarily as a diagnostic. r=me on doing that, ideally with some measurements and/or a bug filed to consider removing it. [1] http://searchfox.org/mozilla-central/rev/51b3d67a5ec1758bd2fe7d7b6e75ad6b6b5da223/servo/components/style/gecko_string_cache/mod.rs#268 [2] http://searchfox.org/mozilla-central/rev/51b3d67a5ec1758bd2fe7d7b6e75ad6b6b5da223/servo/components/style/gecko_string_cache/mod.rs#381

Xidorn Quan [:xidorn] UTC+10

Comment 10

•

7 years ago

Servo PR for hardening the assertion: https://github.com/servo/servo/pull/18302

Xidorn Quan [:xidorn] UTC+10

Comment 11

•

7 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #10) > Servo PR for hardening the assertion: > https://github.com/servo/servo/pull/18302 This should have been in today's nightly. We should stop seeing new crashes with this signature, and instead see some from Atom::from or Atom::from_addrefed.

Bobby Holley (:bholley)

Comment 12

•

7 years ago

So the atom hardening didn't fix the crashes. I dug into it a bit today, and now it's clear to me that it wouldn't have. Here's the a writeup of what I've learned so far. Big thanks to dmajor and ted for helping answer some questions. Broadly speaking, these are all crashes that occur when freeing the hashtables that hang off the stylist (the rule hashes and the invalidation maps). However, there are a few different sub-pathologies: * EXCEPTION_ACCESS_VIOLATION_READ on 0x4/0x8 for 32-bit and 64-bit respectively: This occurs when checking the mIsStatic member of a null nsIAtom pointer (the first word is the refcount, so this bit lives in the second word). * EXCEPTION_ACCESS_VIOLATION_READ at a high address: I haven't looked into these yet. * EXCEPTION_ACCESS_VIOLATION_WRITE on 0x0: At least some of them happen when trying to drop a ThinArc. This suggest we're trying to drop a Selector, which appears in both Dependency and Rule, which are the two Value types in these HashMaps (the Key types are pretty much always Atom). * EXCEPTION_BREAKPOINT: These always happen when the content process is being shut down by the parent, and according to ted, indicate that the parent terminated the content process for some reason (potentially because it was taking a long time). * EXCEPTION_ILLEGAL_INSTRUCTION: This is a rust panic, and the panic message is "assertion failed: !(ptr as *mut u8).is_null()". This condition is asserted in NonZeroPtrMut::new, which is in the call path of ThinArc::drop. So the evidence here suggests that we're dying on both null atoms as well as null ThinArcs, which constitute the keys and the values of the hashmaps that are being torn down. This strongly suggests that it isn't a type-specific error, but rather a data corruption error with the HashMaps. This suggests two uncomfortable possibilities: a bug in the std::HashMap impl, or something else clobbering the memory within the HashMap impl. This would be a great "I Told You So" moment for some people if it started happening after bug 1393656 landed, but that's only landing today. It would be interesting to see if that change has any impact on the crash rate. I traced the crashes back in hopes of finding a regression window. There is one crash in late june, another on july 23rd, and then they pick up at the end of july. This is right when we were rolling out our nightly experiment (and thus getting a real user population for the first time), and I didn't see anything suspicious landing that week. So I think we should assume this bug was pre-existing. It's also worth noting that we converted the HashMap values from Vecs to SmallVecs in bug 1363789, which landed ~ August 7th. There are no EXCEPTION_ILLEGAL_INSTRUCTION or EXCEPTION_ACCESS_VIOLATION_WRITE crashes before then. This makes sense, since clobbering a Vec with 0 will yield something that either leaks or segfaults at 0 (depending on whether we clobber |capacity| or |buff|), but in either case we won't try to run any destructors. This means that SmallVec is very unlikely to be the culprit. Given the heat wave and the 6-pm-on-a-Friday thing, I'm presently out of ideas on where to go next. I may have some over the weekend, but suggestions are certainly welcome. [1] I focused only on the windows crashes. You can follow along with the "Crash Signature" link at the top of the bug.

Bobby Holley (:bholley)

Comment 13

•

7 years ago

Manish - given that you just forked std::HashMap, and thus have it more paged in than anybody else in the world right now, can you do some auditing to see if anything sketchy strikes your eye?

Flags: needinfo?(manishearth)

Bobby Holley (:bholley)

Comment 14

•

7 years ago

Marcia, can you or somebody on your team have a look through the crash report URLs and see if you can find any patterns at all? I had cpeterson look at a few of them, and there wasn't anything too interesting. But a more systematic analysis might be useful.

Flags: needinfo?(mozillamarcia.knous)

Bobby Holley (:bholley)

Comment 15

•

7 years ago

Bob, do you run ASAN or valgrind builds for stylo on bughunter? If not, can you try that?

Flags: needinfo?(bob)

Bob Clary [:bc] (inactive)

Comment 16

•

7 years ago

I do run with opt, debug, opt-asan, debug-asan, opt-stylo, debug-stylo, opt-stylo-asan and debug-stylo-asan. I'll look to see if we have this already or if not find some urls and test them.

Jack Moffitt [:jack]

Comment 17

•

7 years ago

If no one has a better idea, one way to differentiate a bug in HashMap vs. something else would be to replace HashMap with OrderMap temporarily and see if the problem remains. If something is stomping on HashMap, it will likely also stomp on OrderMap. If unsafe code in HashMap is the culprit, the problem should disappear. Considering that ThinArc is implicated by association here, and that contains a partially unsafe reimplementation of Arc, I would also probably look over that. It would be nice if we could find this in a fuzzer or somewhere else we could record a trace. Is there anything in the crash reports that would help us narrow down some STR?

Bobby Holley (:bholley)

Comment 18

•

7 years ago

I wrote a little fuzzing testcase [1], which I've been running under rr chaos mode all day today on various sites from the crash reports. Nothing yet. Chris, I noticed that the crash volume is significantly lower on Beta. How does the size of our beta experiment population compare to Nightly? [1] function fuzzSheets() { function flushStyle() { getComputedStyle(document.body).color; } let sheets = document.styleSheets; let idx = Math.floor(Math.random() * sheets.length); let sheet = document.styleSheets[idx]; if (!sheet.ownerNode) { console.log("No ownernode for sheet " + sheet + ", skipping"); } else { let n = sheet.ownerNode; let nParent = n.parentElement; let nextSibling = n.nextSibling; n.remove(); flushStyle(); setTimeout(function() { nParent.insertBefore(n, nextSibling); flushStyle(); setTimeout(fuzzSheets, 50); }, 50); } }

Flags: needinfo?(cpeterson)

Xidorn Quan [:xidorn] UTC+10

Comment 19

•

7 years ago

(In reply to Jack Moffitt [:jack] from comment #17) > Considering that ThinArc is implicated by association here, and that > contains a partially unsafe reimplementation of Arc, I would also probably > look over that. Reports involving dropping CascadeData (e.g. bp-020d6409-42b8-42fa-84f3-b74760170903, bp-b872b38f-324b-4414-8ee1-12dea0170903) doesn't seem to have any Arc in its storage path. It is stored as: - per_origin: PerOrigin<CascadeData> in DocumentCascadeData - cascade_data: DocumentCascadeData in Stylist - stylist: Stylist in PerDocumentStyleDataImpl - AtomicRefCell<PerDocumentStyleDataImpl> in PerDocumentStyleData It is stored inline in majority of the path, and the only unsafe thing seems to be AtomicRefCell. So I would suggest that ThinArc is probably innocent, but there is a chance that AtomicRefCell could be the culprit. While in reports which involving dropping ElementData (e.g. bp-5213a826-cf7e-4517-9399-a82f60170903) doesn't seem to have AtomicRefCell. (Looking at the crash reports, it seems that the former, which has AtomicRefCell involving, is far more frequent than the latter, while the latter is supposed to be in a more frequent path, I guess.)

Bobby Holley (:bholley)

Comment 20

•

7 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #19) > (In reply to Jack Moffitt [:jack] from comment #17) > > Considering that ThinArc is implicated by association here, and that > > contains a partially unsafe reimplementation of Arc, I would also probably > > look over that. > > Reports involving dropping CascadeData (e.g. > bp-020d6409-42b8-42fa-84f3-b74760170903, > bp-b872b38f-324b-4414-8ee1-12dea0170903) doesn't seem to have any Arc in its > storage path. Yes, but the CascadeData itself contains the HashMaps, which contain a lot of Atoms and Selectors. Selectors have a ThinArc inside them. The analysis in comment 12 showed these crashes when dropping the Atoms and ThinArcs stored within the HashTables, in both cases because their contents had become null. I do generally think ThinArc is an unlikely culprit. These crashes were happening when the ThinArc was in a separate location on the heap (before we used the SmallVec), so it would be surprising if it somehow managed to clobber the hashtable buffer.

Bobby Holley (:bholley)

Comment 21

•

7 years ago

And to set context here, this crash is pretty low-volume, and njn thinks it's not big enough to warrant an undue amount of attention. It just has me concerned because it appears to be random memory corruption (possibly due to unrelated code in Gecko).

Nicholas Nethercote [inactive]

Comment 22

•

7 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #21) > And to set context here, this crash is pretty low-volume, and njn thinks > it's not big enough to warrant an undue amount of attention. It just has me > concerned because it appears to be random memory corruption (possibly due to > unrelated code in Gecko). More specifically, looking at the past 7 days (https://crash-stats.mozilla.com/topcrashers/?product=Firefox&version=57.0a1&days=7) it's been the #18 crash on Nightly, accounting for 0.71% of browser crashes. And if you switch to "any" crashes, it drops to #24 and 0.18%. So: it's worth fixing, but it's not a hair-on-fire situation, IMO.

Chris Peterson [:cpeterson]

Comment 23

•

7 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #18) > Chris, I noticed that the crash volume is significantly lower on Beta. How > does the size of our beta experiment population compare to Nightly? Our Nightly experiment has 96,531 Stylo users and our Beta experiment has 22,469 Stylo users. Over the last seven days, there were 125 crash reports from Nightly users and just 5 from Beta users. So Nightly is affected *much* more than Beta. There is nothing unusual about the distribution of operating systems. https://crash-stats.mozilla.com/signature/?signature=core%3A%3Aptr%3A%3Adrop_in_place%3CT%3E

status-firefox55: --- → unaffected

status-firefox-esr52: --- → unaffected

Flags: needinfo?(cpeterson)

Manish Goregaokar [:manishearth]

Updated

•

7 years ago

Flags: needinfo?(manishearth)

Bob Clary [:bc] (inactive)

Updated

•

7 years ago

Flags: needinfo?(bob)

Marcia Knous [:marcia]

Reporter

Comment 24

•

7 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #14) > Marcia, can you or somebody on your team have a look through the crash > report URLs and see if you can find any patterns at all? I had cpeterson > look at a few of them, and there wasn't anything too interesting. But a more > systematic analysis might be useful. ni on Marco or Calixte to see if they can help, as I am out until Friday.

Flags: needinfo?(mozillamarcia.knous)

Flags: needinfo?(mcastelluccio)

Flags: needinfo?(cdenizet)

Calixte Denizet (:calixte)

Comment 25

•

7 years ago

bholley: I don't see any particular patterns in urls.

Flags: needinfo?(mcastelluccio)

Flags: needinfo?(cdenizet)

Bobby Holley (:bholley)

Comment 26

•

7 years ago

I ran my fuzzer over the weekend on a bunch of URLs from the crash reports, to no avail. To summarize the current state of things: * This is a low-volume crash on Nightly, ~10 reports per day. According to Nick, this is low-enough volume that it is not a significant stability concern. The only reason any attention is being paid is that it's a crash in stylo code, which warrants extra scrutiny. * The crash is concerning because it appears to be the result of heap corruption. The two likeliest causes are (a) a bug in the Rust standard library's HashMap implementation, or (b) some other actor in Gecko erroneously clobbering heap memory it doesn't own. * The crash signature was observed once in June, but only started to appear consistently in late July, when we enabled the Stylo experiment on Nightly and thus increased our test population by several orders of magnitude. This suggests the cause is longstanding, and gives us no useful regression window. * We currently have no correlations or leads on reproducing this locally. Barring any epiphanies, the next steps to investigate this would be to ship diagnostic code to our Nightly audience and monitor the impact on crash reports. The two approaches would be: (1) Switch to a different, slower, third-party HashMap implementation. If this eliminate the crashes, that increases the probability of theory (a). (2) Add code to page-align + pad the HashMap buffers, and then mprotect those pages outside of the function that rebuilds them. If this eliminates the crashes, that increases the probability of theory (b). Additionally, if we can identify a corresponding increase in access violation crashes, that could give us a backtrace pinpointing which part of Gecko clobbered the memory, which would be very valuable. That said, both diagnostic approaches listed above have drawbacks in terms of product quality (performance, memory usage, and possibly stability) for the duration that the diagnostics are in place. Given where we are in the cycle, that doesn't seem advisable for a low-volume crash. As such, I propose pausing investigation on this bug, and evaluating the crash rate when 57 goes to beta. If it's high enough to cause concern, we would then do the diagnostic investigation on Nightly 58, and only touch Beta 57 if we have an actual stability fix in-hand.

Bobby Holley (:bholley)

Comment 27

•

7 years ago

Joe+Jeff: Can you ack comment 26?

Flags: needinfo?(joe-bugzilla)

Flags: needinfo?(jgriffiths)

Joe Hildebrand [:jhildebrand] (UTC-6)

Comment 28

•

7 years ago

Ack. I'm also interested in correlations with some of the populations we've previously talked about, in case we want to disable stylo for entire classes of user that might see this.

Flags: needinfo?(joe-bugzilla)

Jeff Griffiths (:canuckistani) (:⚡︎)

Comment 29

•

7 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #27) > Joe+Jeff: Can you ack comment 26? These approaches to investigating the root cause seem reasonable to me, as does pausing until Beta. What I need is: 1. document the plan to investigate so that spinning up the invesigation process once we're in beta is simpler 2. add a meeting to our calendars ( at least myself, JoeH, cpeterson, elan + your data analyst ) to review beta crash rates and make a decision on how to proceed next.

Flags: needinfo?(jgriffiths)

Bobby Holley (:bholley)

Comment 30

•

7 years ago

(In reply to Jeff Griffiths (:canuckistani) (:⚡︎) from comment #29) > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #27) > > Joe+Jeff: Can you ack comment 26? > > These approaches to investigating the root cause seem reasonable to me, as > does pausing until Beta. > > What I need is: > > 1. document the plan to investigate so that spinning up the invesigation > process once we're in beta is simpler Ok. I'm not sure what there is to write beyond what's in comment 26, let me know if there are any details missing. > 2. add a meeting to our calendars ( at least myself, JoeH, cpeterson, elan + > your data analyst ) to review beta crash rates and make a decision on how to > proceed next. Sounds great. Chris, can you book this?

Flags: needinfo?(cpeterson)

Priority: P2 → P3

Jeff Griffiths (:canuckistani) (:⚡︎)

Comment 31

•

7 years ago

Bobby: One wrinkle here is, if you want distant early warning of this bug in beta, look first at the period when we're shipping Dev Edition 57 based on Beta 57 for a whole week before the real beta is released. DE is a much larger cohort than nightly, although also quite skewed towards newer hardware.

Flags: needinfo?(bobbyholley)

Bobby Holley (:bholley)

Comment 32

•

7 years ago

(In reply to Jeff Griffiths (:canuckistani) (:⚡︎) from comment #31) > Bobby: One wrinkle here is, if you want distant early warning of this bug in > beta, look first at the period when we're shipping Dev Edition 57 based on > Beta 57 for a whole week before the real beta is released. DE is a much > larger cohort than nightly, although also quite skewed towards newer > hardware. Acknowledged. I think in practice this means we should just book the checkin for very shortly after the 21st, since we'll already have a week's worth of data, which is probably the minimum before we start making any conclusions. Monday the 25th could work well.

Flags: needinfo?(bobbyholley)

Bobby Holley (:bholley)

Updated

•

7 years ago

Crash Signature: [@ core::ptr::drop_in_place<T>] → [@ core::ptr::drop_in_place<T>] [@ servo_arc::Arc<T>::drop_slow<T>]

Bobby Holley (:bholley)

Updated

•

7 years ago

Comment 34

•

7 years ago

There are a handful non-hashmap-related crashes now, like[1]... I'd say this should either be our Arc stuff, or mozjemalloc's arenas being messed up, per bug 1399668 assertions. This seems to happen frequently on shutdown, which may point at the second thing... [1]: https://crash-stats.mozilla.com/report/index/59d598b6-5b79-44ac-badd-1c0570170912

Emilio Cobos Álvarez (:emilio)

Comment 35

•

7 years ago

So I looked at the ServoArc::drop stuff. How is it supposed to be safe? It just does Box::from_raw(self.ptr()). But self.ptr() is a pointer to ArcInner<HeaderSliceWithLength<H, [T]>>. So we're effectively dropping an that. However, we allocated that using Vec::with_capacity<>. That HeaderSliceWithLength was allocated using the alignment of T, but nothing prevents the header + length thing to have more alignment than that, right? I don't think it's the case, but that code would benefit from some more eyes I think... I find reasoning about it pretty hard.

Emilio Cobos Álvarez (:emilio)

Comment 36

•

7 years ago

Also, can the compiler do something funny with drop order because of variance / contravariance stuff? In particular, for the compiler an Arc is just a &'static, per our use of NonZeroPtrMut...

Manish Goregaokar [:manishearth]

Comment 37

•

7 years ago

> Also, can the compiler do something funny with drop order because of variance / contravariance stuff? No. drop order and variance is only relevant when T can contain references. T: 'static, so nope.

Chris Peterson [:cpeterson]

Comment 38

•

7 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #30) > > 2. add a meeting to our calendars ( at least myself, JoeH, cpeterson, elan + > > your data analyst ) to review beta crash rates and make a decision on how to > > proceed next. > > Sounds great. Chris, can you book this? Yes. I scheduled this meeting.

Flags: needinfo?(cpeterson)

Emilio Cobos Álvarez (:emilio)

Comment 39

•

7 years ago

There are a handful of crashes (the only ones with crash reason) referencing the NonZeroPtrMut::new assertion, in case it rings any bells: https://crash-stats.mozilla.com/report/index/265684d0-4af0-411d-b527-8ee5a0170911#tab-bugzilla

Emilio Cobos Álvarez (:emilio)

Comment 40

•

7 years ago

The other option is of course that we really got wrong the atomic ordering, and we could consider reverting https://github.com/rust-lang/rust/pull/41714...

Bobby Holley (:bholley)

Comment 41

•

7 years ago

(In reply to Emilio Cobos Álvarez [:emilio] from comment #34) > There are a handful non-hashmap-related crashes now, like[1]... I'd say this > should either be our Arc stuff, or mozjemalloc's arenas being messed up, per > bug 1399668 assertions. > > This seems to happen frequently on shutdown, which may point at the second > thing... > > [1]: > https://crash-stats.mozilla.com/report/index/59d598b6-5b79-44ac-badd- > 1c0570170912 Per comment 20, I think it's unlikely to be Arc-related. The fact that this occurs in a non-HashMap context does support the heap-corruption scenario. It's also worth noting that the only reason the crash report above is grouped in with the others is that every destructor in rust is named "drop_slow", whereas in C++ this would just be an unrelated crash report in the style context destructor. This too is compatible with the theory of heap-corruption that may have predated stylo. (In reply to Emilio Cobos Álvarez [:emilio] from comment #35) > So I looked at the ServoArc::drop stuff. How is it supposed to be safe? I got excited when I first read this, but I don't see any smoking gun in your analysis below. Did you find anything concerning besides the alignment issue? > It just does Box::from_raw(self.ptr()). > > But self.ptr() is a pointer to ArcInner<HeaderSliceWithLength<H, [T]>>. > > So we're effectively dropping an that. However, we allocated that using > Vec::with_capacity<>. > > That HeaderSliceWithLength was allocated using the alignment of T, but > nothing prevents the header + length thing to have more alignment than that, > right? If this was an alignment issue, we'd be seeing unaligned write crashes when initializing the arcs. It wouldn't cause memory corruption. > I don't think it's the case, but that code would benefit from some > more eyes I think... I find reasoning about it pretty hard. More eyes is always good, but per above I'm skeptical that this is a bug in Arc code. (In reply to Emilio Cobos Álvarez [:emilio] from comment #39) > There are a handful of crashes (the only ones with crash reason) referencing > the NonZeroPtrMut::new assertion, in case it rings any bells: > > > https://crash-stats.mozilla.com/report/index/265684d0-4af0-411d-b527- > 8ee5a0170911#tab-bugzilla This is presumably crashing in ThinArc::drop, which invokes from_thin on the pointer, which does NonZeroPtrMut::new(). So that fits what we've seen, where we're crashing on a null pointer to the Arced data, which suggests that the things holding the Arc pointers are getting corrupted (mostly the hashtables, occasionally other things). (In reply to Emilio Cobos Álvarez [:emilio] from comment #40) > The other option is of course that we really got wrong the atomic ordering, > and we could consider reverting > https://github.com/rust-lang/rust/pull/41714... That didn't land, but I assume you mean the equivalent code in servo_arc. I would be extremely surprised if that were the issue, because: * An acquire load is equivalent to an acquire fence here per spec. * An acquire load just compiles down to an acquire fence + a load, as pointed out in that PR * The analysis above about Arc being an unlikely culprit.

Bobby Holley (:bholley)

Updated

•

7 years ago

Depends on: 1403397

Astley Chen (inactive)

Updated

•

7 years ago

status-firefox58: --- → affected

Bobby Holley (:bholley)

Updated

•

7 years ago

Updated

•

7 years ago

Group: core-security

Daniel Veditz [:dveditz]

Comment 42

•

7 years ago

The crash signatures appear to show UAF poisoned markers as well as wild pointers (which could be UAFs that got reallocated before use). I assume that's why Bobby hid this bug.

Group: core-security → layout-core-security

Keywords: csectype-uaf, csectype-wildptr, sec-high

Astley Chen (inactive)

Comment 43

•

7 years ago

Looking at the past 7 days[1], the crash has been the #34 top crash on 57.0b, accounting for 0.36% of browser crashes. Do we have more pointers to progress the investigation? I've been walking through dozens of bug reports but still no lock to reproduce... [1] https://crash-stats.mozilla.com/topcrashers/?product=Firefox&version=57.0b&days=7

Bobby Holley (:bholley)

Updated

•

7 years ago

Depends on: stylo-hashmap-crashes

Chris Peterson [:cpeterson]

Comment 44

•

7 years ago

Adding some variants of drop_in_place crash signatures reported during Servo_StyleContext_Release: [@ HeapFree | core::ptr::drop_in_place<T>] [@ RtlEnterCriticalSection | HeapFree | core::ptr::drop_in_place<T>] [@ arena_dalloc_small | HeapFree | core::ptr::drop_in_place<T>]

Crash Signature: [@ core::ptr::drop_in_place<T>] [@ servo_arc::Arc<T>::drop_slow<T>] → [@ HeapFree | core::ptr::drop_in_place<T>] [@ RtlEnterCriticalSection | HeapFree | core::ptr::drop_in_place<T>] [@ arena_dalloc_small | HeapFree | core::ptr::drop_in_place<T>] [@ core::ptr::drop_in_place<T>] [@ servo_arc::Arc<T>::drop_slow<T>]

Chris Peterson [:cpeterson]

Updated

•

7 years ago

Updated

•

7 years ago

Wennie

Comment 47

•

7 years ago

Hi Jet: I have assigned these security bugs to you to reassign them to appropriate developers in your team to investigate and fix them. Thanks! Wennie

Assignee: nobody → bugs

Chris Peterson [:cpeterson]

Comment 48

•

7 years ago

status-firefox57=wontfix unless someone thinks this bug should block 57

status-firefox57: affected → wontfix

Randell Jesup [:jesup] (needinfo me)

Comment 49

•

7 years ago

manishearth -- is this bug the same ram issue you and bholley believe is the source of rust hashmap issues?

Flags: needinfo?(manishearth)

Manish Goregaokar [:manishearth]

Comment 50

•

7 years ago

It *could* be but I don't think we're sure of that. It's frequent enough to probably not be that issue. emilio, wdyt?

Flags: needinfo?(manishearth) → needinfo?(emilio)

Randell Jesup [:jesup] (needinfo me)

Comment 51

•

7 years ago

Yeah, 100's per day doesn't sound like RAM issues

Emilio Cobos Álvarez (:emilio)

Comment 52

•

7 years ago

Yeah, I cannot make any sense of the crashes though, there are a bunch which are hashmap related, but others aren't (like https://crash-stats.mozilla.com/report/index/f4749cb6-82e4-42a1-a947-130ce0180103). I honestly cannot make much sense of those crashes... The only explanation I can think of is external memory corruption, but... :(

Flags: needinfo?(emilio)

Randell Jesup [:jesup] (needinfo me)

Comment 53

•

7 years ago

Are there any assertions/guards/diagnostics we could put in place to try to determine what's happening? I wouldn't be shocked if there is more than one cause.

Chris Peterson [:cpeterson]

Comment 54

•

7 years ago

(In reply to Randell Jesup [:jesup] from comment #53) > Are there any assertions/guards/diagnostics we could put in place to try to > determine what's happening? I wouldn't be shocked if there is more than one > cause. We've tried a lot of approaches to debug this, including mprotecting the hashtable memory when it is read-only (bug 1403397), but we haven't found a smoking gun yet. We discussed temporarily replacing HashMap with OrderMap in Nightly, but haven't tried yet. Should we try this?

[:philipp]

Updated

•

7 years ago

Crash Signature: core::ptr::drop_in_place<T>] [@ core::ptr::drop_in_place<T>] [@ servo_arc::Arc<T>::drop_slow<T>] → core::ptr::drop_in_place<T>] [@ core::ptr::drop_in_place<T>] [@ servo_arc::Arc<T>::drop_slow<T>] [@ arena_t::DallocSmall | HeapFree | servo_arc::Arc<T>::drop_slow<T>]

Randell Jesup [:jesup] (needinfo me)

Comment 55

•

7 years ago

cpeterson: I think we should try something

Flags: needinfo?(cpeterson)

Bobby Holley (:bholley)

Comment 56

•

7 years ago

(In reply to Randell Jesup [:jesup] from comment #55) > cpeterson: I think we should try something Quoting Chris from Slack: > Emilio temporarily replaced Stylo's HashMap with OrderMap in Nighty 60 to see if the random HashMap crashes "went away". > Unfortunately, we still see similar random crashes in the OrderMap code and Emilio says they look a lot like the HashMap > crashes when clearing the style data. So these crashes are probably caused by random memory corruption bug in other code > and not a bug in Rust's HashMap. > The crash rate is pretty low, we don't have any new leads, and we used to see similar crashes in Gecko's layout hash > tables, so AFAIK no further investigation is planned. Just FYI to bring this discussion to a close.

Flags: needinfo?(cpeterson)

Marcia Knous [:marcia]

Reporter

Updated

•

7 years ago

Crash Signature: core::ptr::drop_in_place<T>] [@ core::ptr::drop_in_place<T>] [@ servo_arc::Arc<T>::drop_slow<T>] [@ arena_t::DallocSmall | HeapFree | servo_arc::Arc<T>::drop_slow<T>] → core::ptr::drop_in_place<T>] [@ core::ptr::drop_in_place<T>] [@ servo_arc::Arc<T>::drop_slow<T>] [@ arena_t::DallocSmall | HeapFree | servo_arc::Arc<T>::drop_slow<T>] [@ arena_t::DallocSmall | arena_dalloc | HeapFree | servo_arc::Arc<T>::drop_slow<T>]

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox58: affected → wontfix

status-firefox59: --- → wontfix

status-firefox60: --- → affected

Frederik Braun [:freddy]

Updated

•

7 years ago

Whiteboard: [sec-triage-backlog]

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox60: affected → wontfix

status-firefox61: --- → affected

status-firefox62: --- → affected

status-firefox-esr60: --- → affected

Julien Cristau [:jcristau]

Updated

•

6 years ago

status-firefox61: affected → wontfix

Sean Voisen (:svoisen)

Updated

•

6 years ago

Keywords: stalled

Sean Voisen (:svoisen)

Updated

•

6 years ago

Assignee: bugs → svoisen

Liz Henry (:lizzard) (relman/hg->git project)

Comment 57

•

6 years ago

This is a fairly high volume crash on 62 release with well over 2000 crashes on release in the last week, and high volume on beta as well. Sean - are you actively working on this, or should we still consider it "stalled"?

status-firefox62: affected → wontfix

status-firefox63: --- → affected

status-firefox64: --- → affected

Flags: needinfo?(svoisen)

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 58

•

6 years ago

Adding [@ static void core::ptr::drop_in_place<T> ] to the signature field since that appears to be the dominant signature for Firefox 61, and I think it's the same thing (see, e.g., bp-651813fe-6f79-4f64-b278-7ddfa0180826). I suspect that adding that signature will make the graph look smooth again.

Crash Signature: servo_arc::Arc<T>::drop_slow<T>] → servo_arc::Arc<T>::drop_slow<T>] [@ static void core::ptr::drop_in_place<T> ]

Sean Voisen (:svoisen)

Comment 59

•

6 years ago

dbaron and I just spoke about this. We're still not actively working on it as it remains unclear what, if anything, we can do. Given the David's update to the signature field (there hasn't been any spike since the issue was opened) it remains stalled. In bug 1489297, David added reporting metadata about whether or not the system has ECC RAM, so hopefully after that goes to release we can at least better determine if memory corruption is the culprit.

Flags: needinfo?(svoisen)

Cameron McCormack (:heycam)

Comment 60

•

6 years ago

(In reply to David Baron :dbaron: 🇫🇷 ⌚UTC+2 from comment #58) > Adding [@ static void core::ptr::drop_in_place<T> ] to the signature field > since that appears to be the dominant signature for Firefox 61, and I think > it's the same thing (see, e.g., bp-651813fe-6f79-4f64-b278-7ddfa0180826). I > suspect that adding that signature will make the graph look smooth again. Note that I added ptr::drop_in_place to the signature prefix list in bug 1495966 since by itself it was causing some unrelated crashes to be classified with the same signature.

Pascal Chevrel:pascalc

Updated

•

6 years ago

status-firefox63: affected → wontfix

status-firefox65: --- → affected

Ryan VanderMeulen [:RyanVM]

Updated

•

5 years ago

status-firefox64: affected → wontfix

status-firefox65: affected → wontfix

status-firefox-esr60: affected → wontfix

Daniel Holbert [:dholbert]

Updated

•

3 years ago

Assignee: sean → nobody

BMO Automation

Updated

•

2 years ago

Severity: critical → S2

Daniel Holbert [:dholbert]

Comment 63

•

2 years ago

•

Edited

Reclassifying to S3 given the low crash volume, given that we've annotated this as stalled, and given that our working theory here is that this is external memory corruption (which is hard to make actionable).

Severity: S2 → S3

BugBot [:suhaib / :marco/ :calixte]

Comment 64

•

2 years ago

The severity field for this bug is set to S3. However, the bug is flagged with the sec-high keyword.
:emilio, could you consider increasing the severity of this security bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Comment 65

•

2 years ago

See above.

Flags: needinfo?(emilio)

Tom Ritter [:tjr]

Comment 66

•

10 months ago

So core::ptr::drop_in_place<T> had a huge spike on 120.0a1 with a few results otherwise. We're trying to close out stalled bugs we can't move forward so resolving this as incomplete unless we get something actionable.

Status: NEW → RESOLVED

Closed: 10 months ago

Resolution: --- → INCOMPLETE

BugBot (nomail) [:suhaib / :marco/ :calixte]

Comment 67

•

10 months ago

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit BugBot documentation.

Keywords: stalled

Daniel Veditz [:dveditz]

Updated

•

8 months ago

Group: layout-core-security