1406996 - (stylo-hashmap-crashes) stylo: Investigate crash reports around Rust HashMaps

Reporter

Description

•

8 years ago

The investigation here has been split across various bugs, and I wanted a single location for discussion and investigation. I considered duping the various "Crash in X" bugs together and combining the signatures, but it's probably less disruptive to just file a new bug. This bug will block any crashes that we suspect are related to this issue, and depend on any diagnostics and counter-measures we've landed (some of which will eventually need to be backed out).

Bobby Holley (:bholley)

Reporter

Updated

•

8 years ago

Blocks: 1385925, 1404791, 1405621, 1403845

Bobby Holley (:bholley)

Reporter

Updated

•

8 years ago

Depends on: 1403397, 1405879, 1406220, 1406815

Bobby Holley (:bholley)

Reporter

Comment 1

•

8 years ago

I made a slack channel to discuss this, so that we can have a private group chat with full history. Ping me if you want an add.

Bobby Holley (:bholley)

Reporter

Comment 2

•

8 years ago

This is the query of all the corruption-related crashes I'm looking at on nightly: https://crash-stats.mozilla.com/search/?signature=~core%3A%3Aptr%3A%3Adrop_in_place%3CT%3E&signature=~servo_arc%3A%3AArc%3CT%3E%3A%3Adrop_slow%3CT%3E&signature=~geckoservo%3A%3Aglue%3A%3AServo_ComputedValues_EqualCustomProperties&signature=~hashglobe%3A%3Ahash_map%3A%3AEntry%3CT%3E%3A%3Aor_insert_with%3CT%3E%20smallvec%3A%3ASmallVec%3CT%3E&signature=~hashglobe&product=Firefox&version=58.0a1&_sort=-build_id&_sort=-date&_facets=signature&_columns=signature&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=moz_crash_reason&_columns=install_time#crash-reports This is the query for explicit assertion failures in hashglobe, in case they end up in other signatures somehow: https://crash-stats.mozilla.com/search/?moz_crash_reason=HashMap%20Corruption&product=Firefox&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature And this is the query for all crashes with hashglobe on the stack: https://crash-stats.mozilla.com/search/?proto_signature=~hashglobe&product=Firefox&version=58.0a1&_sort=-build_id&_sort=-date&_facets=signature&_columns=signature&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=moz_crash_reason&_columns=install_time#crash-reports

Daniel Veditz [:dveditz]

Updated

•

8 years ago

Group: core-security → layout-core-security

Keywords: sec-audit

Bobby Holley (:bholley)

Reporter

Updated

•

8 years ago

Depends on: 1407080

Bobby Holley (:bholley)

Reporter

Updated

•

8 years ago

Priority: -- → P2

Bobby Holley (:bholley)

Reporter

Comment 3

•

8 years ago

I'll be on leave starting tomorrow, so Gankro is going to take this over, with help from dmajor.

Assignee: bobbyholley → a.beingessner

Marcia Knous [:marcia]

Comment 4

•

8 years ago

Discussed during triage, adding tracking 57+ and affected.

status-firefox57: --- → affected

status-firefox58: --- → affected

tracking-firefox57: --- → +

Manish Goregaokar [:manishearth]

Comment 5

•

8 years ago

https://crash-stats.mozilla.com/report/index/ff8093e4-6f23-4646-a8bb-1835d0171012 HashMap Corruption (sz=8234, cap=16384, pairsz=64, cnry=0xe7e7e7e7e7e7e7e7, pos=8231, base_addr=0x23af1200000, cnry_addr=0x23af13075c8, jrnl_len=8272) It looks like a bucket near the end of a hashmap's buffer is thought to be occupied when it isn't. (The pairsz isn't enough to correctly figure out if it actually is at the end of the buffer, will need to manually do more size calculations)

Xidorn Quan [:xidorn] UTC+10

Comment 6

•

8 years ago

The code in https://hg.mozilla.org/mozilla-central/annotate/bc7a5be76b72/servo/components/hashglobe/src/diagnostic.rs#l55 seems to only report the last instance (instead of the first), and its pos calculation may not be exact when there are more before. Probably worth also recording the number of instances.

Xidorn Quan [:xidorn] UTC+10

Comment 7

•

8 years ago

Also it seems that the canary field is not cleared when an item is removed from the hashmap. Maybe it should be cleared in that case so that we catch more?

Manish Goregaokar [:manishearth]

Comment 8

•

8 years ago

Nick, you have crashdump access, could you get us access to the hashmap journal and contents? (self.journal in https://hg.mozilla.org/mozilla-central/file/tip/servo/components/hashglobe/src/diagnostic.rs#l148 ) Thanks.

Manish Goregaokar [:manishearth]

Comment 9

•

8 years ago

Ok, actually, given the bugs in the code mentioned by xidorn, there's some additional information in those numbers and it's pretty curious. So firstly 0xe7 is our fill for the hashmap. Seeing it means we have reached a pristine empty bucket that never was full ever. (In fact, as xidorn mentioned we don't write_volatile clear canaries so if a bucket used to be full it's likely to have the canary already) As xidorn noted the position value is totally wrong. We increment the position each time we see a valid full bucket, so the final value of position should be the number of valid full buckets seen. The final value of position is 8231. The size of the hashmap is 8234. So we saw 3 broken buckets. Now what's interesting is that if position=size it would mean that we saw some broken empty buckets but there also were extra full buckets, i.e. things had been misplaced. This is not the case. There are buckets for which we've recorded a hash, *and for which we've incremented the size* (it's not just an errant write into the hashes segment), which are nowhere to be seen in the KV segment. Which feels like we decided to insert something into the hashmap, did all the steps for it, and then ... neglected to record the KV. Again, these are pristine buckets; this isn't the result of a botched remove. An alternate explanation is that we decided to insert something into the hashmap, did all the steps for it, and then recorded the KV in the wrong bucket -- a bucket which was already full or was later filled. This happened thrice. Given that the hashmap is 50% empty it's slightly unlikely that this would happen thrice by chance, which is also weird. (Also, given that we don't explicitly clear canaries, I would expect the position value to be *more* than the size since it would also count now-vacant but previously-full entries) Our diagnostic code is buggy, but the bugginess reveals some really interesting facts about this crash, though admittedly I'm unable to see where this corruption might have happened given the reduced search space from these diagnostics. (I'll think more about this tomorrow)

Manish Goregaokar [:manishearth]

Comment 10

•

8 years ago

https://github.com/servo/servo/pull/18861 fixes those bugs. We should be removing these diagnostics soon, but I don't think that will be for a couple of days, so we might as well have this stuff fixed in the meantime in case we need more information.

Nicholas Nethercote [inactive]

Comment 11

•

8 years ago

(In reply to Manish Goregaokar [:manishearth] from comment #8) > Nick, you have crashdump access, could you get us access to the hashmap > journal and contents? > > (self.journal in > https://hg.mozilla.org/mozilla-central/file/tip/servo/components/hashglobe/ > src/diagnostic.rs#l148 ) I'm having trouble finding this field in the crash report. There are several different things in the "Raw Dump" tab of a crash report and I've never managed to understand exactly what each one represents. The only one where I can see any sign of "HashMapJournal" is the .dmp file but that's a binary file and I don't know how to extract the data for that field. Ted, can you assist? This is for https://crash-stats.mozilla.com/report/index/ff8093e4-6f23-4646-a8bb-1835d0171012

Flags: needinfo?(ted)

Nicholas Nethercote [inactive]

Comment 12

•

8 years ago

dmajor might also be able to help.

Flags: needinfo?(dmajor)

Nicholas Nethercote [inactive]

Comment 13

•

8 years ago

Bug 1408292 didn't help, but even accounting for that I still can't find this field.

(not currently active) Ted Mielczarek

Comment 14

•

8 years ago

So I went poking around. Nowadays we send annotations from child processes via shared memory, and that code has a hard 16KB limit on the total size of annotations: https://dxr.mozilla.org/mozilla-central/rev/196dadb2fe500e75c6fbddcac78106648676cf10/ipc/glue/CrashReporterClient.h#50 The crash reason for this crash says: jrnl_len = 8272 ...so I am quite sure that the size of the annotation generated would be >16KB. Sorry, this is clearly all terrible.

Flags: needinfo?(ted)

(Away)

Updated

•

8 years ago

Flags: needinfo?(dmajor)

Xidorn Quan [:xidorn] UTC+10

Comment 15

•

8 years ago

bp-1fb7ea04-7d24-492a-9815-d03660171015 seems like a good one to investigate. It has > HashMap Corruption (sz=160, cap=256, pairsz=24, cnry=0x42cafe9942c2fe99, count=1, last_pos=5, base_addr=0x19ac1000, cnry_addr=0x19ac18b0, jrnl_len=192) A 192 items journal should be smaller than 16KB I suppose.

Flags: needinfo?(ted)

Xidorn Quan [:xidorn] UTC+10

Comment 16

•

8 years ago

bp-661bbc85-c6dd-4924-9682-f4f7f0171015 only contains 187. Having a look more closely at those crash reports with short journal, this really seems more like some kind of bit rot... 0x42cafe9942cafe99 ^ 0x42cafe9942c2fe99 = 0x0000_0000_0008_0000 0x42cafe9942cafe99 ^ 0x42caf69942cafe99 = 0x0000_0800_0000_0000 0x42cafe9942cafe99 ^ 0x42cafa9942cafe99 = 0x0000_0400_0000_0000

Xidorn Quan [:xidorn] UTC+10

Comment 17

•

8 years ago

bp-a3758377-6333-4a9f-ac50-2357e0171015 is another long journal corruption with canary being 0xe7e7e7e7e7e7e7e7. It has > HashMap Corruption (sz=9192, cap=16384, pairsz=64, cnry=0xe7e7e7e7e7e7e7e7, count=1, last_pos=3528, base_addr=0x26edfb00000, cnry_addr=0x26edfb81bc8, jrnl_len=9276) I guess we have two problems, then. One is the bits are randomly set somehow, the other is that we read things from uninserted position.

Mike Hommey [:glandium]

Comment 18

•

8 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #15) > bp-1fb7ea04-7d24-492a-9815-d03660171015 seems like a good one to > investigate. It has > > HashMap Corruption (sz=160, cap=256, pairsz=24, cnry=0x42cafe9942c2fe99, count=1, last_pos=5, base_addr=0x19ac1000, cnry_addr=0x19ac18b0, jrnl_len=192) > > A 192 items journal should be smaller than 16KB I suppose. This one doesn't have a journal. (In reply to Xidorn Quan [:xidorn] UTC+10 from comment #16) > bp-661bbc85-c6dd-4924-9682-f4f7f0171015 only contains 187. This one does: [Insert(9223372040800430640), Insert(9223372037583450023), Insert(9223372037331258603), Insert(9223372037363030801), Insert(9223372039430138675), Insert(9223372040348347031), Insert(9223372040481347453), Insert(9223372039448211261), Insert(9223372039254182491), Insert(9223372038421190002), Insert(9223372039917800308), Insert(9223372037988661486), Insert(9223372039026795514), Insert(9223372040381933642), Insert(9223372038485486634), Insert(9223372037179874635), Insert(9223372037115169326), Insert(9223372038230214386), Insert(9223372038617724901), Insert(9223372039244466658), Insert(9223372037699397048), Insert(9223372040512701811), Insert(9223372038179067507), Insert(9223372037327718718), Insert(9223372040249027737), Insert(9223372038431253912), Insert(9223372040782608028), Insert(9223372039045589349), Insert(9223372040787469832), Insert(9223372041108375103), Insert(9223372037683846781), Insert(9223372040752179429), Insert(9223372037660816815), Insert(9223372038597259377), Insert(9223372040732592612), Insert(9223372039972984854), Insert(9223372038628606895), Insert(9223372039484361949), Insert(9223372037842923432), Insert(9223372037811276715), Insert(9223372040369610561), Insert(9223372037036415018), Insert(9223372040467028517), Insert(9223372038679530105), Insert(9223372040748744789), Insert(9223372038558818406), Insert(9223372040160249925), Insert(9223372038410520761), Insert(9223372040049239808), Insert(9223372037691020926), Insert(9223372038446144062), Insert(9223372039746359810), Insert(9223372036866689104), Insert(9223372040054217968), Insert(9223372039025581704), Insert(9223372039092085797), Insert(9223372039085554332), Insert(9223372038676785515), Insert(9223372038379887956), Insert(9223372040896862216), Insert(9223372040890396317), Insert(9223372039540879453), Insert(9223372037001352116), Insert(9223372037606201184), Insert(9223372038108338082), Insert(9223372039503594112), Insert(9223372038010538333), Insert(9223372037117965481), Insert(9223372039095246646), Insert(9223372039888935450), Insert(9223372038865488586), Insert(9223372040487938133), Insert(9223372040853080038), Insert(9223372039188623770), Insert(9223372037882431777), Insert(9223372037532955795), Insert(9223372039352260868), Insert(9223372040498929429), Insert(9223372037922721777), Insert(9223372041048803252), Insert(9223372037045106544), Insert(9223372037809608083), Insert(9223372036885826728), Insert(9223372037975487061), Insert(9223372039578413606), Insert(9223372039028516597), Insert(9223372039638085596), Insert(9223372039042726425), Insert(9223372038634330380), Insert(9223372039578606854), Insert(9223372040598030232), Insert(9223372040655232645), Insert(9223372040178087295), Insert(9223372038949753120), Insert(9223372038498770849), Insert(9223372038023184547), Insert(9223372040937882656), Insert(9223372039915704403), Insert(9223372038359325460), Insert(9223372038865354670), Insert(9223372037834900915), Insert(9223372040559428669), Insert(9223372039204107924), Insert(9223372038040175412), Insert(9223372038378653066), Insert(9223372039514863269), Insert(9223372040846253689), Insert(9223372038041699682), Insert(9223372040169881236), Insert(9223372040614914203), Insert(9223372041032242639), Insert(9223372037732994521), Insert(9223372037437193805), Insert(9223372039953608731), Insert(9223372039507312984), Insert(9223372039219619240), Insert(9223372040485013175), Insert(9223372038399485790), Insert(9223372040916113567), Insert(9223372040410067134), Insert(9223372039215482378), Insert(9223372038615139304), Insert(9223372037237889627), Insert(9223372036919918032), Insert(9223372038511803821), Insert(9223372039887814695), Insert(9223372038635861748), Insert(9223372040302565961), Insert(9223372039176808310), Insert(9223372037582161228), Insert(9223372038736424646), Insert(9223372038444272149), Insert(9223372038958762995), Insert(9223372040255352284), Insert(9223372037940704922), Insert(9223372038015972846), Insert(9223372039009185764), Insert(9223372038432151448), Insert(9223372039244466658), Insert(9223372037683846781), Insert(9223372037811276715), Insert(9223372037036415018), Insert(9223372038558818406), Insert(9223372040916113567), Remove(9223372038617724901), Insert(9223372040916113567), Insert(9223372040916113567), Insert(9223372039162663787), Insert(9223372039204107924), Insert(9223372038399485790), Insert(9223372038617724901), Insert(9223372039244466658), Insert(9223372037683846781), Insert(9223372037811276715), Insert(9223372040916113567), Insert(9223372040776354999), Insert(9223372040570174603), Insert(9223372039244466658), Insert(9223372037683846781), Insert(9223372037811276715), Insert(9223372040916113567), Insert(9223372040776354999), Remove(9223372038617724901), Insert(9223372038205317300), Insert(9223372038205317300), Insert(9223372037485976815), Insert(9223372038108918288), Insert(9223372040855021246), Insert(9223372038250087590), Insert(9223372037687809427), Insert(9223372037766414881), Insert(9223372038740877830), Insert(9223372038668605532), Insert(9223372039675084955), Insert(9223372038796126079), Insert(9223372039535142435), Insert(9223372038459810812), Insert(9223372039939177951), Insert(9223372038755424106), Insert(9223372038804570148), Insert(9223372040354952311), Insert(9223372039105907127), Insert(9223372038298388083), Insert(9223372038027391260), Insert(9223372039120322629), Insert(9223372037450597366)]

Xidorn Quan [:xidorn] UTC+10

Comment 19

•

8 years ago

Another observation: it seems that majority of the corruption reports in the last week happen on either begin_mutation or drop, there are two cases on end_mutation. The two cases are both on old version without the incorrect pos data, but they have identical bad canary value, which seems to be bit rot. This makes me think that it is probably unrelated to hashmap, but something outside is overflowing into hashmap content and changing some bits inside.

Xidorn Quan [:xidorn] UTC+10

Comment 20

•

8 years ago

(In reply to Mike Hommey [:glandium] from comment #18) > (In reply to Xidorn Quan [:xidorn] UTC+10 from comment #16) > > bp-661bbc85-c6dd-4924-9682-f4f7f0171015 only contains 187. > > This one does: Thanks.

Flags: needinfo?(ted)

Xidorn Quan [:xidorn] UTC+10

Comment 21

•

8 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #19) > The two cases are both on old version without the incorrect pos data, > but they have identical bad canary value, which seems to be bit rot. I mean, without the correct pos data.

Jet Villegas (inactive)

Comment 22

•

8 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 (less responsive Nov 5 ~ Dec 16) from comment #19) > This makes me think that it is probably unrelated to hashmap, but something > outside is overflowing into hashmap content and changing some bits inside. I was able to reproduce bug 1385925 once. I can't repro again but here are a few unique details: 1. I was running a Mac Release Build built on this changeset changeset: 383923:15f221f491f7 tag: tip fxtree: central user: ffxbld date: Mon Oct 02 10:46:46 2017 -0700 2. The build crashed seconds after startup and running only 1 single test file. 3. I had local changes that I know to corrupt memory (I was doing some non-CSS ImgLib hackery to diagnose a bug) 4. Running the same steps again on the same local build wouldn't crash. The bug I was chasing down is also memory corruption but shows up as rendering glitches (bug 1404366.) Given that I ran only the test file when I crashed in the Stylo hashmap is another indicator that external forces may be at play here. We've had frequent crashes in RuleHash (e.g., bug 719116 and bug 792626) in the C++ style system, but I'm not sure if the correlations are similar to the Stylo crashes we're seeing here.

Aria Beingessner [:Gankra]

Comment 23

•

8 years ago

> I guess we have two problems, then. One is the bits are randomly set somehow, the other is that we read things from uninserted position. The hashmap contains an array of hashes, and non-zero hashes indicate a valid entry in the key-value array. If a hash bitrotted, then iteration would identify it as a valid entry and access it.

Manish Goregaokar [:manishearth]

Comment 24

•

8 years ago

Yeah, we have a bunch of crashes with a canary one bit off from cafe99. I feel like those are bitrot coming in from elsewhere (unsure if related to stylo code). It's the crashes with canary e7e7 that are troubling; but those are all on huge hashmaps.

Nathan Froyd [:froydnj]

Comment 25

•

8 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 (less responsive Nov 5 ~ Dec 16) from comment #19) > This makes me think that it is probably unrelated to hashmap, but something > outside is overflowing into hashmap content and changing some bits inside. Since we're already doing a bunch of custom allocation stuff for hashmaps (or we were? I may have lost track of what the current state is here), can we ensure that we allocate and page-protect the pages before and after the hashmap to try and catch whatever's doing this?

Manish Goregaokar [:manishearth]

Comment 26

•

8 years ago

> can we ensure that we allocate and page-protect the pages before and after the hashmap to try and catch whatever's doing this? I was under the impression that we already did that and this only revealed a bug in spidermonkey

Emilio Cobos Álvarez (:emilio)

Comment 27

•

8 years ago

(In reply to Manish Goregaokar [:manishearth] from comment #26) > > can we ensure that we allocate and page-protect the pages before and after the hashmap to try and catch whatever's doing this? > > I was under the impression that we already did that and this only revealed a > bug in spidermonkey Yeah, this was tried before indeed.

Nika Layzell [:nika] (ooo, ni? for response)

Comment 28

•

8 years ago

(In reply to Manish Goregaokar [:manishearth] from comment #24) > It's the crashes with canary e7e7 that are troubling; but those are all on > huge hashmaps. Can we include the hash of the object which we're adding to the hashmap so we can see if it's just a EMPTY_HASH which has bitrotted to appear full? I'm inclined to think it isn't, as the information manish was showing me suggested that the length was also wrong.

Aria Beingessner [:Gankra]

Comment 29

•

8 years ago

I spent the morning running the input in https://bugzilla.mozilla.org/show_bug.cgi?id=1406996#c18 a bunch of times, messing around with retaining/freeing multiple instances, with/without asan/optimizations, and so far it's holding rock solid.

Jet Villegas (inactive)

Comment 30

•

8 years ago

Pasting this comment from @manishearth from earlier today re these crashes, as it's significant: https://crash-stats.mozilla.com/report/index/066fe396-818a-4d41-8179-3eb2e0171023 HashMap Corruption (sz=8132, buffer_hash_sz=8133, cap=16384, pairsz=64, hash=0x80000000, cnry=0xe7e7e7e7e7e7e7e7, count=1, last_pos=7592, base_addr=0x22e4dd00000, cnry_addr=0x22e4de0e908, jrnl_len=8144) HashMap Corruption (sz=2026, buffer_hash_sz=2056, cap=4096, pairsz=64, hash=0x31, cnry=0xe7e7e7e7e7e7e7e7, count=30, last_pos=1046, base_addr=0x18a32000, cnry_addr=0x18a59e48, jrnl_len=2547) HashMap Corruption (sz=7703, buffer_hash_sz=7703, cap=16384, pairsz=48, hash=0x80000000aa1bd9a8, cnry=0x42cafe9942cafe22, count=4, last_pos=3106, base_addr=0x1ab02000, cnry_addr=0x1ab6ef88, jrnl_len=25284 for the e7 canaries, buffer_hash_sz > sz so it’s pretty clear that it’s the hashes getting written to in the original one we had a case where we actually had sz=buffer_hash_sz (we didn’t report that number but it was inferrable because of the bug in the reporting) which was concerning, btw that might have been a fluke for the cafe canaries we get sz = buffer_hash_sz so this is pretty clearly random writes to the buffer i think also note that the hash is 0x80000000, i.e one bit flip the other one is 0x31, so not one bit flip, however that buffer has 30 broken hashes so it might just be a case where we had multiple flips

Randell Jesup [:jesup] (needinfo me)

Comment 31

•

8 years ago

I normally think of cosmic-ray-type incidents when someone says "bit flip" - some sort of HW error, which happens on non-ECC memory (occasionally). However, it tends not to happen in the same spots. Here we have repeated crashes in code related to the hashmaps, it appears, and perhaps by "bit flips" you mean write-after-free or wild writes. It's possible, though (very) unusual, for such writes to be single bit modifications (true bit flips). And why would such writes (flips or not) keep targeting these hashmaps? Since I don't know the specific code here, I may be missing something. And 30 "flips" in a buffer doesn't sound like HW either. Again, not impossible, but darn unlikely.

Xidorn Quan [:xidorn] UTC+10

Comment 32

•

8 years ago

(In reply to Jet Villegas (:jet) from comment #30) > Pasting this comment from @manishearth from earlier today re these crashes, > as it's significant: > > https://crash-stats.mozilla.com/report/index/066fe396-818a-4d41-8179- > 3eb2e0171023 > HashMap Corruption (sz=8132, buffer_hash_sz=8133, cap=16384, pairsz=64, > hash=0x80000000, cnry=0xe7e7e7e7e7e7e7e7, count=1, last_pos=7592, > base_addr=0x22e4dd00000, cnry_addr=0x22e4de0e908, jrnl_len=8144) This is likely bit flip... > HashMap Corruption (sz=2026, buffer_hash_sz=2056, cap=4096, pairsz=64, > hash=0x31, cnry=0xe7e7e7e7e7e7e7e7, count=30, last_pos=1046, > base_addr=0x18a32000, cnry_addr=0x18a59e48, jrnl_len=2547) but this is clearly not. There must be something wrong with this. And the crash report for this is bp-e6608850-2880-4fcb-a594-aaa300171023. Also, bp-942da53e-2de3-4cf2-9f94-253620171023 is interesting as well. It has HashMap Corruption (sz=12074, buffer_hash_sz=12074, cap=16384, pairsz=24, hash=0xa2e52d93, cnry=0x42c2fe99, count=1, last_pos=8649, base_addr=0xd759000, cnry_addr=0xd7ad614, jrnl_len=12344) So half of the canary is cleared, while the other half contains a bit flip.

Julian Seward [:jseward]

Comment 33

•

8 years ago

Over at https://bugzilla.mozilla.org/show_bug.cgi?id=1404860#c23, nbp speculates that there might be a problem with mozjemalloc meta-data management. And I wonder if that's what we're seeing here. For sure mozjemalloc does twiddle individual bits in metadata management: memory/build/mozjemalloc.cpp, arena_run_reg_alloc(): mask = run->regs_mask[i]; [..] /* Clear bit. */ mask ^= (1U << bit); run->regs_mask[i] = mask; arena_run_reg_dalloc(): run->regs_mask[elm] |= (1U << bit);

Mike Hommey [:glandium]

Comment 34

•

8 years ago

But "external" modifications of the buffer have been eliminated with mprotect, right?

Julian Seward [:jseward]

Comment 35

•

8 years ago

(In reply to Mike Hommey [:glandium] from comment #34) > But "external" modifications of the buffer have been eliminated with > mprotect, right? Right. But that doesn't rule out "internal" modifications. Given (1) the apparent randomness of failures and the absence of any reproducible test case, (2) the failure of ASan and the mprotect scheme to find any smoking gun, and (3) the relative doubtfulness of the hardware-problem hypothesis, I have been wondering for some weeks now whether this is a data race. It bears the signs of a race. Has that possibility been considered in detail? Do we have any evidence for/against that? In particular, some time back in the summer, mozjemalloc was modified so as to have one-arena-per-stylo thread (I can't find the bug number now), so the level of concurrency there has gone up a bit. And so I am wondering if there might be a race or other concurrency bug inside mozjemalloc itself. I am half inclined to saw it off and write a multithreaded stress test case for it, to see what happens. Since that would make it into a simple standalone program, testing it with a race detector (Helgrind?) is also a possibility.

Mike Hommey [:glandium]

Comment 36

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #35) > (In reply to Mike Hommey [:glandium] from comment #34) > > But "external" modifications of the buffer have been eliminated with > > mprotect, right? > > Right. But that doesn't rule out "internal" modifications. The only internal modifications that are not ruled out are those that happen when the pages are writable. Mozjemalloc doesn't make them writable on its own. So the only way it could be writing to mprotected pages is if it was somehow racing with another thread that made the pages writable. Which means there are also plenty of cases where it would lose the race and write to the pages while mprotected, and we should have caught that.

Bobby Holley (:bholley)

Reporter

Comment 37

•

8 years ago

(In reply to Mike Hommey [:glandium] from comment #34) > But "external" modifications of the buffer have been eliminated with > mprotect, right? To be clear, the mprotect diagnostics were only on nightly for a few days, and backed out some time ago. They caused some intermittent failures on CI and so there was pressure to back them out quickly, but I kind of wish we'd let them ride for longer. But the key point here is that they are _not_ active now. The tricky thing about them was that all allocations get rounded up to page size, which could impact crashes, especially if we believe it's mozjemalloc doing the bit flipping. We could try again, but before doing that I think we should do the diagnostics in bug 1405525 comment 6, since we _did_ get one suspicious stack from those diagnostics. (In reply to Julian Seward [:jseward] from comment #35) > (In reply to Mike Hommey [:glandium] from comment #34) > > But "external" modifications of the buffer have been eliminated with > > mprotect, right? > > Right. But that doesn't rule out "internal" modifications. > > Given (1) the apparent randomness of failures and the absence of any > reproducible test case, (2) the failure of ASan and the mprotect scheme to > find any smoking gun, and (3) the relative doubtfulness of the > hardware-problem hypothesis, I have been wondering for some weeks now > whether this is a data race. It bears the signs of a race. > > Has that possibility been considered in detail? Do we have any evidence > for/against that? > > In particular, some time back in the summer, mozjemalloc was modified so as > to have one-arena-per-stylo thread (I can't find the bug number now), so the > level of concurrency there has gone up a bit. And so I am wondering if > there might be a race or other concurrency bug inside mozjemalloc itself. These data structures are only mutated when the rayon pool is idle. It's not clear to me where the race would come from.

Bobby Holley (:bholley)

Reporter

Comment 38

•

8 years ago

And to emphasize, I really think the crash in bug 1405525 is worth following up on. If it is the case that the off-main-thread spidermonkey GC is double-freeing strings, that would explain quite a few of the puzzle pieces here (especially the part where, IIUC, the corruption happens both inside and outside of {begin,end}_mutation pairs).

Emilio Cobos Álvarez (:emilio)

Comment 39

•

8 years ago

(In reply to Bobby Holley (parental leave - send mail for anything urgent) from comment #37) > These data structures are only mutated when the rayon pool is idle. It's not > clear to me where the race would come from. As I understood it, this would be a race on mozjemalloc while allocating other stuff in parallel, like the stylo rule nodes, structs, and such. We do perform allocations off-main-thread, just not the hashmap bucket ones.

Bobby Holley (:bholley)

Reporter

Comment 40

•

8 years ago

(In reply to Emilio Cobos Álvarez [:emilio] from comment #39) > (In reply to Bobby Holley (parental leave - send mail for anything urgent) > from comment #37) > > These data structures are only mutated when the rayon pool is idle. It's not > > clear to me where the race would come from. > > As I understood it, this would be a race on mozjemalloc while allocating > other stuff in parallel, like the stylo rule nodes, structs, and such. We do > perform allocations off-main-thread, just not the hashmap bucket ones. If that were the case, we wouldn't see crashes in end_mutation (which per comment 19, we may be seeing, though it's hard to say). I'm really starting to suspect the GC here. It would also explain why the largest hashmap crash volume is in hashmap teardown, since that's exactly when the GC would be going on a freeing spree.

Julian Seward [:jseward]

Comment 41

•

8 years ago

(In reply to Bobby Holley (parental leave - send mail for anything urgent) from comment #38) > If it is the case that the off-main-thread spidermonkey GC is > double-freeing strings, that would explain quite a few of the puzzle pieces That would seem to imply that mozjemalloc doesn't detect and reject double frees, instead silently corrupting the heap (or its metadata). Is that indeed the case?

Bobby Holley (:bholley)

Reporter

Updated

•

8 years ago

Flags: needinfo?(mh+mozilla)

Mike Hommey [:glandium]

Comment 42

•

8 years ago

It actively rejects double frees, via MOZ_DIAGNOSTIC_ASSERTs. For allocations < ~1MB and >= 4KB: https://dxr.mozilla.org/mozilla-central/source/memory/build/mozjemalloc.cpp#3690 For allocations < 4KB: https://dxr.mozilla.org/mozilla-central/source/memory/build/mozjemalloc.cpp#2453 For allocations >= ~1MB: https://dxr.mozilla.org/mozilla-central/source/memory/build/mozjemalloc.cpp#4196 Note that while the assert is a MOZ_ASSERT, so debug only, the huge.Remove below will actively crash with a null deref if we're in presence of a double free.

Flags: needinfo?(mh+mozilla)

Julian Seward [:jseward]

Comment 43

•

8 years ago

(In reply to Mike Hommey [:glandium] from comment #42) > It actively rejects double frees, via MOZ_DIAGNOSTIC_ASSERTs. Given that (and, confirmed via irc chat with Mike that mozjemalloc should detect *all* double frees of mozjemalloc-issued pointers), I don't see how this can be consistent with the theory that the corruption is caused by spidermonkey double-frees.

Julian Seward [:jseward]

Comment 44

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #35) [regarding mozjemalloc] > I am half inclined to saw it off and write a multithreaded stress test case > for it, to see what happens. Since that would make it into a simple > standalone program, testing it with a race detector (Helgrind?) is also a > possibility. I did that. Stress test case (20 threads, per-thread arenas, several million allocations of varying sizes) didn't find anything. Helgrind reported three races, one of which Mike already fixed, and the other two of which are (per discussion with Mike) harmless initialisation races. So, nothing to see here, it seems.

Mike Hommey [:glandium]

Comment 45

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #44) > (In reply to Julian Seward [:jseward] from comment #35) > [regarding mozjemalloc] > > I am half inclined to saw it off and write a multithreaded stress test case > > for it, to see what happens. Since that would make it into a simple > > standalone program, testing it with a race detector (Helgrind?) is also a > > possibility. > > I did that. Stress test case (20 threads, per-thread arenas, several > million allocations of varying sizes) didn't find anything. Helgrind > reported three races, one of which Mike already fixed, and the other two of > which are (per discussion with Mike) harmless initialisation races. Not that it matters much, but they should be fixed too, now.

Daniel Veditz [:dveditz]

Updated

•

8 years ago

Blocks: 1417057

Randell Jesup [:jesup] (needinfo me)

Comment 46

•

8 years ago

Can we capture here any results from investigations? I heard considerable details in-person I don't see in bugzilla. Thanks

Flags: needinfo?(manishearth)

Manish Goregaokar [:manishearth]

Comment 47

•

8 years ago

Rust hashmaps are stored as a single long buffer which has a segment for key occupancy, and one for values. Our investigation basically pointed to there being corruption of this buffer. Sometimes a key occupancy cell would be corrupted, leading to the hashmap thinking there's an entry where there isn't (which we could verify via canaries); and sometimes a value would be corrupted. These would generally be single bit flips (IIRC). We had at one point mprotected things; so it didn't seem to be corruption coming in from elsewhere (however we could have just been unlucky for the duration of the mprotect). :dbaron pointed out that we've had such corruption of Gecko hashmaps in the past; though less frequent. Our current hypothesis is that this is just random memory corruption happening on certain systems (e.g. if you have old RAM or RAM with mismatched timings), that happens to hit hashmaps in particular because these are big buffers and easy to corrupt into visible crashes by flipping one bit (if you flip any bit in the key occupancy area the hashmap will probably crash within its destructor).

Flags: needinfo?(manishearth)

Aria Beingessner [:Gankra]

Updated

•

8 years ago

Assignee: a.beingessner → manishearth

Sylvestre Ledru [:Sylvestre]

Updated

•

8 years ago

status-firefox57: affected → wontfix

status-firefox58: affected → wontfix

status-firefox59: --- → fix-optional

status-firefox60: --- → affected

Emilio Cobos Álvarez (:emilio)

Comment 48

•

8 years ago

I switched nightly to use ordermap, and there are still crashes: https://crash-stats.mozilla.com/search/?signature=~ordermap&date=%3E%3D2018-02-05T01%3A28%3A11.000Z&date=%3C2018-02-12T01%3A28%3A11.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

Emilio Cobos Álvarez (:emilio)

Comment 49

•

8 years ago

Note that I haven't analyzed the relative frequency of these crashes.

Bobby Holley (:bholley)

Reporter

Comment 50

•

8 years ago

(In reply to Emilio Cobos Álvarez [:emilio] from comment #48) > I switched nightly to use ordermap, and there are still crashes: Very useful to know! This is basically the only outcome of this experiment that we would have been able to draw conclusions from. :-) (In reply to Emilio Cobos Álvarez [:emilio] from comment #49) > Note that I haven't analyzed the relative frequency of these crashes. I don't think that would necessarily tell us much, since OrderMap may well be less sensitive to this kind of bit clobbering than std::HashMap.

(not currently active) Ted Mielczarek

Comment 53

•

7 years ago

(In reply to Manish Goregaokar [:manishearth] from comment #47) > :dbaron pointed out that we've had such corruption of Gecko hashmaps in the > past; though less frequent. Our current hypothesis is that this is just > random memory corruption happening on certain systems (e.g. if you have old If we're still getting a steady stream of these crashes then we might be able to test this hypothesis now. bug 1489297 added an annotation to crash reports to indicate whether the system uses ECC memory. If you look at the Metadata tab on a crash report from my Thinkstation: https://crash-stats.mozilla.com/report/index/9f557855-08e6-4353-82b6-03cf40181024#tab-metadata You can see " MemoryErrorCorrection Multi-bit ECC".

Emilio Cobos Álvarez (:emilio)

Comment 54

•

7 years ago

No crash with memory error correction set to something that is not empty or none so far in: https://crash-stats.mozilla.com/signature/?product=Firefox&signature=core%3A%3Aptr%3A%3Adrop_in_place%3CT%3E%20%7C%20style%3A%3Aselector_map%3A%3AMaybeCaseInsensitiveHashMap%3CT%3E%3A%3Aclear%3CT%3E&date=%3E%3D2018-11-07T17%3A26%3A39.000Z&date=%3C2018-11-14T17%3A26%3A39.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_columns=memory_error_correction&_sort=-date&page=1

Randell Jesup [:jesup] (needinfo me)

Comment 55

•

7 years ago

A single crash with something other than None or blank in the last 6 months: https://crash-stats.mozilla.com/report/index/cfb5f727-1535-41a1-96ad-ebe340181106

(not currently active) Ted Mielczarek

Comment 56

•

7 years ago

Interesting! I grabbed that dump and ran it through get-minidump-instructions and the faulting instruction is the lock sub here: 7ffd243a528c: 48 8b 07 mov rax, qword ptr [rdi] 7ffd243a528f: f0 48 81 28 01 00 00 lock sub qword ptr [rax], 1 7ffd243a5296: 00 From the "raw dump tab" we have: "rax": "0x000201ba483e6e40", which is a non-canonical pointer (hence why the crash address comes out as 0xffffffffffffffff). More interesting, looking at some of the other registers there it does appear like it's a bit flip in a pointer address: "r9": "0x000001ba3f001100", "rax": "0x000201ba483e6e40", "rbp": "0x000001ba3bfb0700", The pointer in rax came from rdi, which points to the stack: "rdi": "0x0000000ab9ff6d58", so we actually have that stack memory in the minidump and it has the bogus pointer value: 0000000ab9ff6d58 000201ba483e6e40

Cameron McCormack (:heycam)

Updated

•

6 years ago

Blocks: 1534128

Liz Henry (:lizzard) (relman/hg->git project)

Comment 59

•

6 years ago

Hi, is this still something you are looking into? Are any of the recent duplicates useful?

Flags: needinfo?(manishearth)

Manish Goregaokar [:manishearth]

Comment 60

•

6 years ago

Our conclusion here was basically "some people have bad RAM and the rust hashmap implementation is better at crashing when some bits get flipped" -- there were crashes like this before Stylo on the C++ hashmaps with a reduced frequency. The mprotect stuff ensured that it wasn't other code stomping on our buffers, and other things verified that it wasn't our code either.

Flags: needinfo?(manishearth)

Manish Goregaokar [:manishearth]

Comment 61

•

5 years ago

We should probably unhide this bug at this point.

Emilio Cobos Álvarez (:emilio)

Updated

•

5 years ago

Blocks: 1664037

BugBot [:suhaib / :marco/ :calixte]

Updated

•

5 years ago

Crash Signature: [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>]

BugBot [:suhaib / :marco/ :calixte]

Updated

•

4 years ago

Crash Signature: [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] → [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear]

BugBot [:suhaib / :marco/ :calixte]

Comment 65

•

3 years ago

The bug assignee didn't login in Bugzilla in the last 7 months.
:emilio, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: manishearth → nobody

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Updated

•

3 years ago

Crash Signature: [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear] → [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear]

Flags: needinfo?(emilio)

BugBot [:suhaib / :marco/ :calixte]

Comment 67

•

3 years ago

Copying crash signatures from duplicate bugs.

Crash Signature: [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear] → [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear] [@ core::ptr::drop_in_place<T> | style::stylist::CascadeData::clear]

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

Crash Signature: [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear] [@ core::ptr::drop_in_place<T> | style::stylist::CascadeData::clear] → [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear] [@ core::ptr::drop_in_place<T> | style::stylist::CascadeData::clear]

status-firefox59: fix-optional → wontfix

status-firefox60: affected → wontfix

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

Group: layout-core-security

BugBot [:suhaib / :marco/ :calixte]

Comment 69

•

3 years ago

Copying crash signatures from duplicate bugs.

Crash Signature: [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear] [@ core::ptr::drop_in_place<T> | style::stylist::CascadeData::clear] → [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear] [@ core::ptr::drop_in_place<T> | style::stylist::CascadeData::clear] [@ Ge…

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

BugBot [:suhaib / :marco/ :calixte]

Comment 70

•

3 years ago

The severity field for this bug is relatively low, S3. However, the bug has 9 duplicates.
:emilio, could you consider increasing the bug severity?

For more information, please visit auto_nag documentation.

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Comment 71

•

3 years ago

Isn't particularly actionable.

Crash Signature: [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear] [@ core::ptr::drop_in_place<T> | style::stylist::CascadeData::clear] [@ Ge… → [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear] [@ core::ptr::drop_in_place<T> | style::stylist::CascadeData::clear] [@ G…

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Updated

•

2 years ago

Duplicate of this bug: 1797655

BugBot [:suhaib / :marco/ :calixte]

Comment 73

•

2 years ago

Copying crash signatures from duplicate bugs.

Crash Signature: Gecko_ReleaseAtom] [@ IPCError-browser | ShutDownKill | Gecko_ReleaseAtom] → Gecko_ReleaseAtom] [@ IPCError-browser | ShutDownKill | Gecko_ReleaseAtom] [@ servo_arc::thin_to_thick]

BugBot [:suhaib / :marco/ :calixte]

Comment 74

•

2 years ago

The bug is linked to a topcrash signature, which matches the following criterion:

Top 10 AArch64 and ARM crashes on release

:emilio, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(emilio)

Keywords: topcrash

Emilio Cobos Álvarez (:emilio)

Comment 75

•

2 years ago

Yeah, so the crashes on android seem a bit concerning, it seems there was a massive spike earlier this month... I don't think we've had many relevant changes there that could explain that on our end, maybe a toolchain change?

Severity: S3 → S2

Crash Signature: Gecko_ReleaseAtom] [@ IPCError-browser | ShutDownKill | Gecko_ReleaseAtom] [@ servo_arc::thin_to_thick] → Gecko_ReleaseAtom] [@ IPCError-browser | ShutDownKill | Gecko_ReleaseAtom] [@ servo_arc::thin_to_thick]

Flags: needinfo?(emilio)

Ryan VanderMeulen [:RyanVM]

Comment 76

•

2 years ago

Manifestation of bug 1801006 maybe?

Daniel Holbert [:dholbert]

Updated

•

2 years ago

No longer duplicate of this bug: 1797655

Daniel Holbert [:dholbert]

Comment 77

•

2 years ago

Per triage meeting: un-duped the [@ servo_arc::thin_to_thick] bug (bug 1797655) since it's high-volume & probably merits an investigation on its own bug.

Crash Signature: Gecko_ReleaseAtom] [@ IPCError-browser | ShutDownKill | Gecko_ReleaseAtom] [@ servo_arc::thin_to_thick] → Gecko_ReleaseAtom] [@ IPCError-browser | ShutDownKill | Gecko_ReleaseAtom]

Daniel Holbert [:dholbert]

Updated

•

2 years ago

Severity: S2 → S4

Daniel Holbert [:dholbert]

Updated

•

2 years ago

Severity: S4 → S3

Daniel Holbert [:dholbert]

Updated

•

2 years ago

Blocks: 1815788

Daniel Holbert [:dholbert]

Updated

•

2 years ago

Blocks: 1797655

BugBot [:suhaib / :marco/ :calixte]

Comment 78

•

2 years ago

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

Emilio Cobos Álvarez (:emilio)

Updated

•

2 years ago

Duplicate of this bug: 1825180

BugBot [:suhaib / :marco/ :calixte]

Comment 80

•

2 years ago

Copying crash signatures from duplicate bugs.

Crash Signature: Gecko_ReleaseAtom] [@ IPCError-browser | ShutDownKill | Gecko_ReleaseAtom] → Gecko_ReleaseAtom] [@ IPCError-browser | ShutDownKill | Gecko_ReleaseAtom] [@ hashbrown::raw::RawTable<T>::reserve_rehash<T>]

Chris Peterson [:cpeterson]

Updated

•

1 year ago

Updated

•

9 months ago

Duplicate of this bug: 1931043

Emilio Cobos Álvarez (:emilio)