Open Bug 1406996 (stylo-hashmap-crashes) Opened 7 years ago Updated 1 year ago

stylo: Investigate crash reports around Rust HashMaps

Categories

(Core :: CSS Parsing and Computation, defect, P2)

defect

Tracking

()

Tracking Status
firefox57 + wontfix
firefox58 --- wontfix
firefox59 --- wontfix
firefox60 --- wontfix

People

(Reporter: bholley, Unassigned)

References

(Blocks 3 open bugs)

Details

(Keywords: sec-audit)

Crash Data

The investigation here has been split across various bugs, and I wanted a single location for discussion and investigation. I considered duping the various "Crash in X" bugs together and combining the signatures, but it's probably less disruptive to just file a new bug. This bug will block any crashes that we suspect are related to this issue, and depend on any diagnostics and counter-measures we've landed (some of which will eventually need to be backed out).
I made a slack channel to discuss this, so that we can have a private group chat with full history. Ping me if you want an add.
Group: core-security → layout-core-security
Keywords: sec-audit
Depends on: 1407080
Priority: -- → P2
I'll be on leave starting tomorrow, so Gankro is going to take this over, with help from dmajor.
Assignee: bobbyholley → a.beingessner
Discussed during triage, adding tracking 57+ and affected.
https://crash-stats.mozilla.com/report/index/ff8093e4-6f23-4646-a8bb-1835d0171012

HashMap Corruption (sz=8234, cap=16384, pairsz=64, cnry=0xe7e7e7e7e7e7e7e7, pos=8231, base_addr=0x23af1200000, cnry_addr=0x23af13075c8, jrnl_len=8272) 


It looks like a bucket near the end of a hashmap's buffer is thought to be occupied when it isn't.

(The pairsz isn't enough to correctly figure out if it actually is at the end of the buffer, will need to manually do more size calculations)
The code in https://hg.mozilla.org/mozilla-central/annotate/bc7a5be76b72/servo/components/hashglobe/src/diagnostic.rs#l55 seems to only report the last instance (instead of the first), and its pos calculation may not be exact when there are more before.

Probably worth also recording the number of instances.
Also it seems that the canary field is not cleared when an item is removed from the hashmap. Maybe it should be cleared in that case so that we catch more?
Nick, you have crashdump access, could you get us access to the hashmap journal and contents?

(self.journal in https://hg.mozilla.org/mozilla-central/file/tip/servo/components/hashglobe/src/diagnostic.rs#l148 )

Thanks.
Ok, actually, given the bugs in the code mentioned by xidorn, there's some additional information in those numbers and it's pretty curious.

So firstly 0xe7 is our fill for the hashmap. Seeing it means we have reached a pristine empty bucket that never was full ever. (In fact, as xidorn mentioned we don't write_volatile clear canaries so if a bucket used to be full it's likely to have the canary already)


As xidorn noted the position value is totally wrong. We increment the position each time we see a valid full bucket, so the final value of position should be the number of valid full buckets seen.

The final value of position is 8231. The size of the hashmap is 8234. So we saw 3 broken buckets.

Now what's interesting is that if position=size it would mean that we saw some broken empty buckets but there also were extra full buckets, i.e. things had been misplaced.

This is not the case. There are buckets for which we've recorded a hash, *and for which we've incremented the size* (it's not just an errant write into the hashes segment), which are nowhere to be seen in the KV segment.


Which feels like we decided to insert something into the hashmap, did all the steps for it, and then ... neglected to record the KV. Again, these are pristine buckets; this isn't the result of a botched remove.

An alternate explanation is that we decided to insert something into the hashmap, did all the steps for it, and then recorded the KV in the wrong bucket -- a bucket which was already full or was later filled. This happened thrice. Given that the hashmap is 50% empty it's slightly unlikely that this would happen thrice by chance, which is also weird.

(Also, given that we don't explicitly clear canaries, I would expect the position value to be *more* than the size since it would also count now-vacant but previously-full entries)

Our diagnostic code is buggy, but the bugginess reveals some really interesting facts about this crash, though admittedly I'm unable to see where this corruption might have happened given the reduced search space from these diagnostics.

(I'll think more about this tomorrow)
https://github.com/servo/servo/pull/18861 fixes those bugs. We should be removing these diagnostics soon, but I don't think that will be for a couple of days, so we might as well have this stuff fixed in the meantime in case we need more information.
(In reply to Manish Goregaokar [:manishearth] from comment #8)
> Nick, you have crashdump access, could you get us access to the hashmap
> journal and contents?
> 
> (self.journal in
> https://hg.mozilla.org/mozilla-central/file/tip/servo/components/hashglobe/
> src/diagnostic.rs#l148 )

I'm having trouble finding this field in the crash report. There are several different things in the "Raw Dump" tab of a crash report and I've never managed to understand exactly what each one represents. The only one where I can see any sign of "HashMapJournal" is the .dmp file but that's a binary file and I don't know how to extract the data for that field.

Ted, can you assist? This is for https://crash-stats.mozilla.com/report/index/ff8093e4-6f23-4646-a8bb-1835d0171012
Flags: needinfo?(ted)
dmajor might also be able to help.
Flags: needinfo?(dmajor)
Bug 1408292 didn't help, but even accounting for that I still can't find this field.
So I went poking around. Nowadays we send annotations from child processes via shared memory, and that code has a hard 16KB limit on the total size of annotations:
https://dxr.mozilla.org/mozilla-central/rev/196dadb2fe500e75c6fbddcac78106648676cf10/ipc/glue/CrashReporterClient.h#50

The crash reason for this crash says:
jrnl_len = 8272

...so I am quite sure that the size of the annotation generated would be >16KB.

Sorry, this is clearly all terrible.
Flags: needinfo?(ted)
Flags: needinfo?(dmajor)
bp-1fb7ea04-7d24-492a-9815-d03660171015 seems like a good one to investigate. It has
> HashMap Corruption (sz=160, cap=256, pairsz=24, cnry=0x42cafe9942c2fe99, count=1, last_pos=5, base_addr=0x19ac1000, cnry_addr=0x19ac18b0, jrnl_len=192)

A 192 items journal should be smaller than 16KB I suppose.
Flags: needinfo?(ted)
bp-661bbc85-c6dd-4924-9682-f4f7f0171015 only contains 187.

Having a look more closely at those crash reports with short journal, this really seems more like some kind of bit rot...

0x42cafe9942cafe99 ^ 0x42cafe9942c2fe99 = 0x0000_0000_0008_0000
0x42cafe9942cafe99 ^ 0x42caf69942cafe99 = 0x0000_0800_0000_0000
0x42cafe9942cafe99 ^ 0x42cafa9942cafe99 = 0x0000_0400_0000_0000
bp-a3758377-6333-4a9f-ac50-2357e0171015 is another long journal corruption with canary being 0xe7e7e7e7e7e7e7e7. It has
> HashMap Corruption (sz=9192, cap=16384, pairsz=64, cnry=0xe7e7e7e7e7e7e7e7, count=1, last_pos=3528, base_addr=0x26edfb00000, cnry_addr=0x26edfb81bc8, jrnl_len=9276)

I guess we have two problems, then. One is the bits are randomly set somehow, the other is that we read things from uninserted position.
(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #15)
> bp-1fb7ea04-7d24-492a-9815-d03660171015 seems like a good one to
> investigate. It has
> > HashMap Corruption (sz=160, cap=256, pairsz=24, cnry=0x42cafe9942c2fe99, count=1, last_pos=5, base_addr=0x19ac1000, cnry_addr=0x19ac18b0, jrnl_len=192)
> 
> A 192 items journal should be smaller than 16KB I suppose.

This one doesn't have a journal.

(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #16)
> bp-661bbc85-c6dd-4924-9682-f4f7f0171015 only contains 187.

This one does:

[Insert(9223372040800430640), Insert(9223372037583450023), Insert(9223372037331258603), Insert(9223372037363030801), Insert(9223372039430138675), Insert(9223372040348347031), Insert(9223372040481347453), Insert(9223372039448211261), Insert(9223372039254182491), Insert(9223372038421190002), Insert(9223372039917800308), Insert(9223372037988661486), Insert(9223372039026795514), Insert(9223372040381933642), Insert(9223372038485486634), Insert(9223372037179874635), Insert(9223372037115169326), Insert(9223372038230214386), Insert(9223372038617724901), Insert(9223372039244466658), Insert(9223372037699397048), Insert(9223372040512701811), Insert(9223372038179067507), Insert(9223372037327718718), Insert(9223372040249027737), Insert(9223372038431253912), Insert(9223372040782608028), Insert(9223372039045589349), Insert(9223372040787469832), Insert(9223372041108375103), Insert(9223372037683846781), Insert(9223372040752179429), Insert(9223372037660816815), Insert(9223372038597259377), Insert(9223372040732592612), Insert(9223372039972984854), Insert(9223372038628606895), Insert(9223372039484361949), Insert(9223372037842923432), Insert(9223372037811276715), Insert(9223372040369610561), Insert(9223372037036415018), Insert(9223372040467028517), Insert(9223372038679530105), Insert(9223372040748744789), Insert(9223372038558818406), Insert(9223372040160249925), Insert(9223372038410520761), Insert(9223372040049239808), Insert(9223372037691020926), Insert(9223372038446144062), Insert(9223372039746359810), Insert(9223372036866689104), Insert(9223372040054217968), Insert(9223372039025581704), Insert(9223372039092085797), Insert(9223372039085554332), Insert(9223372038676785515), Insert(9223372038379887956), Insert(9223372040896862216), Insert(9223372040890396317), Insert(9223372039540879453), Insert(9223372037001352116), Insert(9223372037606201184), Insert(9223372038108338082), Insert(9223372039503594112), Insert(9223372038010538333), Insert(9223372037117965481), Insert(9223372039095246646), Insert(9223372039888935450), Insert(9223372038865488586), Insert(9223372040487938133), Insert(9223372040853080038), Insert(9223372039188623770), Insert(9223372037882431777), Insert(9223372037532955795), Insert(9223372039352260868), Insert(9223372040498929429), Insert(9223372037922721777), Insert(9223372041048803252), Insert(9223372037045106544), Insert(9223372037809608083), Insert(9223372036885826728), Insert(9223372037975487061), Insert(9223372039578413606), Insert(9223372039028516597), Insert(9223372039638085596), Insert(9223372039042726425), Insert(9223372038634330380), Insert(9223372039578606854), Insert(9223372040598030232), Insert(9223372040655232645), Insert(9223372040178087295), Insert(9223372038949753120), Insert(9223372038498770849), Insert(9223372038023184547), Insert(9223372040937882656), Insert(9223372039915704403), Insert(9223372038359325460), Insert(9223372038865354670), Insert(9223372037834900915), Insert(9223372040559428669), Insert(9223372039204107924), Insert(9223372038040175412), Insert(9223372038378653066), Insert(9223372039514863269), Insert(9223372040846253689), Insert(9223372038041699682), Insert(9223372040169881236), Insert(9223372040614914203), Insert(9223372041032242639), Insert(9223372037732994521), Insert(9223372037437193805), Insert(9223372039953608731), Insert(9223372039507312984), Insert(9223372039219619240), Insert(9223372040485013175), Insert(9223372038399485790), Insert(9223372040916113567), Insert(9223372040410067134), Insert(9223372039215482378), Insert(9223372038615139304), Insert(9223372037237889627), Insert(9223372036919918032), Insert(9223372038511803821), Insert(9223372039887814695), Insert(9223372038635861748), Insert(9223372040302565961), Insert(9223372039176808310), Insert(9223372037582161228), Insert(9223372038736424646), Insert(9223372038444272149), Insert(9223372038958762995), Insert(9223372040255352284), Insert(9223372037940704922), Insert(9223372038015972846), Insert(9223372039009185764), Insert(9223372038432151448), Insert(9223372039244466658), Insert(9223372037683846781), Insert(9223372037811276715), Insert(9223372037036415018), Insert(9223372038558818406), Insert(9223372040916113567), Remove(9223372038617724901), Insert(9223372040916113567), Insert(9223372040916113567), Insert(9223372039162663787), Insert(9223372039204107924), Insert(9223372038399485790), Insert(9223372038617724901), Insert(9223372039244466658), Insert(9223372037683846781), Insert(9223372037811276715), Insert(9223372040916113567), Insert(9223372040776354999), Insert(9223372040570174603), Insert(9223372039244466658), Insert(9223372037683846781), Insert(9223372037811276715), Insert(9223372040916113567), Insert(9223372040776354999), Remove(9223372038617724901), Insert(9223372038205317300), Insert(9223372038205317300), Insert(9223372037485976815), Insert(9223372038108918288), Insert(9223372040855021246), Insert(9223372038250087590), Insert(9223372037687809427), Insert(9223372037766414881), Insert(9223372038740877830), Insert(9223372038668605532), Insert(9223372039675084955), Insert(9223372038796126079), Insert(9223372039535142435), Insert(9223372038459810812), Insert(9223372039939177951), Insert(9223372038755424106), Insert(9223372038804570148), Insert(9223372040354952311), Insert(9223372039105907127), Insert(9223372038298388083), Insert(9223372038027391260), Insert(9223372039120322629), Insert(9223372037450597366)]
Another observation: it seems that majority of the corruption reports in the last week happen on either begin_mutation or drop, there are two cases on end_mutation. The two cases are both on old version without the incorrect pos data, but they have identical bad canary value, which seems to be bit rot.

This makes me think that it is probably unrelated to hashmap, but something outside is overflowing into hashmap content and changing some bits inside.
(In reply to Mike Hommey [:glandium] from comment #18)
> (In reply to Xidorn Quan [:xidorn] UTC+10 from comment #16)
> > bp-661bbc85-c6dd-4924-9682-f4f7f0171015 only contains 187.
> 
> This one does:

Thanks.
Flags: needinfo?(ted)
(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #19)
> The two cases are both on old version without the incorrect pos data,
> but they have identical bad canary value, which seems to be bit rot.

I mean, without the correct pos data.
(In reply to Xidorn Quan [:xidorn] UTC+10 (less responsive Nov 5 ~ Dec 16) from comment #19)
> This makes me think that it is probably unrelated to hashmap, but something
> outside is overflowing into hashmap content and changing some bits inside.

I was able to reproduce bug 1385925 once. I can't repro again but here are a few unique details:

1. I was running a Mac Release Build built on this changeset

changeset:   383923:15f221f491f7
tag:         tip
fxtree:      central
user:        ffxbld
date:        Mon Oct 02 10:46:46 2017 -0700

2. The build crashed seconds after startup and running only 1 single test file.

3. I had local changes that I know to corrupt memory (I was doing some non-CSS ImgLib hackery to diagnose a bug)

4. Running the same steps again on the same local build wouldn't crash.

The bug I was chasing down is also memory corruption but shows up as rendering glitches (bug 1404366.) Given that I ran only the test file when I crashed in the Stylo hashmap is another indicator that external forces may be at play here. 

We've had frequent crashes in RuleHash (e.g., bug 719116 and bug 792626) in the C++ style system, but I'm not sure if the correlations are similar to the Stylo crashes we're seeing here.
> I guess we have two problems, then. One is the bits are randomly set somehow, the other is that we read things from uninserted position.

The hashmap contains an array of hashes, and non-zero hashes indicate a valid entry in the key-value array. If a hash bitrotted, then iteration would identify it as a valid entry and access it.
Yeah, we have a bunch of crashes with a canary one bit off from cafe99. I feel like those are bitrot coming in from elsewhere (unsure if related to stylo code).

It's the crashes with canary e7e7 that are troubling; but those are all on huge hashmaps.
(In reply to Xidorn Quan [:xidorn] UTC+10 (less responsive Nov 5 ~ Dec 16) from comment #19)
> This makes me think that it is probably unrelated to hashmap, but something
> outside is overflowing into hashmap content and changing some bits inside.

Since we're already doing a bunch of custom allocation stuff for hashmaps (or we were?  I may have lost track of what the current state is here), can we ensure that we allocate and page-protect the pages before and after the hashmap to try and catch whatever's doing this?
> can we ensure that we allocate and page-protect the pages before and after the hashmap to try and catch whatever's doing this?

I was under the impression that we already did that and this only revealed a bug in spidermonkey
(In reply to Manish Goregaokar [:manishearth] from comment #26)
> > can we ensure that we allocate and page-protect the pages before and after the hashmap to try and catch whatever's doing this?
> 
> I was under the impression that we already did that and this only revealed a
> bug in spidermonkey

Yeah, this was tried before indeed.
(In reply to Manish Goregaokar [:manishearth] from comment #24)
> It's the crashes with canary e7e7 that are troubling; but those are all on
> huge hashmaps.

Can we include the hash of the object which we're adding to the hashmap so we can see if it's just a EMPTY_HASH which has bitrotted to appear full?

I'm inclined to think it isn't, as the information manish was showing me suggested that the length was also wrong.
I spent the morning running the input in https://bugzilla.mozilla.org/show_bug.cgi?id=1406996#c18 a bunch of times, messing around with retaining/freeing multiple instances, with/without asan/optimizations, and so far it's holding rock solid.
Pasting this comment from @manishearth from earlier today re these crashes, as it's significant:

https://crash-stats.mozilla.com/report/index/066fe396-818a-4d41-8179-3eb2e0171023
HashMap Corruption (sz=8132, buffer_hash_sz=8133, cap=16384, pairsz=64, hash=0x80000000, cnry=0xe7e7e7e7e7e7e7e7, count=1, last_pos=7592, base_addr=0x22e4dd00000, cnry_addr=0x22e4de0e908, jrnl_len=8144)

HashMap Corruption (sz=2026, buffer_hash_sz=2056, cap=4096, pairsz=64, hash=0x31, cnry=0xe7e7e7e7e7e7e7e7, count=30, last_pos=1046, base_addr=0x18a32000, cnry_addr=0x18a59e48, jrnl_len=2547)

HashMap Corruption (sz=7703, buffer_hash_sz=7703, cap=16384, pairsz=48, hash=0x80000000aa1bd9a8, cnry=0x42cafe9942cafe22, count=4, last_pos=3106, base_addr=0x1ab02000, cnry_addr=0x1ab6ef88, jrnl_len=25284 for the e7 canaries, buffer_hash_sz > sz so it’s pretty clear that it’s the hashes getting written to

in the original one we had a case where we actually had sz=buffer_hash_sz (we didn’t report that number but it was inferrable because of the bug in the reporting) which was concerning, btw that might have been a fluke for the cafe canaries we get sz = buffer_hash_sz so this is pretty clearly random writes to the buffer i think

also note that the hash is 0x80000000, i.e one bit flip the other one is 0x31, so not one bit flip, however that buffer has 30 broken hashes so it might just be a case where we had multiple flips
I normally think of cosmic-ray-type incidents when someone says "bit flip" - some sort of HW error, which happens on non-ECC memory (occasionally).  However, it tends not to happen in the same spots.

Here we have repeated crashes in code related to the hashmaps, it appears, and perhaps by "bit flips" you mean write-after-free or wild writes.  It's possible, though (very) unusual, for such writes to be single bit modifications (true bit flips).  And why would such writes (flips or not) keep targeting these hashmaps?

Since I don't know the specific code here, I may be missing something.  And 30 "flips" in a buffer doesn't sound like HW either.  Again, not impossible, but darn unlikely.
(In reply to Jet Villegas (:jet) from comment #30)
> Pasting this comment from @manishearth from earlier today re these crashes,
> as it's significant:
> 
> https://crash-stats.mozilla.com/report/index/066fe396-818a-4d41-8179-
> 3eb2e0171023
> HashMap Corruption (sz=8132, buffer_hash_sz=8133, cap=16384, pairsz=64,
> hash=0x80000000, cnry=0xe7e7e7e7e7e7e7e7, count=1, last_pos=7592,
> base_addr=0x22e4dd00000, cnry_addr=0x22e4de0e908, jrnl_len=8144)

This is likely bit flip...

> HashMap Corruption (sz=2026, buffer_hash_sz=2056, cap=4096, pairsz=64,
> hash=0x31, cnry=0xe7e7e7e7e7e7e7e7, count=30, last_pos=1046,
> base_addr=0x18a32000, cnry_addr=0x18a59e48, jrnl_len=2547)

but this is clearly not. There must be something wrong with this.

And the crash report for this is bp-e6608850-2880-4fcb-a594-aaa300171023.

Also, bp-942da53e-2de3-4cf2-9f94-253620171023 is interesting as well. It has
HashMap Corruption (sz=12074, buffer_hash_sz=12074, cap=16384, pairsz=24, hash=0xa2e52d93, cnry=0x42c2fe99, count=1, last_pos=8649, base_addr=0xd759000, cnry_addr=0xd7ad614, jrnl_len=12344)

So half of the canary is cleared, while the other half contains a bit flip.
Over at https://bugzilla.mozilla.org/show_bug.cgi?id=1404860#c23, nbp
speculates that there might be a problem with mozjemalloc meta-data
management.  And I wonder if that's what we're seeing here.  For sure
mozjemalloc does twiddle individual bits in metadata management:

memory/build/mozjemalloc.cpp, arena_run_reg_alloc():

	mask = run->regs_mask[i];
	[..]
		/* Clear bit. */
		mask ^= (1U << bit);
		run->regs_mask[i] = mask;

arena_run_reg_dalloc():

	run->regs_mask[elm] |= (1U << bit);
But "external" modifications of the buffer have been eliminated with mprotect, right?
(In reply to Mike Hommey [:glandium] from comment #34)
> But "external" modifications of the buffer have been eliminated with
> mprotect, right?

Right.  But that doesn't rule out "internal" modifications.

Given (1) the apparent randomness of failures and the absence of any
reproducible test case, (2) the failure of ASan and the mprotect scheme to
find any smoking gun, and (3) the relative doubtfulness of the
hardware-problem hypothesis, I have been wondering for some weeks now
whether this is a data race.  It bears the signs of a race.

Has that possibility been considered in detail?  Do we have any evidence
for/against that?

In particular, some time back in the summer, mozjemalloc was modified so as
to have one-arena-per-stylo thread (I can't find the bug number now), so the
level of concurrency there has gone up a bit.  And so I am wondering if
there might be a race or other concurrency bug inside mozjemalloc itself.

I am half inclined to saw it off and write a multithreaded stress test case
for it, to see what happens.  Since that would make it into a simple
standalone program, testing it with a race detector (Helgrind?) is also a
possibility.
(In reply to Julian Seward [:jseward] from comment #35)
> (In reply to Mike Hommey [:glandium] from comment #34)
> > But "external" modifications of the buffer have been eliminated with
> > mprotect, right?
> 
> Right.  But that doesn't rule out "internal" modifications.

The only internal modifications that are not ruled out are those that happen when the pages are writable. Mozjemalloc doesn't make them writable on its own. So the only way it could be writing to mprotected pages is if it was somehow racing with another thread that made the pages writable. Which means there are also plenty of cases where it would lose the race and write to the pages while mprotected, and we should have caught that.
(In reply to Mike Hommey [:glandium] from comment #34)
> But "external" modifications of the buffer have been eliminated with
> mprotect, right?

To be clear, the mprotect diagnostics were only on nightly for a few days, and backed out some time ago. They caused some intermittent failures on CI and so there was pressure to back them out quickly, but I kind of wish we'd let them ride for longer. But the key point here is that they are _not_ active now.

The tricky thing about them was that all allocations get rounded up to page size, which could impact crashes, especially if we believe it's mozjemalloc doing the bit flipping. We could try again, but before doing that I think we should do the diagnostics in bug 1405525 comment 6, since we _did_ get one suspicious stack from those diagnostics.

(In reply to Julian Seward [:jseward] from comment #35)
> (In reply to Mike Hommey [:glandium] from comment #34)
> > But "external" modifications of the buffer have been eliminated with
> > mprotect, right?
> 
> Right.  But that doesn't rule out "internal" modifications.
> 
> Given (1) the apparent randomness of failures and the absence of any
> reproducible test case, (2) the failure of ASan and the mprotect scheme to
> find any smoking gun, and (3) the relative doubtfulness of the
> hardware-problem hypothesis, I have been wondering for some weeks now
> whether this is a data race.  It bears the signs of a race.
> 
> Has that possibility been considered in detail?  Do we have any evidence
> for/against that?
> 
> In particular, some time back in the summer, mozjemalloc was modified so as
> to have one-arena-per-stylo thread (I can't find the bug number now), so the
> level of concurrency there has gone up a bit.  And so I am wondering if
> there might be a race or other concurrency bug inside mozjemalloc itself.

These data structures are only mutated when the rayon pool is idle. It's not clear to me where the race would come from.
And to emphasize, I really think the crash in bug 1405525 is worth following up on. If it is the case that the off-main-thread spidermonkey GC is double-freeing strings, that would explain quite a few of the puzzle pieces here (especially the part where, IIUC, the corruption happens both inside and outside of {begin,end}_mutation pairs).
(In reply to Bobby Holley (parental leave - send mail for anything urgent) from comment #37)
> These data structures are only mutated when the rayon pool is idle. It's not
> clear to me where the race would come from.

As I understood it, this would be a race on mozjemalloc while allocating other stuff in parallel, like the stylo rule nodes, structs, and such. We do perform allocations off-main-thread, just not the hashmap bucket ones.
(In reply to Emilio Cobos Álvarez [:emilio] from comment #39)
> (In reply to Bobby Holley (parental leave - send mail for anything urgent)
> from comment #37)
> > These data structures are only mutated when the rayon pool is idle. It's not
> > clear to me where the race would come from.
> 
> As I understood it, this would be a race on mozjemalloc while allocating
> other stuff in parallel, like the stylo rule nodes, structs, and such. We do
> perform allocations off-main-thread, just not the hashmap bucket ones.

If that were the case, we wouldn't see crashes in end_mutation (which per comment 19, we may be seeing, though it's hard to say).

I'm really starting to suspect the GC here. It would also explain why the largest hashmap crash volume is in hashmap teardown, since that's exactly when the GC would be going on a freeing spree.
(In reply to Bobby Holley (parental leave - send mail for anything urgent) from comment #38)
> If it is the case that the off-main-thread spidermonkey GC is
> double-freeing strings, that would explain quite a few of the puzzle pieces

That would seem to imply that mozjemalloc doesn't detect and reject double
frees, instead silently corrupting the heap (or its metadata).  Is that 
indeed the case?
Flags: needinfo?(mh+mozilla)
It actively rejects double frees, via MOZ_DIAGNOSTIC_ASSERTs.

For allocations < ~1MB and >= 4KB:
https://dxr.mozilla.org/mozilla-central/source/memory/build/mozjemalloc.cpp#3690

For allocations < 4KB:
https://dxr.mozilla.org/mozilla-central/source/memory/build/mozjemalloc.cpp#2453

For allocations >= ~1MB:
https://dxr.mozilla.org/mozilla-central/source/memory/build/mozjemalloc.cpp#4196
Note that while the assert is a MOZ_ASSERT, so debug only, the huge.Remove below will actively crash with a null deref if we're in presence of a double free.
Flags: needinfo?(mh+mozilla)
(In reply to Mike Hommey [:glandium] from comment #42)
> It actively rejects double frees, via MOZ_DIAGNOSTIC_ASSERTs.

Given that (and, confirmed via irc chat with Mike that mozjemalloc
should detect *all* double frees of mozjemalloc-issued pointers),
I don't see how this can be consistent with the theory that the
corruption is caused by spidermonkey double-frees.
(In reply to Julian Seward [:jseward] from comment #35)
[regarding mozjemalloc]
> I am half inclined to saw it off and write a multithreaded stress test case
> for it, to see what happens.  Since that would make it into a simple
> standalone program, testing it with a race detector (Helgrind?) is also a
> possibility.

I did that.  Stress test case (20 threads, per-thread arenas, several
million allocations of varying sizes) didn't find anything.  Helgrind
reported three races, one of which Mike already fixed, and the other two of
which are (per discussion with Mike) harmless initialisation races.

So, nothing to see here, it seems.
(In reply to Julian Seward [:jseward] from comment #44)
> (In reply to Julian Seward [:jseward] from comment #35)
> [regarding mozjemalloc]
> > I am half inclined to saw it off and write a multithreaded stress test case
> > for it, to see what happens.  Since that would make it into a simple
> > standalone program, testing it with a race detector (Helgrind?) is also a
> > possibility.
> 
> I did that.  Stress test case (20 threads, per-thread arenas, several
> million allocations of varying sizes) didn't find anything.  Helgrind
> reported three races, one of which Mike already fixed, and the other two of
> which are (per discussion with Mike) harmless initialisation races.

Not that it matters much, but they should be fixed too, now.
Blocks: 1417057
Can we capture here any results from investigations?  I heard considerable details in-person I don't see in bugzilla.  Thanks
Flags: needinfo?(manishearth)
Rust hashmaps are stored as a single long buffer which has a segment for key occupancy, and one for values.

Our investigation basically pointed to there being corruption of this buffer. Sometimes a key occupancy cell would be corrupted, leading to the hashmap thinking there's an entry where there isn't (which we could verify via canaries); and sometimes a value would be corrupted. These would generally be single bit flips (IIRC).

We had at one point mprotected things; so it didn't seem to be corruption coming in from elsewhere (however we could have just been unlucky for the duration of the mprotect).

:dbaron pointed out that we've had such corruption of Gecko hashmaps in the past; though less frequent. Our current hypothesis is that this is just random memory corruption happening on certain systems (e.g. if you have old RAM or RAM with mismatched timings), that happens to hit hashmaps in particular because these are big buffers and easy to corrupt into visible crashes by flipping one bit (if you flip any bit in the key occupancy area the hashmap will probably crash within its destructor).
Flags: needinfo?(manishearth)
Assignee: a.beingessner → manishearth
Note that I haven't analyzed the relative frequency of these crashes.
(In reply to Emilio Cobos Álvarez [:emilio] from comment #48)
> I switched nightly to use ordermap, and there are still crashes:

Very useful to know! This is basically the only outcome of this experiment that we would have been able to draw conclusions from. :-)

(In reply to Emilio Cobos Álvarez [:emilio] from comment #49)
> Note that I haven't analyzed the relative frequency of these crashes.

I don't think that would necessarily tell us much, since OrderMap may well be less sensitive to this kind of bit clobbering than std::HashMap.
(In reply to Manish Goregaokar [:manishearth] from comment #47)
> :dbaron pointed out that we've had such corruption of Gecko hashmaps in the
> past; though less frequent. Our current hypothesis is that this is just
> random memory corruption happening on certain systems (e.g. if you have old

If we're still getting a steady stream of these crashes then we might be able to test this hypothesis now. bug 1489297 added an annotation to crash reports to indicate whether the system uses ECC memory.

If you look at the Metadata tab on a crash report from my Thinkstation:
https://crash-stats.mozilla.com/report/index/9f557855-08e6-4353-82b6-03cf40181024#tab-metadata

You can see " MemoryErrorCorrection 	Multi-bit ECC".
A single crash with something other than None or blank in the last 6 months:
https://crash-stats.mozilla.com/report/index/cfb5f727-1535-41a1-96ad-ebe340181106
Interesting! I grabbed that dump and ran it through get-minidump-instructions and the faulting instruction is the lock sub here:
   7ffd243a528c:	48 8b 07             	mov rax, qword ptr [rdi]
   7ffd243a528f:	f0 48 81 28 01 00 00 	lock sub qword ptr [rax], 1
   7ffd243a5296:	00                   	

From the "raw dump tab" we have: "rax": "0x000201ba483e6e40", which is a non-canonical pointer (hence why the crash address comes out as 0xffffffffffffffff).

More interesting, looking at some of the other registers there it does appear like it's a bit flip in a pointer address:
"r9":  "0x000001ba3f001100",
"rax": "0x000201ba483e6e40",
"rbp": "0x000001ba3bfb0700",

The pointer in rax came from rdi, which points to the stack: "rdi": "0x0000000ab9ff6d58",

so we actually have that stack memory in the minidump and it has the bogus pointer value:
0000000ab9ff6d58 000201ba483e6e40

Hi, is this still something you are looking into? Are any of the recent duplicates useful?

Flags: needinfo?(manishearth)

Our conclusion here was basically "some people have bad RAM and the rust hashmap implementation is better at crashing when some bits get flipped" -- there were crashes like this before Stylo on the C++ hashmaps with a reduced frequency. The mprotect stuff ensured that it wasn't other code stomping on our buffers, and other things verified that it wasn't our code either.

Flags: needinfo?(manishearth)

We should probably unhide this bug at this point.

Crash Signature: [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>]
Crash Signature: [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] → [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear]

The bug assignee didn't login in Bugzilla in the last 7 months.
:emilio, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: manishearth → nobody
Flags: needinfo?(emilio)
Crash Signature: [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear] → [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear]
Flags: needinfo?(emilio)

Copying crash signatures from duplicate bugs.

Crash Signature: [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear] → [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear] [@ core::ptr::drop_in_place<T> | style::stylist::CascadeData::clear]
Crash Signature: [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear] [@ core::ptr::drop_in_place<T> | style::stylist::CascadeData::clear] → [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear] [@ core::ptr::drop_in_place<T> | style::stylist::CascadeData::clear]
Group: layout-core-security

Copying crash signatures from duplicate bugs.

Crash Signature: [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear] [@ core::ptr::drop_in_place<T> | style::stylist::CascadeData::clear] → [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear] [@ core::ptr::drop_in_place<T> | style::stylist::CascadeData::clear] [@ Ge…
Severity: normal → S3

The severity field for this bug is relatively low, S3. However, the bug has 9 duplicates.
:emilio, could you consider increasing the bug severity?

For more information, please visit auto_nag documentation.

Flags: needinfo?(emilio)

Isn't particularly actionable.

Crash Signature: [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear] [@ core::ptr::drop_in_place<T> | style::stylist::CascadeData::clear] [@ Ge… → [@ hashglobe::hash_map::HashMap<T>::clear] [@ hashglobe::hash_map::HashMap<T>::clear<T>] [@ core::ptr::drop_in_place<T> | style::stylist::GenericElementAndPseudoRules<T>::clear] [@ core::ptr::drop_in_place<T> | style::stylist::CascadeData::clear] [@ G…
Flags: needinfo?(emilio)
Duplicate of this bug: 1797655

Copying crash signatures from duplicate bugs.

Crash Signature: Gecko_ReleaseAtom] [@ IPCError-browser | ShutDownKill | Gecko_ReleaseAtom] → Gecko_ReleaseAtom] [@ IPCError-browser | ShutDownKill | Gecko_ReleaseAtom] [@ servo_arc::thin_to_thick]

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 AArch64 and ARM crashes on release

:emilio, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(emilio)
Keywords: topcrash

Yeah, so the crashes on android seem a bit concerning, it seems there was a massive spike earlier this month... I don't think we've had many relevant changes there that could explain that on our end, maybe a toolchain change?

Severity: S3 → S2
Crash Signature: Gecko_ReleaseAtom] [@ IPCError-browser | ShutDownKill | Gecko_ReleaseAtom] [@ servo_arc::thin_to_thick] → Gecko_ReleaseAtom] [@ IPCError-browser | ShutDownKill | Gecko_ReleaseAtom] [@ servo_arc::thin_to_thick]
Flags: needinfo?(emilio)

Manifestation of bug 1801006 maybe?

No longer duplicate of this bug: 1797655

Per triage meeting: un-duped the [@ servo_arc::thin_to_thick] bug (bug 1797655) since it's high-volume & probably merits an investigation on its own bug.

Crash Signature: Gecko_ReleaseAtom] [@ IPCError-browser | ShutDownKill | Gecko_ReleaseAtom] [@ servo_arc::thin_to_thick] → Gecko_ReleaseAtom] [@ IPCError-browser | ShutDownKill | Gecko_ReleaseAtom]
Severity: S2 → S4
Severity: S4 → S3

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash
Duplicate of this bug: 1825180

Copying crash signatures from duplicate bugs.

Crash Signature: Gecko_ReleaseAtom] [@ IPCError-browser | ShutDownKill | Gecko_ReleaseAtom] → Gecko_ReleaseAtom] [@ IPCError-browser | ShutDownKill | Gecko_ReleaseAtom] [@ hashbrown::raw::RawTable<T>::reserve_rehash<T>]
You need to log in before you can comment on or make changes to this bug.