Closed Bug 1333000 Opened 7 years ago Closed 6 years ago

Add more release asserts to type sweeping code

Categories

(Core :: JavaScript Engine, defect, P3)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: jandem, Assigned: jandem)

References

Details

(Keywords: triage-deferred, Whiteboard: [tbird crash])

Crash Data

Attachments

(6 files)

The release asserts I added in bug 1332601 didn't catch anything, there are still crashes on Nightly, so it's time for a more nuclear option.
This adds a magic value to each TypeConstraint and ConstraintTypeSet. Then when we use these constraints and type sets, we MOZ_RELEASE_ASSERT the magic word is in place.

I'm pretty sure we will get crashes from this, if it's really memory corruption. As a first step I just want to see *where* we crash exactly.

I think this should have no measurable overhead and we only do this #ifdef JS_CRASH_DIAGNOSTICS.
Attachment #8829345 - Flags: review?(jcoppeard)
There are a bunch of other things to try after this, for instance we could (temporarily) split the typeLifoAlloc into multiple separate LifoAllocs, each for a different kind (TypeConstraint, TypeSet, TypeHashSet, etc), and then we can see if/how it affects these crashes.

I'm just going to work on this incrementally. There are enough crashes on Nightly that we should get useful results within a few days after landing a patch.
Comment on attachment 8829345 [details] [diff] [review]
Part 1 - Add magic values

Review of attachment 8829345 [details] [diff] [review]:
-----------------------------------------------------------------

Good idea.  Hopefully this will catch something.
Attachment #8829345 - Flags: review?(jcoppeard) → review+
Pushed by jandemooij@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6005e9d7efd0
part 1 - Add more release asserts to type sweeping code. r=jonco
Keywords: leave-open
We have some results from part 1 and it's kind of interesting:

* 4 out of 8 crashes are now hitting the checkMagic() I added to ConstraintTypeSet::sweep. This means the ConstraintTypeSet pointer is bogus or the TypeSet got corrupted.

* 3 crashes are still in the TypeConstraint sweeping code, but these did *not* hit any of the checkMagic calls there. It suggests the pointers are not entirely bogus/garbage. It's possible part of these constraints got corrupted though.

I'll take a closer look in a bit.
Flags: needinfo?(jdemooij)
One way things in the TypeLifoALloc could get corrupted like this is via TypeHashSet.

This patch adds an extra word storing the capacity right before the data we allocate for that, and then we MOZ_RELEASE_ASSERTS that matches the computed capacity value when we do a TypeHashSet lookup or insert.

I've no idea how this could happen, but it can't hurt to try it.
Flags: needinfo?(jdemooij)
Attachment #8831076 - Flags: review?(jcoppeard)
Attachment #8831076 - Flags: review?(jcoppeard) → review+
Pushed by jandemooij@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/28ffe691c195
part 2 - Add (release) asserts to verify TypeHashSet capacity. r=jonco
Part 2 didn't stop the crashes and we didn't hit the new asserts, so it's likely not TypeHashSet misbehaving. It's weird though because the TypeLifoAlloc is only used for a few things. I'll add more asserts.
This patch adds a bunch more checkMagic() calls, checks a few more invariants, and uses a different magic value for TypeConstraint and ConstraintTypeSet.

We can probably remove (some of) these asserts at some point, but for now let's assert what we can to see if it will affect crash signatures.
Attachment #8831624 - Flags: review?(jcoppeard)
Attachment #8831624 - Flags: review?(jcoppeard) → review+
Pushed by jandemooij@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/12d653d3ccce
part 3 - Add more release asserts to TI code. r=jonco
Based on the latest crashes, I wonder if we're accessing the typeLifoAlloc from multiple threads.

This patch makes TypeZone::typeLifoAlloc private, adds an accessor that asserts we can access the Zone, and also adds a thread assert to AutoEnterAnalysis.
Attachment #8833282 - Flags: review?(jcoppeard)
Attachment #8833282 - Flags: review?(jcoppeard) → review+
Pushed by jandemooij@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8f8290156d0b
part 4 - Add more release asserts to TI code. r=jonco
Here's what we know at this point:

(1) I added magic words to ConstraintTypeSet and TypeConstraint, and we check these words are what we expect when we work with these types. A lot of the crashes now hit these asserts, but we still see crashes where the magic words are all fine and we crash right after that. This suggests it's not random corruption/memory but only some words are getting corrupted.

(2) We're *not* accessing the typeLifoAlloc, sweeping types, or using AutoEnterAnalysis from a thread that doesn't have (exclusive) access to the Zone. We now have release asserts enforcing this.

(3) Likely *not* OOM crashes.

(4) The crashes seem to happen randomly. Some of them have an uptime of a few seconds/minutes.

(5) What's interesting is that the bogus pointers often have a single bit or a small number of bits set, for instance:

0x10
0x4000
0x40000000
0x4f4
0xffff0

I think the crash frequency is too high for these to be hardware bit flips. It's possible we're trying to set flags and corrupting memory as a result. Then we have the following possibilities:

(a) The corruption comes from things allocated in the TypeLifoAlloc. The only things we allocate there are ConstraintTypeSets, TypeConstraints, and TypeHashSet data. If the corruption comes from one of these, it's most likely from TypeSets as the other things don't have flags (some type constraints store flags but we don't change these flags and I think our magic word checks should catch weirdness here).

(b) The corruption comes from unrelated code. I don't know why it would affect the typeLifoAlloc so often though. This LifoAlloc uses a default chunk size of 8 KB and these allocations should happen in other places as well.

So, some options now are:

(A) We can check for (a) by moving the magic word from ConstraintTypeSet to its base class (TypeSet) and then check it whenever we set TypeSet::flags_ etc. I'll post a patch for that soon.

(B) Use different LifoAllocs for TypeConstraint, ConstraintTypeSet, and TypeHashSet data. Then we can check if the corruption affects all of them or only one. We will use a bit more memory if we do this, but we could back it out after a few days.

(C) Use memory protection tricks. It might be easier to do this on top of (B).
I just realized there's another option, (D): in checkMagic, when the check fails, we should annotate the crash report with the data that's stored there.

This could be super interesting because if it *is* some code setting a flag or something we should get the magic word we expected with one or more bits flipped.
Report the value we got instead of the magic word. I want to know whether the value is completely bogus or only some bytes/bits got corrupted.

I made the report function MOZ_NEVER_INLINE so we get different crash signatures. The MOZ_RELEASE_ASSERT is currently inlined and it's pretty annoying.
Attachment #8833783 - Flags: review?(emanuel.hoogeveen)
Comment on attachment 8833783 [details] [diff] [review]
Part 5 - Report actual value

Review of attachment 8833783 [details] [diff] [review]:
-----------------------------------------------------------------

This looks fine. We should really have a MOZ_CRASH_DYNAMIC or something so we don't have to keep making custom implementations for this.
Attachment #8833783 - Flags: review?(emanuel.hoogeveen) → review+
Pushed by jandemooij@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3153a3e522f7
part 5 - Annotate crash reports with the value we found instead of the TI magic word. r=ehoogeveen
Thanks for the quick review! I also added a missing mozilla/IntegerPrintfMacros.h #include.

I agree we should have a more generic way to annotate these crashes..
We got our first crash with the new instrumentation: https://crash-stats.mozilla.com/report/index/5364b5ec-55f4-4ac2-8e06-d06be2170208

Very interesting: MOZ_CRASH(Got 0xa1a2b35bc5c6d7da expected magic word 0xa1a2b3b4c5c6d7da)

Note that byte 0xb4 was changed to 0x5b:

0xa1a2b3b4c5c6d7da
0xa1a2b35bc5c6d7da
--------^^--------

If we look at the bit pattern:

0xb4: 10110100
0x5b: 01011011

It looks like (byte >> 1) | 1. Let's see if other crashes show a similar pattern...
Oh it looks like this one is a StackTypeSet, which is a little unexpected because these are not allocated in the typeLifoAlloc (but as part of the TypeScript). Almost all other crashes are for HeapTypeSets and TypeConstraints we allocate in the LifoAlloc.

Let's wait for more data in case this is a one off.
More reports now:

* 2 of them are MOZ_CRASH(Got 0x0 expected magic word 0xa1a2b3b4c5c6d7da)
* 4 of them are MOZ_CRASH(Got 0x0 expected magic word 0xa1a2b3b4c5c6d7d9)

*But* 3 of these are duplicates (same uptime, same install age, same extensions - it's unfortunate that crash-stats doesn't filter these better):

https://crash-stats.mozilla.com/report/index/a85b4a9e-f5fa-4b9d-9ef8-648e92170208
https://crash-stats.mozilla.com/report/index/0f97ebea-3d75-4d57-998d-cbb9a2170208
https://crash-stats.mozilla.com/report/index/8795e752-7786-4f2d-8378-a5b8f2170208

And one of them seems to be from this user too (they have a ton of extensions installed so it's pretty easy to identify):

https://crash-stats.mozilla.com/report/index/cc5a2c0b-2e45-4a36-b668-23a062170208

The remaining two of these 0x0 crashes are:

https://crash-stats.mozilla.com/report/index/552c9dfc-3a77-466b-af42-0613e2170208
https://crash-stats.mozilla.com/report/index/ed53028f-8564-4867-b8ea-152eb2170208

So, there are 3 different users.

--------------------

Then we have 5 that look like we flipped/corrupted some bits/bytes:

* The one in comment 24, 0xb4 -> 0x5b: MOZ_CRASH(Got 0xa1a2b35bc5c6d7da expected magic word 0xa1a2b3b4c5c6d7da)
* One that's similar but 0xb4 -> 0x4b: MOZ_CRASH(Got 0xa1a2b34bc5c6d7da expected magic word 0xa1a2b3b4c5c6d7da)
* One with a bitflip     0xa2 -> 0xa3: MOZ_CRASH(Got 0xa1a3b3b4c5c6d7da expected magic word 0xa1a2b3b4c5c6d7da) 
* One with a bitflip     0xa2 -> 0xa6: MOZ_CRASH(Got 0xa1a6b3b4c5c6d7da expected magic word 0xa1a2b3b4c5c6d7da) 
* One with a bitflip     0xb4 -> 0xb0: MOZ_CRASH(Got 0xa1a2b3b0c5c6d7da expected magic word 0xa1a2b3b4c5c6d7da) 

I wonder if at least these two "single byte corrupted" crashes were from the same user:

https://crash-stats.mozilla.com/report/index/18534e44-961e-4ee3-90e0-f33b62170209
https://crash-stats.mozilla.com/report/index/5364b5ec-55f4-4ac2-8e06-d06be2170208

Same CPU, same hardware, both have the same extensions avg@toolbar and MyStartab-the-extension1@mozilla.com (this seems to be malware?)

2 other byte corruption crashes are also with CPU "family 6 model 42 stepping 7" but probably different users with no extensions:

https://crash-stats.mozilla.com/report/index/c25eb796-0e42-4a41-b2dd-be7202170209
https://crash-stats.mozilla.com/report/index/937cae5a-d97b-4167-b9bf-429042170209

And the remaining bit/byte corruption one:

https://crash-stats.mozilla.com/report/index/de2c779e-0bfa-4fc4-80c8-3462d2170208

All 11 reports were on x64 with Windows 7 SP1 but that's not true in general for these TI crashes.

--------------------

So what to make of this? Given that some users seem to hit it twice per day, that *most* users (including us) never hit it, that there are no clear STR or URLs triggering it, that it's often a single bit/byte flipped/corrupted, makes me wonder if at least some of these crashes are random memory corruption. The 0x0 crashes are more weird.
Crash Signature: [@ js::ReportMagicWordFailure ]
I've just got one on https://web.telegram.org/: https://crash-stats.mozilla.com/report/index/6989523e-0727-4176-8a76-ed5a82170222.

"Got 0xa1a2b3f4c5c6d7d9 expected magic word 0xa1a2b3b4c5c6d7d9"
6 Windows crashes in Nightly 20170317111607:

https://crash-stats.mozilla.com/signature/?build_id=20170317111607&product=Firefox&release_channel=nightly&platform=Windows&signature=js%3A%3AReportMagicWordFailure&date=%3E%3D2017-03-17T00%3A00%3A00.000Z&date=%3C2017-03-20T02%3A49%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#reports

> Got 0x21a2b3b5 expected magic word 0xa1a2b3b5
> Got 0x21a2b3b4c5c6d7da expected magic word 0xa1a2b3b4c5c6d7da
> Got 0xa1a2b3b4c5c2d7da expected magic word 0xa1a2b3b4c5c6d7da
> Got 0xe5e5e5e5e5e5e5e5 expected magic word 0xa1a2b3b4c5c6d7da
> Got 0x2f2f2f2f expected magic word 0xa1a2b3b6
> Got 0x92fa000 expected magic word 0xa1a2b3b6

Quite a mix, there.
Whiteboard: [tbird crash]
(In reply to Nicholas Nethercote [:njn] from comment #28)
> 6 Windows crashes in Nightly 20170317111607:
>
> > Got 0x21a2b3b5 expected magic word 0xa1a2b3b5
> > Got 0x21a2b3b4c5c6d7da expected magic word 0xa1a2b3b4c5c6d7da
> > Got 0xa1a2b3b4c5c2d7da expected magic word 0xa1a2b3b4c5c6d7da
> > Got 0xe5e5e5e5e5e5e5e5 expected magic word 0xa1a2b3b4c5c6d7da
> > Got 0x2f2f2f2f expected magic word 0xa1a2b3b6
> > Got 0x92fa000 expected magic word 0xa1a2b3b6
> 
> Quite a mix, there.

Breaking these down:

> > Got 0x21a2b3b5 expected magic word 0xa1a2b3b5

flip of 0x80000000

> > Got 0x21a2b3b4c5c6d7da expected magic word 0xa1a2b3b4c5c6d7da

flip of 0x8000000000000000

> > Got 0xa1a2b3b4c5c2d7da expected magic word 0xa1a2b3b4c5c6d7da

flip of 0x40000

> > Got 0xe5e5e5e5e5e5e5e5 expected magic word 0xa1a2b3b4c5c6d7da

jemalloc freed memory

> > Got 0x2f2f2f2f expected magic word 0xa1a2b3b6

initial nursery poison value?

> > Got 0x92fa000 expected magic word 0xa1a2b3b6

completely unrelated values. Reused memory?
Do you think it's worth adding this in? It might remove some guesswork from bug 1353242, though really it should be replaced with a more general "capture this memory and display it in crash-stats" type of thing.
Attachment #8855090 - Flags: review?(jdemooij)
Assignee: jdemooij → sphink
(In reply to Jan de Mooij [:jandem] from comment #26)
> * The one in comment 24, 0xb4 -> 0x5b: MOZ_CRASH(Got 0xa1a2b35bc5c6d7da
> expected magic word 0xa1a2b3b4c5c6d7da)
> * One that's similar but 0xb4 -> 0x4b: MOZ_CRASH(Got 0xa1a2b34bc5c6d7da
> expected magic word 0xa1a2b3b4c5c6d7da)
> * One with a bitflip     0xa2 -> 0xa3: MOZ_CRASH(Got 0xa1a3b3b4c5c6d7da
> expected magic word 0xa1a2b3b4c5c6d7da) 
> * One with a bitflip     0xa2 -> 0xa6: MOZ_CRASH(Got 0xa1a6b3b4c5c6d7da
> expected magic word 0xa1a2b3b4c5c6d7da) 
> * One with a bitflip     0xb4 -> 0xb0: MOZ_CRASH(Got 0xa1a2b3b0c5c6d7da
> expected magic word 0xa1a2b3b4c5c6d7da) 

Dumping out the xors of those:

0xef00000000 (as jandem noticed, that looks more like a shift)
0xff00000000 (does not look like a shift. A complete inversion of 1 byte??)
0x1000000000000
0x4000000000000
0x400000000

That first one could also be seen as an inversion of a byte, plus a bit flip.

These bits don't make sense as TypeFlags setting/clearing. Those are 32 bits, with 7 bits near the low end and 14 bits at the high end used as counts. 0x80000000 and 0x8000000000000000 from comment 29 both land in the top region. So setting/clearing a flag would not toggle that bit (I don't know where the 32 bits are in the 64 bit word, but given that we saw both 0x8... values, it doesn't matter; we're flipping a bit at the top end of both 32 bit words.)
I see 6 crashes in js::ConstraintTypeSet::checkMagic: http://bit.ly/2p3BG0Z
Crash Signature: [@ js::ReportMagicWordFailure ] → [@ js::ReportMagicWordFailure ] [@ js::ConstraintTypeSet::checkMagic]
Comment on attachment 8855090 [details] [diff] [review]
Display some additional diagnostic information for ConstraintTypeSet corruption

Review of attachment 8855090 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not sure how useful it is, but it probably can't hurt.
Attachment #8855090 - Flags: review?(jdemooij) → review+
Assignee: sphink → jdemooij
These crashes are still happening. 8 in the 5-17 Nightly, from 8 different installations.
Pushed by sfink@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8b3ea20f546c
Display some additional diagnostic information for ConstraintTypeSet corruption, r=jandem
Keywords: triage-deferred
Priority: -- → P3
Adding another signature showing up in nightly crash stats. These crashes go back to Build ID 20171108110838: http://bit.ly/2BKgC7w. 38 crashes/16 installs in the last 7 days. Moz Crash Reason for some of them is: Got 0xa122b3b4c5c6d7d9 expected magic word 0xa1a2b3b4c5c6d7d9
Crash Signature: [@ js::ReportMagicWordFailure ] [@ js::ConstraintTypeSet::checkMagic] → [@ js::ReportMagicWordFailure ] [@ js::ConstraintTypeSet::checkMagic] [@ js::TypeConstraint::next]
I think we should close this bug. We added more release asserts, we can file new bugs if we want to add more.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: