stylo: Land Nightly diagnostic code to mprotect stylist hashtables outside of rebuilds in order to determine what's corrupting them

RESOLVED FIXED in Firefox 58

Status

()

defect
P3
normal
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: bholley, Assigned: bholley)

Tracking

(Blocks 2 bugs)

unspecified
mozilla58
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox57 wontfix, firefox58 fixed)

Details

Attachments

(4 attachments, 1 obsolete attachment)

This is the approach discussed in bug 1385925 comment 26. The hope is that this will convert those crash reports into new reports whose callstack indicates the source of corruption.

My basic plan is as follows:
(1) Introduce a HashMap wrapper that must be explicitly unlocked before mutation, and add lock/unlock calls around stylist rebuilds. 
(2) Round all hashglobe allocations up to the nearest page. This should regress memory somewhat, but it should be acceptable for a few days of diagnostics on Nightly 58.
(3) In {begin,end}_mutation, do an FFI call with the buffer to invoke MemoryProtectionExceptionHandler::{add,remove}Region and {MakePagesReadOnly,UnprotectPages}.

If anyone has thoughts/suggestions here, please speak up.

This try push is for step (1):
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d46d80220c328017d60d4d3df269d35e8d39a95e
Priority: -- → P3
So, there were a bunch of stacks in that bug that doesn't necessarily involve HashMap.

Is the idea here just that HashMaps are just a bit more likely to hit this because they use a ton of contiguous memory?
I'd be really happy if MemoryProtectionExceptionHandler helped track down an issue like this :)

One implementation detail of the class that might be worth keeping an eye on is its use of js::SplayTree [1]. IIUC splay trees work best when a small subset of elements are accessed frequently, or if it's used as a LIFO buffer, and I don't know if that's likely for this. If it ends up acting like a linked list and insertion or removal start showing up in profiles, it might be worth switching it over to another data structure.

[1] https://dxr.mozilla.org/mozilla-central/source/js/src/ds/SplayTree.h
(In reply to Emilio Cobos Álvarez [:emilio] from comment #1)
> So, there were a bunch of stacks in that bug that doesn't necessarily
> involve HashMap.
> 
> Is the idea here just that HashMaps are just a bit more likely to hit this
> because they use a ton of contiguous memory?

Potentially, or they're distinct crashes. Either way these diagnostics are intended to diagnose the hashmap crashes described in bug 1385925 comment 12.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #2)
> I'd be really happy if MemoryProtectionExceptionHandler helped track down an
> issue like this :)
> 
> One implementation detail of the class that might be worth keeping an eye on
> is its use of js::SplayTree [1]. IIUC splay trees work best when a small
> subset of elements are accessed frequently, or if it's used as a LIFO
> buffer, and I don't know if that's likely for this. If it ends up acting
> like a linked list and insertion or removal start showing up in profiles, it
> might be worth switching it over to another data structure.
> 
> [1] https://dxr.mozilla.org/mozilla-central/source/js/src/ds/SplayTree.h

The plan is to only leave this code active for a few days, so as long as the performance isn't terrible it's probably ok. Will be on the lookout though.
MozReview-Commit-ID: K0m65uZi7iw
Attachment #8913016 - Flags: review?(manishearth)
MozReview-Commit-ID: 34KFtcwCkBB
Attachment #8913017 - Flags: review?(manishearth)
Attachment #8913017 - Flags: review?(dmajor)
MozReview-Commit-ID: KACmfw4pZY2
Attachment #8913018 - Flags: review?(manishearth)
Attachment #8913018 - Flags: review?(emanuel.hoogeveen)
Attachment #8913018 - Flags: review?(dmajor)
Did some local measurements. Memory impact is surprisingly small, which I guess indicates that we don't have lots and lots of barely-populated hashmaps. There's some performance overhead during stylist rebuilds, but it should be acceptable for a few days of diagnostics on Nightly.
Updated to handle asan.

MozReview-Commit-ID: KACmfw4pZY2
Attachment #8913071 - Flags: review?(manishearth)
Attachment #8913071 - Flags: review?(emanuel.hoogeveen)
Attachment #8913071 - Flags: review?(dmajor)
Attachment #8913018 - Attachment is obsolete: true
Attachment #8913018 - Flags: review?(manishearth)
Attachment #8913018 - Flags: review?(emanuel.hoogeveen)
Attachment #8913018 - Flags: review?(dmajor)
This will allow us to verify the entire detection pipeline in real nightly
builds, which will give us confidence that real heap corruption will be
detected and reported properly.

MozReview-Commit-ID: 43Fp2HT8RYy
Attachment #8913072 - Flags: review?(manishearth)
Attachment #8913072 - Flags: review?(bzbarsky)
Attachment #8913071 - Attachment description: Protect the hashmaps outside of rebuilds. v2 → Part 3 - Protect the hashmaps outside of rebuilds. v2
And to be clear: The plan is to land these patches on nightly, and back them out a few days later once we have sufficient data from crash-stats.
Attachment #8913071 - Flags: review?(emanuel.hoogeveen) → review+
Comment on attachment 8913017 [details] [diff] [review]
Part 2 - Round hashglobe allocations up to the nearest page size. v1

Review of attachment 8913017 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/style/ServoBindings.cpp
@@ +2634,5 @@
>  
> +size_t
> +Gecko_GetSystemPageSize()
> +{
> +  MOZ_ASSERT(NS_IsMainThread());

This operation is probably safe on any thread, but I guess there's no harm in being more restrictive.

::: servo/components/hashglobe/src/table.rs
@@ +1212,5 @@
> +    let remainder = size % page_size;
> +    if remainder != 0 {
> +        result += page_size - remainder;
> +    }
> +    debug_assert!(result % page_size == 0);

In case this function survives past your backout and the math above gets refactored, could you please add:
debug_assert!(result - size < page_size);
We've had some off-by-ones in the past where e.g. 4096 got rounded up to 8192. :-/
Attachment #8913017 - Flags: review?(dmajor) → review+
Comment on attachment 8913071 [details] [diff] [review]
Part 3 - Protect the hashmaps outside of rebuilds. v2

Review of attachment 8913071 [details] [diff] [review]:
-----------------------------------------------------------------

I assume this handles cases of the buffer changing on resize? (I'm not familiar with the hashmap code)
Attachment #8913071 - Flags: review?(dmajor) → review+
I want to test this out by introducing a bad write in my debugger and checking that the report shows up correctly in crash-stats. (Don't block on me, and clear my flag if you've already done this)
Flags: needinfo?(dmajor)
Looks good:
MOZ_CRASH Reason 	MOZ_CRASH(Tried to access a protected region!)
Flags: needinfo?(dmajor)
(In reply to David Major [:dmajor] from comment #17)
> Looks good:
> MOZ_CRASH Reason 	MOZ_CRASH(Tried to access a protected region!)

I assume this was you? https://crash-stats.mozilla.com/report/index/d1354951-339d-4be4-b725-819da0170928

Note that I was also planning to land the diagnostics in Part 4, which would allow us to trigger a crash with a known stack from a privileged scratchpad in a nightly build.
Attachment #8913016 - Flags: review?(manishearth) → review+
Comment on attachment 8913017 [details] [diff] [review]
Part 2 - Round hashglobe allocations up to the nearest page size. v1

Review of attachment 8913017 [details] [diff] [review]:
-----------------------------------------------------------------

::: servo/components/hashglobe/src/table.rs
@@ +776,5 @@
>  
>  
>  
>          // FORK NOTE: Uses alloc shim instead of Heap.alloc
> +        let buffer = alloc(round_up_to_page_size(size), alignment);

kinda feel like we could round up cap as well but not necessary since this is just an experiment
Attachment #8913017 - Flags: review?(manishearth) → review+
Comment on attachment 8913071 [details] [diff] [review]
Part 3 - Protect the hashmaps outside of rebuilds. v2

Review of attachment 8913071 [details] [diff] [review]:
-----------------------------------------------------------------

::: servo/components/hashglobe/src/protected.rs
@@ +169,5 @@
>  {
>      pub fn new() -> Self {
>          Self {
>              map: HashMap::new(),
>              readonly: true,

it's a bit non-obvious why we don't protect() here too, might be worth mentioning in a comment, but no big deal if ignored since this is temporary
Attachment #8913071 - Flags: review?(manishearth) → review+
Attachment #8913072 - Flags: review?(manishearth) → review+
Comment on attachment 8913072 [details] [diff] [review]
Part 4 - Add a testing API. v1

r=me
Attachment #8913072 - Flags: review?(bzbarsky) → review+
Pushed by bholley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/820174e66b18
Round hashglobe allocations up to the nearest page size. r=Manishearth,r=dmajor
https://hg.mozilla.org/integration/autoland/rev/17044f9cf2a3
Protect the hashmaps outside of rebuilds. r=Manishearth,r=dmajor,r=ehoogeveen
https://hg.mozilla.org/integration/autoland/rev/600792598d42
Add a testing API. r=bz,r=Manishearth
Depends on: 1404258
So, we got some mixed and inconclusive results here.

First, the mprotect stuff didn't turn up anything all that interesting. We got exactly one interesting crash when trying to write to the protected region: https://crash-stats.mozilla.com/report/index/39b2cd4a-7bed-40ec-8b63-598080171002 . That seems to indicate the JS heap was corrupted and GCing it is causing us to clobber memory elsewhere in the system. That's pretty bad, but I'm not sure if there's anything we can do about it. NI some GC people to make sure they're aware, but one instances of extern corruption doesn't account for the crash volume we were seeing.

Additionally, the crashes continued. The data is a bit hard to read, because the crash volume (in 58) decreased substantially a few weeks ago. Furthermore, the EXCEPTION_BREAKPOINT crashes (where we get killed by the parent process) are less explainable, so I've filtered those out as well in my query [1].

Anyway, there are definitely some HashMap crashes in the range where this stuff was landed.
* https://crash-stats.mozilla.com/report/index/15be700d-7b16-4880-9b1f-3da270171001
* https://crash-stats.mozilla.com/report/index/e8ab82ce-adc0-4f14-adf7-4fd5f0171003
* https://crash-stats.mozilla.com/report/index/7d56ffe7-006f-4a0a-94f9-ced150171003


Removing the diagnostics in https://github.com/servo/servo/pull/18732 . Gankro and I are doing some audits of HashMap.

[1] https://crash-stats.mozilla.com/signature/?version=58.0a1&reason=EXCEPTION_ACCESS_VIOLATION_WRITE&reason=EXCEPTION_ACCESS_VIOLATION_READ&signature=core%3A%3Aptr%3A%3Adrop_in_place%3CT%3E&date=%3E%3D2017-09-19T19%3A33%3A24.000Z&date=%3C2017-10-03T19%3A33%3A24.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1
Flags: needinfo?(pbone)
Flags: needinfo?(jcoppeard)
Pushed by bholley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/bf6cec51bf0d
Revert: Add a testing API. r=bholley
https://hg.mozilla.org/integration/autoland/rev/2a4030b05c35
Revert: Protect the hashmaps outside of rebuilds. r=bholley
https://hg.mozilla.org/integration/autoland/rev/93de74a2302a
Revert: Round hashglobe allocations up to the nearest page size. r=bholley
Thanks, I have filed bug 1405525 for the GC team.
Flags: needinfo?(pbone)
Flags: needinfo?(jcoppeard)
Blocks: 1406996
You need to log in before you can comment on or make changes to this bug.