Closed
Bug 1403397
Opened 7 years ago
Closed 7 years ago
stylo: Land Nightly diagnostic code to mprotect stylist hashtables outside of rebuilds in order to determine what's corrupting them
Categories
(Core :: CSS Parsing and Computation, defect, P3)
Core
CSS Parsing and Computation
Tracking
()
RESOLVED
FIXED
mozilla58
People
(Reporter: bholley, Assigned: bholley)
References
(Blocks 1 open bug)
Details
Attachments
(4 files, 1 obsolete file)
21.38 KB,
patch
|
manishearth
:
review+
|
Details | Diff | Splinter Review |
9.34 KB,
patch
|
manishearth
:
review+
away
:
review+
|
Details | Diff | Splinter Review |
10.84 KB,
patch
|
manishearth
:
review+
away
:
review+
ehoogeveen
:
review+
|
Details | Diff | Splinter Review |
8.22 KB,
patch
|
manishearth
:
review+
bzbarsky
:
review+
|
Details | Diff | Splinter Review |
This is the approach discussed in bug 1385925 comment 26. The hope is that this will convert those crash reports into new reports whose callstack indicates the source of corruption.
My basic plan is as follows:
(1) Introduce a HashMap wrapper that must be explicitly unlocked before mutation, and add lock/unlock calls around stylist rebuilds.
(2) Round all hashglobe allocations up to the nearest page. This should regress memory somewhat, but it should be acceptable for a few days of diagnostics on Nightly 58.
(3) In {begin,end}_mutation, do an FFI call with the buffer to invoke MemoryProtectionExceptionHandler::{add,remove}Region and {MakePagesReadOnly,UnprotectPages}.
If anyone has thoughts/suggestions here, please speak up.
This try push is for step (1):
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d46d80220c328017d60d4d3df269d35e8d39a95e
Updated•7 years ago
|
Comment 1•7 years ago
|
||
So, there were a bunch of stacks in that bug that doesn't necessarily involve HashMap.
Is the idea here just that HashMaps are just a bit more likely to hit this because they use a ton of contiguous memory?
Comment 2•7 years ago
|
||
I'd be really happy if MemoryProtectionExceptionHandler helped track down an issue like this :)
One implementation detail of the class that might be worth keeping an eye on is its use of js::SplayTree [1]. IIUC splay trees work best when a small subset of elements are accessed frequently, or if it's used as a LIFO buffer, and I don't know if that's likely for this. If it ends up acting like a linked list and insertion or removal start showing up in profiles, it might be worth switching it over to another data structure.
[1] https://dxr.mozilla.org/mozilla-central/source/js/src/ds/SplayTree.h
Assignee | ||
Comment 3•7 years ago
|
||
(In reply to Emilio Cobos Álvarez [:emilio] from comment #1)
> So, there were a bunch of stacks in that bug that doesn't necessarily
> involve HashMap.
>
> Is the idea here just that HashMaps are just a bit more likely to hit this
> because they use a ton of contiguous memory?
Potentially, or they're distinct crashes. Either way these diagnostics are intended to diagnose the hashmap crashes described in bug 1385925 comment 12.
Assignee | ||
Comment 4•7 years ago
|
||
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #2)
> I'd be really happy if MemoryProtectionExceptionHandler helped track down an
> issue like this :)
>
> One implementation detail of the class that might be worth keeping an eye on
> is its use of js::SplayTree [1]. IIUC splay trees work best when a small
> subset of elements are accessed frequently, or if it's used as a LIFO
> buffer, and I don't know if that's likely for this. If it ends up acting
> like a linked list and insertion or removal start showing up in profiles, it
> might be worth switching it over to another data structure.
>
> [1] https://dxr.mozilla.org/mozilla-central/source/js/src/ds/SplayTree.h
The plan is to only leave this code active for a few days, so as long as the performance isn't terrible it's probably ok. Will be on the lookout though.
Assignee | ||
Comment 5•7 years ago
|
||
Assignee | ||
Comment 6•7 years ago
|
||
MozReview-Commit-ID: K0m65uZi7iw
Attachment #8913016 -
Flags: review?(manishearth)
Assignee | ||
Comment 7•7 years ago
|
||
MozReview-Commit-ID: 34KFtcwCkBB
Attachment #8913017 -
Flags: review?(manishearth)
Attachment #8913017 -
Flags: review?(dmajor)
Assignee | ||
Comment 8•7 years ago
|
||
MozReview-Commit-ID: KACmfw4pZY2
Attachment #8913018 -
Flags: review?(manishearth)
Attachment #8913018 -
Flags: review?(emanuel.hoogeveen)
Attachment #8913018 -
Flags: review?(dmajor)
Assignee | ||
Comment 9•7 years ago
|
||
Did some local measurements. Memory impact is surprisingly small, which I guess indicates that we don't have lots and lots of barely-populated hashmaps. There's some performance overhead during stylist rebuilds, but it should be acceptable for a few days of diagnostics on Nightly.
Assignee | ||
Comment 10•7 years ago
|
||
Updated to handle asan.
MozReview-Commit-ID: KACmfw4pZY2
Attachment #8913071 -
Flags: review?(manishearth)
Attachment #8913071 -
Flags: review?(emanuel.hoogeveen)
Attachment #8913071 -
Flags: review?(dmajor)
Assignee | ||
Updated•7 years ago
|
Attachment #8913018 -
Attachment is obsolete: true
Attachment #8913018 -
Flags: review?(manishearth)
Attachment #8913018 -
Flags: review?(emanuel.hoogeveen)
Attachment #8913018 -
Flags: review?(dmajor)
Assignee | ||
Comment 11•7 years ago
|
||
This will allow us to verify the entire detection pipeline in real nightly
builds, which will give us confidence that real heap corruption will be
detected and reported properly.
MozReview-Commit-ID: 43Fp2HT8RYy
Attachment #8913072 -
Flags: review?(manishearth)
Attachment #8913072 -
Flags: review?(bzbarsky)
Assignee | ||
Updated•7 years ago
|
Attachment #8913071 -
Attachment description: Protect the hashmaps outside of rebuilds. v2 → Part 3 - Protect the hashmaps outside of rebuilds. v2
Assignee | ||
Comment 12•7 years ago
|
||
Assignee | ||
Comment 13•7 years ago
|
||
And to be clear: The plan is to land these patches on nightly, and back them out a few days later once we have sufficient data from crash-stats.
Updated•7 years ago
|
Attachment #8913071 -
Flags: review?(emanuel.hoogeveen) → review+
Comment 14•7 years ago
|
||
Comment on attachment 8913017 [details] [diff] [review]
Part 2 - Round hashglobe allocations up to the nearest page size. v1
Review of attachment 8913017 [details] [diff] [review]:
-----------------------------------------------------------------
::: layout/style/ServoBindings.cpp
@@ +2634,5 @@
>
> +size_t
> +Gecko_GetSystemPageSize()
> +{
> + MOZ_ASSERT(NS_IsMainThread());
This operation is probably safe on any thread, but I guess there's no harm in being more restrictive.
::: servo/components/hashglobe/src/table.rs
@@ +1212,5 @@
> + let remainder = size % page_size;
> + if remainder != 0 {
> + result += page_size - remainder;
> + }
> + debug_assert!(result % page_size == 0);
In case this function survives past your backout and the math above gets refactored, could you please add:
debug_assert!(result - size < page_size);
We've had some off-by-ones in the past where e.g. 4096 got rounded up to 8192. :-/
Attachment #8913017 -
Flags: review?(dmajor) → review+
Comment 15•7 years ago
|
||
Comment on attachment 8913071 [details] [diff] [review]
Part 3 - Protect the hashmaps outside of rebuilds. v2
Review of attachment 8913071 [details] [diff] [review]:
-----------------------------------------------------------------
I assume this handles cases of the buffer changing on resize? (I'm not familiar with the hashmap code)
Attachment #8913071 -
Flags: review?(dmajor) → review+
Comment 16•7 years ago
|
||
I want to test this out by introducing a bad write in my debugger and checking that the report shows up correctly in crash-stats. (Don't block on me, and clear my flag if you've already done this)
Flags: needinfo?(dmajor)
Comment 17•7 years ago
|
||
Looks good:
MOZ_CRASH Reason MOZ_CRASH(Tried to access a protected region!)
Flags: needinfo?(dmajor)
Assignee | ||
Comment 18•7 years ago
|
||
(In reply to David Major [:dmajor] from comment #17)
> Looks good:
> MOZ_CRASH Reason MOZ_CRASH(Tried to access a protected region!)
I assume this was you? https://crash-stats.mozilla.com/report/index/d1354951-339d-4be4-b725-819da0170928
Note that I was also planning to land the diagnostics in Part 4, which would allow us to trigger a crash with a known stack from a privileged scratchpad in a nightly build.
Updated•7 years ago
|
Attachment #8913016 -
Flags: review?(manishearth) → review+
Comment 19•7 years ago
|
||
Comment on attachment 8913017 [details] [diff] [review]
Part 2 - Round hashglobe allocations up to the nearest page size. v1
Review of attachment 8913017 [details] [diff] [review]:
-----------------------------------------------------------------
::: servo/components/hashglobe/src/table.rs
@@ +776,5 @@
>
>
>
> // FORK NOTE: Uses alloc shim instead of Heap.alloc
> + let buffer = alloc(round_up_to_page_size(size), alignment);
kinda feel like we could round up cap as well but not necessary since this is just an experiment
Attachment #8913017 -
Flags: review?(manishearth) → review+
Comment 20•7 years ago
|
||
Comment on attachment 8913071 [details] [diff] [review]
Part 3 - Protect the hashmaps outside of rebuilds. v2
Review of attachment 8913071 [details] [diff] [review]:
-----------------------------------------------------------------
::: servo/components/hashglobe/src/protected.rs
@@ +169,5 @@
> {
> pub fn new() -> Self {
> Self {
> map: HashMap::new(),
> readonly: true,
it's a bit non-obvious why we don't protect() here too, might be worth mentioning in a comment, but no big deal if ignored since this is temporary
Attachment #8913071 -
Flags: review?(manishearth) → review+
Updated•7 years ago
|
Attachment #8913072 -
Flags: review?(manishearth) → review+
Comment 21•7 years ago
|
||
Comment on attachment 8913072 [details] [diff] [review]
Part 4 - Add a testing API. v1
r=me
Attachment #8913072 -
Flags: review?(bzbarsky) → review+
Assignee | ||
Comment 22•7 years ago
|
||
Comment 23•7 years ago
|
||
Pushed by bholley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/820174e66b18
Round hashglobe allocations up to the nearest page size. r=Manishearth,r=dmajor
https://hg.mozilla.org/integration/autoland/rev/17044f9cf2a3
Protect the hashmaps outside of rebuilds. r=Manishearth,r=dmajor,r=ehoogeveen
https://hg.mozilla.org/integration/autoland/rev/600792598d42
Add a testing API. r=bz,r=Manishearth
Assignee | ||
Comment 24•7 years ago
|
||
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #22)
> https://github.com/servo/servo/pull/18668
Landed as:
https://hg.mozilla.org/integration/autoland/rev/9793aaf42699
https://hg.mozilla.org/integration/autoland/rev/9d688ebb802b
Comment 25•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/820174e66b18
https://hg.mozilla.org/mozilla-central/rev/17044f9cf2a3
https://hg.mozilla.org/mozilla-central/rev/600792598d42
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Assignee | ||
Comment 26•7 years ago
|
||
So, we got some mixed and inconclusive results here.
First, the mprotect stuff didn't turn up anything all that interesting. We got exactly one interesting crash when trying to write to the protected region: https://crash-stats.mozilla.com/report/index/39b2cd4a-7bed-40ec-8b63-598080171002 . That seems to indicate the JS heap was corrupted and GCing it is causing us to clobber memory elsewhere in the system. That's pretty bad, but I'm not sure if there's anything we can do about it. NI some GC people to make sure they're aware, but one instances of extern corruption doesn't account for the crash volume we were seeing.
Additionally, the crashes continued. The data is a bit hard to read, because the crash volume (in 58) decreased substantially a few weeks ago. Furthermore, the EXCEPTION_BREAKPOINT crashes (where we get killed by the parent process) are less explainable, so I've filtered those out as well in my query [1].
Anyway, there are definitely some HashMap crashes in the range where this stuff was landed.
* https://crash-stats.mozilla.com/report/index/15be700d-7b16-4880-9b1f-3da270171001
* https://crash-stats.mozilla.com/report/index/e8ab82ce-adc0-4f14-adf7-4fd5f0171003
* https://crash-stats.mozilla.com/report/index/7d56ffe7-006f-4a0a-94f9-ced150171003
Removing the diagnostics in https://github.com/servo/servo/pull/18732 . Gankro and I are doing some audits of HashMap.
[1] https://crash-stats.mozilla.com/signature/?version=58.0a1&reason=EXCEPTION_ACCESS_VIOLATION_WRITE&reason=EXCEPTION_ACCESS_VIOLATION_READ&signature=core%3A%3Aptr%3A%3Adrop_in_place%3CT%3E&date=%3E%3D2017-09-19T19%3A33%3A24.000Z&date=%3C2017-10-03T19%3A33%3A24.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1
Assignee | ||
Updated•7 years ago
|
Flags: needinfo?(pbone)
Flags: needinfo?(jcoppeard)
Comment 27•7 years ago
|
||
Pushed by bholley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/bf6cec51bf0d
Revert: Add a testing API. r=bholley
https://hg.mozilla.org/integration/autoland/rev/2a4030b05c35
Revert: Protect the hashmaps outside of rebuilds. r=bholley
https://hg.mozilla.org/integration/autoland/rev/93de74a2302a
Revert: Round hashglobe allocations up to the nearest page size. r=bholley
Comment 28•7 years ago
|
||
Thanks, I have filed bug 1405525 for the GC team.
Flags: needinfo?(pbone)
Flags: needinfo?(jcoppeard)
Comment 29•7 years ago
|
||
bugherder |
Assignee | ||
Updated•7 years ago
|
Blocks: stylo-hashmap-crashes
You need to log in
before you can comment on or make changes to this bug.
Description
•