Closed Bug 1405879 Opened 7 years ago Closed 7 years ago

stylo: poison the hashtable buffers

Categories

(Core :: CSS Parsing and Computation, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla58
Tracking Status
firefox57 --- wontfix
firefox58 --- fixed

People

(Reporter: bholley, Assigned: bholley)

References

(Blocks 1 open bug)

Details

(Keywords: sec-audit, sec-want, Whiteboard: [keep hidden while bug 1406996 is open][adv-main58-][post-critsmash-triage])

Attachments

(3 files)

Per investigation in bug 1385925, bug 1404791, and bug 1403845, I have a  reasonably strong suspicion that there's an uninitialized memory read in rust's std::HashMap.

HashMap stores the table as follows:

[h,h,h...][(k,v),(k,v),(k,v)...]

If a given hash is nonzero, it's assumed that the (k,v) at the same index is valid. As such, only the first portion of the table is zeroed on construction, and the rest is left uninitialized.

We're seeing a lot of crashes trying to access keys/values at addresses near 0x0 and 0xe5e5, which suggests that this invariant isn't holding. As both a diagnostic tool and a safety improvement, I want to poison the entire buffer, so that any uninitialized memory has a safe (and recognizable) pattern. This should hurt hashtable construction performance a little but, but given that we already poison memory on free, we're basically just making these allocations as expensive as the corresponding deallocations, which is probably acceptable.
MozReview-Commit-ID: 8uLGtFv6X4P
Attachment #8915371 - Flags: review?(manishearth)
Attachment #8915371 - Flags: review?(manishearth) → review+
Just a note, we use 0xe4 for junk fill in mozjemalloc. It might make sense to align with that.
(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #3)
> Just a note, we use 0xe4 for junk fill in mozjemalloc. It might make sense
> to align with that.

I specifically wanted to use something else that wouldn't come from elsewhere, which is why I went with 0xe7 (which, according to 1044077#c4, should have the same properties).
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #0)
> I have a reasonably strong suspicion that there's an uninitialized memory
> read in rust's std::HashMap.

I am (somewhat) surprised to hear that, given that I've been running
Stylo-ised Fx builds on Valgrind on a fairly regular basis these past few
months, and have never seen any such complaints.

Maybe it happens on a very obscure path?  Or is there some other allocator
in between hash table creation and the underlying malloc, that recycles
memory in such a way that it does not appear to be uninitialised?

If we had a way, in your patch, to call MOZ_MAKE_MEM_{UNDEFINED,NOACCESS}
(in mfbt/MemoryChecking.h), or the direct Valgrind equivalents thereof, then
we could verify this a bit more directly, at least for local testing.

In your investigations, have you seen anything that suggests these crashes
could be caused by a data race?
https://hg.mozilla.org/integration/autoland/rev/2cd9b59aef83

(In reply to Julian Seward [:jseward] from comment #5)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #0)
> > I have a reasonably strong suspicion that there's an uninitialized memory
> > read in rust's std::HashMap.
> 
> I am (somewhat) surprised to hear that, given that I've been running
> Stylo-ised Fx builds on Valgrind on a fairly regular basis these past few
> months, and have never seen any such complaints.
> 
> Maybe it happens on a very obscure path?

Well, hashtables are probabilistic data structures. My hypothesis is that we're hitting an edge cases in the robin-hood algorithm, which would only arise in a very particular initial state of occupied/unoccupied buckets when the problematic insertion occurs. With that, it seems plausible that the uninitialized reads could be vanishingly rare.

See bug 1404791 comment 2 for an analysis of some of the effects we're seeing.

> Or is there some other allocator
> in between hash table creation and the underlying malloc, that recycles
> memory in such a way that it does not appear to be uninitialised?
> 
> If we had a way, in your patch, to call MOZ_MAKE_MEM_{UNDEFINED,NOACCESS}
> (in mfbt/MemoryChecking.h), or the direct Valgrind equivalents thereof, then
> we could verify this a bit more directly, at least for local testing.

Yeah. It's possible that the problem actually occurs more frequently, but that most of the time the contents of the associated (k,v) memory are simply leftovers from a previous (k,v), which would basically work fine for lookups (we'd compare the key, it wouldn't match, and we'd move on). Given that we're using Atoms as keys though, this would result in later (harder-to-detect) refcounting errors and prematures frees when the hashtable was dropped (since we'd free the same atom twice).

Anyway, instrumenting hashglobe in local builds could certainly be worthwhile.

> In your investigations, have you seen anything that suggests these crashes
> could be caused by a data race?

I think that is very unlikely. We have several hashmaps being used in entirely different contexts. A data race in either would be pretty unlikely, and in both even less so.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Need to wait a day or two to verify on crash stats that the poisoning works.
did this land?  Perhaps you forgot to mark it here; automatic marking doesn't work on sec bugs.
Flags: needinfo?(bobbyholley)
(In reply to Randell Jesup [:jesup] from comment #8)
> did this land?  Perhaps you forgot to mark it here; automatic marking
> doesn't work on sec bugs.

It landed in https://hg.mozilla.org/integration/autoland/rev/2cd9b59aef83 per comment 6. For stuff that gets synced over by the VCS sync service, we tend to resolve fixed when it hits autoland, since we don't get automatic marking.

I will be watching the crash reports to see if this does the right thing, and flag for uplift if so.
Flags: needinfo?(bobbyholley)
> It landed in https://hg.mozilla.org/integration/autoland/rev/2cd9b59aef83
> per comment 6. 

Aha... I missed that since it was just sitting at the head of large comment
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #6)
> Anyway, instrumenting hashglobe in local builds could certainly be worthwhile.

I instrumented components/hashglobe/src/table.rs using the attached patch,
which is derived from yours.  Note, the use of zutil.c is entirely arbitrary
-- I just needed a C source file in which to park the helper function.

Running startup, + Obama, + techcrunch.com, + the W3 single-page spec, I
found no evidence of any unitialised value uses arising from that
annotation.  The fprintf verified that it was actually getting used.  So I
have no evidence in support of your hypothesis, alas.
(In reply to Julian Seward [:jseward] from comment #11)
> I have no evidence in support of your hypothesis

I did however end up looking at the attached Valgrind report, which is
unrelated to your hypothesis.  It's something I've seen several times
before, but discounted as a false positive.  However, on looking at it
again, and with some help from Michael Woerister, I think my initial
judgement was wrong.

It appears to pertain to RawVec deallocation.  Do you think it could be
relevant?
Flags: needinfo?(bobbyholley)
Target Milestone: --- → mozilla58
Group: core-security → core-security-release
(In reply to Julian Seward [:jseward] from comment #13)
> Created attachment 8915978 [details]
> A Valgrind complaint relating to RawVec deallocation
> 
> (In reply to Julian Seward [:jseward] from comment #11)
> > I have no evidence in support of your hypothesis
> 
> I did however end up looking at the attached Valgrind report, which is
> unrelated to your hypothesis.  It's something I've seen several times
> before, but discounted as a false positive.  However, on looking at it
> again, and with some help from Michael Woerister, I think my initial
> judgement was wrong.
> 
> It appears to pertain to RawVec deallocation.  Do you think it could be
> relevant?

At first glance, this appears to be related to the OS string we create in the one of the env::var calls in global_style_data.rs. It doesn't really seem related to hashmap, aside from the fact that they go through the same allocator. If you have reason to believe there's something fishy there it may be worth investigating, but I don't see an on-face reason why it would be related to the hashmap corruption.
Flags: needinfo?(bobbyholley)
I am seeing 2 crashes/1 install in [@ hashglobe::diagnostic::DiagnosticHashMap<T>::report_corruption<T>] which seem to point back to this landing.
(In reply to Marcia Knous [:marcia - use ni] from comment #15)
> I am seeing 2 crashes/1 install in [@
> hashglobe::diagnostic::DiagnosticHashMap<T>::report_corruption<T>] which
> seem to point back to this landing.

Sorry, I neglected to include the link to the crash report: https://crash-stats.mozilla.com/report/index/8d69088a-a876-4b40-beb4-979f40171009
Depends on: 1406955
Is this 57:wontfix?
Flags: needinfo?(manishearth)
Yes; this is pure diagnostic code that should not make it into a release.
Flags: needinfo?(manishearth)
(In reply to Manish Goregaokar [:manishearth] from comment #18)
> Yes; this is pure diagnostic code that should not make it into a release.

Ok, so how are we making sure it doesn't get to release?  It's not inside whatever-Rust-thing is needed to be nightly-only or nightly-and-early-beta/etc....
Flags: needinfo?(manishearth)
(In reply to Randell Jesup [:jesup] from comment #19)
> (In reply to Manish Goregaokar [:manishearth] from comment #18)
> > Yes; this is pure diagnostic code that should not make it into a release.
> 
> Ok, so how are we making sure it doesn't get to release?  It's not inside
> whatever-Rust-thing is needed to be nightly-only or
> nightly-and-early-beta/etc....

What we've done before is just backing out the diagnostics once we got enough info.
> Ok, so how are we making sure it doesn't get to release?

What emilio said, basically. We'll hopefully be backing these out next week.
Flags: needinfo?(manishearth)
Whiteboard: [adv-main57-]
Whiteboard: [adv-main57-] → [adv-main58-]
Flags: qe-verify-
Whiteboard: [adv-main58-] → [adv-main58-][post-critsmash-triage]
Group: core-security-release
Group: core-security-release
Whiteboard: [adv-main58-][post-critsmash-triage] → [keep hidden while bug 1406996 is open][adv-main58-][post-critsmash-triage]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: