Closed Bug 1405879 Opened 7 years ago Closed 7 years ago

stylo: poison the hashtable buffers

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla58

Tracking Flags:

Tracking

Status

firefox57

---

wontfix

firefox58

---

fixed

People

(Reporter: bholley, Assigned: bholley)

References

(Blocks 1 open bug)

Details

(Keywords: sec-audit, sec-want, Whiteboard: [keep hidden while bug 1406996 is open][adv-main58-][post-critsmash-triage])

Attachments

(3 files)

Poison hashtable buffers. v1 7 years ago Bobby Holley (:bholley) 1.29 KB, patch	manishearth : review+	Details \| Diff \| Splinter Review
bug1405879-valgrind-annotations.diff [DO NOT LAND] 7 years ago Julian Seward [:jseward] 2.33 KB, patch		Details \| Diff \| Splinter Review
A Valgrind complaint relating to RawVec deallocation 7 years ago Julian Seward [:jseward] 3.51 KB, text/plain		Details

Bobby Holley (:bholley)

Assignee

Description

•

7 years ago

Per investigation in bug 1385925, bug 1404791, and bug 1403845, I have a  reasonably strong suspicion that there's an uninitialized memory read in rust's std::HashMap.

HashMap stores the table as follows:

[h,h,h...][(k,v),(k,v),(k,v)...]

If a given hash is nonzero, it's assumed that the (k,v) at the same index is valid. As such, only the first portion of the table is zeroed on construction, and the rest is left uninitialized.

We're seeing a lot of crashes trying to access keys/values at addresses near 0x0 and 0xe5e5, which suggests that this invariant isn't holding. As both a diagnostic tool and a safety improvement, I want to poison the entire buffer, so that any uninitialized memory has a safe (and recognizable) pattern. This should hurt hashtable construction performance a little but, but given that we already poison memory on free, we're basically just making these allocations as expensive as the corresponding deallocations, which is probably acceptable.

Bobby Holley (:bholley)

Assignee

Comment 1

•

7 years ago

Attached patch Poison hashtable buffers. v1 — Details — Splinter Review

MozReview-Commit-ID: 8uLGtFv6X4P

Attachment #8915371 - Flags: review?(manishearth)

Manish Goregaokar [:manishearth]

Updated

•

7 years ago

Attachment #8915371 - Flags: review?(manishearth) → review+

Bobby Holley (:bholley)

Assignee

Comment 2

•

7 years ago

https://github.com/servo/servo/pull/18751

Eric Rahm [:erahm]

Comment 3

•

7 years ago

Just a note, we use 0xe4 for junk fill in mozjemalloc. It might make sense to align with that.

Bobby Holley (:bholley)

Assignee

Comment 4

•

7 years ago

(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #3)
> Just a note, we use 0xe4 for junk fill in mozjemalloc. It might make sense
> to align with that.

I specifically wanted to use something else that wouldn't come from elsewhere, which is why I went with 0xe7 (which, according to 1044077#c4, should have the same properties).

Julian Seward [:jseward]

Comment 5

•

7 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #0)
> I have a reasonably strong suspicion that there's an uninitialized memory
> read in rust's std::HashMap.

I am (somewhat) surprised to hear that, given that I've been running
Stylo-ised Fx builds on Valgrind on a fairly regular basis these past few
months, and have never seen any such complaints.

Maybe it happens on a very obscure path?  Or is there some other allocator
in between hash table creation and the underlying malloc, that recycles
memory in such a way that it does not appear to be uninitialised?

If we had a way, in your patch, to call MOZ_MAKE_MEM_{UNDEFINED,NOACCESS}
(in mfbt/MemoryChecking.h), or the direct Valgrind equivalents thereof, then
we could verify this a bit more directly, at least for local testing.

In your investigations, have you seen anything that suggests these crashes
could be caused by a data race?

Daniel Veditz [:dveditz]

Updated

•

7 years ago

Keywords: sec-audit, sec-want

Bobby Holley (:bholley)

Assignee

Comment 6

•

7 years ago

https://hg.mozilla.org/integration/autoland/rev/2cd9b59aef83

(In reply to Julian Seward [:jseward] from comment #5)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #0)
> > I have a reasonably strong suspicion that there's an uninitialized memory
> > read in rust's std::HashMap.
> 
> I am (somewhat) surprised to hear that, given that I've been running
> Stylo-ised Fx builds on Valgrind on a fairly regular basis these past few
> months, and have never seen any such complaints.
> 
> Maybe it happens on a very obscure path?

Well, hashtables are probabilistic data structures. My hypothesis is that we're hitting an edge cases in the robin-hood algorithm, which would only arise in a very particular initial state of occupied/unoccupied buckets when the problematic insertion occurs. With that, it seems plausible that the uninitialized reads could be vanishingly rare.

See bug 1404791 comment 2 for an analysis of some of the effects we're seeing.

> Or is there some other allocator
> in between hash table creation and the underlying malloc, that recycles
> memory in such a way that it does not appear to be uninitialised?
> 
> If we had a way, in your patch, to call MOZ_MAKE_MEM_{UNDEFINED,NOACCESS}
> (in mfbt/MemoryChecking.h), or the direct Valgrind equivalents thereof, then
> we could verify this a bit more directly, at least for local testing.

Yeah. It's possible that the problem actually occurs more frequently, but that most of the time the contents of the associated (k,v) memory are simply leftovers from a previous (k,v), which would basically work fine for lookups (we'd compare the key, it wouldn't match, and we'd move on). Given that we're using Atoms as keys though, this would result in later (harder-to-detect) refcounting errors and prematures frees when the hashtable was dropped (since we'd free the same atom twice).

Anyway, instrumenting hashglobe in local builds could certainly be worthwhile.

> In your investigations, have you seen anything that suggests these crashes
> could be caused by a data race?

I think that is very unlikely. We have several hashmaps being used in entirely different contexts. A data race in either would be pretty unlikely, and in both even less so.

Randell Jesup [:jesup] (needinfo me)

Updated

•

7 years ago

status-firefox57: --- → affected

status-firefox58: --- → affected

Priority: -- → P1

Bobby Holley (:bholley)

Assignee

Updated

•

7 years ago

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Bobby Holley (:bholley)

Assignee

Comment 7

•

7 years ago

Need to wait a day or two to verify on crash stats that the poisoning works.

Randell Jesup [:jesup] (needinfo me)

Comment 8

•

7 years ago

did this land?  Perhaps you forgot to mark it here; automatic marking doesn't work on sec bugs.

Flags: needinfo?(bobbyholley)

Bobby Holley (:bholley)

Assignee

Comment 9

•

7 years ago

(In reply to Randell Jesup [:jesup] from comment #8)
> did this land?  Perhaps you forgot to mark it here; automatic marking
> doesn't work on sec bugs.

It landed in https://hg.mozilla.org/integration/autoland/rev/2cd9b59aef83 per comment 6. For stuff that gets synced over by the VCS sync service, we tend to resolve fixed when it hits autoland, since we don't get automatic marking.

I will be watching the crash reports to see if this does the right thing, and flag for uplift if so.

Flags: needinfo?(bobbyholley)

Randell Jesup [:jesup] (needinfo me)

Comment 10

•

7 years ago

> It landed in https://hg.mozilla.org/integration/autoland/rev/2cd9b59aef83
> per comment 6. 

Aha... I missed that since it was just sitting at the head of large comment

Julian Seward [:jseward]

Comment 11

•

7 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #6)
> Anyway, instrumenting hashglobe in local builds could certainly be worthwhile.

I instrumented components/hashglobe/src/table.rs using the attached patch,
which is derived from yours.  Note, the use of zutil.c is entirely arbitrary
-- I just needed a C source file in which to park the helper function.

Running startup, + Obama, + techcrunch.com, + the W3 single-page spec, I
found no evidence of any unitialised value uses arising from that
annotation.  The fprintf verified that it was actually getting used.  So I
have no evidence in support of your hypothesis, alas.

Julian Seward [:jseward]

Comment 12

•

7 years ago

Attached patch bug1405879-valgrind-annotations.diff [DO NOT LAND] — Details — Splinter Review

Julian Seward [:jseward]

Comment 13

•

7 years ago

Attached file A Valgrind complaint relating to RawVec deallocation — Details

(In reply to Julian Seward [:jseward] from comment #11)
> I have no evidence in support of your hypothesis

I did however end up looking at the attached Valgrind report, which is
unrelated to your hypothesis.  It's something I've seen several times
before, but discounted as a false positive.  However, on looking at it
again, and with some help from Michael Woerister, I think my initial
judgement was wrong.

It appears to pertain to RawVec deallocation.  Do you think it could be
relevant?

Julian Seward [:jseward]

Updated

•

7 years ago

Flags: needinfo?(bobbyholley)

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox58: affected → fixed

Target Milestone: --- → mozilla58

Daniel Veditz [:dveditz]

Updated

•

7 years ago

Group: core-security → core-security-release

Bobby Holley (:bholley)

Assignee

Comment 14

•

7 years ago

(In reply to Julian Seward [:jseward] from comment #13)
> Created attachment 8915978 [details]
> A Valgrind complaint relating to RawVec deallocation
> 
> (In reply to Julian Seward [:jseward] from comment #11)
> > I have no evidence in support of your hypothesis
> 
> I did however end up looking at the attached Valgrind report, which is
> unrelated to your hypothesis.  It's something I've seen several times
> before, but discounted as a false positive.  However, on looking at it
> again, and with some help from Michael Woerister, I think my initial
> judgement was wrong.
> 
> It appears to pertain to RawVec deallocation.  Do you think it could be
> relevant?

At first glance, this appears to be related to the OS string we create in the one of the env::var calls in global_style_data.rs. It doesn't really seem related to hashmap, aside from the fact that they go through the same allocator. If you have reason to believe there's something fishy there it may be worth investigating, but I don't see an on-face reason why it would be related to the hashmap corruption.

Flags: needinfo?(bobbyholley)

Marcia Knous [:marcia]

Comment 15

•

7 years ago

I am seeing 2 crashes/1 install in [@ hashglobe::diagnostic::DiagnosticHashMap<T>::report_corruption<T>] which seem to point back to this landing.

Marcia Knous [:marcia]

Comment 16

•

7 years ago

(In reply to Marcia Knous [:marcia - use ni] from comment #15)
> I am seeing 2 crashes/1 install in [@
> hashglobe::diagnostic::DiagnosticHashMap<T>::report_corruption<T>] which
> seem to point back to this landing.

Sorry, I neglected to include the link to the crash report: https://crash-stats.mozilla.com/report/index/8d69088a-a876-4b40-beb4-979f40171009

Calixte Denizet (:calixte)

Updated

•

7 years ago

Depends on: 1406955

Bobby Holley (:bholley)

Assignee

Updated

•

7 years ago

Blocks: stylo-hashmap-crashes

Ryan VanderMeulen [:RyanVM]

Comment 17

•

7 years ago

Is this 57:wontfix?

Flags: needinfo?(manishearth)

Manish Goregaokar [:manishearth]

Comment 18

•

7 years ago

Yes; this is pure diagnostic code that should not make it into a release.

Flags: needinfo?(manishearth)

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox57: affected → wontfix

Randell Jesup [:jesup] (needinfo me)

Comment 19

•

7 years ago

(In reply to Manish Goregaokar [:manishearth] from comment #18)
> Yes; this is pure diagnostic code that should not make it into a release.

Ok, so how are we making sure it doesn't get to release?  It's not inside whatever-Rust-thing is needed to be nightly-only or nightly-and-early-beta/etc....

Flags: needinfo?(manishearth)

Emilio Cobos Álvarez (:emilio)

Comment 20

•

7 years ago

(In reply to Randell Jesup [:jesup] from comment #19)
> (In reply to Manish Goregaokar [:manishearth] from comment #18)
> > Yes; this is pure diagnostic code that should not make it into a release.
> 
> Ok, so how are we making sure it doesn't get to release?  It's not inside
> whatever-Rust-thing is needed to be nightly-only or
> nightly-and-early-beta/etc....

What we've done before is just backing out the diagnostics once we got enough info.

Manish Goregaokar [:manishearth]

Comment 21

•

7 years ago

> Ok, so how are we making sure it doesn't get to release?

What emilio said, basically. We'll hopefully be backing these out next week.

Flags: needinfo?(manishearth)

Al Billings [:abillings - ex-MoCo]

Updated

•

7 years ago

Whiteboard: [adv-main57-]

Al Billings [:abillings - ex-MoCo]

Updated

•

7 years ago

Whiteboard: [adv-main57-] → [adv-main58-]

Matt Wobensmith [:mwobensmith][:matt:]

Updated

•

7 years ago

Flags: qe-verify-

Whiteboard: [adv-main58-] → [adv-main58-][post-critsmash-triage]

Daniel Veditz [:dveditz]

Updated

•

6 years ago

Group: core-security-release

Daniel Veditz [:dveditz]

Updated

•

6 years ago

Group: core-security-release

Whiteboard: [adv-main58-][post-critsmash-triage] → [keep hidden while bug 1406996 is open][adv-main58-][post-critsmash-triage]

Daniel Veditz [:dveditz]

Updated

•

4 years ago

Group: core-security-release

You need to log in before you can comment on or make changes to this bug.