Closed Bug 1607762 Opened 4 years ago Closed 4 years ago

ThreadSanitizer: data race [@ operator--] vs. [@ operator--] through [@ nsHtml5OwningUTF16Buffer::Release]

Tracking

()

Status:

RESOLVED FIXED

Milestone:

83 Branch

Tracking Flags:

Tracking

Status

firefox-esr78

83+

fixed

firefox73

---

wontfix

firefox81

---

wontfix

firefox82

---

wontfix

firefox83

fixed

People

(Reporter: decoder, Assigned: hsivonen)

References

(Blocks 1 open bug)

Details

(Keywords: csectype-race, sec-moderate, Whiteboard: [post-critsmash-triage][adv-main83+r][adv-esr78.5+r])

Attachments

(3 files)

Detailed Crash Information 4 years ago Christian Holler (:decoder) 44.45 KB, text/plain		Details
Bug 1607762 - Make the refcount of nsHtml5OwningUTF16Buffer atomic. 4 years ago Henri Sivonen (:hsivonen) 47 bytes, text/x-phabricator-request	jcristau : approval-mozilla-beta- RyanVM : approval-mozilla-esr78+	Details \| Review
Bug 1607762 - Remove supression for seemingly fixed issue. r?decoder 4 years ago Aria Beingessner [:Gankra] 47 bytes, text/x-phabricator-request		Details \| Review

Christian Holler (:decoder)

Reporter

Description

•

4 years ago

The attached crash information was detected while running CI tests with ThreadSanitizer on mozilla-central revision 12fb5e522dd3.

It looks like we are giving up a reference to a nsHtml5OwningUTF16Buffer from to threads at the same time. Since this is a non-threadsafe RefPtr this can lead to various side-effects. For this particular race, where both threads are decreasing the non-atomic reference count, one of the updates might be trashed, so we end up with the reference count only being decremented once and have a memory leak. However, If this RefPtr is touched from two different threads at the same time, then there is likely some underlying ownership problem and if we increased the reference count on one thread while releasing on the other, it would cause a use-after-free.

I'm filing this as a security bug as a start until this has been fully investigated.

General information about TSan reports

Why fix races?

Data races are undefined behavior and can cause crashes as well as correctness issues. Compiler optimizations can cause racy code to have unpredictable and hard-to-reproduce behavior.

Rating

If you think this race can cause crashes or correctness issues, it would be great to rate the bug appropriately as P1/P2 and/or indicating this in the bug. This makes it a lot easier for us to assess the actual impact that these reports make and if they are helpful to you.

False Positives / Benign Races

Typically, races reported by TSan are not false positives [1], but it is possible that the race is benign. Even in this case it would be nice to come up with a fix if it is easily doable and does not regress performance. Every race that we cannot fix will have to remain on the suppression list and slows down the overall TSan performance. Also note that seemingly benign races can possibly be harmful (also depending on the compiler, optimizations and the architecture) [2][3].

[1] One major exception is the involvement of uninstrumented code from third-party libraries.
[2] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[3] How to miscompile programs with "benign" data races: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf

Suppressing unfixable races

If the bug cannot be fixed, then a runtime suppression needs to be added in mozglue/build/TsanOptions.cpp. The suppressions match on the full stack, so it should be picked such that it is unique to this particular race. The bug number of this bug should also be included so we have some documentation on why this suppression was added.

Christian Holler (:decoder)

Reporter

Comment 1

•

4 years ago

Attached file Detailed Crash Information — Details

Ryan VanderMeulen [:RyanVM]

Updated

•

4 years ago

Group: core-security → dom-core-security

Andrew McCreight [:mccr8]

Updated

•

4 years ago

Keywords: csectype-race

Nathan Froyd [:froydnj]

Comment 2

•

4 years ago

It's not obvious, but the mutexes held in both cases (different mutexes, to be clear!) will provide memory barriers to ensure the writes are visible to all threads. So I guess the question is whether those pieces of code, with different mutexes locked, could ever run concurrently and therefore race, or whether there's some higher-level synchronization in the HTML parser that ensures concurrent execution won't happen.

Christian Holler (:decoder)

Reporter

Comment 3

•

4 years ago

(In reply to Nathan Froyd [:froydnj] from comment #2)

It's not obvious, but the mutexes held in both cases (different mutexes, to be clear!) will provide memory barriers to ensure the writes are visible to all threads.

They are visible, but not atomic, right? One write can still overwrite the other.

So I guess the question is whether those pieces of code, with different mutexes locked, could ever run concurrently and therefore race, or whether there's some higher-level synchronization in the HTML parser that ensures concurrent execution won't happen.

According to TSan, they are running concurrently with no synchronization. The only synchronization that TSan does not recognize for this purpose is memory fences or synchronization primitives in uninstrumented code. TSan also checks that the two mutexes have no higher-level synchronization in common.

Nathan Froyd [:froydnj]

Comment 4

•

4 years ago

(In reply to Christian Holler (:decoder) from comment #3)

(In reply to Nathan Froyd [:froydnj] from comment #2)

It's not obvious, but the mutexes held in both cases (different mutexes, to be clear!) will provide memory barriers to ensure the writes are visible to all threads.

They are visible, but not atomic, right? One write can still overwrite the other.

Well, yes, but "One write can still overwrite the other" is still true even if the accesses were atomic. It's not clear to me what that claim proves.

So I guess the question is whether those pieces of code, with different mutexes locked, could ever run concurrently and therefore race, or whether there's some higher-level synchronization in the HTML parser that ensures concurrent execution won't happen.

According to TSan, they are running concurrently with no synchronization. The only synchronization that TSan does not recognize for this purpose is memory fences or synchronization primitives in uninstrumented code. TSan also checks that the two mutexes have no higher-level synchronization in common.

I don't think that's what TSan claims. TSan just checks whether there's synchronization between threads for accesses to the same memory location; it doesn't say anything about whether it's possible for the accesses to run truly concurrently (in many cases, it is possible, which is what makes the memory accesses racy, but it doesn't have to be racy).

"TSan also checks that the two mutexes have no higher-level synchronization in common." sounds akin to solving the halting problem. :)

Christian Holler (:decoder)

Reporter

Comment 5

•

4 years ago

(In reply to Nathan Froyd [:froydnj] from comment #4)

Well, yes, but "One write can still overwrite the other" is still true even if the accesses were atomic. It's not clear to me what that claim proves.

What I meant to say is that the decrement here is not atomic, hence you can have Load, Load, Decrement, Decrement, Store, Store, canceling out one decrement. If this was an atomic, the decrement could be done atomically, negating this problem.

"TSan also checks that the two mutexes have no higher-level synchronization in common." sounds akin to solving the halting problem. :)

I know that the algorithm takes mutexes into account and even detects cycles in mutexes. I don't see how this is related to the halting problem.

Daniel Veditz [:dveditz]

Updated

•

4 years ago

Keywords: sec-moderate

Henri Sivonen (:hsivonen)

Assignee

Comment 6

•

4 years ago

mTokenizerMutex protects the current speculation. mSpeculationMutex protects the older speculations. There's a linked list of buffers. In addition to a buffer being able to point to the next buffer on the list, the parser and speculations can point into the list. Hence, RefPtr instead of UniquePtr.

So indeed, if the previous-to-current speculation is released, it decrements the refcount of the buffer it points to, which chains via the linked list to the buffer the current speculation points to. ...Without synchronization.

Now, I think that at that point there isn't anything that could actually increase or decrease the refcount from the parser thread, which explains why this has survived for over a decade without things having been totally awful. We should probably still do something about this.

I need to think a bit how bad it would be to get rid of mSpeculationMutex and hold mTokenizerMutex when mSpeculationMutex is currently held. Alternatively: Whether having an atomic refcount would solve this.

Henri Sivonen (:hsivonen)

Assignee

Comment 7

•

4 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #6)

I need to think a bit how bad it would be to get rid of mSpeculationMutex and hold mTokenizerMutex when mSpeculationMutex is currently held. Alternatively: Whether having an atomic refcount would solve this.

At least the third option, acquiring mTokenizerMutex at https://searchfox.org/mozilla-central/rev/e076e40ab1290f4e5e67ebd21dc8af753fc05be6/parser/html/nsHtml5StreamParser.cpp#1821 is a bad idea, since it would deadlock.

The whole purpose of mSpeculationMutex is that this section: https://searchfox.org/mozilla-central/rev/e076e40ab1290f4e5e67ebd21dc8af753fc05be6/parser/html/nsHtml5StreamParser.cpp#1790-1830 can run concurrently with the parser thread. That section doesn't do much work, and we always acquire mTokenizerMutex anyway after that section, but that section makes an effort to get the parser thread to release mTokenizerMutex sooner, so that makes sense.

I don't know how much that actually matters for perf, but it seems like an arrangement we should want to keep. Since the refcount increments/decrements on nsHtml5OwningUTF16Buffer are infrequent, I'm guessing that getting rid of the detected condition here would be perf-wise rather harmless.

So I suggest the way forward here is making the refcount of nsHtml5OwningUTF16Buffer atomic. If we want to micro-optimize it, I guess we could keep the current nsHtml5OwningUTF16Buffer for document.write() and introduce an nsHtml5AtomicOwningUTF16Buffer only for content that comes from network.

Henri Sivonen (:hsivonen)

Assignee

Comment 8

•

4 years ago

Hah. No. I take that back.

Since on the main thread every time we release mSpeculationMutex, we always acquire mTokenizerMutex after, let's set a flag needToDequeueOneSpeculation while holding mSpeculationMutex and then remove the item while holding mTokenizerMutex if the flag was set. No need for new locks or atomics.

Henri Sivonen (:hsivonen)

Assignee

Comment 9

•

4 years ago

Except I misread the code, and we return early precisely not to have to acquire mTokenizerMutex. Back to atomics.

BugBot [:suhaib / :marco/ :calixte]

Comment 10

•

4 years ago

The priority flag is not set for this bug.
:hsinyi, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(htsai)

Hsin-Yi Tsai (she/her) [:hsinyi]

Updated

•

4 years ago

Flags: needinfo?(htsai)

Priority: -- → P2

Andrew McCreight [:mccr8]

Comment 11

•

4 years ago

I wonder if this could be related to the leaks in bug 1400567.

Jens Stutte [:jstutte]

Comment 12

•

4 years ago

Hi Henri, as you seemed to get far with the analysis: are you going to work on this? Thanks!

Flags: needinfo?(hsivonen)

Henri Sivonen (:hsivonen)

Assignee

Comment 13

•

4 years ago

Attached file Bug 1607762 - Make the refcount of nsHtml5OwningUTF16Buffer atomic. — Details

Phabricator Automation

Updated

•

4 years ago

Assignee: nobody → hsivonen

Status: NEW → ASSIGNED

Henri Sivonen (:hsivonen)

Assignee

Comment 14

•

4 years ago

Thanks for the reminder. I posted the simple fix without the micro-optimization.

Flags: needinfo?(hsivonen)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 15

•

4 years ago

https://hg.mozilla.org/integration/autoland/rev/6f476d99cefd36ad91f08b5cd071583aaabd5bbb
https://hg.mozilla.org/mozilla-central/rev/6f476d99cefd

Group: dom-core-security → core-security-release

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

status-firefox83: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 83 Branch

Henri Sivonen (:hsivonen)

Assignee

Comment 16

•

4 years ago

Comment on attachment 9180420 [details]
Bug 1607762 - Make the refcount of nsHtml5OwningUTF16Buffer atomic.

Beta/Release Uplift Approval Request

User impact if declined: Unclear. We might have memory-safety problems in the wild arising from the refcount of an object becoming wrong.
Is this code covered by automated tests?: No
Has the fix been verified in Nightly?: No
Needs manual test from QE?: No
If yes, steps to reproduce: As a very specific race condition, this bug isn't testable.
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): This patch changes a refcount from non-atomic to atomic, which can make performance a tiny, tiny bit worse but should never make correctness worse.
String changes made/needed: None

ESR Uplift Approval Request

If this is not a sec:{high,crit} bug, please state case for ESR consideration: Resolves a memory management-related race condition, which might be causing memory-safety issues that we are currently unable to attribute to this bug. Extremely low-risk fix.
User impact if declined: Unclear. We might have memory-safety problems in the wild arising from the refcount of an object becoming wrong.
Fix Landed on Version:
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): This patch changes a refcount from non-atomic to atomic, which can make performance a tiny, tiny bit worse but should never make correctness worse.
String or UUID changes made by this patch: None

Attachment #9180420 - Flags: approval-mozilla-esr78?

Attachment #9180420 - Flags: approval-mozilla-beta?

Julien Cristau [:jcristau] (back April 22)

Comment 17

•

4 years ago

Too late for 82.

status-firefox73: affected → wontfix

status-firefox81: --- → wontfix

status-firefox82: --- → wontfix

status-firefox-esr78: --- → affected

Julien Cristau [:jcristau] (back April 22)

Updated

•

4 years ago

Attachment #9180420 - Flags: approval-mozilla-beta? → approval-mozilla-beta-

Brindusa Tot, Desktop QA

Updated

•

4 years ago

Flags: qe-verify-

Whiteboard: [post-critsmash-triage]

Ryan VanderMeulen [:RyanVM]

Comment 18

•

4 years ago

Comment on attachment 9180420 [details]
Bug 1607762 - Make the refcount of nsHtml5OwningUTF16Buffer atomic.

Approved for 78.5esr.

Attachment #9180420 - Flags: approval-mozilla-esr78? → approval-mozilla-esr78+

Ryan VanderMeulen [:RyanVM]

Comment 19

•

4 years ago

uplift

https://hg.mozilla.org/releases/mozilla-esr78/rev/5aa243a2fe9d

status-firefox-esr78: affected → fixed

tracking-firefox83: --- → +

tracking-firefox-esr78: --- → 83+

Aria Beingessner [:Gankra]

Comment 20

•

4 years ago

Attached file Bug 1607762 - Remove supression for seemingly fixed issue. r?decoder — Details

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 21

•

4 years ago

Remove supression for seemingly fixed issue. r=decoder
https://hg.mozilla.org/mozilla-central/rev/66e5e37d6f33

Tom Ritter [:tjr] (OOTO until 4/30 at least)

Updated

•

4 years ago

Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main83+r]

Tom Ritter [:tjr] (OOTO until 4/30 at least)

Updated

•

4 years ago

Whiteboard: [post-critsmash-triage][adv-main83+r] → [post-critsmash-triage][adv-main83+r][adv-esr78.5+r]

Daniel Veditz [:dveditz]

Updated

•

3 years ago

Group: core-security-release

You need to log in before you can comment on or make changes to this bug.