Closed Bug 1607762 Opened 2 years ago Closed 1 year ago

ThreadSanitizer: data race [@ operator--] vs. [@ operator--] through [@ nsHtml5OwningUTF16Buffer::Release]

Categories

(Core :: DOM: HTML Parser, defect, P2)

x86_64
Linux
defect

Tracking

()

RESOLVED FIXED
83 Branch
Tracking Status
firefox-esr78 83+ fixed
firefox73 --- wontfix
firefox81 --- wontfix
firefox82 --- wontfix
firefox83 + fixed

People

(Reporter: decoder, Assigned: hsivonen)

References

(Blocks 1 open bug)

Details

(Keywords: csectype-race, sec-moderate, Whiteboard: [post-critsmash-triage][adv-main83+r][adv-esr78.5+r])

Attachments

(3 files)

The attached crash information was detected while running CI tests with ThreadSanitizer on mozilla-central revision 12fb5e522dd3.

It looks like we are giving up a reference to a nsHtml5OwningUTF16Buffer from to threads at the same time. Since this is a non-threadsafe RefPtr this can lead to various side-effects. For this particular race, where both threads are decreasing the non-atomic reference count, one of the updates might be trashed, so we end up with the reference count only being decremented once and have a memory leak. However, If this RefPtr is touched from two different threads at the same time, then there is likely some underlying ownership problem and if we increased the reference count on one thread while releasing on the other, it would cause a use-after-free.

I'm filing this as a security bug as a start until this has been fully investigated.

General information about TSan reports

Why fix races?

Data races are undefined behavior and can cause crashes as well as correctness issues. Compiler optimizations can cause racy code to have unpredictable and hard-to-reproduce behavior.

Rating

If you think this race can cause crashes or correctness issues, it would be great to rate the bug appropriately as P1/P2 and/or indicating this in the bug. This makes it a lot easier for us to assess the actual impact that these reports make and if they are helpful to you.

False Positives / Benign Races

Typically, races reported by TSan are not false positives [1], but it is possible that the race is benign. Even in this case it would be nice to come up with a fix if it is easily doable and does not regress performance. Every race that we cannot fix will have to remain on the suppression list and slows down the overall TSan performance. Also note that seemingly benign races can possibly be harmful (also depending on the compiler, optimizations and the architecture) [2][3].

[1] One major exception is the involvement of uninstrumented code from third-party libraries.
[2] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[3] How to miscompile programs with "benign" data races: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf

Suppressing unfixable races

If the bug cannot be fixed, then a runtime suppression needs to be added in mozglue/build/TsanOptions.cpp. The suppressions match on the full stack, so it should be picked such that it is unique to this particular race. The bug number of this bug should also be included so we have some documentation on why this suppression was added.

Group: core-security → dom-core-security

It's not obvious, but the mutexes held in both cases (different mutexes, to be clear!) will provide memory barriers to ensure the writes are visible to all threads. So I guess the question is whether those pieces of code, with different mutexes locked, could ever run concurrently and therefore race, or whether there's some higher-level synchronization in the HTML parser that ensures concurrent execution won't happen.

(In reply to Nathan Froyd [:froydnj] from comment #2)

It's not obvious, but the mutexes held in both cases (different mutexes, to be clear!) will provide memory barriers to ensure the writes are visible to all threads.

They are visible, but not atomic, right? One write can still overwrite the other.

So I guess the question is whether those pieces of code, with different mutexes locked, could ever run concurrently and therefore race, or whether there's some higher-level synchronization in the HTML parser that ensures concurrent execution won't happen.

According to TSan, they are running concurrently with no synchronization. The only synchronization that TSan does not recognize for this purpose is memory fences or synchronization primitives in uninstrumented code. TSan also checks that the two mutexes have no higher-level synchronization in common.

(In reply to Christian Holler (:decoder) from comment #3)

(In reply to Nathan Froyd [:froydnj] from comment #2)

It's not obvious, but the mutexes held in both cases (different mutexes, to be clear!) will provide memory barriers to ensure the writes are visible to all threads.

They are visible, but not atomic, right? One write can still overwrite the other.

Well, yes, but "One write can still overwrite the other" is still true even if the accesses were atomic. It's not clear to me what that claim proves.

So I guess the question is whether those pieces of code, with different mutexes locked, could ever run concurrently and therefore race, or whether there's some higher-level synchronization in the HTML parser that ensures concurrent execution won't happen.

According to TSan, they are running concurrently with no synchronization. The only synchronization that TSan does not recognize for this purpose is memory fences or synchronization primitives in uninstrumented code. TSan also checks that the two mutexes have no higher-level synchronization in common.

I don't think that's what TSan claims. TSan just checks whether there's synchronization between threads for accesses to the same memory location; it doesn't say anything about whether it's possible for the accesses to run truly concurrently (in many cases, it is possible, which is what makes the memory accesses racy, but it doesn't have to be racy).

"TSan also checks that the two mutexes have no higher-level synchronization in common." sounds akin to solving the halting problem. :)

(In reply to Nathan Froyd [:froydnj] from comment #4)

Well, yes, but "One write can still overwrite the other" is still true even if the accesses were atomic. It's not clear to me what that claim proves.

What I meant to say is that the decrement here is not atomic, hence you can have Load, Load, Decrement, Decrement, Store, Store, canceling out one decrement. If this was an atomic, the decrement could be done atomically, negating this problem.

"TSan also checks that the two mutexes have no higher-level synchronization in common." sounds akin to solving the halting problem. :)

I know that the algorithm takes mutexes into account and even detects cycles in mutexes. I don't see how this is related to the halting problem.

mTokenizerMutex protects the current speculation. mSpeculationMutex protects the older speculations. There's a linked list of buffers. In addition to a buffer being able to point to the next buffer on the list, the parser and speculations can point into the list. Hence, RefPtr instead of UniquePtr.

So indeed, if the previous-to-current speculation is released, it decrements the refcount of the buffer it points to, which chains via the linked list to the buffer the current speculation points to. ...Without synchronization.

Now, I think that at that point there isn't anything that could actually increase or decrease the refcount from the parser thread, which explains why this has survived for over a decade without things having been totally awful. We should probably still do something about this.

I need to think a bit how bad it would be to get rid of mSpeculationMutex and hold mTokenizerMutex when mSpeculationMutex is currently held. Alternatively: Whether having an atomic refcount would solve this.

(In reply to Henri Sivonen (:hsivonen) from comment #6)

I need to think a bit how bad it would be to get rid of mSpeculationMutex and hold mTokenizerMutex when mSpeculationMutex is currently held. Alternatively: Whether having an atomic refcount would solve this.

At least the third option, acquiring mTokenizerMutex at https://searchfox.org/mozilla-central/rev/e076e40ab1290f4e5e67ebd21dc8af753fc05be6/parser/html/nsHtml5StreamParser.cpp#1821 is a bad idea, since it would deadlock.

The whole purpose of mSpeculationMutex is that this section: https://searchfox.org/mozilla-central/rev/e076e40ab1290f4e5e67ebd21dc8af753fc05be6/parser/html/nsHtml5StreamParser.cpp#1790-1830 can run concurrently with the parser thread. That section doesn't do much work, and we always acquire mTokenizerMutex anyway after that section, but that section makes an effort to get the parser thread to release mTokenizerMutex sooner, so that makes sense.

I don't know how much that actually matters for perf, but it seems like an arrangement we should want to keep. Since the refcount increments/decrements on nsHtml5OwningUTF16Buffer are infrequent, I'm guessing that getting rid of the detected condition here would be perf-wise rather harmless.

So I suggest the way forward here is making the refcount of nsHtml5OwningUTF16Buffer atomic. If we want to micro-optimize it, I guess we could keep the current nsHtml5OwningUTF16Buffer for document.write() and introduce an nsHtml5AtomicOwningUTF16Buffer only for content that comes from network.

Hah. No. I take that back.

Since on the main thread every time we release mSpeculationMutex, we always acquire mTokenizerMutex after, let's set a flag needToDequeueOneSpeculation while holding mSpeculationMutex and then remove the item while holding mTokenizerMutex if the flag was set. No need for new locks or atomics.

Except I misread the code, and we return early precisely not to have to acquire mTokenizerMutex. Back to atomics.

The priority flag is not set for this bug.
:hsinyi, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(htsai)
Flags: needinfo?(htsai)
Priority: -- → P2

I wonder if this could be related to the leaks in bug 1400567.

Hi Henri, as you seemed to get far with the analysis: are you going to work on this? Thanks!

Flags: needinfo?(hsivonen)
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED

Thanks for the reminder. I posted the simple fix without the micro-optimization.

Flags: needinfo?(hsivonen)
Group: dom-core-security → core-security-release
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 83 Branch

Comment on attachment 9180420 [details]
Bug 1607762 - Make the refcount of nsHtml5OwningUTF16Buffer atomic.

Beta/Release Uplift Approval Request

  • User impact if declined: Unclear. We might have memory-safety problems in the wild arising from the refcount of an object becoming wrong.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce: As a very specific race condition, this bug isn't testable.
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This patch changes a refcount from non-atomic to atomic, which can make performance a tiny, tiny bit worse but should never make correctness worse.
  • String changes made/needed: None

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: Resolves a memory management-related race condition, which might be causing memory-safety issues that we are currently unable to attribute to this bug. Extremely low-risk fix.
  • User impact if declined: Unclear. We might have memory-safety problems in the wild arising from the refcount of an object becoming wrong.
  • Fix Landed on Version:
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This patch changes a refcount from non-atomic to atomic, which can make performance a tiny, tiny bit worse but should never make correctness worse.
  • String or UUID changes made by this patch: None
Attachment #9180420 - Flags: approval-mozilla-esr78?
Attachment #9180420 - Flags: approval-mozilla-beta?
Attachment #9180420 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Flags: qe-verify-
Whiteboard: [post-critsmash-triage]

Comment on attachment 9180420 [details]
Bug 1607762 - Make the refcount of nsHtml5OwningUTF16Buffer atomic.

Approved for 78.5esr.

Attachment #9180420 - Flags: approval-mozilla-esr78? → approval-mozilla-esr78+
Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main83+r]
Whiteboard: [post-critsmash-triage][adv-main83+r] → [post-critsmash-triage][adv-main83+r][adv-esr78.5+r]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.