ThreadSanitizer: data race [@ operator--] vs. [@ operator--] through [@ nsHtml5OwningUTF16Buffer::Release]
Categories
(Core :: DOM: HTML Parser, defect, P2)
Tracking
()
People
(Reporter: decoder, Assigned: hsivonen)
References
(Blocks 1 open bug)
Details
(Keywords: csectype-race, sec-moderate, Whiteboard: [post-critsmash-triage][adv-main83+r][adv-esr78.5+r])
Attachments
(3 files)
44.45 KB,
text/plain
|
Details | |
47 bytes,
text/x-phabricator-request
|
jcristau
:
approval-mozilla-beta-
RyanVM
:
approval-mozilla-esr78+
|
Details | Review |
47 bytes,
text/x-phabricator-request
|
Details | Review |
The attached crash information was detected while running CI tests with ThreadSanitizer on mozilla-central revision 12fb5e522dd3.
It looks like we are giving up a reference to a nsHtml5OwningUTF16Buffer
from to threads at the same time. Since this is a non-threadsafe RefPtr
this can lead to various side-effects. For this particular race, where both threads are decreasing the non-atomic reference count, one of the updates might be trashed, so we end up with the reference count only being decremented once and have a memory leak. However, If this RefPtr
is touched from two different threads at the same time, then there is likely some underlying ownership problem and if we increased the reference count on one thread while releasing on the other, it would cause a use-after-free.
I'm filing this as a security bug as a start until this has been fully investigated.
General information about TSan reports
Why fix races?
Data races are undefined behavior and can cause crashes as well as correctness issues. Compiler optimizations can cause racy code to have unpredictable and hard-to-reproduce behavior.
Rating
If you think this race can cause crashes or correctness issues, it would be great to rate the bug appropriately as P1/P2 and/or indicating this in the bug. This makes it a lot easier for us to assess the actual impact that these reports make and if they are helpful to you.
False Positives / Benign Races
Typically, races reported by TSan are not false positives [1], but it is possible that the race is benign. Even in this case it would be nice to come up with a fix if it is easily doable and does not regress performance. Every race that we cannot fix will have to remain on the suppression list and slows down the overall TSan performance. Also note that seemingly benign races can possibly be harmful (also depending on the compiler, optimizations and the architecture) [2][3].
[1] One major exception is the involvement of uninstrumented code from third-party libraries.
[2] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[3] How to miscompile programs with "benign" data races: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
Suppressing unfixable races
If the bug cannot be fixed, then a runtime suppression needs to be added in mozglue/build/TsanOptions.cpp
. The suppressions match on the full stack, so it should be picked such that it is unique to this particular race. The bug number of this bug should also be included so we have some documentation on why this suppression was added.
Reporter | ||
Comment 1•4 years ago
|
||
Updated•4 years ago
|
Updated•4 years ago
|
Comment 2•4 years ago
|
||
It's not obvious, but the mutexes held in both cases (different mutexes, to be clear!) will provide memory barriers to ensure the writes are visible to all threads. So I guess the question is whether those pieces of code, with different mutexes locked, could ever run concurrently and therefore race, or whether there's some higher-level synchronization in the HTML parser that ensures concurrent execution won't happen.
Reporter | ||
Comment 3•4 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #2)
It's not obvious, but the mutexes held in both cases (different mutexes, to be clear!) will provide memory barriers to ensure the writes are visible to all threads.
They are visible, but not atomic, right? One write can still overwrite the other.
So I guess the question is whether those pieces of code, with different mutexes locked, could ever run concurrently and therefore race, or whether there's some higher-level synchronization in the HTML parser that ensures concurrent execution won't happen.
According to TSan, they are running concurrently with no synchronization. The only synchronization that TSan does not recognize for this purpose is memory fences or synchronization primitives in uninstrumented code. TSan also checks that the two mutexes have no higher-level synchronization in common.
Comment 4•4 years ago
|
||
(In reply to Christian Holler (:decoder) from comment #3)
(In reply to Nathan Froyd [:froydnj] from comment #2)
It's not obvious, but the mutexes held in both cases (different mutexes, to be clear!) will provide memory barriers to ensure the writes are visible to all threads.
They are visible, but not atomic, right? One write can still overwrite the other.
Well, yes, but "One write can still overwrite the other" is still true even if the accesses were atomic. It's not clear to me what that claim proves.
So I guess the question is whether those pieces of code, with different mutexes locked, could ever run concurrently and therefore race, or whether there's some higher-level synchronization in the HTML parser that ensures concurrent execution won't happen.
According to TSan, they are running concurrently with no synchronization. The only synchronization that TSan does not recognize for this purpose is memory fences or synchronization primitives in uninstrumented code. TSan also checks that the two mutexes have no higher-level synchronization in common.
I don't think that's what TSan claims. TSan just checks whether there's synchronization between threads for accesses to the same memory location; it doesn't say anything about whether it's possible for the accesses to run truly concurrently (in many cases, it is possible, which is what makes the memory accesses racy, but it doesn't have to be racy).
"TSan also checks that the two mutexes have no higher-level synchronization in common." sounds akin to solving the halting problem. :)
Reporter | ||
Comment 5•4 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #4)
Well, yes, but "One write can still overwrite the other" is still true even if the accesses were atomic. It's not clear to me what that claim proves.
What I meant to say is that the decrement here is not atomic, hence you can have Load, Load, Decrement, Decrement, Store, Store, canceling out one decrement. If this was an atomic, the decrement could be done atomically, negating this problem.
"TSan also checks that the two mutexes have no higher-level synchronization in common." sounds akin to solving the halting problem. :)
I know that the algorithm takes mutexes into account and even detects cycles in mutexes. I don't see how this is related to the halting problem.
Updated•4 years ago
|
Assignee | ||
Comment 6•4 years ago
|
||
mTokenizerMutex
protects the current speculation. mSpeculationMutex
protects the older speculations. There's a linked list of buffers. In addition to a buffer being able to point to the next buffer on the list, the parser and speculations can point into the list. Hence, RefPtr
instead of UniquePtr
.
So indeed, if the previous-to-current speculation is released, it decrements the refcount of the buffer it points to, which chains via the linked list to the buffer the current speculation points to. ...Without synchronization.
Now, I think that at that point there isn't anything that could actually increase or decrease the refcount from the parser thread, which explains why this has survived for over a decade without things having been totally awful. We should probably still do something about this.
I need to think a bit how bad it would be to get rid of mSpeculationMutex
and hold mTokenizerMutex
when mSpeculationMutex
is currently held. Alternatively: Whether having an atomic refcount would solve this.
Assignee | ||
Comment 7•4 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #6)
I need to think a bit how bad it would be to get rid of
mSpeculationMutex
and holdmTokenizerMutex
whenmSpeculationMutex
is currently held. Alternatively: Whether having an atomic refcount would solve this.
At least the third option, acquiring mTokenizerMutex
at https://searchfox.org/mozilla-central/rev/e076e40ab1290f4e5e67ebd21dc8af753fc05be6/parser/html/nsHtml5StreamParser.cpp#1821 is a bad idea, since it would deadlock.
The whole purpose of mSpeculationMutex
is that this section: https://searchfox.org/mozilla-central/rev/e076e40ab1290f4e5e67ebd21dc8af753fc05be6/parser/html/nsHtml5StreamParser.cpp#1790-1830 can run concurrently with the parser thread. That section doesn't do much work, and we always acquire mTokenizerMutex
anyway after that section, but that section makes an effort to get the parser thread to release mTokenizerMutex
sooner, so that makes sense.
I don't know how much that actually matters for perf, but it seems like an arrangement we should want to keep. Since the refcount increments/decrements on nsHtml5OwningUTF16Buffer
are infrequent, I'm guessing that getting rid of the detected condition here would be perf-wise rather harmless.
So I suggest the way forward here is making the refcount of nsHtml5OwningUTF16Buffer
atomic. If we want to micro-optimize it, I guess we could keep the current nsHtml5OwningUTF16Buffer
for document.write()
and introduce an nsHtml5AtomicOwningUTF16Buffer
only for content that comes from network.
Assignee | ||
Comment 8•4 years ago
|
||
Hah. No. I take that back.
Since on the main thread every time we release mSpeculationMutex
, we always acquire mTokenizerMutex
after, let's set a flag needToDequeueOneSpeculation
while holding mSpeculationMutex
and then remove the item while holding mTokenizerMutex
if the flag was set. No need for new locks or atomics.
Assignee | ||
Comment 9•4 years ago
|
||
Except I misread the code, and we return early precisely not to have to acquire mTokenizerMutex
. Back to atomics.
Comment 10•4 years ago
|
||
The priority flag is not set for this bug.
:hsinyi, could you have a look please?
For more information, please visit auto_nag documentation.
Updated•4 years ago
|
Comment 11•4 years ago
|
||
I wonder if this could be related to the leaks in bug 1400567.
Comment 12•4 years ago
|
||
Hi Henri, as you seemed to get far with the analysis: are you going to work on this? Thanks!
Assignee | ||
Comment 13•4 years ago
|
||
Updated•4 years ago
|
Assignee | ||
Comment 14•4 years ago
|
||
Thanks for the reminder. I posted the simple fix without the micro-optimization.
Comment 15•4 years ago
|
||
https://hg.mozilla.org/integration/autoland/rev/6f476d99cefd36ad91f08b5cd071583aaabd5bbb
https://hg.mozilla.org/mozilla-central/rev/6f476d99cefd
Assignee | ||
Comment 16•4 years ago
|
||
Comment on attachment 9180420 [details]
Bug 1607762 - Make the refcount of nsHtml5OwningUTF16Buffer atomic.
Beta/Release Uplift Approval Request
- User impact if declined: Unclear. We might have memory-safety problems in the wild arising from the refcount of an object becoming wrong.
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce: As a very specific race condition, this bug isn't testable.
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This patch changes a refcount from non-atomic to atomic, which can make performance a tiny, tiny bit worse but should never make correctness worse.
- String changes made/needed: None
ESR Uplift Approval Request
- If this is not a sec:{high,crit} bug, please state case for ESR consideration: Resolves a memory management-related race condition, which might be causing memory-safety issues that we are currently unable to attribute to this bug. Extremely low-risk fix.
- User impact if declined: Unclear. We might have memory-safety problems in the wild arising from the refcount of an object becoming wrong.
- Fix Landed on Version:
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This patch changes a refcount from non-atomic to atomic, which can make performance a tiny, tiny bit worse but should never make correctness worse.
- String or UUID changes made by this patch: None
Comment 17•4 years ago
|
||
Too late for 82.
Updated•4 years ago
|
Updated•4 years ago
|
Comment 18•4 years ago
|
||
Comment on attachment 9180420 [details]
Bug 1607762 - Make the refcount of nsHtml5OwningUTF16Buffer atomic.
Approved for 78.5esr.
Comment 19•4 years ago
|
||
uplift |
Comment 20•4 years ago
|
||
Comment 21•4 years ago
|
||
Remove supression for seemingly fixed issue. r=decoder
https://hg.mozilla.org/mozilla-central/rev/66e5e37d6f33
Updated•4 years ago
|
Updated•4 years ago
|
Updated•3 years ago
|
Description
•