ThreadSanitizer: data race [@ mozilla::dom::WorkerFetchResolver::NeedOnDataAvailable] vs. [@ Shutdown]
Categories
(Core :: DOM: Networking, defect, P2)
Tracking
()
People
(Reporter: tsmith, Assigned: edenchuang)
References
(Blocks 2 open bugs)
Details
(Keywords: csectype-race, sec-moderate, Whiteboard: [post-critsmash-triage][adv-main89+r])
Attachments
(2 files)
The attached crash information was detected by ThreadSanitizer while fuzzing with build mozilla-central 20210111-7643eaee62a3. The fuzzers have hit this issue multiple times but no reliable test case is available at this time.
General information about TSan reports
Why fix races?
Data races are undefined behavior and can cause crashes as well as correctness issues. Compiler optimizations can cause racy code to have unpredictable and hard-to-reproduce behavior.
Rating
If you think this race can cause crashes or correctness issues, it would be great to rate the bug appropriately as P1/P2 and/or indicating this in the bug. This makes it a lot easier for us to assess the actual impact that these reports make and if they are helpful to you.
False Positives / Benign Races
Typically, races reported by TSan are not false positives [1], but it is possible that the race is benign. Even in this case it would be nice to come up with a fix if it is easily doable and does not regress performance. Every race that we cannot fix will have to remain on the suppression list and slows down the overall TSan performance. Also note that seemingly benign races can possibly be harmful (also depending on the compiler, optimizations and the architecture) [2][3].
[1] One major exception is the involvement of uninstrumented code from third-party libraries.
[2] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[3] How to miscompile programs with "benign" data races: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
Suppressing unfixable races
If the bug cannot be fixed, then a runtime suppression needs to be added in mozglue/build/TsanOptions.cpp
. The suppressions match on the full stack, so it should be picked such that it is unique to this particular race. The bug number of this bug should also be included so we have some documentation on why this suppression was added.
Reporter | ||
Comment 1•4 years ago
|
||
Comment 2•4 years ago
|
||
So it seems like what's happening is that someone is shutting down the WorkerFetchResolver while the main thread is still using it, leading to a race on the value of mFetchObserver
. The main thread acquires mPromiseProxy->Lock()
, but the other thread only sort of does. The previous line of the shutdown code runs mPromiseProxy->CleanUp()
which does acquire the lock internally... but then Releases mPromiseProxy!
That suggests to me that the main thread is potentially use-after-freeing mPromiseProxy?
It seems like there's some key missing piece of synchronization to stop the main thread from using WorkerFetchResolver once shutdown starts, but I don't know this code well enough to suggest a solution.
Permalinks to problematic lines:
WorkerFetchResolver::NeedOnDataAvailable
vs
Updated•4 years ago
|
Comment 3•4 years ago
|
||
:baku, can you give some input on comment 2 and the related race, in particular how this is supposed to work normally? If you know someone who can work on this, please let us know :) Thanks!
Updated•4 years ago
|
Comment 4•4 years ago
|
||
The mFetchObserver
member is documented as being "Touched only on the worker thread" (https://searchfox.org/mozilla-central/rev/d7e344e956d9da2808ea33e1fe0f963ed10c142d/dom/fetch/Fetch.cpp#264), however it is explicitly read with the lock held from the main thread here: https://searchfox.org/mozilla-central/rev/d7e344e956d9da2808ea33e1fe0f963ed10c142d/dom/fetch/Fetch.cpp#875
A separate boolean which is actually managed by the lock, or an atomic boolean of some sort, should probably be used instead.
Updated•4 years ago
|
Comment 6•4 years ago
|
||
Hi Eden, could you find some time for this? Thank you!
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 7•4 years ago
|
||
Transfer ni from Asuth to me. Will answer the question later.
Assignee | ||
Comment 8•4 years ago
|
||
Assignee | ||
Comment 9•4 years ago
|
||
I think the root cause of the race is because WorkerFetchResolver::mFetchObserver is not really protected by PromiseWorkerProxy::mCleanUpLock in WorkerFetchResolver.
As Alexis mentioned in the comment 2, in fact, WorkerFetchResolver::Shutdown() doesn't acquire the lock before assigning WorkerFetchResolver::mFetchObserver. The lock is only acquired during PromiseWorkerProxy::CleanUp(). I think this is based on the assumption WorkerFetchResolver::mFetchObserver could be only accessed on worker thread. However, WorkerFetchResolver::NeedOnDataAvailable() violates the assumption.
I think we don't need to worry about UAF on PromiseWorkerProxy::mCleanUpLock, even PromiseWorkerProxy::CleanUp() call Release().
PromiseWorkerProxy::Create() call AddRef() to make sure PromiseWorkerProxy::CleanUp() must be called before releasing it.
I think Nika suggested a correct solution. We should keep WorkerFetchResolver::mFetchObserver be only accessed in the worker thread.
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Comment 10•4 years ago
|
||
Landed:
https://hg.mozilla.org/integration/autoland/rev/6483ed81fc507c2b2fb0261f7ced0e0a562422f2
Backed out for build bustage:
https://hg.mozilla.org/integration/autoland/rev/18501918510ea8a988a44ec0b68458bf41dece36
Push with bustage: https://treeherder.mozilla.org/jobs?repo=autoland&group_state=expanded&selectedTaskRun=UwsdHzXuQRadW7s6J7P_iQ.0&resultStatus=retry%2Cusercancel%2Ctestfailed%2Cbusted%2Cexception&revision=6483ed81fc507c2b2fb0261f7ced0e0a562422f2
Build log: https://treeherder.mozilla.org/logviewer?job_id=336353909&repo=autoland
[task 2021-04-13T12:14:16.968Z] 12:14:16 ERROR - /builds/worker/checkouts/gecko/dom/fetch/Fetch.cpp:359:5: error: use of undeclared identifier 'mIsNeedOnDataAvailable'; did you mean 'mNeedOnDataAvailable'?
[task 2021-04-13T12:14:16.968Z] 12:14:16 INFO - mIsNeedOnDataAvailable = false;
[task 2021-04-13T12:14:16.968Z] 12:14:16 INFO - ^~~~~~~~~~~~~~~~~~~~~~
[task 2021-04-13T12:14:16.968Z] 12:14:16 INFO - mNeedOnDataAvailable
[task 2021-04-13T12:14:16.968Z] 12:14:16 INFO - /builds/worker/checkouts/gecko/dom/fetch/Fetch.cpp:269:16: note: 'mNeedOnDataAvailable' declared here
[task 2021-04-13T12:14:16.968Z] 12:14:16 INFO - Atomic<bool> mNeedOnDataAvailable;
[task 2021-04-13T12:14:16.969Z] 12:14:16 INFO - ^
[task 2021-04-13T12:14:16.969Z] 12:14:16 ERROR - /builds/worker/checkouts/gecko/dom/fetch/Fetch.cpp:383:9: error: initializer 'mIsNeedOnDataAvailable' does not name a non-static data member or base class; did you mean the member 'mNeedOnDataAvailable'?
[task 2021-04-13T12:14:16.969Z] 12:14:16 INFO - mIsNeedOnDataAvailable(!!aObserver) {
[task 2021-04-13T12:14:16.969Z] 12:14:16 INFO - ^~~~~~~~~~~~~~~~~~~~~~
[task 2021-04-13T12:14:16.969Z] 12:14:16 INFO - mNeedOnDataAvailable
[task 2021-04-13T12:14:16.970Z] 12:14:16 INFO - /builds/worker/checkouts/gecko/dom/fetch/Fetch.cpp:269:16: note: 'mNeedOnDataAvailable' declared here
[task 2021-04-13T12:14:16.970Z] 12:14:16 INFO - Atomic<bool> mNeedOnDataAvailable;
[task 2021-04-13T12:14:16.970Z] 12:14:16 INFO - ^
[task 2021-04-13T12:14:16.971Z] 12:14:16 ERROR - /builds/worker/checkouts/gecko/dom/fetch/Fetch.cpp:878:10: error: use of undeclared identifier 'mIsNeedOnDataAvailable'; did you mean 'mNeedOnDataAvailable'?
[task 2021-04-13T12:14:16.971Z] 12:14:16 INFO - return mIsNeedOnDataAvailable;
[task 2021-04-13T12:14:16.971Z] 12:14:16 INFO - ^~~~~~~~~~~~~~~~~~~~~~
[task 2021-04-13T12:14:16.971Z] 12:14:16 INFO - mNeedOnDataAvailable
[task 2021-04-13T12:14:16.971Z] 12:14:16 INFO - /builds/worker/checkouts/gecko/dom/fetch/Fetch.cpp:269:16: note: 'mNeedOnDataAvailable' declared here
[task 2021-04-13T12:14:16.971Z] 12:14:16 INFO - Atomic<bool> mNeedOnDataAvailable;
[task 2021-04-13T12:14:16.971Z] 12:14:16 INFO - ^
Assignee | ||
Updated•4 years ago
|
Comment 11•4 years ago
|
||
Comment 12•4 years ago
|
||
Updated•4 years ago
|
Assignee | ||
Comment 13•4 years ago
|
||
Hello Ryan, do you think we should uplift the patch to esr78?
In fact, I think it doesn't make any impact on users even race happens. So I am thinking of don't uplift this to ESR78.
Comment 14•4 years ago
|
||
Given the moderate severity and lack of user impact, riding the trains sounds fine. Thanks for asking!
Updated•4 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Description
•