Closed Bug 1686154 Opened 4 years ago Closed 4 years ago

ThreadSanitizer: data race [@ mozilla::dom::WorkerFetchResolver::NeedOnDataAvailable] vs. [@ Shutdown]

Categories

(Core :: DOM: Networking, defect, P2)

defect

Tracking

()

RESOLVED FIXED
89 Branch
Tracking Status
firefox-esr78 --- wontfix
firefox86 --- wontfix
firefox87 --- wontfix
firefox88 --- wontfix
firefox89 --- fixed

People

(Reporter: tsmith, Assigned: edenchuang)

References

(Blocks 2 open bugs)

Details

(Keywords: csectype-race, sec-moderate, Whiteboard: [post-critsmash-triage][adv-main89+r])

Attachments

(2 files)

The attached crash information was detected by ThreadSanitizer while fuzzing with build mozilla-central 20210111-7643eaee62a3. The fuzzers have hit this issue multiple times but no reliable test case is available at this time.

General information about TSan reports

Why fix races?

Data races are undefined behavior and can cause crashes as well as correctness issues. Compiler optimizations can cause racy code to have unpredictable and hard-to-reproduce behavior.

Rating

If you think this race can cause crashes or correctness issues, it would be great to rate the bug appropriately as P1/P2 and/or indicating this in the bug. This makes it a lot easier for us to assess the actual impact that these reports make and if they are helpful to you.

False Positives / Benign Races

Typically, races reported by TSan are not false positives [1], but it is possible that the race is benign. Even in this case it would be nice to come up with a fix if it is easily doable and does not regress performance. Every race that we cannot fix will have to remain on the suppression list and slows down the overall TSan performance. Also note that seemingly benign races can possibly be harmful (also depending on the compiler, optimizations and the architecture) [2][3].

[1] One major exception is the involvement of uninstrumented code from third-party libraries.
[2] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[3] How to miscompile programs with "benign" data races: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf

Suppressing unfixable races

If the bug cannot be fixed, then a runtime suppression needs to be added in mozglue/build/TsanOptions.cpp. The suppressions match on the full stack, so it should be picked such that it is unique to this particular race. The bug number of this bug should also be included so we have some documentation on why this suppression was added.

So it seems like what's happening is that someone is shutting down the WorkerFetchResolver while the main thread is still using it, leading to a race on the value of mFetchObserver. The main thread acquires mPromiseProxy->Lock(), but the other thread only sort of does. The previous line of the shutdown code runs mPromiseProxy->CleanUp() which does acquire the lock internally... but then Releases mPromiseProxy!

That suggests to me that the main thread is potentially use-after-freeing mPromiseProxy?

It seems like there's some key missing piece of synchronization to stop the main thread from using WorkerFetchResolver once shutdown starts, but I don't know this code well enough to suggest a solution.


Permalinks to problematic lines:

WorkerFetchResolver::NeedOnDataAvailable

vs

WorkerFetchResolver::Shutdown

:baku, can you give some input on comment 2 and the related race, in particular how this is supposed to work normally? If you know someone who can work on this, please let us know :) Thanks!

Flags: needinfo?(amarchesini)
Flags: needinfo?(amarchesini) → needinfo?(bugmail)

The mFetchObserver member is documented as being "Touched only on the worker thread" (https://searchfox.org/mozilla-central/rev/d7e344e956d9da2808ea33e1fe0f963ed10c142d/dom/fetch/Fetch.cpp#264), however it is explicitly read with the lock held from the main thread here: https://searchfox.org/mozilla-central/rev/d7e344e956d9da2808ea33e1fe0f963ed10c142d/dom/fetch/Fetch.cpp#875

A separate boolean which is actually managed by the lock, or an atomic boolean of some sort, should probably be used instead.

Jens, can you pls have asuth look into this?

Flags: needinfo?(jstutte)
Flags: needinfo?(jstutte)

Hi Eden, could you find some time for this? Thank you!

Flags: needinfo?(echuang)
Assignee: nobody → echuang
Status: NEW → ASSIGNED
Flags: needinfo?(echuang)

Transfer ni from Asuth to me. Will answer the question later.

Flags: needinfo?(bugmail) → needinfo?(echuang)

I think the root cause of the race is because WorkerFetchResolver::mFetchObserver is not really protected by PromiseWorkerProxy::mCleanUpLock in WorkerFetchResolver.

https://searchfox.org/mozilla-central/rev/d9f6cded535d202a9ade4a530e653e659bcb5bbd/dom/fetch/Fetch.cpp#350-365

As Alexis mentioned in the comment 2, in fact, WorkerFetchResolver::Shutdown() doesn't acquire the lock before assigning WorkerFetchResolver::mFetchObserver. The lock is only acquired during PromiseWorkerProxy::CleanUp(). I think this is based on the assumption WorkerFetchResolver::mFetchObserver could be only accessed on worker thread. However, WorkerFetchResolver::NeedOnDataAvailable() violates the assumption.

I think we don't need to worry about UAF on PromiseWorkerProxy::mCleanUpLock, even PromiseWorkerProxy::CleanUp() call Release().
PromiseWorkerProxy::Create() call AddRef() to make sure PromiseWorkerProxy::CleanUp() must be called before releasing it.

I think Nika suggested a correct solution. We should keep WorkerFetchResolver::mFetchObserver be only accessed in the worker thread.

Flags: needinfo?(echuang)
Priority: -- → P2
Attachment #9214854 - Attachment description: WIP: Bug 1686154 - Fix the race condition on WorkerFetchResolver::mFetchObserver. → Bug 1686154 - Fix the race condition on WorkerFetchResolver::mFetchObserver.

Landed:

https://hg.mozilla.org/integration/autoland/rev/6483ed81fc507c2b2fb0261f7ced0e0a562422f2

Backed out for build bustage:

https://hg.mozilla.org/integration/autoland/rev/18501918510ea8a988a44ec0b68458bf41dece36

Push with bustage: https://treeherder.mozilla.org/jobs?repo=autoland&group_state=expanded&selectedTaskRun=UwsdHzXuQRadW7s6J7P_iQ.0&resultStatus=retry%2Cusercancel%2Ctestfailed%2Cbusted%2Cexception&revision=6483ed81fc507c2b2fb0261f7ced0e0a562422f2
Build log: https://treeherder.mozilla.org/logviewer?job_id=336353909&repo=autoland

[task 2021-04-13T12:14:16.968Z] 12:14:16    ERROR -  /builds/worker/checkouts/gecko/dom/fetch/Fetch.cpp:359:5: error: use of undeclared identifier 'mIsNeedOnDataAvailable'; did you mean 'mNeedOnDataAvailable'?
[task 2021-04-13T12:14:16.968Z] 12:14:16     INFO -      mIsNeedOnDataAvailable = false;
[task 2021-04-13T12:14:16.968Z] 12:14:16     INFO -      ^~~~~~~~~~~~~~~~~~~~~~
[task 2021-04-13T12:14:16.968Z] 12:14:16     INFO -      mNeedOnDataAvailable
[task 2021-04-13T12:14:16.968Z] 12:14:16     INFO -  /builds/worker/checkouts/gecko/dom/fetch/Fetch.cpp:269:16: note: 'mNeedOnDataAvailable' declared here
[task 2021-04-13T12:14:16.968Z] 12:14:16     INFO -    Atomic<bool> mNeedOnDataAvailable;
[task 2021-04-13T12:14:16.969Z] 12:14:16     INFO -                 ^
[task 2021-04-13T12:14:16.969Z] 12:14:16    ERROR -  /builds/worker/checkouts/gecko/dom/fetch/Fetch.cpp:383:9: error: initializer 'mIsNeedOnDataAvailable' does not name a non-static data member or base class; did you mean the member 'mNeedOnDataAvailable'?
[task 2021-04-13T12:14:16.969Z] 12:14:16     INFO -          mIsNeedOnDataAvailable(!!aObserver) {
[task 2021-04-13T12:14:16.969Z] 12:14:16     INFO -          ^~~~~~~~~~~~~~~~~~~~~~
[task 2021-04-13T12:14:16.969Z] 12:14:16     INFO -          mNeedOnDataAvailable
[task 2021-04-13T12:14:16.970Z] 12:14:16     INFO -  /builds/worker/checkouts/gecko/dom/fetch/Fetch.cpp:269:16: note: 'mNeedOnDataAvailable' declared here
[task 2021-04-13T12:14:16.970Z] 12:14:16     INFO -    Atomic<bool> mNeedOnDataAvailable;
[task 2021-04-13T12:14:16.970Z] 12:14:16     INFO -                 ^
[task 2021-04-13T12:14:16.971Z] 12:14:16    ERROR -  /builds/worker/checkouts/gecko/dom/fetch/Fetch.cpp:878:10: error: use of undeclared identifier 'mIsNeedOnDataAvailable'; did you mean 'mNeedOnDataAvailable'?
[task 2021-04-13T12:14:16.971Z] 12:14:16     INFO -    return mIsNeedOnDataAvailable;
[task 2021-04-13T12:14:16.971Z] 12:14:16     INFO -           ^~~~~~~~~~~~~~~~~~~~~~
[task 2021-04-13T12:14:16.971Z] 12:14:16     INFO -           mNeedOnDataAvailable
[task 2021-04-13T12:14:16.971Z] 12:14:16     INFO -  /builds/worker/checkouts/gecko/dom/fetch/Fetch.cpp:269:16: note: 'mNeedOnDataAvailable' declared here
[task 2021-04-13T12:14:16.971Z] 12:14:16     INFO -    Atomic<bool> mNeedOnDataAvailable;
[task 2021-04-13T12:14:16.971Z] 12:14:16     INFO -                 ^
Flags: needinfo?(echuang)
Flags: needinfo?(echuang)
Group: dom-core-security → core-security-release
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 89 Branch

Hello Ryan, do you think we should uplift the patch to esr78?
In fact, I think it doesn't make any impact on users even race happens. So I am thinking of don't uplift this to ESR78.

Flags: needinfo?(ryanvm)

Given the moderate severity and lack of user impact, riding the trains sounds fine. Thanks for asking!

Flags: needinfo?(ryanvm)
Flags: qe-verify-
Whiteboard: [post-critsmash-triage]
Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main89+r]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: