ThreadSanitizer: data race [@ EventSource::UpdateDontKeepAlive] vs. [@ EventSource::ReadyState]
Categories
(Core :: DOM: Workers, defect, P1)
Tracking
()
People
(Reporter: Gankra, Assigned: ytausky)
References
(Blocks 2 open bugs)
Details
(Keywords: csectype-race, sec-high, Whiteboard: [sec-survey][adv-main86+r][adv-esr78.8+r])
Attachments
(8 files)
|
50.97 KB,
text/plain
|
Details | |
|
48 bytes,
text/x-phabricator-request
|
tjr
:
approval-mozilla-beta+
tjr
:
approval-mozilla-esr78+
tjr
:
sec-approval+
|
Details | Review |
|
48 bytes,
text/x-phabricator-request
|
tjr
:
approval-mozilla-beta+
tjr
:
approval-mozilla-esr78+
tjr
:
sec-approval+
|
Details | Review |
|
48 bytes,
text/x-phabricator-request
|
tjr
:
approval-mozilla-beta+
tjr
:
approval-mozilla-esr78+
tjr
:
sec-approval+
|
Details | Review |
|
48 bytes,
text/x-phabricator-request
|
tjr
:
approval-mozilla-beta+
tjr
:
approval-mozilla-esr78+
tjr
:
sec-approval+
|
Details | Review |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review |
Found while bringing up wpt under tsan. Tentatively filing as security since this is a race on the value of a RefPtr.
It looks like EventSource::UpdateDontKeepAlive needs to acquire mMutex before nulling out mEventSource to properly synchronize this change with EventSource::ReadyState.
Permalinks to problematic lines:
EventSource::UpdateDontKeepAlive
General information about TSan reports
Why fix races?
Data races are undefined behavior and can cause crashes as well as correctness issues. Compiler optimizations can cause racy code to have unpredictable and hard-to-reproduce behavior.
Rating
If you think this race can cause crashes or correctness issues, it would be great to rate the bug appropriately as P1/P2 and/or indicating this in the bug. This makes it a lot easier for us to assess the actual impact that these reports make and if they are helpful to you.
False Positives / Benign Races
Typically, races reported by TSan are not false positives [1], but it is possible that the race is benign. Even in this case it would be nice to come up with a fix if it is easily doable and does not regress performance. Every race that we cannot fix will have to remain on the suppression list and slows down the overall TSan performance. Also note that seemingly benign races can possibly be harmful (also depending on the compiler, optimizations and the architecture) [2][3].
[1] One major exception is the involvement of uninstrumented code from third-party libraries.
[2] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[3] How to miscompile programs with "benign" data races: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
Suppressing unfixable races
If the bug cannot be fixed, then a runtime suppression needs to be added in mozglue/build/TsanOptions.cpp. The suppressions match on the full stack, so it should be picked such that it is unique to this particular race. The bug number of this bug should also be included so we have some documentation on why this suppression was added.
Comment 1•5 years ago
|
||
Nice find. This looks like an issue with EventSourceImpl on DOM workers, so I'll move it to the worker component.
Comment 2•5 years ago
|
||
Leaning toward sec-high because content scripts are clearly involved
Comment 3•5 years ago
|
||
Updated•5 years ago
|
Comment 4•5 years ago
|
||
While I was looking at the code, I was also wondering if there is a good reason for this raw pointer. Yes, it is probably fine like this, but are we sure there is no interference with the keepalive logic?
Updated•5 years ago
|
Updated•5 years ago
|
Comment 5•5 years ago
|
||
Comment on attachment 9193722 [details]
Bug 1682928: P1 Reduce the use of EventSourceImpl raw pointers r=#dom-workers-and-storage-reviewers,ytausky
Revision D100010 was moved to bug 1683620. Setting attachment 9193722 [details] to obsolete.
Comment 6•5 years ago
•
|
||
I moved the work over to bug 1683620. Rather than trying to fix the specific race I am trying to overhaul the keepalive logic in general.
Here's a try push.
Comment 7•5 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #4)
While I was looking at the code, I was also wondering if there is a good reason for this raw pointer. Yes, it is probably fine like this, but are we sure there is no interference with the keepalive logic?
Just to answer this for posterity (and reduce my "restating" elsewhere), it seems like the raw pointer used in conjunction with mKeepingAlive and EventSource::UpdateMustKeepAlive and UpdateDontKeepAlive and EventSourceImpl::AddRefObject/EventSourceImpl::ReleaseObject are just an obfuscated version of a normal RefPtr.
Updated•5 years ago
|
| Assignee | ||
Comment 8•5 years ago
|
||
This is a manifestation of bug 1658776. The main hurdle in fixing this is that the main thread is accessing memory locations in DOMEventTargetHelper (a base class of EventSource), e.g. here, and this whole class wasn't written with multi-thread access in mind, so there's no synchronization.
Comment 9•5 years ago
|
||
(In reply to Yaron Tausky [:ytausky] from comment #8)
This is a manifestation of bug 1658776. The main hurdle in fixing this is that the main thread is accessing memory locations in
DOMEventTargetHelper(a base class ofEventSource), e.g. here, and this whole class wasn't written with multi-thread access in mind, so there's no synchronization.
In other words, the work in bug 1683620 will not help here, right (which does not invalidate the work there as a useful cleanup, I hope) ?
| Assignee | ||
Comment 10•5 years ago
|
||
It would make this particular report go away (because the underlying data members changed), but that just means that we will get a slightly different TSan report. There are also the other data races like the one I mentioned above. The cleanup is still useful, of course.
| Assignee | ||
Updated•5 years ago
|
Updated•5 years ago
|
Comment 11•5 years ago
|
||
Hi Daniel,
we have the following situation here for patches:
- bug 1683620 features a cleanup, which was initially meant to solve the problem but is believed to not solve this (but to reduce other edge cases).
- bug 1658776 (which was independently filed) contains the patch-set that is supposed now to help here.
We assume, that we should move the patches from bug 1658776 here and ask for sec-approval on those. Should we ask also for sec-approval for the patch on bug 1683620 (or even move its patch here, too) ?
Any advice that helps also to reduce your work to gather pieces is appreciated. Thanks!
Updated•5 years ago
|
Comment 12•5 years ago
•
|
||
Talking with Yaron, we just decided that all the patches should land together in order to be clear about dependencies and landing order in case we want to uplift this.
| Assignee | ||
Comment 14•5 years ago
|
||
| Assignee | ||
Comment 15•5 years ago
|
||
With this commit a few of EventSource's and EventSourceImpl's data
members are now atomic, since a mutex isn't really necessary for
their use case. Also, several data members are now marked const.
| Assignee | ||
Comment 16•5 years ago
|
||
Depends on D101210
| Reporter | ||
Comment 17•5 years ago
|
||
Depends on D102317
Comment 18•5 years ago
|
||
https://hg.mozilla.org/integration/autoland/rev/18466ad12efce0c74060d3df0d25aac9ae712620
https://hg.mozilla.org/mozilla-central/rev/18466ad12efc
Updated•5 years ago
|
Updated•5 years ago
|
Comment 19•5 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #12)
Talking with Yaron, we just decided that all the patches should land together in order to be clear about dependencies and landing order in case we want to uplift this.
Clearing Daniel's needinfo request.
| Assignee | ||
Comment 20•5 years ago
|
||
Comment on attachment 9193722 [details]
Bug 1682928: P1 Reduce the use of EventSourceImpl raw pointers r=#dom-workers-and-storage-reviewers,ytausky
Security Approval Request
- How easily could an exploit be constructed based on the patch?: I don't think the patches make it any easier other than indicating that there are data races in this class.
- Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
- Which older supported branches are affected by this flaw?: All
- If not all supported branches, which bug introduced the flaw?: None
- Do you have backports for the affected branches?: No
- If not, how different, hard to create, and risky will they be?: It shouldn't be hard to backport the patches since they're self-contained and this class wasn't touched in a while.
- How likely is this patch to cause regressions; how much testing does it need?: It's a bigger patch so regressions are possible, but it should be covered by automated tests.
| Assignee | ||
Updated•5 years ago
|
| Reporter | ||
Comment 21•5 years ago
|
||
Hey sorry we raced past eachother a bit (needed to supress this to turn on the test suite) -- could you add a patch that reverts https://phabricator.services.mozilla.com/D102318 so this change is properly tested when it lands?
| Assignee | ||
Comment 22•5 years ago
|
||
Depends on D101582
| Assignee | ||
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Comment 23•5 years ago
|
||
Comment on attachment 9193722 [details]
Bug 1682928: P1 Reduce the use of EventSourceImpl raw pointers r=#dom-workers-and-storage-reviewers,ytausky
approved to land and uplift
Updated•5 years ago
|
Comment 24•5 years ago
|
||
Can you clarify what the landing procedure should be for these fixes? https://firefox-source-docs.mozilla.org/bug-mgmt/processes/security-approval.html says "At that point, land it according to instructions provided.." and I'm not sure if the patch should land on m-c first and then later on the branches, or everywhere all at once.
Comment 25•5 years ago
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #24)
Can you clarify what the landing procedure should be for these fixes? https://firefox-source-docs.mozilla.org/bug-mgmt/processes/security-approval.html says "At that point, land it according to instructions provided.." and I'm not sure if the patch should land on m-c first and then later on the branches, or everywhere all at once.
You can land it on m-c now, and relman will come along and uplift it in a few days.
Comment 26•5 years ago
|
||
Backed out for crashes on [@ mozilla::detail::MutexImpl::mutexLock()]:
https://hg.mozilla.org/integration/autoland/rev/ac1a3b6bec7a911886290fa4bedc7bfa2a20d871
Push with failures: https://treeherder.mozilla.org/jobs?repo=autoland&group_state=expanded&selectedTaskRun=U2TefNZGTY22mYqxtPDrdw.0&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunnable&revision=ef19a227311211433bbafe3cbcbbf4e97e883625
Failure log: https://treeherder.mozilla.org/logviewer?job_id=328123744&repo=autoland
Comment 27•5 years ago
•
|
||
The crash situation is weird because it looks like we're experiencing a null deref on what should be a mutex reentrancy hang. Building a local opt build I see the reentrancy deadlock hang, and building a local debug build I see the following report on the reentrancy:
0:14.01 PASS What is this object?
0:14.01 GECKO(675190) ###!!! ERROR: Potential deadlock detected:
0:14.01 GECKO(675190) === Cyclical dependency starts at
0:14.01 GECKO(675190) --- Mutex : EventSourceImpl::mSharedData (currently acquired)
0:14.01 GECKO(675190) calling context
0:14.01 GECKO(675190) [stack trace unavailable]
0:14.01 GECKO(675190) === Cycle completed at
0:14.01 GECKO(675190) --- Mutex : EventSourceImpl::mSharedData (currently acquired)
0:14.01 GECKO(675190) calling context
0:14.01 GECKO(675190) [stack trace unavailable]
0:14.01 GECKO(675190) ###!!! Deadlock may happen NOW!
0:14.01 GECKO(675190) [Parent 675204, Main Thread] ###!!! ASSERTION: Potential deadlock detected:
0:14.01 GECKO(675190) Cyclical dependency starts at
0:14.01 GECKO(675190) Mutex : EventSourceImpl::mSharedData (currently acquired)
0:14.01 GECKO(675190) Cycle completed at
0:14.01 GECKO(675190) Mutex : EventSourceImpl::mSharedData (currently acquired)
0:14.01 GECKO(675190) ###!!! Deadlock may happen NOW!
0:14.02 GECKO(675190) : 'Error', file /home/visbrero/rev_control/hg/mozilla-unified3/xpcom/threads/BlockingResourceBase.cpp:248
This finds us hanging in the call at https://hg.mozilla.org/integration/autoland/file/ef19a227311211433bbafe3cbcbbf4e97e883625/dom/base/EventSource.cpp#l1061 where we are already holding the lock but we don't drop the lock before calling DispatchFailConnection which calls IsClosed() at https://hg.mozilla.org/integration/autoland/file/ef19a227311211433bbafe3cbcbbf4e97e883625/dom/base/EventSource.cpp#l1340 which is really just a ReadyState check which grabs the lock.
Note that from discussion with :khuey after trying to reproduce from the self-serve API, it sounds like currently the necessary symbols won't exist (at least by default, I have an outstanding question on that). Edit: khuey identified this line as the logic that makes it so release and try builds include full symbols, but that autoland does not/can't get those symbols, which means self-serve won't work on autoland backouts. I've tentatively raised the question of whether there's a cost reason to not provide the symbols on https://chat.mozilla.org/#/room/#firefox-ci:mozilla.org
Comment 28•5 years ago
|
||
A pernosco trace of the comment 27 debug build reproduction is: https://pernos.co/debug/qA54dGU3ukBf6_xjuztthQ/index.html
It confirms what one might expect from the test log output, which is that we are executing the line var es = Cu.evalInSandbox('var es = new EventSource("about:blank"); es', sandbox); and that we fail because we only support schemes "http" and "https". But we don't check for that case until we've checked the global (and are still holding the lock).
Updated•5 years ago
|
Updated•5 years ago
|
Comment 29•5 years ago
|
||
This landed and got backed out:
https://hg.mozilla.org/integration/autoland/rev/d818af41cda711705e39402dece0b0de96492273
Failure log: https://treeherder.mozilla.org/logviewer?job_id=328709620&repo=autoland
[task 2021-02-03T14:58:51.668Z] 14:58:51 INFO - TEST-OK | dom/base/test/test_eventsourceservice_status_error.html | took 72ms
[task 2021-02-03T14:58:51.669Z] 14:58:51 INFO - TEST-START | dom/base/test/test_eventsourceservice_worker.html
[task 2021-02-03T14:59:12.003Z] 14:59:12 INFO - wait for org.mozilla.geckoview.test complete; top activity=com.android.launcher3
[task 2021-02-03T14:59:12.003Z] 14:59:12 INFO - runtestsremote.py | Application ran for: 0:02:06.127552
[task 2021-02-03T14:59:12.124Z] 14:59:12 INFO - mozcrash Downloading symbols from: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/LKTiNGxZRquNRxSb-D-rKw/artifacts/public/build/target.crashreporter-symbols.zip
[task 2021-02-03T14:59:14.001Z] 14:59:14 INFO - mozcrash Copy/paste: /builds/worker/fetches/minidump_stackwalk/minidump_stackwalk /tmp/tmpfGDD_b/0917ee3f-096a-fd7e-166d-2a02ba4c0d50.dmp /tmp/tmpv87zw5
[task 2021-02-03T14:59:18.017Z] 14:59:18 INFO - mozcrash Saved minidump as /builds/worker/workspace/build/blobber_upload_dir/0917ee3f-096a-fd7e-166d-2a02ba4c0d50.dmp
[task 2021-02-03T14:59:18.017Z] 14:59:18 INFO - mozcrash Saved app info as /builds/worker/workspace/build/blobber_upload_dir/0917ee3f-096a-fd7e-166d-2a02ba4c0d50.extra
[task 2021-02-03T14:59:18.027Z] 14:59:18 WARNING - PROCESS-CRASH | dom/base/test/test_eventsourceservice_worker.html | application crashed [@ mozilla::dom::EventSourceImpl::AssertIsOnTargetThread() const]
[task 2021-02-03T14:59:18.027Z] 14:59:18 INFO - Mozilla crash reason: MOZ_DIAGNOSTIC_ASSERT(IsTargetThread())
[task 2021-02-03T14:59:18.028Z] 14:59:18 INFO - Crash dump filename: /tmp/tmpfGDD_b/0917ee3f-096a-fd7e-166d-2a02ba4c0d50.dmp
[task 2021-02-03T14:59:18.028Z] 14:59:18 INFO - Operating system: Android
[task 2021-02-03T14:59:18.028Z] 14:59:18 INFO - 0.0.0 Linux 3.10.0+ #260 SMP PREEMPT Fri May 19 12:48:14 PDT 2017 x86_64
[task 2021-02-03T14:59:18.028Z] 14:59:18 INFO - CPU: amd64
[task 2021-02-03T14:59:18.028Z] 14:59:18 INFO - family 6 model 6 stepping 3
[task 2021-02-03T14:59:18.028Z] 14:59:18 INFO - 4 CPUs
[task 2021-02-03T14:59:18.028Z] 14:59:18 INFO - GPU: UNKNOWN
[task 2021-02-03T14:59:18.028Z] 14:59:18 INFO - Crash reason: SIGSEGV /SEGV_MAPERR
[task 2021-02-03T14:59:18.029Z] 14:59:18 INFO - Crash address: 0x0
[task 2021-02-03T14:59:18.029Z] 14:59:18 INFO - Process uptime: not available
[task 2021-02-03T14:59:18.029Z] 14:59:18 INFO - Thread 11 (crashed)
[task 2021-02-03T14:59:18.029Z] 14:59:18 INFO - 0 libxul.so!mozilla::dom::EventSourceImpl::AssertIsOnTargetThread() const [EventSource.cpp:d07913a1ae00f252009bd9d353fc74c6f54d62e3 : 145 + 0x11]
[task 2021-02-03T14:59:18.029Z] 14:59:18 INFO - rax = 0x000076e8c97ab271 rdx = 0x0000000000000001
[task 2021-02-03T14:59:18.029Z] 14:59:18 INFO - rcx = 0x000076e8cc073dd8 rbx = 0x000076e8bc4fec80
[task 2021-02-03T14:59:18.029Z] 14:59:18 INFO - rsi = 0x000076e8bc4fec80 rdi = 0x0000000000000016
[task 2021-02-03T14:59:18.030Z] 14:59:18 INFO - rbp = 0x000076e8cc979a20 rsp = 0x000076e8cc979a20
[task 2021-02-03T14:59:18.030Z] 14:59:18 INFO - r8 = 0x0000000000000000 r9 = 0x0000000000000009
[task 2021-02-03T14:59:18.030Z] 14:59:18 INFO - r10 = 0xfffffffffffff099 r11 = 0x0000000000000000
[task 2021-02-03T14:59:18.030Z] 14:59:18 INFO - r12 = 0x000076e8cc979b40 r13 = 0x000076e8cc979aa8
[task 2021-02-03T14:59:18.030Z] 14:59:18 INFO - r14 = 0x000076e8cc979aa8 r15 = 0x000076e8bc4fec80
[task 2021-02-03T14:59:18.030Z] 14:59:18 INFO - rip = 0x000076e8c502531c
[task 2021-02-03T14:59:18.030Z] 14:59:18 INFO - Found by: given as instruction pointer in context
Updated•5 years ago
|
Updated•5 years ago
|
Comment 30•5 years ago
|
||
https://hg.mozilla.org/integration/autoland/rev/25dc633cbc3e871b254813f6baa7195a9de3e065
https://hg.mozilla.org/integration/autoland/rev/7b2fd9526db98ba1ceb1f532f03495834fc9de6d
https://hg.mozilla.org/integration/autoland/rev/f358e59e6011e4162d1dd4b9d22df98d5acbc04b
https://hg.mozilla.org/integration/autoland/rev/a560ab66c675ee7f15cdf59632ee067a88a34aa2
https://hg.mozilla.org/integration/autoland/rev/0d2fba5b4ee0f9ab719c079e65c84dcc9dc82ed1
https://hg.mozilla.org/mozilla-central/rev/25dc633cbc3e
https://hg.mozilla.org/mozilla-central/rev/7b2fd9526db9
https://hg.mozilla.org/mozilla-central/rev/f358e59e6011
https://hg.mozilla.org/mozilla-central/rev/a560ab66c675
https://hg.mozilla.org/mozilla-central/rev/0d2fba5b4ee0
Yaron, is anything else left to be done is this bug or can it get set as resolved (currently has leave-open keyword)?
| Assignee | ||
Comment 31•5 years ago
|
||
It can be closed now, thanks for reminding me.
Comment 32•5 years ago
|
||
Do you want to uplift for 86?
Comment 34•5 years ago
|
||
| uplift | ||
https://hg.mozilla.org/releases/mozilla-beta/rev/68715e2f23c3
https://hg.mozilla.org/releases/mozilla-beta/rev/1a21d1f62b4b
https://hg.mozilla.org/releases/mozilla-beta/rev/99809a2fe9e5
https://hg.mozilla.org/releases/mozilla-beta/rev/f99529146dea
https://hg.mozilla.org/releases/mozilla-beta/rev/899e10fec606
Comment 36•5 years ago
|
||
As part of a security bug pattern analysis, we are requesting your help with a high level analysis of this bug. It is our hope to develop static analysis (or potentially runtime/dynamic analysis) in the future to identify classes of bugs.
Please visit this google form to reply.
| Assignee | ||
Comment 37•5 years ago
|
||
| Assignee | ||
Comment 38•5 years ago
|
||
I squashed all the relevant patches into this one.
Comment 40•5 years ago
|
||
| uplift | ||
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•4 years ago
|
Description
•