Closed Bug 1682928 Opened 5 years ago Closed 5 years ago

ThreadSanitizer: data race [@ EventSource::UpdateDontKeepAlive] vs. [@ EventSource::ReadyState]

Categories

(Core :: DOM: Workers, defect, P1)

defect

Tracking

()

RESOLVED FIXED
87 Branch
Tracking Status
firefox-esr78 86+ fixed
firefox84 - wontfix
firefox85 - wontfix
firefox86 + fixed
firefox87 + fixed

People

(Reporter: Gankra, Assigned: ytausky)

References

(Blocks 2 open bugs)

Details

(Keywords: csectype-race, sec-high, Whiteboard: [sec-survey][adv-main86+r][adv-esr78.8+r])

Attachments

(8 files)

50.97 KB, text/plain
Details
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

Found while bringing up wpt under tsan. Tentatively filing as security since this is a race on the value of a RefPtr.

It looks like EventSource::UpdateDontKeepAlive needs to acquire mMutex before nulling out mEventSource to properly synchronize this change with EventSource::ReadyState.

Permalinks to problematic lines:

EventSource::UpdateDontKeepAlive

EventSource::ReadyState

General information about TSan reports

Why fix races?

Data races are undefined behavior and can cause crashes as well as correctness issues. Compiler optimizations can cause racy code to have unpredictable and hard-to-reproduce behavior.

Rating

If you think this race can cause crashes or correctness issues, it would be great to rate the bug appropriately as P1/P2 and/or indicating this in the bug. This makes it a lot easier for us to assess the actual impact that these reports make and if they are helpful to you.

False Positives / Benign Races

Typically, races reported by TSan are not false positives [1], but it is possible that the race is benign. Even in this case it would be nice to come up with a fix if it is easily doable and does not regress performance. Every race that we cannot fix will have to remain on the suppression list and slows down the overall TSan performance. Also note that seemingly benign races can possibly be harmful (also depending on the compiler, optimizations and the architecture) [2][3].

[1] One major exception is the involvement of uninstrumented code from third-party libraries.
[2] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[3] How to miscompile programs with "benign" data races: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
Suppressing unfixable races

If the bug cannot be fixed, then a runtime suppression needs to be added in mozglue/build/TsanOptions.cpp. The suppressions match on the full stack, so it should be picked such that it is unique to this particular race. The bug number of this bug should also be included so we have some documentation on why this suppression was added.

Nice find. This looks like an issue with EventSourceImpl on DOM workers, so I'll move it to the worker component.

Group: core-security → dom-core-security
Component: DOM: Events → DOM: Workers
Keywords: csectype-race

Leaning toward sec-high because content scripts are clearly involved

Keywords: sec-high
Assignee: nobody → jstutte
Status: NEW → ASSIGNED

While I was looking at the code, I was also wondering if there is a good reason for this raw pointer. Yes, it is probably fine like this, but are we sure there is no interference with the keepalive logic?

Flags: needinfo?(bugmail)
Flags: needinfo?(bugs)
Attachment #9193722 - Attachment description: Bug 1682928: Move cleanup logic to EventSourceImpl r?#dom-workers-and-storage → Bug 1682928: Reduce use of EventSourceImpl raw pointers r?#dom-workers-and-storage

Comment on attachment 9193722 [details]
Bug 1682928: P1 Reduce the use of EventSourceImpl raw pointers r=#dom-workers-and-storage-reviewers,ytausky

Revision D100010 was moved to bug 1683620. Setting attachment 9193722 [details] to obsolete.

Attachment #9193722 - Attachment is obsolete: true

I moved the work over to bug 1683620. Rather than trying to fix the specific race I am trying to overhaul the keepalive logic in general.

Here's a try push.

(In reply to Jens Stutte [:jstutte] from comment #4)

While I was looking at the code, I was also wondering if there is a good reason for this raw pointer. Yes, it is probably fine like this, but are we sure there is no interference with the keepalive logic?

Just to answer this for posterity (and reduce my "restating" elsewhere), it seems like the raw pointer used in conjunction with mKeepingAlive and EventSource::UpdateMustKeepAlive and UpdateDontKeepAlive and EventSourceImpl::AddRefObject/EventSourceImpl::ReleaseObject are just an obfuscated version of a normal RefPtr.

Flags: needinfo?(bugmail)
Severity: -- → S2
Priority: -- → P1

This is a manifestation of bug 1658776. The main hurdle in fixing this is that the main thread is accessing memory locations in DOMEventTargetHelper (a base class of EventSource), e.g. here, and this whole class wasn't written with multi-thread access in mind, so there's no synchronization.

(In reply to Yaron Tausky [:ytausky] from comment #8)

This is a manifestation of bug 1658776. The main hurdle in fixing this is that the main thread is accessing memory locations in DOMEventTargetHelper (a base class of EventSource), e.g. here, and this whole class wasn't written with multi-thread access in mind, so there's no synchronization.

In other words, the work in bug 1683620 will not help here, right (which does not invalidate the work there as a useful cleanup, I hope) ?

It would make this particular report go away (because the underlying data members changed), but that just means that we will get a slightly different TSan report. There are also the other data races like the one I mentioned above. The cleanup is still useful, of course.

Assignee: jstutte → ytausky

Hi Daniel,

we have the following situation here for patches:

  • bug 1683620 features a cleanup, which was initially meant to solve the problem but is believed to not solve this (but to reduce other edge cases).
  • bug 1658776 (which was independently filed) contains the patch-set that is supposed now to help here.

We assume, that we should move the patches from bug 1658776 here and ask for sec-approval on those. Should we ask also for sec-approval for the patch on bug 1683620 (or even move its patch here, too) ?

Any advice that helps also to reduce your work to gather pieces is appreciated. Thanks!

Flags: needinfo?(dveditz)
Attachment #9193722 - Attachment description: Bug 1682928: Reduce use of EventSourceImpl raw pointers r?#dom-workers-and-storage → Bug 1682928: P1 Reduce the use of EventSourceImpl raw pointers r?#dom-workers-and-storage
Attachment #9193722 - Attachment is obsolete: false

Talking with Yaron, we just decided that all the patches should land together in order to be clear about dependencies and landing order in case we want to uplift this.

With this commit a few of EventSource's and EventSourceImpl's data
members are now atomic, since a mutex isn't really necessary for
their use case. Also, several data members are now marked const.

Group: dom-core-security → core-security-release
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → 86 Branch
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---

(In reply to Jens Stutte [:jstutte] from comment #12)

Talking with Yaron, we just decided that all the patches should land together in order to be clear about dependencies and landing order in case we want to uplift this.

Clearing Daniel's needinfo request.

Flags: needinfo?(dveditz)

Comment on attachment 9193722 [details]
Bug 1682928: P1 Reduce the use of EventSourceImpl raw pointers r=#dom-workers-and-storage-reviewers,ytausky

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: I don't think the patches make it any easier other than indicating that there are data races in this class.
  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
  • Which older supported branches are affected by this flaw?: All
  • If not all supported branches, which bug introduced the flaw?: None
  • Do you have backports for the affected branches?: No
  • If not, how different, hard to create, and risky will they be?: It shouldn't be hard to backport the patches since they're self-contained and this class wasn't touched in a while.
  • How likely is this patch to cause regressions; how much testing does it need?: It's a bigger patch so regressions are possible, but it should be covered by automated tests.
Attachment #9193722 - Flags: sec-approval?
Attachment #9196784 - Flags: sec-approval?
Attachment #9196785 - Flags: sec-approval?
Attachment #9196786 - Flags: sec-approval?

Hey sorry we raced past eachother a bit (needed to supress this to turn on the test suite) -- could you add a patch that reverts https://phabricator.services.mozilla.com/D102318 so this change is properly tested when it lands?

Flags: needinfo?(ytausky)

Depends on D101582

Flags: needinfo?(ytausky)
Attachment #9196786 - Flags: sec-approval?
Attachment #9196786 - Flags: sec-approval+
Attachment #9196786 - Flags: approval-mozilla-esr78+
Attachment #9196786 - Flags: approval-mozilla-beta+
Attachment #9196785 - Flags: sec-approval?
Attachment #9196785 - Flags: sec-approval+
Attachment #9196785 - Flags: approval-mozilla-esr78+
Attachment #9196785 - Flags: approval-mozilla-beta+
Attachment #9196784 - Flags: sec-approval?
Attachment #9196784 - Flags: sec-approval+
Attachment #9196784 - Flags: approval-mozilla-esr78+
Attachment #9196784 - Flags: approval-mozilla-beta+

Comment on attachment 9193722 [details]
Bug 1682928: P1 Reduce the use of EventSourceImpl raw pointers r=#dom-workers-and-storage-reviewers,ytausky

approved to land and uplift

Attachment #9193722 - Flags: sec-approval?
Attachment #9193722 - Flags: sec-approval+
Attachment #9193722 - Flags: approval-mozilla-esr78+
Attachment #9193722 - Flags: approval-mozilla-beta+

Can you clarify what the landing procedure should be for these fixes? https://firefox-source-docs.mozilla.org/bug-mgmt/processes/security-approval.html says "At that point, land it according to instructions provided.." and I'm not sure if the patch should land on m-c first and then later on the branches, or everywhere all at once.

Flags: needinfo?(tom)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #24)

Can you clarify what the landing procedure should be for these fixes? https://firefox-source-docs.mozilla.org/bug-mgmt/processes/security-approval.html says "At that point, land it according to instructions provided.." and I'm not sure if the patch should land on m-c first and then later on the branches, or everywhere all at once.

You can land it on m-c now, and relman will come along and uplift it in a few days.

Flags: needinfo?(tom)

The crash situation is weird because it looks like we're experiencing a null deref on what should be a mutex reentrancy hang. Building a local opt build I see the reentrancy deadlock hang, and building a local debug build I see the following report on the reentrancy:

 0:14.01 PASS What is this object?
 0:14.01 GECKO(675190) ###!!! ERROR: Potential deadlock detected:
 0:14.01 GECKO(675190) === Cyclical dependency starts at
 0:14.01 GECKO(675190) --- Mutex : EventSourceImpl::mSharedData (currently acquired)
 0:14.01 GECKO(675190)  calling context
 0:14.01 GECKO(675190)   [stack trace unavailable]
 0:14.01 GECKO(675190) === Cycle completed at
 0:14.01 GECKO(675190) --- Mutex : EventSourceImpl::mSharedData (currently acquired)
 0:14.01 GECKO(675190)  calling context
 0:14.01 GECKO(675190)   [stack trace unavailable]
 0:14.01 GECKO(675190) ###!!! Deadlock may happen NOW!
 0:14.01 GECKO(675190) [Parent 675204, Main Thread] ###!!! ASSERTION: Potential deadlock detected:
 0:14.01 GECKO(675190) Cyclical dependency starts at
 0:14.01 GECKO(675190) Mutex : EventSourceImpl::mSharedData (currently acquired)
 0:14.01 GECKO(675190) Cycle completed at
 0:14.01 GECKO(675190) Mutex : EventSourceImpl::mSharedData (currently acquired)
 0:14.01 GECKO(675190) ###!!! Deadlock may happen NOW!
 0:14.02 GECKO(675190) : 'Error', file /home/visbrero/rev_control/hg/mozilla-unified3/xpcom/threads/BlockingResourceBase.cpp:248

This finds us hanging in the call at https://hg.mozilla.org/integration/autoland/file/ef19a227311211433bbafe3cbcbbf4e97e883625/dom/base/EventSource.cpp#l1061 where we are already holding the lock but we don't drop the lock before calling DispatchFailConnection which calls IsClosed() at https://hg.mozilla.org/integration/autoland/file/ef19a227311211433bbafe3cbcbbf4e97e883625/dom/base/EventSource.cpp#l1340 which is really just a ReadyState check which grabs the lock.

Note that from discussion with :khuey after trying to reproduce from the self-serve API, it sounds like currently the necessary symbols won't exist (at least by default, I have an outstanding question on that). Edit: khuey identified this line as the logic that makes it so release and try builds include full symbols, but that autoland does not/can't get those symbols, which means self-serve won't work on autoland backouts. I've tentatively raised the question of whether there's a cost reason to not provide the symbols on https://chat.mozilla.org/#/room/#firefox-ci:mozilla.org

A pernosco trace of the comment 27 debug build reproduction is: https://pernos.co/debug/qA54dGU3ukBf6_xjuztthQ/index.html

It confirms what one might expect from the test log output, which is that we are executing the line var es = Cu.evalInSandbox('var es = new EventSource("about:blank"); es', sandbox); and that we fail because we only support schemes "http" and "https". But we don't check for that case until we've checked the global (and are still holding the lock).

Attachment #9193722 - Attachment description: Bug 1682928: P1 Reduce the use of EventSourceImpl raw pointers r?#dom-workers-and-storage → Bug 1682928: P1 Reduce the use of EventSourceImpl raw pointers r=#dom-workers-and-storage-reviewers,ytausky
Attachment #9196785 - Attachment description: Bug 1682928 - Make some data members atomic or const r=asuth,#dom-workers-and-storage-reviewers → Bug 1682928 - Make some data members atomic or const r=asuth,#dom-workers-and-storage-reviewers,sg

This landed and got backed out:

https://hg.mozilla.org/integration/autoland/rev/d818af41cda711705e39402dece0b0de96492273

Push with failures: https://treeherder.mozilla.org/jobs?repo=autoland&group_state=expanded&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunnable&revision=d07913a1ae00f252009bd9d353fc74c6f54d62e3

Failure log: https://treeherder.mozilla.org/logviewer?job_id=328709620&repo=autoland
[task 2021-02-03T14:58:51.668Z] 14:58:51 INFO - TEST-OK | dom/base/test/test_eventsourceservice_status_error.html | took 72ms
[task 2021-02-03T14:58:51.669Z] 14:58:51 INFO - TEST-START | dom/base/test/test_eventsourceservice_worker.html
[task 2021-02-03T14:59:12.003Z] 14:59:12 INFO - wait for org.mozilla.geckoview.test complete; top activity=com.android.launcher3
[task 2021-02-03T14:59:12.003Z] 14:59:12 INFO - runtestsremote.py | Application ran for: 0:02:06.127552
[task 2021-02-03T14:59:12.124Z] 14:59:12 INFO - mozcrash Downloading symbols from: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/LKTiNGxZRquNRxSb-D-rKw/artifacts/public/build/target.crashreporter-symbols.zip
[task 2021-02-03T14:59:14.001Z] 14:59:14 INFO - mozcrash Copy/paste: /builds/worker/fetches/minidump_stackwalk/minidump_stackwalk /tmp/tmpfGDD_b/0917ee3f-096a-fd7e-166d-2a02ba4c0d50.dmp /tmp/tmpv87zw5
[task 2021-02-03T14:59:18.017Z] 14:59:18 INFO - mozcrash Saved minidump as /builds/worker/workspace/build/blobber_upload_dir/0917ee3f-096a-fd7e-166d-2a02ba4c0d50.dmp
[task 2021-02-03T14:59:18.017Z] 14:59:18 INFO - mozcrash Saved app info as /builds/worker/workspace/build/blobber_upload_dir/0917ee3f-096a-fd7e-166d-2a02ba4c0d50.extra
[task 2021-02-03T14:59:18.027Z] 14:59:18 WARNING - PROCESS-CRASH | dom/base/test/test_eventsourceservice_worker.html | application crashed [@ mozilla::dom::EventSourceImpl::AssertIsOnTargetThread() const]
[task 2021-02-03T14:59:18.027Z] 14:59:18 INFO - Mozilla crash reason: MOZ_DIAGNOSTIC_ASSERT(IsTargetThread())
[task 2021-02-03T14:59:18.028Z] 14:59:18 INFO - Crash dump filename: /tmp/tmpfGDD_b/0917ee3f-096a-fd7e-166d-2a02ba4c0d50.dmp
[task 2021-02-03T14:59:18.028Z] 14:59:18 INFO - Operating system: Android
[task 2021-02-03T14:59:18.028Z] 14:59:18 INFO - 0.0.0 Linux 3.10.0+ #260 SMP PREEMPT Fri May 19 12:48:14 PDT 2017 x86_64
[task 2021-02-03T14:59:18.028Z] 14:59:18 INFO - CPU: amd64
[task 2021-02-03T14:59:18.028Z] 14:59:18 INFO - family 6 model 6 stepping 3
[task 2021-02-03T14:59:18.028Z] 14:59:18 INFO - 4 CPUs
[task 2021-02-03T14:59:18.028Z] 14:59:18 INFO - GPU: UNKNOWN
[task 2021-02-03T14:59:18.028Z] 14:59:18 INFO - Crash reason: SIGSEGV /SEGV_MAPERR
[task 2021-02-03T14:59:18.029Z] 14:59:18 INFO - Crash address: 0x0
[task 2021-02-03T14:59:18.029Z] 14:59:18 INFO - Process uptime: not available
[task 2021-02-03T14:59:18.029Z] 14:59:18 INFO - Thread 11 (crashed)
[task 2021-02-03T14:59:18.029Z] 14:59:18 INFO - 0 libxul.so!mozilla::dom::EventSourceImpl::AssertIsOnTargetThread() const [EventSource.cpp:d07913a1ae00f252009bd9d353fc74c6f54d62e3 : 145 + 0x11]
[task 2021-02-03T14:59:18.029Z] 14:59:18 INFO - rax = 0x000076e8c97ab271 rdx = 0x0000000000000001
[task 2021-02-03T14:59:18.029Z] 14:59:18 INFO - rcx = 0x000076e8cc073dd8 rbx = 0x000076e8bc4fec80
[task 2021-02-03T14:59:18.029Z] 14:59:18 INFO - rsi = 0x000076e8bc4fec80 rdi = 0x0000000000000016
[task 2021-02-03T14:59:18.030Z] 14:59:18 INFO - rbp = 0x000076e8cc979a20 rsp = 0x000076e8cc979a20
[task 2021-02-03T14:59:18.030Z] 14:59:18 INFO - r8 = 0x0000000000000000 r9 = 0x0000000000000009
[task 2021-02-03T14:59:18.030Z] 14:59:18 INFO - r10 = 0xfffffffffffff099 r11 = 0x0000000000000000
[task 2021-02-03T14:59:18.030Z] 14:59:18 INFO - r12 = 0x000076e8cc979b40 r13 = 0x000076e8cc979aa8
[task 2021-02-03T14:59:18.030Z] 14:59:18 INFO - r14 = 0x000076e8cc979aa8 r15 = 0x000076e8bc4fec80
[task 2021-02-03T14:59:18.030Z] 14:59:18 INFO - rip = 0x000076e8c502531c
[task 2021-02-03T14:59:18.030Z] 14:59:18 INFO - Found by: given as instruction pointer in context

Target Milestone: 86 Branch → ---
Flags: needinfo?(bugs)

It can be closed now, thanks for reminding me.

Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Flags: needinfo?(ytausky)
Keywords: leave-open
Resolution: --- → FIXED

Do you want to uplift for 86?

Flags: needinfo?(ytausky)
Target Milestone: --- → 87 Branch

Yes, I'll request an uplift.

Flags: needinfo?(ytausky)

This'll need some rebasing for ESR78.

Flags: needinfo?(ytausky)

As part of a security bug pattern analysis, we are requesting your help with a high level analysis of this bug. It is our hope to develop static analysis (or potentially runtime/dynamic analysis) in the future to identify classes of bugs.

Please visit this google form to reply.

Flags: needinfo?(ytausky)
Whiteboard: [sec-survey]

I squashed all the relevant patches into this one.

Flags: needinfo?(ytausky)

I answered the survey.

Flags: needinfo?(ytausky)
QA Whiteboard: [post-critsmash-triage]
Flags: qe-verify-
Whiteboard: [sec-survey] → [sec-survey][adv-main86+r]
Whiteboard: [sec-survey][adv-main86+r] → [sec-survey][adv-main86+r][adv-esr78.8+r]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: