Closed Bug 1493591 Opened 6 years ago Closed 5 years ago

Crash in mozilla::dom::WorkerPrivate::~WorkerPrivate

Categories

(Core :: DOM: Service Workers, defect, P1)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox-esr60 --- unaffected
firefox62 --- unaffected
firefox63 + wontfix
firefox64 + wontfix

People

(Reporter: gsvelto, Assigned: ytausky)

References

Details

(4 keywords, Whiteboard: DWS_NEXT)

Crash Data

Attachments

(1 obsolete file)

This bug was filed from the Socorro interface and is
report bp-b4df91d0-9d5c-401a-9eaf-253cf0180923.
=============================================================

Top 10 frames of crashing thread:

0 xul.dll mozilla::dom::WorkerPrivate::~WorkerPrivate dom/workers/WorkerPrivate.cpp:2757
1 xul.dll void mozilla::dom::WorkerPrivate::ClearSelfAndParentEventTargetRef dom/workers/WorkerPrivate.h:154
2 xul.dll ?Run@TopLevelWorkerFinishedRunnable@?A0x30C6BC27@dom@mozilla@@EEAA?AW4nsresult@@XZ$37675862552aedd537fb67d4bee880bd dom/workers/WorkerPrivate.cpp:301
3 xul.dll mozilla::ThrottledEventQueue::Inner::Executor::Run xpcom/threads/ThrottledEventQueue.cpp:72
4 xul.dll mozilla::ThrottledEventQueue::Inner::Executor::Run xpcom/threads/ThrottledEventQueue.cpp:72
5 xul.dll mozilla::SchedulerGroup::Runnable::Run xpcom/threads/SchedulerGroup.cpp:337
6 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1166
7 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:519
8 xul.dll mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:97
9 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:318

=============================================================

This looks like a new crash. The first instance seems to have appeared with build 20180918075510.
This signature just appeared in 63.0b13 with 12 crashes.
The pushlog for this build is:
https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=0082062b759f&tochange=1e0b0e805d0a

:baku, could you investigate please ?
Flags: needinfo?(amarchesini)
Keywords: regression
Crash Signature: [@ mozilla::dom::WorkerPrivate::~WorkerPrivate] → [@ mozilla::dom::WorkerPrivate::~WorkerPrivate] [@ @0x0 | mozilla::dom::WorkerPrivate::~WorkerPrivate]
Whiteboard: DWS_NEXT
Group: dom-core-security
Marking as security sensitive since during Triage today we noticed some possible UAF signatures.
Sec high, readjusting priorities. 

Yaron, can you give this a look? Make it a top priority.
Flags: needinfo?(ytausky)
Assignee: nobody → ytausky
Priority: P2 → P1
I'm assuming that "e5 UAF" refers to the address of the crashing use-after-free access?

Anyway, it looks like it happens in an inlined destructor of one of the members of WorkerPrivate. Unfortunately, that class is gigantic, so it's hard to guess which one it is. I'll see if I can figure it out from the executable.
Flags: needinfo?(ytausky)
This report suggests that the problem is a mismanaged refcount: https://crash-stats.mozilla.com/report/index/534b24dc-47d8-4728-892f-fefdc0181025
Status: NEW → ASSIGNED
The crash occurs at the end of this snippet (macOS build):
0x1106884de <+526>:  cmpq   $0x0, 0x488(%r14)
0x1106884e6 <+534>:  je     0x21134f8                 ; <+552>
0x1106884e8 <+536>:  leaq   0x488(%r14), %rax
0x1106884ef <+543>:  movq   (%rax), %rdi
0x1106884f2 <+546>:  movq   (%rdi), %rax
0x1106884f5 <+549>:  callq  *0x8(%rax)

We have a virtual function call through an address stored 0x488 bytes into a WorkerPrivate object. The next step would be to figure out which one that is.
I couldn't find out how to get the debug symbols for this executable, so I built the the same revision locally, hoping for the best. In debug mode I'm not getting the same offsets as in the released executable, so I'm guessing a bit here, but if the offsets are biased uniformly (0x18 bytes), the offending member is WorkerPrivate::mPerformanceStorage.

I'll try to see what I can get from a local release build, but in the meantime I noticed something strange about that member: it's accessed both from the main thread [1] and from the worker's parent thread [2]. I don't see any synchronization around those accesses, so it's possible that it's a race condition. Such a race condition could cause the crash seen here.

[1] https://searchfox.org/mozilla-central/source/dom/workers/WorkerPrivate.cpp#5399
[2] https://searchfox.org/mozilla-central/source/dom/workers/WorkerPrivate.h#154
Another thing that seems racy: mPerformanceCounter is accessed from the main thread without synchronization [1], yet it's also accessible from any other thread with synchronization [2]. This might cause a similar crash in WorkerPrivate::~WorkerPrivate.

I think we need a more systematic approach to fixing this problem, especially since WorkerPrivate is so big. I would suggest wrapping data with an object that only grants access to them if a certain condition holds, e.g. a lock is held or the current thread is the expected one.
As for testing, I don't know of any reliable way to cause a race, so testing that it doesn't happen any more is similarly problematic.

[1] https://searchfox.org/mozilla-central/source/dom/workers/WorkerDebugger.cpp#519
[2] https://searchfox.org/mozilla-central/source/dom/workers/WorkerThread.cpp#165
I dug a little deeper while trying to implement a solution and found out that I was previously wrong: while accesses to mPerformanceStorage are not synchronized by the caller, it only gets assigned objects of type PerformanceStorageWorker [1], which have a thread-safe refcount [2].

I'm going to take a step back and look again at all the places this member is used in.

[1] https://searchfox.org/mozilla-central/source/dom/workers/WorkerPrivate.cpp#3487
[2] https://searchfox.org/mozilla-central/source/dom/performance/PerformanceStorageWorker.h#23
Just a note: mPerformanceStorage is *still* unsafe because there are raw pointers to it all over the place, so the assumption that refcount = 0 means the object is safe to deallocate does not hold. I just don't see how it could cause this particular crash, because the refcount itself seems to be protected, and WorkerPrivate stores its pointer in a RefPtr. The crash happens when trying to call a virtual function, so the memory must have been deallocated before WorkerPrivate::~WorkerPrivate touches the refcount itself.
See also bug 1502419 which I just filed
By sharing a raw pointer to a refcounted object it's possible for it to get
used after its destruction. It's not clear if that is really the case here,
but it doesn't seem like it would hurt performance.
Yaron: Status on this?
Flags: needinfo?(ytausky)
After the initial findings I couldn't find any smoking gun. To recap, this looks like a double destruction (which makes it a use-after-free), so the real problem is likely just before the first destruction, which is no longer visible in a stack trace.

To mitigate the problem, in addition to the patch on this bug, I also opened bug 1504638. (That's just a step in a journey towards stricter enforcement of threading restrictions, so it won't solve the problem either.) I'm pretty sure that improper refcounting somewhere along the chain is the root cause here, but unfortunately we currently don't have tools to enforce it automatically.
My next step here would be to improve the refcounting situation and see if it helps.
Flags: needinfo?(ytausky)
I am trying to see if I can get some lead from TSan. So far I'm just trying to weed out what looks like false positives.
:mrbkap, could you please share your opinion about this crash? So far I wasn't able to find a definitive cause, so unless there's some new trail, I would mark this as stalled. (There are general safety improvement that can be pursued and would hopefully eventually catch this crash, but I wouldn't do them as part of this bug since they're indirect.)
Flags: needinfo?(mrbkap)
This is clearly some type of regression, and also is fairly high volume.

Pulling the dump tells me that this->mPerformanceStorage.mRawPtr = 0xe5e5e5e5
Looking at the reports, I don't see any crashes in Nightly (that is 65.0a) or anything after 64b3. Am I misinterpreting crash-stats? I find it interesting that in the regression range reported in comment 1 contains be85aa71510b (bug 1493629, which I can't access) and the range of fixes between 64b3 and b4 includes 3cf4f972200c (bug 1498510, also inaccessible). Both patches are touching the same code and the CSP object is just before mPerformanceStorage in WorkerPrivate.
Flags: needinfo?(mrbkap)
Crashes stopped with 63.0.3 so maybe the signature morphed but I couldn't find a similar signature with any volume of crashes in 63.03. Given the drop, the fact that the bug is stalled, that we don't have drivers for a new dot release and that we ship 64 in 3 weeks, this is wontfix for 63.
Keywords: stalled
Flags: needinfo?(amarchesini)

I suggest we close this; we've had one 67 crash, but clearly something changed after 63. The remaining crash may not be related to the original reason this bug was filed, or may be a more random trashing or memory error.

Flags: needinfo?(ytausky)

I agree, let's close this. If someone finds this bug because this function starts crashing again, please open a new bug.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Flags: needinfo?(ytausky)
Resolution: --- → WORKSFORME
See Also: → 1539508
Group: dom-core-security
Attachment #9021150 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: