Closed Bug 1440166 Opened 4 years ago Closed 4 years ago

AddressSanitizer: heap-use-after-free xpcom/threads/nsThreadUtils.cpp:50:1 in mozilla::Runnable::Release()

Categories

(Core :: XPCOM, defect)

58 Branch
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox61 --- fixed

People

(Reporter: bc, Assigned: froydnj)

References

(Blocks 1 open bug, )

Details

(Keywords: csectype-uaf, Whiteboard: [debug only])

Attachments

(7 files)

Attached file debug asan report
1. https://moonwars.drakemoon.com/
   Linux x86_64 debug asan build on Fedora 27 4.15.3
   may need to try multiple times or reload

2. ==5599==ERROR: AddressSanitizer: heap-use-after-free on address 0x6070005942e0 at pc 0x7f4f91eff4eb bp 0x7f4f7c56bf30 sp 0x7f4f7c56bf28
READ of size 8 at 0x6070005942e0 thread T26 (Softwar~cThread)
    #0 0x7f4f91eff4ea in mozilla::Runnable::Release() /builds/worker/workspace/build/src/xpcom/threads/nsThreadUtils.cpp:50:1
    #1 0x7f4f987054fa in mozilla::layout::VsyncParent::NotifyVsync(mozilla::TimeStamp) /builds/worker/workspace/build/src/layout/ipc/VsyncParent.cpp:60:1
    #2 0x7f4f987059bc in non-virtual thunk to mozilla::layout::VsyncParent::NotifyVsync(mozilla::TimeStamp) /builds/worker/workspace/build/src/layout/ipc/VsyncParent.cpp
    #3 0x7f4f97b65105 in mozilla::RefreshTimerVsyncDispatcher::NotifyVsync(mozilla::TimeStamp) /builds/worker/workspace/build/src/widget/VsyncDispatcher.cpp:117:29
    #4 0x7f4f941538b6 in mozilla::gfx::VsyncSource::Display::NotifyVsync(mozilla::TimeStamp) /builds/worker/workspace/build/src/gfx/thebes/VsyncSource.cpp:68:33
    #5 0x7f4f941532a2 in SoftwareDisplay::NotifyVsync(mozilla::TimeStamp) /builds/worker/workspace/build/src/gfx/thebes/SoftwareVsyncSource.cpp:108:12
This seems like more of a layout bug.
Group: core-security → layout-core-security
Component: XPCOM → Layout
Well, the actual code involved is pretty bog standard IPDL runnable stuff, AFAICT.

Any ideas, Nathan? From the stacks, it looks like this is a runnable because created on the vsync thread, which gets destroyed on the IPDL background thread, then we try to release it on the vsync thread. The only way I can imagine that happening is if the refcounting is messed up somehow. But this looks like a super common pattern, so I don't know what could be going wrong.
Flags: needinfo?(nfroyd)
pseudo-stacks for the top 5 frames and counts for all of the instances of this error that I have seen in total.

They are mostly in 

mozilla::Runnable::Release() nsThreadPool::Run() non-virtual thunk to nsThreadPool::Run() nsThread::ProcessNextEvent(bool, bool*) NS_ProcessNextEvent(nsIThread*, bool)

while only a few have

mozilla::Runnable::Release() mozilla::layout::VsyncParent::NotifyVsync(mozilla::TimeStamp) non-virtual thunk to mozilla::layout::VsyncParent::NotifyVsync(mozilla::TimeStamp) 

See bug 1416529 comment 7 for an example where I found mozilla::net::CacheIOThread::ThreadFunc() on the stack at https://ngodung.tk/.

Even though I have retested these urls multiple times, Bughunter has for the most part only been able to hit the error once per url though I have a handful of urls where it was reproduced 2-3 times. I'll look for better reproducible instances and update here when I find one.
I have no idea what's going on here:

  // Called on hardware vsync thread. We should post to current ipc thread.
  MOZ_ASSERT(!IsOnBackgroundThread());
  nsCOMPtr<nsIRunnable> vsyncEvent =
    NewRunnableMethod<TimeStamp>("layout::VsyncParent::DispatchVsyncEvent",
                                 this,
                                 &VsyncParent::DispatchVsyncEvent,
                                 aTimeStamp);
  MOZ_ALWAYS_SUCCEEDS(mBackgroundThread->Dispatch(vsyncEvent, NS_DISPATCH_NORMAL));

1. We create a runnable method, taking a strong reference to VsyncParent.
2. We addref the runnable.
3. We dispatch the runnable, which means that the thread's event queue takes (another) strong reference to the runnable.
4. We release the runnable.

According to the stack in comment 0, step 4 means that we're touching freed memory?  The only plausible way that could happen is if somebody dropped the strong reference the event queue takes (fine, that happens when the runnable actually runs, as one can see from the ASan log) *and* dropped another (non-existent) strong reference as well.  But soooo many things would break elsewhere if that happened that I have a hard time imagining a plausible scenario for that.

We can't have both threads being in Runnable::Release() and attempting to delete the runnable simultaneously, because the runnable's reference count is atomic...if we did, all sorts of things would be broken.

Maybe the compiler deferred the AddRef() of `vsyncEvent` until *after* the Dispatch()?  But I don't see how the compiler could be permitted to do that...

Comment 0 claims this is a *debug* build, which suggests that perhaps the Release() in debug builds is doing bad things?  Debug-only logging, or something like that?  But then I'd expect that normal ASan runs (even our test suite) would show badness like this in dozens of other places.

I'm pretty mystified.
Group: layout-core-security → dom-core-security
Component: Layout → XPCOM
Keywords: sec-high
I can provide full logs of instances where this reproduced if that would help. Just let me know.
(In reply to Bob Clary [:bc:] from comment #5)
> I can provide full logs of instances where this reproduced if that would
> help. Just let me know.

What do full logs include that the logs you have already posted do not?
Flags: needinfo?(nfroyd) → needinfo?(bob)
Normal Gecko debug spew. Example mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::NotifyVsync related stack.

https://www.huffingtonpost.com/entry/parkland-florida-school-shooting_us_5a849439e4b0774f31d19627
Flags: needinfo?(bob)
Attached file youtube debug log
https://www.youtube.com/watch?v=au54r1157T0 is an example of cases where I see a number of INFO:audioipc_client::stream messages. This does not have the vsync frames but does have Error: PLayerTransaction::Msg_ReleaseLayer Processing error: message was deserialized, but the handler returned false (indicating failure)
Attached file moonwars debug log
moonwars debug log from a local run. Note I am using marionette to launch the browser. Maybe that is part of the issue?
These all seem to follow the form of:

> Dispatch(nsIRunnable*)

rather than

> Dispatch(already_AddRefed<nsIRunnable>)


So usually we hold a ref while dispatching rather than letting it go. It feels like there's some sort of double-release going on in the Dispatch machinery, perhaps in a failure path. Converting these *might* shift that failures to the problem spot.

[1] https://searchfox.org/mozilla-central/rev/9a8d3f8191b15cdca89a7ce044c7bea2dd0462dc/netwerk/cache2/CacheFileIOManager.cpp#1521
[2] https://searchfox.org/mozilla-central/rev/9a8d3f8191b15cdca89a7ce044c7bea2dd0462dc/layout/ipc/VsyncParent.cpp#58
[3] https://searchfox.org/mozilla-central/rev/9a8d3f8191b15cdca89a7ce044c7bea2dd0462dc/netwerk/protocol/http/nsHttpConnectionMgr.cpp#306
[4] https://searchfox.org/mozilla-central/rev/9a8d3f8191b15cdca89a7ce044c7bea2dd0462dc/storage/StorageBaseStatementInternal.cpp#130
[5] https://searchfox.org/mozilla-central/rev/bd05e3853c6e982e2a35c1cc404b987b2bc914d6/netwerk/base/nsTransportUtils.cpp#121
[6] https://searchfox.org/mozilla-central/rev/bd05e3853c6e982e2a35c1cc404b987b2bc914d6/security/manager/ssl/SSLServerCertVerification.cpp#1538
[7] https://searchfox.org/mozilla-central/rev/9a8d3f8191b15cdca89a7ce044c7bea2dd0462dc/netwerk/cache2/CacheFileIOManager.cpp#1521
[8] https://searchfox.org/mozilla-central/rev/9a8d3f8191b15cdca89a7ce044c7bea2dd0462dc/netwerk/cache2/CacheFileIOManager.cpp#3704
[9] https://searchfox.org/mozilla-central/rev/bd05e3853c6e982e2a35c1cc404b987b2bc914d6/dom/quota/QuotaManagerService.cpp#276
[10] https://searchfox.org/mozilla-central/rev/bd05e3853c6e982e2a35c1cc404b987b2bc914d6/xpcom/threads/nsThread.cpp#462
[11] https://searchfox.org/mozilla-central/rev/bd05e3853c6e982e2a35c1cc404b987b2bc914d6/netwerk/base/nsSocketTransport2.cpp#1021
[12] https://searchfox.org/mozilla-central/rev/bd05e3853c6e982e2a35c1cc404b987b2bc914d6/xpcom/io/nsStreamUtils.cpp#486
[13] https://searchfox.org/mozilla-central/rev/bd05e3853c6e982e2a35c1cc404b987b2bc914d6/storage/mozStorageConnection.cpp#1485
[14] https://searchfox.org/mozilla-central/rev/9a8d3f8191b15cdca89a7ce044c7bea2dd0462dc/netwerk/protocol/http/nsHttpConnectionMgr.cpp#306
[15] https://searchfox.org/mozilla-central/rev/bd05e3853c6e982e2a35c1cc404b987b2bc914d6/toolkit/components/url-classifier/nsUrlClassifierProxies.cpp#33
I like that idea, but Dispatch(nsIRunnable*) is literally just:

    nsresult Dispatch(nsIRunnable* aEvent, uint32_t aFlags) {
      return Dispatch(nsCOMPtr<nsIRunnable>(aEvent).forget(), aFlags);
    }

Hardly anything exciting there.

I guess maybe we could just pass in an already_AddRefed thing to Dispatch and then the release-after-Run() inside nsThread would take care of the lone reference, and we wouldn't have to worry about underflow.

I'd *really* like to understand how this error is coming about.  In limited testing with a Valgrind-enabled build, I wasn't able to trigger the use-after-free from comment 0.  It could be linked to marionette, I guess (I wasn't using marionette), but again, I have trouble seeing how that happens.
Attached file moonwars.txt
Note I've found that the follow reproduces the one of the failure modes reliably:

1. https://moonwars.drakemoon.com/ in debug asan build
2. press Ctrl-Shift-K to open Web Console.
3. enter window.open() in the input to open a new tab.
4. press Ctrl-Shift-K in the new tab to open a Web Console.
5. enter setInterval('opener.document.location.reload()', 30000);
6. wait for the crash.

Doesn't seem to affect opt asan builds at all.
I can confirm that bc's instructions reproduce for me on Ubuntu 16.04; I had to wait a little longer (~20 mins?) than bc did on the original Fedora machine (5-10 mins).  I get the same stacks as comment 0, which completely mystifies me for all the reasons in comment 4.

I'm going to try to catch this in rr, and then see if poking at the runnable yields interesting information.
OK, I spent some quality time with rr today and I *believe* this is a race induced by ASan instrumentation, at least for the stack demonstrated by comment 0.  For code that looks like:

    nsCOMPtr<nsIRunnable> r = ...;
    mOtherThread->Dispatch(r, ...);

the sequence of problematic steps on the main thread (MT) and the off-main thread (OT) is:

1. MT: AddRef the runnable after its creation (refcnt 1)
2. MT: Dispatch the runnable (refcnt 2)
3. MT: Release the runnable (refcnt 2)
4. MT: Enter ThreadSafeAutoRefCnt::operator-- (refcnt 2)
5. MT: Atomically decrement the refcnt (refcnt 1)
6. MT: *Do not* return from operator--.  Get interrupted, and...
7. OT: Pull runnable out of the event queue (refcnt 1)
8. OT: Run runnable (refcnt 1)
9. OT: Release runnable (refcnt 0)
10. OT: Destroy runnable, running ~Runnable, and freeing the memory (refcnt 0).
11. OT: ASan will scribble appropriate bytes in its shadow memory denoting freed memory for the runnable.  However...
12. MT: Come back to Release() from ThreadSafeAutoRefCnt::operator--.
13. MT: We know that the refcnt was 1, so we don't have to destroy the runnable.  But...
14. MT: Check that the memory for the runnable is still valid (I'm not entirely sure why it does this, given that there are no *actual* memory references in Release() at this point.  But the fact is that it does, and then...)
15. MT: Crash the program because it looks like we were touching freed memory.

This sequence of steps neatly explains why we had to run the testcase in comment 12 so long--we're racing between the Release() on the main thread and the Release() on the off-main thread.  Also explains why things seem to be intermittent in comment 0.

It would be interesting to see what the instrumented code in non-debug ASan builds looks like and whether there's still the check at step 14 after the refcount has been atomically decremented, or whether this is just an artifact of debug builds (and whether this is potentially fixed in newer clangs?).
(In reply to Nathan Froyd [:froydnj] from comment #14)

> 14. MT: Check that the memory for the runnable is still valid (I'm not
> entirely sure why it does this, given that there are no *actual* memory
> references in Release() at this point.  But the fact is that it does, and
> then...)

I think this is the right question to ask.. Why is this check happening here? ASan does not check if memory is valid or not without an actual access being there. If that access is from us, then it is a bug in our codebase. So we need to find out why exactly there is something that causes this check to be present.

Maybe you could look at the decompilation of Release() in this case (maybe with and without ASan), to see if you can see the access itself that is happening here?
Oh, that's a good point, I should have looked a little farther and thought about this being a debug build.  The disassembly looks like:

   0x00007fea28b660a8 <+72>:	mov    %r14,%rdi
   0x00007fea28b660ab <+75>:	callq  0x7fea28948140 <mozilla::ThreadSafeAutoRefCnt::operator--()>
=> 0x00007fea28b660b0 <+80>:	mov    %rax,%r13
   0x00007fea28b660b3 <+83>:	cmpb   $0x0,0x7fff8000(%rbx)
   0x00007fea28b660ba <+90>:	jne    0x7fea28b661a3 <mozilla::Runnable::Release()+323>
   0x00007fea28b660c0 <+96>:	mov    0x20(%r15),%rdx
   0x00007fea28b660c4 <+100>:	mov    %r15,%rdi
   0x00007fea28b660c7 <+103>:	mov    %r13,%rsi
   0x00007fea28b660ca <+106>:	callq  0x7fea28a04520 <NS_LogRelease(void*, nsrefcnt, char const*)>

and the cmpb check at +83 is verifying that the access to mName for the call to NS_LogRelease is still valid.  Which it's not, necessarily, because that might be freed memory at this point.

Andrew, do you want to have a go at fixing that up?

Relatedly, if all the things bc have seen in this bug are debug-only, then I don't think there's any reason for this to be a security bug.  bc, is that a correct assumption?  It certainly seems to be so from the logs...
Flags: needinfo?(continuation)
Flags: needinfo?(bob)
It certainly seems that this is not security sensitive. I would really like a landed patch that I could use to re-test all of the urls where I have seen this, but that is probably not necessary. Clear the security group if you like.
Flags: needinfo?(bob)
Actually, the patch for this is not that hard.  I will fix things up.  ni? for mccr8 stays, but we'll repurpose it to making the bug not a security bug?
Assignee: nobody → nfroyd
Fun! Thanks for investigating and fixing this, Nathan.
Group: dom-core-security
Flags: needinfo?(continuation)
Keywords: sec-high
Whiteboard: [debug only]
Comment on attachment 8958804 [details] [diff] [review]
avoid races in refcount logging with named classes

Review of attachment 8958804 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/base/nsISupportsImpl.h
@@ +691,5 @@
> +// Use this macro to ensure that we have a local copy of the name prior to
> +// the refcnt decrement.  (We use a macro to make absolutely sure the name
> +// isn't loaded in builds where it wouldn't be used.)
> +#ifdef NS_BUILD_REFCNT_LOGGING
> +#define NS_LOAD_NAME_BEFORE_RELEASE(localname, _name) \

Having a macro feels like overkill to me, but I suppose it is fine.
Attachment #8958804 - Flags: review?(continuation) → review+
(In reply to Andrew McCreight [:mccr8] from comment #21)
> ::: xpcom/base/nsISupportsImpl.h
> @@ +691,5 @@
> > +// Use this macro to ensure that we have a local copy of the name prior to
> > +// the refcnt decrement.  (We use a macro to make absolutely sure the name
> > +// isn't loaded in builds where it wouldn't be used.)
> > +#ifdef NS_BUILD_REFCNT_LOGGING
> > +#define NS_LOAD_NAME_BEFORE_RELEASE(localname, _name) \
> 
> Having a macro feels like overkill to me, but I suppose it is fine.

I also wasn't sure if there would be unused variable warnings if the code was inserted directly into the named release macro...
Pushed by nfroyd@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a3c51e3bb6d9
avoid races in refcount logging with named classes; r=mccr8
Pushed by nfroyd@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4fcc2c3fa993
actually avoid races in refcount logging with named classes; rs=erahm
https://hg.mozilla.org/mozilla-central/rev/a3c51e3bb6d9
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
You need to log in before you can comment on or make changes to this bug.