Closed Bug 1371203 Opened 5 years ago Closed 5 years ago

AddressSanitizer: heap-use-after-free nsCOMPtr.h:834:7 in nsCOMPtr

Categories

(Core :: Networking: HTTP, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla55
Tracking Status
firefox-esr52 --- unaffected
firefox53 --- unaffected
firefox54 --- unaffected
firefox55 + fixed

People

(Reporter: cbook, Assigned: schien)

References

(Blocks 1 open bug, )

Details

(4 keywords, Whiteboard: [post-critsmash-triage])

Attachments

(2 files, 1 obsolete file)

Attached file asan stack
Found via bughunter and seems not reproducible on fedora but seems on ubuntu

Steps to reproduce:

(Warning NSFW URL!!!)
-Load-> 
http://www.xvideos.com/video18880039/japinhas_putas_trocando_esperma_de_boca_-_cumswap_japanese_asian_chicks_60fps?pl=16226303&plname=asian_bang 

i guess the crash might have to do with this video playback maybe


SUMMARY: AddressSanitizer: heap-use-after-free /home/worker/workspace/build/src/obj-firefox/dist/include/nsCOMPtr.h:834:7 in nsCOMPtr
Shadow bytes around the buggy address:
  0x0c388000afc0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c388000afd0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c388000afe0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c388000aff0: fd fd fd fd fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c388000b000: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c388000b010: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd[fd]
  0x0c388000b020: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c388000b030: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c388000b040: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c388000b050: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c388000b060: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Heap right redzone:      fb
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack partial redzone:   f4
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==13782==ABORTING
Stack involves HTTP.
Group: core-security → network-core-security
Component: General → Networking: HTTP
:sc, could this be caused by the PBg patches?
Flags: needinfo?(schien)
@tomcat, which version does this bug report on?
Flags: needinfo?(cbook)
bc: do you know the version number of the ubuntu maschines in production ?
Flags: needinfo?(cbook) → needinfo?(bob)
Additional question, is this bug filed on Firefox 55?
Oh, I see PBackgroundHttp related class in the call stack so it must be Firefox 55.
Flags: needinfo?(schien)
From the free call stack

>    #0 0x4bb62b in __interceptor_free /builds/slave/moz-toolchain/src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:47:3
>    #1 0x7f9c14223db1 in mozilla::net::HttpChannelChild::Release() /home/worker/workspace/build/src/netwerk/protocol/http/HttpChannelChild.cpp:232:5
>    #2 0x7f9c147f8a2e in DestroySubtree /home/worker/workspace/build/src/obj-firefox/ipc/ipdl/PHttpBackgroundChannelChild.cpp:417:5
>    #3 0x7f9c147f8a2e in mozilla::net::PHttpBackgroundChannelChild::OnMessageReceived(IPC::Message const&) /home/worker/workspace/build/src/obj-firefox/ipc/ipdl/PHttpBackgroundChannelChild.cpp:370
>    #4 0x7f9c1470043f in mozilla::ipc::PBackgroundChild::OnMessageReceived(IPC::Message const&) /home/worker/workspace/build/src/obj-firefox/ipc/ipdl/PBackgroundChild.cpp:1608:28
>    #5 0x7f9c14651614 in mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&) /home/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:2075:25
>    #6 0x7f9c1464e257 in mozilla::ipc::MessageChannel::DispatchMessage(IPC::Message&&) /home/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:2001:17
>    #7 0x7f9c1464ff64 in mozilla::ipc::MessageChannel::RunMessage(mozilla::ipc::MessageChannel::MessageTask&) /home/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:1870:5

https://searchfox.org/mozilla-central/rev/a798ee4fc323f9387b7576dbed177859d29d09b7/netwerk/protocol/http/HttpBackgroundChannelChild.cpp#436
Should be the code cause the release of HttpChannelChild, which means it enters normal channel release procedure. So, the problem is not on the free part.

So the root cause is that there is a ChannelEventQueue::CompleteResume dispatched and channel release procedure kicks in before the CompleteResuem is executed. ChannelEventQueue is kept alive because the RunnableMethod holds the last strong reference.

This bug should be introduced by bug 1320744.
If the HttpChannelChild::Release happens on the same thread as the ChannelEventQueue::CompleteResume, we can simply null the mOwner and check the nullity before use.

However, if the HttpChannelChild::Release happens on different thread, there is no way you can guarantee that mOwner is not half way of the destruction. The only way is to hold a strong reference along with the CompleteResume runnable.

@mayhemer, do you think the solution above is the correct way to solve this bug?
Flags: needinfo?(honzab.moz)
Attached patch test.patch (obsolete) — Splinter Review
here is the corresponding patch for comment #8.
Assignee: nobody → schien
Status: NEW → ASSIGNED
Attachment #8876095 - Flags: feedback?(honzab.moz)
Very good catch.

I just hope this is the only place where this in ChannelEventQueue:

  // Keep ptr to avoid refcount cycle: only grab ref during flushing.
  nsISupports *mOwner;

pays off like that...

Should we do an audit?  I'd like to make sure this can't happen in the future.  Some redesign would be nice (in a followup).
Flags: needinfo?(honzab.moz)
Comment on attachment 8876095 [details] [diff] [review]
test.patch

Review of attachment 8876095 [details] [diff] [review]:
-----------------------------------------------------------------

If your analyzes is correct, then this should solve this crash.

::: netwerk/ipc/ChannelEventQueue.cpp
@@ +142,5 @@
>        mSuspended = false;
>        return;
>      }
>  
> +    // Hold a strong reference of mOwner to avoid channel released

'to avoid the channel release'

@@ +143,5 @@
>        return;
>      }
>  
> +    // Hold a strong reference of mOwner to avoid channel released
> +    // before CompleteResume is executed.

'was executed'
Attachment #8876095 - Flags: feedback?(honzab.moz) → feedback+
(In reply to Honza Bambas (:mayhemer) from comment #10)
> Some redesign would be nice (in a followup).

or at least make sure that _ANY_ addref(release) on the ChannelEventQueue object also does a corresponding addref(release) on its mOwner
(In reply to Honza Bambas (:mayhemer) from comment #12)
> (In reply to Honza Bambas (:mayhemer) from comment #10)
> > Some redesign would be nice (in a followup).
> 
> or at least make sure that _ANY_ addref(release) on the ChannelEventQueue
> object also does a corresponding addref(release) on its mOwner

Three places that will touch mOwner:

1) ChannelEventQueue constructor:
  ChannelEventQueue is created as a member of the owner/channel object. No life cycle issue at this point.

2) ChannelEventQueue::RunOrEnqueue:
  This function is all called directly by IPC message handler of the owner/channel object. No life cycle issue at this point.

3) ChannelEventQueue::FlushQueue:
  Maybe reached by two path:
    a) ChannelEventQueue::CompleteResume, which will be asynchronously called by dispatching a runnable method. Will be fixed by this patch.
    b) AutoEventEnqueuer destructor, an RAII helper which used by owner/channel object. If owner/channel object is released on the stack, we might hit the same issue. Need to hold a strong reference of mOwner as well.
Attached patch fix.patchSplinter Review
hold strong reference in AutoEventEnqueuer as well.
Attachment #8876153 - Flags: review?(honzab.moz)
Attachment #8876095 - Attachment is obsolete: true
@tomcat, is it possible to ask bug reporter to help verify my patch?
My ubuntu machine uses corporation network which is not allowed to access the test URL.
Flags: needinfo?(cbook)
(In reply to Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz) from comment #13)
> (In reply to Honza Bambas (:mayhemer) from comment #12)
> > (In reply to Honza Bambas (:mayhemer) from comment #10)
> > > Some redesign would be nice (in a followup).
> > 
> > or at least make sure that _ANY_ addref(release) on the ChannelEventQueue
> > object also does a corresponding addref(release) on its mOwner
> 
> Three places that will touch mOwner:
> 
> 1) ChannelEventQueue constructor:
>   ChannelEventQueue is created as a member of the owner/channel object. No
> life cycle issue at this point.
> 
> 2) ChannelEventQueue::RunOrEnqueue:
>   This function is all called directly by IPC message handler of the
> owner/channel object. No life cycle issue at this point.
> 
> 3) ChannelEventQueue::FlushQueue:
>   Maybe reached by two path:
>     a) ChannelEventQueue::CompleteResume, which will be asynchronously
> called by dispatching a runnable method. Will be fixed by this patch.
>     b) AutoEventEnqueuer destructor, an RAII helper which used by
> owner/channel object. If owner/channel object is released on the stack, we
> might hit the same issue. Need to hold a strong reference of mOwner as well.

Those are good, but it's not exactly what I wanted.  You check on access to mOwner.  I wanted to check access to |this| - when |this| is somehow addreffed then mOwner must be as well, for the same period.

It's a coincidence only that AutoEventEnqueuer touches mOwner.  It also refers the queue itself.  So the fix is correct, but from a different reason.
Comment on attachment 8876153 [details] [diff] [review]
fix.patch

Review of attachment 8876153 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!  This LGTM.
Attachment #8876153 - Flags: review?(honzab.moz) → review+
Ubuntu 16.04.2 LTS
Flags: needinfo?(bob)
(In reply to Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz) from comment #7)
> So the root cause is that there is a ChannelEventQueue::CompleteResume
> dispatched and channel release procedure kicks in before the CompleteResuem
> is executed. 

Could ~nsHttpChannel (last Release()) be called on thread 1 just before thread 2 executes mOwner->AddRef in ChannelEventQueue::ResumeInternal?

If so, this is still not fixed and we have "reference count atomicity illusion problem" as described at https://www.janbambas.cz/illusion-atomic-reference-counting/, with some abstraction to map on what I describe above.

The situation I describe would not be possible if we were 100% sure that all calls to AddRef() on a ChannelEventQueue object were always preceded by a call to mOwner->AddRef() on the same thread.  Are we sure?
Flags: needinfo?(schien)
https://hg.mozilla.org/mozilla-central/rev/230564c469b2
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
(In reply to Honza Bambas (:mayhemer) from comment #20)
> (In reply to Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz) from comment
> #7)
> > So the root cause is that there is a ChannelEventQueue::CompleteResume
> > dispatched and channel release procedure kicks in before the CompleteResuem
> > is executed. 
> 
> Could ~nsHttpChannel (last Release()) be called on thread 1 just before
> thread 2 executes mOwner->AddRef in ChannelEventQueue::ResumeInternal?
> 
> If so, this is still not fixed and we have "reference count atomicity
> illusion problem" as described at
> https://www.janbambas.cz/illusion-atomic-reference-counting/, with some
> abstraction to map on what I describe above.
> 
> The situation I describe would not be possible if we were 100% sure that all
> calls to AddRef() on a ChannelEventQueue object were always preceded by a
> call to mOwner->AddRef() on the same thread.  Are we sure?

This is guaranteed in current code base since
1) RunOrEnqueue will hold additional reference of mOwner before ChannelEventQueue::ResumeInternal
2) FlushQueue will hold additional reference mOwner (which is now guaranteed to be lived) before next ResumeInternal cycle.
   The first additional reference is added on the call stack of Resume, which is either by RunOrEnqueue or directly mEventQueue->Resume by the HttpChannelChild.
3) AutoEventEnqueuer is called by mOwner, I don't see any "this" pointer use-after-free issue reported by HttpChannelChild.
Flags: needinfo?(schien)
(In reply to Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz) from comment #15)
> @tomcat, is it possible to ask bug reporter to help verify my patch?
> My ubuntu machine uses corporation network which is not allowed to access
> the test URL.

Bob: can we let the production bughunter that detected this orginally verify this issue ? thanks!
Flags: needinfo?(cbook) → needinfo?(bob)
I ran this url 30 times through bughunter and could not reproduce though we only had the original crash to begin with. As far as I can tell with the intermittent nature of this bug, it seems resolved.
Flags: needinfo?(bob)
I also ran this on a vm from the command line loading the url 100 times and did not reproduce.
Group: network-core-security → core-security-release
Marking verified as per comment 24 and comment 25.
Status: RESOLVED → VERIFIED
Whiteboard: [post-critsmash-triage]
Group: core-security-release

Bugbug thinks this bug is a regression, but please revert this change in case of error.

Keywords: regression
You need to log in before you can comment on or make changes to this bug.