Closed Bug 1124206 Opened 5 years ago Closed 5 years ago

crash in nsCOMPtr_base::assign_with_AddRef(nsISupports*) | nsCORSListenerProxy::OnStopRequest(nsIRequest*, nsISupports*, tag_nsresult)

Categories

(Core :: DOM: Core & HTML, defect, critical)

38 Branch
x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla39
Tracking Status
firefox37 --- unaffected
firefox38 + fixed
firefox39 + fixed

People

(Reporter: epinal99-bugzilla2, Assigned: smaug)

References

Details

(Keywords: crash, regression, reproducible)

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-4be9d5aa-0f95-45d8-bb8b-7919d2150121.
=============================================================


Crash found when testing bug 1123375.

STR:
1) Open the Web Console in a new tab
2) Load http://84.200.10.32/BugzillaReport/halfsite/
3) Press fastly F5 and click each time on the OK button of the prompt "error loading video" until Nightly crashes (it takes 4 or 5 sec)

Result: crash in [@ nsCOMPtr_base::assign_with_AddRef(nsISupports*) | nsCORSListenerProxy::OnStopRequest(nsIRequest*, nsISupports*, tag_nsresult) ]

Frame 	Module 	Signature 	Source
0 	xul.dll 	nsCOMPtr_base::assign_with_AddRef(nsISupports*) 	xpcom/glue/nsCOMPtr.cpp
1 	xul.dll 	nsCORSListenerProxy::OnStopRequest(nsIRequest*, nsISupports*, tag_nsresult) 	dom/security/nsCORSListenerProxy.cpp
2 	xul.dll 	mozilla::net::HttpBaseChannel::DoNotifyListener() 	netwerk/protocol/http/HttpBaseChannel.cpp
3 	xul.dll 	mozilla::net::HttpAsyncAborter<mozilla::net::nsHttpChannel>::HandleAsyncAbort() 	netwerk/protocol/http/HttpBaseChannel.h
4 	xul.dll 	nsRunnableMethodImpl<void ( mozilla::dom::XULDocument::*)(void), void, 1>::Run() 	xpcom/glue/nsThreadUtils.h
5 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp
6 	xul.dll 	mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp
7 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc
8 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc
9 	xul.dll 	nsBaseAppShell::Run() 	widget/nsBaseAppShell.cpp
10 	xul.dll 	nsAppShell::Run() 	widget/windows/nsAppShell.cpp
11 	xul.dll 	nsAppStartup::Run() 	toolkit/components/startup/nsAppStartup.cpp
12 	xul.dll 	XREMain::XRE_mainRun() 	toolkit/xre/nsAppRunner.cpp
13 	xul.dll 	XREMain::XRE_main(int, char** const, nsXREAppData const*) 	toolkit/xre/nsAppRunner.cpp
14 	xul.dll 	XRE_main 	toolkit/xre/nsAppRunner.cpp
15 	firefox.exe 	do_main 	browser/app/nsBrowserApp.cpp
[Tracking Requested - why for this release]: recent crash

Regression range:
good=2015-01-15
bad=2015-01-16
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=c1f6345f2803&tochange=cac6192956ab

Suspected bug: 
Nikhil Marathe — Bug 1119021 - CORS support. r=baku,bkelly
Use nsCrossSiteListenerProxy.h helpers to implement CORS support.
Blocks: 1119021
Flags: needinfo?(nsm.nikhil)
Version: Trunk → 38 Branch
I believe this problem isn't due to my fetch changes, since they never changed the CORSListenerProxy.
I'm unable to reproduce this, but the problem is likely a double-free due to incorrect QI setup or not managing the original notification callbacks correctly when using the CORSListenerProxy.
Flags: needinfo?(nsm.nikhil)
(In reply to Nikhil Marathe [:nsm] (needinfo? please) from comment #2)
> I believe this problem isn't due to my fetch changes, since they never
> changed the CORSListenerProxy.
> I'm unable to reproduce this, but the problem is likely a double-free due to
> incorrect QI setup or not managing the original notification callbacks
> correctly when using the CORSListenerProxy.

My build: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0 ID:20150126030202 CSet: fa91879c8428
(In reply to Nikhil Marathe [:nsm] (needinfo? please) from comment #2)
> I'm unable to reproduce this, but the problem is likely a double-free due to
> incorrect QI setup or not managing the original notification callbacks
> correctly when using the CORSListenerProxy.

It seems like you are still the best person to investigate this very recent, reproducible crash - could you create a try build with a backout of your patch for others to try and reproduce on?  Alternately if you have ideas on what the problem is that can be forward-fixed instead and put those up for attempted reproducing?
Flags: needinfo?(nsm.nikhil)
Signature changed due to 'nsresult' rename but this is still a top crash in recent nightlies.
Crash Signature: [@ nsCOMPtr_base::assign_with_AddRef(nsISupports*) | nsCORSListenerProxy::OnStopRequest(nsIRequest*, nsISupports*, tag_nsresult)] → [@ nsCOMPtr_base::assign_with_AddRef(nsISupports*) | nsCORSListenerProxy::OnStopRequest(nsIRequest*, nsISupports*, tag_nsresult)] [@ nsCOMPtr_base::assign_with_AddRef(nsISupports*) | nsCORSListenerProxy::OnStopRequest(nsIRequest*, nsISupports*, nsresult)…
This is now the top crash on Dev Edition 38.
Assignee: nobody → nsm.nikhil
Flags: needinfo?(nsm.nikhil)
I don't know how Dev Edition outreach works. Can we get some other STR, or is the original STR (or some other media related page) the only way to reproduce this? It seems like a use-after-free of the mOuterListener in nsCORSListenerProxy, my hunch being some listener somewhere is freeing itself when it shouldn't. But then this is Windows only and I'm not sure why that is happening. More evidence would be nice.
(In reply to Nikhil Marathe [:nsm] (needinfo? please) from comment #7)
> I don't know how Dev Edition outreach works.

Not sure what you mean with this. Dev Edition is just the aurora channel version of Firefox. What "outreach" are you even talking of?
Can we reach out to people having this problem? As far as I know Socorro reporting is highly anonymized, and obviously we've no idea who downloads Aurora. But if some Mozillian can reproduce this problem with some other test case or is willing to debug it, do we just have to wait for them to stumble upon this bug or is there a better way?
(In reply to Nikhil Marathe [:nsm] (needinfo? please) from comment #9)
> Can we reach out to people having this problem?

Generally, we can't reach out to people at all from what we see in crash-stats. That said, we do have some email addresses on some reports that User Advocacy used at times to try and contact users with mixed to no helpful response - but that might have been on release, whereas Dev Edition users might be potentially more helpful. That said, I know of no tooling to do any outreach, so someone (not me, I don't have free time on my plate and enough higher priority stuff to do) would need to find a way to gather addresses manually from the reports and send out friendly emails to those and deal with all replies in a helpful way.
mayhemer, ckerschb: Looks like the most recent changes to nsCORSListenerProxy.cpp were yours and it looks like Nikhil isn't sure what he can do here. Can you help to get this moving?

A number of those crashes (click on the second entry in the Crash Signatures field here) seem to have 0x5a5a5a5a addresses which is our poison-on-free value, so that concerns me somewhat as well.
Flags: needinfo?(mozilla)
Flags: needinfo?(honzab.moz)
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #11)
> mayhemer, ckerschb: Looks like the most recent changes to
> nsCORSListenerProxy.cpp were yours and it looks like Nikhil isn't sure what
> he can do here. Can you help to get this moving?
> 
> A number of those crashes (click on the second entry in the Crash Signatures
> field here) seem to have 0x5a5a5a5a addresses which is our poison-on-free
> value, so that concerns me somewhat as well.

As a first note, I can definitely take a look, but not at the moment - is it only reproduceable on Windows or also on Linux/Mac? Don't have a Windows machine available, so setting one up might take additional time.

Leaving the needinfo so I remember to re-visit that bug.
It is windows only and nothing obviously unusual strikes me at any of the nsCORSListenerProxy users. From what I can see they all use refptrs and set notification callbacks pointers correctly.
(In reply to Christoph Kerschbaumer [:ckerschb] from comment #12)
> As a first note, I can definitely take a look, but not at the moment - is it
> only reproduceable on Windows or also on Linux/Mac? Don't have a Windows
> machine available, so setting one up might take additional time.

I don't know of any steps to reproduce, but the crash reports we have received are all on Windows, with the exception of 3 on OS X. It may actually affect all OSes.
Kamil offered to find the regression range - once we know, I can have a closer look!
Flags: needinfo?(mozilla) → needinfo?(kjozwiak)
(In reply to Christoph Kerschbaumer [:ckerschb] from comment #15)
> Kamil offered to find the regression range - once we know, I can have a
> closer look!

In fact the range is already defined in Comment 1 - sorry haven't seen that.
Flags: needinfo?(kjozwiak)
Testcase in comment #0 is still working with Nightly on Windows.
This is easily the top crash on recent aurora builds at 15-20% and it doesn't look like the investigation is going anywhere, despite a regression range and reproducible testcase.

Frankly I think it's time to back out or disable the feature before it hits beta. Lawrence what do you think?
Flags: needinfo?(lmandel)
(In reply to David Major [:dmajor] (UTC+13) from comment #18)
> This is easily the top crash on recent aurora builds at 15-20% and it
> doesn't look like the investigation is going anywhere, despite a regression
> range and reproducible testcase.
> 
> Frankly I think it's time to back out or disable the feature before it hits
> beta. Lawrence what do you think?

I haven't had time to look at it and potentially I will not get to it before early next week. If it's causing soo many crashes it's probably best to back it out for now - investigate, fix and reland.
I agree. Let's back this out and reland when we've had time to investigate and fix this crash.

Christoph - Can you find time to handle the backout?
Flags: needinfo?(lmandel) → needinfo?(mozilla)
Before we backout, I'd like to investigate a possible CC issue. If I put up links to try builds, could David or other contributors who have managed to reproduce this on windows locally, try them out?
Ok, I remain convinced this is not a fetch issue. Not only does the page not use the fetch API, but I didn't change the cors files either. We should not back out the 'suspected bug' in the comment 1. Fetch is a prerequisite for ServiceWorkers which we ship by the end of the month.

The best thing at this point would be for someone with a debug build who can reproduce this to figure out where the crashing channel comes from. The easiest way to do that would be to:
1) Load firefox in a debugger
2) break on nsCORSListenerProxy::nsCORSListenerProxy
3) Check what mOuterListener is. Use some debugger pointer mashing to figure out it's exact type (possibilities include ChannelMediaResource::Listener and so on)

Once we know the exact type of mOuterListener it will be much easier to find the faulty code.
Assignee: nsm.nikhil → nobody
(In reply to Nikhil Marathe [:nsm] (needinfo? please) from comment #22)
> Ok, I remain convinced this is not a fetch issue. Not only does the page not
> use the fetch API, but I didn't change the cors files either. We should not
> back out the 'suspected bug' in the comment 1. Fetch is a prerequisite for
> ServiceWorkers which we ship by the end of the month.
> 
> The best thing at this point would be for someone with a debug build who can
> reproduce this to figure out where the crashing channel comes from.

The loadinfo [1] should also be very helpful to figure out where the crashing channel is coming from (all gecko created channels should have a loadInfo by now). You could query the LoadingPrincipal, which is the principal that started the load and also you could query the contentPolicyType which tells you what kind of channel we are about to load (e.g. img, script, etc).

Also, not going to back this out for now until we figured what to do here!

[1] http://mxr.mozilla.org/mozilla-central/source/netwerk/base/LoadInfo.cpp#70
Flags: needinfo?(mozilla)
OK. Keep this in for now but we really shouldn't ship this crash to Beta (next merge on Mar 30) if we have a way to avoid it.
Attached patch speculative fixSplinter Review
I'd say this is a variant of Bug 1128219.

This is a guess fix: make HttpBaseChannel::DoNotifyListener() to follow the normal com-rules by keeping the callee alive during the method call.
I can't reproduce the crash, but crash-stats hints that this might be the issue.

(Whoever r-/r+ this first, feel free to clear to other request. We just might want to fix this somehow rather soon.)
Attachment #8578760 - Flags: review?(mcmanus)
Attachment #8578760 - Flags: review?(honzab.moz)
Comment on attachment 8578760 [details] [diff] [review]
speculative fix

Review of attachment 8578760 [details] [diff] [review]:
-----------------------------------------------------------------

thanks smaug - thats a logical conclusion from the last change.. I sure would like to know what code is managing to clear mListener from the listener Onstart/OnStop stack
Attachment #8578760 - Flags: review?(mcmanus)
Attachment #8578760 - Flags: review?(honzab.moz)
Attachment #8578760 - Flags: review+
Attached patch +commit messageSplinter Review
Keywords: checkin-needed
Crash Signature: , nsresult) ] → , nsresult) ] [@ nsRefPtr<imgRequestProxy>::assign_with_AddRef(imgRequestProxy*) | nsCORSListenerProxy::OnStopRequest(nsIRequest*, nsISupports*, nsresult) ] [@ nsRefPtr<mozilla::dom::Element>::assign_with_AddRef(mozilla::dom::Element*) | nsCORSListenerP…
https://hg.mozilla.org/mozilla-central/rev/9a044dcf264b
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
Olli, can we have an uplift request to aurora(38)? thanks
Flags: needinfo?(bugs)
Loic, can you still reproduce the issue on Nightlies?


crash-stat UI has changed to be less obvious, so I don't know how to check if this crash is gone from
Nightlies. Kairo, perhaps you know?
Flags: needinfo?(kairo)
Flags: needinfo?(epinal99-bugzilla2)
Working range:
bad=2015-03-18
good=2015-03-19
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=1174ffb1696d&tochange=cf1060d8ce9f

So I guess it's fixed. :)
Flags: needinfo?(epinal99-bugzilla2)
Comment on attachment 8578986 [details] [diff] [review]
+commit message

Thanks!

Approval Request Comment
[Feature/regressing bug #]: Not sure
[User impact if declined]: Crashes
[Describe test coverage new/current, TreeHerder]: landed to trunk
[Risks and why]: Just keeping objects alive longer, so shouldn't be risky
[String/UUID change made/needed]: NA
Flags: needinfo?(bugs)
Attachment #8578986 - Flags: approval-mozilla-aurora?
Attachment #8578986 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Crash Signature: , nsresult) ] [@ nsRefPtr<mozilla::dom::Element>::assign_with_AddRef(mozilla::dom::Element*) | nsCORSListenerProxy::OnStopRequest(nsIRequest*, nsISupports*, nsresult) ] → , nsresult) ] [@ nsRefPtr<mozilla::dom::Element>::assign_with_AddRef(mozilla::dom::Element*) | nsCORSListenerProxy::OnStopRequest(nsIRequest*, nsISupports*, nsresult) ] [@ nsRefPtr<mozilla::dom::Promise>::assign_with_AddRef(mozilla::dom::Promise*) | nsC…
Flags: needinfo?(kairo)
(In reply to Olli Pettay [:smaug] from comment #31)
> crash-stat UI has changed to be less obvious, so I don't know how to check
> if this crash is gone from
> Nightlies. Kairo, perhaps you know?

Not sure what you mean with changed crash-stats UI. That said, from both the graph on the signature and a search, I can confirm that this signature is not seen in Nightly builds after 2015-03-18.
Thanks Smaug, I had this in my mind too (when fixing bug 1128219), but we never did that before, so I rather left it.  I was a bit against taking care of listener's lifetime on our side, anyway, there are clearly bad implementations..
Flags: needinfo?(honzab.moz)
Assignee: nobody → bugs
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.