Closed Bug 800347 Opened 7 years ago Closed 6 years ago

crash in mozilla::ipc::RPCChannel::OnMaybeDequeueOne

Categories

(Core :: IPC, defect, critical)

25 Branch
x86
Windows 7
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla26
Tracking Status
firefox24 --- unaffected
firefox25 + verified
firefox26 --- verified

People

(Reporter: scoobidiver, Assigned: benjamin)

Details

(Keywords: crash, regression, topcrash)

Crash Data

Attachments

(2 files)

It first appeared in 19.0a1/20121010. The regression range is:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=aa5e3b445810&tochange=ec10630b1a54

There are two comments:
"Crash during profil window display"
"Segfault when starting firefox in testing mode"

Signature 	mozilla::ipc::RPCChannel::OnMaybeDequeueOne() More Reports Search
UUID	6323e61d-60e1-45fb-ac86-b28aa2121011
Date Processed	2012-10-11 12:29:31
Uptime	6
Last Crash	6.4 weeks before submission
Install Age	3.5 minutes since version was first installed.
Install Time	2012-10-11 12:25:19
Product	Firefox
Version	19.0a1
Build ID	20121011030552
Release Channel	nightly
OS	Linux
OS Version	0.0.0 Linux 3.2.0-31-generic #50-Ubuntu SMP Fri Sep 7 16:16:45 UTC 2012 x86_64
Build Architecture	amd64
Build Architecture Info	family 6 model 42 stepping 7
Crash Reason	SIGSEGV
Crash Address	0x0
User Comments	Crash during profil window display
App Notes 	
OpenGL: Tungsten Graphics, Inc -- Mesa DRI Intel(R) Sandybridge Mobile  -- 3.0 Mesa 8.0.4 -- texture_from_pixmap
Processor Notes 	WARNING: JSON file missing Add-ons
EMCheckCompatibility	False

Frame 	Module 	Signature 	Source
0 	libxul.so 	mozilla::ipc::RPCChannel::OnMaybeDequeueOne 	Mutex.h:74
1 	libxul.so 	MessageLoop::DeferOrRunPendingTask 	ipc/chromium/src/base/message_loop.cc:333
2 	libxul.so 	MessageLoop::DoWork 	ipc/chromium/src/base/message_loop.cc:441
3 	libxul.so 	mozilla::ipc::DoWorkRunnable::Run 	ipc/glue/MessagePump.cpp:42
4 	libxul.so 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:612
5 	libxul.so 	NS_ProcessPendingEvents_P 	obj-firefox/xpcom/build/nsThreadUtils.cpp:170
6 	libxul.so 	mozilla::ShutdownXPCOM 	xpcom/build/nsXPComInit.cpp:553
7 	libxul.so 	ScopedXPCOMStartup::~ScopedXPCOMStartup 	toolkit/xre/nsAppRunner.cpp:1111
8 	libxul.so 	ShowProfileManager 	toolkit/xre/nsAppRunner.cpp:1864
9 	libxul.so 	XREMain::XRE_mainStartup 	toolkit/xre/nsAppRunner.cpp:2308
10 	libxul.so 	XREMain::XRE_main 	toolkit/xre/nsAppRunner.cpp:3843
11 	libxul.so 	XRE_main 	toolkit/xre/nsAppRunner.cpp:3933
12 	firefox 	main 	browser/app/nsBrowserApp.cpp:174
13 	libc-2.15.so 	libc-2.15.so@0x2176c 	
14 	firefox 	firefox@0x6f5f

More reports at:
https://crash-stats.mozilla.com/report/list?signature=mozilla%3A%3Aipc%3A%3ARPCChannel%3A%3AOnMaybeDequeueOne%28%29
It started earlier with another signature.

More reports also at:
https://crash-stats.mozilla.com/report/list?signature=mozilla%3A%3Aipc%3A%3ARPCChannel%3A%3AOnMaybeDequeueOne
Crash Signature: [@ mozilla::ipc::RPCChannel::OnMaybeDequeueOne()] → [@ mozilla::ipc::RPCChannel::OnMaybeDequeueOne()] [@ mozilla::ipc::RPCChannel::OnMaybeDequeueOne]
Keywords: regression
Whiteboard: [native-crash]
Version: 19 Branch → Trunk
Stack traces are various:
Frame 	Module 	Signature 	Source
0 	libxul.so 	mozilla::ipc::RPCChannel::OnMaybeDequeueOne 	Mutex.h:74
1 	libxul.so 	RunnableMethod<mozilla::ipc::RPCChannel, bool 	ipc/chromium/src/base/tuple.h:383
2 	libxul.so 	mozilla::ipc::RPCChannel::DequeueTask::Run 	RPCChannel.h:425
3 	libxul.so 	MessageLoop::RunTask 	ipc/chromium/src/base/message_loop.cc:333
4 	libxul.so 	MessageLoop::DeferOrRunPendingTask 	ipc/chromium/src/base/message_loop.cc:341
5 	libxul.so 	MessageLoop::DoWork 	ipc/chromium/src/base/message_loop.cc:441
6 	libxul.so 	base::MessagePumpDefault::Run 	ipc/chromium/src/base/message_pump_default.cc:23
7 	libxul.so 	MessageLoop::RunInternal 	ipc/chromium/src/base/message_loop.cc:215
8 	libxul.so 	MessageLoop::Run 	ipc/chromium/src/base/message_loop.cc:208
9 	libxul.so 	base::Thread::ThreadMain 	ipc/chromium/src/base/thread.cc:156
10 	libxul.so 	ThreadFunc 	ipc/chromium/src/base/platform_thread_posix.cc:39
11 	libc.so 	libc.so@0xe3da 	
12 	libc.so 	libc.so@0xdac6 	

Frame 	Module 	Signature 	Source
0 	libxul.so 	mozilla::ipc::RPCChannel::OnMaybeDequeueOne 	Mutex.h:74
1 	libxul.so 	RunnableMethod<IPC::ChannelProxy::Context, void 	tuple.h:383
2 	libxul.so 	IPC::Message& std::map<unsigned int, IPC::Message, std::less<unsigned int>, std: 	_map.h:181
3 	libxul.so 	mozilla::ipc::RPCChannel::DequeueTask::Run 	RPCChannel.h:425
4 	libxul.so 	mozilla::ipc::RPCChannel::ShouldDeferNotifyMaybeError const 	RPCChannel.h:160
5 	libxul.so 	MessageLoop::RunTask 	message_loop.cc:333
6 	libxul.so 	MessageLoop::DeferOrRunPendingTask 	message_loop.cc:341
7 	libxul.so 	MessageLoop::DoWork 	message_loop.cc:441
8 	libmozglue.so 	arena_malloc 	jemalloc.c:4159
9 	libxul.so 	base::MessagePumpDefault::Run 	message_pump_default.cc:23
10 	libxul.so 	MessageLoop::RunInternal 	message_loop.cc:215
11 	libxul.so 	MessageLoop::Run 	message_loop.cc:208
12 	libxul.so 	base::Thread::ThreadMain 	thread.cc:156
It's a very low volume crash in 22.0, 23.0 Beta and Aurora 24.
It started spiking in 25.0a1/20130720. The regression range for the spike is:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=af4e3ce8c487&tochange=bf73e10f5e54
Keywords: regression
OS: All → Windows 7
Hardware: All → x86
Whiteboard: [native-crash]
Version: Trunk → 25 Branch
Crash Signature: [@ mozilla::ipc::RPCChannel::OnMaybeDequeueOne()] [@ mozilla::ipc::RPCChannel::OnMaybeDequeueOne] → [@ mozilla::ipc::RPCChannel::OnMaybeDequeueOne()] [@ mozilla::ipc::RPCChannel::OnMaybeDequeueOne] [@ RtlEnterCriticalSection | PR_Lock | mozilla::ipc::RPCChannel::OnMaybeDequeueOne() ]
Benjamin, as Scoobidiver states, this has significantly flared up on trunk, who could look into that?
With combined signatures, it's #7 top browser crasher in 25.0a1.
Keywords: topcrash
Tracking topcrash new to trunk, ni? on Benjamin for help getting investigation.
Flags: needinfo?(benjamin)
It's interesting that this spiked before thumbnails got turned on. That's what I would originally have blamed this on.

There's some CPOW changes in this range, but they shouldn't be affecting anything because I'm pretty sure that CPOWs would be preffed off by default and there shouldn't be content processes in Firefox until the 24th anyway.

Tom, CPOWs are off by default for B2G and desktop and require a pref flip to enable, right?

I believe that the signatures here are also possible with non-IPC IPDL usage such as the compositor: bug 889515 and bug 885580 landed in this window (mattwoodrow).

It's possible but pretty unlikely that this is somehow related to plugins that normally fail to initialize and the landing of bug 889480.
Flags: needinfo?(benjamin) → needinfo?(evilpies)
CPOWs are "enabled" by default in that it is possible to use them chrome code. But they should have absolutely no effect if they're not being used, and no code on mozilla-central currently uses them.
Hrm, it looks like I just lied. A lot! We've added some code that assumes a remote browser is the same as having e10s pref'd on, and can send CPOWs to the parent process.
The code that generates the CPOW gets run because it's part of an XBL binding that's applied to remote <browser> elements. We didn't consider until now that this code will also get used by the thumbnail code. In order to actually send an urgent message, someone would have to touch the .contentWindow property of one of these thumbnail <browser> elements. We don't know how that could happen, but maybe it could somehow.

I think the best course would be to avoid using the XBL binding for background thumbnail <browser> elements. It sounds like we don't need the bindings for these elements, and it should be pretty easy to use CSS to avoid applying the binding.
I have nothing to add. Maybe we should add a pref for CPOWs, I think this might be useful to test what breaks without them...
Flags: needinfo?(evilpies)
We should add a pref for CPOWs so that there is no chance of B2G ever accidentally gaining a dependency on them. I've filed bug 902013.

Given the regression range, it's hard to see how CPOWs could be involved, and I'm still betting on this being OMTC/compositor not content processes.
(In reply to Bill McCloskey (:billm) from comment #10)
> The code that generates the CPOW gets run because it's part of an XBL
> binding that's applied to remote <browser> elements.

See bug 902041 comment 4 - that binding didn't get applied to background thumbnail service browsers.
(In reply to :Gavin Sharp (use gavin@gavinsharp.com for email) from comment #13)
> See bug 902041 comment 4 - that binding didn't get applied to background
> thumbnail service browsers.

Gavin, is that something that should be applied here?
No, I was just highlighting that the previous theory about the cause of this bug was wrong.
CPOWs are disabled for sure now by bug 902013. I'm not sure how to proceed here.
me bp-0857775b-0a8b-4707-9670-d028c2130902 (so many of my crashes are "EMPTY" I hardly believe this one)

I also got bp-df174600-12f7-4ce7-8974-44b082130902 ntdll.dll@0x65d14 at the same time for the same URL - http://news.google.com/news?vanilla=1 - as bp-0857775b-0a8b-4707-9670-d028c2130902.  I don't know how this is possible
This signature continues to be #2 on Aurora 25.0a2
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #18)
> This signature continues to be #2 on Aurora 25.0a2

Following up in email to try to find a path forward
Given the regression dates from comment 3, plus the additional CPOW safeguards that were added later, I think we can clearly rule out content IPDL. That leaves OMTC usage as the most likely candidate, and comment 7 is still valid. I don't know whether it's possible to back out the graphics changes listed there. ni?mattwoodrow for analysis.
Flags: needinfo?(matt.woodrow)
Roc, while we're waiting for Matt to come back from PTO - is this something you can look at?
Flags: needinfo?(roc)
OMTC is only enabled by default on Mac. The (vast) majority of the crashes are not on Mac. So I suspect this is not OMTC.
Flags: needinfo?(roc)
And the versions go back to 13, so "regression", is because of the frequency?  Looking for a change that increased this frequency could be useful, but what if it's "this is now faster, so some race condition is more likely to happen".  IMO we should just try to look for the fix, rather than spend a lot of time on the change that may have increased the frequency.  For all we know, it could be that the web pages are starting to use something different than before.  Anyway, who's the right person to look at this bug if we were to ignore the what caused the "regression" part?
(In reply to Milan Sreckovic [:milan] from comment #23)
> IMO we should just try to look for the fix, rather than
> spend a lot of time on the change that may have increased the frequency. 

Forgive me if I misunderstood but are you implying this is already fixed in Nightly and we should find a fix range? If this is the case then I disagree with that assumption.

Looking at pure volume this is certainly much higher on Aurora:
 Firefox 25 @ 68.83 % vs Firefox 26 @ 26.78 %

However, when you look at crashes per install it tells a different story:
 Firefox 25 @ 1.089 vs Firefox 26 @ 1.089

Again, my apologies if you're not saying "this is fixed in Nightly and we should find the fix so we can uplift it".
FWIW, this seems to have spiked in Kairo's explosiveness reports as of September 7th:
https://crash-analysis.mozilla.com/rkaiser/2013-09-08/2013-09-08.firefox.25.explosiveness.html

Crashes per 1M ADU seems to have doubled since then.
ashughes, is that associated with a particular buildid?
re: comment 24 ashughes no, milan is suggesting that instead of looking for things to back out, we should merely fix the bug.

re: comment 23, Milan, unfortunately with race conditions like this, it is often very difficult to catch this in a debugger.

I'm going to attempt to summarize what we currently know about the bug. Please correct if anything here doesn't sound correct:

* We are calling OnMaybeDequeueOne on a dead RPCChannel.
* The task which makes this call is created here: http://hg.mozilla.org/releases/mozilla-aurora/annotate/4e15ddba58dc/ipc/glue/RPCChannel.cpp#l45 and is cancelled in RPCChannel::Clear (line 59)
* Because Tasks don't get "un-cancelled", it doesn't seem to matter whether you post tasks after calling Clear(), so we don't have a race there. Thus it seems likely that somebody is destroying an RPCChannel without Clear()ing it. Is there RPCChannel state that we can assert on in the destructor that the channel was properly cleared? ni?dvander
* We don't know who the client is, therefore anyone who uses IPDL is "suspect". Before content processes got turned on for desktop (after the regression range in question), that basically meant either plugins or something in graphics (I understand that graphics was using IPDL, I believe for OMTC, but perhaps for other things like decoding?). roc indicates that OMTC is probably not a candidate because it's currently mac-only
* Hand inspection of a couple recent crashes indicates that many of the new crashes occur when there is a nested event loop. Evidence:
** https://crash-stats.mozilla.com/report/index/676ca25f-624d-47ee-a033-e35092130908 (thread pool shutdown)
** https://crash-stats.mozilla.com/report/index/676ca25f-624d-47ee-a033-e35092130908 (thread shutdown, perhaps from a thread pool)
** Therefore I recommmend that we go back through the nightly regression range and look for changes related to threading, thread poools, or nested event loops.
* The existing assertion in RPCChannel::~RPCChannel is triggered a lot (tracked in bug 864112). Is it possible that we're merely substituting this crash for that one, and the volumes are actually the same? They may also merely share a common cause.
Flags: needinfo?(dvander)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #26)
> ashughes, is that associated with a particular buildid?

Assuming you are referring to comment 25. Kairo would need to comment on his methodology here but I believe each day reports all buildIDs in use that day. So the number on 2013-09-07 includes but is not exclusive to people running Firefox 25.0a2 2013-09-07 (ie. it also includes people running Firefox 25.0a2 from an earlier day that have not updated yet).

Looking at the numbers it's a bit concerning to me that we're seeing well over 300 crashes/1M ADI since Sept 7th when during the 9 days previous we averaged 157 crashes/1M ADI never going over 220 crashes/1M ADI for this signature.

Maybe KaiRo can provide more expert analysis of the data (I'm still pretty new at this).
Kairo, can you look whether the most recent spike on Aurora happened on a particular buildid or whether it was by date? That may help determine whether an uplift made this worse (again) or whether we're dealing with spikes caused by external code.

For the original regression, I ruled out external causes because of the specific buildids, but I'll go back and verify that using the by-buildid graphs.
https://crash-analysis.mozilla.com/bsmedberg/OnMaybeDequeueOne.csv confirms the regression range from comment 3 and that it's not a date-based regression on nightly.
Attached image test.svg
This graph shows that this signature showed up clearly on Aurora with the FF25 uplift and has not spiked since then. Anthony or Kairo can you please figure out why this would have showed up on the explosiveness report for that day and fix the report to suppress the false-positive?
Flags: needinfo?(kairo)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #27)
> * We are calling OnMaybeDequeueOne on a dead RPCChannel.

I think you're right that the most likely cause is not having called Clear(). It looks like this could happen if we don't wait for a channel-closed notification before deleting the channel. Clear() should probably be called in the destructor. I was wondering if, it'd also be possible to revive the task by posting a new one, but it looks like that's not possible.

I don't see any existing state to assert on this but I could throw a quick patch together if you'd like.
Flags: needinfo?(dvander)
Is it sufficient to assert that mWorkerLoop is null? I think what I'd like to do is on Aurora call Clear in the destructor. On Nightly I'd like to runtime-abort that Clear was already called and figure out what the offending stack is. Does that sound reasonable?
Flags: needinfo?(dvander)
Aha! ~AsyncChannel does call Clear. And Clear is virtual, but RPCChannel::Clear won't be called because we've already finished destructing it. Oops.
Attachment #803275 - Flags: review?(dvander) → review+
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #31)
> Created attachment 803044 [details]
> test.svg
> 
> This graph shows that this signature showed up clearly on Aurora with the
> FF25 uplift and has not spiked since then. Anthony or Kairo can you please
> figure out why this would have showed up on the explosiveness report for
> that day and fix the report to suppress the false-positive?

I'm not sure which one of the signatures you looked at, there are two or three on this bug. My explosiveness report in https://crash-analysis.mozilla.com/rkaiser/2013-09-11/2013-09-11.firefox.25.explosiveness.html takes all crashes in 25.* versions happening on that day (not looking at build date, but crash date, I never came around to do the more difficult work to do build date there). The secondary "RtlEnterCriticalSection | PR_Lock | mozilla::ipc::RPCChannel::OnMaybeDequeueOne()" signature doubled starting 2013-09-07 but 13-20 vs. 30-40 crashes per day may or may not be significant. I would blame the weekend but it stayed higher afterwards. The main "mozilla::ipc::RPCChannel::OnMaybeDequeueOne()" signature doesn't show that jump.
Flags: needinfo?(kairo)
Did a full try run on this last night. https://tbpl.mozilla.org/?tree=Try&rev=05a4fa0fafa1
Looked good except that all of mac was orange because of the new assertion. OMTC apparently doesn't shut down channels the same way, because it's using thread links instead of IPC links. Since that might be ok, landed without the assertion (got r=dvander on IRC for that change).

https://hg.mozilla.org/integration/mozilla-inbound/rev/a904ad651bff

Let's see what this does to volume tomorrow before actually declaring victory.
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(dvander)
Whiteboard: [leave open]
This patch completely got rid of crashes with these signatures on nightly. Kairo or tracy, did you see any new signatures that this may have morphed into?
Flags: needinfo?(twalker)
Comment on attachment 803275 [details] [diff] [review]
Try clearing RPCChannel in the destructor so that the dequeue task is cancelled properly

Requesting approval for FF25 uplift, whatever channel that ends up as since today is uplift-day.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): unknown :-(
User impact if declined: #3 topcrash
Testing completed (on m-c, etc.): verified that it fixes the crashes on nightly
Risk to taking this patch (and alternatives if risky): This doesn't risky to me, and there are no known alternatives.
String or IDL/UUID changes made by this patch: None
Attachment #803275 - Flags: approval-mozilla-beta?
Attachment #803275 - Flags: approval-mozilla-aurora?
Right, there hasn't been any crashes with this signature since the fix was checked in.

I don't see any suspects that would fit as a morphed crash.  Nothing, that this might be, went explosive on the 13th. I also don't see any similar signed crashes.
Flags: needinfo?(twalker)
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Whiteboard: [leave open]
Target Milestone: --- → mozilla26
Comment on attachment 803275 [details] [diff] [review]
Try clearing RPCChannel in the destructor so that the dequeue task is cancelled properly

Aurora's not open yet, but this is a really bad crash and we should resolve in Beta 1 of FF25.
Attachment #803275 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Assignee: nobody → benjamin
Attachment #803275 - Flags: approval-mozilla-aurora?
Keywords: checkin-needed
Flagging for verification based on crash stats once we get some data with 25b1.
Keywords: verifyme
(In reply to Manuela Muntean [:Manuela] [QA] from comment #45)

I see no instances of these signatures in Firefox 25 Beta nor in Firefox 26 Aurora.
Status: RESOLVED → VERIFIED
Keywords: verifyme
Unfortunately, I think this morphed into bug 924121.  crash in mozilla::ipc::MessageChannel::OnMaybeDequeueOne()
You need to log in before you can comment on or make changes to this bug.