Closed Bug 800347 Opened 7 years ago Closed 6 years ago
crash in mozilla::ipc::RPCChannel::On
Maybe Dequeue One
It first appeared in 19.0a1/20121010. The regression range is: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=aa5e3b445810&tochange=ec10630b1a54 There are two comments: "Crash during profil window display" "Segfault when starting firefox in testing mode" Signature mozilla::ipc::RPCChannel::OnMaybeDequeueOne() More Reports Search UUID 6323e61d-60e1-45fb-ac86-b28aa2121011 Date Processed 2012-10-11 12:29:31 Uptime 6 Last Crash 6.4 weeks before submission Install Age 3.5 minutes since version was first installed. Install Time 2012-10-11 12:25:19 Product Firefox Version 19.0a1 Build ID 20121011030552 Release Channel nightly OS Linux OS Version 0.0.0 Linux 3.2.0-31-generic #50-Ubuntu SMP Fri Sep 7 16:16:45 UTC 2012 x86_64 Build Architecture amd64 Build Architecture Info family 6 model 42 stepping 7 Crash Reason SIGSEGV Crash Address 0x0 User Comments Crash during profil window display App Notes OpenGL: Tungsten Graphics, Inc -- Mesa DRI Intel(R) Sandybridge Mobile -- 3.0 Mesa 8.0.4 -- texture_from_pixmap Processor Notes WARNING: JSON file missing Add-ons EMCheckCompatibility False Frame Module Signature Source 0 libxul.so mozilla::ipc::RPCChannel::OnMaybeDequeueOne Mutex.h:74 1 libxul.so MessageLoop::DeferOrRunPendingTask ipc/chromium/src/base/message_loop.cc:333 2 libxul.so MessageLoop::DoWork ipc/chromium/src/base/message_loop.cc:441 3 libxul.so mozilla::ipc::DoWorkRunnable::Run ipc/glue/MessagePump.cpp:42 4 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:612 5 libxul.so NS_ProcessPendingEvents_P obj-firefox/xpcom/build/nsThreadUtils.cpp:170 6 libxul.so mozilla::ShutdownXPCOM xpcom/build/nsXPComInit.cpp:553 7 libxul.so ScopedXPCOMStartup::~ScopedXPCOMStartup toolkit/xre/nsAppRunner.cpp:1111 8 libxul.so ShowProfileManager toolkit/xre/nsAppRunner.cpp:1864 9 libxul.so XREMain::XRE_mainStartup toolkit/xre/nsAppRunner.cpp:2308 10 libxul.so XREMain::XRE_main toolkit/xre/nsAppRunner.cpp:3843 11 libxul.so XRE_main toolkit/xre/nsAppRunner.cpp:3933 12 firefox main browser/app/nsBrowserApp.cpp:174 13 libc-2.15.so libc-2.15.so@0x2176c 14 firefox firefox@0x6f5f More reports at: https://crash-stats.mozilla.com/report/list?signature=mozilla%3A%3Aipc%3A%3ARPCChannel%3A%3AOnMaybeDequeueOne%28%29
It started earlier with another signature. More reports also at: https://crash-stats.mozilla.com/report/list?signature=mozilla%3A%3Aipc%3A%3ARPCChannel%3A%3AOnMaybeDequeueOne
Crash Signature: [@ mozilla::ipc::RPCChannel::OnMaybeDequeueOne()] → [@ mozilla::ipc::RPCChannel::OnMaybeDequeueOne()] [@ mozilla::ipc::RPCChannel::OnMaybeDequeueOne]
Version: 19 Branch → Trunk
Stack traces are various: Frame Module Signature Source 0 libxul.so mozilla::ipc::RPCChannel::OnMaybeDequeueOne Mutex.h:74 1 libxul.so RunnableMethod<mozilla::ipc::RPCChannel, bool ipc/chromium/src/base/tuple.h:383 2 libxul.so mozilla::ipc::RPCChannel::DequeueTask::Run RPCChannel.h:425 3 libxul.so MessageLoop::RunTask ipc/chromium/src/base/message_loop.cc:333 4 libxul.so MessageLoop::DeferOrRunPendingTask ipc/chromium/src/base/message_loop.cc:341 5 libxul.so MessageLoop::DoWork ipc/chromium/src/base/message_loop.cc:441 6 libxul.so base::MessagePumpDefault::Run ipc/chromium/src/base/message_pump_default.cc:23 7 libxul.so MessageLoop::RunInternal ipc/chromium/src/base/message_loop.cc:215 8 libxul.so MessageLoop::Run ipc/chromium/src/base/message_loop.cc:208 9 libxul.so base::Thread::ThreadMain ipc/chromium/src/base/thread.cc:156 10 libxul.so ThreadFunc ipc/chromium/src/base/platform_thread_posix.cc:39 11 libc.so libc.so@0xe3da 12 libc.so libc.so@0xdac6 Frame Module Signature Source 0 libxul.so mozilla::ipc::RPCChannel::OnMaybeDequeueOne Mutex.h:74 1 libxul.so RunnableMethod<IPC::ChannelProxy::Context, void tuple.h:383 2 libxul.so IPC::Message& std::map<unsigned int, IPC::Message, std::less<unsigned int>, std: _map.h:181 3 libxul.so mozilla::ipc::RPCChannel::DequeueTask::Run RPCChannel.h:425 4 libxul.so mozilla::ipc::RPCChannel::ShouldDeferNotifyMaybeError const RPCChannel.h:160 5 libxul.so MessageLoop::RunTask message_loop.cc:333 6 libxul.so MessageLoop::DeferOrRunPendingTask message_loop.cc:341 7 libxul.so MessageLoop::DoWork message_loop.cc:441 8 libmozglue.so arena_malloc jemalloc.c:4159 9 libxul.so base::MessagePumpDefault::Run message_pump_default.cc:23 10 libxul.so MessageLoop::RunInternal message_loop.cc:215 11 libxul.so MessageLoop::Run message_loop.cc:208 12 libxul.so base::Thread::ThreadMain thread.cc:156
It's a very low volume crash in 22.0, 23.0 Beta and Aurora 24. It started spiking in 25.0a1/20130720. The regression range for the spike is: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=af4e3ce8c487&tochange=bf73e10f5e54
OS: All → Windows 7
Hardware: All → x86
Version: Trunk → 25 Branch
Crash Signature: [@ mozilla::ipc::RPCChannel::OnMaybeDequeueOne()] [@ mozilla::ipc::RPCChannel::OnMaybeDequeueOne] → [@ mozilla::ipc::RPCChannel::OnMaybeDequeueOne()] [@ mozilla::ipc::RPCChannel::OnMaybeDequeueOne] [@ RtlEnterCriticalSection | PR_Lock | mozilla::ipc::RPCChannel::OnMaybeDequeueOne() ]
Benjamin, as Scoobidiver states, this has significantly flared up on trunk, who could look into that?
With combined signatures, it's #7 top browser crasher in 25.0a1.
Tracking topcrash new to trunk, ni? on Benjamin for help getting investigation.
It's interesting that this spiked before thumbnails got turned on. That's what I would originally have blamed this on. There's some CPOW changes in this range, but they shouldn't be affecting anything because I'm pretty sure that CPOWs would be preffed off by default and there shouldn't be content processes in Firefox until the 24th anyway. Tom, CPOWs are off by default for B2G and desktop and require a pref flip to enable, right? I believe that the signatures here are also possible with non-IPC IPDL usage such as the compositor: bug 889515 and bug 885580 landed in this window (mattwoodrow). It's possible but pretty unlikely that this is somehow related to plugins that normally fail to initialize and the landing of bug 889480.
Flags: needinfo?(benjamin) → needinfo?(evilpies)
CPOWs are "enabled" by default in that it is possible to use them chrome code. But they should have absolutely no effect if they're not being used, and no code on mozilla-central currently uses them.
Hrm, it looks like I just lied. A lot! We've added some code that assumes a remote browser is the same as having e10s pref'd on, and can send CPOWs to the parent process.
The code that generates the CPOW gets run because it's part of an XBL binding that's applied to remote <browser> elements. We didn't consider until now that this code will also get used by the thumbnail code. In order to actually send an urgent message, someone would have to touch the .contentWindow property of one of these thumbnail <browser> elements. We don't know how that could happen, but maybe it could somehow. I think the best course would be to avoid using the XBL binding for background thumbnail <browser> elements. It sounds like we don't need the bindings for these elements, and it should be pretty easy to use CSS to avoid applying the binding.
I have nothing to add. Maybe we should add a pref for CPOWs, I think this might be useful to test what breaks without them...
We should add a pref for CPOWs so that there is no chance of B2G ever accidentally gaining a dependency on them. I've filed bug 902013. Given the regression range, it's hard to see how CPOWs could be involved, and I'm still betting on this being OMTC/compositor not content processes.
(In reply to Bill McCloskey (:billm) from comment #10) > The code that generates the CPOW gets run because it's part of an XBL > binding that's applied to remote <browser> elements. See bug 902041 comment 4 - that binding didn't get applied to background thumbnail service browsers.
(In reply to :Gavin Sharp (use email@example.com for email) from comment #13) > See bug 902041 comment 4 - that binding didn't get applied to background > thumbnail service browsers. Gavin, is that something that should be applied here?
No, I was just highlighting that the previous theory about the cause of this bug was wrong.
CPOWs are disabled for sure now by bug 902013. I'm not sure how to proceed here.
me bp-0857775b-0a8b-4707-9670-d028c2130902 (so many of my crashes are "EMPTY" I hardly believe this one) I also got bp-df174600-12f7-4ce7-8974-44b082130902 ntdll.dll@0x65d14 at the same time for the same URL - http://news.google.com/news?vanilla=1 - as bp-0857775b-0a8b-4707-9670-d028c2130902. I don't know how this is possible
This signature continues to be #2 on Aurora 25.0a2
(In reply to Robert Kaiser (:firstname.lastname@example.org) from comment #18) > This signature continues to be #2 on Aurora 25.0a2 Following up in email to try to find a path forward
Given the regression dates from comment 3, plus the additional CPOW safeguards that were added later, I think we can clearly rule out content IPDL. That leaves OMTC usage as the most likely candidate, and comment 7 is still valid. I don't know whether it's possible to back out the graphics changes listed there. ni?mattwoodrow for analysis.
Roc, while we're waiting for Matt to come back from PTO - is this something you can look at?
OMTC is only enabled by default on Mac. The (vast) majority of the crashes are not on Mac. So I suspect this is not OMTC.
And the versions go back to 13, so "regression", is because of the frequency? Looking for a change that increased this frequency could be useful, but what if it's "this is now faster, so some race condition is more likely to happen". IMO we should just try to look for the fix, rather than spend a lot of time on the change that may have increased the frequency. For all we know, it could be that the web pages are starting to use something different than before. Anyway, who's the right person to look at this bug if we were to ignore the what caused the "regression" part?
(In reply to Milan Sreckovic [:milan] from comment #23) > IMO we should just try to look for the fix, rather than > spend a lot of time on the change that may have increased the frequency. Forgive me if I misunderstood but are you implying this is already fixed in Nightly and we should find a fix range? If this is the case then I disagree with that assumption. Looking at pure volume this is certainly much higher on Aurora: Firefox 25 @ 68.83 % vs Firefox 26 @ 26.78 % However, when you look at crashes per install it tells a different story: Firefox 25 @ 1.089 vs Firefox 26 @ 1.089 Again, my apologies if you're not saying "this is fixed in Nightly and we should find the fix so we can uplift it".
FWIW, this seems to have spiked in Kairo's explosiveness reports as of September 7th: https://crash-analysis.mozilla.com/rkaiser/2013-09-08/2013-09-08.firefox.25.explosiveness.html Crashes per 1M ADU seems to have doubled since then.
ashughes, is that associated with a particular buildid?
re: comment 24 ashughes no, milan is suggesting that instead of looking for things to back out, we should merely fix the bug. re: comment 23, Milan, unfortunately with race conditions like this, it is often very difficult to catch this in a debugger. I'm going to attempt to summarize what we currently know about the bug. Please correct if anything here doesn't sound correct: * We are calling OnMaybeDequeueOne on a dead RPCChannel. * The task which makes this call is created here: http://hg.mozilla.org/releases/mozilla-aurora/annotate/4e15ddba58dc/ipc/glue/RPCChannel.cpp#l45 and is cancelled in RPCChannel::Clear (line 59) * Because Tasks don't get "un-cancelled", it doesn't seem to matter whether you post tasks after calling Clear(), so we don't have a race there. Thus it seems likely that somebody is destroying an RPCChannel without Clear()ing it. Is there RPCChannel state that we can assert on in the destructor that the channel was properly cleared? ni?dvander * We don't know who the client is, therefore anyone who uses IPDL is "suspect". Before content processes got turned on for desktop (after the regression range in question), that basically meant either plugins or something in graphics (I understand that graphics was using IPDL, I believe for OMTC, but perhaps for other things like decoding?). roc indicates that OMTC is probably not a candidate because it's currently mac-only * Hand inspection of a couple recent crashes indicates that many of the new crashes occur when there is a nested event loop. Evidence: ** https://crash-stats.mozilla.com/report/index/676ca25f-624d-47ee-a033-e35092130908 (thread pool shutdown) ** https://crash-stats.mozilla.com/report/index/676ca25f-624d-47ee-a033-e35092130908 (thread shutdown, perhaps from a thread pool) ** Therefore I recommmend that we go back through the nightly regression range and look for changes related to threading, thread poools, or nested event loops. * The existing assertion in RPCChannel::~RPCChannel is triggered a lot (tracked in bug 864112). Is it possible that we're merely substituting this crash for that one, and the volumes are actually the same? They may also merely share a common cause.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #26) > ashughes, is that associated with a particular buildid? Assuming you are referring to comment 25. Kairo would need to comment on his methodology here but I believe each day reports all buildIDs in use that day. So the number on 2013-09-07 includes but is not exclusive to people running Firefox 25.0a2 2013-09-07 (ie. it also includes people running Firefox 25.0a2 from an earlier day that have not updated yet). Looking at the numbers it's a bit concerning to me that we're seeing well over 300 crashes/1M ADI since Sept 7th when during the 9 days previous we averaged 157 crashes/1M ADI never going over 220 crashes/1M ADI for this signature. Maybe KaiRo can provide more expert analysis of the data (I'm still pretty new at this).
Kairo, can you look whether the most recent spike on Aurora happened on a particular buildid or whether it was by date? That may help determine whether an uplift made this worse (again) or whether we're dealing with spikes caused by external code. For the original regression, I ruled out external causes because of the specific buildids, but I'll go back and verify that using the by-buildid graphs.
https://crash-analysis.mozilla.com/bsmedberg/OnMaybeDequeueOne.csv confirms the regression range from comment 3 and that it's not a date-based regression on nightly.
This graph shows that this signature showed up clearly on Aurora with the FF25 uplift and has not spiked since then. Anthony or Kairo can you please figure out why this would have showed up on the explosiveness report for that day and fix the report to suppress the false-positive?
(In reply to Benjamin Smedberg [:bsmedberg] from comment #27) > * We are calling OnMaybeDequeueOne on a dead RPCChannel. I think you're right that the most likely cause is not having called Clear(). It looks like this could happen if we don't wait for a channel-closed notification before deleting the channel. Clear() should probably be called in the destructor. I was wondering if, it'd also be possible to revive the task by posting a new one, but it looks like that's not possible. I don't see any existing state to assert on this but I could throw a quick patch together if you'd like.
Is it sufficient to assert that mWorkerLoop is null? I think what I'd like to do is on Aurora call Clear in the destructor. On Nightly I'd like to runtime-abort that Clear was already called and figure out what the offending stack is. Does that sound reasonable?
Aha! ~AsyncChannel does call Clear. And Clear is virtual, but RPCChannel::Clear won't be called because we've already finished destructing it. Oops.
Attachment #803275 - Flags: review?(dvander) → review+
(In reply to Benjamin Smedberg [:bsmedberg] from comment #31) > Created attachment 803044 [details] > test.svg > > This graph shows that this signature showed up clearly on Aurora with the > FF25 uplift and has not spiked since then. Anthony or Kairo can you please > figure out why this would have showed up on the explosiveness report for > that day and fix the report to suppress the false-positive? I'm not sure which one of the signatures you looked at, there are two or three on this bug. My explosiveness report in https://crash-analysis.mozilla.com/rkaiser/2013-09-11/2013-09-11.firefox.25.explosiveness.html takes all crashes in 25.* versions happening on that day (not looking at build date, but crash date, I never came around to do the more difficult work to do build date there). The secondary "RtlEnterCriticalSection | PR_Lock | mozilla::ipc::RPCChannel::OnMaybeDequeueOne()" signature doubled starting 2013-09-07 but 13-20 vs. 30-40 crashes per day may or may not be significant. I would blame the weekend but it stayed higher afterwards. The main "mozilla::ipc::RPCChannel::OnMaybeDequeueOne()" signature doesn't show that jump.
Did a full try run on this last night. https://tbpl.mozilla.org/?tree=Try&rev=05a4fa0fafa1 Looked good except that all of mac was orange because of the new assertion. OMTC apparently doesn't shut down channels the same way, because it's using thread links instead of IPC links. Since that might be ok, landed without the assertion (got r=dvander on IRC for that change). https://hg.mozilla.org/integration/mozilla-inbound/rev/a904ad651bff Let's see what this does to volume tomorrow before actually declaring victory.
Whiteboard: [leave open]
This patch completely got rid of crashes with these signatures on nightly. Kairo or tracy, did you see any new signatures that this may have morphed into?
Comment on attachment 803275 [details] [diff] [review] Try clearing RPCChannel in the destructor so that the dequeue task is cancelled properly Requesting approval for FF25 uplift, whatever channel that ends up as since today is uplift-day. [Approval Request Comment] Bug caused by (feature/regressing bug #): unknown :-( User impact if declined: #3 topcrash Testing completed (on m-c, etc.): verified that it fixes the crashes on nightly Risk to taking this patch (and alternatives if risky): This doesn't risky to me, and there are no known alternatives. String or IDL/UUID changes made by this patch: None
Right, there hasn't been any crashes with this signature since the fix was checked in. I don't see any suspects that would fit as a morphed crash. Nothing, that this might be, went explosive on the 13th. I also don't see any similar signed crashes.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Whiteboard: [leave open]
Target Milestone: --- → mozilla26
Comment on attachment 803275 [details] [diff] [review] Try clearing RPCChannel in the destructor so that the dequeue task is cancelled properly Aurora's not open yet, but this is a really bad crash and we should resolve in Beta 1 of FF25.
Attachment #803275 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flagging for verification based on crash stats once we get some data with 25b1.
While looking through Socorro crash reports, within last month, this is what I found: 1) the 1st signature has many crashes on 25.0a2 and 26.0a1 https://crash-stats.mozilla.com/report/list?signature=mozilla%3A%3Aipc%3A%3ARPCChannel%3A%3AOnMaybeDequeueOne%28%29 2) the 2nd signature doesn't have any crashes on 25.0a2 and 26.0a1 https://crash-stats.mozilla.com/report/list?signature=mozilla%3A%3Aipc%3A%3ARPCChannel%3A%3AOnMaybeDequeueOne&product=Firefox&query_type=contains&range_unit=weeks&process_type=any&hang_type=any&date=2013-09-18+14%3A00%3A00&range_value=4 3) the 3rd signature has also many crashes on 25.0a2 and 26.0a1 https://crash-stats.mozilla.com/report/list?signature=RtlEnterCriticalSection+%7C+PR_Lock+%7C+mozilla%3A%3Aipc%3A%3ARPCChannel%3A%3AOnMaybeDequeueOne%28%29&product=Firefox&query_type=contains&range_unit=weeks&process_type=any&hang_type=any&date=2013-09-18+14%3A00%3A00&range_value=4 I'll keep an eye on this, to see if crashes still occur with Firefox 25 beta 1.
(In reply to Manuela Muntean [:Manuela] [QA] from comment #45) I see no instances of these signatures in Firefox 25 Beta nor in Firefox 26 Aurora.
Unfortunately, I think this morphed into bug 924121. crash in mozilla::ipc::MessageChannel::OnMaybeDequeueOne()
You need to log in before you can comment on or make changes to this bug.