Closed Bug 737437 Opened 8 years ago Closed 8 years ago

crash in mozilla::ipc::RPCChannel::OnMaybeDequeueOne when quitting

Categories

(Firefox for Android :: General, defect, critical)

14 Branch
ARM
Android
defect
Not set
critical

Tracking

()

VERIFIED FIXED
Firefox 14
Tracking Status
blocking-fennec1.0 --- beta+

People

(Reporter: scoobidiver, Assigned: ajuma)

References

Details

(4 keywords, Whiteboard: [native-crash])

Crash Data

Attachments

(2 files, 2 obsolete files)

It's #3 top crasher over the last day.

It first appeared in 14.0a1/20120320043530. The regression range is:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=58a2cd0203ee&tochange=ee554888d071
It might be a regression from bug 731603.

Signature 	mozilla::ipc::RPCChannel::OnMaybeDequeueOne More Reports Search
UUID	fd96ad9a-7263-429b-ac5f-b180b2120320
Date Processed	2012-03-20 14:54:13
Uptime	31
Last Crash	1.9 minutes before submission
Install Age	4.3 minutes since version was first installed.
Install Time	2012-03-20 14:48:53
Product	FennecAndroid
Version	14.0a1
Build ID	20120320043530
Release Channel	nightly
OS	Linux
OS Version	0.0.0 Linux 2.6.35.10-g7b95729 #1 PREEMPT Mon Jun 13 10:34:37 CST 2011 armv7l
Build Architecture	arm
Build Architecture Info	
Crash Reason	SIGSEGV
Crash Address	0x0
App Notes 	
EGL? EGL+ AdapterVendorID: vision, AdapterDeviceID: HTC Vision.
AdapterDescription: 'Android, Model: 'HTC Vision', Product: 'htc_vision', Manufacturer: 'HTC', Hardware: 'vision''.
GL Context? GL Context+ GL Layers? GL Layers+ 
HTC HTC Vision
htc_wwe/htc_vision/vision:2.3.3/GRI40/84109:user/release-keys
EMCheckCompatibility	True

Frame 	Module 	Signature [Expand] 	Source
0 	libxul.so 	mozilla::ipc::RPCChannel::OnMaybeDequeueOne 	Mutex.h:106
1 	libxul.so 	RunnableMethod<mozilla::ipc::RPCChannel, bool , Tuple0>::Run 	ipc/chromium/src/base/tuple.h:383
2 	libxul.so 	mozilla::ipc::RPCChannel::DequeueTask::Run 	RPCChannel.h:462
3 	libxul.so 	MessageLoop::RunTask 	ipc/chromium/src/base/message_loop.cc:318
4 	libxul.so 	MessageLoop::DeferOrRunPendingTask 	ipc/chromium/src/base/message_loop.cc:326
5 	libxul.so 	MessageLoop::DoWork 	ipc/chromium/src/base/message_loop.cc:426
6 	libxul.so 	mozilla::ipc::MessagePump::Run 	ipc/glue/MessagePump.cpp:114
7 	libxul.so 	MessageLoop::RunInternal 	ipc/chromium/src/base/message_loop.cc:208
8 	libxul.so 	MessageLoop::Run 	ipc/chromium/src/base/message_loop.cc:201
9 	libxul.so 	nsBaseAppShell::Run 	widget/xpwidgets/nsBaseAppShell.cpp:189
10 	libxul.so 	nsAppStartup::Run 	toolkit/components/startup/nsAppStartup.cpp:295
11 	libxul.so 	XRE_main 	toolkit/xre/nsAppRunner.cpp:3703
12 	libxul.so 	GeckoStart 	toolkit/xre/nsAndroidStartup.cpp:109
13 	libmozglue.so 	__res_nsend 	other-licenses/android/res_send.c:1086
...

More reports at:
https://crash-stats.mozilla.com/report/list?signature=mozilla%3A%3Aipc%3A%3ARPCChannel%3A%3AOnMaybeDequeueOne
Duplicate of this bug: 737452
blocking-fennec1.0: --- → ?
this is suspected as being a regression from bug 731603. I see that some more bits of it landed today, can you comment on how its related?
(In reply to Brad Lassey [:blassey] from comment #2)
> can you comment on how its related?
gfx/layers/ipc has been modified in this bug.
This crash is only happening on devices with Adreno GPUs (and the vast majority of the crashes are on the HTC Vision).
Component: IPC → General
Product: Core → Fennec Native
QA Contact: ipc → general
Summary: crash in mozilla::ipc::RPCChannel::OnMaybeDequeueOne → crash in mozilla::ipc::RPCChannel::OnMaybeDequeueOne on devices with Adreno GPUs
Version: 14 Branch → unspecified
Version: unspecified → Firefox 14
HTC Desire Z (A 205), and EVO 3D (A 220) reported in bug 737477 (could be a symptom of this crash) too.
(In reply to Aaron Train [:aaronmt] from comment #5)
> HTC Desire Z (A 205), and EVO 3D (A 220) reported in bug 737477 (could be a
> symptom of this crash) too.

Yes, these sound like the same problem (particularly in light of Bug 737477 Comment 4).
Duplicate of this bug: 737477
I'm running into this same signature crash on my Galaxy Nexus on a latest-inbound build today (03/22) on Fennec Quit (Menu -> Quit)

bp-697a989b-7c19-4595-a961-63ec32120322
bp-4efc4f0b-a83e-48f8-99bc-c978e2120322
bp-f176c5d2-4858-48e5-95f4-f04752120322

That would indicate that this is not strictly related to Adreno ...
(In reply to Aaron Train [:aaronmt] from comment #8)
> That would indicate that this is not strictly related to Adreno ...

True, although it might be that the crash you're hitting has a different cause (the stacks look very different from the rest).
So far, these crashes occur on:
* HTC Glacier, Vision, Desire, Desire HD, Desire HD A9191, ADR6400L, ThunderBolt, Incredible S, Nexus One
* Samsung SCH-I510, Galaxy Nexus
* Sony Ericsson SO-02C, R800x, MT15i
Assignee: nobody → ajuma
blocking-fennec1.0: ? → beta+
(In reply to Scoobidiver from comment #10)
> So far, these crashes occur on:
> * HTC Glacier, Vision, Desire, Desire HD, Desire HD A9191, ADR6400L,
> ThunderBolt, Incredible S, Nexus One
> * Samsung SCH-I510, Galaxy Nexus
> * Sony Ericsson SO-02C, R800x, MT15i

Updating the list with: Samsung SAMSUNG-SGH-I897 (Samsung Captivate).

This crash occurred on the latest Nightly 03/23:
https://crash-stats.mozilla.com/report/index/bp-e5d6483a-9070-4973-bbc5-61c7e2120323
Perhaps the title should be changed since Samsung Captivate has a PowerVR SGX540 GPU.
Summary: crash in mozilla::ipc::RPCChannel::OnMaybeDequeueOne on devices with Adreno GPUs → crash in mozilla::ipc::RPCChannel::OnMaybeDequeueOne
Reproducible on Nightly/14.0a1 2012-03-23 on Motorola Droid 2 running Android 2.3.3 using the following scenario:

1. Open the Addons page.
2. Install an addon that requires restart(Cute Buttons - Crystal SVG or Copy as Plain Text)
3. When asked tap on restart.

Actual result:
Fennec crashes when Nightly is reopened.
Keywords: reproducible
This bug cause graphic defects to occur on restart on HTC Desire HD.
1. go to menu-> quit
2. crash; select restart
3. go to about:crashes

Expected: list of crashes
Actual: graphic defect: http://www.youtube.com/watch?v=b5edb_Ljm64&feature=youtube_gdata_player
To be clear:
1. Start Nightly
2. Quit Nightly

*BOOM*

I have no add-ons installed and no Sync setup. The crash happens 100% of the time.
(In reply to Mark Finkle (:mfinkle) from comment #15)
> To be clear:
> 1. Start Nightly
> 2. Quit Nightly
> 
> *BOOM*
> 
> I have no add-ons installed and no Sync setup. The crash happens 100% of the
> time.

Using a Galaxy Nexus
Update: I uninstalled Flash from my phone and rebooted. Nightly still crashes on exit.
(In reply to Mark Finkle (:mfinkle) from comment #17)
> Update: I uninstalled Flash from my phone and rebooted. Nightly still
> crashes on exit.

Here is my crash without Flash:
https://crash-stats.mozilla.com/report/index/b1f63fed-2282-4512-a59d-7ecea2120324
I get the following stack when quitting on an HTC desire:
#0  mozilla::ipc::RPCChannel::OnMaybeDequeueOne (this=0x4a010828)
    at /home/ajuma/mozilla-central/ipc/glue/RPCChannel.cpp:403
#1  0x728b39b6 in DispatchToMethod<mozilla::plugins::PluginInstanceChild, void (mozilla::plugins::PluginInstanceChild::*)()> (arg=<optimized out>, method=<optimized out>, 
    obj=<optimized out>) at /home/ajuma/mozilla-central/ipc/chromium/src/base/tuple.h:383
#2  RunnableMethod<mozilla::plugins::PluginInstanceChild, void (mozilla::plugins::PluginInstanceChild::*)(), Tuple0>::Run (this=<optimized out>)
    at /home/ajuma/mozilla-central/ipc/chromium/src/base/task.h:307
#3  0x728c0f28 in Run (this=<optimized out>)
    at ../../dist/include/mozilla/ipc/RPCChannel.h:462
#4  mozilla::ipc::RPCChannel::DequeueTask::Run (this=<optimized out>)
    at ../../dist/include/mozilla/ipc/RPCChannel.h:485
#5  0x72964adc in MessageLoop::RunTask (this=0x46e4c0e0, task=0x45be8040)
    at /home/ajuma/mozilla-central/ipc/chromium/src/base/message_loop.cc:318
#6  0x7296590a in MessageLoop::DeferOrRunPendingTask (this=0x45bd753c, 
    pending_task=<optimized out>)
    at /home/ajuma/mozilla-central/ipc/chromium/src/base/message_loop.cc:326
#7  0x729664b8 in MessageLoop::DoWork (this=0x46e4c0e0)
    at /home/ajuma/mozilla-central/ipc/chromium/src/base/message_loop.cc:426
#8  0x728c09e0 in mozilla::ipc::MessagePump::Run (this=0x46e271c0, aDelegate=0x46e4c0e0)
    at /home/ajuma/mozilla-central/ipc/glue/MessagePump.cpp:114
#9  0x72964a8c in MessageLoop::RunInternal (this=0x72d896c9)
    at /home/ajuma/mozilla-central/ipc/chromium/src/base/message_loop.cc:208
#10 0x72964b42 in RunHandler (this=<optimized out>)
    at /home/ajuma/mozilla-central/ipc/chromium/src/base/message_loop.cc:201
#11 MessageLoop::Run (this=0x46e4c0e0)
    at /home/ajuma/mozilla-central/ipc/chromium/src/base/message_loop.cc:175
#12 0x72855c5c in nsBaseAppShell::Run (this=0x46e28620)
    at /home/ajuma/mozilla-central/widget/xpwidgets/nsBaseAppShell.cpp:189
#13 0x7279e840 in nsAppStartup::Run (this=0x46277670)

Not 100% sure this is the same crash, since this stack has PluginInstanceChild in frame #1 but the stacks on crash-stats don't. This initially made me suspect Flash, but Comment 17 disproves that theory.
Summary: crash in mozilla::ipc::RPCChannel::OnMaybeDequeueOne → crash in mozilla::ipc::RPCChannel::OnMaybeDequeueOne when quiting
Testing on a Nexus S using mozilla-inbound tinderbox builds, I found a different regression range:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a5ac2a7b72c6 doesn't crash
but
https://hg.mozilla.org/integration/mozilla-inbound/rev/80a7d26b02ec
does crash.

This means that on PowerVR devices, the regression is caused by Bug 737686 (which landed on inbound on March 21). That bug made our texture upload behaviour on PowerVR consistent with our behaviour on Adreno (that is, we avoid using glTexSubImage2D). This explains why we initially only saw this crash on Adreno devices.

We still need to find what caused the regression on Adreno; I don't have an Adreno device with me today, but if someone who does can bisect the regression range from Comment 0 using inbound tinderbox builds, that would be very helpful!
Blocks: 737686
Summary: crash in mozilla::ipc::RPCChannel::OnMaybeDequeueOne when quiting → crash in mozilla::ipc::RPCChannel::OnMaybeDequeueOne when quitting
This bug should be a high priority because it is our #1 topcrash. This crash is about 5x more common than the #2 topcrash!
This points to RPCChannel::mMonitor being NULL. I don't know how that can happen, though.
It means that AsyncChannel::Clear() has been called, which likely means that something is trying to IPC after ActorDestroy().  That's not allowed.
Bisecting the regression range from Comment 0 points to Bug 736850 as the cause.
Blocks: 736850
Attached patch bandaidSplinter Review
this makes the crash go away (and confirms what joe said)
Duplicate of this bug: 738847
Comment on attachment 609463 [details] [diff] [review]
bandaid

Nothing about https://hg.mozilla.org/mozilla-central/rev/734c1ef36151 looks obviously wrong to me.
Attachment #609463 - Flags: review?(jones.chris.g)
The problem seems to be that the CompositorParent is deallocating shared memory after the CompositorChild has already been destroyed. This patch prevents that.
Attachment #609540 - Flags: review?(jones.chris.g)
Comment on attachment 609540 [details] [diff] [review]
Don't deallocate shared memory after destruction

This isn't quite right. We need to check if the layer manager is destroyed, not if the layer itself is destroyed.
Attachment #609540 - Flags: review?(jones.chris.g)
(In reply to Ali Juma [:ajuma] from comment #29)
> This isn't quite right. We need to check if the layer manager is destroyed,
> not if the layer itself is destroyed.

Probably an even better approach is to destroy the CompositorParent's layer manager before the CompositorChild is destroyed.
To recap, the problem is that when the CompositorParent's layer manager is destroyed, this triggers the destruction of shared memory held by ShadowThebesLayers, which in turn triggers IPC. Since we currently destroy the CompositorChild before destroying the CompositorParent's layer manager, this IPC is arriving too late.

This makes us destroy the CompositorParent's layer manager first, then process any resulting IPC, and then destroy the CompositorChild.
Attachment #609540 - Attachment is obsolete: true
Attachment #609742 - Flags: review?(jones.chris.g)
(In reply to Ali Juma [:ajuma] from comment #31)
> This makes us destroy the CompositorParent's layer manager first, then
> process any resulting IPC, and then destroy the CompositorChild.

"This *patch* make us...:
Comment on attachment 609463 [details] [diff] [review]
bandaid

If this happens someone has violated the IPC contract.
Attachment #609463 - Flags: review?(jones.chris.g) → review-
(In reply to Ali Juma [:ajuma] from comment #31)
> Created attachment 609742 [details] [diff] [review]
> Destroy the compositor's layer manager before the CompositorChild gets
> destroyed.
> 
> To recap, the problem is that when the CompositorParent's layer manager is
> destroyed, this triggers the destruction of shared memory held by
> ShadowThebesLayers, which in turn triggers IPC. Since we currently destroy
> the CompositorChild before destroying the CompositorParent's layer manager,
> this IPC is arriving too late.
> 
> This makes us destroy the CompositorParent's layer manager first, then
> process any resulting IPC, and then destroy the CompositorChild.

Will take a look at this tonight.  Need to page in a lot of stuff.

Sorry for the delays here.
Attachment #609742 - Flags: review?(jones.chris.g) → review+
(In reply to Phil Ringnalda (:philor) from comment #36)
> Backed out in
> https://hg.mozilla.org/integration/mozilla-inbound/rev/5016d3f2b36d for
> native Talos bustage

These failures seem to be caused by calling MessageLoop::current()->RunAllPending() in nsBaseWidget's destructor (the purpose of this call was to ensure that any pending IPC got processed before the CompositorChild got destroyed). I'm working on a patch that, instead of making this call, adds an event to the MessageLoop to handle compositor destruction.
Attachment #609742 - Attachment is obsolete: true
Attachment #610934 - Flags: review?(bgirard)
Attachment #610934 - Flags: review?(bgirard) → review+
https://hg.mozilla.org/mozilla-central/rev/db7260efc9a7
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Depends on: 741166
No longer depends on: 741166
I'm definitely not seeing this anymore!
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.