878366 - Crash on abort in mozilla::dom::ContentChild::ProcessingError

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Reporter

Description

•

12 years ago

Attached file adb output — Details

+++ This bug was initially created as a clone of Bug #867025 +++ I hit this too, using orangfuzz: bp-2b5abeb3-7e9f-45f8-afa2-e75332130601 Unfortunately I don't have a reliable testcase. James also mentioned that he could also make the phone crash again in bug 867025 comment 35 with the patch in that bug. === I'm on 2013-05-24 v1.1.0 git revision cc2fd02fd461aa12c96e02229a78293365d65264

Scoobidiver (away)

Updated

•

12 years ago

Crash Signature: [@ mozalloc_abort | NS_DebugBreak_P | mozilla::dom::ContentChild::ProcessingError] [@ mozilla::dom::ContentChild::ProcessingError] → [@ mozalloc_abort | NS_DebugBreak_P | mozilla::dom::ContentChild::ProcessingError ] [@ mozilla::dom::ContentChild::ProcessingError ]

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Reporter

Comment 1

•

12 years ago

Interesting output: I/Gecko ( 9860): I/Gecko ( 9860): ###!!! [Child][AsyncChannel] Error: Route error: message sent to unknown actor ID I/Gecko ( 9860): I/Gecko ( 9860): [Child 9860] ###!!! ABORT: aborting because of fatal error: file ../../../gecko/dom/ipc/ContentChild.cpp, line 1009 E/Gecko ( 9860): mozalloc_abort: [Child 9860] ###!!! ABORT: aborting because of fatal error: file ../../../gecko/dom/ipc/ContentChild.cpp, line 1009

[:fabrice] Fabrice Desré

Comment 2

•

12 years ago

Ben, can you take a look?

blocking-b2g: leo? → -

Flags: needinfo?(bent.mozilla)

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 3

•

12 years ago

We don't have any data here that is actionable. We need to know the message id (combination of protocol type and message type) in order to get anything here.

Flags: needinfo?(bent.mozilla)

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 4

•

12 years ago

Oh, sorry, the only way to get that presently is to break in the debugger.

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Reporter

Comment 5

•

12 years ago

For the case of the unagi testdriver phones, they are usually not connected to the debugger. Can we somehow improve data collection here some way such that we can get information out of similar crashes in the future? (I fixed bug 879092 but I'm not sure if that will help in any way..)

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 6

•

12 years ago

You can try enabling IPC message logging (set IPC_MESSAGE_LOG_ENABLED=1 in environment) but that may be too noisy and/or slow. Other than that we'd need to file an IPC bug to improve the logging when we crash.

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Reporter

Comment 7

•

12 years ago

> Other than that we'd need to file an IPC bug to improve the logging when we > crash. Filed bug 879580.

Scoobidiver (away)

Updated

•

11 years ago

Crash Signature: [@ mozalloc_abort | NS_DebugBreak_P | mozilla::dom::ContentChild::ProcessingError ] [@ mozilla::dom::ContentChild::ProcessingError ] → [@ mozalloc_abort | NS_DebugBreak_P | mozilla::dom::ContentChild::ProcessingError ] [@ mozalloc_abort(char const*) | NS_DebugBreak | mozilla::dom::ContentChild::ProcessingError(mozilla::ipc::HasResultCodes::Result) ] [@ mozilla::dom::ContentChild::Proce…

Scoobidiver (away)

Comment 8

•

11 years ago

It's #6 crasher for all B2G versions.

Severity: blocker → critical

Depends on: 879580

Keywords: topcrash

leo.bugzilla.gecko

Comment 9

•

11 years ago

This occurs on leo too.

Scoobidiver (away)

Comment 10

•

11 years ago

It seems that I wrongly added a signature for a normal crash to previous signatures for a crash on abort. It's #6 top crasher in FxOS 1.0.1.

blocking-b2g: - → leo?

Crash Signature: [@ mozalloc_abort | NS_DebugBreak_P | mozilla::dom::ContentChild::ProcessingError ] [@ mozalloc_abort(char const*) | NS_DebugBreak | mozilla::dom::ContentChild::ProcessingError(mozilla::ipc::HasResultCodes::Result) ] [@ mozilla::dom::ContentChild::Proce… → [@ mozalloc_abort | NS_DebugBreak_P | mozilla::dom::ContentChild::ProcessingError ] [@ mozalloc_abort(char const*) | NS_DebugBreak | mozilla::dom::ContentChild::ProcessingError(mozilla::ipc::HasResultCodes::Result) ]

Summary: Crash [@ mozilla::dom::ContentChild::ProcessingError] → crash on abort in mozilla::dom::ContentChild::ProcessingError

Lukas Blakk [:lsblakk] use ?needinfo

Comment 11

•

11 years ago

(In reply to leo.bugzilla.gecko from comment #9) > This occurs on leo too. Can you elaborate? Are you seeing the signature? Are you reproducing a crash? We can't block this without actionable information to move on - being a topcrash isn't enough to block.

blocking-b2g: leo? → -

Flags: needinfo?(jaeohkim83)

Changbin Park

Comment 12

•

11 years ago

Attached file leo-callstack.log — Details

I'm attaching crash dump information from leo device.

Flags: needinfo?(jaeohkim83)

Jason Smith [:jsmith]

Comment 13

•

11 years ago

(In reply to Changbin Park from comment #12) > Created attachment 789252 [details] > leo-callstack.log > > I'm attaching crash dump information from leo device. How did you reproduce the crash that caused this dump?

Changbin Park

Comment 14

•

11 years ago

It occured by our QA team. He said, he tested bunch of message things. Indeed, he didn't realize when the crash occured, but after that a different kind of problem occurs on the taget. So, it passed to me and I found the crash occured on it while he was testing. It's not that detail I know, sorry..

jongsoo.oh

Comment 15

•

11 years ago

This issue is reporduced one more time. The mobile is left in camera view(Message - file attach - Camera) The crash is occured after few minute. It has same call stack with this bug. But it is not reproduced unfortunately.

bhavana bajaj [:bajaj]

Comment 16

•

11 years ago

:Gary, are you still able to reproduce this issue on 1.1 or master ?

bhavana bajaj [:bajaj]

Comment 17

•

11 years ago

(In reply to ben turner [:bent] (needinfo? encouraged) from comment #6) > You can try enabling IPC message logging (set IPC_MESSAGE_LOG_ENABLED=1 in > environment) but that may be too noisy and/or slow. > > Other than that we'd need to file an IPC bug to improve the logging when we > crash. Hey Ben, this has been filed, who is the right owner here https://bugzilla.mozilla.org/show_bug.cgi?id=879580 ?

Robert Kaiser

Comment 18

•

11 years ago

Note that this is the top 1.0.1 crash that seems to be in our code and is not fixed for newer versions. It's #3 over the last week of data, see https://crash-analysis.mozilla.com/rkaiser/2013-09-05/2013-09-05.b2g.topcrashes.weekly.html#b2g-1.0.1.0-prerelease and the ones above or near it are either not our code or are fixed in 1.1.

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 19

•

11 years ago

(In reply to bhavana bajaj [:bajaj] from comment #17) > Hey Ben, this has been filed, who is the right owner here Not sure, anyone who has done IPC work I guess. It's not prioritized at the moment so I don't think anyone has even looked at it.

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Reporter

Comment 20

•

11 years ago

(In reply to ben turner [:bent] (needinfo? encouraged) from comment #19) > (In reply to bhavana bajaj [:bajaj] from comment #17) > > Hey Ben, this has been filed, who is the right owner here > > Not sure, anyone who has done IPC work I guess. It's not prioritized at the > moment so I don't think anyone has even looked at it. I'll nominate bug 879580 to get it prioritized, hopefully.

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Reporter

Comment 21

•

11 years ago

(In reply to bhavana bajaj [:bajaj] from comment #16) > :Gary, are you still able to reproduce this issue on 1.1 or master ? I've not tested on 1.1 / master, in any case won't be able to anytime soon, unfortunately.

Jason Smith [:jsmith]

Comment 22

•

11 years ago

mvines - I see this has been added to the 1.2 CS blocker list, but we don't have STR to action this bug (which has been the main reason we haven't actioned this bug from past releases). Did your team come across a way to reproduce this top crash?

Flags: needinfo?(mvines)

Michael Vines [:m1] [:evilmachines]

Comment 23

•

11 years ago

Yep, run orangutan for ~8 hours. We caught crash twice overnight.

Flags: needinfo?(mvines)

Jason Smith [:jsmith]

Comment 24

•

11 years ago

Noming cause this blocks the CS 1.2 blocker. Andrew - Can you find someone to look into this?

blocking-b2g: - → koi?

Flags: needinfo?(overholt)

Andrew Overholt [:overholt]

Comment 25

•

11 years ago

I spoke about this today with bsmedberg. He's going to try to work on (or find someone to work on) bug 879580. Once we've got that improved logging it will be easier to find this race.

Flags: needinfo?(overholt)

Michael Vines [:m1] [:evilmachines]

Comment 26

•

11 years ago

Great. We can run an orangutan or two on a build with some logging patches/etc pretty easily too. LMK

Benjamin Smedberg

Comment 27

•

11 years ago

I spent a while looking at the debugging possibilities here. The only good option is to run the tests with MOZ_IPC_MESSAGE_LOG on, so that we can see what message the parent process thinks its sending (probably to a dead actor, perhaps because of a race sending a message in both directions). Does the stdout of b2g go to an attached debug device, or does it always stay on the phone? IPC logs can get big. I guess we're running these tests with non-debug builds, right?

Michael Vines [:m1] [:evilmachines]

Comment 28

•

11 years ago

(In reply to Benjamin Smedberg [:bsmedberg] from comment #27) > The only good > option is to run the tests with MOZ_IPC_MESSAGE_LOG on We can do that here, LMK exactly what you need. > Does the stdout of b2g go to an attached debug device, or does it always > stay on the phone? /dev/null right now, but I've been meaning to redirect stdout to logcat though as that's more helpful. We can do that. > I guess we're running these tests with non-debug builds, right? -eng builds right now, but without any additional Gecko debug settings enabled.

Benjamin Smedberg

Comment 29

•

11 years ago

A normal -eng build run with MOZ_IPC_MESSAGE_LOG=1 in the environment should dump a bunch of spew to stdout/stderr about messages being sent and received. If we can get that through logcat for one of these test runs, that would help tremendously.

Michael Vines [:m1] [:evilmachines]

Updated

•

11 years ago

Flags: needinfo?(mvines)

Benjamin Smedberg

Comment 30

•

11 years ago

Attached patch [Test] bug879580.patch — Details — Splinter Review

Debugging patch which should produce *some* output on stderr, even when MOZ_IPC_MESSAGE_LOG is not set. A log with MOZ_IPC_MESSAGE_LOG would still be better.

Michael Vines [:m1] [:evilmachines]

Comment 31

•

11 years ago

Looks like --enable-debug is needed for MOZ_IPC_MESSAGE_LOG=1 to work, which we don't enabled by default with -eng. Looks like s/false/true/ at http://dxr.mozilla.org/mozilla-central/source/ipc/glue/ProtocolUtils.h?from=LoggingEnabled#l112 will do the trick instead? With that change I see a ton of stderr output like: [time:316253345887993][721][PContentChild] Sending Msg_AsyncMessage([TODO]) [time:316253345888908][653][PContentParent] Received Msg_AsyncMessage([TODO]) [time:316253346227936][721][PLayerTransactionChild] Sending Msg_Update([TODO]) [time:316253346228803][653][PLayerTransactionParent] Received Msg_Update([TODO]) [time:316253346229286][653][PLayerTransactionParent] Sending reply Reply_Update([TODO]) [time:316253346232066][721][PLayerTransactionChild] Received reply Reply_Update([TODO]) [time:316253346232966][721][PLayerChild] Sending Msg___delete__([TODO]) [time:316253346233616][721][PLayerChild] Sending Msg___delete__([TODO]) Is this the kind of logging that'll help?

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 32

•

11 years ago

That's the stuff you're looking for!

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Reporter

Updated

•

11 years ago

Summary: crash on abort in mozilla::dom::ContentChild::ProcessingError → Crash on abort in mozilla::dom::ContentChild::ProcessingError

Preeti Raghunath(:Preeti)

Comment 33

•

11 years ago

+ed it for crash issue

blocking-b2g: koi? → koi+

Michael Vines [:m1] [:evilmachines]

Comment 34

•

11 years ago

(The monkeys are still trying to catch this w/ IPC logging. Hopefully they'll get lucky over the weekend.)

Flags: needinfo?(mvines)

Michael Vines [:m1] [:evilmachines]

Updated

•

11 years ago

Flags: needinfo?(mvines)

Michael Vines [:m1] [:evilmachines]

Comment 35

•

11 years ago

Attached file Gallery crash — Details

Here's one example of this crash with IPC logging enabled. The crashing app was gallery. I've attached all Gecko logcat output for the Gallery and b2g processes leading up to the crash. APPLICATION OOM_ADJ OOM_SCORE OOM_SCORE_ADJ USER PID PPID VSIZE RSS WCHAN PC NAME Gallery 1 138 67 app_2532 2532 29689 91184 29248 ffffffff 407dfabc S /system/b2g/plugin-container b2g 0 178 0 root 29689 1 277292 85316 ffffffff 400b0604 S /system/b2g/b2g (background apps omitted)

Flags: needinfo?(mvines)

Michael Vines [:m1] [:evilmachines]

Comment 36

•

11 years ago

Attached file Gallery crash #2 — Details

Another log of this crash, again in the Gallery. b2g pid is 843, Gallery pid is 20847.

Michael Vines [:m1] [:evilmachines]

Comment 37

•

11 years ago

Crash was observed again overnight, and again in the Gallery app

Josh Matthews [:jdm]

Comment 38

•

11 years ago

Here's the important part from the first log: >I/Gecko (29689): [time:1379917243883118][29689][PContentParent] Sending Msg_AsyncMessage([TODO]) >I/Gecko (2532): >I/Gecko ( 2532): ###!!! [Child][AsyncChannel] Error: Route error: message sent to unknown actor ID >I/Gecko (2532): >I/Gecko ( 2532): [Child 2532] ###!!! ABORT: aborting because of MsgRouteError: file /local/mnt/buildbot/v1.2_msm7627a/build/gecko/dom/ipc/ContentChild.cpp, line 1064

Josh Matthews [:jdm]

Comment 39

•

11 years ago

This log doesn't look all that useful without actor IDs, unfortunately :(

Michael Vines [:m1] [:evilmachines]

Comment 40

•

11 years ago

(In reply to Josh Matthews [:jdm] from comment #39) > This log doesn't look all that useful without actor IDs, unfortunately :( If you have a patch that outputs what you need to see I can easily add that temporarily here so we can get that data the next time the crash occurs.

Michael Vines [:m1] [:evilmachines]

Comment 41

•

11 years ago

Crash occurred last night with attachment 807464 [details] [diff] [review] applied ("DumpMessage RouteError for message: .type=1310722 .size=40"). This time it was a camera app crash instead of the gallery app. I/Gecko ( 852): [time:1380104599666268][852][PGrallocBufferChild] Received Msg___delete__([TODO]) I/Gecko ( 852): [time:1380104599666447][852][PGrallocBufferChild] Received Msg___delete__([TODO]) I/Gecko ( 852): [time:1380104599670447][852][PBrowserParent] Received Msg_PContentPermissionRequestConstructor([TODO]) I/Gecko ( 852): [time:1380104599671093][852][PContentPermissionRequestParent] Received Msg_prompt([TODO]) I/Gecko ( 852): [time:1380104599692020][852][PContentPermissionRequestParent] Sending Msg___delete__([TODO]) I/Gecko ( 9942): DumpMessage RouteError for message: .type=1310722 .size=40 I/Gecko ( 9942): I/Gecko ( 9942): ###!!! [Child][AsyncChannel] Error: Route error: message sent to unknown actor ID I/Gecko ( 9942): I/Gecko ( 9942): [Child 9942] ###!!! ABORT: aborting because of MsgRouteError: file /local/mnt/buildbot/v1.2/build/gecko/dom/ipc/ContentChild.cpp, line 1064 I/Gecko ( 9942): [time:1380104599692787][9942][PCrashReporterChild] Sending Msg_AppendAppNotes([TODO]) E/Gecko ( 9942): mozalloc_abort: [Child 9942] ###!!! ABORT: aborting because of MsgRouteError: file /local/mnt/buildbot/v1.2/build/gecko/dom/ipc/ContentChild.cpp, line 1064

Milan Sreckovic [:milan] (needinfo for best results)

Comment 42

•

11 years ago

Benoit, related to the crash you're looking at? I know it's hard to tell from the info here, but just wondering.

Flags: needinfo?(bjacob)

Benoit Jacob [:bjacob] (mostly away)

Comment 43

•

11 years ago

Depends on what you mean by 'related'. The bug I'm looking at, bug 914823, has different symptoms (see call stack in bug 914823 comment 28) and a specific cause (we are referencing an ISurfaceAllocator that already died, we couldn't know it because it's not reference-counted, it can't be reference-counted because IPDL actors aren't reference-counted) that seems to be different from what is being discussed above here. But, on a more meta level, the present bug, bug 914823, and generally half of the b2g crashes I've been dealing with, are 'related' in that they share a common aspect: we crash as we have dangling raw pointers to already-dead IPDL actors, and we can't easily fix that because IPDL actors aren't refcounted, and as long as they're not refcounted, the only way we can write non-crashing code is by having a mental model of the entire B2G IPC-facing codebase, and well, I don't know that anyone anymore has such a mental model.

Flags: needinfo?(bjacob)

Andrew Overholt [:overholt]

Comment 44

•

11 years ago

(In reply to Benoit Jacob [:bjacob] from comment #43) > But, on a more meta level, the present bug, bug 914823, and generally half > of the b2g crashes I've been dealing with, are 'related' in that they share > a common aspect: we crash as we have dangling raw pointers to already-dead > IPDL actors, and we can't easily fix that because IPDL actors aren't > refcounted, and as long as they're not refcounted, the only way we can write > non-crashing code is by having a mental model of the entire B2G IPC-facing > codebase, and well, I don't know that anyone anymore has such a mental model. This piqued my interest so I asked Ben and he said IPDL actors can be refcounted. He opined that perhaps you were thinking of gralloc-associated actors?

Milan Sreckovic [:milan] (needinfo for best results)

Comment 45

•

11 years ago

This was a new discovery. From IRC: 16:17 bjacob: newsflash: we _can_ have refcounted IPDL actors, in fact netwerk/ already has a few, and it's even discussed here: https://developer.mozilla.org/en-US/docs/IPDL/Best_Practices !! 16:18 bjacob: (thanks to jdm) ! 16:44 bjacob: milan: yes, jdm pointed me to netwerk/ipc/Necko{Child,Parent}.* 16:45 bjacob: milan: the other problem is the SurfaceDescriptor IPDL union, which compiles in C++ to raw pointers. There too, jdm suggested a realistic fix: mirror it in C++ by a RefcountingSurfaceDescriptor class that would do the refcounting (and have a constructor taking an IPDL SurfaceDescriptor)

Benoit Jacob [:bjacob] (mostly away)

Comment 46

•

11 years ago

Yep, IPDL actors can be refcounted, but none of the graphics-facing ones are, and until two hours ago, none of use gfx people around here in Toronto knew about that possibility! Many thanks to :jdm for educating us. Switching our IPDL actors to refcounting is a fairly nontrivial code change though, so it might be too big a change for B2G 1.2 at this point. Let's do it on mozilla-central for B2G v1.3, but for B2G 1.2 bugs, for now, let's still try to fix them without switching non-refcounted classes to refcounting...

Andreas Gal :gal

Comment 47

•

11 years ago

If we try to fix this without ref counting, even assuming that we properly identify all the places where we crash, how will we be confident that the fix will actually work? We have been dealing with this class of bugs for about 9 month now, and every time you whack a mole, a new one pops up.

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

11 years ago

Assignee: nobody → bjacob

Milan Sreckovic [:milan] (needinfo for best results)

Comment 48

•

11 years ago

Benoit and Sotaro are connecting with Ben Turner to see how large and scary this change would be.

Andreas Gal :gal

Comment 49

•

11 years ago

This was flagged as a high priority bug by the chipset vendor. How are we doing here?

Benoit Jacob [:bjacob] (mostly away)

Comment 50

•

11 years ago

I've been busy until yesterday with 1) bug 914823, which is another occurence of this kind of issue, and 2) trying to understand what would generally be the right approach to fixing this kind of issue. Now I think that the approach we're taking in bug 914823, which is to make the actor SupportsWeakPtr, is a good one: - it is unintrusive, conservative enough to land on aurora; - it actually makes sense as 'the right solution' as it matches the reality that from the point of view of DOM elements, the whole IPC system can disappear at any time. We can have a given actor outlive the IPC system by holding a strong reference to it, and that can be useful to avoid crashing, but we can't prevent the IPC system from going down. For example, if the IPC system runs out of file descriptors, there is nothing we can do at the moment to do IPC again, IIUC. I filed bug 923530 as a tracking bug for this kind of issues; blocking it. I'll now try to fix the present bug...

Blocks: 923530

Benoit Jacob [:bjacob] (mostly away)

Comment 51

•

11 years ago

So to be clear, I do think that reference-counting actors is going to be a part of the solution. If only because there is no way to do thread-safe weak references without being able to convert a weak reference to a strong reference before dereferencing it. But just reference-counting along, as I thought before would be the solution, isn't a complete solution by itself, because we can't always prevent the IPC system from going down earlier than expected (IPC errors can happen at any time). So we still have to react to unexpected actor death, and using weak references seem like our best way to do so at the moment.

Benoit Jacob [:bjacob] (mostly away)

Comment 52

•

11 years ago

Does anyone have STR that I can use while having USB debugging? The only definite STR that I can see here is: (In reply to jongsoo.oh from comment #15) > This issue is reporduced one more time. > The mobile is left in camera view(Message - file attach - Camera) > The crash is occured after few minute. It has same call stack with this bug. > But it is not reproduced unfortunately. However, when I try this with USB debugging plugged, I get a modal message: "Camera can not be used while plugged in - Unplug the phone to view pictures" Can I work around it?

Flags: needinfo?(zzongsoo)

Flags: needinfo?(mvines)

Flags: needinfo?(gary)

Benoit Jacob [:bjacob] (mostly away)

Comment 53

•

11 years ago

(In reply to Michael Vines [:m1] [:evilmachines] from comment #36) > Created attachment 808744 [details] > Gallery crash #2 > > Another log of this crash, again in the Gallery. b2g pid is 843, Gallery > pid is 20847. From this log: [20847][PBrowserChild] Sending Msg_PContentPermissionRequestConstructor([TODO]) [20847][PContentPermissionRequestChild] Sending Msg_prompt([TODO]) [20847][PBrowserChild] Received Msg_RealTouchEvent([TODO]) [20847][PLayerTransactionChild] Sending Msg_Update([TODO]) [843][PLayerTransactionParent] Received Msg_Update([TODO]) [843][PLayerTransactionParent] Sending reply Reply_Update([TODO]) [20847][PLayerTransactionChild] Received reply Reply_Update([TODO]) [843][PContentParent] Received Msg_AsyncMessage([TODO]) [843][PContentParent] Sending Msg_AsyncMessage([TODO]) [843][PBrowserParent] Sending Msg_Deactivate([TODO]) [20847][PBrowserChild] Received Msg_Deactivate([TODO]) [843][PBrowserParent] Sending Msg_Activate([TODO]) [843][PBrowserParent] Sending Msg_AsyncMessage([TODO]) [843][PBrowserParent] Sending Msg_Destroy([TODO]) [20847][PBrowserChild] Received Msg_Destroy([TODO]) [20847][PContentChild] Sending Msg_AudioChannelChangeDefVolChannel([TODO]) [20847][PIndexedDBDatabaseChild] Sending Msg_Close([TODO]) [20847][PCompositableChild] Sending Msg___delete__([TODO]) [843][PCompositableParent] Received Msg___delete__([TODO]) [20847][PLayerChild] Sending Msg___delete__([TODO]) [843][PLayerParent] Received Msg___delete__([TODO]) [843][PGrallocBufferParent] Sending Msg___delete__([TODO]) [843][PGrallocBufferParent] Sending Msg___delete__([TODO]) [20847][PLayerChild] Sending Msg___delete__([TODO]) [843][PLayerParent] Received Msg___delete__([TODO]) [20847][PLayerChild] Sending Msg___delete__([TODO]) [843][PLayerParent] Received Msg___delete__([TODO]) [20847][PRenderFrameChild] Sending Msg___delete__([TODO]) [20847][PBrowserChild] Sending Msg___delete__([TODO]) [20847][PContentChild] Sending Msg_AsyncMessage([TODO]) [20847][PContentChild] Sending Msg_AsyncMessage([TODO]) [20847][PGrallocBufferChild] Received Msg___delete__([TODO]) [20847][PGrallocBufferChild] Received Msg___delete__([TODO]) [843][PLayerTransactionChild] Sending Msg_PLayerConstructor([TODO]) [843][PLayerTransactionParent] Received Msg_PLayerConstructor([TODO]) [843][PLayerTransactionChild] Sending Msg_Update([TODO]) [843][PLayerTransactionParent] Received Msg_Update([TODO]) [843][PLayerTransactionParent] Sending reply Reply_Update([TODO]) [843][PLayerTransactionChild] Received reply Reply_Update([TODO]) [843][PLayerChild] Sending Msg___delete__([TODO]) [843][PLayerParent] Received Msg___delete__([TODO]) [843][PBrowserParent] Received Msg_PContentPermissionRequestConstructor([TODO]) [843][PContentPermissionRequestParent] Received Msg_prompt([TODO]) [843][PContentPermissionRequestParent] Sending Msg___delete__([TODO]) I/Gecko (20847): I/Gecko (20847): ###!!! [Child][AsyncChannel] Error: Route error: message sent to unknown actor ID I/Gecko (20847): Here is one interpretation of this log, please tell me whether it makes sense: the Layers / Gfx stuff here is actually a red herring, the problem is we're apparently doing a Send___delete__ on a bad PContentPermissionRequestParent. Does that make sense?

Flags: needinfo?(bent.mozilla)

Benoit Jacob [:bjacob] (mostly away)

Comment 54

•

11 years ago

...rather, the PContentPermissionRequestParent is fine (it just received a message) but the matching PContentPermissionRequest***Child*** is already dead. Does _that_ make sense?

Benoit Jacob [:bjacob] (mostly away)

Comment 55

•

11 years ago

Asking Doug who seems to have been maintaining code around PContentPermissionRequestChild.

Michael Vines [:m1] [:evilmachines]

Comment 56

•

11 years ago

(In reply to Benoit Jacob [:bjacob] from comment #52) > However, when I try this with USB debugging plugged, I get a modal message: > > "Camera can not be used while plugged in - Unplug the phone to view pictures" > > Can I work around it? Sounds like you may have Settings -> Enable USB storage enabled. Try disabling.

Flags: needinfo?(mvines)

Sotaro Ikeda [:sotaro]

Comment 57

•

11 years ago

I also think attachment 808694 [details] and attachment 808744 [details] seems not related to gfx. From the log. The crash of attachment 808694 [details] seems triggered by calling PContentParent::SendAsyncMessage(). The crash of attachmeet 808744 seems triggered by cakkubg PContentPermissionRequestParent::Send___delete__(). Both seem to try to send to already disconnected/deleted ipc object.

Sotaro Ikeda [:sotaro]

Comment 58

•

11 years ago

Comment 57 reminds me Bug 867025. Incorrect ipc handling problems are not the only problem around gfx, but also could happen in any other ipc in gecko. In current gecko's ipc architecutre/implementation, it is very very difficult to correctly use it in all situation. Such kind of rare crash could happen easily in b2g. - Bug 867025 - [unagi][tara][weekly build 13.04.17]monkey test crash in mozilla::dom::ContentChild::ProcessingError

Benoit Jacob [:bjacob] (mostly away)

Comment 59

•

11 years ago

(In reply to Michael Vines [:m1] [:evilmachines] from comment #56) > (In reply to Benoit Jacob [:bjacob] from comment #52) > > However, when I try this with USB debugging plugged, I get a modal > message: > > > > "Camera can not be used while plugged in - Unplug the phone to view pictures" > > > > Can I work around it? > > Sounds like you may have Settings -> Enable USB storage enabled. Try > disabling. Thanks, that fixed this issue. Now I am running into another problem: as I follow the STR, which involves switching the the Communications app to the background, I quickly get the Communications killed by the background-process-killer, before I get a chance to crash it. Can I disable the background-processes killer or have Communications be spared by it?

Benoit Jacob [:bjacob] (mostly away)

Comment 60

•

11 years ago

I tried writing 0 to /proc/PID/oom_score_adj and oom_adj, and the value was apparently remembered, but Communications still got killed.

Benoit Jacob [:bjacob] (mostly away)

Comment 61

•

11 years ago

Michael, the OOM killer issue reported in comment 59 -- 60 is preventing me from reproducing. What would be current steps-to-reproduce on Hamachi?

Flags: needinfo?(mvines)

jongsoo.oh

Comment 62

•

11 years ago

(In reply to Benoit Jacob [:bjacob] from comment #52) > Does anyone have STR that I can use while having USB debugging? The only > definite STR that I can see here is: > > (In reply to jongsoo.oh from comment #15) > > This issue is reporduced one more time. > > The mobile is left in camera view(Message - file attach - Camera) > > The crash is occured after few minute. It has same call stack with this bug. > > But it is not reproduced unfortunately. > > However, when I try this with USB debugging plugged, I get a modal message: > > "Camera can not be used while plugged in - Unplug the phone to view pictures" > > Can I work around it? After disable USB storage, you can use the camera with Remote debugging.

Flags: needinfo?(zzongsoo)

Michael Vines [:m1] [:evilmachines]

Comment 63

•

11 years ago

(In reply to Benoit Jacob [:bjacob] from comment #61) > Michael, the OOM killer issue reported in comment 59 -- 60 is preventing me > from reproducing. What would be current steps-to-reproduce on Hamachi? you could probably tweak the LMK parameters at http://dxr.mozilla.org/mozilla-central/source/b2g/app/b2g.js?from=b2g.js#l611 to avoid getting killed. However, looking at the crash database here I see that we were reliably reproducing this issue in overnight orangutan testing from Sept 22nd though Oct 2nd, but have not observed this crash since. Mystery fix from some other patch? But until we begin to see this crash again with the same frequency it's not a blocker for us anymore.

Flags: needinfo?(mvines)

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Reporter

Updated

•

11 years ago

Flags: needinfo?(gary)

Benoit Jacob [:bjacob] (mostly away)

Comment 64

•

11 years ago

Milan, given that this seems 1) already resolved (comment 63) and 2) not graphics-related (comments 53, 57, 58), I suggest un-assigning me.

Flags: needinfo?(bent.mozilla) → needinfo?(milan)

Michael Vines [:m1] [:evilmachines]

Comment 65

•

11 years ago

So we did see this again last night but just once, I'll update here if this start to become a regular occurrence again

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

11 years ago

Assignee: bjacob → nobody

Flags: needinfo?(milan)

Michael Vines [:m1] [:evilmachines]

Comment 66

•

11 years ago

(Sadly this crash has returned on v1.2, we've been seeing it pretty regularly since Oct 09. LMK how I can help)

Andreas Gal :gal

Comment 67

•

11 years ago

Milan we need to get on top of this. Until we have a new owner, you remain the owner. Feel free to ask for help from the content team.

Assignee: nobody → milan

Milan Sreckovic [:milan] (needinfo for best results)

Comment 68

•

11 years ago

Overholt has asked Cervantes to take this on.

Assignee: milan → cyu

Doug Turner (:dougt)

Comment 69

•

11 years ago

bent and I reviewed the prompt proxy and couldn't see anything obvious from this protocol and implementation. cyu, please reach out if you have any questions or get stuck.

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 70

•

11 years ago

Attached patch Clean up PContentPermissionRequest (obsolete) — Details — Splinter Review

I think I may have found the cause here. Can someone please apply this patch and throw some monkeys at it?

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 71

•

11 years ago

Attached patch Clean up PContentPermissionRequest (obsolete) — Details — Splinter Review

Oops, the right patch now.

Attachment #817549 - Attachment is obsolete: true

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 72

•

11 years ago

Attached patch Clean up PContentPermissionRequest (obsolete) — Details — Splinter Review

Really final patch.

Attachment #817554 - Attachment is obsolete: true

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 73

•

11 years ago

I just caught the same crash in the camera app. The message in question is PContentPermissionRequestMsgStart << 16 | Msg___delete____ID. I'll test the patch to see if the crash is fixed.

Michael Vines [:m1] [:evilmachines]

Comment 74

•

11 years ago

Our monkeys are v1.2 only at the moment. :bent, if you can rebase the patch on aurora then I'll put them to work overnight on it.

Michael Vines [:m1] [:evilmachines]

Updated

•

11 years ago

Flags: needinfo?(bent.mozilla)

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 75

•

11 years ago

I applied the patch and made some tests. The camera app still has the same crash.

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 76

•

11 years ago

Attached patch Don't call ContentPermissionRequestParent::Send__delete__() if the managing TabParent is already Destroy()'d (obsolete) — Details — Splinter Review

This should fix the crash. In my tests this fixes the crash. :m1, could you please set up the test to verify it?

Attachment #818102 - Flags: review?(bent.mozilla)

Flags: needinfo?(mvines)

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 77

•

11 years ago

Comment #53 already tells us the story. Here is the analysis: [20847][PBrowserChild] Sending Msg_PContentPermissionRequestConstructor([TODO]) [20847][PContentPermissionRequestChild] Sending Msg_prompt([TODO]) * Content permission prompt requested [843][PBrowserParent] Sending Msg_Destroy([TODO]) * TabParent::Destroy() is called, from this point on we should not send messages of PBrowser or protocols managed by it [20847][PBrowserChild] Received Msg_Destroy([TODO]) * The destroy message is received. From this point on the protocol tree on the child side is destroyed * Many managed protocols being deleted [20847][PBrowserChild] Sending Msg___delete__([TODO]) * At the end of TabChild::RecvDestroy() PBrowserChild sends out Msg___delete__ [843][PBrowserParent] Received Msg_PContentPermissionRequestConstructor([TODO]) [843][PContentPermissionRequestParent] Received Msg_prompt([TODO]) * The prompt is request is finally received. But since PBrowser is already Destroy()'d we should not send back any messages [843][PContentPermissionRequestParent] Sending Msg___delete__([TODO]) * This is the message that cannot be routed and causes crash!! * Since we haven't received Msg___delete__ from PBrowserChild, the actors on the parent side are still alive. That's why PContentPermissionRequestParent still can send out messages. I/Gecko (20847): I/Gecko (20847): ###!!! [Child][AsyncChannel] Error: Route error: message sent to unknown actor ID I/Gecko (20847):

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 78

•

11 years ago

And my STR: * Open the camera app (or the camera process is killed when we leave the inline activity). * Open the message app and then add attachment from the camera. * Since this is a monkey test, let's act like a monkey! Start quickly tapping even before the camera preview is shown. (not sure if this is related to the crash, but this will bring up the camera app UI, not the inline activity UI). * Take pictures (also quickly tapping the camera button). * And quickly switch between camera and gallery * If this doesn't crash go back to the message app, or change homescreen background from the camera.

Doug Turner (:dougt)

Comment 79

•

11 years ago

cyu, do you know why ContentPermissionRequestParent::ActorDestroy isn't being called?

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 80

•

11 years ago

(In reply to Doug Turner (:dougt) from comment #79) > cyu, do you know why ContentPermissionRequestParent::ActorDestroy isn't > being called? It's not yet and we already crashed. It should be called after PBrowserParent receives Msg___delete__. PBrowserChild sends out the message, but it's still in parent's queue.

Michael Vines [:m1] [:evilmachines]

Comment 81

•

11 years ago

(In reply to Cervantes Yu from comment #76) > :m1, could you please set up the test to verify it? The patch doesn't apply to v1.2, can you please rebase it.

Flags: needinfo?(mvines)

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 82

•

11 years ago

Attached patch Patch rebased for v1.2 (obsolete) — Details — Splinter Review

Rebased to v1.2. This should apply. Please test using this one.

Flags: needinfo?(mvines)

Michael Vines [:m1] [:evilmachines]

Comment 83

•

11 years ago

Thanks, I've added this patch internally now and I'll report back in a couple days once we have some test time on it.

Flags: needinfo?(mvines)

Michael Vines [:m1] [:evilmachines]

Updated

•

11 years ago

Flags: needinfo?(bent.mozilla) → needinfo?(mvines)

Tracy Walker [:tracy]

Comment 84

•

11 years ago

topcrash is being replaced by more precise keywords per https://bugzilla.mozilla.org/show_bug.cgi?id=927557#c3

Keywords: topcrash → topcrash-b2g

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 85

•

11 years ago

Comment on attachment 818102 [details] [diff] [review] Don't call ContentPermissionRequestParent::Send__delete__() if the managing TabParent is already Destroy()'d Review of attachment 818102 [details] [diff] [review]: ----------------------------------------------------------------- This looks great, thanks for digging in here! ::: dom/base/nsContentPermissionHelper.cpp @@ +105,5 @@ > if (mParent == nullptr) { > return NS_ERROR_FAILURE; > } > > + TabParent *tabParent = static_cast<TabParent*>(mParent->Manager()); Nit: * on the left here, and below.

Attachment #818102 - Flags: review?(bent.mozilla) → review+

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Reporter

Comment 86

•

11 years ago

Does this need to be backported to 1.1, or is it unnecessary?

Michael Vines [:m1] [:evilmachines]

Comment 87

•

11 years ago

(In reply to Michael Vines [:m1] [:evilmachines] from comment #83) > Thanks, I've added this patch internally now and I'll report back in a > couple days once we have some test time on it. We haven't seen this crash reoccur yet with attachment 818560 [details] [diff] [review], so maybe fixed. Although we've seen this bug go quiet in the past for many days as well. :-/

Flags: needinfo?(mvines)

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 88

•

11 years ago

(In reply to Gary Kwong [:gkw] [:nth10sd] (still catching up on bugmail) (PTO Oct 21-25) from comment #86) > Does this need to be backported to 1.1, or is it unnecessary? My guess is yes, but I am not 100% sure and need to have tests against it.

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 89

•

11 years ago

Attached patch Don't call ContentPermissionRequestParent::Send__delete__() if the managing TabParent is already Destroy()'d. r=bent (obsolete) — Details — Splinter Review

Attachment #818102 - Attachment is obsolete: true

Attachment #820435 - Flags: review+

Cervantes Yu [:cyu] [:cervantes]

Assignee

Updated

•

11 years ago

Keywords: checkin-needed

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 90

•

11 years ago

(In reply to Cervantes Yu from comment #88) > (In reply to Gary Kwong [:gkw] [:nth10sd] (still catching up on bugmail) > (PTO Oct 21-25) from comment #86) > > Does this need to be backported to 1.1, or is it unnecessary? > > My guess is yes, but I am not 100% sure and need to have tests against it. I failed to test this on 1.1 on unagi because of camera issues. We might need QA's help if we want to figure this out.

Milan Sreckovic [:milan] (needinfo for best results)

Comment 91

•

11 years ago

(In reply to Cervantes Yu from comment #89) > Created attachment 820435 [details] [diff] [review] > Don't call ContentPermissionRequestParent::Send__delete__() if the managing > TabParent is already Destroy()'d. r=bent This is a bit obscure; perhaps a comment in both places, or even better, a short method that encapsulates this call and is named something like "in the process of being destroyed" to make things more readable? For somebody looking at the code for the first time (or three years from now), it would help. Right now, the most helpful comment is in TabParent.h.

Ed Morley [:emorley]

Comment 92

•

11 years ago

Please can you obsolete the old patches/rename to make it clearer what needs checking in and in what order? Thank you :-)

Keywords: checkin-needed

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 93

•

11 years ago

Attached patch [Final] Don't send out messages of PContentPermissionRequest when the TabParent is being destroyed. r=bent — Details — Splinter Review

Refactoring and add comments.

Attachment #817560 - Attachment is obsolete: true

Attachment #818560 - Attachment is obsolete: true

Attachment #820435 - Attachment is obsolete: true

Attachment #821144 - Flags: review+

Cervantes Yu [:cyu] [:cervantes]

Assignee

Updated

•

11 years ago

Attachment #807464 - Attachment description: bug879580.patch → [Test] bug879580.patch

Cervantes Yu [:cyu] [:cervantes]

Assignee

Updated

•

11 years ago

Keywords: checkin-needed

Ed Morley [:emorley]

Comment 94

•

11 years ago

I presume I'm not checking in "[Test] bug879580.patch" ? (There are 30+ checkin-neededs in the queue, there sadly isn't much time to read scrollback in the bug etc)

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 95

•

11 years ago

(In reply to Ed Morley [:edmorley UTC+1] from comment #94) > I presume I'm not checking in "[Test] bug879580.patch" ? > No. It's test-only, and bug number is different.

Ed Morley [:emorley]

Comment 96

•

11 years ago

(In reply to Cervantes Yu from comment #95) > (In reply to Ed Morley [:edmorley UTC+1] from comment #94) > > I presume I'm not checking in "[Test] bug879580.patch" ? > > > No. It's test-only, and bug number is different. Cool, thank you. (If there are multiple patches attached, best bet is to mark the others obsolete, change the attachment descriptions or else use the per-patch "checkin?" flag to avoid ambiguity :-))

Ed Morley [:emorley]

Comment 97

•

11 years ago

https://hg.mozilla.org/integration/b2g-inbound/rev/810cb9568dbf

Keywords: checkin-needed

Ed Morley [:emorley]

Comment 98

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/810cb9568dbf

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Ryan VanderMeulen [:RyanVM]

Comment 99

•

11 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/f9931a182295

status-b2g-v1.2: --- → fixed

status-firefox25: --- → wontfix

status-firefox26: --- → fixed

status-firefox27: --- → fixed

Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me)

Comment 100

•

11 years ago

Attached file ADBlogs.zip — Details

This issue happened again during monkey testing. Here is the adb logcat, dmesg etc

Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me)

Updated

•

11 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me)

Updated

•

11 years ago

blocking-b2g: koi+ → 1.3?

Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me)

Comment 101

•

11 years ago

Attached file stack trace when it happened agian — Details

Contents of ".extra " file when it happened again in FFOS 1.3: B2G_OS_Version=1.3.0.0-prerelease Android_Device=msm8610 Android_Manufacturer=unknown ProductName=B2G Android_Board=MSM8610 Android_CPU_ABI=armeabi-v7a Vendor=Mozilla InstallTime=75973 Notes=GL Layers! EGL? EGL+ GL Context? GL Context+ GL Layers+ ReleaseChannel=default Android_CPU_ABI2=armeabi Version=28.0a2 Android_Brand=qcom ServerURL=https://crash-reports.mozilla.com/submit?id={3c2e2abc-06d4-11e1-ac3b-374f68613e61}&version=28.0a2&buildid=20131225214657 Android_Hardware=qcom useragent_locale=en-US BuildID=20131225214657 ProductID={3c2e2abc-06d4-11e1-ac3b-374f68613e61} Android_Version=18(REL) Android_Model=msm8610 CrashTime=1388475440 StartupTime=1388474558 ProcessType=content Notes=EGL? EGL+ GL Context? GL Context+ xpcom_runtime_abort([Child 1262] ###!!! ABORT: aborting because of MsgProcessingError: file /local/mnt/workspace/lnxbuild/project/release_dev_msm8610_2199240/checkout/gecko/dom/ipc/ContentChild.cpp, line 1140) URL=app://gallery.gaiamobile.org/manifest.webapp

Jason Smith [:jsmith]

Comment 102

•

11 years ago

We've already got a different bug open on this - see bug 956325.

Status: REOPENED → RESOLVED

blocking-b2g: 1.3? → koi+

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

adb output 12 years ago Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now) 3.83 KB, text/plain		Details
leo-callstack.log 11 years ago Changbin Park 47.20 KB, text/plain		Details
[Test] bug879580.patch 11 years ago Benjamin Smedberg 1.94 KB, patch		Details \| Diff \| Splinter Review
Gallery crash 11 years ago Michael Vines [:m1] [:evilmachines] 965.24 KB, application/x-gzip		Details
Gallery crash #2 11 years ago Michael Vines [:m1] [:evilmachines] 633.90 KB, application/x-gzip		Details
Clean up PContentPermissionRequest 11 years ago Ben Turner (not reading bugmail, use the needinfo flag!) 7.01 KB, patch		Details \| Diff \| Splinter Review
Clean up PContentPermissionRequest 11 years ago Ben Turner (not reading bugmail, use the needinfo flag!) 5.55 KB, patch		Details \| Diff \| Splinter Review
Clean up PContentPermissionRequest 11 years ago Ben Turner (not reading bugmail, use the needinfo flag!) 5.53 KB, patch		Details \| Diff \| Splinter Review
Don't call ContentPermissionRequestParent::Send__delete__() if the managing TabParent is already Destroy()'d 11 years ago Cervantes Yu [:cyu] [:cervantes] 2.83 KB, patch	bent.mozilla : review+	Details \| Diff \| Splinter Review
Patch rebased for v1.2 11 years ago Cervantes Yu [:cyu] [:cervantes] 2.61 KB, patch		Details \| Diff \| Splinter Review
Don't call ContentPermissionRequestParent::Send__delete__() if the managing TabParent is already Destroy()'d. r=bent 11 years ago Cervantes Yu [:cyu] [:cervantes] 2.83 KB, patch	cyu : review+	Details \| Diff \| Splinter Review
[Final] Don't send out messages of PContentPermissionRequest when the TabParent is being destroyed. r=bent 11 years ago Cervantes Yu [:cyu] [:cervantes] 4.25 KB, patch	cyu : review+	Details \| Diff \| Splinter Review
ADBlogs.zip 11 years ago Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) 576.65 KB, application/zip		Details
stack trace when it happened agian 11 years ago Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) 97.16 KB, text/plain		Details