Closed Bug 878366 Opened 7 years ago Closed 6 years ago

Crash on abort in mozilla::dom::ContentChild::ProcessingError

Categories

(Firefox OS Graveyard :: General, defect, critical)

ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

(blocking-b2g:koi+, firefox25 wontfix, firefox26 fixed, firefox27 fixed, b2g18 affected, b2g-v1.2 fixed)

RESOLVED FIXED
blocking-b2g koi+
Tracking Status
firefox25 --- wontfix
firefox26 --- fixed
firefox27 --- fixed
b2g18 --- affected
b2g-v1.2 --- fixed

People

(Reporter: gkw, Assigned: cyu)

References

(Blocks 1 open bug)

Details

(Keywords: crash, topcrash-b2g, Whiteboard: [b2g-crash])

Crash Data

Attachments

(8 files, 6 obsolete files)

3.83 KB, text/plain
Details
47.20 KB, text/plain
Details
1.94 KB, patch
Details | Diff | Splinter Review
965.24 KB, application/x-gzip
Details
633.90 KB, application/x-gzip
Details
4.25 KB, patch
cyu
: review+
Details | Diff | Splinter Review
576.65 KB, application/zip
Details
97.16 KB, text/plain
Details
Attached file adb output
+++ This bug was initially created as a clone of Bug #867025 +++

I hit this too, using orangfuzz:

bp-2b5abeb3-7e9f-45f8-afa2-e75332130601

Unfortunately I don't have a reliable testcase.

James also mentioned that he could also make the phone crash again in bug 867025 comment 35 with the patch in that bug.

===

I'm on 2013-05-24 v1.1.0 git revision cc2fd02fd461aa12c96e02229a78293365d65264
Crash Signature: [@ mozalloc_abort | NS_DebugBreak_P | mozilla::dom::ContentChild::ProcessingError] [@ mozilla::dom::ContentChild::ProcessingError] → [@ mozalloc_abort | NS_DebugBreak_P | mozilla::dom::ContentChild::ProcessingError ] [@ mozilla::dom::ContentChild::ProcessingError ]
Interesting output:

I/Gecko   ( 9860):
I/Gecko   ( 9860): ###!!! [Child][AsyncChannel] Error: Route error: message sent to unknown actor ID
I/Gecko   ( 9860):
I/Gecko   ( 9860): [Child 9860] ###!!! ABORT: aborting because of fatal error: file ../../../gecko/dom/ipc/ContentChild.cpp, line 1009
E/Gecko   ( 9860): mozalloc_abort: [Child 9860] ###!!! ABORT: aborting because of fatal error: file ../../../gecko/dom/ipc/ContentChild.cpp, line 1009
Ben, can you take a look?
blocking-b2g: leo? → -
Flags: needinfo?(bent.mozilla)
We don't have any data here that is actionable. We need to know the message id (combination of protocol type and message type) in order to get anything here.
Flags: needinfo?(bent.mozilla)
Oh, sorry, the only way to get that presently is to break in the debugger.
For the case of the unagi testdriver phones, they are usually not connected to the debugger. Can we somehow improve data collection here some way such that we can get information out of similar crashes in the future?

(I fixed bug 879092 but I'm not sure if that will help in any way..)
You can try enabling IPC message logging (set IPC_MESSAGE_LOG_ENABLED=1 in environment) but that may be too noisy and/or slow.

Other than that we'd need to file an IPC bug to improve the logging when we crash.
> Other than that we'd need to file an IPC bug to improve the logging when we
> crash.

Filed bug 879580.
Crash Signature: [@ mozalloc_abort | NS_DebugBreak_P | mozilla::dom::ContentChild::ProcessingError ] [@ mozilla::dom::ContentChild::ProcessingError ] → [@ mozalloc_abort | NS_DebugBreak_P | mozilla::dom::ContentChild::ProcessingError ] [@ mozalloc_abort(char const*) | NS_DebugBreak | mozilla::dom::ContentChild::ProcessingError(mozilla::ipc::HasResultCodes::Result) ] [@ mozilla::dom::ContentChild::Proce…
It's #6 crasher for all B2G versions.
Severity: blocker → critical
Depends on: 879580
Keywords: topcrash
This occurs on leo too.
It seems that I wrongly added a signature for a normal crash to previous signatures for a crash on abort.
It's #6 top crasher in FxOS 1.0.1.
blocking-b2g: - → leo?
Crash Signature: [@ mozalloc_abort | NS_DebugBreak_P | mozilla::dom::ContentChild::ProcessingError ] [@ mozalloc_abort(char const*) | NS_DebugBreak | mozilla::dom::ContentChild::ProcessingError(mozilla::ipc::HasResultCodes::Result) ] [@ mozilla::dom::ContentChild::Proce… → [@ mozalloc_abort | NS_DebugBreak_P | mozilla::dom::ContentChild::ProcessingError ] [@ mozalloc_abort(char const*) | NS_DebugBreak | mozilla::dom::ContentChild::ProcessingError(mozilla::ipc::HasResultCodes::Result) ]
Summary: Crash [@ mozilla::dom::ContentChild::ProcessingError] → crash on abort in mozilla::dom::ContentChild::ProcessingError
(In reply to leo.bugzilla.gecko from comment #9)
> This occurs on leo too.

Can you elaborate?  Are you seeing the signature?  Are you reproducing a crash?  We can't block this without actionable information to move on - being a topcrash isn't enough to block.
blocking-b2g: leo? → -
Flags: needinfo?(jaeohkim83)
Attached file leo-callstack.log
I'm attaching crash dump information from leo device.
Flags: needinfo?(jaeohkim83)
(In reply to Changbin Park from comment #12)
> Created attachment 789252 [details]
> leo-callstack.log
> 
> I'm attaching crash dump information from leo device.

How did you reproduce the crash that caused this dump?
It occured by our QA team.
He said, he tested bunch of message things.
Indeed, he didn't realize when the crash occured, but after that a different kind of problem occurs on the taget. So, it passed to me and I found the crash occured on it while he was testing.
It's not that detail I know, sorry..
This issue is reporduced one more time.
The mobile is left in camera view(Message - file attach - Camera)
The crash is occured after few minute. It has same call stack with this bug.
But it is not reproduced unfortunately.
:Gary, are you still able to reproduce this issue on 1.1 or master ?
(In reply to ben turner [:bent] (needinfo? encouraged) from comment #6)
> You can try enabling IPC message logging (set IPC_MESSAGE_LOG_ENABLED=1 in
> environment) but that may be too noisy and/or slow.
> 
> Other than that we'd need to file an IPC bug to improve the logging when we
> crash.

Hey Ben, this has been filed, who is the right owner here https://bugzilla.mozilla.org/show_bug.cgi?id=879580 ?
Note that this is the top 1.0.1 crash that seems to be in our code and is not fixed for newer versions.

It's #3 over the last week of data, see https://crash-analysis.mozilla.com/rkaiser/2013-09-05/2013-09-05.b2g.topcrashes.weekly.html#b2g-1.0.1.0-prerelease and the ones above or near it are either not our code or are fixed in 1.1.
(In reply to bhavana bajaj [:bajaj] from comment #17)
> Hey Ben, this has been filed, who is the right owner here

Not sure, anyone who has done IPC work I guess. It's not prioritized at the moment so I don't think anyone has even looked at it.
(In reply to ben turner [:bent] (needinfo? encouraged) from comment #19)
> (In reply to bhavana bajaj [:bajaj] from comment #17)
> > Hey Ben, this has been filed, who is the right owner here
> 
> Not sure, anyone who has done IPC work I guess. It's not prioritized at the
> moment so I don't think anyone has even looked at it.

I'll nominate bug 879580 to get it prioritized, hopefully.
(In reply to bhavana bajaj [:bajaj] from comment #16)
> :Gary, are you still able to reproduce this issue on 1.1 or master ?

I've not tested on 1.1 / master, in any case won't be able to anytime soon, unfortunately.
mvines - I see this has been added to the 1.2 CS blocker list, but we don't have STR to action this bug (which has been the main reason we haven't actioned this bug from past releases). Did your team come across a way to reproduce this top crash?
Flags: needinfo?(mvines)
Yep, run orangutan for ~8 hours.  We caught crash twice overnight.
Flags: needinfo?(mvines)
Noming cause this blocks the CS 1.2 blocker.

Andrew - Can you find someone to look into this?
blocking-b2g: - → koi?
Flags: needinfo?(overholt)
I spoke about this today with bsmedberg.  He's going to try to work on (or find someone to work on) bug 879580.  Once we've got that improved logging it will be easier to find this race.
Flags: needinfo?(overholt)
Great.  We can run an orangutan or two on a build with some logging patches/etc pretty easily too. LMK
I spent a while looking at the debugging possibilities here. The only good option is to run the tests with MOZ_IPC_MESSAGE_LOG on, so that we can see what message the parent process thinks its sending (probably to a dead actor, perhaps because of a race sending a message in both directions).

Does the stdout of b2g go to an attached debug device, or does it always stay on the phone? IPC logs can get big. I guess we're running these tests with non-debug builds, right?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #27)
> The only good
> option is to run the tests with MOZ_IPC_MESSAGE_LOG on

We can do that here, LMK exactly what you need.

> Does the stdout of b2g go to an attached debug device, or does it always
> stay on the phone?

/dev/null right now, but I've been meaning to redirect stdout to logcat though as that's more helpful.  We can do that. 

> I guess we're running these tests with non-debug builds, right?

-eng builds right now, but without any additional Gecko debug settings enabled.
A normal -eng build run with MOZ_IPC_MESSAGE_LOG=1 in the environment should dump a bunch of spew to stdout/stderr about messages being sent and received. If we can get that through logcat for one of these test runs, that would help tremendously.
Flags: needinfo?(mvines)
Debugging patch which should produce *some* output on stderr, even when MOZ_IPC_MESSAGE_LOG is not set. A log with MOZ_IPC_MESSAGE_LOG would still be better.
Looks like --enable-debug is needed for MOZ_IPC_MESSAGE_LOG=1 to work, which we don't enabled by default with -eng.  Looks like s/false/true/ at http://dxr.mozilla.org/mozilla-central/source/ipc/glue/ProtocolUtils.h?from=LoggingEnabled#l112 will do the trick instead?   With that change I see a ton of stderr output like:

[time:316253345887993][721][PContentChild] Sending Msg_AsyncMessage([TODO])
[time:316253345888908][653][PContentParent] Received Msg_AsyncMessage([TODO])
[time:316253346227936][721][PLayerTransactionChild] Sending Msg_Update([TODO])
[time:316253346228803][653][PLayerTransactionParent] Received Msg_Update([TODO])
[time:316253346229286][653][PLayerTransactionParent] Sending reply Reply_Update([TODO])
[time:316253346232066][721][PLayerTransactionChild] Received reply Reply_Update([TODO])
[time:316253346232966][721][PLayerChild] Sending Msg___delete__([TODO])
[time:316253346233616][721][PLayerChild] Sending Msg___delete__([TODO])

Is this the kind of logging that'll help?
That's the stuff you're looking for!
Summary: crash on abort in mozilla::dom::ContentChild::ProcessingError → Crash on abort in mozilla::dom::ContentChild::ProcessingError
+ed it for crash issue
blocking-b2g: koi? → koi+
(The monkeys are still trying to catch this w/ IPC logging.  Hopefully they'll get lucky over the weekend.)
Flags: needinfo?(mvines)
Flags: needinfo?(mvines)
Attached file Gallery crash
Here's one example of this crash with IPC logging enabled.  The crashing app was gallery.  I've attached all Gecko logcat output for the Gallery and b2g processes leading up to the crash.

APPLICATION       OOM_ADJ  OOM_SCORE  OOM_SCORE_ADJ  USER     PID   PPID  VSIZE  RSS     WCHAN    PC         NAME
Gallery              1        138         67         app_2532  2532  29689 91184  29248 ffffffff 407dfabc S /system/b2g/plugin-container
b2g                  0        178         0          root      29689 1     277292 85316 ffffffff 400b0604 S /system/b2g/b2g
(background apps omitted)
Flags: needinfo?(mvines)
Attached file Gallery crash #2
Another log of this crash, again in the Gallery.  b2g pid is 843, Gallery pid is 20847.
Crash was observed again overnight, and again in the Gallery app
Here's the important part from the first log:
>I/Gecko   (29689): [time:1379917243883118][29689][PContentParent] Sending Msg_AsyncMessage([TODO])                                                                              >I/Gecko   (2532):
>I/Gecko   ( 2532): ###!!! [Child][AsyncChannel] Error: Route error: message sent to unknown actor ID                                                                                                    >I/Gecko   (2532):
>I/Gecko   ( 2532): [Child 2532] ###!!! ABORT: aborting because of MsgRouteError: file /local/mnt/buildbot/v1.2_msm7627a/build/gecko/dom/ipc/ContentChild.cpp, line 1064
This log doesn't look all that useful without actor IDs, unfortunately :(
(In reply to Josh Matthews [:jdm] from comment #39)
> This log doesn't look all that useful without actor IDs, unfortunately :(

If you have a patch that outputs what you need to see I can easily add that temporarily here so we can get that data the next time the crash occurs.
Crash occurred last night with attachment 807464 [details] [diff] [review] applied ("DumpMessage RouteError for message: .type=1310722 .size=40").  This time it was a camera app crash instead of the gallery app.

I/Gecko   (  852): [time:1380104599666268][852][PGrallocBufferChild] Received Msg___delete__([TODO])
I/Gecko   (  852): [time:1380104599666447][852][PGrallocBufferChild] Received Msg___delete__([TODO])
I/Gecko   (  852): [time:1380104599670447][852][PBrowserParent] Received Msg_PContentPermissionRequestConstructor([TODO])
I/Gecko   (  852): [time:1380104599671093][852][PContentPermissionRequestParent] Received Msg_prompt([TODO])
I/Gecko   (  852): [time:1380104599692020][852][PContentPermissionRequestParent] Sending Msg___delete__([TODO])
I/Gecko   ( 9942): DumpMessage RouteError for message: .type=1310722 .size=40
I/Gecko   ( 9942): 
I/Gecko   ( 9942): ###!!! [Child][AsyncChannel] Error: Route error: message sent to unknown actor ID
I/Gecko   ( 9942): 
I/Gecko   ( 9942): [Child 9942] ###!!! ABORT: aborting because of MsgRouteError: file /local/mnt/buildbot/v1.2/build/gecko/dom/ipc/ContentChild.cpp, line 1064
I/Gecko   ( 9942): [time:1380104599692787][9942][PCrashReporterChild] Sending Msg_AppendAppNotes([TODO])
E/Gecko   ( 9942): mozalloc_abort: [Child 9942] ###!!! ABORT: aborting because of MsgRouteError: file /local/mnt/buildbot/v1.2/build/gecko/dom/ipc/ContentChild.cpp, line 1064
Benoit, related to the crash you're looking at?  I know it's hard to tell from the info here, but just wondering.
Flags: needinfo?(bjacob)
Depends on what you mean by 'related'. The bug I'm looking at, bug 914823, has different symptoms (see call stack in bug 914823 comment 28) and a specific cause (we are referencing an ISurfaceAllocator that already died, we couldn't know it because it's not reference-counted, it can't be reference-counted because IPDL actors aren't reference-counted) that seems to be different from what is being discussed above here.

But, on a more meta level, the present bug, bug 914823, and generally half of the b2g crashes I've been dealing with, are 'related' in that they share a common aspect: we crash as we have dangling raw pointers to already-dead IPDL actors, and we can't easily fix that because IPDL actors aren't refcounted, and as long as they're not refcounted, the only way we can write non-crashing code is by having a mental model of the entire B2G IPC-facing codebase, and well, I don't know that anyone anymore has such a mental model.
Flags: needinfo?(bjacob)
(In reply to Benoit Jacob [:bjacob] from comment #43)
> But, on a more meta level, the present bug, bug 914823, and generally half
> of the b2g crashes I've been dealing with, are 'related' in that they share
> a common aspect: we crash as we have dangling raw pointers to already-dead
> IPDL actors, and we can't easily fix that because IPDL actors aren't
> refcounted, and as long as they're not refcounted, the only way we can write
> non-crashing code is by having a mental model of the entire B2G IPC-facing
> codebase, and well, I don't know that anyone anymore has such a mental model.

This piqued my interest so I asked Ben and he said IPDL actors can be refcounted.  He opined that perhaps you were thinking of gralloc-associated actors?
This was a new discovery.  From IRC:
16:17 bjacob: newsflash: we _can_ have refcounted IPDL actors, in fact netwerk/ already has a few, and it's even discussed here: https://developer.mozilla.org/en-US/docs/IPDL/Best_Practices  !!
16:18 bjacob: (thanks to jdm) !
16:44 bjacob: milan: yes, jdm pointed me to netwerk/ipc/Necko{Child,Parent}.*
16:45 bjacob: milan: the other problem is the SurfaceDescriptor IPDL union, which compiles in C++ to raw pointers. There too, jdm suggested a realistic fix: mirror it in C++ by a RefcountingSurfaceDescriptor class that would do the refcounting (and have a constructor taking an IPDL SurfaceDescriptor)
Yep, IPDL actors can be refcounted, but none of the graphics-facing ones are, and until two hours ago, none of use gfx people around here in Toronto knew about that possibility! Many thanks to :jdm for educating us.

Switching our IPDL actors to refcounting is a fairly nontrivial code change though, so it might be too big a change for B2G 1.2 at this point. Let's do it on mozilla-central for B2G v1.3, but for B2G 1.2 bugs, for now, let's still try to fix them without switching non-refcounted classes to refcounting...
If we try to fix this without ref counting, even assuming that we properly identify all the places where we crash, how will we be confident that the fix will actually work? We have been dealing with this class of bugs for about 9 month now, and every time you whack a mole, a new one pops up.
Assignee: nobody → bjacob
Benoit and Sotaro are connecting with Ben Turner to see how large and scary this change would be.
This was flagged as a high priority bug by the chipset vendor. How are we doing here?
I've been busy until yesterday with 1) bug 914823, which is another occurence of this kind of issue, and 2) trying to understand what would generally be the right approach to fixing this kind of issue.

Now I think that the approach we're taking in bug 914823, which is to make the actor SupportsWeakPtr, is a good one:
 - it is unintrusive, conservative enough to land on aurora;
 - it actually makes sense as 'the right solution' as it matches the reality that from the point of view of DOM elements, the whole IPC system can disappear at any time. We can have a given actor outlive the IPC system by holding a strong reference to it, and that can be useful to avoid crashing, but we can't prevent the IPC system from going down. For example, if the IPC system runs out of file descriptors, there is nothing we can do at the moment to do IPC again, IIUC.

I filed bug 923530 as a tracking bug for this kind of issues; blocking it.

I'll now try to fix the present bug...
Blocks: 923530
So to be clear, I do think that reference-counting actors is going to be a part of the solution. If only because there is no way to do thread-safe weak references without being able to convert a weak reference to a strong reference before dereferencing it. But just reference-counting along, as I thought before would be the solution, isn't a complete solution by itself, because we can't always prevent the IPC system from going down earlier than expected (IPC errors can happen at any time). So we still have to react to unexpected actor death, and using weak references seem like our best way to do so at the moment.
Does anyone have STR that I can use while having USB debugging? The only definite STR that I can see here is:

(In reply to jongsoo.oh from comment #15)
> This issue is reporduced one more time.
> The mobile is left in camera view(Message - file attach - Camera)
> The crash is occured after few minute. It has same call stack with this bug.
> But it is not reproduced unfortunately.

However, when I try this with USB debugging plugged, I get a modal message:

"Camera can not be used while plugged in - Unplug the phone to view pictures"

Can I work around it?
Flags: needinfo?(zzongsoo)
Flags: needinfo?(mvines)
Flags: needinfo?(gary)
(In reply to Michael Vines [:m1] [:evilmachines] from comment #36)
> Created attachment 808744 [details]
> Gallery crash #2
> 
> Another log of this crash, again in the Gallery.  b2g pid is 843, Gallery
> pid is 20847.

From this log:

[20847][PBrowserChild] Sending Msg_PContentPermissionRequestConstructor([TODO])
[20847][PContentPermissionRequestChild] Sending Msg_prompt([TODO])
[20847][PBrowserChild] Received Msg_RealTouchEvent([TODO])
[20847][PLayerTransactionChild] Sending Msg_Update([TODO])
[843][PLayerTransactionParent] Received Msg_Update([TODO])
[843][PLayerTransactionParent] Sending reply Reply_Update([TODO])
[20847][PLayerTransactionChild] Received reply Reply_Update([TODO])
[843][PContentParent] Received Msg_AsyncMessage([TODO])
[843][PContentParent] Sending Msg_AsyncMessage([TODO])
[843][PBrowserParent] Sending Msg_Deactivate([TODO])
[20847][PBrowserChild] Received Msg_Deactivate([TODO])
[843][PBrowserParent] Sending Msg_Activate([TODO])
[843][PBrowserParent] Sending Msg_AsyncMessage([TODO])
[843][PBrowserParent] Sending Msg_Destroy([TODO])
[20847][PBrowserChild] Received Msg_Destroy([TODO])
[20847][PContentChild] Sending Msg_AudioChannelChangeDefVolChannel([TODO])
[20847][PIndexedDBDatabaseChild] Sending Msg_Close([TODO])
[20847][PCompositableChild] Sending Msg___delete__([TODO])
[843][PCompositableParent] Received Msg___delete__([TODO])
[20847][PLayerChild] Sending Msg___delete__([TODO])
[843][PLayerParent] Received Msg___delete__([TODO])
[843][PGrallocBufferParent] Sending Msg___delete__([TODO])
[843][PGrallocBufferParent] Sending Msg___delete__([TODO])
[20847][PLayerChild] Sending Msg___delete__([TODO])
[843][PLayerParent] Received Msg___delete__([TODO])
[20847][PLayerChild] Sending Msg___delete__([TODO])
[843][PLayerParent] Received Msg___delete__([TODO])
[20847][PRenderFrameChild] Sending Msg___delete__([TODO])
[20847][PBrowserChild] Sending Msg___delete__([TODO])
[20847][PContentChild] Sending Msg_AsyncMessage([TODO])
[20847][PContentChild] Sending Msg_AsyncMessage([TODO])
[20847][PGrallocBufferChild] Received Msg___delete__([TODO])
[20847][PGrallocBufferChild] Received Msg___delete__([TODO])
[843][PLayerTransactionChild] Sending Msg_PLayerConstructor([TODO])
[843][PLayerTransactionParent] Received Msg_PLayerConstructor([TODO])
[843][PLayerTransactionChild] Sending Msg_Update([TODO])
[843][PLayerTransactionParent] Received Msg_Update([TODO])
[843][PLayerTransactionParent] Sending reply Reply_Update([TODO])
[843][PLayerTransactionChild] Received reply Reply_Update([TODO])
[843][PLayerChild] Sending Msg___delete__([TODO])
[843][PLayerParent] Received Msg___delete__([TODO])
[843][PBrowserParent] Received Msg_PContentPermissionRequestConstructor([TODO])
[843][PContentPermissionRequestParent] Received Msg_prompt([TODO])
[843][PContentPermissionRequestParent] Sending Msg___delete__([TODO])
I/Gecko   (20847): 
I/Gecko   (20847): ###!!! [Child][AsyncChannel] Error: Route error: message sent to unknown actor ID
I/Gecko   (20847): 


Here is one interpretation of this log, please tell me whether it makes sense: the Layers / Gfx stuff here is actually a red herring, the problem is we're apparently doing a Send___delete__ on a bad PContentPermissionRequestParent.

Does that make sense?
Flags: needinfo?(bent.mozilla)
...rather, the PContentPermissionRequestParent is fine (it just received a message) but the matching PContentPermissionRequest***Child*** is already dead. Does _that_ make sense?
Asking Doug who seems to have been maintaining code around PContentPermissionRequestChild.
(In reply to Benoit Jacob [:bjacob] from comment #52)
 > However, when I try this with USB debugging plugged, I get a modal message:
> 
> "Camera can not be used while plugged in - Unplug the phone to view pictures"
> 
> Can I work around it?

Sounds like you may have Settings -> Enable USB storage enabled.  Try disabling.
Flags: needinfo?(mvines)
I also think attachment 808694 [details] and attachment 808744 [details] seems not related to gfx. From the log. The crash of attachment 808694 [details] seems triggered by calling PContentParent::SendAsyncMessage(). The crash of attachmeet 808744 seems triggered by cakkubg PContentPermissionRequestParent::Send___delete__().

Both seem to try to send to already disconnected/deleted ipc object.
Comment 57 reminds me Bug 867025. Incorrect ipc handling problems are not the only problem around gfx, but also could happen in any other ipc in gecko. In current gecko's ipc architecutre/implementation, it is very very difficult to correctly use it in all situation. Such kind of rare crash could happen easily in b2g.

- Bug 867025 - [unagi][tara][weekly build 13.04.17]monkey test crash in mozilla::dom::ContentChild::ProcessingError
(In reply to Michael Vines [:m1] [:evilmachines] from comment #56)
> (In reply to Benoit Jacob [:bjacob] from comment #52)
>  > However, when I try this with USB debugging plugged, I get a modal
> message:
> > 
> > "Camera can not be used while plugged in - Unplug the phone to view pictures"
> > 
> > Can I work around it?
> 
> Sounds like you may have Settings -> Enable USB storage enabled.  Try
> disabling.

Thanks, that fixed this issue.

Now I am running into another problem: as I follow the STR, which involves switching the the Communications app to the background, I quickly get the Communications killed by the background-process-killer, before I get a chance to crash it. Can I disable the background-processes killer or have Communications be spared by it?
I tried writing 0 to /proc/PID/oom_score_adj and oom_adj, and the value was apparently remembered, but Communications still got killed.
Michael, the OOM killer issue reported in comment 59 -- 60 is preventing me from reproducing. What would be current steps-to-reproduce on Hamachi?
Flags: needinfo?(mvines)
(In reply to Benoit Jacob [:bjacob] from comment #52)
> Does anyone have STR that I can use while having USB debugging? The only
> definite STR that I can see here is:
> 
> (In reply to jongsoo.oh from comment #15)
> > This issue is reporduced one more time.
> > The mobile is left in camera view(Message - file attach - Camera)
> > The crash is occured after few minute. It has same call stack with this bug.
> > But it is not reproduced unfortunately.
> 
> However, when I try this with USB debugging plugged, I get a modal message:
> 
> "Camera can not be used while plugged in - Unplug the phone to view pictures"
> 
> Can I work around it?

After disable USB storage, you can use the camera with Remote debugging.
Flags: needinfo?(zzongsoo)
(In reply to Benoit Jacob [:bjacob] from comment #61)
> Michael, the OOM killer issue reported in comment 59 -- 60 is preventing me
> from reproducing. What would be current steps-to-reproduce on Hamachi?

you could probably tweak the LMK parameters at http://dxr.mozilla.org/mozilla-central/source/b2g/app/b2g.js?from=b2g.js#l611 to avoid getting killed.

However, looking at the crash database here I see that we were reliably reproducing this issue in overnight orangutan testing from Sept 22nd though Oct 2nd, but have not observed this crash since.  Mystery fix from some other patch?  But until we begin to see this crash again with the same frequency it's not a blocker for us anymore.
No longer blocks: 916996
Flags: needinfo?(mvines)
Flags: needinfo?(gary)
Milan, given that this seems 1) already resolved (comment 63) and 2) not graphics-related (comments 53, 57, 58), I suggest un-assigning me.
Flags: needinfo?(bent.mozilla) → needinfo?(milan)
So we did see this again last night but just once, I'll update here if this start to become a regular occurrence again
Assignee: bjacob → nobody
Flags: needinfo?(milan)
(Sadly this crash has returned on v1.2, we've been seeing it pretty regularly since Oct 09.  LMK how I can help)
Blocks: 916996
Milan we need to get on top of this. Until we have a new owner, you remain the owner. Feel free to ask for help from the content team.
Assignee: nobody → milan
Overholt has asked Cervantes to take this on.
Assignee: milan → cyu
bent and I reviewed the prompt proxy and couldn't see anything obvious from this protocol and implementation.

cyu, please reach out if you have any questions or get stuck.
I think I may have found the cause here. Can someone please apply this patch and throw some monkeys at it?
Oops, the right patch now.
Attachment #817549 - Attachment is obsolete: true
Really final patch.
Attachment #817554 - Attachment is obsolete: true
I just caught the same crash in the camera app. The message in question is PContentPermissionRequestMsgStart << 16 | Msg___delete____ID. I'll test the patch to see if the crash is fixed.
Our monkeys are v1.2 only at the moment.  :bent, if you can rebase the patch on aurora then I'll put them to work overnight on it.
Flags: needinfo?(bent.mozilla)
I applied the patch and made some tests. The camera app still has the same crash.
This should fix the crash. In my tests this fixes the crash. :m1, could you please set up the test to verify it?
Attachment #818102 - Flags: review?(bent.mozilla)
Flags: needinfo?(mvines)
Comment #53 already tells us the story. Here is the analysis:

[20847][PBrowserChild] Sending Msg_PContentPermissionRequestConstructor([TODO])
[20847][PContentPermissionRequestChild] Sending Msg_prompt([TODO])
* Content permission prompt requested

[843][PBrowserParent] Sending Msg_Destroy([TODO])
* TabParent::Destroy() is called, from this point on we should not send messages of PBrowser or protocols managed by it

[20847][PBrowserChild] Received Msg_Destroy([TODO])
* The destroy message is received. From this point on the protocol tree on the child side is destroyed

* Many managed protocols being deleted

[20847][PBrowserChild] Sending Msg___delete__([TODO])
* At the end of TabChild::RecvDestroy() PBrowserChild sends out Msg___delete__

[843][PBrowserParent] Received Msg_PContentPermissionRequestConstructor([TODO])
[843][PContentPermissionRequestParent] Received Msg_prompt([TODO])
* The prompt is request is finally received. But since PBrowser is already Destroy()'d we should not send back any messages

[843][PContentPermissionRequestParent] Sending Msg___delete__([TODO])
* This is the message that cannot be routed and causes crash!!
* Since we haven't received Msg___delete__ from PBrowserChild, the actors on the parent side are still alive. That's why PContentPermissionRequestParent still can send out messages.

I/Gecko   (20847): 
I/Gecko   (20847): ###!!! [Child][AsyncChannel] Error: Route error: message sent to unknown actor ID
I/Gecko   (20847):
And my STR:
* Open the camera app (or the camera process is killed when we leave the inline activity).
* Open the message app and then add attachment from the camera.
* Since this is a monkey test, let's act like a monkey! Start quickly tapping even before the camera preview is shown.
(not sure if this is related to the crash, but this will bring up the camera app UI, not the inline activity UI).
* Take pictures (also quickly tapping the camera button).
* And quickly switch between camera and gallery
* If this doesn't crash go back to the message app, or change homescreen background from the camera.
cyu, do you know why ContentPermissionRequestParent::ActorDestroy isn't being called?
(In reply to Doug Turner (:dougt) from comment #79)
> cyu, do you know why ContentPermissionRequestParent::ActorDestroy isn't
> being called?

It's not yet and we already crashed. It should be called after PBrowserParent receives Msg___delete__. PBrowserChild sends out the message, but it's still in parent's queue.
(In reply to Cervantes Yu from comment #76)
> :m1, could you please set up the test to verify it?

The patch doesn't apply to v1.2, can you please rebase it.
Flags: needinfo?(mvines)
Attached patch Patch rebased for v1.2 (obsolete) — Splinter Review
Rebased to v1.2. This should apply. Please test using this one.
Flags: needinfo?(mvines)
Thanks, I've added this patch internally now and I'll report back in a couple days once we have some test time on it.
Flags: needinfo?(mvines)
Flags: needinfo?(bent.mozilla) → needinfo?(mvines)
topcrash is being replaced by more precise keywords per https://bugzilla.mozilla.org/show_bug.cgi?id=927557#c3
Keywords: topcrashtopcrash-b2g
Comment on attachment 818102 [details] [diff] [review]
Don't call ContentPermissionRequestParent::Send__delete__() if the managing TabParent is already Destroy()'d

Review of attachment 818102 [details] [diff] [review]:
-----------------------------------------------------------------

This looks great, thanks for digging in here!

::: dom/base/nsContentPermissionHelper.cpp
@@ +105,5 @@
>    if (mParent == nullptr) {
>      return NS_ERROR_FAILURE;
>    }
>  
> +  TabParent *tabParent = static_cast<TabParent*>(mParent->Manager());

Nit: * on the left here, and below.
Attachment #818102 - Flags: review?(bent.mozilla) → review+
Does this need to be backported to 1.1, or is it unnecessary?
(In reply to Michael Vines [:m1] [:evilmachines] from comment #83)
> Thanks, I've added this patch internally now and I'll report back in a
> couple days once we have some test time on it.

We haven't seen this crash reoccur yet with attachment 818560 [details] [diff] [review], so maybe fixed.  Although we've seen this bug go quiet in the past for many days as well. :-/
Flags: needinfo?(mvines)
(In reply to Gary Kwong [:gkw] [:nth10sd] (still catching up on bugmail) (PTO Oct 21-25) from comment #86)
> Does this need to be backported to 1.1, or is it unnecessary?

My guess is yes, but I am not 100% sure and need to have tests against it.
Keywords: checkin-needed
(In reply to Cervantes Yu from comment #88)
> (In reply to Gary Kwong [:gkw] [:nth10sd] (still catching up on bugmail)
> (PTO Oct 21-25) from comment #86)
> > Does this need to be backported to 1.1, or is it unnecessary?
> 
> My guess is yes, but I am not 100% sure and need to have tests against it.

I failed to test this on 1.1 on unagi because of camera issues. We might need QA's help if we want to figure this out.
(In reply to Cervantes Yu from comment #89)
> Created attachment 820435 [details] [diff] [review]
> Don't call ContentPermissionRequestParent::Send__delete__() if the managing
> TabParent is already Destroy()'d. r=bent

This is a bit obscure; perhaps a comment in both places, or even better, a short method that encapsulates this call and is named something like "in the process of being destroyed" to make things more readable?  For somebody looking at the code for the first time (or three years from now), it would help.  Right now, the most helpful comment is in TabParent.h.
Please can you obsolete the old patches/rename to make it clearer what needs checking in and in what order? Thank you :-)
Keywords: checkin-needed
Refactoring and add comments.
Attachment #817560 - Attachment is obsolete: true
Attachment #818560 - Attachment is obsolete: true
Attachment #820435 - Attachment is obsolete: true
Attachment #821144 - Flags: review+
Attachment #807464 - Attachment description: bug879580.patch → [Test] bug879580.patch
Keywords: checkin-needed
I presume I'm not checking in "[Test] bug879580.patch" ?

(There are 30+ checkin-neededs in the queue, there sadly isn't much time to read scrollback in the bug etc)
(In reply to Ed Morley [:edmorley UTC+1] from comment #94)
> I presume I'm not checking in "[Test] bug879580.patch" ?
> 
No. It's test-only, and bug number is different.
(In reply to Cervantes Yu from comment #95)
> (In reply to Ed Morley [:edmorley UTC+1] from comment #94)
> > I presume I'm not checking in "[Test] bug879580.patch" ?
> > 
> No. It's test-only, and bug number is different.

Cool, thank you. (If there are multiple patches attached, best bet is to mark the others obsolete, change the attachment descriptions or else use the per-patch "checkin?" flag to avoid ambiguity :-))
https://hg.mozilla.org/mozilla-central/rev/810cb9568dbf
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Attached file ADBlogs.zip
This issue happened again during monkey testing. Here is the adb logcat, dmesg etc
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Blocks: 942267
blocking-b2g: koi+ → 1.3?
Contents of ".extra " file when it happened again in FFOS 1.3:

B2G_OS_Version=1.3.0.0-prerelease
Android_Device=msm8610
Android_Manufacturer=unknown
ProductName=B2G
Android_Board=MSM8610
Android_CPU_ABI=armeabi-v7a
Vendor=Mozilla
InstallTime=75973
Notes=GL Layers! EGL? EGL+ GL Context? GL Context+ GL Layers+ 
ReleaseChannel=default
Android_CPU_ABI2=armeabi
Version=28.0a2
Android_Brand=qcom
ServerURL=https://crash-reports.mozilla.com/submit?id={3c2e2abc-06d4-11e1-ac3b-374f68613e61}&version=28.0a2&buildid=20131225214657
Android_Hardware=qcom
useragent_locale=en-US
BuildID=20131225214657
ProductID={3c2e2abc-06d4-11e1-ac3b-374f68613e61}
Android_Version=18(REL)
Android_Model=msm8610
CrashTime=1388475440
StartupTime=1388474558
ProcessType=content
Notes=EGL? EGL+ GL Context? GL Context+ xpcom_runtime_abort([Child 1262] ###!!! ABORT: aborting because of MsgProcessingError: file /local/mnt/workspace/lnxbuild/project/release_dev_msm8610_2199240/checkout/gecko/dom/ipc/ContentChild.cpp, line 1140)
URL=app://gallery.gaiamobile.org/manifest.webapp
We've already got a different bug open on this - see bug 956325.
Status: REOPENED → RESOLVED
blocking-b2g: 1.3? → koi+
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
No longer blocks: 942267
You need to log in before you can comment on or make changes to this bug.