Closed
Bug 878366
Opened 12 years ago
Closed 11 years ago
Crash on abort in mozilla::dom::ContentChild::ProcessingError
Categories
(Firefox OS Graveyard :: General, defect)
Tracking
(blocking-b2g:koi+, firefox25 wontfix, firefox26 fixed, firefox27 fixed, b2g18 affected, b2g-v1.2 fixed)
RESOLVED
FIXED
blocking-b2g | koi+ |
People
(Reporter: gkw, Assigned: cyu)
References
(Blocks 1 open bug)
Details
(Keywords: crash, topcrash-b2g, Whiteboard: [b2g-crash])
Crash Data
Attachments
(8 files, 6 obsolete files)
3.83 KB,
text/plain
|
Details | |
47.20 KB,
text/plain
|
Details | |
1.94 KB,
patch
|
Details | Diff | Splinter Review | |
965.24 KB,
application/x-gzip
|
Details | |
633.90 KB,
application/x-gzip
|
Details | |
4.25 KB,
patch
|
cyu
:
review+
|
Details | Diff | Splinter Review |
576.65 KB,
application/zip
|
Details | |
97.16 KB,
text/plain
|
Details |
+++ This bug was initially created as a clone of Bug #867025 +++
I hit this too, using orangfuzz:
bp-2b5abeb3-7e9f-45f8-afa2-e75332130601
Unfortunately I don't have a reliable testcase.
James also mentioned that he could also make the phone crash again in bug 867025 comment 35 with the patch in that bug.
===
I'm on 2013-05-24 v1.1.0 git revision cc2fd02fd461aa12c96e02229a78293365d65264
Updated•12 years ago
|
Crash Signature: [@ mozalloc_abort | NS_DebugBreak_P | mozilla::dom::ContentChild::ProcessingError]
[@ mozilla::dom::ContentChild::ProcessingError] → [@ mozalloc_abort | NS_DebugBreak_P | mozilla::dom::ContentChild::ProcessingError ]
[@ mozilla::dom::ContentChild::ProcessingError ]
![]() |
Reporter | |
Comment 1•12 years ago
|
||
Interesting output:
I/Gecko ( 9860):
I/Gecko ( 9860): ###!!! [Child][AsyncChannel] Error: Route error: message sent to unknown actor ID
I/Gecko ( 9860):
I/Gecko ( 9860): [Child 9860] ###!!! ABORT: aborting because of fatal error: file ../../../gecko/dom/ipc/ContentChild.cpp, line 1009
E/Gecko ( 9860): mozalloc_abort: [Child 9860] ###!!! ABORT: aborting because of fatal error: file ../../../gecko/dom/ipc/ContentChild.cpp, line 1009
Comment 2•12 years ago
|
||
Ben, can you take a look?
blocking-b2g: leo? → -
Flags: needinfo?(bent.mozilla)
We don't have any data here that is actionable. We need to know the message id (combination of protocol type and message type) in order to get anything here.
Flags: needinfo?(bent.mozilla)
Oh, sorry, the only way to get that presently is to break in the debugger.
![]() |
Reporter | |
Comment 5•12 years ago
|
||
For the case of the unagi testdriver phones, they are usually not connected to the debugger. Can we somehow improve data collection here some way such that we can get information out of similar crashes in the future?
(I fixed bug 879092 but I'm not sure if that will help in any way..)
You can try enabling IPC message logging (set IPC_MESSAGE_LOG_ENABLED=1 in environment) but that may be too noisy and/or slow.
Other than that we'd need to file an IPC bug to improve the logging when we crash.
![]() |
Reporter | |
Comment 7•12 years ago
|
||
> Other than that we'd need to file an IPC bug to improve the logging when we
> crash.
Filed bug 879580.
Updated•12 years ago
|
Crash Signature: [@ mozalloc_abort | NS_DebugBreak_P | mozilla::dom::ContentChild::ProcessingError ]
[@ mozilla::dom::ContentChild::ProcessingError ] → [@ mozalloc_abort | NS_DebugBreak_P | mozilla::dom::ContentChild::ProcessingError ]
[@ mozalloc_abort(char const*) | NS_DebugBreak | mozilla::dom::ContentChild::ProcessingError(mozilla::ipc::HasResultCodes::Result) ]
[@ mozilla::dom::ContentChild::Proce…
Comment 8•12 years ago
|
||
It's #6 crasher for all B2G versions.
Comment 9•12 years ago
|
||
This occurs on leo too.
Comment 10•12 years ago
|
||
It seems that I wrongly added a signature for a normal crash to previous signatures for a crash on abort.
It's #6 top crasher in FxOS 1.0.1.
blocking-b2g: - → leo?
Crash Signature: [@ mozalloc_abort | NS_DebugBreak_P | mozilla::dom::ContentChild::ProcessingError ]
[@ mozalloc_abort(char const*) | NS_DebugBreak | mozilla::dom::ContentChild::ProcessingError(mozilla::ipc::HasResultCodes::Result) ]
[@ mozilla::dom::ContentChild::Proce… → [@ mozalloc_abort | NS_DebugBreak_P | mozilla::dom::ContentChild::ProcessingError ]
[@ mozalloc_abort(char const*) | NS_DebugBreak | mozilla::dom::ContentChild::ProcessingError(mozilla::ipc::HasResultCodes::Result) ]
Summary: Crash [@ mozilla::dom::ContentChild::ProcessingError] → crash on abort in mozilla::dom::ContentChild::ProcessingError
Comment 11•12 years ago
|
||
(In reply to leo.bugzilla.gecko from comment #9)
> This occurs on leo too.
Can you elaborate? Are you seeing the signature? Are you reproducing a crash? We can't block this without actionable information to move on - being a topcrash isn't enough to block.
blocking-b2g: leo? → -
Flags: needinfo?(jaeohkim83)
Comment 12•12 years ago
|
||
I'm attaching crash dump information from leo device.
Flags: needinfo?(jaeohkim83)
Comment 13•12 years ago
|
||
(In reply to Changbin Park from comment #12)
> Created attachment 789252 [details]
> leo-callstack.log
>
> I'm attaching crash dump information from leo device.
How did you reproduce the crash that caused this dump?
Comment 14•12 years ago
|
||
It occured by our QA team.
He said, he tested bunch of message things.
Indeed, he didn't realize when the crash occured, but after that a different kind of problem occurs on the taget. So, it passed to me and I found the crash occured on it while he was testing.
It's not that detail I know, sorry..
Comment 15•12 years ago
|
||
This issue is reporduced one more time.
The mobile is left in camera view(Message - file attach - Camera)
The crash is occured after few minute. It has same call stack with this bug.
But it is not reproduced unfortunately.
Comment 16•12 years ago
|
||
:Gary, are you still able to reproduce this issue on 1.1 or master ?
Comment 17•12 years ago
|
||
(In reply to ben turner [:bent] (needinfo? encouraged) from comment #6)
> You can try enabling IPC message logging (set IPC_MESSAGE_LOG_ENABLED=1 in
> environment) but that may be too noisy and/or slow.
>
> Other than that we'd need to file an IPC bug to improve the logging when we
> crash.
Hey Ben, this has been filed, who is the right owner here https://bugzilla.mozilla.org/show_bug.cgi?id=879580 ?
![]() |
||
Comment 18•12 years ago
|
||
Note that this is the top 1.0.1 crash that seems to be in our code and is not fixed for newer versions.
It's #3 over the last week of data, see https://crash-analysis.mozilla.com/rkaiser/2013-09-05/2013-09-05.b2g.topcrashes.weekly.html#b2g-1.0.1.0-prerelease and the ones above or near it are either not our code or are fixed in 1.1.
(In reply to bhavana bajaj [:bajaj] from comment #17)
> Hey Ben, this has been filed, who is the right owner here
Not sure, anyone who has done IPC work I guess. It's not prioritized at the moment so I don't think anyone has even looked at it.
![]() |
Reporter | |
Comment 20•12 years ago
|
||
(In reply to ben turner [:bent] (needinfo? encouraged) from comment #19)
> (In reply to bhavana bajaj [:bajaj] from comment #17)
> > Hey Ben, this has been filed, who is the right owner here
>
> Not sure, anyone who has done IPC work I guess. It's not prioritized at the
> moment so I don't think anyone has even looked at it.
I'll nominate bug 879580 to get it prioritized, hopefully.
![]() |
Reporter | |
Comment 21•12 years ago
|
||
(In reply to bhavana bajaj [:bajaj] from comment #16)
> :Gary, are you still able to reproduce this issue on 1.1 or master ?
I've not tested on 1.1 / master, in any case won't be able to anytime soon, unfortunately.
Comment 22•12 years ago
|
||
mvines - I see this has been added to the 1.2 CS blocker list, but we don't have STR to action this bug (which has been the main reason we haven't actioned this bug from past releases). Did your team come across a way to reproduce this top crash?
Flags: needinfo?(mvines)
Comment 23•12 years ago
|
||
Yep, run orangutan for ~8 hours. We caught crash twice overnight.
Flags: needinfo?(mvines)
Comment 24•12 years ago
|
||
Noming cause this blocks the CS 1.2 blocker.
Andrew - Can you find someone to look into this?
blocking-b2g: - → koi?
Flags: needinfo?(overholt)
Comment 25•12 years ago
|
||
I spoke about this today with bsmedberg. He's going to try to work on (or find someone to work on) bug 879580. Once we've got that improved logging it will be easier to find this race.
Flags: needinfo?(overholt)
Comment 26•12 years ago
|
||
Great. We can run an orangutan or two on a build with some logging patches/etc pretty easily too. LMK
Comment 27•12 years ago
|
||
I spent a while looking at the debugging possibilities here. The only good option is to run the tests with MOZ_IPC_MESSAGE_LOG on, so that we can see what message the parent process thinks its sending (probably to a dead actor, perhaps because of a race sending a message in both directions).
Does the stdout of b2g go to an attached debug device, or does it always stay on the phone? IPC logs can get big. I guess we're running these tests with non-debug builds, right?
Comment 28•12 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #27)
> The only good
> option is to run the tests with MOZ_IPC_MESSAGE_LOG on
We can do that here, LMK exactly what you need.
> Does the stdout of b2g go to an attached debug device, or does it always
> stay on the phone?
/dev/null right now, but I've been meaning to redirect stdout to logcat though as that's more helpful. We can do that.
> I guess we're running these tests with non-debug builds, right?
-eng builds right now, but without any additional Gecko debug settings enabled.
Comment 29•12 years ago
|
||
A normal -eng build run with MOZ_IPC_MESSAGE_LOG=1 in the environment should dump a bunch of spew to stdout/stderr about messages being sent and received. If we can get that through logcat for one of these test runs, that would help tremendously.
Updated•12 years ago
|
Flags: needinfo?(mvines)
Comment 30•12 years ago
|
||
Debugging patch which should produce *some* output on stderr, even when MOZ_IPC_MESSAGE_LOG is not set. A log with MOZ_IPC_MESSAGE_LOG would still be better.
Comment 31•12 years ago
|
||
Looks like --enable-debug is needed for MOZ_IPC_MESSAGE_LOG=1 to work, which we don't enabled by default with -eng. Looks like s/false/true/ at http://dxr.mozilla.org/mozilla-central/source/ipc/glue/ProtocolUtils.h?from=LoggingEnabled#l112 will do the trick instead? With that change I see a ton of stderr output like:
[time:316253345887993][721][PContentChild] Sending Msg_AsyncMessage([TODO])
[time:316253345888908][653][PContentParent] Received Msg_AsyncMessage([TODO])
[time:316253346227936][721][PLayerTransactionChild] Sending Msg_Update([TODO])
[time:316253346228803][653][PLayerTransactionParent] Received Msg_Update([TODO])
[time:316253346229286][653][PLayerTransactionParent] Sending reply Reply_Update([TODO])
[time:316253346232066][721][PLayerTransactionChild] Received reply Reply_Update([TODO])
[time:316253346232966][721][PLayerChild] Sending Msg___delete__([TODO])
[time:316253346233616][721][PLayerChild] Sending Msg___delete__([TODO])
Is this the kind of logging that'll help?
That's the stuff you're looking for!
![]() |
Reporter | |
Updated•12 years ago
|
Summary: crash on abort in mozilla::dom::ContentChild::ProcessingError → Crash on abort in mozilla::dom::ContentChild::ProcessingError
Comment 34•12 years ago
|
||
(The monkeys are still trying to catch this w/ IPC logging. Hopefully they'll get lucky over the weekend.)
Flags: needinfo?(mvines)
Updated•12 years ago
|
Flags: needinfo?(mvines)
Comment 35•12 years ago
|
||
Here's one example of this crash with IPC logging enabled. The crashing app was gallery. I've attached all Gecko logcat output for the Gallery and b2g processes leading up to the crash.
APPLICATION OOM_ADJ OOM_SCORE OOM_SCORE_ADJ USER PID PPID VSIZE RSS WCHAN PC NAME
Gallery 1 138 67 app_2532 2532 29689 91184 29248 ffffffff 407dfabc S /system/b2g/plugin-container
b2g 0 178 0 root 29689 1 277292 85316 ffffffff 400b0604 S /system/b2g/b2g
(background apps omitted)
Flags: needinfo?(mvines)
Comment 36•12 years ago
|
||
Another log of this crash, again in the Gallery. b2g pid is 843, Gallery pid is 20847.
Comment 37•12 years ago
|
||
Crash was observed again overnight, and again in the Gallery app
Comment 38•12 years ago
|
||
Here's the important part from the first log:
>I/Gecko (29689): [time:1379917243883118][29689][PContentParent] Sending Msg_AsyncMessage([TODO]) >I/Gecko (2532):
>I/Gecko ( 2532): ###!!! [Child][AsyncChannel] Error: Route error: message sent to unknown actor ID >I/Gecko (2532):
>I/Gecko ( 2532): [Child 2532] ###!!! ABORT: aborting because of MsgRouteError: file /local/mnt/buildbot/v1.2_msm7627a/build/gecko/dom/ipc/ContentChild.cpp, line 1064
Comment 39•12 years ago
|
||
This log doesn't look all that useful without actor IDs, unfortunately :(
Comment 40•12 years ago
|
||
(In reply to Josh Matthews [:jdm] from comment #39)
> This log doesn't look all that useful without actor IDs, unfortunately :(
If you have a patch that outputs what you need to see I can easily add that temporarily here so we can get that data the next time the crash occurs.
Comment 41•12 years ago
|
||
Crash occurred last night with attachment 807464 [details] [diff] [review] applied ("DumpMessage RouteError for message: .type=1310722 .size=40"). This time it was a camera app crash instead of the gallery app.
I/Gecko ( 852): [time:1380104599666268][852][PGrallocBufferChild] Received Msg___delete__([TODO])
I/Gecko ( 852): [time:1380104599666447][852][PGrallocBufferChild] Received Msg___delete__([TODO])
I/Gecko ( 852): [time:1380104599670447][852][PBrowserParent] Received Msg_PContentPermissionRequestConstructor([TODO])
I/Gecko ( 852): [time:1380104599671093][852][PContentPermissionRequestParent] Received Msg_prompt([TODO])
I/Gecko ( 852): [time:1380104599692020][852][PContentPermissionRequestParent] Sending Msg___delete__([TODO])
I/Gecko ( 9942): DumpMessage RouteError for message: .type=1310722 .size=40
I/Gecko ( 9942):
I/Gecko ( 9942): ###!!! [Child][AsyncChannel] Error: Route error: message sent to unknown actor ID
I/Gecko ( 9942):
I/Gecko ( 9942): [Child 9942] ###!!! ABORT: aborting because of MsgRouteError: file /local/mnt/buildbot/v1.2/build/gecko/dom/ipc/ContentChild.cpp, line 1064
I/Gecko ( 9942): [time:1380104599692787][9942][PCrashReporterChild] Sending Msg_AppendAppNotes([TODO])
E/Gecko ( 9942): mozalloc_abort: [Child 9942] ###!!! ABORT: aborting because of MsgRouteError: file /local/mnt/buildbot/v1.2/build/gecko/dom/ipc/ContentChild.cpp, line 1064
Comment 42•12 years ago
|
||
Benoit, related to the crash you're looking at? I know it's hard to tell from the info here, but just wondering.
Flags: needinfo?(bjacob)
Comment 43•12 years ago
|
||
Depends on what you mean by 'related'. The bug I'm looking at, bug 914823, has different symptoms (see call stack in bug 914823 comment 28) and a specific cause (we are referencing an ISurfaceAllocator that already died, we couldn't know it because it's not reference-counted, it can't be reference-counted because IPDL actors aren't reference-counted) that seems to be different from what is being discussed above here.
But, on a more meta level, the present bug, bug 914823, and generally half of the b2g crashes I've been dealing with, are 'related' in that they share a common aspect: we crash as we have dangling raw pointers to already-dead IPDL actors, and we can't easily fix that because IPDL actors aren't refcounted, and as long as they're not refcounted, the only way we can write non-crashing code is by having a mental model of the entire B2G IPC-facing codebase, and well, I don't know that anyone anymore has such a mental model.
Flags: needinfo?(bjacob)
Comment 44•12 years ago
|
||
(In reply to Benoit Jacob [:bjacob] from comment #43)
> But, on a more meta level, the present bug, bug 914823, and generally half
> of the b2g crashes I've been dealing with, are 'related' in that they share
> a common aspect: we crash as we have dangling raw pointers to already-dead
> IPDL actors, and we can't easily fix that because IPDL actors aren't
> refcounted, and as long as they're not refcounted, the only way we can write
> non-crashing code is by having a mental model of the entire B2G IPC-facing
> codebase, and well, I don't know that anyone anymore has such a mental model.
This piqued my interest so I asked Ben and he said IPDL actors can be refcounted. He opined that perhaps you were thinking of gralloc-associated actors?
Comment 45•12 years ago
|
||
This was a new discovery. From IRC:
16:17 bjacob: newsflash: we _can_ have refcounted IPDL actors, in fact netwerk/ already has a few, and it's even discussed here: https://developer.mozilla.org/en-US/docs/IPDL/Best_Practices !!
16:18 bjacob: (thanks to jdm) !
16:44 bjacob: milan: yes, jdm pointed me to netwerk/ipc/Necko{Child,Parent}.*
16:45 bjacob: milan: the other problem is the SurfaceDescriptor IPDL union, which compiles in C++ to raw pointers. There too, jdm suggested a realistic fix: mirror it in C++ by a RefcountingSurfaceDescriptor class that would do the refcounting (and have a constructor taking an IPDL SurfaceDescriptor)
Comment 46•12 years ago
|
||
Yep, IPDL actors can be refcounted, but none of the graphics-facing ones are, and until two hours ago, none of use gfx people around here in Toronto knew about that possibility! Many thanks to :jdm for educating us.
Switching our IPDL actors to refcounting is a fairly nontrivial code change though, so it might be too big a change for B2G 1.2 at this point. Let's do it on mozilla-central for B2G v1.3, but for B2G 1.2 bugs, for now, let's still try to fix them without switching non-refcounted classes to refcounting...
Comment 47•12 years ago
|
||
If we try to fix this without ref counting, even assuming that we properly identify all the places where we crash, how will we be confident that the fix will actually work? We have been dealing with this class of bugs for about 9 month now, and every time you whack a mole, a new one pops up.
Updated•12 years ago
|
Assignee: nobody → bjacob
Comment 48•12 years ago
|
||
Benoit and Sotaro are connecting with Ben Turner to see how large and scary this change would be.
Comment 49•12 years ago
|
||
This was flagged as a high priority bug by the chipset vendor. How are we doing here?
Comment 50•12 years ago
|
||
I've been busy until yesterday with 1) bug 914823, which is another occurence of this kind of issue, and 2) trying to understand what would generally be the right approach to fixing this kind of issue.
Now I think that the approach we're taking in bug 914823, which is to make the actor SupportsWeakPtr, is a good one:
- it is unintrusive, conservative enough to land on aurora;
- it actually makes sense as 'the right solution' as it matches the reality that from the point of view of DOM elements, the whole IPC system can disappear at any time. We can have a given actor outlive the IPC system by holding a strong reference to it, and that can be useful to avoid crashing, but we can't prevent the IPC system from going down. For example, if the IPC system runs out of file descriptors, there is nothing we can do at the moment to do IPC again, IIUC.
I filed bug 923530 as a tracking bug for this kind of issues; blocking it.
I'll now try to fix the present bug...
Blocks: 923530
Comment 51•12 years ago
|
||
So to be clear, I do think that reference-counting actors is going to be a part of the solution. If only because there is no way to do thread-safe weak references without being able to convert a weak reference to a strong reference before dereferencing it. But just reference-counting along, as I thought before would be the solution, isn't a complete solution by itself, because we can't always prevent the IPC system from going down earlier than expected (IPC errors can happen at any time). So we still have to react to unexpected actor death, and using weak references seem like our best way to do so at the moment.
Comment 52•12 years ago
|
||
Does anyone have STR that I can use while having USB debugging? The only definite STR that I can see here is:
(In reply to jongsoo.oh from comment #15)
> This issue is reporduced one more time.
> The mobile is left in camera view(Message - file attach - Camera)
> The crash is occured after few minute. It has same call stack with this bug.
> But it is not reproduced unfortunately.
However, when I try this with USB debugging plugged, I get a modal message:
"Camera can not be used while plugged in - Unplug the phone to view pictures"
Can I work around it?
Flags: needinfo?(zzongsoo)
Flags: needinfo?(mvines)
Flags: needinfo?(gary)
Comment 53•12 years ago
|
||
(In reply to Michael Vines [:m1] [:evilmachines] from comment #36)
> Created attachment 808744 [details]
> Gallery crash #2
>
> Another log of this crash, again in the Gallery. b2g pid is 843, Gallery
> pid is 20847.
From this log:
[20847][PBrowserChild] Sending Msg_PContentPermissionRequestConstructor([TODO])
[20847][PContentPermissionRequestChild] Sending Msg_prompt([TODO])
[20847][PBrowserChild] Received Msg_RealTouchEvent([TODO])
[20847][PLayerTransactionChild] Sending Msg_Update([TODO])
[843][PLayerTransactionParent] Received Msg_Update([TODO])
[843][PLayerTransactionParent] Sending reply Reply_Update([TODO])
[20847][PLayerTransactionChild] Received reply Reply_Update([TODO])
[843][PContentParent] Received Msg_AsyncMessage([TODO])
[843][PContentParent] Sending Msg_AsyncMessage([TODO])
[843][PBrowserParent] Sending Msg_Deactivate([TODO])
[20847][PBrowserChild] Received Msg_Deactivate([TODO])
[843][PBrowserParent] Sending Msg_Activate([TODO])
[843][PBrowserParent] Sending Msg_AsyncMessage([TODO])
[843][PBrowserParent] Sending Msg_Destroy([TODO])
[20847][PBrowserChild] Received Msg_Destroy([TODO])
[20847][PContentChild] Sending Msg_AudioChannelChangeDefVolChannel([TODO])
[20847][PIndexedDBDatabaseChild] Sending Msg_Close([TODO])
[20847][PCompositableChild] Sending Msg___delete__([TODO])
[843][PCompositableParent] Received Msg___delete__([TODO])
[20847][PLayerChild] Sending Msg___delete__([TODO])
[843][PLayerParent] Received Msg___delete__([TODO])
[843][PGrallocBufferParent] Sending Msg___delete__([TODO])
[843][PGrallocBufferParent] Sending Msg___delete__([TODO])
[20847][PLayerChild] Sending Msg___delete__([TODO])
[843][PLayerParent] Received Msg___delete__([TODO])
[20847][PLayerChild] Sending Msg___delete__([TODO])
[843][PLayerParent] Received Msg___delete__([TODO])
[20847][PRenderFrameChild] Sending Msg___delete__([TODO])
[20847][PBrowserChild] Sending Msg___delete__([TODO])
[20847][PContentChild] Sending Msg_AsyncMessage([TODO])
[20847][PContentChild] Sending Msg_AsyncMessage([TODO])
[20847][PGrallocBufferChild] Received Msg___delete__([TODO])
[20847][PGrallocBufferChild] Received Msg___delete__([TODO])
[843][PLayerTransactionChild] Sending Msg_PLayerConstructor([TODO])
[843][PLayerTransactionParent] Received Msg_PLayerConstructor([TODO])
[843][PLayerTransactionChild] Sending Msg_Update([TODO])
[843][PLayerTransactionParent] Received Msg_Update([TODO])
[843][PLayerTransactionParent] Sending reply Reply_Update([TODO])
[843][PLayerTransactionChild] Received reply Reply_Update([TODO])
[843][PLayerChild] Sending Msg___delete__([TODO])
[843][PLayerParent] Received Msg___delete__([TODO])
[843][PBrowserParent] Received Msg_PContentPermissionRequestConstructor([TODO])
[843][PContentPermissionRequestParent] Received Msg_prompt([TODO])
[843][PContentPermissionRequestParent] Sending Msg___delete__([TODO])
I/Gecko (20847):
I/Gecko (20847): ###!!! [Child][AsyncChannel] Error: Route error: message sent to unknown actor ID
I/Gecko (20847):
Here is one interpretation of this log, please tell me whether it makes sense: the Layers / Gfx stuff here is actually a red herring, the problem is we're apparently doing a Send___delete__ on a bad PContentPermissionRequestParent.
Does that make sense?
Flags: needinfo?(bent.mozilla)
Comment 54•12 years ago
|
||
...rather, the PContentPermissionRequestParent is fine (it just received a message) but the matching PContentPermissionRequest***Child*** is already dead. Does _that_ make sense?
Comment 55•12 years ago
|
||
Asking Doug who seems to have been maintaining code around PContentPermissionRequestChild.
Comment 56•12 years ago
|
||
(In reply to Benoit Jacob [:bjacob] from comment #52)
> However, when I try this with USB debugging plugged, I get a modal message:
>
> "Camera can not be used while plugged in - Unplug the phone to view pictures"
>
> Can I work around it?
Sounds like you may have Settings -> Enable USB storage enabled. Try disabling.
Flags: needinfo?(mvines)
Comment 57•12 years ago
|
||
I also think attachment 808694 [details] and attachment 808744 [details] seems not related to gfx. From the log. The crash of attachment 808694 [details] seems triggered by calling PContentParent::SendAsyncMessage(). The crash of attachmeet 808744 seems triggered by cakkubg PContentPermissionRequestParent::Send___delete__().
Both seem to try to send to already disconnected/deleted ipc object.
Comment 58•12 years ago
|
||
Comment 57 reminds me Bug 867025. Incorrect ipc handling problems are not the only problem around gfx, but also could happen in any other ipc in gecko. In current gecko's ipc architecutre/implementation, it is very very difficult to correctly use it in all situation. Such kind of rare crash could happen easily in b2g.
- Bug 867025 - [unagi][tara][weekly build 13.04.17]monkey test crash in mozilla::dom::ContentChild::ProcessingError
Comment 59•12 years ago
|
||
(In reply to Michael Vines [:m1] [:evilmachines] from comment #56)
> (In reply to Benoit Jacob [:bjacob] from comment #52)
> > However, when I try this with USB debugging plugged, I get a modal
> message:
> >
> > "Camera can not be used while plugged in - Unplug the phone to view pictures"
> >
> > Can I work around it?
>
> Sounds like you may have Settings -> Enable USB storage enabled. Try
> disabling.
Thanks, that fixed this issue.
Now I am running into another problem: as I follow the STR, which involves switching the the Communications app to the background, I quickly get the Communications killed by the background-process-killer, before I get a chance to crash it. Can I disable the background-processes killer or have Communications be spared by it?
Comment 60•12 years ago
|
||
I tried writing 0 to /proc/PID/oom_score_adj and oom_adj, and the value was apparently remembered, but Communications still got killed.
Comment 61•12 years ago
|
||
Michael, the OOM killer issue reported in comment 59 -- 60 is preventing me from reproducing. What would be current steps-to-reproduce on Hamachi?
Flags: needinfo?(mvines)
Comment 62•12 years ago
|
||
(In reply to Benoit Jacob [:bjacob] from comment #52)
> Does anyone have STR that I can use while having USB debugging? The only
> definite STR that I can see here is:
>
> (In reply to jongsoo.oh from comment #15)
> > This issue is reporduced one more time.
> > The mobile is left in camera view(Message - file attach - Camera)
> > The crash is occured after few minute. It has same call stack with this bug.
> > But it is not reproduced unfortunately.
>
> However, when I try this with USB debugging plugged, I get a modal message:
>
> "Camera can not be used while plugged in - Unplug the phone to view pictures"
>
> Can I work around it?
After disable USB storage, you can use the camera with Remote debugging.
Flags: needinfo?(zzongsoo)
Comment 63•12 years ago
|
||
(In reply to Benoit Jacob [:bjacob] from comment #61)
> Michael, the OOM killer issue reported in comment 59 -- 60 is preventing me
> from reproducing. What would be current steps-to-reproduce on Hamachi?
you could probably tweak the LMK parameters at http://dxr.mozilla.org/mozilla-central/source/b2g/app/b2g.js?from=b2g.js#l611 to avoid getting killed.
However, looking at the crash database here I see that we were reliably reproducing this issue in overnight orangutan testing from Sept 22nd though Oct 2nd, but have not observed this crash since. Mystery fix from some other patch? But until we begin to see this crash again with the same frequency it's not a blocker for us anymore.
Flags: needinfo?(mvines)
![]() |
Reporter | |
Updated•12 years ago
|
Flags: needinfo?(gary)
Comment 64•12 years ago
|
||
Milan, given that this seems 1) already resolved (comment 63) and 2) not graphics-related (comments 53, 57, 58), I suggest un-assigning me.
Flags: needinfo?(bent.mozilla) → needinfo?(milan)
Comment 65•12 years ago
|
||
So we did see this again last night but just once, I'll update here if this start to become a regular occurrence again
Updated•12 years ago
|
Assignee: bjacob → nobody
Flags: needinfo?(milan)
Comment 66•12 years ago
|
||
(Sadly this crash has returned on v1.2, we've been seeing it pretty regularly since Oct 09. LMK how I can help)
Comment 67•12 years ago
|
||
Milan we need to get on top of this. Until we have a new owner, you remain the owner. Feel free to ask for help from the content team.
Assignee: nobody → milan
Comment 69•12 years ago
|
||
bent and I reviewed the prompt proxy and couldn't see anything obvious from this protocol and implementation.
cyu, please reach out if you have any questions or get stuck.
I think I may have found the cause here. Can someone please apply this patch and throw some monkeys at it?
Oops, the right patch now.
Attachment #817549 -
Attachment is obsolete: true
Really final patch.
Attachment #817554 -
Attachment is obsolete: true
Assignee | ||
Comment 73•12 years ago
|
||
I just caught the same crash in the camera app. The message in question is PContentPermissionRequestMsgStart << 16 | Msg___delete____ID. I'll test the patch to see if the crash is fixed.
Comment 74•12 years ago
|
||
Our monkeys are v1.2 only at the moment. :bent, if you can rebase the patch on aurora then I'll put them to work overnight on it.
Updated•12 years ago
|
Flags: needinfo?(bent.mozilla)
Assignee | ||
Comment 75•12 years ago
|
||
I applied the patch and made some tests. The camera app still has the same crash.
Assignee | ||
Comment 76•12 years ago
|
||
This should fix the crash. In my tests this fixes the crash. :m1, could you please set up the test to verify it?
Attachment #818102 -
Flags: review?(bent.mozilla)
Flags: needinfo?(mvines)
Assignee | ||
Comment 77•12 years ago
|
||
Comment #53 already tells us the story. Here is the analysis:
[20847][PBrowserChild] Sending Msg_PContentPermissionRequestConstructor([TODO])
[20847][PContentPermissionRequestChild] Sending Msg_prompt([TODO])
* Content permission prompt requested
[843][PBrowserParent] Sending Msg_Destroy([TODO])
* TabParent::Destroy() is called, from this point on we should not send messages of PBrowser or protocols managed by it
[20847][PBrowserChild] Received Msg_Destroy([TODO])
* The destroy message is received. From this point on the protocol tree on the child side is destroyed
* Many managed protocols being deleted
[20847][PBrowserChild] Sending Msg___delete__([TODO])
* At the end of TabChild::RecvDestroy() PBrowserChild sends out Msg___delete__
[843][PBrowserParent] Received Msg_PContentPermissionRequestConstructor([TODO])
[843][PContentPermissionRequestParent] Received Msg_prompt([TODO])
* The prompt is request is finally received. But since PBrowser is already Destroy()'d we should not send back any messages
[843][PContentPermissionRequestParent] Sending Msg___delete__([TODO])
* This is the message that cannot be routed and causes crash!!
* Since we haven't received Msg___delete__ from PBrowserChild, the actors on the parent side are still alive. That's why PContentPermissionRequestParent still can send out messages.
I/Gecko (20847):
I/Gecko (20847): ###!!! [Child][AsyncChannel] Error: Route error: message sent to unknown actor ID
I/Gecko (20847):
Assignee | ||
Comment 78•12 years ago
|
||
And my STR:
* Open the camera app (or the camera process is killed when we leave the inline activity).
* Open the message app and then add attachment from the camera.
* Since this is a monkey test, let's act like a monkey! Start quickly tapping even before the camera preview is shown.
(not sure if this is related to the crash, but this will bring up the camera app UI, not the inline activity UI).
* Take pictures (also quickly tapping the camera button).
* And quickly switch between camera and gallery
* If this doesn't crash go back to the message app, or change homescreen background from the camera.
Comment 79•12 years ago
|
||
cyu, do you know why ContentPermissionRequestParent::ActorDestroy isn't being called?
Assignee | ||
Comment 80•12 years ago
|
||
(In reply to Doug Turner (:dougt) from comment #79)
> cyu, do you know why ContentPermissionRequestParent::ActorDestroy isn't
> being called?
It's not yet and we already crashed. It should be called after PBrowserParent receives Msg___delete__. PBrowserChild sends out the message, but it's still in parent's queue.
Comment 81•12 years ago
|
||
(In reply to Cervantes Yu from comment #76)
> :m1, could you please set up the test to verify it?
The patch doesn't apply to v1.2, can you please rebase it.
Flags: needinfo?(mvines)
Assignee | ||
Comment 82•12 years ago
|
||
Rebased to v1.2. This should apply. Please test using this one.
Flags: needinfo?(mvines)
Comment 83•12 years ago
|
||
Thanks, I've added this patch internally now and I'll report back in a couple days once we have some test time on it.
Flags: needinfo?(mvines)
Updated•12 years ago
|
Flags: needinfo?(bent.mozilla) → needinfo?(mvines)
Comment 84•12 years ago
|
||
topcrash is being replaced by more precise keywords per https://bugzilla.mozilla.org/show_bug.cgi?id=927557#c3
Keywords: topcrash → topcrash-b2g
Comment on attachment 818102 [details] [diff] [review]
Don't call ContentPermissionRequestParent::Send__delete__() if the managing TabParent is already Destroy()'d
Review of attachment 818102 [details] [diff] [review]:
-----------------------------------------------------------------
This looks great, thanks for digging in here!
::: dom/base/nsContentPermissionHelper.cpp
@@ +105,5 @@
> if (mParent == nullptr) {
> return NS_ERROR_FAILURE;
> }
>
> + TabParent *tabParent = static_cast<TabParent*>(mParent->Manager());
Nit: * on the left here, and below.
Attachment #818102 -
Flags: review?(bent.mozilla) → review+
![]() |
Reporter | |
Comment 86•12 years ago
|
||
Does this need to be backported to 1.1, or is it unnecessary?
Comment 87•12 years ago
|
||
(In reply to Michael Vines [:m1] [:evilmachines] from comment #83)
> Thanks, I've added this patch internally now and I'll report back in a
> couple days once we have some test time on it.
We haven't seen this crash reoccur yet with attachment 818560 [details] [diff] [review], so maybe fixed. Although we've seen this bug go quiet in the past for many days as well. :-/
Flags: needinfo?(mvines)
Assignee | ||
Comment 88•12 years ago
|
||
(In reply to Gary Kwong [:gkw] [:nth10sd] (still catching up on bugmail) (PTO Oct 21-25) from comment #86)
> Does this need to be backported to 1.1, or is it unnecessary?
My guess is yes, but I am not 100% sure and need to have tests against it.
Assignee | ||
Comment 89•12 years ago
|
||
Attachment #818102 -
Attachment is obsolete: true
Attachment #820435 -
Flags: review+
Assignee | ||
Updated•12 years ago
|
Keywords: checkin-needed
Assignee | ||
Comment 90•12 years ago
|
||
(In reply to Cervantes Yu from comment #88)
> (In reply to Gary Kwong [:gkw] [:nth10sd] (still catching up on bugmail)
> (PTO Oct 21-25) from comment #86)
> > Does this need to be backported to 1.1, or is it unnecessary?
>
> My guess is yes, but I am not 100% sure and need to have tests against it.
I failed to test this on 1.1 on unagi because of camera issues. We might need QA's help if we want to figure this out.
Comment 91•12 years ago
|
||
(In reply to Cervantes Yu from comment #89)
> Created attachment 820435 [details] [diff] [review]
> Don't call ContentPermissionRequestParent::Send__delete__() if the managing
> TabParent is already Destroy()'d. r=bent
This is a bit obscure; perhaps a comment in both places, or even better, a short method that encapsulates this call and is named something like "in the process of being destroyed" to make things more readable? For somebody looking at the code for the first time (or three years from now), it would help. Right now, the most helpful comment is in TabParent.h.
Comment 92•12 years ago
|
||
Please can you obsolete the old patches/rename to make it clearer what needs checking in and in what order? Thank you :-)
Keywords: checkin-needed
Assignee | ||
Comment 93•12 years ago
|
||
Refactoring and add comments.
Attachment #817560 -
Attachment is obsolete: true
Attachment #818560 -
Attachment is obsolete: true
Attachment #820435 -
Attachment is obsolete: true
Attachment #821144 -
Flags: review+
Assignee | ||
Updated•12 years ago
|
Attachment #807464 -
Attachment description: bug879580.patch → [Test] bug879580.patch
Assignee | ||
Updated•12 years ago
|
Keywords: checkin-needed
Comment 94•12 years ago
|
||
I presume I'm not checking in "[Test] bug879580.patch" ?
(There are 30+ checkin-neededs in the queue, there sadly isn't much time to read scrollback in the bug etc)
Assignee | ||
Comment 95•12 years ago
|
||
(In reply to Ed Morley [:edmorley UTC+1] from comment #94)
> I presume I'm not checking in "[Test] bug879580.patch" ?
>
No. It's test-only, and bug number is different.
Comment 96•12 years ago
|
||
(In reply to Cervantes Yu from comment #95)
> (In reply to Ed Morley [:edmorley UTC+1] from comment #94)
> > I presume I'm not checking in "[Test] bug879580.patch" ?
> >
> No. It's test-only, and bug number is different.
Cool, thank you. (If there are multiple patches attached, best bet is to mark the others obsolete, change the attachment descriptions or else use the per-patch "checkin?" flag to avoid ambiguity :-))
Comment 97•12 years ago
|
||
Keywords: checkin-needed
Comment 98•12 years ago
|
||
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 99•12 years ago
|
||
status-b2g-v1.2:
--- → fixed
status-firefox25:
--- → wontfix
status-firefox26:
--- → fixed
status-firefox27:
--- → fixed
This issue happened again during monkey testing. Here is the adb logcat, dmesg etc
Updated•11 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•11 years ago
|
blocking-b2g: koi+ → 1.3?
Contents of ".extra " file when it happened again in FFOS 1.3:
B2G_OS_Version=1.3.0.0-prerelease
Android_Device=msm8610
Android_Manufacturer=unknown
ProductName=B2G
Android_Board=MSM8610
Android_CPU_ABI=armeabi-v7a
Vendor=Mozilla
InstallTime=75973
Notes=GL Layers! EGL? EGL+ GL Context? GL Context+ GL Layers+
ReleaseChannel=default
Android_CPU_ABI2=armeabi
Version=28.0a2
Android_Brand=qcom
ServerURL=https://crash-reports.mozilla.com/submit?id={3c2e2abc-06d4-11e1-ac3b-374f68613e61}&version=28.0a2&buildid=20131225214657
Android_Hardware=qcom
useragent_locale=en-US
BuildID=20131225214657
ProductID={3c2e2abc-06d4-11e1-ac3b-374f68613e61}
Android_Version=18(REL)
Android_Model=msm8610
CrashTime=1388475440
StartupTime=1388474558
ProcessType=content
Notes=EGL? EGL+ GL Context? GL Context+ xpcom_runtime_abort([Child 1262] ###!!! ABORT: aborting because of MsgProcessingError: file /local/mnt/workspace/lnxbuild/project/release_dev_msm8610_2199240/checkout/gecko/dom/ipc/ContentChild.cpp, line 1140)
URL=app://gallery.gaiamobile.org/manifest.webapp
Comment 102•11 years ago
|
||
We've already got a different bug open on this - see bug 956325.
Status: REOPENED → RESOLVED
blocking-b2g: 1.3? → koi+
Closed: 12 years ago → 11 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•