Closed Bug 956325 Opened 6 years ago Closed 6 years ago

crash in mozalloc_abort(char const*) | NS_DebugBreak | mozilla::dom::ContentChild::ProcessingError(mozilla::ipc::HasResultCodes::Result)

Categories

(Core :: IPC, defect, critical)

29 Branch
ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

()

VERIFIED FIXED
1.3 C2/1.4 S2(17jan)
blocking-b2g 1.3+
Tracking Status
firefox27 --- wontfix
firefox28 --- verified
firefox29 --- verified
b2g-v1.3 --- fixed
b2g-v1.4 --- fixed

People

(Reporter: nkot, Assigned: gwagner)

References

Details

(Keywords: crash, regression, Whiteboard: [b2g-crash], [systemsfe], [CR 596211])

Crash Data

Attachments

(5 files, 2 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-a1f9b222-b5aa-4941-a285-240342140103.
=============================================================

Hit this crash today going through FTE after manually flashed my Buri to 20140103040201 build

I am not sure if it can be reproduced using these STR:
1) Updated Buri to BuildID: 20140103040201
2) Reset device from Settings
3) Go through FTU (I also downloaded Facebook and Outlook contacts)
4) Tap on Privacy policy link
5) Tap Everything.me link

Actual:
Crash occurs

Expected:
NO crashes occur during FTE

Environmental Variables:
Device: Buri v1.4 (Master M-C) Mozilla RIL
BuildID: 20140103040201
Gaia: 83cc63f728489a24256731adf558354bb2012a59
Gecko: 49d2fce9a86c
Version: 29.0a1
Firmware Version: v1.2_20131115
blocking-b2g: --- → 1.4?
Component: General → IPC
Keywords: regression
Product: Firefox OS → Core
Version: unspecified → 29 Branch
I hit this crash as well during first run but I got the same stack as in Bug 952170. Will try to reproduce.
I hit this particular crash twice using the same STR, cannot reproduce it 100% though. Bug 952170 is also happening to me.
this crash also reproduced on 1.3 build

Device: Buri v1.3 Mozilla RIL
BuildID: 20140106004001
Gaia: 35a60b82f8cf2d759939a350e2dadbb9d8b2f5dc
Gecko: a43cb4b322d3
Version: 28.0a2
Firmware Version: v1.2_20131115

same STR:
1) Updated Buri to BuildID: 20140103040201
2) Reset device from Settings
3) Go through FTU: Sign in to WiFi, Set Time Zone to America/LA, Download Facebook and Outlook contacts
4) Tap on Your Privacy link
5) Tap FerifoxOS and then Marketplace links
6) Tap Everything.me link
   ==> device crashes
blocking-b2g: 1.4? → 1.3?
Whiteboard: [b2g-crash]
Andrew - Can you find someone to look into this? We're getting hit by this crash daily in 1.3 testing.
Flags: needinfo?(overholt)
Blocks: 942267
This is ###!!! ABORT: aborting because of MsgRouteError

The most likely explanation for the error here is that we're racing:

* parent is sending a message to an IPDL actor
* child is destroying the actor

The crash stack itself isn't going to be much use. We're going to need to catch this in a debugger or run a debug build with MOZ_IPC_MESSAGE_LOG and capture the log from both processes.
Jason, can whoever runs into this during smoketesting run with a debug build and MOZ_IPC_MESSAGE_LOG=1?
Flags: needinfo?(overholt) → needinfo?(jsmith)
(In reply to Andrew Overholt [:overholt] from comment #6)
> Jason, can whoever runs into this during smoketesting run with a debug build
> and MOZ_IPC_MESSAGE_LOG=1?

On the QA side, we don't have debug device builds, so I don't think we would be able to investigate this unless someone can spin a build for us.
Flags: needinfo?(jsmith)
Andrew,

How do we plan to proceed forward with this?
Flags: needinfo?(overholt)
Jason told me he was working with releng to get debug device builds.  If that's not happening soon, I suggest we ask Gregor or someone on the Systems FE team to get bsmedberg/bent the requested logs.
Flags: needinfo?(overholt)
Flags: needinfo?(jsmith)
Flags: needinfo?(anygregor)
(In reply to Andrew Overholt [:overholt] from comment #9)
> Jason told me he was working with releng to get debug device builds.  If
> that's not happening soon, I suggest we ask Gregor or someone on the Systems
> FE team to get bsmedberg/bent the requested logs.

It's in progress, but I don't expect this happen in a short period of time.
Flags: needinfo?(jsmith)
blocking-b2g: 1.3? → 1.3+
Regression window for v1.3:

~does not reproduce~
BuildID: 20140102004001
Gaia: 01e9da49be2cc4bc134eeefc434740d572ec2246
Gecko: 61f553e5db49
Version: 28.0a2

~reproduces~
BuildID: 20140103004001
Gaia: ae7d05689b6b9ac4ec6182217dfdef06be28e886
Gecko: d9226a660d52
Version: 28.0a2

Occurred earlier on master (1.4) build, can find regression window there if needed, so far - reproduces on 01/02 master build but does not reproduce on 12/23 master build.

Used STR from comment 3 to get a regression range
I tried with debug build and logging enabled but I can't reproduce this bug :(
Flags: needinfo?(anygregor)
i'm going to record a video, maybe it can help
Okay, following these STR after resetting device from Settings I can reproduce this crash 100%. I've tried it on 3 different devices.

Video : http://youtu.be/esl9cdN51EQ
Thanks.
bent and my guess is that we run into an OOM situation.
I also noticed that during entering the password for the gmail contacts the keyboard app got killed.
Gregor,

Can you please find someone to work on this blocker?
Flags: needinfo?(anygregor)
(In reply to Gregor Wagner [:gwagner] from comment #16)
> Thanks.
> bent and my guess is that we run into an OOM situation.
> I also noticed that during entering the password for the gmail contacts the
> keyboard app got killed.

We already have some similar report on Buri (but for v1.1 as far as I can tell), in bug 945043.
Well not similar, but OOM issues.
(In reply to Preeti Raghunath(:Preeti) from comment #17)
> Gregor,
> 
> Can you please find someone to work on this blocker?

Alex will take a look.
Flags: needinfo?(anygregor)
Right now I can't take a look because bug 958732 is kicking in before I can do anything in FTU.
Depends on: 958732
Depends on: 958780
Attached file buri.log (obsolete) —
This is the adb logcat of the device with a debug build. It looks like I'm running into another crash :(
I'm testing with Inari, my Buri is not able to get WiFi working, I've already spent too much time fighting with this :(
(In reply to Natalya Kot [:nkot] from comment #3)
> this crash also reproduced on 1.3 build
> 
> Device: Buri v1.3 Mozilla RIL
> BuildID: 20140106004001
> Gaia: 35a60b82f8cf2d759939a350e2dadbb9d8b2f5dc
> Gecko: a43cb4b322d3
> Version: 28.0a2
> Firmware Version: v1.2_20131115
> 
> same STR:
> 1) Updated Buri to BuildID: 20140103040201
> 2) Reset device from Settings
> 3) Go through FTU: Sign in to WiFi, Set Time Zone to America/LA, Download
> Facebook and Outlook contacts
> 4) Tap on Your Privacy link
> 5) Tap FerifoxOS and then Marketplace links
> 6) Tap Everything.me link
>    ==> device crashes

Are the time zone and contacts download mandatory ?
\o/ reproduced on Inari:
> 1) Reset device from Settings
> 2) Go through FTU: Sign in to WiFi
> 3) Tap on Your Privacy link
> 4) Tap FerifoxOS and then Marketplace links
Attached file Debug buid: adb logcat
Attachment #8359100 - Attachment is obsolete: true
And now hitting bug 959126 while trying to reproduce.
It seems we have a 'Browser' process being stuck. Killing it makes my homescreen coming back.
FYI Browser status was 't'.
Attachment #8359123 - Attachment mime type: text/x-log → text/plain
Attachment #8359124 - Attachment mime type: text/x-log → text/plain
Attachment #8359125 - Attachment mime type: text/x-log → text/plain
(In reply to Alexandre LISSY :gerard-majax from comment #24)
> Are the time zone and contacts download mandatory ?

It was a sure way to repro this crash. I tried going straight to Privacy link and crash didn't reproduce 100%, still could get it like 3/5... so, didn't mean to make things over complicated, thank you for working on that!
Attached patch 956325.diff (obsolete) — Splinter Review
bent's patch.
Assignee: nobody → anygregor
Attachment #8359485 - Flags: review?(bugs)
Attachment #8359485 - Flags: review?(bugs) → review+
Comment on attachment 8359485 [details] [diff] [review]
956325.diff

Er, no, we have mIsDestroyed checks in TabParent.cpp
Attachment #8359485 - Flags: review+ → review-
(In reply to Olli Pettay [:smaug] from comment #34)
> Er, no, we have mIsDestroyed checks in TabParent.cpp

Yikes, that is really fragile.

http://mxr.mozilla.org/mozilla-central/source/dom/ipc/TabParent.h#218 no longer overrides http://mxr.mozilla.org/mozilla-central/source/dom/ipc/PBrowser.ipdl#387

:(
(In reply to ben turner [:bent] (use the needinfo? flag!) from comment #35)
> http://mxr.mozilla.org/mozilla-central/source/dom/ipc/TabParent.h#218 no
> longer overrides
> http://mxr.mozilla.org/mozilla-central/source/dom/ipc/PBrowser.ipdl#387

Is it supposed to? Nobody should be calling [2] except for [1], right? I think we might need another mIsDestroyed check at [3], maybe. But yeah, this is super fragile. Adding some MOZ_OVERRIDE annotations on things would help robustify stuff but probably not completely.

[1] http://mxr.mozilla.org/mozilla-central/source/dom/ipc/TabParent.cpp#765
[2] http://mxr.mozilla.org/mozilla-central/source/dom/ipc/PBrowser.ipdl#387
[3] http://mxr.mozilla.org/mozilla-central/source/dom/ipc/TabParent.cpp#807
Hrm, I thought so (the other Send[*] messages in nsEventStateManager::DispatchCrossProcessEvent do override the IPDL method), but now I'm not so sure about this. I'll poke around some more tomorrow.
Attached image screenshot
I was unable to repro the crash in today's master but scrolling in the E.me Privacy link I hit another issue, lots of overlapping text - see screenshot attached.

Can it be any fallback from the recent work done here or it's a different issue?
filed new bug 959781 for the issue in comment 38
Attached patch 956325.diffSplinter Review
Attachment #8359485 - Attachment is obsolete: true
I still see the crash with the patch attached:

Program received signal SIGSEGV, Segmentation fault.
0xb630419a in mozalloc_abort (msg=<optimized out>) at ../../../memory/mozalloc/mozalloc_abort.cpp:30
30	    MOZ_CRASH();
(gdb) bt
#0  0xb630419a in mozalloc_abort (msg=<optimized out>) at ../../../memory/mozalloc/mozalloc_abort.cpp:30
#1  0xb4d170bc in Abort (aMsg=0xbedeb7e4 "[Child 3685] ###!!! ABORT: aborting because of MsgRouteError: file ../../../dom/ipc/ContentChild.cpp, line 1136")
    at ../../../xpcom/base/nsDebugImpl.cpp:427
#2  NS_DebugBreak (aSeverity=<optimized out>, aStr=0xb6601d59 "aborting because of MsgRouteError", aExpr=0x0, aFile=0xb66019ed "../../../dom/ipc/ContentChild.cpp", 
    aLine=1136) at ../../../xpcom/base/nsDebugImpl.cpp:414
#3  0xb53ff702 in mozilla::dom::ContentChild::ProcessingError (this=<optimized out>, what=<optimized out>) at ../../../dom/ipc/ContentChild.cpp:1136
#4  0xb4f0ac98 in mozilla::dom::PContentChild::OnProcessingError (this=<optimized out>, code=<optimized out>) at PContentChild.cpp:4491
#5  0xb4ee40de in mozilla::ipc::MessageChannel::MaybeHandleError (this=0xb3e44c48, code=mozilla::ipc::HasResultCodes::MsgRouteError, channelName=<optimized out>)
    at ../../../ipc/glue/MessageChannel.cpp:1493
#6  0xb4ee7060 in mozilla::ipc::MessageChannel::OnMaybeDequeueOne (this=0xb3e44c48) at ../../../ipc/glue/MessageChannel.cpp:1029
#7  0xb4ee3b60 in DispatchToMethod<mozilla::ipc::MessageChannel, void (mozilla::ipc::MessageChannel::*)()> (method=
    (void (mozilla::ipc::MessageChannel::*)(mozilla::ipc::MessageChannel * const)) 0xb4ee6fcd <mozilla::ipc::MessageChannel::OnMaybeDequeueOne()>, 
    obj=<optimized out>, arg=<optimized out>) at ../../../ipc/chromium/src/base/tuple.h:383
#8  RunnableMethod<mozilla::ipc::MessageChannel, void (mozilla::ipc::MessageChannel::*)(), Tuple0>::Run (this=<optimized out>)
    at ../../../ipc/chromium/src/base/task.h:307
#9  0xb4ee45c8 in Run (this=<optimized out>) at ../../dist/include/mozilla/ipc/MessageChannel.h:376
#10 mozilla::ipc::MessageChannel::DequeueTask::Run (this=<optimized out>) at ../../dist/include/mozilla/ipc/MessageChannel.h:393
(In reply to Gregor Wagner [:gwagner] from comment #41)
> I still see the crash with the patch attached:

That is bug 959886.
Depends on: 959886
The patch in bug 959886 + this patch fix the crash for me!
Gregor, is this patch ready for review?
Flags: needinfo?(anygregor)
Attachment #8360050 - Flags: review?(bugs)
Flags: needinfo?(anygregor)
Comment on attachment 8360050 [details] [diff] [review]
956325.diff

I don't see how MapEventCoordinatesForChildProcess could 
cause anything bad, but MaybeForwardEventToRenderFrame might.
So move the if to be under MaybeForwardEventToRenderFrame.
Attachment #8360050 - Flags: review?(bugs) → review+
Whiteboard: [b2g-crash] → [b2g-crash], [systemsfe]
Target Milestone: --- → 1.3 C2/1.4 S2(17jan)
https://hg.mozilla.org/mozilla-central/rev/1737bda111ef
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Whiteboard: [b2g-crash], [systemsfe] → [b2g-crash], [systemsfe], [CR596211]
Whiteboard: [b2g-crash], [systemsfe], [CR596211] → [b2g-crash], [systemsfe], [CR 596211]
this crash still consistently reproduces on v1.3 (bp-4aa8f907-68a2-4458-9df7-dca512140117, so far unable to repro on master..
will test it next week or if someone else can try it too, will probably have to reopen the bug

Buri v1.3 
BuildID: 20140117004005
Gaia: a81ccdc53e45a6adeaae423e104e91bcc1e12b0e
Gecko: 2c033140eff4
Version: 28.0a2
Firmware Version: v1.2-device.cfg
(In reply to Natalya Kot [:nkot] from comment #49)
> this crash still consistently reproduces on v1.3
> (bp-4aa8f907-68a2-4458-9df7-dca512140117, so far unable to repro on master..
> will test it next week or if someone else can try it too, will probably have
> to reopen the bug
> [...]
> Gecko: 2c033140eff4

This gecko revision is a descendent of that for Gregor's patch on Aurora so that means it probably didn't fix this bug.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Did it include the fix for bug 959886? Both were needed to pass local testing in SF.
Gregor,

ni for on your attention and radar.
Flags: needinfo?(anygregor)
(In reply to ben turner [:bent] (use the needinfo? flag!) from comment #51)
> Did it include the fix for bug 959886? Both were needed to pass local
> testing in SF.

Don't think so. That patch landed at 8:46 am PST on Friday, which our daily nightly 1.3 builds wouldn't have included. Looks like we need to retest this next week.

Going to reclose on that basis & flagging verifyme to verify the crash no longer reproduces in a build from next week.
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Flags: needinfo?(anygregor)
Resolution: --- → FIXED
Keywords: verifyme
Verified fixed. 
The crash does not reproduce anymore on 01/21 master and v1.3.

BuildID: 20140121040201
Gaia: e218d17ae7d01a81d48f833cd6fafb4e11b26cd8
Gecko: cdc0ab2c0cba
Version: 29.0a1

BuildID: 20140121004137
Gaia: 47049555282a9a01fb60d1e1421b57e2810c96f5
Gecko: 6f7dfe36ab6c
Version: 28.0a2

Firmware Version: v1.2-device.cfg
Status: RESOLVED → VERIFIED
Keywords: verifyme
You need to log in before you can comment on or make changes to this bug.