Closed Bug 1146987 Opened 9 years ago Closed 9 years ago

Crash in Messaging app while stability testing

Categories

(Core Graveyard :: Widget: Gonk, defect)

37 Branch
ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

(blocking-b2g:2.2+, firefox38 wontfix, firefox39 wontfix, firefox40 fixed, b2g-v2.1 unaffected, b2g-v2.2 fixed, b2g-master fixed)

RESOLVED FIXED
mozilla40
blocking-b2g 2.2+
Tracking Status
firefox38 --- wontfix
firefox39 --- wontfix
firefox40 --- fixed
b2g-v2.1 --- unaffected
b2g-v2.2 --- fixed
b2g-master --- fixed

People

(Reporter: ggrisco, Assigned: kats)

References

Details

(Keywords: crash, Whiteboard: [b2g-crash][caf-crash 588][caf priority: p1][CR 812377])

Crash Data

Attachments

(18 files, 1 obsolete file)

143.65 KB, text/plain
Details
449.71 KB, text/plain
Details
140.91 KB, text/plain
Details
463.53 KB, text/plain
Details
143.62 KB, text/plain
Details
483.84 KB, text/plain
Details
138.29 KB, text/plain
Details
485.62 KB, text/plain
Details
3.25 KB, patch
Details | Diff | Splinter Review
103.19 KB, text/plain
Details
376.01 KB, text/plain
Details
102.57 KB, text/plain
Details
393.48 KB, text/plain
Details
1.68 KB, patch
mchang
: review+
Details | Diff | Splinter Review
127.83 KB, text/plain
Details
500.60 KB, text/plain
Details
99.52 KB, text/plain
Details
484.91 KB, text/plain
Details
Test steps:
1. Goto Messages app
2. Type the message and the number to which it has to be sent.
3. Long press the send message icon
4. You will find message just crashed pop-up.

I'm not sure how repeatable this is.  We've seen it twice so far on latest build.
Cafbot will upload logs.
We especially need the stack trace...
Whiteboard: [CR 812377] → [caf priority: p1][CR 812377]
Whiteboard: [caf priority: p1][CR 812377] → [b2g-crash][caf-crash 588][caf priority: p1][CR 812377]
Keywords: crash
Hey Kats, does this make some sense to you?
Flags: needinfo?(bugmail.mozilla)
Comms Triage: We would like to know if this crash can be reproduced on a Flame before making a call on the blocker status.
From the stack trace and crash address it looks like a vsync is happening and we are trying to dispatch move events after the GeckTouchDispatcher has been freed, maybe. Definitely a platform bug anyway, so moving components. I can look into it but off the top of my head I'm not how we could run into this.
Assignee: nobody → bugmail.mozilla
Component: Gaia::SMS → Widget: Gonk
Flags: needinfo?(bugmail.mozilla)
Product: Firefox OS → Core
Version: unspecified → 37 Branch
What device is this on? It seems odd from the stack trace that this is using the SoftwareDisplay rather than the GonkDisplay, which means we fell back to software vsync somehow.  Is this on an L device?
(In reply to Mason Chang [:mchang] from comment #8)

> software vsync somehow.  Is this on an L device?

This is happening on L-based 8909.
I couldn't crash on the Nexus L; it has a different chipset.  I think it depends on the chipset?  I can let mason borrow my Nexus L; though if we can't crash with it, we'll need a device that has that chipset.  Otherwise we (Mozilla Dev/QA)'re blocked completely from looking at this issue.
Test team found the same crash by downloading "Sprint Club Nitro" from Marketplace and playing the game.  I just had cafbot attach the logs for it.  Hope that helps.
Does this still occur without the custom patches in comment 2?
So the line at [1] is the one that's supposedly crashing. That this line is running at all means mResamplingEnabled is false, because of the check on line 174. But if mResamplingEnabled is false and we survived the early return on line 95, that means gfxPrefs::HardwareVsyncEnabled() must be false. So how are we getting vsync notifications? Something here doesn't add up....

http://git.mozilla.org/?p=releases/gecko.git;a=blob;f=widget/gonk/GeckoTouchDispatcher.cpp;h=2146c41ad52bd31490af9050b632896726c8cd65;hb=c36e800ae5e5e7e790129e18edaef202d8d3dbff#l193
(In reply to Mason Chang [:mchang] from comment #14)
> Does this still occur without the custom patches in comment 2?

It was observed on the same exact build, so yes.
What's also confusing is that the stack trace says we're calling DispatchTouchMoves from the display object. We should be calling it from the Compositor thread - http://git.mozilla.org/?p=releases/gecko.git;a=blob;f=gfx/layers/ipc/CompositorParent.cpp;h=f5f664f660a5653debb12994606dc62c67922609;hb=c36e800ae5e5e7e790129e18edaef202d8d3dbff#l340
I think the SoftwareDisplay symbol is bogus. I've seen issues before where the symbol names for tasks aren't correct. I think what's happening is that is resampling is actually enabled, and we're going through the path at http://mxr.mozilla.org/mozilla-central/source/widget/gonk/GeckoTouchDispatcher.cpp?rev=49bbfe7887d5#132
err I meant resampling is actually disabled.
Attached patch Logging patch (obsolete) — Splinter Review
Can you guys run with this patch and post logs if you hit it again? Also you should be able to unapply the "MONKEY" logging patch from bug 1139469 since we already fixed that. I see output from that patch in both of the logs posted to this bug.
For comment 20.
Flags: needinfo?(ntroast)
The logging patch was added to our builds. cafbot should comment when monkey crashes again.
Flags: needinfo?(ntroast)
This latest dump doesn't have any of the new logging I added, and it still has the old compositor loop logging which I asked to be removed.
Flags: needinfo?(ggrisco)
The change didn't make it in time to be built in AU113.  It should come in next AU though.
Flags: needinfo?(ggrisco)
Sorry about that it should be included in AU 114+
This log does have my new logging output, and it indicates that resampling is disabled and confirms what I said in comment 18-19. However it refutes my hypothesis that the GeckoTouchDispatcher was getting destroyed. So I'm not sure what the problem here is, will need to come up with a new hypothesis.
My new hypothesis is that we're getting bad input into the GeckoTouchDispatcher. I think this can happen if we get a end/cancel event with one touch point, followed immediately by a move event (which is bad input). In this scenario, the end/cancel event will go through [1] on the libui thread, and then await processing on the controller thread. Let's say that while the task is awaiting processing, we get the move event, which puts the input into the mTouchMoveEvents at [2] and also dispatches a task to the controller thread a few lines down. Now the first task (end/cancel event) runs, which clears out the mTouchMoveEvents at [3]. Then the move task runs which tries to grab an item from mTouchMoveEvents (which is empty) at [4].

I'll add more logging to see if I can confirm this hypothesis.

[1] http://mxr.mozilla.org/mozilla-central/source/widget/gonk/GeckoTouchDispatcher.cpp?rev=49bbfe7887d5#139
[2] http://mxr.mozilla.org/mozilla-central/source/widget/gonk/GeckoTouchDispatcher.cpp?rev=49bbfe7887d5#126
[3] http://mxr.mozilla.org/mozilla-central/source/widget/gonk/GeckoTouchDispatcher.cpp?rev=49bbfe7887d5#329
[4] http://mxr.mozilla.org/mozilla-central/source/widget/gonk/GeckoTouchDispatcher.cpp?rev=49bbfe7887d5#195
Attached patch Logging patch v2Splinter Review
Updated logging patch to test hypothesis in previous comment. Please apply this in place of the previous one. Thanks!
Attachment #8583265 - Attachment is obsolete: true
Flags: needinfo?(ntroast)
The new patch was uploaded by ggrisco. It may not make it into the next AU, but it should be coming.
Flags: needinfo?(ntroast)
It's probably a dup but it might not be. Making a dependency for now.
Blocks: 1148636
(In reply to Nicholas Troast [:ntroast] from comment #35)
> The new patch was uploaded by ggrisco. It may not make it into the next AU,
> but it should be coming.

AU 115 still does not have the latest patch.  Should come in next AU.
Ok, so what's happening here is that this sequence of inputs:

1) touchdown
2) touchup
3) touchdown
4) touchmove (possibly multiple)

arrives on the libui thread all at once, and then the stuff posted to the controller thread gets to run. When the DispatchTouchNonMoveEvent for (2) gets run, it clears out the move event queue, which includes the events from (4). Therefore when DispatchTouchMoveEvents for (4) runs the queue is empty and it craps out.

We already have code to deal with interleaved move and non-move events for the resampling case, we just need to use it for the non-resampling case as well.
Attached patch FixSplinter Review
This should fix it. Can you apply this patch and see if you still have the same crash?
Flags: needinfo?(ntroast)
Attachment #8585440 - Flags: review?(mchang)
Comment on attachment 8585440 [details] [diff] [review]
Fix

Review of attachment 8585440 [details] [diff] [review]:
-----------------------------------------------------------------

nit: Can we please add an assert that mTouchMoveEvents.IsEmpty() here - https://dxr.mozilla.org/mozilla-central/source/widget/gonk/GeckoTouchDispatcher.cpp?from=GeckoTouchDispatcher.cpp&case=true#172
Attachment #8585440 - Flags: review?(mchang) → review+
https://hg.mozilla.org/mozilla-central/rev/8d9394e77702
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
blocking-b2g: 2.2? → 2.2+
Depends on: 1149412
Thanks for the patch !
Just a reminder that we'll need an approval, Kats :)
Flags: needinfo?(bugmail.mozilla)
Yeah I'll do that once we have verification it fixes the issue and the regressions are taken care of.
Flags: needinfo?(bugmail.mozilla)
Wes, shouldn't the TM be mozilla40?
Flags: needinfo?(wkocher)
Yes it should. Fixing, but leaving ni to wes as an FYI.
Target Milestone: mozilla39 → mozilla40
I was going to wait for this patch to land, but since you are waiting for our test results to approve then I will go ahead and add this patch to our patch tree.
Flags: needinfo?(ntroast)
Thanks. Please also add the patch from bug 1149412 on top (or just take out the assert from this patch).
No longer depends on: 1149412
Ugh, mcMerge hadn't picked up the version bump when I ran that, and I forgot to manually change them.
Flags: needinfo?(wkocher)
The above AU does not include the fix. It may not come in until AU 119.
Also, when you are ready to approve the patch then please also approve the patch from bug 1149412 for v2.2
Attachment #8585440 - Flags: approval-mozilla-b2g37?
Flags: needinfo?(bbajaj)
Comment on attachment 8585440 [details] [diff] [review]
Fix

Generally the patch author requests uplift approval as a risk assessment of the uplift needs to be made. I am still waiting for some reasonable confirmation that this fixes the issue reported (in this case that a build with the patch has been running for a few days without crashing).
Flags: needinfo?(bbajaj)
Attachment #8585440 - Flags: approval-mozilla-b2g37?
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #64)
> Comment on attachment 8585440 [details] [diff] [review]
> Fix
> 
> Generally the patch author requests uplift approval as a risk assessment of
> the uplift needs to be made. I am still waiting for some reasonable
> confirmation that this fixes the issue reported (in this case that a build
> with the patch has been running for a few days without crashing).

Thanks for the explanation :)  Should we ni? the patch author with the information when it's available?  I'm afraid we delay the uplift since being aware of non-crashes is a little more hard to track from our side.

For this bug we're fairly certain that it is fixed since we have a considerable amount of time testing on AU119 without having seen it again.
(In reply to Greg Grisco from comment #65)
> For this bug we're fairly certain that it is fixed since we have a
> considerable amount of time testing on AU119 without having seen it again.

That's all I needed to hear, thanks! :)
Comment on attachment 8585440 [details] [diff] [review]
Fix

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): touch input resampling
User impact if declined: possible crashes when touch resampling is disabled (requires flipping a pref to non-default value)
Testing completed: see previous comment
Risk to taking this patch (and alternatives if risky): low risk, but we should take bug 1149412 also.
String or UUID changes made by this patch: none
Attachment #8585440 - Flags: approval-mozilla-b2g37?
Attachment #8585440 - Flags: approval-mozilla-b2g37? → approval-mozilla-b2g37+
See Also: → 1150146
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: