Crash in Messaging app while stability testing

RESOLVED FIXED in Firefox 40, Firefox OS v2.2

Status

()

--
critical
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: ggrisco, Assigned: kats)

Tracking

({crash})

37 Branch
mozilla40
ARM
Gonk (Firefox OS)
crash
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking-b2g:2.2+, firefox38 wontfix, firefox39 wontfix, firefox40 fixed, b2g-v2.1 unaffected, b2g-v2.2 fixed, b2g-master fixed)

Details

(Whiteboard: [b2g-crash][caf-crash 588][caf priority: p1][CR 812377], crash signature)

Attachments

(18 attachments, 1 obsolete attachment)

143.65 KB, text/plain
Details
449.71 KB, text/plain
Details
140.91 KB, text/plain
Details
463.53 KB, text/plain
Details
143.62 KB, text/plain
Details
483.84 KB, text/plain
Details
138.29 KB, text/plain
Details
485.62 KB, text/plain
Details
3.25 KB, patch
Details | Diff | Splinter Review
103.19 KB, text/plain
Details
376.01 KB, text/plain
Details
102.57 KB, text/plain
Details
393.48 KB, text/plain
Details
Fix
1.68 KB, patch
mchang
: review+
Details | Diff | Splinter Review
127.83 KB, text/plain
Details
500.60 KB, text/plain
Details
99.52 KB, text/plain
Details
484.91 KB, text/plain
Details
(Reporter)

Description

4 years ago
Test steps:
1. Goto Messages app
2. Type the message and the number to which it has to be sent.
3. Long press the send message icon
4. You will find message just crashed pop-up.

I'm not sure how repeatable this is.  We've seen it twice so far on latest build.
Cafbot will upload logs.
We especially need the stack trace...

Updated

4 years ago
Whiteboard: [CR 812377] → [caf priority: p1][CR 812377]

Updated

4 years ago
Whiteboard: [caf priority: p1][CR 812377] → [b2g-crash][caf-crash 588][caf priority: p1][CR 812377]

Updated

4 years ago
Keywords: crash
Created attachment 8582684 [details]
EXTRA file attachment - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.110
Created attachment 8582685 [details]
decoded minidump - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.110
Hey Kats, does this make some sense to you?
Flags: needinfo?(bugmail.mozilla)
Comms Triage: We would like to know if this crash can be reproduced on a Flame before making a call on the blocker status.
From the stack trace and crash address it looks like a vsync is happening and we are trying to dispatch move events after the GeckTouchDispatcher has been freed, maybe. Definitely a platform bug anyway, so moving components. I can look into it but off the top of my head I'm not how we could run into this.
Assignee: nobody → bugmail.mozilla
Component: Gaia::SMS → Widget: Gonk
Flags: needinfo?(bugmail.mozilla)
Product: Firefox OS → Core
Version: unspecified → 37 Branch
What device is this on? It seems odd from the stack trace that this is using the SoftwareDisplay rather than the GonkDisplay, which means we fell back to software vsync somehow.  Is this on an L device?
(Reporter)

Comment 9

4 years ago
(In reply to Mason Chang [:mchang] from comment #8)

> software vsync somehow.  Is this on an L device?

This is happening on L-based 8909.
I couldn't crash on the Nexus L; it has a different chipset.  I think it depends on the chipset?  I can let mason borrow my Nexus L; though if we can't crash with it, we'll need a device that has that chipset.  Otherwise we (Mozilla Dev/QA)'re blocked completely from looking at this issue.
Created attachment 8583212 [details]
EXTRA file attachment -
Created attachment 8583213 [details]
decoded minidump -
(Reporter)

Comment 13

4 years ago
Test team found the same crash by downloading "Sprint Club Nitro" from Marketplace and playing the game.  I just had cafbot attach the logs for it.  Hope that helps.
Does this still occur without the custom patches in comment 2?
So the line at [1] is the one that's supposedly crashing. That this line is running at all means mResamplingEnabled is false, because of the check on line 174. But if mResamplingEnabled is false and we survived the early return on line 95, that means gfxPrefs::HardwareVsyncEnabled() must be false. So how are we getting vsync notifications? Something here doesn't add up....

http://git.mozilla.org/?p=releases/gecko.git;a=blob;f=widget/gonk/GeckoTouchDispatcher.cpp;h=2146c41ad52bd31490af9050b632896726c8cd65;hb=c36e800ae5e5e7e790129e18edaef202d8d3dbff#l193
(Reporter)

Comment 16

4 years ago
(In reply to Mason Chang [:mchang] from comment #14)
> Does this still occur without the custom patches in comment 2?

It was observed on the same exact build, so yes.
What's also confusing is that the stack trace says we're calling DispatchTouchMoves from the display object. We should be calling it from the Compositor thread - http://git.mozilla.org/?p=releases/gecko.git;a=blob;f=gfx/layers/ipc/CompositorParent.cpp;h=f5f664f660a5653debb12994606dc62c67922609;hb=c36e800ae5e5e7e790129e18edaef202d8d3dbff#l340
I think the SoftwareDisplay symbol is bogus. I've seen issues before where the symbol names for tasks aren't correct. I think what's happening is that is resampling is actually enabled, and we're going through the path at http://mxr.mozilla.org/mozilla-central/source/widget/gonk/GeckoTouchDispatcher.cpp?rev=49bbfe7887d5#132
err I meant resampling is actually disabled.
Created attachment 8583265 [details] [diff] [review]
Logging patch

Can you guys run with this patch and post logs if you hit it again? Also you should be able to unapply the "MONKEY" logging patch from bug 1139469 since we already fixed that. I see output from that patch in both of the logs posted to this bug.
For comment 20.
Flags: needinfo?(ntroast)
The logging patch was added to our builds. cafbot should comment when monkey crashes again.
Flags: needinfo?(ntroast)
Created attachment 8583657 [details]
EXTRA file attachment - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.113
Created attachment 8583658 [details]
decoded minidump - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.113
This latest dump doesn't have any of the new logging I added, and it still has the old compositor loop logging which I asked to be removed.

Updated

4 years ago
Flags: needinfo?(ggrisco)
(Reporter)

Comment 27

4 years ago
The change didn't make it in time to be built in AU113.  It should come in next AU though.
Flags: needinfo?(ggrisco)
Sorry about that it should be included in AU 114+
Created attachment 8584480 [details]
EXTRA file attachment - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.114
Created attachment 8584481 [details]
decoded minidump - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.114
This log does have my new logging output, and it indicates that resampling is disabled and confirms what I said in comment 18-19. However it refutes my hypothesis that the GeckoTouchDispatcher was getting destroyed. So I'm not sure what the problem here is, will need to come up with a new hypothesis.
My new hypothesis is that we're getting bad input into the GeckoTouchDispatcher. I think this can happen if we get a end/cancel event with one touch point, followed immediately by a move event (which is bad input). In this scenario, the end/cancel event will go through [1] on the libui thread, and then await processing on the controller thread. Let's say that while the task is awaiting processing, we get the move event, which puts the input into the mTouchMoveEvents at [2] and also dispatches a task to the controller thread a few lines down. Now the first task (end/cancel event) runs, which clears out the mTouchMoveEvents at [3]. Then the move task runs which tries to grab an item from mTouchMoveEvents (which is empty) at [4].

I'll add more logging to see if I can confirm this hypothesis.

[1] http://mxr.mozilla.org/mozilla-central/source/widget/gonk/GeckoTouchDispatcher.cpp?rev=49bbfe7887d5#139
[2] http://mxr.mozilla.org/mozilla-central/source/widget/gonk/GeckoTouchDispatcher.cpp?rev=49bbfe7887d5#126
[3] http://mxr.mozilla.org/mozilla-central/source/widget/gonk/GeckoTouchDispatcher.cpp?rev=49bbfe7887d5#329
[4] http://mxr.mozilla.org/mozilla-central/source/widget/gonk/GeckoTouchDispatcher.cpp?rev=49bbfe7887d5#195
Created attachment 8584512 [details] [diff] [review]
Logging patch v2

Updated logging patch to test hypothesis in previous comment. Please apply this in place of the previous one. Thanks!
Attachment #8583265 - Attachment is obsolete: true
(Reporter)

Updated

4 years ago
Flags: needinfo?(ntroast)
The new patch was uploaded by ggrisco. It may not make it into the next AU, but it should be coming.
Flags: needinfo?(ntroast)
It's probably a dup but it might not be. Making a dependency for now.
Blocks: 1148636
Created attachment 8585025 [details]
EXTRA file attachment - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.115
Created attachment 8585026 [details]
decoded minidump - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.115
(Reporter)

Comment 41

4 years ago
(In reply to Nicholas Troast [:ntroast] from comment #35)
> The new patch was uploaded by ggrisco. It may not make it into the next AU,
> but it should be coming.

AU 115 still does not have the latest patch.  Should come in next AU.
Created attachment 8585161 [details]
EXTRA file attachment - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.116
Created attachment 8585162 [details]
decoded minidump - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.116
Ok, so what's happening here is that this sequence of inputs:

1) touchdown
2) touchup
3) touchdown
4) touchmove (possibly multiple)

arrives on the libui thread all at once, and then the stuff posted to the controller thread gets to run. When the DispatchTouchNonMoveEvent for (2) gets run, it clears out the move event queue, which includes the events from (4). Therefore when DispatchTouchMoveEvents for (4) runs the queue is empty and it craps out.

We already have code to deal with interleaved move and non-move events for the resampling case, we just need to use it for the non-resampling case as well.
Created attachment 8585440 [details] [diff] [review]
Fix

This should fix it. Can you apply this patch and see if you still have the same crash?
Flags: needinfo?(ntroast)
Attachment #8585440 - Flags: review?(mchang)
Comment on attachment 8585440 [details] [diff] [review]
Fix

Review of attachment 8585440 [details] [diff] [review]:
-----------------------------------------------------------------

nit: Can we please add an assert that mTouchMoveEvents.IsEmpty() here - https://dxr.mozilla.org/mozilla-central/source/widget/gonk/GeckoTouchDispatcher.cpp?from=GeckoTouchDispatcher.cpp&case=true#172
Attachment #8585440 - Flags: review?(mchang) → review+
https://hg.mozilla.org/mozilla-central/rev/8d9394e77702
Status: NEW → RESOLVED
Last Resolved: 4 years ago
status-firefox40: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla39

Updated

4 years ago
blocking-b2g: 2.2? → 2.2+
Depends on: 1149412
Thanks for the patch !
Just a reminder that we'll need an approval, Kats :)
Flags: needinfo?(bugmail.mozilla)
Yeah I'll do that once we have verification it fixes the issue and the regressions are taken care of.
Flags: needinfo?(bugmail.mozilla)
Wes, shouldn't the TM be mozilla40?
Flags: needinfo?(wkocher)
Yes it should. Fixing, but leaving ni to wes as an FYI.
Target Milestone: mozilla39 → mozilla40
I was going to wait for this patch to land, but since you are waiting for our test results to approve then I will go ahead and add this patch to our patch tree.
Flags: needinfo?(ntroast)
Thanks. Please also add the patch from bug 1149412 on top (or just take out the assert from this patch).
(Reporter)

Updated

4 years ago
No longer depends on: 1149412
Ugh, mcMerge hadn't picked up the version bump when I ran that, and I forgot to manually change them.
Flags: needinfo?(wkocher)

Comment 58

4 years ago
ignore
Created attachment 8586297 [details]
EXTRA file attachment - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.117

Comment 59

4 years ago
ignore
Created attachment 8586299 [details]
decoded minidump - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.117
The above AU does not include the fix. It may not come in until AU 119.
Also, when you are ready to approve the patch then please also approve the patch from bug 1149412 for v2.2

Comment 62

4 years ago
ignore
Created attachment 8586733 [details]
EXTRA file attachment - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.118
Created attachment 8586734 [details]
decoded minidump - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.118

Updated

4 years ago
Attachment #8585440 - Flags: approval-mozilla-b2g37?

Updated

4 years ago
Flags: needinfo?(bbajaj)
Comment on attachment 8585440 [details] [diff] [review]
Fix

Generally the patch author requests uplift approval as a risk assessment of the uplift needs to be made. I am still waiting for some reasonable confirmation that this fixes the issue reported (in this case that a build with the patch has been running for a few days without crashing).
Flags: needinfo?(bbajaj)
Attachment #8585440 - Flags: approval-mozilla-b2g37?
(Reporter)

Comment 65

4 years ago
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #64)
> Comment on attachment 8585440 [details] [diff] [review]
> Fix
> 
> Generally the patch author requests uplift approval as a risk assessment of
> the uplift needs to be made. I am still waiting for some reasonable
> confirmation that this fixes the issue reported (in this case that a build
> with the patch has been running for a few days without crashing).

Thanks for the explanation :)  Should we ni? the patch author with the information when it's available?  I'm afraid we delay the uplift since being aware of non-crashes is a little more hard to track from our side.

For this bug we're fairly certain that it is fixed since we have a considerable amount of time testing on AU119 without having seen it again.
(In reply to Greg Grisco from comment #65)
> For this bug we're fairly certain that it is fixed since we have a
> considerable amount of time testing on AU119 without having seen it again.

That's all I needed to hear, thanks! :)
Comment on attachment 8585440 [details] [diff] [review]
Fix

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): touch input resampling
User impact if declined: possible crashes when touch resampling is disabled (requires flipping a pref to non-default value)
Testing completed: see previous comment
Risk to taking this patch (and alternatives if risky): low risk, but we should take bug 1149412 also.
String or UUID changes made by this patch: none
Attachment #8585440 - Flags: approval-mozilla-b2g37?
status-b2g-v2.1: --- → unaffected
status-b2g-v2.2: --- → affected
status-b2g-master: --- → fixed

Updated

4 years ago
Attachment #8585440 - Flags: approval-mozilla-b2g37? → approval-mozilla-b2g37+

Updated

4 years ago
See Also: → bug 1150146

Updated

4 years ago
Duplicate of this bug: 1148636
https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/9b2ba3dfc006
status-b2g-v2.2: affected → fixed
status-firefox38: --- → wontfix
status-firefox39: --- → wontfix
You need to log in before you can comment on or make changes to this bug.