Closed Bug 1004266 Opened 9 years ago Closed 9 years ago

[B2G][FTU] When signing into Gmail repeatedly tapping on the screen can force close the FTU

Categories

(Core :: IPC, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla32
blocking-b2g 1.3T+
Tracking Status
b2g-v1.3 --- affected
b2g-v1.3T --- fixed
b2g-v1.4 --- unaffected
b2g-v2.0 --- fixed

People

(Reporter: jschmitt, Assigned: kats)

References

Details

(Keywords: crash, Whiteboard: [b2g-crash][sprd316039])

Crash Data

Attachments

(4 files)

Attached file log.txt
Description:
On the authentication page of Gmail tapping repeatedly can force close FTU and bring the user to the Homescreen.

Repro Steps:
1) Update a Tarako to BuildID: 20140430014008
2) Reset phone to enable FTU
3) Proceed to the 'Select a Network' and connect to a Wi-Fi network
4) Proceed to 'Import Contacts' and select 'Gmail'
5) Input user name and password then select 'Sign in'
6) Once loaded, repeatedly tap on the screen around the words 'Have offline access'

Actual:
The FTU closes and user is taken to the Homescreen

Expected:
The FTU does not close

1.3t Environmental Variables:
Device: Tarako 1.3t
BuildID: 20140430014008
Gaia: f9b62bd1135f4edf8317fff1c2947b9d99438d79
Gecko: ba50f734b815
Version: 28.1
Firmware Version: sp6821a_gonk4.0_user.pac

Notes:
Repro frequency: 100%
See attached: logcat
Tested against 1.3 Buri build which the bug DOES repro and I was able to get a crash log for Buri, I was unable to receive a crash log for Tarako.

https://crash-stats.mozilla.com/report/index/5e97dabe-c1ea-478d-b021-005192140430

1.3 Environmental Variables:
Device: Buri 1.3 MOZ
BuildID: 20140429024001
Gaia: caab7a78f0c04913f48a3051259663ee85625906
Gecko: f84a8ffbc552
Version: 28.0
Firmware Version: v1.2-device.cfg
Looks like an Dom/IPC crash.  Can you check to see if you can repro in 2.0 on the flame or open c please?
Flags: needinfo?(jschmitt)
Keywords: crash
Whiteboard: [b2g-crash]
Crash Signature: [@ mozilla::dom::TabChild::RecvRealTouchEvent(mozilla::WidgetTouchEvent const&, mozilla::layers::ScrollableLayerGuid const&) ]
Is there a crash report message showing up on tarako when this bug reproduces?
(In reply to Jason Smith [:jsmith] from comment #3)
> Is there a crash report message showing up on tarako when this bug
> reproduces?

Taking ni for jschmitt, the crash report message is NOT showing up on Tarako, the FTE just force closes. 

The crash report msg can only be seen on Buri 1.3 when using the same STR.
Flags: needinfo?(jschmitt)
(In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from comment #2)
> Looks like an Dom/IPC crash.  Can you check to see if you can repro in 2.0
> on the flame or open c please?

I tested on 2.0 Open_C and the crash does not occur.

When using the same repro steps the Open_C froze on the FTU 'Import Contacts' page and I had to restart the device, but no crash or force close of the FTU when following the same repro steps.
We met this crash after land bug 1014272.
blocking-b2g: --- → 1.3T?
Whiteboard: [b2g-crash] → [b2g-crash][sprd316039]
Blocks: 1014272
We met the same crash backtrace after land bug 1014272.
Component: Gaia::First Time Experience → IPC
Product: Firefox OS → Core
Blocking as we have STR, :smaug, fabrice recommend that you may be able to help with some investigation here as this is in IPC area, any thoughts here ? (crash logs in comment #1)
blocking-b2g: 1.3T? → 1.3T+
Flags: needinfo?(bugs)
Any chance for a better stack trace?
https://crash-stats.mozilla.com/report/index/5e97dabe-c1ea-478d-b021-005192140430 is missing all the
interesting bits.

This looks like a null+offset crash, but hard to say what is actually null.
Flags: needinfo?(bugs)
Oh, there is some nested event loop handling. What is causing that. Some js spinning the loop?

This might be a regression from bug 950300. That certainly doesn't handle nested event loop spinning.
outerWindow for example can be null here http://hg.mozilla.org/releases/mozilla-b2g28_v1_3/rev/ce9035ea1b83#l2.42 when returning to the outermost loop.
Hi Olli, wonder if you are the right person to take this bug? Thanks
Flags: needinfo?(bugs)
Olli's at least somewhat on PTO this week.

kats and Vivien were also involved in bug 950300 which smaug thinks may have caused a regression here so flagging them for help.
Flags: needinfo?(bugs)
Flags: needinfo?(bugmail.mozilla)
Flags: needinfo?(21)
I agree with Olli's analysis in comment 10. I'm just not sure what the right fix. It seems wrong for the JS to be spinning the event loop and I feel like we should prevent that from happening. If we can't prevent that then maybe we need some IPC changes to not deliver new events until the previous one is done being processed? I feel like this might be a problem that affects more than just RecvRealTouchEvent.
Flags: needinfo?(bugmail.mozilla)
We have to assume that firing any DOM event could spin a nested event loop, right? And that any other events in the queue will run when we do unless they're blocked explicitly.
Any JS can spin event loop currently. Just use main thread sync XHR.
Ok. Well it doesn't make sense to dispatch a touch event while the JS code is still in the middle of processing the previous touch event, right? And I don't think we should be processing touch events out-of-order with other types of messages. For example the APZ code assumes that touch events and UpdateFrame messages (i.e. TabChild::RecvUpdateFrame) are processed in the order that they are dispatched from the parent process. What are our options here?
Assignee: nobody → overholt
Attached patch Blind patchSplinter Review
I can't test this easily since I don't have an active tarako tree. But based on the crash stack (a nsCOMPtr null deref) I'm pretty confident this is the problem.
Attachment #8431677 - Flags: review?(fabrice)
Assignee: overholt → bugmail.mozilla
(In reply to Olli Pettay [:smaug] (vacation-ish May 26-30) from comment #15)
> Any JS can spin event loop currently. Just use main thread sync XHR.

bent and I discussed this a bit. It looks like the sync XHR code explicitly blocks events from going through at [1]. The modal dialogs (alert/confirm/prompt) do something similar except they only block animations [2] but as far as I can tell all events go to the modal window anyway so it's not a problem.

The point is that I can't find any existing way that content JS will be delivered events in a re-entrant manner. I imagine that a bunch of web content would be busted if that happened. What makes the most sense to me is to install this event suppression in this case as well, and just discard events that are coming in while content is in the middle of handling something. In the case of touch events we would probably have to ensure event stream consistency (e.g. don't send touchmoves without a corresponding touchstart) but maybe nsEventStateManager does this already.

We'd also have to make RecvRealTouchEvent itself re-entrancy aware, so that it doesn't make a mistake when pulling the prevent-default status from the "first" touchmove event.

bent also suggested maybe all of this should live inside the event state manager. smaug, you know all this code best, any thoughts?

[1] http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsXMLHttpRequest.cpp?rev=57b0932e2f06#3004
[2] http://mxr.mozilla.org/mozilla-central/source/dom/base/nsGlobalWindow.cpp#8558
Flags: needinfo?(21) → needinfo?(bugs)
Attached patch bug1004266.patchSplinter Review
rebased to v1.3t I think?  Kats, can you verify that I made the change to the right location please?
Flags: needinfo?(bugmail.mozilla)
This rebased patch looks correct to me, Naoki.
Flags: needinfo?(bugmail.mozilla)
sync XHR blocks event in presshell level. TabChild is way lower level thing, and event suppression
shouldn't affect to it.

Note, the outerWindow issue was just a guess. We need a real stack trace here.
Flags: needinfo?(bugs)
Or someone who can reproduce the issue needs to try the patch.
Yeah Naoki is trying it. I don't understand why event suppression wouldn't affect this - the TabChild dispatches the window via the widget, which should be going through the presshell lower down in the stack, no? So if the suppression happens in the presshell it should still have an effect there.
While suppression is on we still process more runnables in the event loop.
Right, as I understand it that's fine, kats just doesn't want the page to be getting more dom events... I think?
why does that matter here? If event loop spins, anything can happen (unless something is explicitly
disabled).
Attached image 2014-05-30-17-30-29.png
I applied the patch and I tried the steps.  Instead of crashing, I'm stuck at this screen.  Hitting X won't cancel out.  Since it's FTU you can't long tap on Home button to get out.  You have to reboot, which does put you back to the FTU.
I discussed this a bit with smaug on IRC today. According to him there are already cases where web content needs to deal with re-entrancy with respect to events, and I think this is probably one of those cases as well. The patch I provided fixes a gecko bug, but the stuff naoki described in comment 27 might need to be fixed in the web content.

I don't know exactly what the web content in this case is doing on touchmove events, but it's something that spins the event loop. Can we find out what DOM APIs it's invoking to make this happen? If it's something we expect to spin the event loop, then the rest of this bug needs to be fixed in the content (by handling re-entrancy). If it's some unexpected codepath that spins the event loop then there might be additional gecko bugs.
Andrew, do you know who can get a hold of the web content that is spinning the event loop on touch events? I think we need that to move ahead with this bug.
Flags: needinfo?(overholt)
The APIs that would spin the event loop include synchronous XHR, alert() and confirm() AFAIK.
Maybe Francisco, Fernando, Mike, or Gregor?
Flags: needinfo?(overholt)
Flags: needinfo?(mhenretty)
Flags: needinfo?(francisco)
Flags: needinfo?(fernando.campo)
Flags: needinfo?(anygregor)
I think Jose knows the importing parts.
Flags: needinfo?(anygregor) → needinfo?(jmcf)
showModalDialog() is the canonical spec'ed "spin-the-event-loop" case, but I don't know if
b2g supports that.
sync XHR should work too.
I didn't see any sync XHR, alert, confirm, or showModalDialog calls in the FTU or 'shared/pages/import/*'. Is there anything else we should look for?

Since this import is opening an overlay to google's oauth dialog, my guess would be that it's google's content that's causing the nested spin.
Flags: needinfo?(mhenretty)
(In reply to Michael Henretty [:mhenretty] from comment #34)
> I didn't see any sync XHR, alert, confirm, or showModalDialog calls in the
> FTU or 'shared/pages/import/*'. Is there anything else we should look for?

Are there any touch event listeners of any kind?

> Since this import is opening an overlay to google's oauth dialog, my guess
> would be that it's google's content that's causing the nested spin.

That's quite likely. Is it possible to get a hold of the source to the google dialog?
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #35)
> (In reply to Michael Henretty [:mhenretty] from comment #34)
> > I didn't see any sync XHR, alert, confirm, or showModalDialog calls in the
> > FTU or 'shared/pages/import/*'. Is there anything else we should look for?
> 
> Are there any touch event listeners of any kind?

I didn't see touchend, touchstart or touchmove in either FTU or the import page. We do use click though, but I'm not sure if this maps to a touch event in the platform or not.

The way the whole flow works is, we register a click handler on the "Gmail" import button. In this handler we post a message to an iframe which contains our oauth logic (still in gaia). This iframe (which has no visible DOM), then calls window.open with the url https://accounts.google.com/o/oauth2/auth?... for oauth. Once the flow is done on google's side, our iframe postMessages back the FTU.


> > Since this import is opening an overlay to google's oauth dialog, my guess
> > would be that it's google's content that's causing the nested spin.
> 
> That's quite likely. Is it possible to get a hold of the source to the
> google dialog?

Hrm, that's quite hard. I used FF desktop to try and view the source of one of an oauth flow [1], and the code is minified and obfuscated. I also tried re-constructing the url we use for the gmail import, but I can't figure our FxOS appID [2]. I think we will need someone who is familiar with this flow. Jose may know.



1). https://accounts.google.com/o/oauth2/auth?client_id=533832195920.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Fbobyouku.ap01.aws.af.cm%2Ftestyoutube.php&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fyoutube&response_type=code&access_type=offline

2.) https://accounts.google.com/o/oauth2/auth?client_id=123456&response_type=token&approval_prompt=force&redirect_uri=http://www.mozilla.com&scope=wl.basic
 build/config/communications_services.json file contains the application id for Gmail.
Flags: needinfo?(jmcf)
Hi, I think that Michael and Jose provided all the info, nothing else that I can add, sorry folks :(
Flags: needinfo?(francisco)
Do you have the info you need here, kats?
Flags: needinfo?(fernando.campo) → needinfo?(bugmail.mozilla)
If the page in comment 38 is the one that I was looking for, then something else must be going on, because that page doesn't have any touch listeners at all, as far as I can tell. Either the stack in the crash report is wrong or the touch listener is somewhere else, and I have no idea where it might be.
Flags: needinfo?(bugmail.mozilla)
I'm going to do a local build of 1.3 on Hamachi to see if I can repro and debug this. If anybody has any other ideas please feel free to mention and/or try them.
I'm unable to reproduce on a local 1.3 build on Hamachi using the STR in comment 0. Also just to note the STR indicate that the crash happens on the screen *after* signing in, which is not the same page as the one in comment 38.

Also, if I hit "Decline" on that permissions screen, I end up in a state like what Naoki described in comment 27, so I think that is most likely a separate bug in the FTU.

I'm inclined to land the patch as-is to fix the crash and spin off another bug to deal with this "Stuck in the FTU" scenario (which may or may not need to be blocking, I don't know).
Comment on attachment 8431677 [details] [diff] [review]
Blind patch

Moving review since fabrice is away.
Attachment #8431677 - Flags: review?(fabrice) → review?(21)
Comment on attachment 8431677 [details] [diff] [review]
Blind patch

No need for (), but rs+
Attachment #8431677 - Flags: review?(21) → review+
Depends on: 1019957
https://hg.mozilla.org/mozilla-central/rev/02dc36c183a1
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
somebody help to uplift it to 1.3t?
Because we don' have the rights now.
Adding qawanted to see if this is reproducible on 1.4.
Keywords: qawanted
Unable to reproduce this issue on the latest 1.4 for Flame, Open C, and Buri.

Device: Flame 1.4
BuildID: 20140618000203
Gaia: 3bdd037ec1a11abebe16a5d7f6ff0d863e80bc07
Gecko: 523491fa3339
Version: 30.0 (1.4) 
Firmware Version: v121-2

Device: Open_C 1.4
BuildID: 20140618000203
Gaia: 3bdd037ec1a11abebe16a5d7f6ff0d863e80bc07
Gecko: 523491fa3339
Version: 30.0 (1.4) 
Firmware Version: P821A10V1.0.0B06_LOG_DL

Device: Buri 1.4
Build ID: 20140618000203
Gaia: 3bdd037ec1a11abebe16a5d7f6ff0d863e80bc07
Gecko: 523491fa3339
Version: 30.0 (1.4) 
Firmware Version: v1.2device.cfg
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(jmitchell)
Keywords: qawanted
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #17)
> Created attachment 8431677 [details] [diff] [review]
> Blind patch
> 
> I can't test this easily since I don't have an active tarako tree. But based
> on the crash stack (a nsCOMPtr null deref) I'm pretty confident this is the
> problem.

Hi, would you please uplift this to 1.3t?
Flags: needinfo?(bugmail.mozilla)
thx
This fix is not in the mozilla-b2g30_v1_4 branch. Bhavana says QA is testing to determine whether the crash is actually reproducible on 1.4.
status-b2g-v1.4: --- → ?
Based on comment 49 and comment 50 that already happened.
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(jmitchell)
You need to log in before you can comment on or make changes to this bug.