Closed Bug 1151082 Opened 9 years ago Closed 9 years ago

[Flame] Request-sync API goes into an infinite loop in the main process if pending messages gets backlogged

Categories

(Core :: DOM: Core & HTML, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla40
blocking-b2g 2.2+
Tracking Status
firefox38 --- wontfix
firefox39 --- wontfix
firefox40 --- fixed
b2g-v2.1 --- unaffected
b2g-v2.2 --- verified
b2g-master --- verified

People

(Reporter: onelson, Assigned: baku)

References

()

Details

(Keywords: regression, Whiteboard: [3.0-Daily-Testing])

Attachments

(3 files)

+++ This bug was initially created as a clone of Bug #1122276 +++

Description:
If the user attempts to add a Gmail account to the email app after another email account has previously been added, the device will entirely lock up after signing in, through the setup UI.
   
Repro Steps:
1) Update a Flame device to BuildID: 20150403010203
2) Connect to a WiFi or Data network
3) Launch the Email app for the first time
4) Sign into a new account, gmail, and progress to the inbox
5) Tap the menu icon to open the Drawer and select the Gear icon
6) Select 'Add account'
7) Sign into another gmail account and select 'Next'
8) Change the time to update inbox multiple times.
  
Actual:
The touch screen, and hardware buttons are unresponsive once the Gmail log-in page loads
  
Expected: 
User still has full device functionality after the Gmail log-in page loads

  
Repro frequency: 5/5
See attached: 
  video:
      previous- http://youtu.be/fZfFfGv_hH8
      today- https://youtu.be/r5lnCF65ck8
  logcat


Environmental Variables:
--------------------------------------------------

Device: Flame 3.0
BuildID: 20150403010203
Gaia: 7969b367a7da62877c3a24a26d3cb5fda89d766c
Gecko: 70a113676b21
Gonk: b83fc73de7b64594cd74b33e498bf08332b5d87b
Version: 40.0a1 (3.0) 
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:40.0) Gecko/40.0 Firefox/40.0
--------------------------------------------------

This issue DOES occur on Flame 2.2
The touch screen, and hardware buttons are unresponsive once the Gmail log-in page loads

Device: Flame 2.2
BuildID: 20150403002503
Gaia: 022eeb91197ba4a9adfd67bd6db5aa03cc69eb31
Gecko: 77fdc6fbcae9
Gonk: ebad7da532429a6f5efadc00bf6ad8a41288a429
Version: 37.0 (2.2) 
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:37.0) Gecko/37.0 Firefox/37.0
--------------------------------------------------

This issue does NOT occur on Flame 2.1
User still has full device functionality after the Gmail log-in page loads

Device: Flame 2.1
BuildID: 20150403001204
Gaia: 4e7d880defb83c25b49462479c02b3e64d21f421
Gecko: e76f5e7a264d
Gonk: a814b2e2dfdda7140cb3a357617dc4fbb1435e76
Version: 34.0 (2.1)
Firmware: V18D-1
User Agent: Mozilla/5.0 (Mobile; rv:34.0) Gecko/34.0 Firefox/34.0
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(pbylenga)
Severe functional regression that hardlocks phone.

Requesting a window.
blocking-b2g: --- → 2.2?
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(pbylenga)
QA Contact: jmercado
The log indicates that the email app was still responding to the presses even after the UI failed to update.

If I had to guess, it would be that the graphics subsystem got wedged.  The only suspicious thing in the log that jumps out at me are some "[Parent][MessageChannel] Error: Channel error: cannot send/recv" messages.  I suppose a queue somewhere could have filled up and become blocking?  But that's just a wild guess.

I would expect that this is probably impacting the phone across the board, so I'm moving this to a more generic component where there might be other similar bugs.  I'm not familiar with what the stability component is actually used for, however, so I could be very wrong.  Feel free to move somewhere else again.  (Just not back to email ;)  (I am staying on the cc list in case email is a factor somehow.)
Component: Gaia::E-Mail → Stability
Summary: [Email] Signing in two consecutive Gmail accounts will freeze, lock up the device. → [Flame] Screen stops updating when rapidly switching <select> options after doing various email stuff
Er, and specifically, for those going log-diving, the log entry that seems to correspond to the UI no longer responding is:
04-03 14:41:04.311  1734  1734 I GeckoDump: LOG: sync interval changed to 1800000
I can no longer reproduce this issue on the latest central tinderbox build.  Oliver can you check the next Nightly build and verify if this is still happening?

Issue DOES occur

Environmental Variables:
Device: Flame 3.0
BuildID: 20150402033937
Gaia: f37be8b44cb7c3a147b9615ab76743b760f08eeb
Gecko: cf8864126c58
Gonk: b83fc73de7b64594cd74b33e498bf08332b5d87b
Version: 40.0a1 (3.0) 
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:40.0) Gecko/40.0 Firefox/40.0

Issue does NOT occur

Environmental Variables:
Device: Flame 3.0
BuildID: 20150405145459
Gaia: ef61ebbe5de8c2c9fc2a8f74a12455044c3b82e9
Gecko: 4fe763cbe196
Gonk: b83fc73de7b64594cd74b33e498bf08332b5d87b
Version: 40.0a1 (3.0) 
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:40.0) Gecko/40.0 Firefox/40.0
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(onelson)
Flags: needinfo?(ktucker)
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage-]
Flags: needinfo?(ktucker)
Please re-nom if this happens on 2.2 after further testing.
blocking-b2g: 2.2? → ---
Issue REPROES in latest nightly for 3.0 and 2.2 flame devices:

Device: Flame 3.0
BuildID: 20150407010204
Gaia: c710bac533b76635161315bf907d004e000549cb
Gecko: ab0490972e1e
Gonk: b83fc73de7b64594cd74b33e498bf08332b5d87b
Version: 40.0a1 (3.0) 
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:40.0) Gecko/40.0 Firefox/40.0

Device: Flame 2.2
BuildID: 20150407002501
Gaia: 5e09637414269728f6f1bc0152d0160f3b6b380e
Gecko: 245f37f44017
Gonk: ebad7da532429a6f5efadc00bf6ad8a41288a429
Version: 37.0 (2.2) 
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:37.0) Gecko/37.0 Firefox/37.0
QA Whiteboard: [QAnalyst-Triage-] → [QAnalyst-Triage?]
Flags: needinfo?(onelson) → needinfo?(pbylenga)
Adding qawanted to check this again on 2.2 per request.
I was also able to reproduce this issue on the latest 2.2 Flame build.  Of note, I had to add a third email account before I could get this issue to reproduce.  So if it does not seem to be reproducing with the second try completing it and adding more.

Environmental Variables:
Device: Flame 2.2
BuildID: 20150407002501
Gaia: 5e09637414269728f6f1bc0152d0160f3b6b380e
Gecko: 245f37f44017
Gonk: ebad7da532429a6f5efadc00bf6ad8a41288a429
Version: 37.0 (2.2) 
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:37.0) Gecko/37.0 Firefox/37.0
Flags: needinfo?(ktucker)
Keywords: qawanted
Nominating 2.2? based on Comment 5 since we could reproduce this on 2.2.
blocking-b2g: --- → 2.2?
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
Flags: needinfo?(pbylenga)
Bug 1098289 seems to have caused this issue

Central Regression Window:

Last Working 
Environmental Variables:
Device: Flame 3.0
BuildID: 20150211132949
Gaia: e280a660955bbdab265d50f8d9e009de34082332
Gecko: 650863f6987b
Version: 38.0a1 (3.0) 
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:38.0) Gecko/38.0 Firefox/38.0

First Broken 
Environmental Variables:
Device: Flame 3.0
BuildID: 20150211174239
Gaia: d5a71cedb37dd45f439f672489db3994b349ac43
Gecko: 3778ff309988
Version: 38.0a1 (3.0) 
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:38.0) Gecko/38.0 Firefox/38.0

Last Working gaia / First Broken gecko - Issue does NOT occur
Gaia: e280a660955bbdab265d50f8d9e009de34082332
Gecko: 3778ff309988

First Broken gaia / Last Working gecko - Issue DOES occur
Gaia: d5a71cedb37dd45f439f672489db3994b349ac43
Gecko: 650863f6987b

Gaia Pushlog: https://github.com/mozilla-b2g/gaia/compare/e280a660955bbdab265d50f8d9e009de34082332...d5a71cedb37dd45f439f672489db3994b349ac43
QA Whiteboard: [QAnalyst-Triage+] → [QAnalyst-Triage?]
Flags: needinfo?(ktucker)
James, can you take a look at this please? This might have been caused by the work done for bug 1098289.
Blocks: 1098289
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker) → needinfo?(jrburke)
triage: regression breaking basic function
blocking-b2g: 2.2? → 2.2+
Component: Stability → Gaia::E-Mail
I reproduced and attached gdb.  We're ending up in a tight-loop in some JITted code in the main thread.  Unfortunately gdb can't figure out how to unwind the stack so I'm not sure if it's email or the request-sync logic, but I'd guess email since any loops should happen in the main process.  I'll take a quick dive into the code and maybe instrument a little if it's not obvious.
Oh, whoops, but it was the main process that freaked out and that I attached gdb to, not the email process.  So I'm guessing request-sync after all.  Which makes sense from what we saw from the provided logs and how email was still able to timeout its IMAP connection.
Okay, yeah, this is a requestsync bug.  pendingOperationDone has the following loop inside it:

    while (this._pendingMessages.length) {
      this.receiveMessage(this._pendingMessages.shift());
    }

This assumes that the receiveMessage call always consumes at least one entry from the list.  But it doesn't because:

    // We cannot process this request now.
    if (this._pendingOperation) {
      this._pendingMessages.push(aMessage);
      return;
    }

And you'd think that's safe because pendingOperationDone does the following right before its loop:

    this._pendingOperation = false;

But of course, if receiveMessage ever does anything that sets that flag, infinite loops happen.  I turned on the debug in the file and indeed we see a pattern like this:

04-09 06:50:40.799 I/Gecko   ( 5921): DEBUG RequestSyncService: pendingOperationDone
04-09 06:50:40.799 I/Gecko   ( 5921): DEBUG RequestSyncService: receiveMessage
04-09 06:50:40.809 I/Gecko   ( 5921): DEBUG RequestSyncService: unregister
04-09 06:50:40.809 I/Gecko   ( 5921): DEBUG RequestSyncService: removeRegistration
04-09 06:50:40.809 I/Gecko   ( 5921): DEBUG RequestSyncService: removeRegistrationInternal
04-09 06:50:40.809 I/Gecko   ( 5921): DEBUG RequestSyncService: dbTxn
04-09 06:50:40.809 I/Gecko   ( 5921): DEBUG RequestSyncService: pendingOperationStarted
04-09 06:50:40.809 I/Gecko   ( 5921): DEBUG RequestSyncService: receiveMessage
04-09 06:50:40.809 I/Gecko   ( 5921): DEBUG RequestSyncService: receiveMessage * infinity

I'm going to attach a larger log excerpt now too in case a larger debug trace proves useful for better understanding the shenanigans email is getting up to in its use of the API.

Note that I'm just moving this bug to Core::DOM because the bug is in requestsync, the steps to reproduce hold, and I don't have the ability to mark a new bug in Core::DOM as a blocker.  (And last time we left a non-actionable blocker in email dependent on a platform blocker, everyone got confused.)
Component: Gaia::E-Mail → DOM
Flags: needinfo?(jrburke) → needinfo?(amarchesini)
Product: Firefox OS → Core
Summary: [Flame] Screen stops updating when rapidly switching <select> options after doing various email stuff → [Flame] Request-sync API goes into an infinite loop in the main process if pending messages gets backlogged
Note that I think all that's required for email is to have 2 mail accounts configured for periodic sync and then to just keep changing the select value aggressively until the RequestSyncService falls behind enough to backlog and hit the infinite loop case.  No need to ever add a 3rd account, etc.
Andrew, thanks for the debugging!
Assignee: nobody → amarchesini
Flags: needinfo?(amarchesini)
Attachment #8590226 - Flags: review?(ehsan)
Attachment #8590226 - Flags: review?(ehsan) → review+
Thanks for the fast turnaround on the patch and the test!  And major thanks to :oliverthor, :JMercado, :PBylenga, :KTucker and friends for uncovering this and doing the invaluable bisection.
https://hg.mozilla.org/mozilla-central/rev/0097fbcadabc
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
Since the email app in b2g 2.2 uses requestsync, this also seems to need approval-mozilla-b2g37? set on it to get it landed for that branch?
Flags: needinfo?(amarchesini)
Comment on attachment 8590226 [details] [diff] [review]
requestSync.patch

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 1018320
User impact if declined: a infinite loop
Testing completed: mochitest is included
Risk to taking this patch (and alternatives if risky): no risks
String or UUID changes made by this patch: none
Flags: needinfo?(amarchesini)
Attachment #8590226 - Flags: approval-mozilla-b2g37?
Attachment #8590226 - Flags: approval-mozilla-b2g37? → approval-mozilla-b2g37+
This issue seems to be fixed in the latest 3.0 and 2.2 builds.

Actual Results: The phone does not freeze up when rapidly switching the update time.

Environmental Variables:
Device: Flame 3.0
BuildID: 20150417010203
Gaia: 3cd0a9facce26c2acc7be3755a17131a6358e33f
Gecko: 51e3cb11a258
Gonk: b83fc73de7b64594cd74b33e498bf08332b5d87b
Version: 40.0a1 (3.0) 
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:40.0) Gecko/40.0 Firefox/40.0

Environmental Variables:
Device: Flame 2.2
BuildID: 20150417002504
Gaia: d50b8a3919a7b4d8d289f150d3b9bed704ebafa9
Gecko: 5ebf32030512
Gonk: ebad7da532429a6f5efadc00bf6ad8a41288a429
Version: 37.0 (2.2) 
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:37.0) Gecko/37.0 Firefox/37.0
Status: RESOLVED → VERIFIED
QA Whiteboard: [QAnalyst-Triage+] → [QAnalyst-Triage?]
Flags: needinfo?(ktucker)
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: