Email app fails to fully launch due to worker JS runtime problem (reported as "InternalError: too much recursion") breaking startup process and resulting in unusable UI or black screen

RESOLVED FIXED in 2.2 S8 (20mar)

Status

defect
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: viorela, Assigned: asuth)

Tracking

({qablocker})

unspecified
2.2 S8 (20mar)
ARM
Gonk (Firefox OS)

Firefox Tracking Flags

(blocking-b2g:2.2+, b2g-v2.0 ?, b2g-v2.1 fixed, b2g-v2.2 fixed, b2g-master fixed)

Details

(Whiteboard: [fromAutomation],[2.2-nexus-5-l])

Attachments

(3 attachments)

(Reporter)

Description

4 years ago
Posted file logcat.txt
Email tests are failing intermittently, because Email app doesn't load after launching: http://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk-319.mozilla-central.ui.functional.smoke/242/HTML_Report/ 
There is no email account setup, so new account page should be displayed after launching email Email app. 

I was able to reproduce the failure locally, by running test_IMAP_email_notification.py several times. 
Reproduction rate with automation: 2/10
I couldn't reproduce the issue manually.

Prerequisities:
There's no email account set

#STR from the test:
1. Connect to network
2. Launch Email app

#Expected results:
Email app is loaded, new account page is displayed

#Actual results:
Email app is not loaded, white screen is displayed

I'll attach a video of the issue
(Reporter)

Updated

4 years ago
Summary: Email app fails to load after launch → Email app fails to launch
(Reporter)

Comment 2

4 years ago
2 email tests have failed in latest v2.2 build because of this issue: 
http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.mozilla-central.ui.functional.smoke/243/HTML_Report/

Build info: 
Device firmware (base) 	L1TC10011880
Device firmware (date) 	07 Jan 2015 01:44:12
Device firmware (incremental) 	eng.cltbld.20150107.044401
Device firmware (release) 	4.4.2
Device identifier 	flame
Gaia date 	06 Jan 2015 03:02:13
Gaia revision 	69ac77cfa938
Gecko build 	20150107010216
Gecko revision 	33781a3a5201
Gecko version 	37.0a1

:asuth, can you please take a look? I also attached a logcat of the issue. Thanks!
Flags: needinfo?(bugmail)
(Reporter)

Updated

4 years ago
Keywords: smoketest
This is the mysterious stack depth problem observed in bug 1115039.  Investigating at top priority now.
Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Flags: needinfo?(bugmail)
Summary: Email app fails to launch → Email app fails to launch due to apparent platform stack depth problem ("InternalError: too much recursion")
:asuth, do you consider this to be a smoketest blocker?
blocking-b2g: --- → 2.2?
Component: Gaia::UI Tests → Gaia::E-Mail
Flags: needinfo?(bugmail)
(In reply to Parul Mathur [:pragmatic] from comment #4)
> :asuth, do you consider this to be a smoketest blocker?

Maybe?  In this specific case, I think marionette and other automated testing infrastructure may be on the stack, which is not something a user would experience.  If a human user using a nightly build can't get the email app to launch, that 100% sounds like a smoketest blocker.

There is clearly a serious problem going on here which is at least a 2.2+ blocker.  But it's also worth noting, as far as I can tell, there's no single obvious thing to back out at this stage.  I expect the fix may involve some combination of adjusting platform stacks, providing feedback to the JS team, finding smoking guns about nested event loops, and maybe changing the behaviour of the JS module loader or microtask-queue operation.
Flags: needinfo?(bugmail)
(In reply to Andrew Sutherland [:asuth] from comment #5)
> (In reply to Parul Mathur [:pragmatic] from comment #4)
> > :asuth, do you consider this to be a smoketest blocker?
> 
> Maybe?  In this specific case, I think marionette and other automated
> testing infrastructure may be on the stack, which is not something a user
> would experience.  If a human user using a nightly build can't get the email
> app to launch, that 100% sounds like a smoketest blocker.

Worth noting that there's two criteria for a priority blocker as concerns smoketests: error to the user, and blocks the execution of the test. If the automation was executing successfully and now is not, it'd hit the second criterion.
:tchung, would you like to comment on whether this is a smoke test blocker?
Flags: needinfo?(tchung)
I talked to Parul in person to get more context. 

Since this didn't reproduce manually, I don't think it's a smoketest blocker in the sense of getting the smoketest tag.

However, we have the "qablocker" concept too, which is a bug that blocks execution of an existing automated (or manual, if it's a severe enough bug) suite. 

This is documented in the Firefox OS Bug MDN as fix by next beta or RC, but that's old wording based on the Bugzilla keyword list, and not actually the agreed upon SLA from when started using it in FxOS. 

It's fix by the next scheduled iteration of that test, or ASAP otherwise. Otherwise we have to start testing that area manually instead.

https://developer.mozilla.org/en-US/Firefox_OS/Developing_Firefox_OS/Filing_bugs_against_Firefox_OS

So, this would be a qablocker for someone (either Gaia or Marionette, depending on where the problem lies), assuming that the problem isn't with the test itself. 

Since it's a qablocker on a smoketest, and in fact fouls all tests that launch email (a whole area of testing), it's roughly the same priority as a smoketest blocker.
Flagging the bug as 'qablocker' to indicate that it's causing test automation failures.
Keywords: smoketestqablocker
blocking-b2g: 2.2? → 2.2+
(In reply to Parul Mathur [:pragmatic] from comment #7)
> :tchung, would you like to comment on whether this is a smoke test blocker?

commented in irc, but i agree with the assessment that this is a 2.2+ bug to be fixed, but not a smoketest blocker.
Flags: needinfo?(tchung)
gecko git hash is 99d746b5e45ca1aed8a290fd126594183e717a86

No analysis yet; want to make sure this gets saved since it took several tries to get this to happen.
(Assignee)

Updated

4 years ago
Depends on: 1119157
Right, so investigation/analysis suggests there is no stack depth problem and this is a worker runtime initialization problem.  Best guess right now is that failure is probabilistic based on randomly issued memory addresses and perhaps some race factors and that the nuwa process could factor in heavily.  Rebooting devices or entirely restarting b2g may change address layout and help or not help.

We'll see what the DOM team says!
Summary: Email app fails to launch due to apparent platform stack depth problem ("InternalError: too much recursion") → Email app fails to fully launch due to apparent worker JS runtime problem (reported as "InternalError: too much recursion") breaking startup process
(Assignee)

Updated

4 years ago
Duplicate of this bug: 1115039
(In reply to Andrew Sutherland [:asuth] from comment #12)
> Right, so investigation/analysis suggests there is no stack depth problem
> and this is a worker runtime initialization problem.  Best guess right now
> is that failure is probabilistic based on randomly issued memory addresses
> and perhaps some race factors and that the nuwa process could factor in
> heavily.  Rebooting devices or entirely restarting b2g may change address
> layout and help or not help.
> 
> We'll see what the DOM team says!

Thanks for chasing this! I was kind of afraid it might be a layout/race issue based on the spare repro rate and the fact that it popped up out of nowhere even though the stack issue was apparently pre-existing. 

In the meantime, we'll keep an eye on the automation. Hopefully this'll remain uncommon enough that we can just deal with very occasional glitches while it gets looked at.
Whiteboard: [from automation]
Whiteboard: [from automation] → [fromAutomation]
Posted file email.log
Experienced this a couple of time today
(Assignee)

Updated

4 years ago
Duplicate of this bug: 1126204
https://bugzilla.mozilla.org/show_bug.cgi?id=1126204#c4 indicated this reproduced on v2.1 and v2.2.  v2.0 may also be affected since https://bugzilla.mozilla.org/show_bug.cgi?id=1126204#c0 indicated it was.  However, there wasn't log confirmation and v2.0 email is so different from v2.1 email that it's possible that's just a different problem.

Also note that I've made an update on bug 1119157 and this is definitely a platform bug.  It's not clear if mitigation would be possible by repeatedly spawning workers until we get one that doesn't die.  I'd prefer to just have the platform fix.
status-b2g-v2.0: --- → ?
Summary: Email app fails to fully launch due to apparent worker JS runtime problem (reported as "InternalError: too much recursion") breaking startup process → Email app fails to fully launch due to worker JS runtime problem (reported as "InternalError: too much recursion") breaking startup process
I'm unassigning to indicate that no more work/investigation is going on here in the email component.  We're fully depending on the platform issue bug 1119157 being fixed (by someone who is not me).

We're leaving open for tracking/duping-to purposes.
Status: ASSIGNED → NEW
Andrew, thanks for investigating!
(Assignee)

Updated

4 years ago
Duplicate of this bug: 1138852
Note the possibility of a black screen if the fast-cache is not able to reuse the existing cached HTML.

Also note that on bug 1138852 it seems like this bug was now reproducing with an extremely high rate.  We may need to implement our proposed mitigation where we keep creating workers until one of them doesn't die.
Summary: Email app fails to fully launch due to worker JS runtime problem (reported as "InternalError: too much recursion") breaking startup process → Email app fails to fully launch due to worker JS runtime problem (reported as "InternalError: too much recursion") breaking startup process and resulting in unusable UI or black screen
Whiteboard: [fromAutomation] → [fromAutomation],[2.2-nexus-5-l]
(Assignee)

Updated

4 years ago
Duplicate of this bug: 1138852
The fix for bug 1119157 is now on trunk.  Marking this as fixed by the fix for 1119157.  Any verification assistance that can be provided on bug 1119157 will probably be appreciated by releng when approving the uplift.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
(It might also make sense to just dupe this bug at this point.  The argument for not duping is to avoid cluttering up that bug with email specific discussion if it appears that the fix did not address the problem for email.)
Target Milestone: --- → 2.2 S8 (20mar)
You need to log in before you can comment on or make changes to this bug.