Closed Bug 950673 Opened 6 years ago Closed 5 years ago

150+ ms launch regression in gallery, video, camera, email_ftu (Dec 10-13)

Categories

(Firefox OS Graveyard :: Gaia::System::Window Mgmt, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:2.0+, b2g-v1.4 affected, b2g-v2.0 fixed)

RESOLVED FIXED
2.0 S2 (23may)
blocking-b2g 2.0+
Tracking Status
b2g-v1.4 --- affected
b2g-v2.0 --- fixed

People

(Reporter: bkelly, Assigned: alive)

References

Details

(Keywords: perf, regression, Whiteboard: [c=progress p= s= u=2.0] QARegressExclude)

Attachments

(12 files, 3 obsolete files)

239 bytes, text/plain
Details
46 bytes, text/x-github-pull-request
Details | Review
3.68 MB, text/plain
Details
2.33 MB, text/plain
Details
612.89 KB, application/json
Details
665.71 KB, application/json
Details
49.73 KB, image/png
Details
80.77 KB, image/png
Details
46 bytes, text/x-github-pull-request
etienne
: review+
Details | Review
46 bytes, text/x-github-pull-request
etienne
: feedback+
Details | Review
46 bytes, text/x-github-pull-request
alive
: review+
Details | Review
202.88 KB, text/plain
Details
Between Dec 10 and Dec 13 we had a launch regression in multiple apps.  Cold launch times increased 150+ms for gallery, video, camera, and email_ftu:

  http://mzl.la/1fybZMm

Unfortunately b2gperf was broken for a couple days during this window, so the regression range is larger than usual.

  https://github.com/mozilla-b2g/gaia/compare/a98cf9fcd29a...1d574755c04196
  http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=df82be9d89a5&tochange=8b5875dc7e31
Whiteboard: [c=automation p= s= u-] → [c=automation p= s= u=]
Related to the NUWA landing? Bug 930282?
(In reply to Gregor Wagner [:gwagner] from comment #1)
> Related to the NUWA landing? Bug 930282?

That was backed out.
Whiteboard: [c=automation p= s= u=] → [c=progress p= s= u=]
I think Ben already clarified in his first comment that getting a deeper regression window here isn't possible, as b2gperf was broken during part of the timeframe of that window.
(In reply to Jason Smith [:jsmith] from comment #3)
> I think Ben already clarified in his first comment that getting a deeper
> regression window here isn't possible, as b2gperf was broken during part of
> the timeframe of that window.

The automation config was broken, but it should be possible to bisect with local builds and b2gperf.  Just wanted to clarify.
(In reply to Ben Kelly [:bkelly] from comment #4)
> (In reply to Jason Smith [:jsmith] from comment #3)
> > I think Ben already clarified in his first comment that getting a deeper
> > regression window here isn't possible, as b2gperf was broken during part of
> > the timeframe of that window.
> 
> The automation config was broken, but it should be possible to bisect with
> local builds and b2gperf.  Just wanted to clarify.

Okay - readding keyword then
Whiteboard: [c=progress p= s= u=] → [c=progress p= s= u=] QARegressExclude
Assignee: nobody → jhylands
Whiteboard: [c=progress p= s= u=] QARegressExclude → [c=progress p=4 s= u=] QARegressExclude
Mike,

Do you see this as a priority in 1.4?
Flags: needinfo?(mlee)
Assignee: jhylands → dhuseby
Status: NEW → ASSIGNED
Yes. It's an immediate priority that we're addressing this sprint.
blocking-b2g: 1.4? → 1.4+
Flags: needinfo?(mlee)
Whiteboard: [c=progress p=4 s= u=] QARegressExclude → [c=progress p=4 s= u=1.4] QARegressExclude
This has a wide regression range and will take a while to nail down.  I'm punting to the next sprint because I won't get to it this week.
Depends on: 972075
No longer depends on: 972075
How are we doing here?

Moving this over to Runtime to get it out of the General component. This should probably be in a new "Performance" component.
Component: General → Runtime
Component: Runtime → Performance
PM triage: 1.4+
Let me see if I can narrow down the regression range today.
these are old builds and b2gperf has been updated since.  I'm struggling with getting all of the correct versions of python packages installed on my system so that I can run b2gperf v0.1.3 instead of the latest version.  this has me block for now.  I'm working to resolve it.
Attached file df82be9d89a5-requirements.txt (obsolete) —
pip requirements file for b2gperf to work with the oldest build in the regression range.
I figured out the sekrit recipe for making b2gperf work with gecko revision df82be9d89a5.  Attachment 8391573 [details] is a "requirements file" for the python module install tool pip.  You use it like so:

pip -Ir requirements.txt

The contents of the file tells pip which python modules and which version of each to install.  Once you do that, the b2gperf tool will work.
Here are my initial numbers for the two gecko revisions and the a98cf9f version of gaia.

df82be9d89a5  Results for Gallery, cold_load_time: median:494, mean:508, std: 64, max:851, min:482
8b5875dc7e31  Results for Gallery, cold_load_time: median:490, mean:501, std: 53, max:783, min:467

Since the numbers are nearly identical, I am concluding that this is a gaia regression rather than a gecko regression.
updated the requirements file.  gaiatest==0.20 and b2gperf==0.13 fail to test the earliest revision of gaia in the regression range.
Attachment #8391573 - Attachment is obsolete: true
Trying both revisions of Gecko and both revisions of Gaia to decide if the regression was in Gecko or Gaia:

Gecko: df82be9d89a5
Gaia: a98cf9f
Results for Gallery, cold_load_time: median:497, mean:508, std: 58, max:818, min:468

Gecko: 8b5875dc7e31
Gaia: a98cf9f
Results for Gallery, cold_load_time: median:486, mean:499, std: 54, max:786, min:470

Gecko: df82be9d89a5
Gaia: 1d57475
Results for Gallery, cold_load_time: median:654, mean:657, std: 26, max:785, min:624

Gecko: 8b5875dc7e31
Gaia: 1d57475
Results for Gallery, cold_load_time: median:642, mean:668, std: 112, max:1261, min:622

From these results, the regression is in Gaia.


Bisecting Gaia:

Gecko: df82be9d89a5
git bisect start 1d57475 a98cf9f

50af7ba
Results for Gallery, cold_load_time: median:614, mean:614, std: 17, max:667, min:586
git bisect bad

3b59454
Results for Gallery, cold_load_time: median:604, mean:717, std: 244, max:1270, min:583
git bisect bad

4a1663c
Results for Gallery, cold_load_time: median:606, mean:736, std: 265, max:1278, min:581
git bisect bad

0985280
Results for Gallery, cold_load_time: median:614, mean:785, std: 291, max:1283, min:588
git bisect bad

51857b6
Results for Gallery, cold_load_time: median:607, mean:761, std: 278, max:1293, min:582
git bisect bad

552c666
Results for Gallery, cold_load_time: median:493, mean:506, std: 55, max:799, min:483
git bisect good

a6c2af8
Results for Gallery, cold_load_time: median:618, mean:787, std: 285, max:1281, min:589
git bisect bad

f574915
Results for Gallery, cold_load_time: median:604, mean:755, std: 275, max:1281, min:582
git bisect bad

f574915022e3c2edcafb2a2be8230b95b9f2c7cb is the first bad commit
commit f574915022e3c2edcafb2a2be8230b95b9f2c7cb
Author: Alive Kuo <alegnadise@gmail.com>
Date:   Tue Oct 29 16:25:06 2013 +0800

    Bug 907013 - Implement AppWindowManager.
    
    * Disable 3 marionette js tests because marionette seems having trouble on finding out alerts. Will file follow up bugs to re-enable them.
    * Remove WindowManager and introduce AppWindowManager
    * Introduce LayoutManager who knows and gethers the sizing info well.
    * Fix OrientationManager issues.
    * Abandon TransitionMixin and "Re"write an AppTransitionController for the use case of edge gesture -- let appWindow support multiple transitions.
    * Abandon AuthenticationDialog and "Re"implement AppAuthenticationDialog
    * AppWindow: Introduce subcomponents under appWindow
    * AppWindow: Introduce broadcast for internal events, publish for external events.
    * AppWindow: Introduce event handling mechanism by registeredEvents.
    * Fix VisibilityManager issues.
    * Change the mechanism in python based gaia-apps.
    * More shared code between AppWindow and HomescreenWindow.
    * More shared code between AppWindow and ActivityWindow.
    * Add 100+ unit test functions.
    * Isolate window.css from system.css
    * Add a new event in KeyboardManager for continuos transition control.
    * Add a developer settings for enabling simultaneous animation control.
    * Add jsdoc support.
      http://alivedise.github.io/gaia-system-jsdoc/

:100644 100644 593cabcae22430f0d91da9f2ebe8aedd02e6c0fe f02862647e6cb58881f3cb57cfc08123881e91e7 M      Makefile
:040000 040000 f90c24487c896fb0a4ca2f8a492d0a0265acae52 c3baa383d6663cc6d6465e8be5a0b5fd36c61af9 M      apps
:040000 040000 7d1a8af027aecdbf0f7af40c0fcce6d5061d7e0f 18a505c78782980da5bc33457430e1aa5c2c78b7 M      build
:040000 040000 03cbc58b2102909f884f0006edceea0e6c1cfeb1 4c9f677c37fb3c9398d1384d5b28a08f22224573 M      shared
:040000 040000 908dfc651281f2ff23a888cd612130f4a673e3f9 9e3d53b3249a7f081bad31243b7b06510718dc1a M      tests
I'm digging into the revision to determine what exactly caused the regression.
I wish this patch was split into smaller patches instead of one monster commit.
Dave, if you find a regression please mark it as a dependency.
Component: Performance → Gaia::System::Window Mgmt
(In reply to Dave Huseby [:huseby] from comment #18)
> I'm digging into the revision to determine what exactly caused the
> regression.

And update about this?
Flags: needinfo?(dhuseby)
This regression is proving to be very difficult to pinpoint due to the immensity of the offending revision.  If nothing specific can be identified as the cause, this may be one of those cases where we accept the regression as part of a necessary improvement in features/design.

I'm not ready to concede that point yet.  I'm currently picking this apart using the profiler.  Nothing is obviously the problem...yet.
Flags: needinfo?(dhuseby)
I've identified a patch you made as causing a 150ms regression in launch times for several apps.  However your patch is rather large and touches lots of things.  Can you take make any recommendations on which parts of your patch might have negative performance impacts?  Or better yet, can help me break up your patch into smaller, discrete pieces that I can test individually?
Flags: needinfo?(alive)
Hi!

I have to say it's nearly impossible to break down the patch of bug 907013.
The purpose of bug 907013 is to split the so called monster module: WindowManager into self-managed components. It does everything in the same scope without proper and clean plan:
create mozbrowser iframe, render the iframe, transitioning the window, change the default background, kill the app, launch the inline activity, link the window disposition activity, deal with app visibility management, deal with app orientation management, deal with app sizing management.

The most tricky part is dealing with transitioning here, it's complex to describe shortly, but we are having trouble to do something correctly in the v1.3 window manager to control the visibility/orientation/sizing in the correct timing: before open, after open, before close, after close.

Please see https://docs.google.com/presentation/d/1mgrMsVPIyWUR0evcLy_iZ4oKgDtgkrYBGUrJPX3ZF1A/edit?usp=sharing for reference.

It took me a long time to figure out these rules to not cause problems, for instance, resizing timing should be 'before opening transitioning' + 'had been resized' + 'not being fit into current layout' or 'after opened'. App controlling is n-ary concept: Render, transition, orientation, layout, visibility. And the old window manager delt these badly in one main function and without a clear plan.

Another reference:
http://alivedise.github.io/gaia-system-jsdoc/AppWindow.html

The main thing would affect the socalled launch time might be:
The invention of app transition controller, a state machine to control the app transition and the other 3 dimension: orientation/layout/visibility. (https://github.com/mozilla-b2g/gaia/blob/master/apps/system/js/app_transition_controller.js) The state machine is using setTimeout to control the state changed event and DOM events to broadcast the event to other modules. I'd defined some inner events to be manipulated to do the transitioning stuff before the outer events so I am not sure it really matters.

I don't think anything is regressed here. The perf measurement itself is problematic because it relies on the window manager's event: 'apploadtime' and the timing is controlled just by it.

Apologize for I cannot help.
Flags: needinfo?(alive)
Depends on: 972075
Hub, Kevin suggested that I look at the b2gperf test and verify if this patch changed the timing of system app event that we use to measure app launch times.  It is possible that the app launch isn't actually taking more time but the system app firing happens later in the process, thus giving us a false positive for a launch time regression.  Can you help me figure out if the event firing moved?  First we need to know what event we hook into to stop the timer.  We could use some console output or eideticker to measure actual app launch times to see if they are the same before and after the bad revision.  That would confirm if the event firing moved instead of the app launch getting slower.
Flags: needinfo?(hub)
I'm trying to figure out if the rewrite of the window manager changed the timing between the app launch start and app launch end without actually slowing down app launches.  See comment 25 for more details.  I think the only thing that will give us an accurate determination is eideticker.  Can you help me get some eideticker data on the launch time of several of these apps before and after the window manager patch is added?

If the app launch times don't change in the eideticker data, then I'll know for sure that the launch time regression is a false positive.
Flags: needinfo?(wlachance)
This is a blocker. Can we back out the change and investigate?
I have an eideticker setup up here. If :huseby or alive can give me a set of builds to flash on the device, we should be able to measure whether the patch makes a user-perceivable difference.
Flags: needinfo?(wlachance)
Over lunch I tried backing out this change and the git revert caused 70 conflicts that required manual resolution.  This patch is HUGE and old with lots of code built on top of it.  It will be very difficult to back this out.  The plan is to do black box validation to verify if the app launches are really slower or if the launch and loadend events have moved relative to each other.  I've got a builds ready for flashing.  If the app launches really are slower, then this bug goes to :alive for fixing.
I tried to re-read v1.3 window_manager code and maybe there's one difference:
https://github.com/mozilla-b2g/gaia/blob/v1.3/apps/system/js/window_manager.js#L212

Since apploadtime depends on mozbrowserloadend event firing, and our device is single-core, the v1.3 and the older window manager did one hack here: do its best to not do anything before the iframe is loaded.

But this causes us much trouble on race condition because the launch time may or may not beyond of after the transition is ended and if the user opens/closes app quickly we will have a dead device.

I will try to remove some event handler before the app is loaded to see if this helps.

But again, it's because we are not real 'multi-task'. The performance measurement way is problematic to me.
(In reply to Alive Kuo [:alive][NEEDINFO!][God bless Taiwan.] from comment #30)
> I tried to re-read v1.3 window_manager code and maybe there's one difference:
> https://github.com/mozilla-b2g/gaia/blob/v1.3/apps/system/js/window_manager.
> js#L212
> 
> Since apploadtime depends on mozbrowserloadend event firing, and our device
> is single-core, the v1.3 and the older window manager did one hack here: do
> its best to not do anything before the iframe is loaded.
> 
> But this causes us much trouble on race condition because the launch time
> may or may not beyond of after the transition is ended and if the user
> opens/closes app quickly we will have a dead device.
> 
> I will try to remove some event handler before the app is loaded to see if
> this helps.

But if this raises another race condition regression (fire events based on mozbrowser iframe loaded event), I am curious who should be backout?
Confirmed that this commit caused a regression of hundreds of ms for app launches. Seems like we must do absolutely *nothing* during app launch. Reverting this seems insanely difficult, one option may be rolling back the entire system app to right before this patch, then reland patches on top of it.

In the following measurements, I measured from the time you tap an icon on the homescreen, till the same point inside of an app launch path.

Launch times in master: 879, 875
First bad revision: 847, 807
Last good revision: 587, 579

In this case gallery had close to a 300ms launch regression.
Assignee: dhuseby → alive
Flags: needinfo?(hub)
Blocks: 987994
One of possible way is try to remove all event handlers before any app is loaded.
Includes:
visibility/orientation/layout for homescreen app.

This may cause another regressions for example if the app is too late to launch homescreen will stay foreground always.
Hack to delay open event handler before the app is loaded
Comment on attachment 8400523 [details] [review]
https://github.com/mozilla-b2g/gaia/pull/17884

Let me know if this patch helps to reduce launch time.
Attachment #8400523 - Flags: feedback?(dhuseby)
Also I need some real profiling data to know what's really slowing. put ni.
If you are not the one to acquire profiling, please let me know who I should contact. Thanks.
Flags: needinfo?(dhuseby)
Attached file profile_captured.sym (obsolete) —
profiler capture of the latest v1.4 tip WITHOUT your patch.  load this into cleopatra to analyze it or go here: https://people.mozilla.org/~bgirard/cleopatra/#report=bbd8c70b63ba94750eefcbf0aedc1f69e8684526
profile capture of gaia commit abdabcc.  this is the commit before your patch.
Attachment #8404336 - Attachment is obsolete: true
I will be doing b2gperf timing tests of before and after your patch later this evening PST.
Thanks for help.
Unfortunately from the log I don't see useful information right now...
It's obvious before bug 907013, mozbrowserloadstart begins right away after the page is rendered.
But after bug 907013, mozbrowserloadstart is delayed.
Here are the b2gperf numbers for commit abdabcc (without patch):
Results for Settings, cold_load_time: median:1745, mean:1753, std: 41, max:1946, min:1698, all:1832
Flags: needinfo?(dhuseby)
Shoot, those numbers were for Settings.  I'll grab them for Gallery.
Here's Gallery times with the patch:
Results for Gallery, cold_load_time: median:1008, mean:1001, std: 43, max:1104, min:909

So even with the patch the numbers are actually worse.  Let me try the revision before the patch.
Here's Gallery times for the revision just before your patch:
Results for Gallery, cold_load_time: median:1117, mean:1121, std: 30, max:1200, min:1073

your patch makes things a little better but it's still takes nearly twice as long to launch now than when it did before the re-write.
Flags: needinfo?(alive)
Thanks for your profiling and help, my next step is using TaskTracer to see why main process is busy during the app launching, in this week.

Vivien and Etienne, I am not sure if you have some idea to improve or any clue here. To summarize, bug 907013 - AppWindowManager rewrite slows 150ms and recent bug 961800 - ChildWindowFactory slows another 300ms. I am still on the 150ms part but currently I have no idea other than removing the innerHeight/innerWidth check as the patch. I'd appreciate any help and hope we won't introduce bug 842627 which causes bug 844990 again. I wish.
Flags: needinfo?(etienne)
Flags: needinfo?(alive)
Flags: needinfo?(21)
I am using Isis TaskTracer to see task states.
Will append recorded two profile before and after bug 907013.

http://alivedise.github.io/isis/
Launch email: Task data before bug 907013 with logging.
Launch email: task data after bug 907013 with logging
(In reply to Alive Kuo [:alive][NEEDINFO!][God bless Taiwan.] from comment #47)
> Thanks for your profiling and help, my next step is using TaskTracer to see
> why main process is busy during the app launching, in this week.
> 
> Vivien and Etienne, I am not sure if you have some idea to improve or any
> clue here. 

If your patch yields good results I guess we could:

* listen to the _loaded broadcast in the AppTransitionController
* changeTransitionState to complete in animationend *only* if |app.loaded|
* changeTransitionState to complete in _loaded *only* if the animation ended
* make the openingtimeout longer (800ms?) for cold openings (when |!this.app.loaded| in _do_opening)

It would formalize that the app gets priority to load for *some* amount of time, but then things move forward in the system app.

What do you think? Don't have a strong opinion yet, posting here to keep the conversation going.
Flags: needinfo?(etienne)
Whiteboard: [c=progress p=4 s= u=1.4] QARegressExclude → [c=progress p= s= u=1.4] QARegressExclude
I made a patch on v1.4 to skip animationend event before the active app loads ended and enlarge the timeout to 800ms.
https://github.com/mozilla-b2g/gaia/pull/18368

This remove a long execution task in b2g main process, but the launch time of e-mail app is still about 17xx ms.
The launch time of it before bug 907013 is about 10xx ms.
I found there is a blur then focus event before loadend. The focus event in pre-907013 occur after loaded. My next try is remove this focus event.
Hmmm after testing the pull request has some improvement but I have trouble running test-perf...
Thanks Fred Lin's help I am able to run b2gperf on my device. Running..
I have trouble running b2gperf for commit before 907013...

Dave do you have time to test https://bugzilla.mozilla.org/attachment.cgi?id=8407467 for me?
It's against v1.4 branch.

I will try to fight with b2gperf again at the same time.
Flags: needinfo?(dhuseby)
Finally make it run on unagi.

[Before 907013]
2014-04-17 19:02:24,406 B2GPerfRunner INFO    | Results for Email, cold_load_time: median:471, mean:489, std: 37, max:611, min:458, all:611,554,494,493,469,466,497,458,479,459,477,465,464,552,542,459,464,486,538,466,469,468,563,471,462,470,471,480,481,469
[With PR]
2014-04-17 19:12:24,281 B2GPerfRunner INFO    | Results for Email, cold_load_time: median:444, mean:457, std: 30, max:557, min:428, all:557,444,512,481,443,438,444,443,443,501,444,440,463,451,460,508,436,452,434,444,521,454,440,452,442,437,446,434,429,428
[Latest v1.4]
2014-04-17 19:23:44,771 B2GPerfRunner INFO    | Results for Email, cold_load_time: median:619, mean:624, std: 29, max:728, min:577, all:728,638,604,627,588,608,610,618,636,604,628,598,648,672,620,620,607,603,620,678,616,601,644,655,616,577,623,624,610,604

Sounds a success :)
Flags: needinfo?(dhuseby)
Flags: needinfo?(21)
[Master with cherry-picked patch]
2014-04-17 19:56:25,183 B2GPerfRunner INFO    | Results for Email, cold_load_time: median:450, mean:459, std: 26, max:560, min:435, all:560,464,469,474,452,445,436,449,513,468,455,450,448,456,443,435,437,458,452,440,446,437,447,507,448,446,494,451,451,450

[Master]
2014-04-17 20:21:33,938 B2GPerfRunner INFO    | Results for Email, cold_load_time: median:922, mean:938, std: 51, max:1079, min:872, all:1079,931,936,948,887,904,925,1050,896,928,913,922,926,1049,932,919,911,872,911,922,908,944,1037,1019,905,921,909,932,911,910
(In reply to Alive Kuo [:alive][NEEDINFO!][God bless Taiwan.] from comment #60)
> [Master with cherry-picked patch]
> 2014-04-17 19:56:25,183 B2GPerfRunner INFO    | Results for Email,
> cold_load_time: median:450, mean:459, std: 26, max:560, min:435,
> all:560,464,469,474,452,445,436,449,513,468,455,450,448,456,443,435,437,458,
> 452,440,446,437,447,507,448,446,494,451,451,450
> 
> [Master]
> 2014-04-17 20:21:33,938 B2GPerfRunner INFO    | Results for Email,
> cold_load_time: median:922, mean:938, std: 51, max:1079, min:872,
> all:1079,931,936,948,887,904,925,1050,896,928,913,922,926,1049,932,919,911,
> 872,911,922,908,944,1037,1019,905,921,909,932,911,910

This means the patch fixes bug 987994 as well.
It also means we have even a ~30ms win, comparing to the build before bug 907013 landing.
Can you please send an email to dev-gaia explaining what happened and how you fixed it so others can learn from it? Thanks!
(In reply to Andreas Gal :gal from comment #64)
> Can you please send an email to dev-gaia explaining what happened and how
> you fixed it so others can learn from it? Thanks!

Sure.
Unfortunately there's one obvious regression: if we do opening/closing app quickly, homescreen will lose its opening transition and have no response at the moment the app is closed. Looking..
(In reply to Alive Kuo [:alive][NEEDINFO!][God bless Taiwan.] from comment #66)
> Unfortunately there's one obvious regression: if we do opening/closing app
> quickly, homescreen will lose its opening transition and have no response at
> the moment the app is closed. Looking..

Fixed. But found cardview is broken :/
Target Milestone: --- → 1.4 S6 (25apr)
Attached file patch for 1.4
Now I don't see any big problems. but to-cardview/from-cardview transition seems doesn't work now..

Some integration tests are failing only on travis but works on my side. Re-running..
Attachment #8408882 - Flags: review?(etienne)
(In reply to Alive Kuo [:alive][NEEDINFO!][God bless Taiwan.] from comment #68)
> Created attachment 8408882 [details] [review]
> patch for 1.4
> 
> Now I don't see any big problems. but to-cardview/from-cardview transition
> seems doesn't work now..
> 
> Some integration tests are failing only on travis but works on my side.
> Re-running..

Update: fix the failen tests and cardview transition.
Comment on attachment 8408882 [details] [review]
patch for 1.4

That's a lot of milliseconds gained per line of diff :) Kudos!
Tiny comments on github, r=me
Attachment #8408882 - Flags: review?(etienne) → review+
Attachment #8400523 - Flags: feedback?(dhuseby)
Attached file Patch for master (obsolete) —
Waiting travis to be green to be merged.
Some tests are failing, investigating...I guess it's timing issue.
No longer depends on: 972075
I am getting strange fail with marionette-js tests in v1.4 and master.
For v1.4, the output stops always in calendar app test.
https://travis-ci.org/mozilla-b2g/gaia/jobs/23643208
For master some tests are failing intermittently.
https://travis-ci.org/mozilla-b2g/gaia/jobs/23725053

Flag ni=evan to help.
Flags: needinfo?(evanxd)
Hi alive,
I'm investigating that.
Flags: needinfo?(evanxd)
Target Milestone: 1.4 S6 (25apr) → 2.0 S1 (9may)
Whiteboard: [c=progress p= s= u=1.4] QARegressExclude → [c=progress p= s= u=1.4] QARegressExclude [ETA: 4/29]
Status update: we may hit some unknown marionette-js (marionette-apps) bugs. It's having a wrong way to identify an app is already active to switchTo from long time ago but nobody found that. Evan is on it.
Hi all,

We are fixing the issues of marionette-apps in Bug 1003788. Even we fix the marionette-apps, we also need to fix intermittent failing tests listed on Bug 1003795. Because we will change the logic of swithToApp method in marionette-apps module, and all tests depend on it. Case by case, the change just causes some tests to be failed.

Here is one thing we could do to fix this blocking bug more quickly. We could fix the marionette-apps first, and disable all failing tests listed on Bug 1003795. And then fix the 11 tests listed on Bug 1003795.

How do you guys think?
The travis light is green(https://travis-ci.org/evanxd/gaia/builds/24085504) after we disable the tests listed on Bug 1003795.
It is also with the marionette-apps changes.
Let's move forward with commend 76. Evan what's the gaia change I need to have here for marionette-js?
We could fix the marionette-apps issue and disable the failing tests to resolve this bug. We could change `travis-manifest` and `tbpl-manifest` files to disable tests.
Depends on: 1003788
dependent bug 1003788 is fixed, are we able to close this bug?
Flags: needinfo?(evanxd)
Flags: needinfo?(alive)
Hi Howie,

Yes, we could use the updated marionette-apps module and disable the intermittent failing tests list with the https://github.com/evanxd/gaia/commit/55d1c80e76bfe276c38cf0f36d768898dd72318d commit. Then we could fix and close this bug. We will re-enable the disabled tests in Bug 1003795.

See the travis page https://travis-ci.org/evanxd/gaia/builds/25302358. The marionette_js job is happy now.
Flags: needinfo?(evanxd)
Blocks: 1003795
\o/ Let's move forward with this one, it will make fixing bug 1010604 *much easier*.
We need rebasing and retry the tests, coming soon!
Flags: needinfo?(alive)
Good/bad new is: ONLY 1 js fail and 2 py fails this time. Investigating..
Attachment #8410942 - Attachment is obsolete: true
(In reply to Alive Kuo [:alive][NEEDINFO!][5/11-5/15@MV] from comment #86)
> Created attachment 8424769 [details] [review]
> patch for master updated

Quick note, looks like we _always_ get a [w] launchtime event, due to the openingtimeout not being cleared (and at this time app.loaded is true). We may need to resetTransition on _loaded.
(In reply to Etienne Segonzac (:etienne) from comment #87)
> (In reply to Alive Kuo [:alive][NEEDINFO!][5/11-5/15@MV] from comment #86)
> > Created attachment 8424769 [details] [review]
> > patch for master updated
> 
> Quick note, looks like we _always_ get a [w] launchtime event, due to the
> openingtimeout not being cleared (and at this time app.loaded is true). We
> may need to resetTransition on _loaded.

Fixing. The travis is green for master and after this is addressed we are good to go.
For v1.4 still needs some tweaks.
Comment on attachment 8424769 [details] [review]
patch for master updated

Proposed change: check loaded state in opening handler to register opened handler.
Attachment #8424769 - Flags: feedback?(etienne)
Comment on attachment 8424769 [details] [review]
patch for master updated

yep this is the patch we need :)
Attachment #8424769 - Flags: feedback?(etienne) → feedback+
I just merge the master patch since it's all green. Mark as fixed and await Evan's investigation on v1.4.
master
https://github.com/mozilla-b2g/gaia/commit/62e4c86931593fcd913b91de19f76edb75078412
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Duplicate of this bug: 1010604
Target Milestone: 2.0 S1 (9may) → 2.0 S2 (23may)
Why did we just double the number of disabled tests?!
Hi Gareth,

We could refer to https://bugzilla.mozilla.org/show_bug.cgi?id=1003795#c8 to know the reasons. And we already a patch for re-enable the calendar tests at Bug 1011415.
https://github.com/mozilla-b2g/gaia/commit/652ef38a45ebb2e20411032fe8be28eb8954dcc5 backed out following extensive conversation here https://groups.google.com/forum/#!topic/mozilla.dev.gaia/VTJd_cI2kEw. My prescription for a path forward here is to back off of requiring the additional iframe "render" class for identifying page load and pursuing page load identification strategy specification here https://bugzilla.mozilla.org/show_bug.cgi?id=937659
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Looks like the marionette incompatibility comes from delaying the focus, not delaying the appopen.
I tested it [1] and ran a quick b2gperf:

Results for phone, cold_load_time: median:988, mean:997, std: 38, max:1089, min:959, all:1089,959,972,1008,970,1045,982,997,994,963 without the patch

Results for phone, cold_load_time: median:901, mean:916, std: 42, max:1043, min:890, all:1043,895,911,900,897,902,917,890,910,899 with the patch

This is still a significant gain, it will prevent future launch time regressions when we do stuff on appopen and Travis is happy [2].

Given the 1.4+ness of this patch I think it's our best bet, what do you think?

[1] https://github.com/mozilla-b2g/gaia/pull/19631
[2] https://travis-ci.org/mozilla-b2g/gaia/builds/26046023
Flags: needinfo?(alive)
(In reply to Etienne Segonzac (:etienne) from comment #96)
> Looks like the marionette incompatibility comes from delaying the focus, not
> delaying the appopen.
> I tested it [1] and ran a quick b2gperf:
> 
> Results for phone, cold_load_time: median:988, mean:997, std: 38, max:1089,
> min:959, all:1089,959,972,1008,970,1045,982,997,994,963 without the patch
> 
> Results for phone, cold_load_time: median:901, mean:916, std: 42, max:1043,
> min:890, all:1043,895,911,900,897,902,917,890,910,899 with the patch
> 
> This is still a significant gain, it will prevent future launch time
> regressions when we do stuff on appopen and Travis is happy [2].
> 
> Given the 1.4+ness of this patch I think it's our best bet, what do you
> think?
> 
> [1] https://github.com/mozilla-b2g/gaia/pull/19631
> [2] https://travis-ci.org/mozilla-b2g/gaia/builds/26046023

Thanks for doing that! Surely this works for me for 1.4 branch, but I think we need someone to take decision for 'only 50ms gain' but the bug summary is '150ms lost'.. I am not sure who could make this decision.
Flags: needinfo?(alive)
Tim,
Etienne made a patch to recover all v1.4 failures but with that we have only 50ms gain and it's a workaround. Could we take this or should we fix 150ms delay? Who could make decision?

Evan,
Could you link v1.4 failure bug here?

Thanks.
Flags: needinfo?(timdream)
Whiteboard: [c=progress p= s= u=1.4] QARegressExclude [ETA: 4/29] → [c=progress p= s= u=1.4] QARegressExclude
It's a release driver decision at this point. Kevin, who we should talk to?
Flags: needinfo?(timdream) → needinfo?(khu)
It seems like there were some motions to capture (via eideticker) whether the 150ms gain here was a "user-perceptible" 150ms so to speak or whether the regression has more to do with the way the measurements are being made. Did we do that? What were the results?
What's the main reason why we can't fix the full 150ms delay? Risk?
(In reply to Jason Smith [:jsmith] from comment #101)
> What's the main reason why we can't fix the full 150ms delay? Risk?

Much tests of v1.4 based on marionette-apps needs to be rework or reevaluated. Unknown ETA, no owner for now, and no approval for this kind of work.
Can we fix this on trunk?
(In reply to Andreas Gal :gal from comment #103)
> Can we fix this on trunk?

Yes, Evan is working on fixing the last failure on master. Will land on master soon.
Hi Alive,

The Bug 1015848 is for the test failures in 1.4 branch.
We have 37 intermittent failures[1] in v1.4 branch with the patch.

The statistics of failing test case for each app.
Browser: 4
Calendar: 4
Clock: 4
Contacts: 5
Dialer: 1
Email: 2
FTU: 2
Gallery: 3
Settings: 12

[1] https://travis-ci.org/evanxd/gaia/jobs/26028723.
No longer blocks: 1003795
(In reply to Alive Kuo [:alive][NEEDINFO!] from comment #104)
> (In reply to Andreas Gal :gal from comment #103)
> > Can we fix this on trunk?
> 
> Yes, Evan is working on fixing the last failure on master. Will land on
> master soon.

Okay - my view on this is that we should land the 150ms fix on 1.4 & turn the tests off on 1.4 only. For 2.0, we should make the real fix that resolves the 150ms regression & keeps the tests on.

Major reasons for this include:

* We're measured by partners heavily on meeting perf goals for app launch, so we really need the perf regression fixed
* 1.4 has a sufficient amount of stability at this point, so in the worst case, we could take a hit losing automated coverage here if we had to (even if we don't like it)
* 2.0 needs to keep the tests on, as there's a lot of instability right now on that branch (especially with smoketest regressions recently)

Tim - Does that work for you?
Flags: needinfo?(timdream)
(In reply to Jason Smith [:jsmith] from comment #107)
> Tim - Does that work for you?

It looks like my opinion does not matter a lot judging by the conversations in mailing list right? :)

Anyhow, I am a bit surprised this is coming from QA. If disabling the tests on 1.4 works for QA then it works for me.
Flags: needinfo?(timdream)
(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #99)
> It's a release driver decision at this point. Kevin, who we should talk to?

I don't see any major concern if we land this to 1.4. To have 50ms improvement is also very good. This bug was marked as 1.4+ which means we agreed that this bug has priority, and based on the importance of this issue, we would like to have it in 1.4. The only concern from my side is, I don't know if there has any side effect from this workaround. If we have confidence that it does not have any side effect, or cause other regressions, I am very happy to see this workaround landed in 1.4. In 2.0, I suggest to try to find a better fix for this issue. 

Sounds good?
Flags: needinfo?(khu)
Etienne, mind finishing and landing the v1.4 patch? Thank you.
Flags: needinfo?(etienne)
(In reply to Alive Kuo [:alive][NEEDINFO!] from comment #110)
> Etienne, mind finishing and landing the v1.4 patch? Thank you.

Yep.
I confirmed the 1.4 version of the patch still works well, b2gperf looks promising:
Results for settings, cold_load_time: median:1323, mean:1323, std: 106, max:1477, min:1183, all:1200,1266,1459,1406,1381,1183,1230,1236,1477,1399 without

Results for settings, cold_load_time: median:1130, mean:1184, std: 105, max:1439, min:1102, all:1133,1109,1122,1118,1324,1165,1204,1439,1127,1102 with

But I'm trying to disable/tweak as little tests as possible and travis is making this process pretty slow.
Flags: needinfo?(etienne)
We could land the patch once the travis job[1] is green.
Waiting for the result. :D

[1] https://travis-ci.org/mozilla-b2g/gaia/builds/26279453
Hi all,

Good new is `marionette_js` job[1] of the PR is passed. But the gaia ui test `test_settings_change_keyboard_language.py[2] is failed. And alive is working on the issue.

[1] https://travis-ci.org/mozilla-b2g/gaia/jobs/26279456
[2] https://travis-ci.org/mozilla-b2g/gaia/jobs/26279457
Blocks: 1013883
Comment on attachment 8430617 [details] [review]
Link to Github pull-request: https://github.com/mozilla-b2g/gaia/pull/19751

Hi Alive,

Could you help to review the patch?

Thanks.
Attachment #8430617 - Flags: review?(alive)
Attachment #8430617 - Flags: review?(alive) → review+
master
https://github.com/mozilla-b2g/gaia/commit/0862324900aef8d4f129ae6439c3681dc90ac88a
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
Thanks for Evan's great help.
I'm not really sure what the uplift story is here for v1.4. Alive, can you please take care of it? Thanks :)
Flags: needinfo?(alive)
It seems as if there were some open questions here which I presume were answered but perhaps the answers didn't find their way into the public bug?

+ Was the perf regression a user-perceptible one?
+ Are we rolling with the augmented page load semantics? (ie app iframes are expected to display a "render" class if the iframe is in the DOM but the system app wants to communicate that the app is not ready to be interacted with)
+ If we are pushing ahead with this, are we going to work with the DOM, apps, etc. working groups to get feedback (from other groups such as Chrome OS), experimental documentation, and understand better whether this needs standardizing? Or perhaps if there is another, equivalent DOM attribute with more momentum that we can use? If so, can we mirror that information here as well? And who is the contact point for that effort?

Thanks!
Flags: needinfo?(evanxd)
Flags: needinfo?(etienne)
Depends on: 1018836
Depends on: 1018487
Given the regressions we've already seen this patch, I think we might need to re-evaluate the blocking decision here. I don't like shipping with regressions, but I also know that taking a high risk patch can be problematic.
blocking-b2g: 1.4+ → 1.4?
No longer depends on: 1018836
Depends on: 1018836
Mike - This is a risky patch to take onto 1.4. We can either live with the 150ms hit on 1.4 & not take the regression risk or do the opposite here. What's your take on this from a perf perspective?
Flags: needinfo?(mlee)
Jason I agree with your concerns re: risk to 1.4. Let's not destabilize or delay 1.4 any longer with uplifting this patch. Targeting 2.0 gives us a better chance to resolve unexpected side effects.
Flags: needinfo?(mlee)
Sounds good - I'll move this to 2.0+ then.
blocking-b2g: 1.4? → 2.0+
Clearing needinfo since we're not going to do the uplift anymore due to risk.
Flags: needinfo?(alive)
Whiteboard: [c=progress p= s= u=1.4] QARegressExclude → [c=progress p= s= u=2.0] QARegressExclude
(In reply to Gareth Aye [:gaye] from comment #120)
> It seems as if there were some open questions here which I presume were
> answered but perhaps the answers didn't find their way into the public bug?

Wasn't involved in the test fixing phase so clearing ni.
Pretty sure it wasn't the new load policy stuff though since it's not implemented in marionette.
Flags: needinfo?(etienne)
Alive, could you help to answer gaye's questions listed at Comment 120?
Flags: needinfo?(evanxd) → needinfo?(alive)
(In reply to Gareth Aye [:gaye] from comment #120)
> It seems as if there were some open questions here which I presume were
> answered but perhaps the answers didn't find their way into the public bug?
> 
> + Was the perf regression a user-perceptible one?

You could check out the master branch without the patch to try out. It's very slower 450+ms.

> + Are we rolling with the augmented page load semantics? (ie app iframes are
> expected to display a "render" class if the iframe is in the DOM but the
> system app wants to communicate that the app is not ready to be interacted
> with)

render is just a class to put some style / let the other module in system know the content is rendered (mozbrowserloadend). The "app" doesn't know this. window.load should be enough for web contents.

> + If we are pushing ahead with this, are we going to work with the DOM,
> apps, etc. working groups to get feedback (from other groups such as Chrome
> OS), experimental documentation, and understand better whether this needs
> standardizing? Or perhaps if there is another, equivalent DOM attribute with
> more momentum that we can use? If so, can we mirror that information here as
> well? And who is the contact point for that effort?
> 
> Thanks!

I guess you are misunderstanding something because the information about rendring/opening/closing is just needed inside system app, and, for some test library.

For app being rendered: window.load, this is automatically triggered by gecko
For app being opened: visibilitychange, this is triggered by mozbrowser API's setVisible.
For app being closed: visibilitychange, this is triggered by mozbrowser API's setVisible.

For app being opening/closing: no API for app to know and no use cases for now. But we definitely needs to info inside system app to better control its states.
Flags: needinfo?(alive)
(In reply to Alive Kuo [:alive][NEEDINFO!] from comment #128)
> (In reply to Gareth Aye [:gaye] from comment #120)
> > It seems as if there were some open questions here which I presume were
> > answered but perhaps the answers didn't find their way into the public bug?
> > 
> > + Was the perf regression a user-perceptible one?
> 
> You could check out the master branch without the patch to try out. It's
> very slower 450+ms.
> 
> > + Are we rolling with the augmented page load semantics? (ie app iframes are
> > expected to display a "render" class if the iframe is in the DOM but the
> > system app wants to communicate that the app is not ready to be interacted
> > with)
> 
> render is just a class to put some style / let the other module in system
> know the content is rendered (mozbrowserloadend). The "app" doesn't know
> this. window.load should be enough for web contents.
> 
> > + If we are pushing ahead with this, are we going to work with the DOM,
> > apps, etc. working groups to get feedback (from other groups such as Chrome
> > OS), experimental documentation, and understand better whether this needs
> > standardizing? Or perhaps if there is another, equivalent DOM attribute with
> > more momentum that we can use? If so, can we mirror that information here as
> > well? And who is the contact point for that effort?
> > 
> > Thanks!
> 
> I guess you are misunderstanding something because the information about
> rendring/opening/closing is just needed inside system app, and, for some
> test library.
> 
> For app being rendered: window.load, this is automatically triggered by gecko
> For app being opened: visibilitychange, this is triggered by mozbrowser
> API's setVisible.
> For app being closed: visibilitychange, this is triggered by mozbrowser
> API's setVisible.
> 
> For app being opening/closing: no API for app to know and no use cases for
> now. But we definitely needs to info inside system app to better control its
> states.

So then I take it that if the render class is insignficant and we don't need to work with anyone to standardize for interoperability, then we should backout https://github.com/mozilla-b2g/marionette-apps/commit/6d7e702d748261939c64dfec103138706dbff7b3?
Flags: needinfo?(alive)
My concern here is in regards to interoperability for web automation fwiw
(In reply to Gareth Aye [:gaye] from comment #129)> 
> So then I take it that if the render class is insignficant and we don't need
> to work with anyone to standardize for interoperability, then we should
> backout
> https://github.com/mozilla-b2g/marionette-apps/commit/
> 6d7e702d748261939c64dfec103138706dbff7b3?

Well, you could IF
* you implement your own system app in your test framework to "launch the app without any delay" to test the UI operation.
* you hack the gecko to know the loaded + opened state of the app to test the UI operation.

Otherwise you are just guessing it's 'ready to test' but not really know when you could test the UI operations.

If there's no comment 130 I tend to think it's because the backout-ed patch of this bug ever blocked your work so you are here to block my work to make you happy...

I guess I understand what does standardize mean, you are worried about the automation could not be applied to another platform.....is it? But why? It's testing FxOS so depending FxOS' implementation detail doesn't harm. What my concern is this kind of implementation detail is not known and communicated between system app developers and automation developers. Now it's a chance to know but you are just shooting from my point of view. If you do the suggestions above and make travis happy I don't have any other opinion for any change you made to marionette-apps -- unfortunately you still need either gecko or system app to tell you the app is ready to go or just guess using a timer or just do not throw any error in your switchToApp functions even the app is not loaded or opened.

I don't think this kind of discussion should keep annoying the 25 people here.
Flags: needinfo?(alive)
(In reply to Alive Kuo [:alive][NEEDINFO!] from comment #131)
> (In reply to Gareth Aye [:gaye] from comment #129)> 
> > So then I take it that if the render class is insignficant and we don't need
> > to work with anyone to standardize for interoperability, then we should
> > backout
> > https://github.com/mozilla-b2g/marionette-apps/commit/
> > 6d7e702d748261939c64dfec103138706dbff7b3?
> 
> Well, you could IF
> * you implement your own system app in your test framework to "launch the
> app without any delay" to test the UI operation.
> * you hack the gecko to know the loaded + opened state of the app to test
> the UI operation.
> 
> Otherwise you are just guessing it's 'ready to test' but not really know
> when you could test the UI operations.
> 
> If there's no comment 130 I tend to think it's because the backout-ed patch
> of this bug ever blocked your work so you are here to block my work to make
> you happy...

I can't parse it, but I think you're trying to say something passive aggressive. We're on the same team. I'm commenting here because I would like for our system app to be built with the same values for openness and interoperability that we strive to incorporate into all of our work at Mozilla. I'm not trying to "block your work". Sometimes doing things openly takes longer and requires more collaborative effort than doing comparable things behind closed doors, but it's the way we do it at Mozilla.
 
> I guess I understand what does standardize mean, you are worried about the
> automation could not be applied to another platform.....is it? But why? It's
> testing FxOS so depending FxOS' implementation detail doesn't harm.

Exactly! I would love, one day, to bring my web automation to any web operating system and for them to implement specs like https://developer.mozilla.org/en-US/docs/Web/API/Apps.mgmt that allow things like our marionette tests to work across system application platforms. In fact, there's a group dedicated to this effort in the w3c[1].

> What my concern is this kind of implementation detail is not known and communicated
> between system app developers and automation developers. Now it's a chance
> to know but you are just shooting from my point of view. If you do the
> suggestions above and make travis happy I don't have any other opinion for
> any change you made to marionette-apps -- unfortunately you still need
> either gecko or system app to tell you the app is ready to go or just guess
> using a timer or just do not throw any error in your switchToApp functions
> even the app is not loaded or opened.

I don't think this is just an implementation detail. It's evident that automation that wasn't built with these semantics in mind could not work alongside our system app with the changes made here. Notice that this patch broke a lot of existing tests before changes were made to the marionette-apps plugin. Perhaps one day, all of the mobile apis and manifest details will be standard between chrome os, firefox os, web os, etc. It would be great if (since webdriver is an open standard which mozilla's marionette implements) you could take your choice of automation libraries (like our app management plugin) and it would just work across operating systems that implemented the related system app web apis.

> I don't think this kind of discussion should keep annoying the 25 people
> here.

Again, there are costs and benefits to doing things in the open. I don't mean to annoy, but bugmail subscriptions are optional and doing things like this the right way is important.

1. http://www.w3.org/2012/sysapps/
This is the kind of thing I'm talking about http://www.w3.org/2012/sysapps/app-lifecycle/#application-events! We should be a part of that conversation. Go forth and collaborate!
Depends on: 1028002
You need to log in before you can comment on or make changes to this bug.