Closed Bug 1024928 Opened 10 years ago Closed 6 years ago

Video Cold Launch Time 200ms Regression

Categories

(Core :: DOM: Content Processes, defect, P3)

28 Branch
ARM
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED WONTFIX

People

(Reporter: jhylands, Assigned: cyu)

References

Details

(Keywords: perf, regression, Whiteboard: [c=regression p= s= u=])

On or about Feb 5, 2014, the Video app's cold load launch time regressed by 200ms due to this patch:

5a5fa6d369ab2e3c430cb9e1a5e909e9922c4a3f is the first bad commit
commit 5a5fa6d369ab2e3c430cb9e1a5e909e9922c4a3f
Author: Fabrice Desré <fabrice@mozilla.com>
Date:   Tue Jan 14 16:00:25 2014 -0800

    Bug 950266 - Re-enable the Nuwa process on B2G by default r=me

(Please see bug 1002842 for the bisection leading to this)

Clearly, this can't be backed out at this time, but we need to figure out specifically what caused the issue, and see if it can be fixed.
Fabrice, can you look at this and comment?
blocking-b2g: --- → 1.4?
Flags: needinfo?(fabrice)
Keywords: perf
Whiteboard: [c=regression p= s= u=]
Severity: normal → critical
OS: Windows 7 → Gonk (Firefox OS)
Priority: -- → P1
Hardware: x86_64 → ARM
Blocks: 950266
Component: Gaia::Video → IPC
Keywords: regression
Product: Firefox OS → Core
Version: unspecified → 28 Branch
Component: IPC → DOM: Content Processes
ha ha... Nuwa changes a zillion things. Is the video app the only one impacted?
Flags: needinfo?(fabrice)
Honestly, I don't know at this point. I can certainly pull the data for a few other apps, and check.
What's the user impact of the delay? Did this regress since 1.3?
Flags: needinfo?(jhylands)
Yes - the history is in bug 1002842.
Flags: needinfo?(jhylands)
It looks like all the apps regressed on the same date. There appears to be a corresponding drop around Feb 7 or 8 - was the Nuwa patch backed out then, or was it something else?
QA,

Can we please verify if there is a drop significant to 1.3?
Keywords: qawanted
(In reply to Preeti Raghunath(:Preeti) from comment #7)
> QA,
> 
> Can we please verify if there is a drop significant to 1.3?

This isn't a request QA does - we don't do perf studies of this kind.
Keywords: qawanted
Mike

Is this regression on par for your acceptance criteria? If not we'll need this fixed so please help make a call here
Flags: needinfo?(mlee)
(In reply to Preeti Raghunath(:Preeti) from comment #9)
> ...
> Is this regression on par for your acceptance criteria? If not we'll need
> this fixed so please help make a call here

Launch speed regressions of 200ms meet our FxOS Performance Criteria for blocking [1]. That said the apparent cause of this regression, Nuwa [2] is a core feature we're unlikely to backout especially so late in the 1.4 release cycle.

The effort needed to correct or improve this regression is probably best targeted for 2.0 or later so noming as a 2.0 blocker.

[1] https://wiki.mozilla.org/FirefoxOS/Performance/Triage#Blockers
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=771765#c0


(In reply to Jon Hylands [:jhylands] from comment #6)
> It looks like all the apps regressed on the same date. There appears to be a
> corresponding drop around Feb 7 or 8 - was the Nuwa patch backed out then,
> or was it something else?

Fabrice, can you respond to Jon's question re: Nuwa being backed out?
blocking-b2g: 1.4? → 2.0?
Flags: needinfo?(mlee) → needinfo?(fabrice)
(In reply to Mike Lee [:mlee] from comment #10)

> Launch speed regressions of 200ms meet our FxOS Performance Criteria for
> blocking [1]. That said the apparent cause of this regression, Nuwa [2] is a
> core feature we're unlikely to backout especially so late in the 1.4 release
> cycle.

Backing out nuwa (or even disabling it) is just not an option at all at this point, on any branch.


> (In reply to Jon Hylands [:jhylands] from comment #6)
> > It looks like all the apps regressed on the same date. There appears to be a
> > corresponding drop around Feb 7 or 8 - was the Nuwa patch backed out then,
> > or was it something else?
> 
> Fabrice, can you respond to Jon's question re: Nuwa being backed out?

All the fun happened in bug 950266, in which:
- it initially landed on 01/31 (https://bugzilla.mozilla.org/show_bug.cgi?id=950266#c78)
- was backed out on 02/01 (https://bugzilla.mozilla.org/show_bug.cgi?id=950266#c85)
- relanded on m-c on 02/04 (https://bugzilla.mozilla.org/show_bug.cgi?id=950266#c89)
- merged on 1.3t on 02/07 (https://bugzilla.mozilla.org/show_bug.cgi?id=950266#101)
Flags: needinfo?(fabrice)
Okay, interesting, so the drop on 2/7 in m-c must have been something else.
And while I also don't think disabling Nuwa is an option, it might be helpful if we can figure out why it causes a ~200ms cold load time regression, and see if we can fix that in 2.0.
Given we shipped 1.3 and 1.4 with this, I am not sure why we would specifically block a 2.0 release for this issue. Can you help here?
Flags: needinfo?(mlee)
Nuwa was a critical feature committed for 1.3 [1], that's why we shipped with these results. 1.4, as far as I know, hasn't yet shipped and is still being locked down, blocking that release now is obviously not an option.

A 200ms regression in launch time is worth blocking. It represents more than 15% of our app launch goal [2] and is 4 times longer than the maximum amount of time it takes a person to perceive successive visual change [3]. We should make an effort to fix or reduce this regression.

With performance these degradations in speed accumulate over time. It's best to address them early when there are fewer contributing causes.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=771765#c217
[2] https://wiki.mozilla.org/FirefoxOS/Performance/Release_Acceptance#Criteria
[3] https://mozilla.app.box.com/s/aww17rx74k7fjds5vada
Flags: needinfo?(mlee) → needinfo?(bbajaj)
blocking-b2g: 2.0? → 2.0+
Fabrice, given that we're looking at this as a 2.0 blocker now, can you point me at the right person to look at this code, to see if we can figure out why re-enabling Nuwa caused this 200ms launch regression?
Flags: needinfo?(fabrice)
Severity: critical → blocker
Whiteboard: [c=regression p= s= u=] → [c=regression p= s= u=2.0]
We lived with this on 1.3 and 1.4, this really shouldn't block.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #17)
> We lived with this on 1.3 and 1.4, this really shouldn't block.

I agree.

Cervantes can help there. Did you try to build with Nuwa disabled on 2.0/2.1 to see where we are?
Flags: needinfo?(fabrice) → needinfo?(cyu)
I don't usually build with Nuwa disabled. 200 ms of code launch time doesn't look normal: the app should be launched from the same starting point: an app process is preallocated with preload.js loaded. The difference between Nuwa and non-Nuwa preallocated process should only be COW-pages and wrapper functions, but the difference should be minimal in theory. I can take a look to see if something is wrong.
Assignee: nobody → cyu
Flags: needinfo?(cyu)
(In reply to Mike Lee [:mlee] from comment #15)
> Nuwa was a critical feature committed for 1.3 [1], that's why we shipped
> with these results. 1.4, as far as I know, hasn't yet shipped and is still
> being locked down, blocking that release now is obviously not an option.
> 
> A 200ms regression in launch time is worth blocking. It represents more than
> 15% of our app launch goal [2] and is 4 times longer than the maximum amount
> of time it takes a person to perceive successive visual change [3]. We
> should make an effort to fix or reduce this regression.
> 
> With performance these degradations in speed accumulate over time. It's best
> to address them early when there are fewer contributing causes.
> 
> [1] https://bugzilla.mozilla.org/show_bug.cgi?id=771765#c217
> [2]
> https://wiki.mozilla.org/FirefoxOS/Performance/Release_Acceptance#Criteria
> [3] https://mozilla.app.box.com/s/aww17rx74k7fjds5vada

thanks, that make sense..and jsmith also pointed me to the blocking criteria established here: https://wiki.mozilla.org/FirefoxOS/Performance/Triage#Blockers, so I'll leave it here to understand why this regressed.
Flags: needinfo?(bbajaj)
If the performance team thinks this should block 2.0 after living with it for two releases the performance team really should put up the manpower to investigate it.
I ran the test locally against unagi using m-c- revision 83c09fe3a658. I tested the cold launch time of the the Video and the Template app. The following tests are done using b2gperf --itertion=10 --delay=10 <App Name>:

Without Nuwa:
Results for Video, cold_load_time: median:1367, mean:1375, std: 40, max:1486, min:1336
Results for Video, cold_load_time: median:1412, mean:1418, std: 58, max:1572, min:1331
Results for Video, cold_load_time: median:1377, mean:1397, std: 67, max:1553, min:1327
Results for Template, cold_load_time: median:574, mean:557, std: 59, max:664, min:470
Results for Template, cold_load_time: median:580, mean:559, std: 64, max:666, min:466
Results for Template, cold_load_time: median:575, mean:571, std: 72, max:674, min:468

With Nuwa:
Results for Video, cold_load_time: median:1461, mean:1470, std: 38, max:1549, min:1425
Results for Video, cold_load_time: median:1444, mean:1456, std: 32, max:1552, min:1433
Results for Video, cold_load_time: median:1445, mean:1481, std: 66, max:1629, min:1420
Results for Template, cold_load_time: median:562, mean:557, std: 61, max:654, min:458
Results for Template, cold_load_time: median:543, mean:537, std: 51, max:611, min:451
Results for Template, cold_load_time: median:590, mean:575, std: 58, max:694, min:479

It can be seen that the launch time of an empty app is the same, w/wo Nuwa, but the Video app shows some difference (several dozens to 100 ms). Because ranges of launch time overlaps w/wo Nuwa, taking a closer look at the constitution of the app launch time might produce a misleading result. I think we might need to run the tests with SPS enabled.

Mike, is there any way we can run b2gperf with SPS enabled?
Flags: needinfo?(mlee)
Jonathan, can someone on the ATeam answer Cervantes question:

> Is there any way we can run b2gperf with SPS enabled?
Status: NEW → ASSIGNED
Flags: needinfo?(mlee) → needinfo?(jgriffin)
You could certainly run b2gperf with a local SPS-enabled build; we don't have the capacity to run b2gperf with custom builds in our infrastructure at this time.

Other than a custom build, would b2gperf have to work differently with SPS?
Flags: needinfo?(jgriffin)
SPS is enabled by starting b2g with MOZ_PROFILER_STARTUP=1 environment variable. But I just realized that running b2gperf with SPS doesn't help in diagnosing app launch time problems because the app process is started and then killed. The launch time is so short that SPS can't get statistically meaningful data.

Event ordering might be relevant to the increased app launch time. With Nuwa, in app launch we always get 'loadend' DOM event before sending the 'loadend' browserelementapi:call message to the chrome process to notify the system app. In contrast, without Nuwa, we seldom see the DOM event received before notifying the system ap. Why the 2 events appear in this unfavourable order with Nuwa enabled is still unknown. I need to take a deeper look.
Any update? Thanks.
Flags: needinfo?(cyu)
I am tracing the events that happen during app launch to see what makes the events happen in this unfavourable order when Nuwa is enabled.
Flags: needinfo?(cyu)
We have far more important things to focus on at this point.  This should be 2.0-.
Flags: needinfo?(jsmith)
Flags: needinfo?(jsmith)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #28)
> We have far more important things to focus on at this point.  This should be
> 2.0-.

Mike - What do you think?
Flags: needinfo?(mlee)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #28)
> We have far more important things to focus on at this point.  This should be
> 2.0-.

Actually, when I discussed this bug with all engineering managers and EPMs in Taipei yesterday, Thinker and I agreed with what Kyle said here. But, seems like Mike Lee provided a firmed feedback in comment #10.
Okay. Agree to minus this from 2.0.
Severity: blocker → normal
blocking-b2g: 2.0+ → ---
Flags: needinfo?(mlee)
Priority: P1 → P3
Whiteboard: [c=regression p= s= u=2.0] → [c=regression p= s= u=]
In my test using unagi, there is ~100ms difference in app launch times with or without Nuwa. It appears that the timing of the XHR sent in gaia/shared/js/l10n.js to load locales-obj/en-US.json contributes most of the added startup time of the Video app.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.