Open Bug 1786466 Opened 3 years ago Updated 18 days ago

Profile Android Fission's impact on page load performance

Categories

(GeckoView :: General, defect, P2)

Firefox 106
All
Android
defect

Tracking

(Not tracked)

People

(Reporter: alexandrui, Assigned: sinker)

References

(Depends on 2 open bugs, Blocks 1 open bug, Regression)

Details

(Keywords: perf, perf-alert, regression, Whiteboard: [fission:android:m2][group1][fxdroid])

Perfherder has detected a browsertime performance regression from push 8dbbc6a6b494858ae0f76375e2a69b0e1df03a73. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

Ratio Test Platform Options Absolute values (old vs new)
11% wikipedia ContentfulSpeedIndex android-hw-p2-8-0-android-aarch64-shippable-qr fission warm webrender 222.88 -> 248.08
11% wikipedia ContentfulSpeedIndex android-hw-p2-8-0-android-aarch64-shippable-qr fission warm webrender 224.75 -> 248.42
8% wikipedia PerceptualSpeedIndex android-hw-p2-8-0-android-aarch64-shippable-qr fission warm webrender 236.71 -> 255.58
5% nytimes LastVisualChange linux1804-64-shippable-qr fission warm webrender 1,425.71 -> 1,503.33

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) may be backed out in accordance with our regression policy.

If you need the profiling jobs you can trigger them yourself from treeherder job view or ask a sheriff to do that for you.

For more information on performance sheriffing please see our FAQ.

Flags: needinfo?(calu)

Set release status flags based on info from the regressing bug 1648158

If this Android regression only affects Fission, we don't need to back out the regressing changeset because we don't enable Fission for any Android users yet.

Whiteboard: [fission:android:m2]

The final regression there is Linux Fission, but judging from the patch, I don't see how that could be to blame.

Assignee: nobody → calu
Severity: -- → S3
Priority: -- → P1
Whiteboard: [fission:android:m2] → [fission:android:m2] [geckoview:m106]
Flags: needinfo?(calu)

P3

Assignee: calu → nobody
Priority: P1 → P3
Whiteboard: [fission:android:m2] [geckoview:m106] → [fission:android:m2]

Some more information on how these page load performance tests work:

To rerun this on try, the job name is test-android-hw-p2-8-0-android-aarch64-shippable-qr/opt-browsertime-tp6m-geckoview-wikipedia. It uses the browsertime-tp6m config file (https://searchfox.org/mozilla-central/source/testing/raptor/raptor/tests/tp6/mobile/browsertime-tp6m.ini), which specifies things like browser_cycles = 15 and page_cycles = 25. On Try, browsertime uses chimera, which specifies a page_cycle = 2, meaning it loads a webpage cold, measures, then reloads and measures to get a warm page load. You can see in the log (https://firefoxci.taskcluster-artifacts.net/ey8RSsimSGS73pYVCdlDLA/0/public/logs/live_backing.log) cycle 0 and cycle 1 for iteration 1. Since browser cycles are 15, it runs 15 iterations of this cold and warm page load combination.

Navigating to about:blank, count: 0
Navigating to url about:blank iteration 1
Navigating to primary url:https://en.m.wikipedia.org/wiki/Main_Page
Cycle 0, waiting for 1000 ms
Cycle 0, starting the measure
Testing url https://en.m.wikipedia.org/wiki/Main_Page iteration 1
[runs command calls to screen record]
Navigating to about:blank, count: 1
Navigating to url about:blank iteration 1
Navigating to primary url:https://en.m.wikipedia.org/wiki/Main_Page
Cycle 1, waiting for 1000 ms
Cycle 1, starting the measure
Testing url https://en.m.wikipedia.org/wiki/Main_Page iteration 1
[runs command calls to screen record]
Browsertime pageload ended.
https://en.m.wikipedia.org/wiki/Main_Page TTFB: 67ms DOMContentLoaded: 1.50s firstPaint: 1.49s FCP: 1.30s Load: 1.94s
VisualMetrics: FirstVisualChange: 1.67s SpeedIndex: 1.67s PerceptualSpeedIndex: 1.68s ContentfulSpeedIndex: 1.67s VisualComplete85: 1.67s LastVisualChange: 2.55s
https://en.m.wikipedia.org/wiki/Main_Page?browsertime_run=2 TTFB: 150ms DOMContentLoaded: 385ms firstPaint: 288ms FCP: 272ms Load: 581ms
VisualMetrics: FirstVisualChange: 312ms SpeedIndex: 411ms PerceptualSpeedIndex: 379ms ContentfulSpeedIndex: 342ms VisualComplete85: 414ms

The source code for page loads is https://searchfox.org/mozilla-central/source/testing/raptor/browsertime/browsertime_pageload.js#336-412.

Component: Sandboxing → General

The details in this bug from 2022 are no longer relevant, but we can use this bug as a placeholder task to investigate Fission's perf impact.

Here are Perfherder comparisons Android Fission vs nofis for PerceptualSpeedIndex for an arbitrary heavyweight website (espn.com) among those we test. We should check other sites, too.

GVE fis vs nofis (IsolateEverything)

  • Fission's warm page load looks roughly 10% slower.
  • Fission's cold page load looks roughly 25% slower.

Fenix fix vs nofis (IsolateHighValue, which effectively means “isolate nothing” for our automated tests)

  • No regression (as expected).
Severity: S3 → N/A
Type: task → defect
Priority: P3 → P2
Summary: 11.31 - 5.44% wikipedia ContentfulSpeedIndex / nytimes LastVisualChange + 2 more (Android, Linux) regression on Wed August 17 2022 → Profile Android Fission's impact on page load performance
Whiteboard: [fission:android:m2] → [fission:android:m2][group1][fxdroid]

Got some profiles to compare last week. Here are some things that I found. (with Pixel 9, GeckoView, raptor wikipedia).

  1. GeckoViewUtils._log() takes a lot of time during page loading with fission. It took 10+ms with fission and 4ms without fission. When I disabled it (return immediately), it took only 25us. (asyn or off main thread?)
  2. Off thread compilation takes too long, the main thread is waiting for it. I tried to disable Off thread compilication by replacing Kind::OffMainThreadOnly with Kind::MainThreadOnly for CompileOrDecodeTask, and it reduce to ~50ms from ~100ms. (Check markers ScriptCompileOffThread for https://en.m.wikipedia.org/w/load.php?. Off thread: https://share.firefox.dev/42VXjGI , Not Off thread: https://share.firefox.dev/3EG8rO8 )
  3. Content processes receive PContent::Msg_ConstructBrower to construct a BrowserChild before loading the content. It take additional ~28ms to handle related messages. This only happens to fission and changes the timing of refresh driver, postponing rendering and composition. It causes the regression of XXXSpeedIndex mentioned earlier. CanonicalBrowsingContext::ChangeRemoteness() will call ContentParent::CreateBrowser() to create a BrowserChild. If it performs earlier, the content process would get ready earlier. (Do it before receiving nsHTTPChannel response? Not wait for it..)

Profiles:

Assignee: nobody → thinker.li

Smaug, could you take a look about off thread compilation? Taking 50ms more time doesn't make sense to me.

Flags: needinfo?(smaug)

As I mentioned on matrix, it looks like in the profiles, a non-insignificant amount of time is being spent loading the SessionStateAggregator.js frame script. (https://searchfox.org/mozilla-central/rev/155d514d72473453492a822e97dc1c68cf49d110/mobile/shared/chrome/geckoview/SessionStateAggregator.js). This frame script is unused with SHIP enabled (See the bottom where it is mocked out in that case: https://searchfox.org/mozilla-central/rev/155d514d72473453492a822e97dc1c68cf49d110/mobile/shared/chrome/geckoview/SessionStateAggregator.js#672-676)

We should probably remove framescript support from the module manager and only load this module if SHIP is disabled. The easiest way to do this is probably to drop the generic framescript module handling from the module manager, and add special case handling for this framescript until SHIP can be shipped everywhere.

Depends on: 1950477
Depends on: 1950480
Depends on: 1950491

(looks like separate bug was filed for the off-thread compilation issue and right folks CCed :) )

Flags: needinfo?(smaug)
Type: task → defect
Type: task → defect
You need to log in before you can comment on or make changes to this bug.