GeckoView Speedometer regression on Google Pixel 2 (ARM32 and ARM64) from clang 7 bug 1492663

RESOLVED FIXED in Firefox 65

Status

defect
P1
normal
RESOLVED FIXED
7 months ago
5 months ago

People

(Reporter: cpeterson, Assigned: glandium)

Tracking

unspecified
mozilla65
ARM
Android
Dependency tree / graph

Firefox Tracking Flags

(geckoview64 unaffected, firefox63 unaffected, firefox64+ unaffected, firefox65+ fixed)

Details

(Whiteboard: [qf])

Attachments

(1 attachment)

(Reporter)

Description

7 months ago
Speedometer scores started to regress on Android around October 15 and then dipped again on October 27:

https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1776166,1,10&series=mozilla-central,1775979,1,10

Here is the hg pushlog for the second dip:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=6e96c7ec0d1187c1b488dd4ba645df9cfd68ec16&tochange=7007206a3cd46da7c478e5a9992b608b743a1ff9

Android LTO (bug 1480006) landed in that regression range, but it was also backed out.

PLT stubs (bug 1500154) also landed. That change improved page load performance, but perhaps it regressed Speedometer?
(Reporter)

Updated

7 months ago
Summary: Speedometer regression → GeckoView Speedometer regression on Google Pixel 2 (ARM32 and ARM64)
keep in mind a lot has changed to make the tests run again on bitbar (second dip), will see if tooling is the culprit or the browser.
(Reporter)

Comment 2

7 months ago
Bob, do you know why Speedometer results from Moto G5 stopped after October 23? I don't know if that was related to the regression on Pixel 2, but they happened around the same time:

https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1776166,1,10&series=mozilla-central,1775979,1,10
Flags: needinfo?(bob)
Priority: -- → P1
:cpeterson, all bitbar testing was broken after a ssl cert was renewed and we had too many redirects accessing artifacts (i.e. bug 1499246).  We fixed the issue for the most part, but we turned off all testing for a few days to understand the issue and ended up updating our docker image that runs the tests and serves the webpages to a new version of ubuntu that has new and updated libraries.
My gut feeling here is that it's related to a11y stuff, since Speedometer does a bunch of form actions. Maybe Eitan has insight?
Flags: needinfo?(eitan)
Does speedometer have any accessibility service active?
pushing 3 different pushes, I see similar results to the graph- 2 regressions- since this just ran today in the last 1.5 hours, there are no changes going into the bitbar environment, only changes in the browser code.  I think this is code changes.

We can bisect this if needed- it takes a few hours to deal with tree updates, etc. but we could bisect on try if there are no culprits identified
I am not sure if we have made progress on this- as it would be some work to bisect, I haven't done that, but I would be happy to if we are not sure anything will fix the problem.
The logs don't suggest accessibility being enabled. So I'm not sure how the a11y changes may have done this. Can't rule it out, though..
Flags: needinfo?(eitan)
I have done a lot of bisecting on this and cannot come to any conclusions, it is obvious that we have a sustained change, but I see it show up in a smaller range, then I bisect more and it yields no results, as if I am unable to reproduce this.
(Reporter)

Comment 11

6 months ago
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #10)
> I have done a lot of bisecting on this and cannot come to any conclusions,
> it is obvious that we have a sustained change, but I see it show up in a
> smaller range, then I bisect more and it yields no results, as if I am
> unable to reproduce this.

@ Joel, what is your smaller regression range? Is your range for the first (October 16) or second (October 27) regression? The regressions look more pronounced on android-hw-p2-8-0-android-aarch64 (ARM64), where we don't have Ion JIT yet (bug 1380129).

What do you recommend we do next? With the smaller range, perhaps a Gecko developer can try to bisect by running Speedometer locally.

@ Kannan, the range for the first regression (October 16 around [1]) might be [2] or [3]. Range [2] includes a JIT fix for bug 1496847. Could that cause a Speedometer regression on ARM?

Range [3] includes a raptor test change for Sunspider Chrome bug 1487374.

[1] https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1776166,1,10&series=mozilla-central,1775979,1,10&zoom=1539315355586.678,1540316452294.6199,15.998876335915556,18.392134762881845

[2] https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f2e35ed6a6928b35e085462dc4268fd86f00fdb9&tochange=31724aea10cae55f30b825ade226c4d25e11a899

[3] https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=9079bbe837184ed183b133a374753865b6768bc4&tochange=f2e35ed6a6928b35e085462dc4268fd86f00fdb9
Flags: needinfo?(kvijayan)
Flags: needinfo?(jmaher)
See Also: 1500154
I had looked at the oct 27th regression, it seemed that I started narrowing down a range and then I wasn't able to reproduce it on a smaller range and I played whack a mole for a 100 try pushes.

If you want I can look into the oct 16th regression tomorrow, let me know if that is desired.
Flags: needinfo?(jmaher)

Comment 13

6 months ago
Remember that when using the bitbar devices we are getting a possibly different device each time which may affect the ability to bisect. It might be possible to manually bisect with a single local device.
Chris: No, that patch should behave strictly as speedup on all platforms.
Flags: needinfo?(kvijayan)
(Reporter)

Comment 15

6 months ago
[Tracking Requested - why for this release]:

This is a performance regression on Speedometer, one of our  benchmarks.

GV 64 Nightly merged to Beta on 2018-10-22, so GV 64 would be affected by the apparent October 15 regression. GV 65 would be affected by both the October 15 and 27 regressions.
tracking as gv perf regression
Eitan - speedometer is known for doing rapid-fire changing of form element states (creating checkboxes, checking 100 checkboxes at a time, deleting them).  If any code is in that processing path in order to support accessibility (maintaining state, perhaps), perhaps it could have a perf impact even if a11y isn't in use at that time.

Obviously trying to narrow regression ranges is the primary tool to attack this for now
Flags: needinfo?(eitan)
*Possible* regression range for the first regression:
https://hg.mozilla.org/mozilla-central/pushloghtml?startdate=Oct+15+2018&enddate=Oct+16+2018
Note: I'm eyeballing the distribution and saying it regressed late-15th-ish perhaps or *maybe* early 16th

Roughly 17.75 to 17.0 from eyeballing the noise
Looking at that range: 
Bug 1498699  was an android-only bug in that range... and  Bug 1498715 (unlikely though).  Bug 1479757  was for GV, but may be WebRender-only.   Bug 1498329 seems very unlikely.  There was a big non-android landing for tables (Bug 1484126 ).  a JIT piece landed at the end of that which we've discussed:  Bug 1496847

I presume this didn't impact desktop.

jmaher: thoughts?  How does this mesh with regressionrange tests you did, or did you only focus on the second?
Flags: needinfo?(jmaher)
:jesup, I only focused on the second.

basically someone could do this process:
foreach revision in range:
   hg update revision
   <apply any patches to fix build or scheduling>
   ./mach try fuzzy -q 'android aarch64 speedometer' --rebuild 5
   <remove any patches to fix build or scheduling>

sadly this takes a long time for me locally and the cpu load on my machine is high so doing this while in a vidyo meeting creates a lot of interference.  I did this on the second range and found that it was inconclusive which happens maybe 10% of the time when I bisect- usually more frequent the longer back in history I go and the more custom patches to get stuff to build or run.
Flags: needinfo?(jmaher)
I'm about to try to bisect this locally by just loading the speedometer page on a dedicated device. That should be good enough, right? My goal is to observe that there *is* in fact a regression, and see if and how it is related to a11y.
loading speedometer on a device while iterating through builds should show a regression; I would at least run it twice for each build; I did this one day to compare against google chrome and different versions of the browser- overall I got a strong signal.  As a note, this is on a pixel 2 device.
Eitan - yes, dedicated device should be good. You can bisect on m-c merges in that time range (from archive) once you verify there's a difference from (say) 10/14 vs 10/17.
Ok, the Oct 16th regression in speedometer is understood:  glandium backed out Android LTO:
"Back out changeset 9481b9ecaf2f (bug 1480006) as being a suspect in the rise of crashes such as bug 1489953. "

See https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&searchStr=pixel2%2Cspeedometer&selectedJob=211353831&revision=9079bbe837184ed183b133a374753865b6768bc4

I identified the regression looking at m-c pushes with https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&searchStr=pixel2%2Cspeedometer&startdate=2018-10-15&enddate=2018-10-17&selectedJob=211353831, then retriggered the speedometer runs around the regression before noticing this backout (above).

So it's just the Oct 27th regression to understand
>    ./mach try fuzzy -q 'android aarch64 speedometer' --rebuild 5

So long as I reland the bitbar stuff for each push (and be careful of glandium landing and backing out LTO in this range, it works.

So the range is down to 8 changesets: 524c546c7ad5 (good) to ab584824a073 (bad) -- 4 bugs, 8 changesets (5 for 1 bug).  One changeset is webgl only (unlikely...) 

Two are by glandium and involve possible perf issues, though one looks to be desktop-only, and the other is in graphite(!?) and so seems unlikely:
	434f70360933	Mike Hommey — Bug 1492663 - Upgrade most CI builds to clang 7 r=froydnj
	02ac5668f849	Mike Hommey — Bug 1498072 - Prevent inlining of the direct_run function. r=jfkthame

The other patches are for 2 bugs:
	42961627bc9a	Ryan Hunt — Bug 1453425 - Remove mScrollPosAtLastPaint from nsGfxScrollFrame. r=botond (and 4 more patches)
	5ef21255897e	Jeff Gilbert — Bug 1478216 - Don't init tex images in FBAttachment::IsComplete. r=kvark

Jeff's bug appears to be all WebGL.  So if glandium's patches don't pan out as the cause, it's bug 1453425
Try run with 434f70360933	Mike Hommey — Bug 1492663 is slow, so it has to be that or bug 1498072.

Fast: https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=211484789&revision=d49d773fe06dad3b08cbbef6c904d2d0fe9c90c5
Slow: https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=211484789&revision=4d30905771e75c7a3882343c59874c81ce53f40f
(includes both patches for the bugs above)

Still running: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0bb91544975ad5911033f09dfca051c3bf3f29af
(which has Bug 1498072 - Prevent inlining of the direct_run function, but not the other patch).
Flags: needinfo?(eitan) → needinfo?(mh+mozilla)
> Still running:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=0bb91544975ad5911033f09dfca051c3bf3f29af
> (which has Bug 1498072 - Prevent inlining of the direct_run function, but not the other patch).

That was fast, so it's bug 1492663
Blocks: 1492663
Summary: GeckoView Speedometer regression on Google Pixel 2 (ARM32 and ARM64) → GeckoView Speedometer regression on Google Pixel 2 (ARM32 and ARM64) from bug 1492663
(Reporter)

Comment 28

6 months ago
(In reply to Randell Jesup [:jesup] from comment #27)
> That was fast, so it's bug 1492663

Thanks for pinpointing the regression, Randell. That's a bummer. There's not a lot we can do to recover from a performance regression caused by upgrading to a new clang version. Did macOS or Linux regress on Speedometer after upgrading to clang 7?

Skimming the clang 7 release notes [1] for any interesting changes:

* -fmerge-all-constants is no longer applied by default because it is a non-conforming optimization and miscompiles the Linux kernel [2]. Maybe we should try re-enabling?

* The new -fforce-emit-vtables option forces the emission of vtables even in modules where it isn't necessary, which increases opportunities for devirtualisation. Forcing emitting of vtable adds a reference of these inline virtual functions. Note that GCC was always doing it. [3]

[1] http://releases.llvm.org/7.0.0/tools/clang/docs/ReleaseNotes.html
[2] https://patchwork.ozlabs.org/patch/888487/
[3] https://reviews.llvm.org/rL334600
(Assignee)

Comment 29

6 months ago
(In reply to Chris Peterson [:cpeterson] from comment #28)
> (In reply to Randell Jesup [:jesup] from comment #27)
> > That was fast, so it's bug 1492663
> 
> Thanks for pinpointing the regression, Randell. That's a bummer. There's not
> a lot we can do to recover from a performance regression caused by upgrading
> to a new clang version. Did macOS or Linux regress on Speedometer after
> upgrading to clang 7?

Slightly improved on linux
https://treeherder.mozilla.org/perf.html#/graphs?timerange=2592000&series=mozilla-inbound,1640394,1,1&series=mozilla-inbound,1641346,1,1&highlightedRevisions=434f70360933

Unchanged on Mac
https://treeherder.mozilla.org/perf.html#/graphs?timerange=2592000&series=mozilla-central,1647998,1,1&highlightedRevisions=434f70360933

Or is it one of those benchmarks where the lower score is better?
Flags: needinfo?(mh+mozilla)
(Reporter)

Comment 30

6 months ago
(In reply to Mike Hommey [:glandium] from comment #29)
> Or is it one of those benchmarks where the lower score is better?

No. A higher Speedometer score is better, so Linux did improve slightly.
(Reporter)

Comment 31

6 months ago
[qf] to put this regression on the Quantum Flow team's radar. We don't have an action plan at this time.
Summary: GeckoView Speedometer regression on Google Pixel 2 (ARM32 and ARM64) from bug 1492663 → GeckoView Speedometer regression on Google Pixel 2 (ARM32 and ARM64) from clang 7 bug 1492663
Whiteboard: [qf]
Has anyone tried bisecting LLVM trunk to see if this is due to a specific change? It may give us a clue on how to work around the regression or at least tell us who to poke upstream.

Comment 34

6 months ago
Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/autoland/rev/386ec4aee534
Switch android arm/aarch64 opt/nightly builds back to clang 6. r=dmajor

Comment 35

6 months ago
Pushed by apavel@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3a0b29de3cfb
fix clang failures on a CLOSED TREE

Comment 36

6 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/386ec4aee534
https://hg.mozilla.org/mozilla-central/rev/3a0b29de3cfb
Status: NEW → RESOLVED
Last Resolved: 6 months ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 65
Assignee: nobody → mh+mozilla
Thanks Mike! I see FF64 is marked tracking and affected so I think that means we'll need to do the same thing for beta? (See comment 15)
Flags: needinfo?(mh+mozilla)
(Reporter)

Comment 38

6 months ago
We can see Speedometer performance restored on Perfherder already:

https://treeherder.mozilla.org/perf.html#/graphs?series=mozilla-central,1775979,1,10&series=mozilla-central,1776360,1,10&series=mozilla-central,1776166,1,10

android-hw-p2-8-0-android-aarch64 was impacted the most. aarch64 currently only has Baseline JIT support, so apparently clang 7 was hurting Baseline JIT code more than Ion JIT code. Once we have ARM64 support for Ion (bug 1380129), we can probably switch back to clang 7 (unless we find a clang 7 fix before then).
(Reporter)

Comment 39

6 months ago
(In reply to David Bolter [:davidb] (NeedInfo me for attention) from comment #37)
> Thanks Mike! I see FF64 is marked tracking and affected so I think that
> means we'll need to do the same thing for beta? (See comment 15)

This bug covers two Speedometer regressions: one in 64 Nightly on October ~16 and one in 65 Nightly on October 27.

The 65 regression was clang 7 bug 1492663, which has been fixed by reverting to clang 6. As such, I am unneedinfo'ing Mike because I don't think any action is needed from him.

According to Randell's comment 24, the 64 regression appears to be the backout [1] of Android LTO bug 1480006 from the first time Mike tried to land it. So 64 did not actually regress; it was a return to pre-LTO performance. I'll mark 64=unaffected. The good news is that we now know that LTO measurably improved our performance on Android!

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1480006#c12
Flags: needinfo?(mh+mozilla)
See Also: → android-lto
(Assignee)

Comment 40

6 months ago
(In reply to Chris Peterson [:cpeterson] from comment #38)
> We can see Speedometer performance restored on Perfherder already:
> 
> https://treeherder.mozilla.org/perf.html#/graphs?series=mozilla-central,
> 1775979,1,10&series=mozilla-central,1776360,1,10&series=mozilla-central,
> 1776166,1,10

Wait, we have talos for android?!
(Assignee)

Comment 41

6 months ago
Taking a step back here, it seems the bug summary is wrong, in that arm32 was not affected. I'm bisecting, and I see no differences there along the way. Only on arm64. What this tells me is that we could leave arm32 alone. And since we're not shipping aarch64... do we really need to keep using clang 6 for it (when it would be only platform using it)?
Flags: needinfo?(cpeterson)
(Reporter)

Comment 42

6 months ago
(In reply to Mike Hommey [:glandium] from comment #40)
> Wait, we have talos for android?!

Not all of Talos. We're running Speedometer, Unity WebGL, and soon the Talos tp6 page load benchmarks using a new test runner (for all Firefox platforms) called Raptor.

(In reply to Mike Hommey [:glandium] from comment #41)
> Taking a step back here, it seems the bug summary is wrong, in that arm32
> was not affected. I'm bisecting, and I see no differences there along the
> way. Only on arm64. What this tells me is that we could leave arm32 alone.
> And since we're not shipping aarch64... do we really need to keep using
> clang 6 for it (when it would be only platform using it)?

sstangl is adding Ion support for arm64 in bug 1380129, which should land in 2019 Q1. In the meantime, we could go back to clang 7 on all platforms, accepting the arm64 regression with the hope that Ion will recover the clang 7 regression. We can't stay on clang 6 forever, but perhaps arm64 would still be faster with clang 6 than clang 7 even after we have Ion support? As a Build peer, the compiler decisions are yours.

If I squint at this Perfherder graph, I think both arm32 and arm64 regressed on October 27, but arm64 was definitely hit harder:

https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1775979,1,10&series=mozilla-central,1776360,1,10&series=mozilla-central,1776166,1,10

We don't have Ion JIT support for arm64 yet, which is why arm64 (android-hw-p2-8-0-android-aarch64) is so much slower than arm32 (  android-hw-p2-8-0-arm7-api-16) on the same device (Google Pixel 2).
Flags: needinfo?(cpeterson)
(Assignee)

Updated

6 months ago
Blocks: 1508547
(Reporter)

Comment 43

6 months ago
status-geckoview64=unaffected

Updated

5 months ago
Product: Firefox for Android → GeckoView
Target Milestone: Firefox 65 → mozilla65
You need to log in before you can comment on or make changes to this bug.