1503330 - GeckoView Speedometer regression on Google Pixel 2 (ARM32 and ARM64) from clang 7 bug 1492663

Reporter

Description

•

7 years ago

Speedometer scores started to regress on Android around October 15 and then dipped again on October 27: https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1776166,1,10&series=mozilla-central,1775979,1,10 Here is the hg pushlog for the second dip: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=6e96c7ec0d1187c1b488dd4ba645df9cfd68ec16&tochange=7007206a3cd46da7c478e5a9992b608b743a1ff9 Android LTO (bug 1480006) landed in that regression range, but it was also backed out. PLT stubs (bug 1500154) also landed. That change improved page load performance, but perhaps it regressed Speedometer?

Chris Peterson [:cpeterson]

Reporter

Updated

•

7 years ago

Summary: Speedometer regression → GeckoView Speedometer regression on Google Pixel 2 (ARM32 and ARM64)

Joel Maher ( :jmaher ) (UTC -8)

Comment 1

•

7 years ago

keep in mind a lot has changed to make the tests run again on bitbar (second dip), will see if tooling is the culprit or the browser.

Chris Peterson [:cpeterson]

Reporter

Comment 2

•

7 years ago

Bob, do you know why Speedometer results from Moto G5 stopped after October 23? I don't know if that was related to the regression on Pixel 2, but they happened around the same time: https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1776166,1,10&series=mozilla-central,1775979,1,10

Flags: needinfo?(bob)

Priority: -- → P1

Joel Maher ( :jmaher ) (UTC -8)

Comment 3

•

7 years ago

:cpeterson, all bitbar testing was broken after a ssl cert was renewed and we had too many redirects accessing artifacts (i.e. bug 1499246). We fixed the issue for the most part, but we turned off all testing for a few days to understand the issue and ended up updating our docker image that runs the tests and serves the webpages to a new version of ubuntu that has new and updated libraries.

James Willcox (:snorp) (jwillcox@mozilla.com) (he/him)

Comment 4

•

7 years ago

My gut feeling here is that it's related to a11y stuff, since Speedometer does a bunch of form actions. Maybe Eitan has insight?

Flags: needinfo?(eitan)

Eitan Isaacson [:eeejay]

Comment 5

•

7 years ago

Does speedometer have any accessibility service active?

Joel Maher ( :jmaher ) (UTC -8)

Comment 6

•

7 years ago

pushing 3 different pushes, I see similar results to the graph- 2 regressions- since this just ran today in the last 1.5 hours, there are no changes going into the bitbar environment, only changes in the browser code. I think this is code changes. We can bisect this if needed- it takes a few hours to deal with tree updates, etc. but we could bisect on try if there are no culprits identified

Joel Maher ( :jmaher ) (UTC -8)

Comment 7

•

7 years ago

I don't think there is anything specific to a11y, we set preferences: https://searchfox.org/mozilla-central/source/testing/profiles/common/user.js https://searchfox.org/mozilla-central/source/testing/profiles/perf/user.js and the speedometer source is here: https://searchfox.org/mozilla-central/source/third_party/webkit/PerformanceTests/Speedometer

Flags: needinfo?(bob)

Joel Maher ( :jmaher ) (UTC -8)

Comment 8

•

7 years ago

I am not sure if we have made progress on this- as it would be some work to bisect, I haven't done that, but I would be happy to if we are not sure anything will fix the problem.

Eitan Isaacson [:eeejay]

Comment 9

•

7 years ago

The logs don't suggest accessibility being enabled. So I'm not sure how the a11y changes may have done this. Can't rule it out, though..

Flags: needinfo?(eitan)

Joel Maher ( :jmaher ) (UTC -8)

Comment 10

•

7 years ago

I have done a lot of bisecting on this and cannot come to any conclusions, it is obvious that we have a sustained change, but I see it show up in a smaller range, then I bisect more and it yields no results, as if I am unable to reproduce this.

Chris Peterson [:cpeterson]

Reporter

Comment 11

•

7 years ago

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #10) > I have done a lot of bisecting on this and cannot come to any conclusions, > it is obvious that we have a sustained change, but I see it show up in a > smaller range, then I bisect more and it yields no results, as if I am > unable to reproduce this. @ Joel, what is your smaller regression range? Is your range for the first (October 16) or second (October 27) regression? The regressions look more pronounced on android-hw-p2-8-0-android-aarch64 (ARM64), where we don't have Ion JIT yet (bug 1380129). What do you recommend we do next? With the smaller range, perhaps a Gecko developer can try to bisect by running Speedometer locally. @ Kannan, the range for the first regression (October 16 around [1]) might be [2] or [3]. Range [2] includes a JIT fix for bug 1496847. Could that cause a Speedometer regression on ARM? Range [3] includes a raptor test change for Sunspider Chrome bug 1487374. [1] https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1776166,1,10&series=mozilla-central,1775979,1,10&zoom=1539315355586.678,1540316452294.6199,15.998876335915556,18.392134762881845 [2] https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f2e35ed6a6928b35e085462dc4268fd86f00fdb9&tochange=31724aea10cae55f30b825ade226c4d25e11a899 [3] https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=9079bbe837184ed183b133a374753865b6768bc4&tochange=f2e35ed6a6928b35e085462dc4268fd86f00fdb9

Flags: needinfo?(kvijayan)

Flags: needinfo?(jmaher)

Comment 12

•

7 years ago

I had looked at the oct 27th regression, it seemed that I started narrowing down a range and then I wasn't able to reproduce it on a smaller range and I played whack a mole for a 100 try pushes. If you want I can look into the oct 16th regression tomorrow, let me know if that is desired.

Flags: needinfo?(jmaher)

Bob Clary [:bc] (inactive)

Comment 13

•

7 years ago

Remember that when using the bitbar devices we are getting a possibly different device each time which may affect the ability to bisect. It might be possible to manually bisect with a single local device.

Kannan Vijayan [:djvj]

Comment 14

•

7 years ago

Chris: No, that patch should behave strictly as speedup on all platforms.

Flags: needinfo?(kvijayan)

Chris Peterson [:cpeterson]

Reporter

Comment 15

•

7 years ago

[Tracking Requested - why for this release]: This is a performance regression on Speedometer, one of our benchmarks. GV 64 Nightly merged to Beta on 2018-10-22, so GV 64 would be affected by the apparent October 15 regression. GV 65 would be affected by both the October 15 and 27 regressions.

status-firefox64: unaffected → affected

tracking-firefox64: --- → ?

tracking-firefox65: --- → ?

Julien Cristau [:jcristau] (back August 18)

Comment 16

•

7 years ago

tracking as gv perf regression

tracking-firefox64: ? → +

tracking-firefox65: ? → +

Randell Jesup [:jesup] (needinfo me)

Comment 17

•

7 years ago

Eitan - speedometer is known for doing rapid-fire changing of form element states (creating checkboxes, checking 100 checkboxes at a time, deleting them). If any code is in that processing path in order to support accessibility (maintaining state, perhaps), perhaps it could have a perf impact even if a11y isn't in use at that time. Obviously trying to narrow regression ranges is the primary tool to attack this for now

Flags: needinfo?(eitan)

Randell Jesup [:jesup] (needinfo me)

Comment 18

•

7 years ago

*Possible* regression range for the first regression: https://hg.mozilla.org/mozilla-central/pushloghtml?startdate=Oct+15+2018&enddate=Oct+16+2018 Note: I'm eyeballing the distribution and saying it regressed late-15th-ish perhaps or *maybe* early 16th Roughly 17.75 to 17.0 from eyeballing the noise

Randell Jesup [:jesup] (needinfo me)

Comment 19

•

7 years ago

Looking at that range: Bug 1498699 was an android-only bug in that range... and Bug 1498715 (unlikely though). Bug 1479757 was for GV, but may be WebRender-only. Bug 1498329 seems very unlikely. There was a big non-android landing for tables (Bug 1484126 ). a JIT piece landed at the end of that which we've discussed: Bug 1496847 I presume this didn't impact desktop. jmaher: thoughts? How does this mesh with regressionrange tests you did, or did you only focus on the second?

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 20

•

7 years ago

:jesup, I only focused on the second. basically someone could do this process: foreach revision in range: hg update revision <apply any patches to fix build or scheduling> ./mach try fuzzy -q 'android aarch64 speedometer' --rebuild 5 <remove any patches to fix build or scheduling> sadly this takes a long time for me locally and the cpu load on my machine is high so doing this while in a vidyo meeting creates a lot of interference. I did this on the second range and found that it was inconclusive which happens maybe 10% of the time when I bisect- usually more frequent the longer back in history I go and the more custom patches to get stuff to build or run.

Flags: needinfo?(jmaher)

Eitan Isaacson [:eeejay]

Comment 21

•

7 years ago

I'm about to try to bisect this locally by just loading the speedometer page on a dedicated device. That should be good enough, right? My goal is to observe that there *is* in fact a regression, and see if and how it is related to a11y.

Joel Maher ( :jmaher ) (UTC -8)

Comment 22

•

7 years ago

loading speedometer on a device while iterating through builds should show a regression; I would at least run it twice for each build; I did this one day to compare against google chrome and different versions of the browser- overall I got a strong signal. As a note, this is on a pixel 2 device.

Randell Jesup [:jesup] (needinfo me)

Comment 23

•

7 years ago

Eitan - yes, dedicated device should be good. You can bisect on m-c merges in that time range (from archive) once you verify there's a difference from (say) 10/14 vs 10/17.

Randell Jesup [:jesup] (needinfo me)

Comment 24

•

7 years ago

Ok, the Oct 16th regression in speedometer is understood: glandium backed out Android LTO: "Back out changeset 9481b9ecaf2f (bug 1480006) as being a suspect in the rise of crashes such as bug 1489953. " See https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&searchStr=pixel2%2Cspeedometer&selectedJob=211353831&revision=9079bbe837184ed183b133a374753865b6768bc4 I identified the regression looking at m-c pushes with https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&searchStr=pixel2%2Cspeedometer&startdate=2018-10-15&enddate=2018-10-17&selectedJob=211353831, then retriggered the speedometer runs around the regression before noticing this backout (above). So it's just the Oct 27th regression to understand

Randell Jesup [:jesup] (needinfo me)

Comment 25

•

7 years ago

> ./mach try fuzzy -q 'android aarch64 speedometer' --rebuild 5 So long as I reland the bitbar stuff for each push (and be careful of glandium landing and backing out LTO in this range, it works. So the range is down to 8 changesets: 524c546c7ad5 (good) to ab584824a073 (bad) -- 4 bugs, 8 changesets (5 for 1 bug). One changeset is webgl only (unlikely...) Two are by glandium and involve possible perf issues, though one looks to be desktop-only, and the other is in graphite(!?) and so seems unlikely: 434f70360933 Mike Hommey — Bug 1492663 - Upgrade most CI builds to clang 7 r=froydnj 02ac5668f849 Mike Hommey — Bug 1498072 - Prevent inlining of the direct_run function. r=jfkthame The other patches are for 2 bugs: 42961627bc9a Ryan Hunt — Bug 1453425 - Remove mScrollPosAtLastPaint from nsGfxScrollFrame. r=botond (and 4 more patches) 5ef21255897e Jeff Gilbert — Bug 1478216 - Don't init tex images in FBAttachment::IsComplete. r=kvark Jeff's bug appears to be all WebGL. So if glandium's patches don't pan out as the cause, it's bug 1453425

Randell Jesup [:jesup] (needinfo me)

Comment 26

•

7 years ago

Try run with 434f70360933 Mike Hommey — Bug 1492663 is slow, so it has to be that or bug 1498072. Fast: https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=211484789&revision=d49d773fe06dad3b08cbbef6c904d2d0fe9c90c5 Slow: https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=211484789&revision=4d30905771e75c7a3882343c59874c81ce53f40f (includes both patches for the bugs above) Still running: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0bb91544975ad5911033f09dfca051c3bf3f29af (which has Bug 1498072 - Prevent inlining of the direct_run function, but not the other patch).

Flags: needinfo?(eitan) → needinfo?(mh+mozilla)

Randell Jesup [:jesup] (needinfo me)

Comment 27

•

7 years ago

> Still running: > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=0bb91544975ad5911033f09dfca051c3bf3f29af > (which has Bug 1498072 - Prevent inlining of the direct_run function, but not the other patch). That was fast, so it's bug 1492663

Blocks: 1492663

Summary: GeckoView Speedometer regression on Google Pixel 2 (ARM32 and ARM64) → GeckoView Speedometer regression on Google Pixel 2 (ARM32 and ARM64) from bug 1492663

Chris Peterson [:cpeterson]

Reporter

Comment 28

•

7 years ago

(In reply to Randell Jesup [:jesup] from comment #27) > That was fast, so it's bug 1492663 Thanks for pinpointing the regression, Randell. That's a bummer. There's not a lot we can do to recover from a performance regression caused by upgrading to a new clang version. Did macOS or Linux regress on Speedometer after upgrading to clang 7? Skimming the clang 7 release notes [1] for any interesting changes: * -fmerge-all-constants is no longer applied by default because it is a non-conforming optimization and miscompiles the Linux kernel [2]. Maybe we should try re-enabling? * The new -fforce-emit-vtables option forces the emission of vtables even in modules where it isn't necessary, which increases opportunities for devirtualisation. Forcing emitting of vtable adds a reference of these inline virtual functions. Note that GCC was always doing it. [3] [1] http://releases.llvm.org/7.0.0/tools/clang/docs/ReleaseNotes.html [2] https://patchwork.ozlabs.org/patch/888487/ [3] https://reviews.llvm.org/rL334600

Mike Hommey [:glandium]

Assignee

Comment 29

•

7 years ago

(In reply to Chris Peterson [:cpeterson] from comment #28) > (In reply to Randell Jesup [:jesup] from comment #27) > > That was fast, so it's bug 1492663 > > Thanks for pinpointing the regression, Randell. That's a bummer. There's not > a lot we can do to recover from a performance regression caused by upgrading > to a new clang version. Did macOS or Linux regress on Speedometer after > upgrading to clang 7? Slightly improved on linux https://treeherder.mozilla.org/perf.html#/graphs?timerange=2592000&series=mozilla-inbound,1640394,1,1&series=mozilla-inbound,1641346,1,1&highlightedRevisions=434f70360933 Unchanged on Mac https://treeherder.mozilla.org/perf.html#/graphs?timerange=2592000&series=mozilla-central,1647998,1,1&highlightedRevisions=434f70360933 Or is it one of those benchmarks where the lower score is better?

Flags: needinfo?(mh+mozilla)

Chris Peterson [:cpeterson]

Reporter

Comment 30

•

7 years ago

(In reply to Mike Hommey [:glandium] from comment #29) > Or is it one of those benchmarks where the lower score is better? No. A higher Speedometer score is better, so Linux did improve slightly.

Chris Peterson [:cpeterson]

Reporter

Comment 31

•

7 years ago

[qf] to put this regression on the Quantum Flow team's radar. We don't have an action plan at this time.

Summary: GeckoView Speedometer regression on Google Pixel 2 (ARM32 and ARM64) from bug 1492663 → GeckoView Speedometer regression on Google Pixel 2 (ARM32 and ARM64) from clang 7 bug 1492663

Whiteboard: [qf]

Mike Hommey [:glandium]

Assignee

Comment 32

•

7 years ago

Attached file Bug 1503330 - Switch android arm/aarch64 opt/nightly builds back to clang 6. — Details

(Away)

Comment 33

•

7 years ago

Has anyone tried bisecting LLVM trunk to see if this is due to a specific change? It may give us a clue on how to work around the regression or at least tell us who to poke upstream.

Pulsebot

Comment 34

•

7 years ago

Pushed by mh@glandium.org: https://hg.mozilla.org/integration/autoland/rev/386ec4aee534 Switch android arm/aarch64 opt/nightly builds back to clang 6. r=dmajor

Pulsebot

Comment 35

•

7 years ago

Pushed by apavel@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/3a0b29de3cfb fix clang failures on a CLOSED TREE

Arthur Iakab [arthur_iakab]

Comment 36

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/386ec4aee534 https://hg.mozilla.org/mozilla-central/rev/3a0b29de3cfb

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox65: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → Firefox 65

BugBot [:suhaib / :marco/ :calixte]

Updated

•

7 years ago

Assignee: nobody → mh+mozilla

David Bolter [:davidb] (NeedInfo me for attention)

Comment 37

•

7 years ago

Thanks Mike! I see FF64 is marked tracking and affected so I think that means we'll need to do the same thing for beta? (See comment 15)

Flags: needinfo?(mh+mozilla)

Chris Peterson [:cpeterson]

Reporter

Comment 38

•

7 years ago

We can see Speedometer performance restored on Perfherder already: https://treeherder.mozilla.org/perf.html#/graphs?series=mozilla-central,1775979,1,10&series=mozilla-central,1776360,1,10&series=mozilla-central,1776166,1,10 android-hw-p2-8-0-android-aarch64 was impacted the most. aarch64 currently only has Baseline JIT support, so apparently clang 7 was hurting Baseline JIT code more than Ion JIT code. Once we have ARM64 support for Ion (bug 1380129), we can probably switch back to clang 7 (unless we find a clang 7 fix before then).

Chris Peterson [:cpeterson]

Reporter

Comment 39

•

7 years ago

(In reply to David Bolter [:davidb] (NeedInfo me for attention) from comment #37) > Thanks Mike! I see FF64 is marked tracking and affected so I think that > means we'll need to do the same thing for beta? (See comment 15) This bug covers two Speedometer regressions: one in 64 Nightly on October ~16 and one in 65 Nightly on October 27. The 65 regression was clang 7 bug 1492663, which has been fixed by reverting to clang 6. As such, I am unneedinfo'ing Mike because I don't think any action is needed from him. According to Randell's comment 24, the 64 regression appears to be the backout [1] of Android LTO bug 1480006 from the first time Mike tried to land it. So 64 did not actually regress; it was a return to pre-LTO performance. I'll mark 64=unaffected. The good news is that we now know that LTO measurably improved our performance on Android! [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1480006#c12

status-firefox64: affected → unaffected

Flags: needinfo?(mh+mozilla)

Comment 40

•

7 years ago

(In reply to Chris Peterson [:cpeterson] from comment #38) > We can see Speedometer performance restored on Perfherder already: > > https://treeherder.mozilla.org/perf.html#/graphs?series=mozilla-central, > 1775979,1,10&series=mozilla-central,1776360,1,10&series=mozilla-central, > 1776166,1,10 Wait, we have talos for android?!

Mike Hommey [:glandium]

Assignee

Comment 41

•

7 years ago

Taking a step back here, it seems the bug summary is wrong, in that arm32 was not affected. I'm bisecting, and I see no differences there along the way. Only on arm64. What this tells me is that we could leave arm32 alone. And since we're not shipping aarch64... do we really need to keep using clang 6 for it (when it would be only platform using it)?

Flags: needinfo?(cpeterson)

Chris Peterson [:cpeterson]

Reporter

Comment 42

•

7 years ago

(In reply to Mike Hommey [:glandium] from comment #40) > Wait, we have talos for android?! Not all of Talos. We're running Speedometer, Unity WebGL, and soon the Talos tp6 page load benchmarks using a new test runner (for all Firefox platforms) called Raptor. (In reply to Mike Hommey [:glandium] from comment #41) > Taking a step back here, it seems the bug summary is wrong, in that arm32 > was not affected. I'm bisecting, and I see no differences there along the > way. Only on arm64. What this tells me is that we could leave arm32 alone. > And since we're not shipping aarch64... do we really need to keep using > clang 6 for it (when it would be only platform using it)? sstangl is adding Ion support for arm64 in bug 1380129, which should land in 2019 Q1. In the meantime, we could go back to clang 7 on all platforms, accepting the arm64 regression with the hope that Ion will recover the clang 7 regression. We can't stay on clang 6 forever, but perhaps arm64 would still be faster with clang 6 than clang 7 even after we have Ion support? As a Build peer, the compiler decisions are yours. If I squint at this Perfherder graph, I think both arm32 and arm64 regressed on October 27, but arm64 was definitely hit harder: https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1775979,1,10&series=mozilla-central,1776360,1,10&series=mozilla-central,1776166,1,10 We don't have Ion JIT support for arm64 yet, which is why arm64 (android-hw-p2-8-0-android-aarch64) is so much slower than arm32 ( android-hw-p2-8-0-arm7-api-16) on the same device (Google Pixel 2).

Flags: needinfo?(cpeterson)

Mike Hommey [:glandium]

Assignee

Updated

•

7 years ago

Blocks: 1508547

Chris Peterson [:cpeterson]

Reporter

Comment 43

•

7 years ago

status-geckoview64=unaffected

status-geckoview64: --- → unaffected

BMO Automation

Updated

•

7 years ago

Product: Firefox for Android → GeckoView

Dylan Hardison [:dylan] (he/him)

Updated

•

7 years ago

Target Milestone: Firefox 65 → mozilla65

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

3 years ago

Performance Impact: --- → ?

Whiteboard: [qf]