Closed Bug 1581529 Opened 5 years ago Closed 5 years ago

Alert #23063 - autoland - 19.71 - 1584.43% raptor-tp6m (android-hw-p2-8-0-android-aarch64, android-hw-p2-8-0-arm7-api-16) regression on push 72d9ad70e1ba7d56e9e3732cfd912f53c348a9a8 (Sat Sep 14 2019)

Categories

(Testing :: Performance, defect)

Version 3
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: alexandrui, Unassigned)

References

Details

(Keywords: perf, perf-alert)

Raptor has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=6a36994d2e14869283ea43331eeeaedac5be9e7c&tochange=72d9ad70e1ba7d56e9e3732cfd912f53c348a9a8

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

1584% raptor-tp6m-cnn-geckoview-cold loadtime android-hw-p2-8-0-arm7-api-16 pgo 3,690.83 -> 62,169.54
1549% raptor-tp6m-cnn-geckoview-cold loadtime android-hw-p2-8-0-android-aarch64 pgo 3,894.92 -> 64,225.08
449% raptor-tp6m-espn-geckoview-cold loadtime android-hw-p2-8-0-android-aarch64 pgo 2,411.25 -> 13,244.33
309% raptor-tp6m-ebay-kleinanzeigen-geckoview-cold loadtime android-hw-p2-8-0-arm7-api-16 pgo 1,863.83 -> 7,614.17
279% raptor-tp6m-cnn-geckoview-cold android-hw-p2-8-0-android-aarch64 pgo 2,161.57 -> 8,185.85
276% raptor-tp6m-cnn-geckoview-cold android-hw-p2-8-0-arm7-api-16 pgo 1,896.39 -> 7,121.63
273% raptor-tp6m-allrecipes-geckoview-cold loadtime android-hw-p2-8-0-android-aarch64 pgo 3,073.17 -> 11,461.42
253% raptor-tp6m-microsoft-support-geckoview-cold loadtime android-hw-p2-8-0-arm7-api-16 pgo 2,287.83 -> 8,082.33
252% raptor-tp6m-amazon-search-geckoview-cold loadtime android-hw-p2-8-0-android-aarch64 pgo 1,639.25 -> 5,764.92
248% raptor-tp6m-ebay-kleinanzeigen-search-geckoview-cold loadtime android-hw-p2-8-0-arm7-api-16 pgo 1,900.75 -> 6,623.08
248% raptor-tp6m-ebay-kleinanzeigen-geckoview-cold loadtime android-hw-p2-8-0-android-aarch64 pgo 1,904.29 -> 6,630.00
241% raptor-tp6m-ebay-kleinanzeigen-search-geckoview-cold loadtime android-hw-p2-8-0-android-aarch64 pgo 1,937.92 -> 6,606.75
235% raptor-tp6m-amazon-search-geckoview-cold loadtime android-hw-p2-8-0-arm7-api-16 pgo 1,554.33 -> 5,210.33
228% raptor-tp6m-booking-geckoview-cold loadtime android-hw-p2-8-0-arm7-api-16 pgo 1,309.62 -> 4,300.75
225% raptor-tp6m-stackoverflow-geckoview-cold android-hw-p2-8-0-android-aarch64 pgo 959.03 -> 3,121.32
221% raptor-tp6m-facebook-cristiano-geckoview-cold loadtime android-hw-p2-8-0-arm7-api-16 pgo 1,558.08 -> 5,003.67
217% raptor-tp6m-stackoverflow-geckoview-cold loadtime android-hw-p2-8-0-android-aarch64 pgo 1,333.21 -> 4,227.92
212% raptor-tp6m-stackoverflow-geckoview-cold fcp android-hw-p2-8-0-android-aarch64 pgo 922.29 -> 2,881.83
210% raptor-tp6m-booking-geckoview-cold loadtime android-hw-p2-8-0-android-aarch64 pgo 1,287.58 -> 3,990.92
207% raptor-tp6m-facebook-cristiano-geckoview-cold loadtime android-hw-p2-8-0-android-aarch64 pgo 1,618.25 -> 4,967.67
203% raptor-tp6m-stackoverflow-geckoview-cold android-hw-p2-8-0-arm7-api-16 pgo 977.65 -> 2,962.25
201% raptor-tp6m-imdb-geckoview-cold fcp android-hw-p2-8-0-android-aarch64 pgo 965.88 -> 2,909.58
196% raptor-tp6m-stackoverflow-geckoview-cold loadtime android-hw-p2-8-0-arm7-api-16 pgo 1,352.79 -> 4,010.25
194% raptor-tp6m-imdb-geckoview-cold android-hw-p2-8-0-arm7-api-16 pgo 1,086.66 -> 3,192.44
192% raptor-tp6m-stackoverflow-geckoview-cold fcp android-hw-p2-8-0-arm7-api-16 pgo 937.88 -> 2,735.75
191% raptor-tp6m-espn-geckoview-cold android-hw-p2-8-0-android-aarch64 pgo 1,988.10 -> 5,794.99
188% raptor-tp6m-imdb-geckoview-cold android-hw-p2-8-0-android-aarch64 pgo 1,147.75 -> 3,305.06
174% raptor-tp6m-jianshu-geckoview-cold loadtime android-hw-p2-8-0-android-aarch64 pgo 1,325.96 -> 3,627.92
171% raptor-tp6m-bbc-geckoview-cold loadtime android-hw-p2-8-0-android-aarch64 pgo 2,990.29 -> 8,098.08
169% raptor-tp6m-cnn-ampstories-geckoview-cold loadtime android-hw-p2-8-0-arm7-api-16 pgo 1,351.79 -> 3,630.58
166% raptor-tp6m-facebook-geckoview-cold android-hw-p2-8-0-arm7-api-16 pgo 1,013.27 -> 2,697.92
164% raptor-tp6m-facebook-geckoview-cold android-hw-p2-8-0-android-aarch64 pgo 1,053.57 -> 2,785.34
164% raptor-tp6m-bbc-geckoview-cold loadtime android-hw-p2-8-0-arm7-api-16 pgo 2,931.71 -> 7,750.25
164% raptor-tp6m-facebook-geckoview-cold loadtime android-hw-p2-8-0-arm7-api-16 pgo 1,568.88 -> 4,145.33
163% raptor-tp6m-google-maps-geckoview-cold loadtime android-hw-p2-8-0-arm7-api-16 pgo 811.00 -> 2,131.42
160% raptor-tp6m-espn-geckoview-cold android-hw-p2-8-0-arm7-api-16 pgo 2,017.98 -> 5,245.44
159% raptor-tp6m-facebook-geckoview-cold fcp android-hw-p2-8-0-arm7-api-16 pgo 916.17 -> 2,375.58
159% raptor-tp6m-microsoft-support-geckoview-cold android-hw-p2-8-0-arm7-api-16 pgo 1,184.58 -> 3,063.69
158% raptor-tp6m-facebook-geckoview-cold fcp android-hw-p2-8-0-android-aarch64 pgo 952.08 -> 2,456.00
157% raptor-tp6m-facebook-geckoview-cold loadtime android-hw-p2-8-0-android-aarch64 pgo 1,648.25 -> 4,235.83
150% raptor-tp6m-amazon-search-geckoview-cold android-hw-p2-8-0-android-aarch64 pgo 1,012.03 -> 2,531.29
146% raptor-tp6m-imdb-geckoview-cold loadtime android-hw-p2-8-0-arm7-api-16 pgo 2,406.88 -> 5,913.50
145% raptor-tp6m-jianshu-geckoview-cold loadtime android-hw-p2-8-0-arm7-api-16 pgo 1,295.00 -> 3,178.50
138% raptor-tp6m-imdb-geckoview-cold loadtime android-hw-p2-8-0-android-aarch64 pgo 2,526.96 -> 6,024.75
133% raptor-tp6m-facebook-cristiano-geckoview-cold android-hw-p2-8-0-arm7-api-16 pgo 979.66 -> 2,280.46
132% raptor-tp6m-jianshu-geckoview-cold android-hw-p2-8-0-android-aarch64 pgo 928.61 -> 2,158.39
128% raptor-tp6m-youtube-geckoview-cold loadtime android-hw-p2-8-0-android-aarch64 pgo 922.79 -> 2,105.33
120% raptor-tp6m-amazon-search-geckoview-cold android-hw-p2-8-0-arm7-api-16 pgo 1,037.87 -> 2,280.37
118% raptor-tp6m-jianshu-geckoview-cold fcp android-hw-p2-8-0-android-aarch64 pgo 737.33 -> 1,609.92
116% raptor-tp6m-facebook-cristiano-geckoview-cold android-hw-p2-8-0-android-aarch64 pgo 988.87 -> 2,137.04
115% raptor-tp6m-aframeio-animation-geckoview-cold android-hw-p2-8-0-android-aarch64 pgo 837.25 -> 1,803.52
115% raptor-tp6m-aframeio-animation-geckoview-cold fcp android-hw-p2-8-0-android-aarch64 pgo 612.17 -> 1,317.08
115% raptor-tp6m-amazon-search-geckoview-cold fcp android-hw-p2-8-0-android-aarch64 pgo 919.21 -> 1,975.25
113% raptor-tp6m-bbc-geckoview-cold android-hw-p2-8-0-android-aarch64 pgo 1,794.84 -> 3,828.49
113% raptor-tp6m-aframeio-animation-geckoview-cold android-hw-p2-8-0-arm7-api-16 pgo 868.04 -> 1,849.84
111% raptor-tp6m-aframeio-animation-geckoview-cold fcp android-hw-p2-8-0-arm7-api-16 pgo 644.58 -> 1,362.50
110% raptor-tp6m-bbc-geckoview-cold android-hw-p2-8-0-arm7-api-16 pgo 1,750.54 -> 3,676.55
108% raptor-tp6m-aframeio-animation-geckoview-cold loadtime android-hw-p2-8-0-arm7-api-16 pgo 2,237.25 -> 4,656.50
108% raptor-tp6m-amazon-geckoview-cold loadtime android-hw-p2-8-0-android-aarch64 pgo 1,017.50 -> 2,112.67
104% raptor-tp6m-facebook-cristiano-geckoview-cold fcp android-hw-p2-8-0-arm7-api-16 pgo 857.21 -> 1,751.33
104% raptor-tp6m-jianshu-geckoview-cold android-hw-p2-8-0-arm7-api-16 pgo 901.28 -> 1,839.15
104% raptor-tp6m-aframeio-animation-geckoview-cold loadtime android-hw-p2-8-0-android-aarch64 pgo 2,227.50 -> 4,536.50
104% raptor-tp6m-ebay-kleinanzeigen-geckoview-cold android-hw-p2-8-0-arm7-api-16 pgo 842.10 -> 1,714.19
102% raptor-tp6m-espn-geckoview-cold fcp android-hw-p2-8-0-android-aarch64 pgo 3,835.21 -> 7,743.33
102% raptor-tp6m-web-de-geckoview-cold loadtime android-hw-p2-8-0-android-aarch64 pgo 2,068.67 -> 4,175.42
101% raptor-tp6m-microsoft-support-geckoview-cold fcp android-hw-p2-8-0-arm7-api-16 pgo 1,301.79 -> 2,612.08
99% raptor-tp6m-youtube-watch-geckoview-cold loadtime android-hw-p2-8-0-android-aarch64 pgo 1,206.71 -> 2,405.33
97% raptor-tp6m-web-de-geckoview-cold loadtime android-hw-p2-8-0-arm7-api-16 pgo 2,074.50 -> 4,087.00
92% raptor-tp6m-cnn-geckoview-cold fcp android-hw-p2-8-0-android-aarch64 pgo 4,067.29 -> 7,824.92
91% raptor-tp6m-wikipedia-geckoview-cold loadtime android-hw-p2-8-0-arm7-api-16 pgo 695.62 -> 1,326.67
91% raptor-tp6m-microsoft-support-geckoview-cold fcp android-hw-p2-8-0-android-aarch64 pgo 1,357.54 -> 2,586.75
89% raptor-tp6m-espn-geckoview-cold fcp android-hw-p2-8-0-arm7-api-16 pgo 3,781.92 -> 7,155.25
86% raptor-tp6m-wikipedia-geckoview-cold loadtime android-hw-p2-8-0-android-aarch64 pgo 717.54 -> 1,336.58
86% raptor-tp6m-facebook-cristiano-geckoview-cold fcp android-hw-p2-8-0-android-aarch64 pgo 884.29 -> 1,646.00
85% raptor-tp6m-bbc-geckoview-cold fcp android-hw-p2-8-0-android-aarch64 pgo 1,405.88 -> 2,605.08
84% raptor-tp6m-booking-geckoview-cold android-hw-p2-8-0-arm7-api-16 pgo 506.75 -> 933.04
83% raptor-tp6m-ebay-kleinanzeigen-search-geckoview-cold android-hw-p2-8-0-android-aarch64 pgo 851.53 -> 1,560.87
82% raptor-tp6m-bbc-geckoview-cold fcp android-hw-p2-8-0-arm7-api-16 pgo 1,376.25 -> 2,508.25
82% raptor-tp6m-ebay-kleinanzeigen-search-geckoview-cold android-hw-p2-8-0-arm7-api-16 pgo 855.44 -> 1,558.08
79% raptor-tp6m-booking-geckoview-cold android-hw-p2-8-0-android-aarch64 pgo 495.29 -> 888.02
76% raptor-tp6m-cnn-ampstories-geckoview-cold android-hw-p2-8-0-arm7-api-16 pgo 661.34 -> 1,164.56
72% raptor-tp6m-amazon-geckoview-cold android-hw-p2-8-0-android-aarch64 pgo 741.32 -> 1,276.99
65% raptor-tp6m-ebay-kleinanzeigen-geckoview-cold android-hw-p2-8-0-android-aarch64 pgo 865.21 -> 1,430.53
64% raptor-tp6m-web-de-geckoview-cold android-hw-p2-8-0-android-aarch64 pgo 751.74 -> 1,233.08
54% raptor-tp6m-cnn-ampstories-geckoview-cold fcp android-hw-p2-8-0-arm7-api-16 pgo 1,031.98 -> 1,593.00
50% raptor-tp6m-web-de-geckoview-cold fcp android-hw-p2-8-0-android-aarch64 pgo 494.92 -> 741.25
39% raptor-tp6m-youtube-geckoview-cold android-hw-p2-8-0-android-aarch64 pgo 572.99 -> 799.00
39% raptor-tp6m-wikipedia-geckoview-cold android-hw-p2-8-0-arm7-api-16 pgo 554.66 -> 772.25
31% raptor-tp6m-google-maps-geckoview-cold android-hw-p2-8-0-arm7-api-16 pgo 581.50 -> 760.56
26% raptor-tp6m-youtube-geckoview-cold fcp android-hw-p2-8-0-android-aarch64 pgo 679.67 -> 855.08
23% raptor-tp6m-wikipedia-geckoview-cold fcp android-hw-p2-8-0-arm7-api-16 pgo 548.25 -> 674.67
20% raptor-tp6m-youtube-geckoview-cold fcp android-hw-p2-8-0-arm7-api-16 pgo 667.25 -> 798.75

You can find links to graphs and comparison views for each of the above tests at:
autoland: https://treeherder.mozilla.org/perf.html#/alerts?id=23063
beta: https://treeherder.mozilla.org/perf.html#/alerts?id=23104
inbound: https://treeherder.mozilla.org/perf.html#/alerts?id=23118

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a Treeherder page showing the Raptor jobs in a pushlog format.

To learn more about the regressing test(s) or reproducing them, please see: https://wiki.mozilla.org/TestEngineering/Performance/Raptor

android-hw/bitbar started using a new python2 binary in the docker image that is unoptimized, but newer than what ships with ubuntu 16.04. I discussed this regression with :bc and he didn't think that performance of python on the host would affect device timing significantly.

bitbar also implemented cpu and memory limiting of containers fully at around 10:45am(PDT) on Friday.

Please let me know if the bisect doesn't reveal anything.

I was just thinking about it again this morning when Bebe asked about image changes on Friday and realized that if python's performance in running mitmproxy and/or the web servers were impacted that might be a cause for a change in the performance measurements.

aerickson: Could you put together a timeline of the images we used on Thursday and Friday?

Flags: needinfo?(aerickson)

Wed 12PM Pacific: Use 2019/9/10 image with compiled python2/3.
Thu 2PM Pacific: Use 2019/8/13 image (known good).
Fri 5PM Pacific: Use 2019/9/13 image (python 2/3 with g-w16).

We're going to go back to the old ubuntu provided py2. I'll post here when it goes live.

Flags: needinfo?(aerickson)

We have this alert for inbound also: https://treeherder.mozilla.org/perf.html#/alerts?id=23104
I wount mark it as downstream because of it's size. Is more readable this way.

The latest image with the older ubuntu-provided pythons is now live.

Initial results suggest that we're still seeing the regression. :aerickson were there any other infrastructure changes on Friday that may contribute to this regression? :alexandrui could you confirm that the regression is still present?

Flags: needinfo?(alexandru.ionescu)
Flags: needinfo?(aerickson)

retriggered some tasks, waiting for the results.

Flags: needinfo?(alexandru.ionescu)

I came across some Fenix graphs, such as this one and this one.

If you look back at revision 106458c2c93ae, around Aug 30 & head over to Treeherder (e.g.), you'll see its jobs are very recent. They're from todat, Sep 19 & experience this very regression.

Are Fenix revisions hardcoded to use a very specific Geckoview revision? If yes, this is just an extra reason to say this is an infra problem.

First results of the retrigger: raptor-tp6m-youtube-geckoview-cold loadtime pgo . Regression not fixed.

(In reply to Andrew Erickson [:aerickson] from comment #2)

bitbar also implemented cpu and memory limiting of containers fully at around 10:45am(PDT) on Friday.

this also might be the cause of the regression.
I will ask bitbar on slack to see if we can disable the limitation temporarily and see if this is the cause

I did some retriggers myself here, around Sep 6.

Now my own results arrived: the infra change is still there. We experience similar high noise & regressions even on pretty old revisions, dating way before the suspect patch.

(In reply to Ionuț Goldan [:igoldan], Performance Sheriff from comment #10)

Are Fenix revisions hardcoded to use a very specific Geckoview revision? If yes, this is just an extra reason to say this is an infra problem.

According to Sebastian Kaspari:

No, not really. They get their GV version indirectly from "Android Components". Both versions (Nightly/Beta) are updated automatically as long as there's no API breakage.

From where can we get notifications when Bitbar infra changes are planned to be deployed or got deployed?

Flags: needinfo?(dave.hunt)

:davehunt, no other changes made. The memory and cpu limits were added because the entire cluster was crashing due to perf tests using all available memory.

Flags: needinfo?(aerickson)

(In reply to Dave Hunt [:davehunt] [he/him] ⌚BST from comment #7)

Initial results suggest that we're still seeing the regression. :aerickson were there any other infrastructure changes on Friday that may contribute to this regression? :alexandrui could you confirm that the regression is still present?

Looks like the baseline now returns to its original value, before the infra-caused regressions were noticed. It just took a while to reflect.
The Android pool seems to output these results slower than the desktop infras (Windows/Linux/OSX).

I suggest we revisit this after the weekend, so Perfherder collects enough data. Then we can quickly skim all our Android perf graphs.
If all their baselines returned to normal, we can close this bug.

:bc What changes where made to the infrastructure to fix this regression?

Flags: needinfo?(bob)

We removed the cpu and memory limit imposed on the docker container which had been put into place last Friday 9/13 to prevent containers from consuming all of the resources on the cluster host.

We had attempted to do this last week as part of our recovery from the host outages by reverting to a known good image from August but when that failed we applied the limits.

Yesterday while investigating if the cpu and memory limits were the cause of this regression, we removed the memory limits from the containers and performed limited testing while tracking the memory and cpu use on the hosts. For some reason, we were no longer able to reproduce the high memory usage we had seen before and once we were convinced we were ok, kept the containers running without cpu and memory limits.

The current image contains the updated generic-worker and some other changes but does not contain the compiled python2 and python3 that were introduced to support the condprofile work which we believe were the root cause of the memory induced outage last week.

I believe that the attempt to revert to the August image failed to actually use the correct image which caused us to erroneously believe the memory issue was related to a changed test.

Going forward, we will be more careful about making certain that the intended image is being used and validating that new images do not introduce regressions. As for supporting the condprofile work, aerickson is working on standing up an Ubuntu 18.04 image which natively supports newer python2 and python3 versions. We will test this extensively for memory use and regressions before switching over.

Flags: needinfo?(bob)

Since Thursday, the regression was fixed and remained like this. Also, I didn't see any intermittents yet, either.

(In reply to Bob Clary [:bc:] from comment #19)

Going forward, we will be more careful about making certain that the intended image is being used and validating that new images do not introduce regressions. As for supporting the condprofile work, aerickson is working on standing up an Ubuntu 18.04 image which natively supports newer python2 and python3 versions. We will test this extensively for memory use and regressions before switching over.

So should we consider this regression fixed and the caused 100% found? Could we wrap this up?

Flags: needinfo?(bob)

(In reply to Alexandru Ionescu :alexandrui from comment #21)

So should we consider this regression fixed and the caused 100% found? Could we wrap this up?

Yes, this is resolved, and the alert should be tagged as infrastructure.

Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(dave.hunt)
Flags: needinfo?(bob)
Resolution: --- → FIXED
Blocks: 1592626
No longer blocks: 1592626
You need to log in before you can comment on or make changes to this bug.