Open Bug 1548837 Opened 5 years ago Updated 2 years ago

14.69 - 30.82% raptor-tp6-docs-firefox / raptor-tp6-docs-firefox fcp (osx-10-10-shippable) regression on push f5f8345a5c873fc99459f53d2eb14247cbc5473a (Thu May 2 2019)

Categories

(Core :: Graphics: ImageLib, defect, P3)

Unspecified
macOS
defect

Tracking

()

People

(Reporter: ccoroiu, Unassigned)

References

(Regression)

Details

(4 keywords)

Raptor has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=d9a454afe9bf4e6299021d96948f135ecbc827ca&tochange=c477a96a488995459baab0948a10cb7231d7cbe3

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

31% raptor-tp6-docs-firefox fcp osx-10-10-shippable opt 1,373.96 -> 1,797.38
15% raptor-tp6-docs-firefox osx-10-10-shippable opt 1,387.57 -> 1,591.39

Improvements:

27% raptor-tp6-microsoft-firefox fcp linux64-shippable opt 481.75 -> 353.62
26% raptor-tp6-microsoft-firefox fcp linux64-shippable-qr opt 513.71 -> 381.62
26% raptor-tp6-microsoft-firefox fcp windows7-32-shippable opt 542.71 -> 404.25
24% raptor-tp6-microsoft-firefox fcp windows10-64-shippable opt 526.62 -> 402.88
23% raptor-tp6-microsoft-firefox fcp windows10-64-shippable-qr opt 546.04 -> 421.88
21% raptor-tp6-microsoft-firefox fcp osx-10-10-shippable opt 1,247.02 -> 982.75
20% raptor-tp6-yandex-firefox loadtime linux64-shippable-qr opt 287.08 -> 229.62
19% raptor-tp6-microsoft-firefox linux64-shippable-qr opt 1,024.67 -> 826.61
19% raptor-tp6-microsoft-firefox linux64-shippable opt 975.46 -> 792.91
19% raptor-tp6-yandex-firefox loadtime linux64-shippable opt 280.33 -> 228.33
18% raptor-tp6-yandex-firefox fcp windows10-64-shippable-qr opt 90.08 -> 74.25
17% raptor-tp6-imdb-firefox osx-10-10-shippable opt 326.45 -> 271.86
17% raptor-tp6-yandex-firefox windows7-32-shippable opt 122.80 -> 102.44
16% raptor-tp6-yandex-firefox linux64-shippable-qr opt 129.49 -> 108.66
14% raptor-tp6-yandex-firefox windows10-64-shippable opt 122.18 -> 104.63
13% raptor-tp6-yandex-firefox linux64-shippable opt 130.66 -> 113.22
13% raptor-tp6-yandex-firefox osx-10-10-shippable opt 252.07 -> 219.32
10% raptor-tp6-imdb-firefox fcp linux64-shippable-qr opt 167.67 -> 150.75
9% raptor-tp6-imdb-firefox fcp linux64-shippable opt 165.62 -> 151.08
5% raptor-tp6-sheets-firefox loadtime linux64-shippable-qr opt 753.46 -> 714.83
4% raptor-tp6-sheets-firefox loadtime linux64-shippable opt 729.25 -> 699.62

You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=20773

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a Treeherder page showing the Raptor jobs in a pushlog format.

To learn more about the regressing test(s) or reproducing them, please see: https://wiki.mozilla.org/Performance_sheriffing/Raptor

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Performance_sheriffing/Talos/RegressionBugsHandling

Component: General → ImageLib
Flags: needinfo?(igoldan)
Flags: needinfo?(ehsan)
Product: Testing → Core

The only reason why I could imagine OS X could regress while other platforms don't is that the patch is making us do more work on that platform. This is because the raptor-tp6-docs-firefox test just measures the load time of this page: https://docs.google.com/document/d/1US-07msg12slQtI_xchzYxcKlTs6Fp7WqIc6W5GK5M8/edit (see https://searchfox.org/mozilla-central/rev/b2015fdd464f598d645342614593d4ebda922d95/testing/raptor/raptor/tests/raptor-tp6-2.ini#21). If you load that page and watch the network connections it makes under devtools you'll see there are no third-party tracking resources found there.

mTopLevelBaseDomain should be google.com for the image subresources of the main doc and empty for the rest. I doubt the cost of copying the string objects would be the cause...

This OSX only caller of imgILoader::LoadImage() looks interesting: https://searchfox.org/mozilla-central/rev/e7d9a8749303b39dadcc0e18ea0d60a570a68145/widget/cocoa/nsMenuItemIconX.mm#235. I doubt we have a ton of menu items with icons but maybe the cost of hashing is the root cause? I think I may need to do some try pushes.

Does anyone know how I can get a talos profile for these regressions? (I don't have access to a Mac machine myself.) Thanks!

Flags: needinfo?(ehsan)

(In reply to Ehsan Akhgari (:ehsan) from comment #1)

Does anyone know how I can get a talos profile for these regressions? (I don't have access to a Mac machine myself.) Thanks!

I suppose you mean Gecko profiles? If yes, I can share that.

Flags: needinfo?(igoldan) → needinfo?(ehsan)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #2)

(In reply to Ehsan Akhgari (:ehsan) from comment #1)

Does anyone know how I can get a talos profile for these regressions? (I don't have access to a Mac machine myself.) Thanks!

I suppose you mean Gecko profiles? If yes, I can share that.

Yes, that would be great. Thank you!

Flags: needinfo?(ehsan)

Having some Gecko profiles from Linux and Windows raptor runs which did not show a regression when this landed would be extremely helpful for comparison too!

(In reply to Ehsan Akhgari (:ehsan) from comment #4)

Having some Gecko profiles from Linux and Windows raptor runs which did not show a regression when this landed would be extremely helpful for comparison too!

These are the Gecko profiles for the raptor-tp6-docs-firefox regressions from OSX:

before: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FFpEGIqppTneDxAjN3xAS9w%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-docs-firefox.zip

after: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FXLwteq9GSVCRpUSicJNtWw%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-docs-firefox.zip

Thanks, Ionuț.

Discoveries so far: this patch seems to win back the regression that I caused: https://hg.mozilla.org/try/rev/2dc297791558bb4b83de7b07dfae80a78a489cce. Reverting no other changes from the original patch has any performance impact. Specifically that means changes to cache key computation or cache key comparison were not at fault. So the only possible explanation for why the above change makes any difference is if this code has some positive impact on OSX: https://searchfox.org/mozilla-central/rev/99a2a5a955960b0e58ceade1db1f7652d9db4ba1/image/imgLoader.cpp#1461 which would only make sense if clearing the image cache in between page reloads more often helps make things faster on OSX...

I'm going to push a couple of patches to try to test these hypotheses.

Breaking the image cache intentionally (https://hg.mozilla.org/try/rev/1ec3ab54dd5633a5814ac474ca57d8f8792800b3) wins back the performance regression https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=fff2f680db9f1f98c5232153c2e0a8d6db52de96&newProject=try&newRevision=2b0bcf9c9b24931aa81691aa56f83df8c3d0299a&framework=10.

So I think this confirms that the reason behind this regression on OSX is that my patch in bug 1548349 made Firefox better at using its image cache, and as a result now it has worse performance on this test.

Andrew, what's your recommendation on how to deal with this regression?

Flags: needinfo?(aosmond)

I believe whatever is causing the regression was already present and unrelated to the patches in bug 1548349. Since we are hitting the cache more frequently, we will have changed the event ordering from the imgRequestProxy notifications (e.g. they come in sooner than they would have otherwise). This would be my best guess as to why the benchmark changed. However this would mean it is a bug already present and sensitive to timing, which makes it challenging to investigate. However that also means that real world performance is unlikely to be affected, so barring reports to the contrary from users, I would recommend we leave this for now, and keep an eye on any related future regressions.

Flags: needinfo?(aosmond)
Priority: -- → P3

Thanks!

Not sure if we should leave the bug open or not, but I'm not planning on doing more here.

Keywords: perf-alert
Has Regression Range: --- → yes
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.