1212412 - 2% Win8 tp5o_scroll/tart regression on Mozilla-Inbound on October 06, 2015 from push dbe8d3254ccd054a3508d4d46c970b1ec4d03bc2

Reporter

Description

•

9 years ago

Talos has detected a Firefox performance regression from your commit dbe8d3254ccd054a3508d4d46c970b1ec4d03bc2 in bug 1211339. We need you to address this regression. This is a list of all known regressions and improvements related to your bug: http://alertmanager.allizom.org:8080/alerts.html?rev=dbe8d3254ccd054a3508d4d46c970b1ec4d03bc2&showAll=1 On the page above you can see Talos alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format. To learn more about the regressing test, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#TART.2FCART Reproducing and debugging the regression: If you would like to re-run this Talos test on a potential fix, use try with the following syntax: try: -b o -p win64 -u none -t svgr # add "mozharness: --spsProfile" to generate profile data To run the test locally and do a more in-depth investigation, first set up a local Talos environment: https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code Then run the following command from the directory where you set up Talos: talos --develop -e <path>/firefox -a tart Making a decision: As the patch author we need your feedback to help us handle this regression. *** Please let us know your plans by Monday, or the offending patch will be backed out! *** Our wiki page oulines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 1

•

9 years ago

I did some retriggers here to compare this push to the previous one: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=9400b5b4f6da&tochange=eca3009f91f6&filter-searchStr=Windows%208%2064-bit%20mozilla-inbound%20talos%20svgr looking at the comparison view in perfherder these two tests (tp5o_scroll and tart) show regressions: https://treeherder.allizom.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=5d303b961af1&newProject=mozilla-inbound&newRevision=dbe8d3254ccd keep in mind above that the other data points are noisy- I did look through all of them prior to filing this bug. :cpearce, can you take a look at this to help us understand how your change affects these tests and what work if any should/could be done to reduce the regression?

Flags: needinfo?(cpearce)

Chris Pearce [:cpearce (Not reading bugmail)]

Comment 2

•

9 years ago

My patch change our HTMLMediaElement.canPlayType() behaviour. Why are scrolling tests calling media functions? My change is basically similar to bug 1164925; in a synchronous JS API (HTMLMediaElement.canPlayType) we check that Windows system media decoders are usable before reporting to JavaScript that we can play video/audio. This is the only way we can check reliably. I cache the results on the first call, to minimize overhead. If we prepopulated the results on startup, we'd get a startup perf regression, and people who don't use media would suffer. So if this patch is causing a regression, I don't think there's anything we can do, other than remove whatever is calling HTMLMediaElement.canPlayType(). Also does this test take longer than 90 seconds to run? We call HTMLMediaElement.canPlayType() 90 seconds after startup here: https://dxr.mozilla.org/mozilla-central/source/browser/base/content/browser.js#1442 This is so that we can report via telemetry how many user have non-functioning system decoders, which is a major problem we've had on Windows. If the test takes longer than 90 seconds, that could explain why we're hitting this code. You could try commenting out that code to see if your numbers stabilize.

Flags: needinfo?(cpearce) → needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 3

•

9 years ago

There are two tests that fail: tp5o_scroll - load a page, scroll, load the next page in the same tab, scroll, repeat until 50 pages hit tart - open a new tab, take a bunch of measurements, repeat with different number of tabs, etc. each test runs with a fresh profile and they do keep the browser open >90 seconds. Although tp5o, tps, tsvgx, damp, and other tests keep the browser open >90 seconds and they didn't regress. I suspect they regressed, just not enough to warrant a regression report. I pushed to try to see what happens: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6731ad7226ab Given the fact that we have 90 seconds, I am not sure if there is much to do here without changing some of the bug properly. Do weigh in if you see potentials to fix this or other ways to achieve the bug.

Flags: needinfo?(jmaher)

Chris Pearce [:cpearce (Not reading bugmail)]

Comment 4

•

9 years ago

How long is the longest test? If your push indicates this is the issue, we could increase the timeout duration. Do the test build run with telemetry enabled? We could #ifdef TELEMETRY out that block inside browser.js?

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 5

•

9 years ago

we run with telemetry hacked off via prefs: https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/config.py#166 https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/config.py#92 I am not sure if that is the best/proper way to do that. I would say the longest test runs for 8 minutes!

Flags: needinfo?(jmaher)

Avi Halachmi (:avih)

Comment 6

•

9 years ago

(In reply to Chris Pearce (:cpearce) from comment #4) > How long is the longest test? If your push indicates this is the issue, we > could increase the timeout duration. Do the test build run with telemetry > enabled? We could #ifdef TELEMETRY out that block inside browser.js? Chris, the point here is not to make the issue invisible to the test, but rather let the test alert us that the patch possibly affected performance to some degree. So now that we know that (assuming it's not a false positive), consider our options. Is this something you expected? do you think it could/should be addressed? etc.

Chris Pearce [:cpearce (Not reading bugmail)]

Comment 7

•

9 years ago

Avi, you get what you measure. I would not expect a change to HTMLMediaElement.canPlayType() to affect scrolling performance. The graphics stack should not be calling HTMLMediaElement.canPlayType().

Avi Halachmi (:avih)

Comment 8

•

9 years ago

Chris, thanks. So you're saying you don't think this patch is the fault. Good info. Joel, how confident are we that it's this?

Chris Pearce [:cpearce (Not reading bugmail)]

Comment 9

•

9 years ago

(In reply to Avi Halachmi (:avih) from comment #8) > Chris, thanks. So you're saying you don't think this patch is the fault. > Good info. Not directly. As I said above, my patch changes a synchronous JS API (HTMLMediaElement.canPlayType) to check that system decoders are usable. This could cause some DLLs to load in a function that runs on the main thread. We cache the result. So my patch should only cause a small blip for I/O loading the DLLs on the first time it's called. I wouldn't expect that to be significant in an 8 minute long test!

Avi Halachmi (:avih)

Comment 10

•

9 years ago

According to your explanation, I agree, it doesn't sound to me that it should end up with 2% regression, and while there's some possibility that tp5o-scroll have some pages which trigger this call, TART really shouldn't be triggering it (I don't think newtab should have media elements other than images). Furthermore, when I examine a specific regressed test in details (tart on win8 64 *), I see that all the subtests regressed (including tests which do nothing other than animating the "tab"), and that all subtests regressed similarly. Also, the value between the 6 complete test runs (retriggers) is quite stable too. So this does not look to me at all like a single "event" which caused this. To me this smells like something which affects most or all frames consistently. Possibly at the compositor, or layout code, etc. Joel, how can we validate your conclusion at comment 0 that the sole reason for this regression is bug bug 1211339? * https://treeherder.allizom.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=5d303b961af1&newProject=mozilla-inbound&newRevision=dbe8d3254ccd&originalSignature=b43983bbf31fc4b74b4a65775bc8eb416f6aa82e&newSignature=b43983bbf31fc4b74b4a65775bc8eb416f6aa82e

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 11

•

9 years ago

it seems that an attempt to hack out this didn't make a difference: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=6731ad7226ab&newProject=mozilla-inbound&newRevision=dbe8d3254ccd Not sure if I did it correctly: https://hg.mozilla.org/try/rev/6731ad7226ab either way, there is enough data to show before/after the change that something happened.

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 12

•

9 years ago

did some pushes to try: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=6c452ce9600e&newProject=try&newRevision=cf5002f63137 you can see the tart results are 1.15% regressed with the patch, but look at the details: https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=6c452ce9600e&newProject=try&newRevision=cf5002f63137&originalSignature=b43983bbf31fc4b74b4a65775bc8eb416f6aa82e&newSignature=b43983bbf31fc4b74b4a65775bc8eb416f6aa82e many of the pages are showing 2-4% regressions, with a small number showing gains. :avih, any thoughts on this?

Flags: needinfo?(avihpit)

Avi Halachmi (:avih)

Comment 13

•

9 years ago

I'm ambivalent. On one hand, three people already spent more time on this than such relatively small regression typically warrants. OTOH, I'd really like to understand where this regression comes from since 1. despite its low magnitude, it seems consistently regressing. 2. Chris doesn't think bug 1211339 is responsible, which leaves us without any kind of explanation. My suggestion is that Chris would give this another thought if the consistent regression during different kinds of animation could be explained by this patch. If he can think of something but doesn't think there's an obvious fix, let's keep it for the records and close WONTFIX. If he can't think of something, still close as WONTFIX but knowing that we could miss something important here, but it's probably not worth spending more time to understand what it is that we miss. Chris, you up to it? Give it another final consideration if bug 1211339 could somehow explain it?

Flags: needinfo?(avihpit) → needinfo?(cpearce)

Anthony Jones (:ajones, :kentuckyfriedtakahe, :k17e)

Comment 14

•

9 years ago

Marking as P1 because we need to resolve this to make sure media code doesn't get removed due to scrolling performance issues.

Priority: -- → P1

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 15

•

9 years ago

I don't see any action items on this bug, nor do I see alerts upon the uplift to Aurora. Shall we mark this as resolved/worksforme?

Jean-Yves Avenard [:jya]

Assignee

Updated

•

9 years ago

Assignee: nobody → jyavenard

Jean-Yves Avenard [:jya]

Assignee

Updated

•

9 years ago

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → WORKSFORME

Jean-Yves Avenard [:jya]

Assignee

Updated

•

9 years ago

Flags: needinfo?(cpearce)

Bugzilla

Quick Search

2% Win8 tp5o_scroll/tart regression on Mozilla-Inbound on October 06, 2015 from push dbe8d3254ccd054a3508d4d46c970b1ec4d03bc2

Categories

(Core :: Audio/Video: Playback, defect, P1)

Tracking

()

People

(Reporter: jmaher, Assigned: jya)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Updated

Updated

Updated