Closed Bug 1212412 Opened 9 years ago Closed 9 years ago

2% Win8 tp5o_scroll/tart regression on Mozilla-Inbound on October 06, 2015 from push dbe8d3254ccd054a3508d4d46c970b1ec4d03bc2

Categories

(Core :: Audio/Video: Playback, defect, P1)

defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox44 --- affected

People

(Reporter: jmaher, Assigned: jya)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

Talos has detected a Firefox performance regression from your commit dbe8d3254ccd054a3508d4d46c970b1ec4d03bc2 in bug 1211339. We need you to address this regression. This is a list of all known regressions and improvements related to your bug: http://alertmanager.allizom.org:8080/alerts.html?rev=dbe8d3254ccd054a3508d4d46c970b1ec4d03bc2&showAll=1 On the page above you can see Talos alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format. To learn more about the regressing test, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#TART.2FCART Reproducing and debugging the regression: If you would like to re-run this Talos test on a potential fix, use try with the following syntax: try: -b o -p win64 -u none -t svgr # add "mozharness: --spsProfile" to generate profile data To run the test locally and do a more in-depth investigation, first set up a local Talos environment: https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code Then run the following command from the directory where you set up Talos: talos --develop -e <path>/firefox -a tart Making a decision: As the patch author we need your feedback to help us handle this regression. *** Please let us know your plans by Monday, or the offending patch will be backed out! *** Our wiki page oulines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
I did some retriggers here to compare this push to the previous one: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=9400b5b4f6da&tochange=eca3009f91f6&filter-searchStr=Windows%208%2064-bit%20mozilla-inbound%20talos%20svgr looking at the comparison view in perfherder these two tests (tp5o_scroll and tart) show regressions: https://treeherder.allizom.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=5d303b961af1&newProject=mozilla-inbound&newRevision=dbe8d3254ccd keep in mind above that the other data points are noisy- I did look through all of them prior to filing this bug. :cpearce, can you take a look at this to help us understand how your change affects these tests and what work if any should/could be done to reduce the regression?
Flags: needinfo?(cpearce)
My patch change our HTMLMediaElement.canPlayType() behaviour. Why are scrolling tests calling media functions? My change is basically similar to bug 1164925; in a synchronous JS API (HTMLMediaElement.canPlayType) we check that Windows system media decoders are usable before reporting to JavaScript that we can play video/audio. This is the only way we can check reliably. I cache the results on the first call, to minimize overhead. If we prepopulated the results on startup, we'd get a startup perf regression, and people who don't use media would suffer. So if this patch is causing a regression, I don't think there's anything we can do, other than remove whatever is calling HTMLMediaElement.canPlayType(). Also does this test take longer than 90 seconds to run? We call HTMLMediaElement.canPlayType() 90 seconds after startup here: https://dxr.mozilla.org/mozilla-central/source/browser/base/content/browser.js#1442 This is so that we can report via telemetry how many user have non-functioning system decoders, which is a major problem we've had on Windows. If the test takes longer than 90 seconds, that could explain why we're hitting this code. You could try commenting out that code to see if your numbers stabilize.
Flags: needinfo?(cpearce) → needinfo?(jmaher)
There are two tests that fail: tp5o_scroll - load a page, scroll, load the next page in the same tab, scroll, repeat until 50 pages hit tart - open a new tab, take a bunch of measurements, repeat with different number of tabs, etc. each test runs with a fresh profile and they do keep the browser open >90 seconds. Although tp5o, tps, tsvgx, damp, and other tests keep the browser open >90 seconds and they didn't regress. I suspect they regressed, just not enough to warrant a regression report. I pushed to try to see what happens: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6731ad7226ab Given the fact that we have 90 seconds, I am not sure if there is much to do here without changing some of the bug properly. Do weigh in if you see potentials to fix this or other ways to achieve the bug.
Flags: needinfo?(jmaher)
How long is the longest test? If your push indicates this is the issue, we could increase the timeout duration. Do the test build run with telemetry enabled? We could #ifdef TELEMETRY out that block inside browser.js?
Flags: needinfo?(jmaher)
we run with telemetry hacked off via prefs: https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/config.py#166 https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/config.py#92 I am not sure if that is the best/proper way to do that. I would say the longest test runs for 8 minutes!
Flags: needinfo?(jmaher)
(In reply to Chris Pearce (:cpearce) from comment #4) > How long is the longest test? If your push indicates this is the issue, we > could increase the timeout duration. Do the test build run with telemetry > enabled? We could #ifdef TELEMETRY out that block inside browser.js? Chris, the point here is not to make the issue invisible to the test, but rather let the test alert us that the patch possibly affected performance to some degree. So now that we know that (assuming it's not a false positive), consider our options. Is this something you expected? do you think it could/should be addressed? etc.
Avi, you get what you measure. I would not expect a change to HTMLMediaElement.canPlayType() to affect scrolling performance. The graphics stack should not be calling HTMLMediaElement.canPlayType().
Chris, thanks. So you're saying you don't think this patch is the fault. Good info. Joel, how confident are we that it's this?
(In reply to Avi Halachmi (:avih) from comment #8) > Chris, thanks. So you're saying you don't think this patch is the fault. > Good info. Not directly. As I said above, my patch changes a synchronous JS API (HTMLMediaElement.canPlayType) to check that system decoders are usable. This could cause some DLLs to load in a function that runs on the main thread. We cache the result. So my patch should only cause a small blip for I/O loading the DLLs on the first time it's called. I wouldn't expect that to be significant in an 8 minute long test!
According to your explanation, I agree, it doesn't sound to me that it should end up with 2% regression, and while there's some possibility that tp5o-scroll have some pages which trigger this call, TART really shouldn't be triggering it (I don't think newtab should have media elements other than images). Furthermore, when I examine a specific regressed test in details (tart on win8 64 *), I see that all the subtests regressed (including tests which do nothing other than animating the "tab"), and that all subtests regressed similarly. Also, the value between the 6 complete test runs (retriggers) is quite stable too. So this does not look to me at all like a single "event" which caused this. To me this smells like something which affects most or all frames consistently. Possibly at the compositor, or layout code, etc. Joel, how can we validate your conclusion at comment 0 that the sole reason for this regression is bug bug 1211339? * https://treeherder.allizom.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=5d303b961af1&newProject=mozilla-inbound&newRevision=dbe8d3254ccd&originalSignature=b43983bbf31fc4b74b4a65775bc8eb416f6aa82e&newSignature=b43983bbf31fc4b74b4a65775bc8eb416f6aa82e
Flags: needinfo?(jmaher)
it seems that an attempt to hack out this didn't make a difference: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=6731ad7226ab&newProject=mozilla-inbound&newRevision=dbe8d3254ccd Not sure if I did it correctly: https://hg.mozilla.org/try/rev/6731ad7226ab either way, there is enough data to show before/after the change that something happened.
Flags: needinfo?(jmaher)
I'm ambivalent. On one hand, three people already spent more time on this than such relatively small regression typically warrants. OTOH, I'd really like to understand where this regression comes from since 1. despite its low magnitude, it seems consistently regressing. 2. Chris doesn't think bug 1211339 is responsible, which leaves us without any kind of explanation. My suggestion is that Chris would give this another thought if the consistent regression during different kinds of animation could be explained by this patch. If he can think of something but doesn't think there's an obvious fix, let's keep it for the records and close WONTFIX. If he can't think of something, still close as WONTFIX but knowing that we could miss something important here, but it's probably not worth spending more time to understand what it is that we miss. Chris, you up to it? Give it another final consideration if bug 1211339 could somehow explain it?
Flags: needinfo?(avihpit) → needinfo?(cpearce)
Marking as P1 because we need to resolve this to make sure media code doesn't get removed due to scrolling performance issues.
Priority: -- → P1
I don't see any action items on this bug, nor do I see alerts upon the uplift to Aurora. Shall we mark this as resolved/worksforme?
Assignee: nobody → jyavenard
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
Flags: needinfo?(cpearce)
You need to log in before you can comment on or make changes to this bug.