Closed Bug 1672704 Opened 4 years ago Closed 3 years ago

4.83 - 28.14% displaylist_mutate / glvideo Mean / perf_reftest_singleton (linux64, macosx1014, windows10, windows7) regression on push 4c3e6fb95aa6184651f215d466e702ff7d1660d3 (Wed October 21 2020)

Categories

(Core :: Audio/Video: Playback, defect)

Firefox 84
defect

Tracking

()

RESOLVED FIXED
86 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox82 --- unaffected
firefox83 --- unaffected
firefox84 --- disabled
firefox85 --- disabled
firefox86 --- fixed

People

(Reporter: alexandrui, Unassigned)

References

(Regression)

Details

(4 keywords, Whiteboard: [perf:alert:0])

Attachments

(2 obsolete files)

Perfherder has detected a talos performance regression from push 4c3e6fb95aa6184651f215d466e702ff7d1660d3. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

Ratio Suite Test Platform Options Absolute values (old vs new)
26% glvideo Mean tick time across 100 ticks: macosx1014-64-shippable-qr e10s stylo 27.31 -> 34.49
26% glvideo Mean tick time across 100 ticks: macosx1014-64-shippable e10s stylo 27.85 -> 35.17
26% glvideo Mean tick time across 100 ticks: macosx1014-64-shippable e10s stylo 27.71 -> 34.79
5% displaylist_mutate linux64-shippable-qr e10s stylo 2,942.05 -> 3,084.10

Improvements:

Ratio Suite Test Platform Options Absolute values (old vs new)
5% perf_reftest_singletons window-named-property-get.html linux64-shippable e10s stylo 595.81 -> 565.39

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) will be backed out in accordance with our regression policy.

For more information on performance sheriffing please see our FAQ.

Flags: needinfo?(jyavenard)
Component: Performance → Audio/Video: Playback
Product: Testing → Core

Set release status flags based on info from the regressing bug 1595994

Investigating.

None of those tests include a media element to be played.
We would never encounter the new code path introduced in bug 1595994.

However, when doing a first run, there a sanity test being run that checks if hardware acceleration can be used. This is done by decoding a single frame video and ensuring it displays properly. This will happen once only.

So it's likely what we are seeing is the additional time required to spawn the RDD process.

For glvideo on mac, this is expected if webrender isn't running. Once webrender is enabled by default it should improved things.

I would argue that if the time to spawned the RDD process is included and is like running a new configuration from zero, it doesn't really reflect a typical use.

We need to determine if the hardware is not doing crazy things.

Depends on: 1572687

I tried all day to understand what was going on with some of those tests as I had no explanation for the regression, even with the new code path disabled by pref I was still getting the same range of values.

Here is a push with all of bug 1595994 reverted, even more so both the GPU remote decoder and the RDD process have been turned off.

https://treeherder.mozilla.org/#/jobs?repo=try&selectedTaskRun=FHgVQ9wfSqOYbctIyNeRaA.0&revision=8138419746a1d0a56dbb9a278e275ea5eb3daff2

Here we have:
perf_reftest_singletons bloom-basic-2.html opt windows7-32-shippable e10s stylo: 64.98
perf_reftest_singletons bloom-basic-2.html opt windows10-64-shippable e10s stylo: 52.41
perf_reftest_singletons bloom-basic.html opt windows10-64-shippable e10s stylo: 50.8

Here is the windows7-32 graph: https://treeherder.mozilla.org/perf.html#/graphs?series=try,1908965,1,1&selected=1908965,1245748561 you can see no changes.

Those are the 4 biggest regression showing in the report, even with the revert we are still at the same range of values.

So I have no reasonable explanation for those regressions, it's not due to the RDD not being started, in the push above,

Flags: needinfo?(jyavenard) → needinfo?(aionescu)

In addition, I can't also explain any of the performance improvement, the code modified by bug 1595994 isn't called.
With all code reverted, the biggest improvement shown in the list
perf_reftest_singletons parent-basic-singleton.html opt macosx1014-64-shippable e10s stylo: 168.91

still give the same figures.

Assignee: nobody → jyavenard
Status: NEW → ASSIGNED
Attachment #9184139 - Attachment is obsolete: true

Thanks for the input, Jean. Did some retriggers. Bug 1595994 seems more reasonable as the culprit. I will update the bug as the graph reveals another culprit.

Flags: needinfo?(aionescu)
Flags: needinfo?(aionescu)

(In reply to Alexandru Ionescu (needinfo me) [:alexandrui] from comment #7)

Thanks for the input, Jean. Did some retriggers. Bug 1595994 seems more reasonable as the culprit. I will update the bug as the graph reveals another culprit.

It may be worth also looking at the regression that shows around October 21st there https://arewefastyet.com/win10/overview?numDays=60

You can see the regression in most of those tests around 1 day before bug 1595994 landed.

Flags: needinfo?(aionescu)
Whiteboard: [perftest:alert:?]
Flags: needinfo?(aionescu)

(In reply to Jean-Yves Avenard [:jya] from comment #3)

I tried all day to understand what was going on with some of those tests as I had no explanation for the regression, even with the new code path disabled by pref I was still getting the same range of values.

Here is a push with all of bug 1595994 reverted, even more so both the GPU remote decoder and the RDD process have been turned off.

https://treeherder.mozilla.org/#/jobs?repo=try&selectedTaskRun=FHgVQ9wfSqOYbctIyNeRaA.0&revision=8138419746a1d0a56dbb9a278e275ea5eb3daff2

Here we have:
perf_reftest_singletons bloom-basic-2.html opt windows7-32-shippable e10s stylo: 64.98
perf_reftest_singletons bloom-basic-2.html opt windows10-64-shippable e10s stylo: 52.41
perf_reftest_singletons bloom-basic.html opt windows10-64-shippable e10s stylo: 50.8

Here is the windows7-32 graph: https://treeherder.mozilla.org/perf.html#/graphs?series=try,1908965,1,1&selected=1908965,1245748561 you can see no changes.

Those are the 4 biggest regression showing in the report, even with the revert we are still at the same range of values.

So I have no reasonable explanation for those regressions, it's not due to the RDD not being started, in the push above,

Jean, when I filed the bug I based on glvideo regressions from this alert. If you look at this graph retriggers you'll be ablt to see pretty clear that Bug 1595994 is the culprit. I'll do some retriggers on perf_reftest_singletons to make sure that this test's regression is also caused by the same patch. I'll keep you posted!

Flags: needinfo?(aionescu) → needinfo?(jyavenard)

Regression of glvideo on Mac without webrender enabled is expected.

Not any of the others.

Flags: needinfo?(jyavenard)

Indeed, perf_reftest_singleons were caused by Bug 1644624. I reassigned them. Thanks for the heads-up!

Whiteboard: [perftest:alert:?] → [perftest:alert:0]
Whiteboard: [perftest:alert:0] → [perf:alert:0]

Jean-Yves what's the status here, will the patch on this bug land?

Flags: needinfo?(jyavenard)

Oops didn't mean to change 84 status

(In reply to Julien Cristau [:jcristau] from comment #12)

Jean-Yves what's the status here, will the patch on this bug land?

we can, but I don't believe it will affect the outcome of some regressions here.
The majority of it isn't even related to the bug first mentioned.

Flags: needinfo?(jyavenard)
Pushed by jyavenard@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/684580bcc087
Disable sanity test with talos. r=mattwoodrow
Attachment #9184140 - Attachment is obsolete: true
Assignee: jyavenard → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(jyavenard)

(In reply to Jean-Yves Avenard [:jya] from comment #14)

(In reply to Julien Cristau [:jcristau] from comment #12)

Jean-Yves what's the status here, will the patch on this bug land?

we can, but I don't believe it will affect the outcome of some regressions here.
The majority of it isn't even related to the bug first mentioned.

AIUI there's still the displaylist_mutate and glvideo-with-webrender cases?

Flags: needinfo?(jyavenard)

Alexandru could you update the bug title and comment 0 to reflect comment 11?

Flags: needinfo?(aionescu)

(In reply to Julien Cristau [:jcristau] from comment #17)

AIUI there's still the displaylist_mutate and glvideo-with-webrender cases?

displaylist_mutate is nonsensical, as we won't hit the new codepath with that test. So I have no explanation on what's going on.

For webgl, as mentioned earlier, there will be a temporary regression occurring until webrender is enabled and webgl in the GPU process becomes active, and this one may now be.

Flags: needinfo?(jyavenard)

Short note that displaylist_mutate was apparently only added for WR when this landed (https://treeherder.mozilla.org/perfherder/graphs?series=autoland,1925725,1,1&timerange=5184000&series=autoland,2812899,1,1&series=mozilla-central,2820195,1,1) - and that, so far, performs massively worse that basic, i.e. should be investigated either way :)

Flags: needinfo?(aionescu)

(In reply to Jean-Yves Avenard [:jya] from comment #3)

I tried all day to understand what was going on with some of those tests as I had no explanation for the regression, even with the new code path disabled by pref I was still getting the same range of values.
https://treeherder.mozilla.org/#/jobs?repo=try&selectedTaskRun=FHgVQ9wfSqOYbctIyNeRaA.0&revision=8138419746a1d0a56dbb9a278e275ea5eb3daff2

Re-using that same build with the revert we have:
without bug 1595994:
displaylist_mutate flattened_mutate.html opt e10s stylo webrender: 2460.49
displaylist_mutate inactive_mutate.html opt e10s stylo webrender: 3142.56
displaylist_mutate mutate.html opt e10s stylo webrender: 4018.05
displaylist_mutate opt e10s stylo webrender: 3143.71

compare to supposedly the regressed range (https://treeherder.mozilla.org/jobs?repo=mozilla-central&revision=1980f87855fc557e7ef6bb24b66c4d2df5606afa&group_state=expanded&selectedTaskRun=efYRjeh6RMqAnUOVnTDxlQ.0)
with bug 1595994:
displaylist_mutate flattened_mutate.html opt e10s stylo webrender: 2317.37
displaylist_mutate inactive_mutate.html opt e10s stylo webrender: 3055.52
displaylist_mutate mutate.html opt e10s stylo webrender: 4004.04
displaylist_mutate opt e10s stylo webrender: 3049.27

We can see here that with none of the RDD work we get even worse results

This is conclusive proof that bug 1595994 has nothing to do with any regressions on displaylist_mutate

The main problem appears to be in webgl itself ; even if the image should be readable directly, it still uses a readback to the RDD process to read it.

GLBlitHelper::BlitImageToFramebuffer isn't called.

This seems to be the same problem as bug 1670542 and bug 1640607.

So moving the decoder from the content process to the GPU or RDD process only exposed the issue.

Setting webgl.out-of-process to false; and we move from 35 to 4.5

Flags: needinfo?(jgilbert)

This is being addressed in bug 1640607, thanks!

Flags: needinfo?(jgilbert)

(In reply to Jeff Gilbert [:jgilbert] from comment #23)

This is being addressed in bug 1640607, thanks!

:jgilbert could you confirm that bug 1640607 covers the remaining regressions?

Flags: needinfo?(jgilbert)

Yes.

Flags: needinfo?(jgilbert)

webgl.out-of-process is off for beta and release.

Improvement shown in bug 1640607 comment 23.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 86 Branch
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: