Closed Bug 1783674 Opened 2 years ago Closed 11 days ago

40.34 - 12.49% tsvg_static / tsvgr_opacity (OSX) regression on Tue August 2 2022

Categories

(Core :: DOM: UI Events & Focus Handling, defect)

defect

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox-esr91 --- unaffected
firefox-esr102 --- unaffected
firefox103 --- unaffected
firefox104 --- unaffected
firefox105 --- wontfix
firefox106 --- fix-optional

People

(Reporter: aglavic, Unassigned)

References

(Regression)

Details

(4 keywords)

Perfherder has detected a talos performance regression from push 5771675980ea9f1e63bcbf905d251f735a7f7bb5. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

Ratio Test Platform Options Absolute values (old vs new)
40% tsvg_static macosx1015-64-shippable-qr e10s fission stylo webrender-sw 43.57 -> 61.15
39% tsvg_static macosx1015-64-shippable-qr e10s fission stylo webrender-sw 43.45 -> 60.57
12% tsvgr_opacity macosx1015-64-shippable-qr e10s fission stylo webrender-sw 54.46 -> 61.26

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=35045

Flags: needinfo?(florian)

Set release status flags based on info from the regressing bug 1782693

So it looks like bug 1782693 regressed the performance of loading svg documents on Mac... I wonder:

  1. If there's a specific check we should add to the code to keep ticking in SVG documents.
  2. If there might be something bogus in how the test detects the load is finished.

Olli, any thought on this?

Andej, is it easily possible to see a profile of these tests?

Flags: needinfo?(smaug)
Flags: needinfo?(florian)
Flags: needinfo?(aglavic)

(In reply to Andrej (:andrej) from comment #3)

Profile job of the tests is currently queued and has not completed you can find it here: https://treeherder.mozilla.org/jobs?repo=autoland&revision=628149ef8b2c734ce965f509b11e387e26196092&group_state=expanded&selectedTaskRun=cXGvfQCySEC9iX3KZK1lhQ.0&searchStr=OS%2CX%2C10.15%2CWebRender%2CShippable%2Copt%2CTalos%2Cperformance%2Ctests%2Cwith%2Csoftware%2Cwebrender%2Cenabled%2C%28profiling%29%2Ctest-macosx1015-64-shippable-qr%2Fopt-talos-svgr-swr%2Cs-p

Thanks for the profiling jobs. Unfortunately they seem to contain profiles for the svgr subtask (profile_svgr.zip), but not for the tsvg_static tests.

I looked at before/after profiles for big-optimizable-group-opacity-2500.svg, and I see no difference between them. In both cases there's a long composite taking 3.0s. It's also not obvious what is being measured, the test could do with more profiler makers.

When looking at https://treeherder.mozilla.org/perfherder/comparesubtest?framework=1&originalProject=autoland&originalSignature=4086806&newProject=autoland&newSignature=4086806&originalRevision=f0d33d0689087a6255e544526f6846085e6ec460&newRevision=628149ef8b2c734ce965f509b11e387e26196092&page=1 it seems there's mostly a reduction of the noise.

The more significant changes are in tsvg_static, https://treeherder.mozilla.org/perfherder/comparesubtest?framework=1&originalProject=autoland&originalSignature=4086822&newProject=autoland&newSignature=4086822&originalRevision=f0d33d0689087a6255e544526f6846085e6ec460&newRevision=628149ef8b2c734ce965f509b11e387e26196092&page=1&showOnlyConfident=1 but I won't go very far investigating there without a profile and without knowing what the test is trying to do.

Daniel, do you know what these tests are measuring, and if this regression looks like something we should worry about?

Flags: needinfo?(dholbert)

It's been a long time since I looked at those tests.

I think the regression does look like something we should worry about and address though I would bet it's not a user-visible issue. It's likely more of a situation where the "guilty" commit here has influenced the signal that we're waiting for, in such a way that it reduces our ability to measure regressions into the future. EDIT: it may in fact be user-visible, as indicated by profiles in my next comment.

I think (?) the test is measuring time-to-first-paint after pageload, where "paint" is defined using the MozAfterPaint event, per:
https://searchfox.org/mozilla-central/rev/c0bed29d643393af6ebe77aa31455f283f169202/testing/talos/talos/test.py#956,965,975,983

So: maybe the guilty change here is inadvertently causing a delay to when we dispatch MozAfterPaint, for some reason?

Flags: needinfo?(dholbert)

(In reply to Florian Quèze [:florian] from comment #4)

The more significant changes are in tsvg_static [...] I won't go very far investigating there without a profile and without knowing what the test is trying to do.

Agreed. I think we do have profiles, though? I found some tsvg_static profiles there, looking at the s-p jobs here ("bad"):
https://treeherder.mozilla.org/jobs?repo=autoland&searchStr=talos%2Cmac&revision=628149ef8b2c734ce965f509b11e387e26196092
vs. here ("good"):
https://treeherder.mozilla.org/jobs?repo=autoland&searchStr=talos%2Cmac&revision=f0d33d0689087a6255e544526f6846085e6ec460

On the s-p tasks there, under "Artifacts & debugging tools", I clicked profile_svgr.zip - open in Firefox Profiler which gives me a listing of the various tests/subtests & lets me pick which one to view.

I looked at the subtest composite-scale-rotate-opacity.svg_pagecycle_1 (since composite-scale-rotate.svg seems to be the worst tsvg_static subtest regression as shown by perfherder), and I get these "bad"/"good" profiles (from after vs. before the regression):
bad: https://share.firefox.dev/3zMVdZE
good: https://share.firefox.dev/3JLrSmX
In both cases, the marker chart shows DocumentLoad taking about 21-22ms, but there's a much longer delay to completing FirstContentfulComposite in the bad profile vs. the good profile (58.5ms vs 43.6ms).

One perhaps-relevant observation: In the "good" profile, it looks like FirstNonBlankPaint immediately produces a FirstNonBlankComposite 1ms later as part of the same refresh-driver tick; whereas in the "bad" profile, they are separated by 22ms (maybe happening in separate refresh driver ticks?)

Here's a pair of profiles for composite-scale-rotate.svg_pagecycle_2:

bad: https://share.firefox.dev/3A8Iz8B
good: https://share.firefox.dev/3Q5XI09

As above, DocumentLoad is the same duration (about 15ms) and pageshow fires at around the same timestamp (at around t=0.018ms), and for some reason the bad profile then has quite a long delay before FirstContentfulComposite (around t=0.065ms vs. t=0.039ms).

(And MozAfterPaint does seem to fire immediately after FirstContentfulComposite, so that seems to be the relevant [and unfortunately user-impacting] measurement here.)

I'm not super familiar with the intricacies of paint scheduling, so I'll stop digging at this point. I think/hope the profiles in comment 8 - comment 9 should have enough info for someone to dig further, though. CC'ing tnikkel, mstange, and emilio who might have thoughts/insights here.

tl;dr answers to comment 5:
"what's the profile measuring": first paint (via MozAfterPaint which corresponds to FirstContentfulComposite)
"do we need to worry": yes. the profiles seem to legitimately show substantially-delayed paint operations here.

FWIW, the regressing bug did change some heuristics which were added in bug 1708121, and there especially https://bugzilla.mozilla.org/show_bug.cgi?id=1708121#c11 is interesting.

Flags: needinfo?(smaug)

In the "bad" profile, there's a PVsync Notify runnable which doesn't do anything, and then the refresh driver tick happens in the next vsync runnable. I'm not sure why the first one has no effect. In the tick, on the ViewManagerFlush marker, it says that the paint was triggered 37ms ago, which was before the no-op vsync runnable.

Florian, can you get some before/after profiles with IPC info?

Flags: needinfo?(florian)

Before: https://share.firefox.dev/3BZN0UQ
After: https://share.firefox.dev/3AbFjJL

The "After" profile clearly shows a Vsync event that happened while vsync was being observed, but this event did not trigger a refresh tick. Why!?

Severity: -- → S3

Set release status flags based on info from the regressing bug 1782693

Markus, is that a regression that we should fix for our upcoming releases?

Flags: needinfo?(mstange.moz)

I think this regression falls under the "we should understand it, but it probably doesn't have much impact on our users" bucket. So there's not a big rush to fix it. It really seems to be just one skipped vsync event, and this performance test happened to create the right conditions for it to occur.

Flags: needinfo?(mstange.moz)

Hi :mstange, it's been 2 years with no activity here, could this performance alert be closed as WONTFIX? If some investigation could still be done, it would be better to file a more specific follow-up bug. Something to note is that the osx machines will be migrated from 10.15 to 14.70 in the next month so it may not be reproducible in CI after that point.

Flags: needinfo?(mstange.moz)

Yeah let's close it as WONTFIX.

I've started an talos-svgr-profiling job here just in case we've added enough markers to make more sense of this based on the profiles: https://treeherder.mozilla.org/jobs?repo=mozilla-central&searchStr=talos-svgr&revision=60b9a1ba2d2e318e3800cfecc5cb0191563e0b91&selectedTaskRun=VfLW4k-BQo-f1woYQfAbKQ.0

Status: NEW → RESOLVED
Closed: 11 days ago
Flags: needinfo?(mstange.moz)
Resolution: --- → WONTFIX

That profile doesn't have the vsync-but-no-refresh-tick issue: https://share.firefox.dev/3CGGxR7

Anyway, this is not worth spending much time on. Speeding up this test would be a matter of making it so that the refresh driver fires sooner (not sure why it doesn't use a catch-up tick here) and so that the compositor composites sooner. That's something we want to do generally, but it's not motivated by page load, and doesn't need to be motivated by this SVG test.

You need to log in before you can comment on or make changes to this bug.