Open Bug 1519718 Opened 2 years ago Updated 29 days ago

Slow Frames detected on singlesprout home page

Categories

(Core :: Graphics: WebRender, defect, P3)

defect

Tracking

()

REOPENED
Tracking Status
firefox66 --- disabled
firefox67 --- wontfix
firefox68 --- affected

People

(Reporter: yoasif, Assigned: kvark)

References

(Blocks 5 open bugs, )

Details

(Keywords: nightly-community)

Attachments

(4 files)

Seeing the red animation after enabling gfx.webrender.debug.slow-frame-indicator on https://www.singlesprout.com/

Profile: https://perfht.ml/2VPrlHn

Slow Frame #01 Frame 2729(https://www.singlesprout.com/) CONTENT_FRAME_TIME 711 - Transaction start 0.250244, main-thread time 1.360601, full paint time 8.513312, Skipped composites 0, Composite start 16.124971, Resource upload time 0.000434, GPU cache upload time 0.253842, Render time 101.557626, Composite time 102.260539
Slow Frame #02 Frame 2732(https://www.singlesprout.com/) CONTENT_FRAME_TIME 662 - Transaction start 0.199985, main-thread time 1.040204, full paint time 1.750260, Skipped composites 0, Composite start 60.498300, Resource upload time 2.838174, GPU cache upload time 1.248226, Render time 47.848765, Composite time 49.632026
Slow Frame #03 Frame 2730(https://www.singlesprout.com/) CONTENT_FRAME_TIME 645 - Transaction start 0.291483, main-thread time 1.364013, full paint time 2.188527, Skipped composites 0, Composite start 16.129933, Resource upload time 0.001761, GPU cache upload time 0.058995, Render time 5.150866, Composite time 91.181365
Slow Frame #04 Frame 315(https://www.singlesprout.com/) CONTENT_FRAME_TIME 616 - Transaction start 0.336957, main-thread time 2.412339, full paint time 8.948310, Skipped composites 0, Composite start 16.040572, Resource upload time 1.621313, GPU cache upload time 0.114735, Render time 84.312407, Composite time 86.308967
Slow Frame #05 Frame 316(https://www.singlesprout.com/) CONTENT_FRAME_TIME 559 - Transaction start 0.614784, main-thread time 3.238029, full paint time 5.423443, Skipped composites 0, Composite start 12.906398, Resource upload time 0.001895, GPU cache upload time 0.324688, Render time 6.633799, Composite time 76.681015
Slow Frame #06 Frame 2772(https://www.singlesprout.com/) CONTENT_FRAME_TIME 506 - Transaction start 0.243645, main-thread time 1.299267, full paint time 1.991710, Skipped composites 0, Composite start 16.103180, Resource upload time 0.663972, GPU cache upload time 0.268590, Render time 67.415635, Composite time 68.084906
Slow Frame #07 Frame 2731(https://www.singlesprout.com/) CONTENT_FRAME_TIME 471 - Transaction start 0.381023, main-thread time 2.520607, full paint time 4.443342, Skipped composites 0, Composite start 16.042296, Resource upload time 0.000620, GPU cache upload time 0.102412, Render time 1.800384, Composite time 62.339946
Slow Frame #08 Frame 2934(https://www.singlesprout.com/) CONTENT_FRAME_TIME 466 - Transaction start 0.465604, main-thread time 7.433910, full paint time 11.099232, Skipped composites 0, Composite start 15.917207, Resource upload time 2.745110, GPU cache upload time 0.385973, Render time 4.642055, Composite time 61.459521
Slow Frame #09 Frame 2774(https://www.singlesprout.com/) CONTENT_FRAME_TIME 437 - Transaction start 0.962038, main-thread time 2.663309, full paint time 4.660299, Skipped composites 0, Composite start 16.068381, Resource upload time 0.000477, GPU cache upload time 0.060649, Render time 55.797323, Composite time 56.514158
Slow Frame #10 Frame 2773(https://www.singlesprout.com/) CONTENT_FRAME_TIME 430 - Transaction start 0.190110, main-thread time 0.870163, full paint time 1.613620, Skipped composites 0, Composite start 16.065874, Resource upload time 0.002077, GPU cache upload time 0.273269, Render time 3.445983, Composite time 55.526719

I can reproduce this by scrolling down the list of clients, and mousing over the client names (such that they animate their size).

Lots of time spent on the Renderer thread, just blocking waiting for the previous frame to complete.

Enabling picture caching helps a lot, though it seems like it's possible to temporary break it sometimes and get ~1second of slowness, and then it goes back to being fast (stands out in the gecko profiler).

WR profiler shows a lot of time in B_Image (and also target init).

Any ideas Glenn?

Flags: needinfo?(gwatson)

I tried with the latest picture caching pages - it's still very janky when scrolling. There seems to be a very large number of color targets (7+) that get allocated and used, and a lot of mix-blend-mode, which is currently quite slow. This will need further investigation, both to see why it's invalidating with picture caching, and why it's so slow to rasterize.

Flags: needinfo?(gwatson)
Assignee: nobody → gwatson

Notes for myself:

  • The mix-blend code that isolates stacking contexts doesn't take into account iframes / content rects, so is doing an extra blit and confusing the invalidation logic.

  • There is a world clip on one of the background images that is tripping up the invalidation logic - it appears to be 1px larger in the y-axis than I would expect. Might be a Gecko bug, might be something that can be handled in WR.

  • The mix-blend-mode is slow here because there are two large, chained mix-blend-modes. We might need to consider implementing the options previously discussed to make mix-blend-mode faster than the current implementation.

  • Although there are 7 reported color targets, at least 2 of them are only storage for the mix-blend-mode readback regions. However, we still clear these targets on the GPU, which is redundant and costs ~2-3 ms GPU time.

There are a number of invalidation issues here, which will need separate patches.

The first patch is https://bugzilla.mozilla.org/show_bug.cgi?id=1521689.

A second patch is up - this fixes interning text runs on this page.

https://bugzilla.mozilla.org/show_bug.cgi?id=1522028

There are still pending patches to resolve all the invalidations on this page.

WIP patch for Matt to take a look at.

No longer blocks: 1524117
Depends on: 1524117

Matt, do you want to take this bug? Once your mix-blend-mode patch lands, we'll probably want to either close this or unassign and drop priority for mvp?

Flags: needinfo?(matt.woodrow)

This is not great on integrated GPUs and still needs some work, but it seems to run well on discrete GPUs that I've tested.

Jeff, could you test / verify if it's OK to drop the priority on this one?

Flags: needinfo?(matt.woodrow) → needinfo?(jmuizelaar)

Bug 1524117 never landed. Presumably it helps with perf on this. Do you want to fix the wrench test and reland that and then we can drop the priority of this one?

Flags: needinfo?(jmuizelaar)

Now that bug 1524117 has landed this seems pretty good.

Assignee: gwatson → nobody

Fixing our mix-blend performance is fairly important.

Assignee: nobody → dmalyshau
Status: NEW → ASSIGNED
Attached file WR mix-blend rewrite

This is a new implementation of mix-blend compositing that is meant to be more idiomatic to WR and efficient.

Previously, mix-blend mode was composed in the following way:

  1. parent stacking context was forced to isolate
  2. source picture is also isolated
  3. when rendering the isolated context, the framebuffer is read upon reaching the source. Both the readback and the source are placed in the RT cache.
  4. a mix-blend draw call is issued to read from those cache segments and blend on top of the backdrop

The new implementation works by using the picture cutting (intruduced for preserve-3D contexts earlier) and some bits of magic:

  1. backdrop stacking context is isolated with a special composition mode that prevents it from actually rendeing unless the suorce stacking context is invisible.
  2. source stacking context is isolated with mix-blend composition mode that has a pointer to the backdrop picture
  3. the instance of the backdrop picture is placed as a peer of the source picture (not a child)
  4. if the backdrop is invisible, the source is drawn as a simple blit
  5. otherwise, it's a draw call that reads from the isolated backdrop and source textures

Note the differences:

  • parent stacking context is not isolated, but backdrop is
  • no framebuffer readback is involved
  • the source and backdrop pictures are rendered in parallel in a pass, improving the batching
  • we don't blend onto the backdrop while reading from the backdrop copy at the same time
  • the depth of the render pass tree is reduced: previously the parent and the source were isolated, now the source and the backdrop, which are siblings

Checked a GPU capture of the new implementation on Singesprout.com versus the old one: slightly reduced draw call count (36 versus 48), missing 3 framebuffer blits, otherwise seems identical. Therefore, I don't expect a major perf win here, but doing this re-implementation has a value anyway: it fits the picture/surface/task structure much better.

Figured out the issues on Windows and updated before landing. The new try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=14d3ae3115c7614a3711ed1529f9ab559436c4cc
It shows a few unexpected passes:

  • blend-gradient-background-color.html
  • background-blending-gradient-color.html
Flags: needinfo?(dmalyshau)
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla67
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Razvan, these are unexpected passes. I suppose all I need is to mark them as passing with this PR?

Flags: needinfo?(dmalyshau) → needinfo?(rmaries)
Flags: needinfo?(rmaries)
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Depends on: 1530333
Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/640e731ed52e
Follow-up to bump the fuzz number slightly, so the test passes on AppVeyor too. rs=kvark
Depends on: 1530611
Backout by rgurzau@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6878f6294d68
Backed out 2 changesets for performance regression (bug 1530611) on a CLOSED TREE.
Backout by rgurzau@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c1bd622a4093
Backed out changeset 6878f6294d68 for bustages on a CLOSED TREE.
Pushed by dmalyshau@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/84bc38dc1fde
Revert the mix-blend rewrite r=jrmuizel
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED

The original change was reverted.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla67 → ---
Blocks: wr-68
Priority: P4 → P3

What do we want to do here long term?

No longer blocks: wr-70
Flags: needinfo?(dmalyshau)

Well, the fix was rolled back, and there isn't a good plan B yet. I took a few captures today and I'm no longer seeing mix-blend mode used at all. The pages have a lot of transparency, but that's it.

Some notes from GPU captures:

  • yuv shader is heavy
    • use textureLevel() instead of texture()?
    • is it needed for grayscale images? (two of the planes are solid)
  • can we detect/make it opaque? - many images are drawn with blending enabled unnecessary.

Will look more...

WR times seem heavy-ish but not extraordinary: 3.5ms CPU rendering, 3.7ms CPU backend, 3.7ms GPU time, 1.5ms DL.
I'd say the plan is to investigate the items I posted above.

Flags: needinfo?(dmalyshau)

On average the frames are not too bad but some frames take very long. I see some with 20ms frame builds, 30ms renderer 40ms GPU.

The content main thread is 100% busy while scrolling, half of the time spent in JS inside scroll handlers, a lot of time inside a function called Gecko_SnapshotAttrHasSubstring (7.5%), and 30% in display list building.

When the render thread is busy, it spends a bit more than half of its time blobking during swap buffers, the rest in wr_renderer_render.

The render backend time spends an unusually high amount of time in webrender::picture::TileNode::update_dirty_rects (50%!) half of which is in pthread_mutex code according to the profiler, maybe some lock contention caused by memory allocations?

On the GPU side things don't look terribly suspicious. Rendering 10 picture cache tiles one after another just amounts to a lot of GPU time, in addition to the 3 passes necessary to render text shadows.
There's a fairly large semi-transparent png in the brackground (happily chomping on memory bandwidth), which we can't do much about.

Are we invalidating most of the tiles every frame? This is the case where bug 1585716 would help.

Blocks: wr-caching
No longer blocks: wr-perf
You need to log in before you can comment on or make changes to this bug.