Painting is more than 2x slower than Chrome on the Editor-CodeMirror subtest of speedometer3
Categories
(Core :: Web Painting, defect, P2)
Tracking
()
People
(Reporter: mstange, Assigned: tnikkel)
References
(Blocks 1 open bug, )
Details
(Whiteboard: [sp3])
https://speedometer-preview.netlify.app/?suite=Editor-CodeMirror
According to the comparison report, on Windows, Firefox spends 2.17x as much time painting during the measured time compared to Chrome. Eliminating the difference would improve our total time on this subtest by 2.6%.
Firefox: https://share.firefox.dev/3KmYW6z
Chrome: https://share.firefox.dev/3JZTvde
I noticed a few things in the profiles:
- Retained Display List partial update appears to fail, because we spend just as much time doing the full rebuild as we spend in the partial update rebuild.
- We hit fallback painting (
WebRenderCommandBuilder::PushItemAsImage
), probably for the Windows scrollbar arrows. - There's a lot of time spent in
RetainedDisplayListBuilder::ComputeRebuildRegion
. I think this page would probably be faster to paint with RDL turned off entirely.
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 1•2 years ago
|
||
Hmm, I don't get PushItemAsImage in my profiles at all. I tested on macos and on Windows 10 (there are scrollbars with arrows visible). The only thing that shows up as a blob with gfx.webrender.debug.blob.paint-flashing on is the svg inside the buttons at the top.
Maybe you can share some more info about your setup so I can try to emulate that to reproduce?
Assignee | ||
Comment 2•2 years ago
|
||
I think the profile is hitting this
but I don't see how I would not be hitting that. Maybe it's just so fast on my machine there are no samples in it?
Reporter | ||
Comment 3•2 years ago
|
||
That's possible. It is a rather small part of the profile. The profile was taken with ?iterationCount=100 and a profiling interval of 0.25ms.
Assignee | ||
Comment 4•2 years ago
|
||
Seems like the major thing that makes ComputeRebuildRegion take so long is calling nsLayoutUtils::GetTransformToAncestor. It looks like we call it 3 separate times for the same frame. Two of them in the displayport getter, and a third when we transform the rect in ProcessFrameInternal. Calling it once and sharing it would be a good optimization.
Assignee | ||
Comment 5•2 years ago
|
||
The long partial display list updates look to be when the syntax highlighting it turned on. We get ~300 modified frames, which is just below the limit of a partial update (500). They end up making the dirty rect be everything but the toolbar of buttons at the top, so we are rebuilding almost the whole thing anyway, so all the other retained stuff (merging, preprocessing etc) is just pure overhead.
It makes me think that a good heuristic to add might be to bail out of a partial update if the root dirty rect becomes most of the whole page. The root dirty rect doesn't tell the whole story because we can store an override dirty rect on every frame with a display port and every stacking context. So the root dirty rect could be large, but we immediately enter a stacking context or displayport with a very small rect. But I'm not sure if that consideration is important.
Assignee | ||
Comment 6•2 years ago
|
||
And the scrollbar parts that we use fallback for, they aren't blobs, they are images because nsDisplayThemedBackground::MustPaintOnContentSide returns true. But I think since we are using the non-native theme we are just drawing using simple DrawTarget commands and not asking the OS for bitmaps, at least in the common cases, that we can just use blobs, which should be a small win.
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 7•1 year ago
|
||
(In reply to Timothy Nikkel (:tnikkel) from comment #4)
Seems like the major thing that makes ComputeRebuildRegion take so long is calling nsLayoutUtils::GetTransformToAncestor. It looks like we call it 3 separate times for the same frame. Two of them in the displayport getter, and a third when we transform the rect in ProcessFrameInternal. Calling it once and sharing it would be a good optimization.
Bug 1566826, comment 3 also notes a case where getting transform matrices is significant and caching would save us time.
Description
•