Open Bug 1594768 Opened 5 years ago Updated 1 year ago

Too many blob layers on https://www.goodfour.us/

Categories

(Core :: Graphics: WebRender, defect, P3)

defect

Tracking

()

People

(Reporter: nical, Unassigned)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

Attachments

(4 files, 3 obsolete files)

Just above halfway on the page, "Why Four Hours?" animation).

Blocks: wr-blob-perf

This is a reduced version of the animation on the original page, including only a single layer. This is a Lottie animation that decodes to an SVG consisting of a large square of solid color, masked to a small circle. The two seem to move together.

The full animation includes many layers, all using a similar technique. Firefox doesn't recognize that although the path in the layer is large, the visible portion of it is much smaller, so it ends up allocating a much larger backing pixel buffer than necessary, and drawing more area than necessary.

(Masking is expensive because it ostensibly requires us to fully render the underlying image (the large square), render the mask, compute a luminance value for each pixel, and then trim the underlying image according to the luminance values, a per-pixel multiplication.)

Firefox could render this faster by computing a bounding box for the portion of the mask with non-zero luminance, intersecting that with the bounding box of the underlying image, and using the result to decide what portion to render, cache, and draw.

Attachment #9121163 - Attachment is obsolete: true
Attachment #9121164 - Attachment is obsolete: true

So I believe we could get a significant improvement here if the computed geometry for the nsDisplayItemMasksAndClip included the bounds of the mask as is computed by ComputeMaskGeometry (https://searchfox.org/mozilla-central/source/layout/painting/nsDisplayList.cpp#9803).

Miko do you think intersecting the bounds of the mask in nsDisplayItemMasksAndClip is practical?

Flags: needinfo?(mikokm)

It would also be interesting to try disabling active layers to see how much worse they're making performance here.

(In reply to Jeff Muizelaar [:jrmuizel] from comment #6)

Miko do you think intersecting the bounds of the mask in nsDisplayItemMasksAndClip is practical?

Unfortunately I'm not familiar enough with masking to answer this. Markus probably knows this code better.

Flags: needinfo?(mikokm) → needinfo?(mstange)
See Also: → 1609705

My laptop has a high-DPI screen and integrated Intel graphics. We observed the WRWorkerLP threads spending most of their time waiting for malloc locks, so we doubled the size of the tiles (to quarter the number of allocations) and lowered the number of worker threads to match physical cpus, since hyperthreading is unlikely to benefit blitting tasks, and they're spending their time waiting for the malloc lock anyway.

This lowers the time a WRWorkerLP spends blocked on malloc to 20%, but still gives poor frame rates, with 28% of the time spent on mask operations. The way to improve that time is to shrink the area of the mask, so we don't need to create the mask, compute its luminance, and so on. If we can use the mask region to choose smaller bounding rectangles into the blob image recordings for the SVG, that would reduce these costs.

If we force jemalloc to use a thread-local arena even for large allocations, the malloc lock overhead disappears from the profiles.

Perhaps we should change mozjemalloc.cpp's choose_arena to let each thread specify its own size threshold for redirecting requests to the default arena. Then WRWorkerLP threads could specify a large threshold, to remove contention over allocations like this, whereas other threads could continue to operate under the same parameters they do now.

Interesting. How large is 'large'? One would expect large allocations to be infrequent unless we're consuming enormous amount of memory, and therefor not conflict a lot, assuming all the small cruft goes to a per-thread SBA. Is the problem perhaps that we're constantly allocating and freeing these large chunks? If so, a per-thread re-use / delayed-free strategy could also be an idea?

Depends on: 1610409

(In reply to Jeff Muizelaar [:jrmuizel] from comment #6)

So I believe we could get a significant improvement here if the computed geometry for the nsDisplayMasksAndClip included the bounds of the mask as is computed by ComputeMaskGeometry (https://searchfox.org/mozilla-central/source/layout/painting/nsDisplayList.cpp#9803).

So the way this is supposed to work is the following:

We should verify that the call to ComputeClipForMaskItem in BuildDisplayListForStackingContext is hit and computes an appropriate rectangle.

Flags: needinfo?(mstange)

The bounds and clips of the Gecko display items are printed as part of a Gecko display list dump. You can get one from your testcase by setting the pref layout.display-list.dump-content to true and running the testcase; this pref works in any Firefox build.

Jeff traced the rect computation to a call to GetBBox. GetBBox takes into account clip-paths but not masks, as per spec (see the "If clipped is true" pieces). So if we want to take masks into account we have to additionally intersect the result of GetBBox with the mask bounds, and we may need a way to compute those mask bounds if we don't already have one.

Okay, this makes sense. I'd observed GetBBox behaving that way but didn't realize it was per spec.

ComputeClipForMaskItem, however, is calling GetAndObserveMasks, suggesting that it already intends to take masks into account; however, it only consults the masks of the top item; in the example, the top item has a clip, but its sole child has a mask.

No longer blocks: wr-perf
Severity: normal → S3
Blocks: lottie
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: