Closed Bug 1595708 Opened 1 year ago Closed 1 year ago

Cargo build timings SVG output Freezes the entire browser with WebRender

Categories

(Core :: Graphics: WebRender, defect, P2)

defect

Tracking

()

VERIFIED FIXED
mozilla73
Tracking Status
firefox73 --- wontfix
firefox74 --- fixed

People

(Reporter: nical, Assigned: bpeers)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

Testcase: https://gistpreview.github.io/?dcccceb93a87316fcf455f7786ddb333

Profile: https://perfht.ml/2qJ9lDz

The profile was taken simply hovering over the tab a bunch which causes background pre-loading and freezes everything including the rest of the chrome.

A lot of different things go very wrong here, combining into multi-second freezes of the entire browser:

  • We spend a lot of CPU time in rayon's glue Even though worker threads are saturated, they only spend about half of their CPU time doing useful work.
  • Somehow the hit testing code gets serialized behind the blob rasterization work, causing synchronous hit testing queries to block for seconds.
  • It looks like even frame building is spending 90% of its time blocking under FrameBuilder::build_layer_screen_rects_and_cull_layers which I suspect is because we install a job on the rayon thread pool for glyph rasterization and that gets delayed by the blob rasterization.
  • We rasterize a large number of pixels at once (26 965 440 px, which is equivalent to something larger than 5k x 5k pixels).

I suspect that this is a situation where having large paths replayed entirely on many small tiles that partially cover them has substantial per-tile overhead.

There is a large 5660x4538 pixels blob which more or less matches the main <svg width="5680" height="4549" class="graph"> element in the document (and another one 5674x389 pixels).

I confirm that frame building is indeed blocked inside of resolve_glyphs. Interestingly we tend to have glyphs to rasterize most frames even though it doesn't look like we should. For example while watching a youtube video we consistently rasterize 4 glyphs every frame. We shouldn't be using rayon for glyph rasterization unless we have a large number of glyphs to rasterize. I filed bug 1595767 to fix that and bug 1595768 about investigating why we rasterize glyphs as often as we do.

So it would be worth trying to understand why the stroking doesn't show up as a problem in Chrome. In theory they should have the same problem with their tiled rendering.

Ideas from discussion with Jeff:

We want to reduce the per-tile overhead of stroking, by turning strokes on regular paths into skia's "stoked paths" (turns the stroke into a fill) before replaying into each tile.
This could be done on the recording side but there is a risk of blowing up the recording size for example for dashed paths like we have here.
That could be fixed by having path resources in the resource cache just like we do for image and fonts. It's something we want to do long term anyway, and would let us slim down updated blob recordings which is also good.
We already have cached paths at the SVG element level: https://searchfox.org/mozilla-central/rev/0678172d5b5c681061b904c776b668489e3355b0/dom/svg/SVGGeometryElement.h#231 we could use it to keep the path key.

Another thing worth considering (in addition) is to allow the blob rasterizer to rasterize adjacent tiles as one larger buffer. The content would still be uploaded and rendered as smaller tiles.

On glyph/SVG specifically:

My understanding so far is that we allow up to 8 worker threads in Rayon, and the same threadpool is used by the blob rasterizer and the glyph rasterizer. When we're busy doing long-running blobs (SVG), the high priority UI thread might ask for a glyph and stall on the result, with no way to help out and do the glyphs rather than block.

I can't find a clear yes/no answer if Rayon will allow the asking-and-blocking thread to enter the threadpool and help out, building the result that it's blocking on. But even if it does, it wouldn't help us, because request_glyphs_from_backend blocks on a glyph_rx.recv(), and nothing gets glyph_tx.send() until the entire par_iter+map+collect has finished. So it's invisible to Rayon that the blocking recv could help out with the work inside par_iter.

A quick fix would be to use try_recv instead and if not ready, add this_thread to the worker pool. I'm not sure if Rayon supports that (install() is the exact opposite), and even so there's a risk that the extra "worker" will pick up even more SVG work (likely if Rayon uses a task FIFO).

An alternative would be to manually implement work stealing; don't use par_iter, but put all glyphs in a shared vector, and use atomicAdd to take work from threads insert()'ed into the threadpool, or whatever. If resolve_glyphs sees outstanding work, it can also grab some and help out.

But what we really might want is priorities on the outstanding Rayon work. When we schedule UI glyphs, they should take precedence over SVG work. If we try to do it at the Rayon level, then we have two problems: 1/ the SVG work must be yieldable, ie. implemented as a chain of short lived tasks, and 2/ Rayon doesn't seem to have priorities :)
So an option might be to use the OS instead: have 2 worker threadpools of 8 workers each, a low priority pool and a high priority pool. In start_handler we can (hopefully) set the priorities, and also use thread affinity to make sure that each core only runs one or the other -- we don't want 16 threads total active. Then it becomes the OS's problem to context switch from SVG to glyph rendering, which it actually can do (unlike our atomic handling of entire tasks).

Skipping all this machinery for a small amount of glyphs is probably still worthwhile to reduce overhead, but it seems like we need something more general.

I hope I'm not analyzing and speculating widely off the mark here. Thanks!

I'm prototyping the extra-worker-threadpool idea. Initially I added a high priority pool and tried to put all the important things on them. But then I realized we already have a WRSceneBuilderLP and it's probably cleaner to make a lower priority threadpool, and try to get all LP blob rasterization to use those -- leaving everything else as de-facto elevated priority.

Assignee: nobody → bpeers
Pushed by bpeers@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/053275374435
Cargo build timings SVG output Freezes the entire browser with WebRender r=nical,jrmuizel
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla73

Reopened as my patch is only the first step.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 1602905
Depends on: 1602907
Depends on: 1604570

(nevermind, deleted)

After trying a bunch of combinations (on Windows+Intel+tsvgx) it seems like explicit affinity and priority are a net negative for perf, and are unnecessary. Simply having two pools of workers (one per SceneBuilder thread) maintains perf in tsvgx and still fixes the UI lock up.

The original point of that complexity was to make sure we never have more than 8 worker threads active. But, between the perf drop, our many processes (some with more worker pools), the lack of cross-platform APIs, and a new scheduler R&D, maybe it's not necessary to worry too much about how many threads are runnable. So I'll try the simplest solution of two unconstrained pools, which also happens to be the fastest and most portable:

try-fuzzy-Talos results

Linux sees 10 tests up to 6% faster, 5 tests up to 6% slower. A reasonable trade-off to improve UI latency?

(The Windows gains are legit but that's just repairing the damage I did earlier. Not sure if I missed the alert or we don't alert for Windows, or what...)

Removing the priority and affinity code;
enable pool on Windows, Linux after a promising fuzzy talos run

Pushed by bpeers@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ec6587c0da30
Cargo build timings SVG output Freezes the entire browser with WebRender r=jrmuizel
Status: REOPENED → RESOLVED
Closed: 1 year ago1 year ago
Resolution: --- → FIXED
Flags: qe-verify+

I am attempting to verify this fix, but I do not understand how to reproduce it in the first place.
@Nicolas Silva: Can you explain the process in detail?
@Bert Peers: Can you provide any useful information about how to properly reproduce it? Maybe related to the test environment?

Thank you for your contributions!

Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(bpeers)

Hi, the easiest way to reproduce the problem is to run Firefox on a high-DPI screen laptop, and maximize the browser.
Then, open this link:
https://gistpreview.github.io/?dcccceb93a87316fcf455f7786ddb333
and while it's loading, try to open a new tab by clicking the '+' button in the Tab Bar.
The bug is that the UI will not respond, in fact the '+' won't even highlight as you mouse-over, for several seconds, until the page stops loading. The duration of the lock-up is not always consistent, but it would be a poor user experience most of the time.
Likewise if you go back and forth between the tabs and hit refresh on the above page, the UI is slow to respond.
Even with the fix, there can still be some sluggishness under load but it should be orders of magnitude better.

The key part is the high number of pixels to draw due to high-resolution monitor and large browser window on a not-super-powerful platform.

You should be able to reproduce the original problem somewhat by going to about:config and setting gfx.webrender.enable-low-priority-pool to false and restarting the browser.

Flags: needinfo?(bpeers)

This likely doesn't need verification. Do you know why the qe verify flag was added?

Flags: needinfo?(daniel.bodea)

I suppose it's because it's a user-facing issue, but he should have his opinion.
I'll verify it if needed, afterward.

Flags: needinfo?(daniel.bodea) → needinfo?(alexandru.trif)

(In reply to Jessie [:jbonisteel] plz needinfo from comment #19)

This likely doesn't need verification. Do you know why the qe verify flag was added?

We are triaging bugs that enter beta at every new cycle. As daniel said this was a P2 user-facing issue, without automated tests so the conclusion was to be manually verified. If manual testing is not needed, we can remove the qe+ flag. Thank's!

Flags: needinfo?(alexandru.trif)
Flags: needinfo?(nical.bugzilla)

Considering Alexandru's response, should we still attempt verification?
If no, why do you consider this does not need verification?
If yes, can you help with steps to verify?
THank you!

Flags: needinfo?(jbonisteel)

If you want to try and verify, follow the steps in comment 18.

Flags: needinfo?(jbonisteel)

I cannot manage to reproduce this issue so I cannot be verifying it.
Nicolas, can you please verify this bug in the fixed channels?

Flags: needinfo?(nical.bugzilla)

Verified.

Status: RESOLVED → VERIFIED
Flags: needinfo?(nical.bugzilla)
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.