Closed Bug 1595708 Opened 6 years ago Closed 6 years ago

Cargo build timings SVG output Freezes the entire browser with WebRender

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla73

Tracking Flags:

Tracking

Status

firefox73

---

wontfix

firefox74

---

fixed

People

(Reporter: nical, Assigned: bpeers)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

Trying to use affinity&priority to make sure #0-7LP only runs if #0-7 isn't active. 6 years ago Bert Peers [:bpeers] 507.07 KB, image/png		Details
Bug 1595708 - Cargo build timings SVG output Freezes the entire browser with WebRender 6 years ago Bert Peers [:bpeers] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1595708 - Cargo build timings SVG output Freezes the entire browser with WebRender 6 years ago Bert Peers [:bpeers] 47 bytes, text/x-phabricator-request		Details \| Review

Nicolas Silva [:nical]

Reporter

Description

•

6 years ago

Testcase: https://gistpreview.github.io/?dcccceb93a87316fcf455f7786ddb333

Profile: https://perfht.ml/2qJ9lDz

The profile was taken simply hovering over the tab a bunch which causes background pre-loading and freezes everything including the rest of the chrome.

A lot of different things go very wrong here, combining into multi-second freezes of the entire browser:

We spend a lot of CPU time in rayon's glue Even though worker threads are saturated, they only spend about half of their CPU time doing useful work.
Somehow the hit testing code gets serialized behind the blob rasterization work, causing synchronous hit testing queries to block for seconds.
It looks like even frame building is spending 90% of its time blocking under FrameBuilder::build_layer_screen_rects_and_cull_layers which I suspect is because we install a job on the rayon thread pool for glyph rasterization and that gets delayed by the blob rasterization.
We rasterize a large number of pixels at once (26 965 440 px, which is equivalent to something larger than 5k x 5k pixels).

Nicolas Silva [:nical]

Reporter

Comment 1

•

6 years ago

I suspect that this is a situation where having large paths replayed entirely on many small tiles that partially cover them has substantial per-tile overhead.

There is a large 5660x4538 pixels blob which more or less matches the main <svg width="5680" height="4549" class="graph"> element in the document (and another one 5674x389 pixels).

I confirm that frame building is indeed blocked inside of resolve_glyphs. Interestingly we tend to have glyphs to rasterize most frames even though it doesn't look like we should. For example while watching a youtube video we consistently rasterize 4 glyphs every frame. We shouldn't be using rayon for glyph rasterization unless we have a large number of glyphs to rasterize. I filed bug 1595767 to fix that and bug 1595768 about investigating why we rasterize glyphs as often as we do.

Jeff Muizelaar [:jrmuizel]

Comment 2

•

6 years ago

So it would be worth trying to understand why the stroking doesn't show up as a problem in Chrome. In theory they should have the same problem with their tiled rendering.

Nicolas Silva [:nical]

Reporter

Comment 3

•

6 years ago

Ideas from discussion with Jeff:

We want to reduce the per-tile overhead of stroking, by turning strokes on regular paths into skia's "stoked paths" (turns the stroke into a fill) before replaying into each tile.
This could be done on the recording side but there is a risk of blowing up the recording size for example for dashed paths like we have here.
That could be fixed by having path resources in the resource cache just like we do for image and fonts. It's something we want to do long term anyway, and would let us slim down updated blob recordings which is also good.
We already have cached paths at the SVG element level: https://searchfox.org/mozilla-central/rev/0678172d5b5c681061b904c776b668489e3355b0/dom/svg/SVGGeometryElement.h#231 we could use it to keep the path key.

Nicolas Silva [:nical]

Reporter

Comment 4

•

6 years ago

Another thing worth considering (in addition) is to allow the blob rasterizer to rasterize adjacent tiles as one larger buffer. The content would still be uploaded and rendered as smaller tiles.

Bert Peers [:bpeers]

Assignee

Comment 5

•

6 years ago

On glyph/SVG specifically:

My understanding so far is that we allow up to 8 worker threads in Rayon, and the same threadpool is used by the blob rasterizer and the glyph rasterizer. When we're busy doing long-running blobs (SVG), the high priority UI thread might ask for a glyph and stall on the result, with no way to help out and do the glyphs rather than block.

I can't find a clear yes/no answer if Rayon will allow the asking-and-blocking thread to enter the threadpool and help out, building the result that it's blocking on. But even if it does, it wouldn't help us, because request_glyphs_from_backend blocks on a glyph_rx.recv(), and nothing gets glyph_tx.send() until the entire par_iter+map+collect has finished. So it's invisible to Rayon that the blocking recv could help out with the work inside par_iter.

A quick fix would be to use try_recv instead and if not ready, add this_thread to the worker pool. I'm not sure if Rayon supports that (install() is the exact opposite), and even so there's a risk that the extra "worker" will pick up even more SVG work (likely if Rayon uses a task FIFO).

An alternative would be to manually implement work stealing; don't use par_iter, but put all glyphs in a shared vector, and use atomicAdd to take work from threads insert()'ed into the threadpool, or whatever. If resolve_glyphs sees outstanding work, it can also grab some and help out.

But what we really might want is priorities on the outstanding Rayon work. When we schedule UI glyphs, they should take precedence over SVG work. If we try to do it at the Rayon level, then we have two problems: 1/ the SVG work must be yieldable, ie. implemented as a chain of short lived tasks, and 2/ Rayon doesn't seem to have priorities :)
So an option might be to use the OS instead: have 2 worker threadpools of 8 workers each, a low priority pool and a high priority pool. In start_handler we can (hopefully) set the priorities, and also use thread affinity to make sure that each core only runs one or the other -- we don't want 16 threads total active. Then it becomes the OS's problem to context switch from SVG to glyph rendering, which it actually can do (unlike our atomic handling of entire tasks).

Skipping all this machinery for a small amount of glyphs is probably still worthwhile to reduce overhead, but it seems like we need something more general.

I hope I'm not analyzing and speculating widely off the mark here. Thanks!

Bert Peers [:bpeers]

Assignee

Comment 6

•

6 years ago

I'm prototyping the extra-worker-threadpool idea. Initially I added a high priority pool and tried to put all the important things on them. But then I realized we already have a WRSceneBuilderLP and it's probably cleaner to make a lower priority threadpool, and try to get all LP blob rasterization to use those -- leaving everything else as de-facto elevated priority.

Bert Peers [:bpeers]

Assignee

Comment 7

•

6 years ago

Attached image Trying to use affinity&priority to make sure #0-7LP only runs if #0-7 isn't active. — Details

Bert Peers [:bpeers]

Assignee

Comment 8

•

6 years ago

Attached file Bug 1595708 - Cargo build timings SVG output Freezes the entire browser with WebRender — Details

Phabricator Automation

Updated

•

6 years ago

Assignee: nobody → bpeers

Pulsebot

Comment 9

•

6 years ago

Pushed by bpeers@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/053275374435 Cargo build timings SVG output Freezes the entire browser with WebRender r=nical,jrmuizel

Narcis Beleuzu [:NarcisB]

Comment 10

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/053275374435

Status: NEW → RESOLVED

Closed: 6 years ago

status-firefox73: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla73

Bert Peers [:bpeers]

Assignee

Comment 11

•

6 years ago

Reopened as my patch is only the first step.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

BugBot [:suhaib / :marco/ :calixte]

Updated

•

6 years ago

status-firefox73: fixed → affected

Bert Peers [:bpeers]

Assignee

Updated

•

6 years ago

Depends on: 1602905

Bert Peers [:bpeers]

Assignee

Updated

•

6 years ago

Depends on: 1602907

Bert Peers [:bpeers]

Assignee

Updated

•

6 years ago

Depends on: 1604570

Bert Peers [:bpeers]

Assignee

Comment 12

•

6 years ago

•

Edited

(nevermind, deleted)

Bert Peers [:bpeers]

Assignee

Comment 13

•

6 years ago

After trying a bunch of combinations (on Windows+Intel+tsvgx) it seems like explicit affinity and priority are a net negative for perf, and are unnecessary. Simply having two pools of workers (one per SceneBuilder thread) maintains perf in tsvgx and still fixes the UI lock up.

The original point of that complexity was to make sure we never have more than 8 worker threads active. But, between the perf drop, our many processes (some with more worker pools), the lack of cross-platform APIs, and a new scheduler R&D, maybe it's not necessary to worry too much about how many threads are runnable. So I'll try the simplest solution of two unconstrained pools, which also happens to be the fastest and most portable:

try-fuzzy-Talos results

Linux sees 10 tests up to 6% faster, 5 tests up to 6% slower. A reasonable trade-off to improve UI latency?

(The Windows gains are legit but that's just repairing the damage I did earlier. Not sure if I missed the alert or we don't alert for Windows, or what...)

Bert Peers [:bpeers]

Assignee

Comment 14

•

6 years ago

Attached file Bug 1595708 - Cargo build timings SVG output Freezes the entire browser with WebRender — Details

Removing the priority and affinity code;
enable pool on Windows, Linux after a promising fuzzy talos run

Pulsebot

Comment 15

•

6 years ago

Pushed by bpeers@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ec6587c0da30 Cargo build timings SVG output Freezes the entire browser with WebRender r=jrmuizel

Andreea Pavel [:apavel]

Comment 16

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/ec6587c0da30

Status: REOPENED → RESOLVED

Closed: 6 years ago → 6 years ago

status-firefox74: --- → fixed

Resolution: --- → FIXED

Ryan VanderMeulen [:RyanVM]

Updated

•

5 years ago

status-firefox73: affected → wontfix

Alexandru Trif, Desktop Test Engineering [:atrif] [PTO 22 Dec-9 Jan, please ni? :mboldan]

Updated

•

5 years ago

Flags: qe-verify+

Daniel Bodea [:danibodea]

Comment 17

•

5 years ago

I am attempting to verify this fix, but I do not understand how to reproduce it in the first place.
@Nicolas Silva: Can you explain the process in detail?
@Bert Peers: Can you provide any useful information about how to properly reproduce it? Maybe related to the test environment?

Thank you for your contributions!

Flags: needinfo?(nical.bugzilla)

Flags: needinfo?(bpeers)

Bert Peers [:bpeers]

Assignee

Comment 18

•

5 years ago

•

Edited

Hi, the easiest way to reproduce the problem is to run Firefox on a high-DPI screen laptop, and maximize the browser.
Then, open this link:
https://gistpreview.github.io/?dcccceb93a87316fcf455f7786ddb333
and while it's loading, try to open a new tab by clicking the '+' button in the Tab Bar.
The bug is that the UI will not respond, in fact the '+' won't even highlight as you mouse-over, for several seconds, until the page stops loading. The duration of the lock-up is not always consistent, but it would be a poor user experience most of the time.
Likewise if you go back and forth between the tabs and hit refresh on the above page, the UI is slow to respond.
Even with the fix, there can still be some sluggishness under load but it should be orders of magnitude better.

The key part is the high number of pixels to draw due to high-resolution monitor and large browser window on a not-super-powerful platform.

You should be able to reproduce the original problem somewhat by going to about:config and setting gfx.webrender.enable-low-priority-pool to false and restarting the browser.

Flags: needinfo?(bpeers)

Jessie [:jbonisteel] pls NI

Comment 19

•

5 years ago

This likely doesn't need verification. Do you know why the qe verify flag was added?

Flags: needinfo?(daniel.bodea)

Daniel Bodea [:danibodea]

Comment 20

•

5 years ago

I suppose it's because it's a user-facing issue, but he should have his opinion.
I'll verify it if needed, afterward.

Flags: needinfo?(daniel.bodea) → needinfo?(alexandru.trif)

Alexandru Trif, Desktop Test Engineering [:atrif] [PTO 22 Dec-9 Jan, please ni? :mboldan]

Comment 21

•

5 years ago

(In reply to Jessie [:jbonisteel] plz needinfo from comment #19)

This likely doesn't need verification. Do you know why the qe verify flag was added?

We are triaging bugs that enter beta at every new cycle. As daniel said this was a P2 user-facing issue, without automated tests so the conclusion was to be manually verified. If manual testing is not needed, we can remove the qe+ flag. Thank's!

Flags: needinfo?(alexandru.trif)

Nicolas Silva [:nical]

Reporter

Updated

•

5 years ago

Flags: needinfo?(nical.bugzilla)

Daniel Bodea [:danibodea]

Comment 22

•

5 years ago

Considering Alexandru's response, should we still attempt verification?
If no, why do you consider this does not need verification?
If yes, can you help with steps to verify?
THank you!

Flags: needinfo?(jbonisteel)

Jessie [:jbonisteel] pls NI

Comment 23

•

5 years ago

If you want to try and verify, follow the steps in comment 18.

Flags: needinfo?(jbonisteel)

Daniel Bodea [:danibodea]

Comment 24

•

5 years ago

I cannot manage to reproduce this issue so I cannot be verifying it.
Nicolas, can you please verify this bug in the fixed channels?

Flags: needinfo?(nical.bugzilla)

Nicolas Silva [:nical]

Reporter

Comment 25

•

5 years ago

Verified.

Status: RESOLVED → VERIFIED

Flags: needinfo?(nical.bugzilla)

Daniel Bodea [:danibodea]

Updated

•

5 years ago

Flags: qe-verify+

You need to log in before you can comment on or make changes to this bug.