Open
Bug 1471309
Opened 6 years ago
Updated 2 years ago
Investigate eager OMT font shaping and/or text run construction
Categories
(Core :: Layout: Text and Fonts, enhancement, P3)
Core
Layout: Text and Fonts
Tracking
()
NEW
People
(Reporter: bholley, Assigned: jfkthame)
References
(Blocks 1 open bug)
Details
From jfkthame over email:
> text shaping (or more generally textrun construction) can be a significant
> part of intrinsic-size computation. Currently, we create textruns "on demand"
> when a textframe needs to measure itself, and this can block things like
> Get{Min,Pref}ISize during reflow. I've wondered if it would be beneficial to
> trigger text shaping on a separate thread during frame construction, so that
> by the time reflow wants to know the frame's inline-size, we've already shaped
> the text.
Next step here is to do some investigation, including:
* Measuring the overall percentage of reflow spent constructing text frames on top sites.
* Measuring the benefits of the existing lazy infrastructure - how much extra work would we incur by doing it eagerly?
* Measuring the cost of compile harfbuzz with threadsafety enabled
* Brainstorming and discussing appropriate algorithms and machinery if it looks promising.
Jonathan is going to lead the investigation here.
Reporter | ||
Updated•6 years ago
|
Blocks: layout-perf
Reporter | ||
Updated•6 years ago
|
Assignee | ||
Updated•6 years ago
|
Priority: -- → P3
Assignee | ||
Comment 1•6 years ago
|
||
As a first step, I tried profiling the loading of a fairly text-heavy site (https://en.wikipedia.org/wiki/History_of_Western_civilization, which contains 200K or so of English text, without a lot of complex styling). These profiles are from a local macOS (optimized, nodebug) build.
(1) Initial page load in a newly-launched browser: https://perfht.ml/2z2fb6D
The long first reflow here is 264ms. Focusing on just the samples in that reflow, we have 154ms in EnsureTextRun (which is what nsTextFrames use to create their text run when needed); and of that, 110ms is specifically in ShapeText (getting the array of positioned glyphs for a given text string and font style).
(2) Reload the page (so the shaped-word caches are populated): https://perfht.ml/2z4hc2c
The first reflow is down to 114ms, of which EnsureTextRun is now just 24ms, and ShapeText doesn't show up at all.
This illustrates the benefit we get from the shaped-word caches: for the initial page load, shaping was the dominant part of textrun construction, and textrun construction in turn was the largest part of reflow, but on a reload, shaping is no longer significant, textrun construction is much faster, and is now only 20% of the complete reflow.
(The exact figures vary quite a bit from one profile to another, of course, but the above are pretty typical.)
But there's something else: it turns out these profiles (and probably many profiles recorded on Macs) are rather misleading in terms of how we'd perform on other platforms. The Wikipedia page is using the browser's default sans-serif font, which on macOS is Helvetica, and is an AAT font. These are specific to macOS, and we route them to the Core Text shaper in order to support the Core Text-specific layout tables that may be present. But Core Text tends to be substantially slower than HarfBuzz for similar content.
So I changed the default fonts for Latin text in my profile, to use Times New Roman (OpenType) in place of Times (AAT); Arial (OpenType) in place of Helvetica (AAT); and Courier New (OpenType) in place of Courier (AAT). These fonts will be shaped via HarfBuzz, and therefore more representative of the behavior we'd see across all other platforms.
(3) Initial page load in a newly-launched browser: https://perfht.ml/2yYrNvz
First reflow is 175ms; within that, we have EnsureTextRun at 77ms, of which ShapeText is 33ms.
(4) Reload: https://perfht.ml/2tYgQ7k
First reflow is 112ms; of that, EnsureTextRun is just 20ms, and ShapeText is gone.
So reloading the page (which makes use of the word caches) is the same for either Core Text or HarfBuzz shaping -- as expected, given that the text didn't need to be re-shaped -- but the initial reflow (with empty caches) is significantly faster when using HarfBuzz, and the contribution of text shaping to that reflow dropped from 42% to 19% when I switched the default font.
(In reply to Bobby Holley (:bholley) from comment #0)
> * Measuring the overall percentage of reflow spent constructing text frames
> on top sites.
:bholley, have we compiled a list of the specific top sites that we'd like to investigate more deeply?
Flags: needinfo?(bobbyholley)
Reporter | ||
Comment 2•6 years ago
|
||
This is great analysis, thanks!
One question from comment 0 that isn't addressed here is the lazy-vs-eager question. Is there a meaningful delta between the text runs we'd create if we did it eagerly versus what we currently create on-demand? Just trying to figure out if there is a cost to be considered there (this was the case with cascading in stylo).
(In reply to Jonathan Kew (:jfkthame) from comment #1)
> As a first step, I tried profiling the loading of a fairly text-heavy site
> (https://en.wikipedia.org/wiki/History_of_Western_civilization, which
> contains 200K or so of English text, without a lot of complex styling).
> These profiles are from a local macOS (optimized, nodebug) build.
Note that you'll also need --enable-release to get realistic performance results on anything that involves Rust code (which is certainly the case for layout). Those builds are also on the slow side, so profiling Nightlies is also a decent way to go. I also recommend a profiling interval of 0.5ms for coarse-grained profiling and 0.2ms for finer-grained profiling.
> (1) Initial page load in a newly-launched browser: https://perfht.ml/2z2fb6D
> The long first reflow here is 264ms. Focusing on just the samples in that
> reflow, we have 154ms in EnsureTextRun (which is what nsTextFrames use to
> create their text run when needed); and of that, 110ms is specifically in
> ShapeText (getting the array of positioned glyphs for a given text string
> and font style).
>
> (2) Reload the page (so the shaped-word caches are populated):
> https://perfht.ml/2z4hc2c
> The first reflow is down to 114ms, of which EnsureTextRun is now just 24ms,
> and ShapeText doesn't show up at all.
>
> This illustrates the benefit we get from the shaped-word caches: for the
> initial page load, shaping was the dominant part of textrun construction,
> and textrun construction in turn was the largest part of reflow, but on a
> reload, shaping is no longer significant, textrun construction is much
> faster, and is now only 20% of the complete reflow.
That cache seems pretty powerful. How much does it tend to help across domains? I'm specifically wondering in terms of Fission.
> (The exact figures vary quite a bit from one profile to another, of course,
> but the above are pretty typical.)
>
> But there's something else: it turns out these profiles (and probably many
> profiles recorded on Macs) are rather misleading in terms of how we'd
> perform on other platforms. The Wikipedia page is using the browser's
> default sans-serif font, which on macOS is Helvetica, and is an AAT font.
> These are specific to macOS, and we route them to the Core Text shaper in
> order to support the Core Text-specific layout tables that may be present.
> But Core Text tends to be substantially slower than HarfBuzz for similar
> content.
Yeah, I think we care most about HarfBuzz assuming that's what's used on Windows.
>
> So I changed the default fonts for Latin text in my profile, to use Times
> New Roman (OpenType) in place of Times (AAT); Arial (OpenType) in place of
> Helvetica (AAT); and Courier New (OpenType) in place of Courier (AAT). These
> fonts will be shaped via HarfBuzz, and therefore more representative of the
> behavior we'd see across all other platforms.
>
> (3) Initial page load in a newly-launched browser: https://perfht.ml/2yYrNvz
> First reflow is 175ms; within that, we have EnsureTextRun at 77ms, of which
> ShapeText is 33ms.
>
> (4) Reload: https://perfht.ml/2tYgQ7k
> First reflow is 112ms; of that, EnsureTextRun is just 20ms, and ShapeText is
> gone.
>
> So reloading the page (which makes use of the word caches) is the same for
> either Core Text or HarfBuzz shaping -- as expected, given that the text
> didn't need to be re-shaped -- but the initial reflow (with empty caches) is
> significantly faster when using HarfBuzz, and the contribution of text
> shaping to that reflow dropped from 42% to 19% when I switched the default
> font.
>
>
> (In reply to Bobby Holley (:bholley) from comment #0)
>
> > * Measuring the overall percentage of reflow spent constructing text frames
> > on top sites.
>
> :bholley, have we compiled a list of the specific top sites that we'd like
> to investigate more deeply?
Yeah, this should get you started: https://docs.google.com/document/d/1I5MlrMgNTMjicHgauWa0zltS9tMBxcgN-hCPgNBxZA8
Obviously no need to exhaustively test everything there, but should be a decent starting point.
Flags: needinfo?(bobbyholley)
Comment hidden (typo) |
Assignee | ||
Comment 4•6 years ago
|
||
So the other point to keep in mind here is that there are two potential "levels" at which we could try to do this -- if it looks worth pursuing at all. There's the font shaping process, where harfbuzz (or coretext) implements the layout of a run of characters in a specific font/style, with specific features applied, etc., and returns an array of glyphs and positions. This is what happens within the ShapeText method. It's (usually) done on a per-word basis, and the results cached so that a given word (in a given style) doesn't need to have shaping re-done every time it occurs; we just pull the resulting positioned glyphs from the word cache.
What I've considered from time to time is the possibility of implementing eager font shaping, because as soon as we have a text frame, associated with a given piece of content and a computed style, we could shape all the words that occur in the text and store them in the relevant fonts' word caches. Then, when we need a text run, the cache would be pre-populated and text run construction would be correspondingly faster (very much like the difference between an initial cold page-load and a reload, in the profiles in comment 1). This seems, at least in principle, like it would be pretty easy to implement (modulo ensuring the relevant objects/methods are made thread-safe).
The second level would be eager construction of the actual textruns that text frames use for measurement and drawing. I think this is significantly harder than eager shaping, because there isn't a simple 1:1 mapping of text frames to textruns; we often create a single textrun that backs a whole sequence of frames. So while in principle we could parallelize the shaping of each of the individual words in the text, it may be harder to parallelize textrun creation (though maybe we could parallelize the actual construction, driven by a non-parallel walk over the frame tree to identify the ranges for each run needed).
(These two levels correspond roughly to ShapeText vs EnsureTextRun in the profiles, where ShapeText is a subset of the overall EnsureTextRun time.)
Assignee | ||
Comment 5•6 years ago
|
||
One more profile of loading https://en.wikipedia.org/wiki/History_of_Western_civilization (initial load in a freshly-launched browser, so the word caches are empty):
https://perfht.ml/2lRbOWT
First reflow is down to 127ms; within that, we have EnsureTextRun: 31ms, ShapeText: 4ms.
This is the result of adding a quick hack to implement "eager font shaping": as soon as a textframe is created during frame construction, we go ahead and shape each of the words found. Therefore, by the time reflow happens, the shaped words we need are found in cache.
Now, because font selection and shaping isn't at all thread-safe, the hack here does all this on the main thread. So there's really no overall win; filtering the profile for ShapeText, we see that most of the ShapeText use (41ms overall) has simply moved earlier in the profile, so it's happening before the long reflow; but that just means we're taking correspondingly longer (just over 50ms extra) in nsCSSFrameConstructor instead, and starting reflow later.
But in principle, if shaping were thread-safe (and we had a spare core) that shaping could be taken off the main thread and allowed to happen on a secondary thread without blocking frame construction.
Reporter | ||
Comment 6•6 years ago
|
||
That's great context and measurement, thanks! Exactly the sort of "prototype it before you do it" stuff we should be doing. :-)
The cache-priming approach definitely seems promising, assuming we can make the thread safety work without a bunch of added contention / overhead.
I think there are a few more measurements we need to take before proceeding. Specifically:
(1) Rerunning the existing measurements with an --enable-release build and better resolution, per comment 2. We want to be sure that the proportions of time here are realistic, and that the overhead of the Servo_* calls (nontrivial in your profile) are not inflated as a result of missing optimizations.
(2) Repeating the measurements on some of the sites in the doc. I think we mostly want to know which sites have enough text to get significant benefit from this, and what the benefit might look like in those cases.
Assignee | ||
Comment 7•6 years ago
|
||
I've run profiles of a few of the sites from the optimization testcases doc; here are some notes. (These were run with a standard Nightly build.) In each case, there's a profile from loading the site in a newly-launched browser, and one from reloading.
tp6 pages:
(a) Amazon
initial: https://perfht.ml/2u1RZzJ long reflows 72ms, 43ms, 2324ms(!); EnsureTextRun 2366ms (23 + 15 within the first two reflows); ShapeText 11ms
reload: https://perfht.ml/2z6jga1 reflows 43+8.7ms; EnsureTextRun 18; ShapeText 0
(I'm not sure what the 2-plus second reflow there is about; it seems to be something the page triggers via script after it has loaded, but it doesn't block display of the page to the user. The vast majority of this time seems to be doing system font fallback.)
(b) Facebook
initial: https://perfht.ml/2lS9tLf reflow 64ms; EnsureTextRun 42ms; ShapeText 8ms
reload: https://perfht.ml/2lPR6GI reflow 7.1ms; EnsureTextRun 0; ShapeText 0
(c) Google
initial: https://perfht.ml/2lOYp1h reflow 21ms; EnsureTextRun 5ms; ShapeText 0
reload: https://perfht.ml/2lRAzC2 reflow 7.7ms; EnsureTextRun 0; ShapeText 0
(d) Youtube
initial: https://perfht.ml/2lOYT7B reflows 11+70+31ms; EnsureTextRun 40ms; ShapeText 4ms
reload: https://perfht.ml/2lPXgGX reflow 3.6+42+31ms; EnsureTextRun 20ms; ShapeText 5ms
Other sites (live web)
(e) NYTimes home page:
initial: https://perfht.ml/2zd0f5U long reflows 63, 28, 22, 22, 38; EnsureTextRun 80; ShapeText 14
reload: https://perfht.ml/2z1YW9A long reflows 17, 17, 20, 22; EnsureTextRun 24; ShapeText 0
(f) The Guardian home page:
initial: https://perfht.ml/2lVmRyl first reflow 81ms; EnsureTextRun 38; ShapeText 11
reload: https://perfht.ml/2lOHP1z first reflow 38ms; EnsureTextRun 14; ShapeText 8
(g) Long-ish Medium post (https://medium.com/s/story/life-after-aziz-9ded0e53c184)
initial: https://perfht.ml/2lN6WBU first reflow 155ms; EnsureTextRun 141; ShapeText 13
reload: https://perfht.ml/2lQ7iI7 first reflow 24ms; EnsureTextRun 17; ShapeText 10
(h) A reddit thread (https://www.reddit.com/r/britishproblems/comments/8rmt52/meta_on_the_eu_copyright_reform/)
initial: https://perfht.ml/2z6HCjX reflow 73ms; EnsureTextRun 14; ShapeText 6
reload: https://perfht.ml/2z6G0GZ reflow 17ms; EnsureTextRun 1; ShapeText 0
So in some cases, ShapeText makes a significant contribution to the initial reflow (e.g. 14ms on the NYTimes home page), and could benefit from being done off-main-thread, but in others it is relatively minor compared to the rest of EnsureTextRun.
Eventually, triggering an OMT version of EnsureTextRun from frame construction could be a bigger win, but would also be more complex; and the more limited project of making per-word font shaping thread-safe would be a prerequisite anyhow.
I'll run some higher-resolution profiles to confirm how they look; and also want to try instrumenting a build to see how often we create a textframe that we end up discarding without ever actually measuring or rendering. If that happens with any frequency, then eager textrun construction for such frames would simply be wasted work.
Reporter | ||
Comment 8•6 years ago
|
||
Yeah. From these measurements it seems like it mostly hinges on the code complexity, overhead, and percentage of wasted work. If things look acceptable on all three fronts, the cache priming could be a modest win.
It does seem that EnsureTextRun is the bigger fish though. What's the breakdown of work that function does? At first glance, it looks like there's some very expensive stuff we do, but only once per font/process?
Assignee | ||
Comment 9•6 years ago
|
||
I instrumented a browser to report how many text frames are created and destroyed without ever getting a textrun assigned, indicating cases where eager work would be wasted work.
Over the course of a session visiting a bunch sites, particularly pages from the "optimization testcases" list, I was seeing a range of around 4-7% of text frames that never get textruns. When I focused on relatively simple, static pages (e.g. Wikipedia articles) the percentage would be lower, and when visiting typical media sites that use a more complex layout and pull in a lot of different resources, potentially building the page more dynamically, the percentage tends to rise (as one would expect).
I suspect - though instrumenting to prove it would be a bit more complex - that in the cases where we construct a text frame but then discard it without using its textrun, we'll often end up reconstructing a new frame that will still need the same shaped words. So word-cache priming would still be a win, not just wasted work.
> It does seem that EnsureTextRun is the bigger fish though. What's the breakdown of work that function does? At first glance, it looks like there's some very expensive stuff we do, but only once per font/process?
Yes; in particular, during EnsureTextRun we do the work of matching the characters in the text against the available fonts (the font-family list, followed by fonts from preferences and global fallback, if necessary). The first time we want to use a given font family, we'll have to load a whole bunch of information about its faces, which can be a bit expensive - which is why we do it on-demand rather than preloading all the font info during startup.
(One thing that I've noticed in some of these profiles is that initializing the macOS system UI font (font-family:-apple-system) seems to be particularly expensive, because it is a more complex font family (with optical sizes, variations, etc) than most common families. It may be worth pre-initializing this specific family in a background thread that we fire off as early as possible during startup, so that it doesn't block the initial reflow of sites like Facebook.)
Doing word-cache priming should also pull the potentially-expensive font initialization out of EnsureTextRun, though. Matching the text to available fonts would end up getting re-done when we actually construct the textrun, but it should be faster as the font families involved will have already been initialized.
EnsureTextRun also scans the frames to determine what content goes into the textrun, because we may create a single "unified" textrun that allows shaping to cross the boundaries of inline elements that share the same font style. This is work that would end up being wasted if we destroy the frame without using its textrun, as we'd have to re-scan the new frame tree.
Assignee | ||
Comment 10•6 years ago
|
||
So I think the first thing to do here is to look into how much overhead and complexity will be involved in supporting off-main-thread font shaping, which will have to include the font-selection process (including fallback and all that) as well as shaping itself. If we go further and take all of EnsureTextRun onto another thread, the font selection and shaping work will be an essential part of that anyway.
Reporter | ||
Comment 11•6 years ago
|
||
(In reply to Jonathan Kew (:jfkthame) [away July 9-12] from comment #9)
> I instrumented a browser to report how many text frames are created and
> destroyed without ever getting a textrun assigned, indicating cases where
> eager work would be wasted work.
>
> Over the course of a session visiting a bunch sites, particularly pages from
> the "optimization testcases" list, I was seeing a range of around 4-7% of
> text frames that never get textruns. When I focused on relatively simple,
> static pages (e.g. Wikipedia articles) the percentage would be lower, and
> when visiting typical media sites that use a more complex layout and pull in
> a lot of different resources, potentially building the page more
> dynamically, the percentage tends to rise (as one would expect).
4-7% is certainly in the ballpark of what might be ok, IMO.
> I suspect - though instrumenting to prove it would be a bit more complex -
> that in the cases where we construct a text frame but then discard it
> without using its textrun, we'll often end up reconstructing a new frame
> that will still need the same shaped words. So word-cache priming would
> still be a win, not just wasted work.
Yeah, that's a fair point. We could potentially measure this more precisely with your aforementioned prototype by counting which of the primed words get cache hits during actual reflow.
>
> > It does seem that EnsureTextRun is the bigger fish though. What's the breakdown of work that function does? At first glance, it looks like there's some very expensive stuff we do, but only once per font/process?
>
> Yes; in particular, during EnsureTextRun we do the work of matching the
> characters in the text against the available fonts (the font-family list,
> followed by fonts from preferences and global fallback, if necessary). The
> first time we want to use a given font family, we'll have to load a whole
> bunch of information about its faces, which can be a bit expensive - which
> is why we do it on-demand rather than preloading all the font info during
> startup.
>
> (One thing that I've noticed in some of these profiles is that initializing
> the macOS system UI font (font-family:-apple-system) seems to be
> particularly expensive, because it is a more complex font family (with
> optical sizes, variations, etc) than most common families. It may be worth
> pre-initializing this specific family in a background thread that we fire
> off as early as possible during startup, so that it doesn't block the
> initial reflow of sites like Facebook.)
Yeah, this seems really expensive. I'm also concerned about fission - process lifetime will get shorter (so the per-process amortization will get less powerful), and the memory impact of these caches is greater. Have we given any thought to sharing the font family cache across processes? That would, incidentally, reduces the overhead we see during the first reflow, especially if we taught the parent process to preload certain font families in the background while idle.
> Doing word-cache priming should also pull the potentially-expensive font
> initialization out of EnsureTextRun, though. Matching the text to available
> fonts would end up getting re-done when we actually construct the textrun,
> but it should be faster as the font families involved will have already been
> initialized.
>
> EnsureTextRun also scans the frames to determine what content goes into the
> textrun, because we may create a single "unified" textrun that allows
> shaping to cross the boundaries of inline elements that share the same font
> style. This is work that would end up being wasted if we destroy the frame
> without using its textrun, as we'd have to re-scan the new frame tree.
Assignee | ||
Comment 12•6 years ago
|
||
> Yeah, this seems really expensive. I'm also concerned about fission - process lifetime will get shorter (so the per-process amortization will get less powerful), and the memory impact of these caches is greater. Have we given any thought to sharing the font family cache across processes? That would, incidentally, reduces the overhead we see during the first reflow, especially if we taught the parent process to preload certain font families in the background while idle.
Yes, this question has come up from time to time. In principle it seems like there could be some significant benefits, but we'll have to be careful not to introduce excessive IPC overhead to things that are very inner-loopish for text processing. In particular, we access the font list - potentially querying a number of the available font families - for every character of text, to determine what font should be used to render that character; and we query the individual font's word-cache for every word in the text when we're building text runs.
I'd be surprised if we could remote those low-level operations to another process without the overhead being a serious drag on performance (though I'd be happy to be proved wrong!). Another thing I've wondered, though, is whether we could come up with a higher-performance implementation by doing something like maintaining all the font data and caches in a "font server" process, and giving content processes read-only access via shared memory to that data. With the use of lock-free data structures, and knowing that only one process ever modifies the data - the others only have read access - we might be able to minimize the overhead of locking and contention, as well as IPC.
Reporter | ||
Comment 13•6 years ago
|
||
(In reply to Jonathan Kew (:jfkthame) [away July 9-12] from comment #12)
> > Yeah, this seems really expensive. I'm also concerned about fission - process lifetime will get shorter (so the per-process amortization will get less powerful), and the memory impact of these caches is greater. Have we given any thought to sharing the font family cache across processes? That would, incidentally, reduces the overhead we see during the first reflow, especially if we taught the parent process to preload certain font families in the background while idle.
>
> Yes, this question has come up from time to time. In principle it seems like
> there could be some significant benefits, but we'll have to be careful not
> to introduce excessive IPC overhead to things that are very inner-loopish
> for text processing.
Assuming these caches aren't tiny (which seems unlikely), this is probably a blocker for fission, and as such needs to be sorted out one way or another before doing any of the performance work we're talking about here. Though ideally, we'd design the architecture in such a way as to be also amenable to the stuff we're talking about here. :-)
> In particular, we access the font list - potentially
> querying a number of the available font families - for every character of
> text, to determine what font should be used to render that character; and we
> query the individual font's word-cache for every word in the text when we're
> building text runs.
>
> I'd be surprised if we could remote those low-level operations to another
> process without the overhead being a serious drag on performance (though I'd
> be happy to be proved wrong!).
Yeah, I think standard IPDL remoting is probably a non-starter here.
> Another thing I've wondered, though, is
> whether we could come up with a higher-performance implementation by doing
> something like maintaining all the font data and caches in a "font server"
> process, and giving content processes read-only access via shared memory to
> that data. With the use of lock-free data structures, and knowing that only
> one process ever modifies the data - the others only have read access - we
> might be able to minimize the overhead of locking and contention, as well as
> IPC.
Yes, a cache in shared memory seems like the way to go here. The devil's in the details though. I'll get a bug on file.
Reporter | ||
Comment 14•6 years ago
|
||
I found bug 648417 for this.
(In reply to Bobby Holley (:bholley) from comment #13)
> Yeah, I think standard IPDL remoting is probably a non-starter here.
...
> Yes, a cache in shared memory seems like the way to go here. The devil's in
> the details though. I'll get a bug on file.
And to be clear, I was thinking more about the shaped word caches here, but per my comment in that bug I think we should investigate whether we can keep those per-process. IPIDL remoting for fonts themselves seems potentially more doable (though we'd still want to use shmem rather than serializing them).
A process just for fonts seems kinda heavyweight, though there has been talk of a utility process rather than hoisting this kind of stuff into the main process. I don't think much progress has been made on that.
Assignee | ||
Comment 15•6 years ago
|
||
Here's a profile of loading the Wikipedia page from comment 1 in a browser where I've created a separate thread to prime the word cache with the content found at textframe creation time: https://perfht.ml/2LanodV. Filtering for ShapeText here, we can see that most of the shaping is now happening on the FontShaping thread, and only a small residue needs to be done by the main thread during reflow. So that seems promising -- that was the goal of the word-cache priming.
The earliest I think we can reasonably start to do "background" shaping is when the frame constructor encounters a text frame and knows that a certain piece of text content is (probably) going to be rendered with a certain computed style. So that's what is happening here: at the end of nsTextFrame::Init, we get the fontGroup corresponding to the computed style, and hand this and the text off to the shaping thread. (Various of the font-related objects needed to have locking added to make this possible, obviously. And my prototype is not strictly complete, though it seems to work most of the time!) The shaping thread then populates the word caches of the relevant font(s), so that when reflow needs textruns, they can be built more quickly.
Not all is so rosy, however. Note that the activity on the shaping thread starts about 60ms before the reflow we're trying to optimize; but it doesn't all get completed before reflow begins, but takes around 120ms before it's all done. So during the first 60ms of reflow, while the shaping thread is still working on all the text we've fired at it, we also have the main thread wanting to start actual textrun construction. This means that both threads want access to the same font objects: the shaping thread is still busy populating their word caches, while the main thread now wants to query those caches and use the words from them. As we don't have lock-free hashtables, this means we get significant contention; filtering the profile for RWLock shows this pretty clearly during the first 50-60ms of the reflow.
The other thing that needs to be taken into account is that while priming the word caches can reduce the time spent creating textruns during the initial reflow, it is costing us significant time during frame construction even with this background-thread implementation. In particular, my current prototype calls nsLayoutUtils::GetFontMetricsForComputedStyle from the main thread during textframe initialization, in order to get the fontgroup to pass to the shaping thread. But it turns out this can be quite expensive, as it involves looking up the font(s) found in the style and instantiating the relevant platform font references. So a part (sometimes quite a big part) of the "win" we get during reflow is lost earlier during frame construction. To address that, we'd need to push more of the font management over to the shaping thread -- which we can probably do, but it'll be quite a bit of work, and it may increase the risk that the shaping thread hasn't finished its work before reflow wants to use the caches, resulting in more contention.
Incidentally, another brief experiment I tried was to create multiple font-shaping threads, with the idea that perhaps this would help us to get shaping completed before reflow needs the results. In this Wikipedia example, at least, where there is a lot of text all using the same font, this did not look good at all; the shaping threads end up contending with each other for access to the same font objects. It's possible, though, that a strategy with one shaping thread per font (but never multiple threads trying to work on the same font) would help, particularly with content where there are several heavily-used fonts.
Reporter | ||
Comment 16•6 years ago
|
||
Nice prototype! A few thoughts:
* The hashtable locking issue is quite solvable, using a two-tiered mechanism whereby the locks are per-subtable. See the analysis and code at [1], which worked very well to eliminate locking contention for parallel CSS parsing.
* It looks like only 70% of the non-idle time in the worker is in GetShapedWord, with the rest being in other stuff under InitTextRun. Is that overhead that can be eliminated?
* Is the GetFontMetricsForComputedStyle stuff actually new work, or does it just move work from reflow to frame construction?
* A lot of the work under GetFontMetricsForComputedStyle seems to be in OS libraries. How much of this is pulling data that we can cache for later? If we move the OS font stuff to another process (for fission), is this overhead something we could amortize?
* Is there a fundamental reason why we can't shape the same font on multiple threads?
[1] https://searchfox.org/mozilla-central/rev/88199de427d3c5762b7f3c2a4860c10734abd867/xpcom/ds/nsAtomTable.cpp#283
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•