[meta] Text shaping word cache affects performance on several SP3 tests (particularly TipTap)
Categories
(Core :: Graphics: Text, enhancement)
Tracking
()
| Performance Impact | low |
People
(Reporter: denispal, Assigned: jlink)
References
(Blocks 2 open bugs)
Details
(Keywords: meta)
Attachments
(1 file)
The word cache splits text into individual words and shapes each separately which creates a lot of overhead in the Editor-TipTap test during Speedometer3, even with good cache hit rates. When I run 100 iterations of TipTap on my macbook pro, we spend about 2240 ms text shaping when the word cache is enabled and 1193 ms when I skip it. Skipping the word cache leads to a 10% improvement in Editor-TipTap's performance..
| Reporter | ||
Comment 1•4 months ago
|
||
Here is a profile: https://share.firefox.dev/4bqWI42
| Reporter | ||
Comment 2•4 months ago
|
||
Since a lot of the overhead is in locking and realloc, maybe we can get away with a smaller fixed size MRU cache instead of the current implementation with mozilla::HashMap.
| Reporter | ||
Comment 3•4 months ago
|
||
(In reply to Denis Palmeiro [:denispal] from comment #2)
Since a lot of the overhead is in locking and realloc, maybe we can get away with a smaller fixed size MRU cache instead of the current implementation with
mozilla::HashMap.
I prototyped this but it doesn't seem to help much.
| Reporter | ||
Comment 4•4 months ago
|
||
For text runs longer than 1024 characters, skip the word cache and shape the entire run directly. This avoids per-word overhead (i.e. locking, reallocs, etc) that becomes inefficient in cases that have large text blocks. Improves Editor-TipTap by about 10%.
Updated•4 months ago
|
| Reporter | ||
Updated•4 months ago
|
Comment 5•4 months ago
|
||
The performance observation here is interesting. It used to be that the word cache gave us a (smallish but significant) perf boost on most text-heavy content, but I wonder if that's still true. In general, harfbuzz shaping performance has been steadily improving thanks to lots of good upstream work by Behdad; it may be that it's reached a point where the cache isn't really gaining us much.
So I guess I'm suggesting that before we do a threshold as per the patch here (which may be fine -- I'm not opposed to the idea), I'd be interested to know what happens (for perf tests in general, not just Editor-TipTap) if we disable the word cache altogether. It should be pretty simple to do that, by hacking gfxFont::SplitAndInitTextRun to just call ShapeTextWithoutWordCache directly. Would you be up for giving that a try?
Comment 6•4 months ago
|
||
PERF key word?
| Reporter | ||
Comment 7•4 months ago
|
||
(In reply to Jonathan Kew [:jfkthame] from comment #5)
The performance observation here is interesting. It used to be that the word cache gave us a (smallish but significant) perf boost on most text-heavy content, but I wonder if that's still true. In general, harfbuzz shaping performance has been steadily improving thanks to lots of good upstream work by Behdad; it may be that it's reached a point where the cache isn't really gaining us much.
So I guess I'm suggesting that before we do a threshold as per the patch here (which may be fine -- I'm not opposed to the idea), I'd be interested to know what happens (for perf tests in general, not just Editor-TipTap) if we disable the word cache altogether. It should be pretty simple to do that, by hacking
gfxFont::SplitAndInitTextRunto just callShapeTextWithoutWordCachedirectly. Would you be up for giving that a try?
Justin was interested in trying out different caching implementations so I transferred this bug to him. Justin, is this something you can also try out? Thanks!
| Assignee | ||
Comment 8•4 months ago
|
||
Sorry, I was out sick yesterday so I haven't done anything here yet. I plan to start taking a look today and will try out Jonathan's suggestion.
Updated•4 months ago
|
Updated•4 months ago
|
Updated•4 months ago
|
Comment 9•3 months ago
|
||
The severity field is not set for this bug.
:lsalzman, could you have a look please?
For more information, please visit BugBot documentation.
Updated•3 months ago
|
| Assignee | ||
Comment 10•3 months ago
•
|
||
I have tried four different ways of disabling the word cache (comparisons here here here and here).
Except on Windows non-ref hardware where anything that causes us to use the word cache less is a win, simply disabling the word cache results in a clear regression. In all of the cases, there are huge regressions in Editor-TipTap. On Windows, those huge Editor-TipTap regressions are sometimes offset by wins in other tests, although those seem less significant (even though the effect is that the overall SP3 score gets pulled upward).
I was surprised to see such a strong regression after Denis' earlier results so I re-implemented his change and tried a few different thresholds here and still see the improvements that he saw. Using a lower threshold than what he picked (1024) might also be better.
There are a few take-aways that I see from this:
- The word cache is useful and valuable but we are definitely over-using it.
- Non-ref Windows hardware seems to have a distinctly difference experience with the word cache. Perhaps the shaping itself is less expensive there for some reason so there is less benefit in caching the results? Maybe something aspect of the word cache executes less efficiently on that hardware which affects when it should be used?
I'm taking a few action items out of this:
- Study how the tests that benefit from the word cache are shaping words vs the tests that don't and use that to determine the appropriate conditions for when we should try to skip it.
- Look more closely into what's happening on non-ref Windows hardware. I triggered some performance profiles on these machines in CI and have started looking at the profiles. These are the profiles from the slowest of these runs. So far it's surprising because we aren't actually spending much time in the main code related to shaping and caching.
- Try using a simpler data structure (without need for allocations and possibly without need for locking?) like the cache that I previously implemented for JS atoms.
| Reporter | ||
Updated•2 months ago
|
| Reporter | ||
Updated•2 months ago
|
Updated•1 month ago
|
Updated•1 month ago
|
| Assignee | ||
Updated•1 month ago
|
| Assignee | ||
Updated•1 month ago
|
| Assignee | ||
Updated•1 month ago
|
| Assignee | ||
Comment 11•1 month ago
|
||
It turns out that the word cache is still very helpful when it comes to performance. The reason that Denis' patch to restrict when we use the word cache made things faster was because it avoided doing something that sabotages the word cache. See bug 2030147 for more details.
I'm turning this into a meta-bug because there are still a few more avenues to follow-up on:
- Pre-allocate the word cache with a larger size to avoid some initial re-allocs. A quick test of this seemed to indicate that this was a significant win on Linux and neutral on other desktop platforms.
- Consider using a different data structure for the cache. If the "working set" is actually quite large, then the current data structure is probably well-suited. If not, an MRU cache or hash-based cache (similar to a CPU cache) might be a better choice.
- Take a look at the locking that is used. In my tests, it seems like we're always doing the shaping from the main thread so maybe the locking isn't really necessary but, then again, maybe it's also not really harming us if there's no contention.
Comment 12•1 month ago
|
||
The cache size/strategy is a tricky one, because it is so dependent on the nature of the content we're dealing with. Not to say it can't be improved, but we should be wary of focusing too much on just a few examples.
Regarding locking, for HTML content we always shape on the main thread, but since we implemented text rendering for offscreen canvas, it's also possible for us to do shaping from DOM worker threads. So that's why we had to add locking.
An alternative would be to avoid ever sharing font instances between threads, but that potentially has other downsides (increased memory usage, and losing the possibility of the word cache being usefully shared across threads -- e.g. if several workers are using the same font, they don't all have to shape the same words independently).
Description
•