Open Bug 834609 Opened 11 years ago Updated 2 years ago

need an explicit way to test text rendering performance

Categories

(Core :: Layout: Text and Fonts, defect)

defect

Tracking

()

People

(Reporter: jtd, Unassigned)

References

(Depends on 1 open bug)

Details

Attachments

(2 files)

There are lots of caches floating around in the text subsystem of Gecko.  We have font caches, word caches, table caches, yadda yadda.  But we often decide yes or no on a given patch based on whether it affects (tp) numbers or not.  But given the (tp) methodology of doing multiple loads of the same page and then averaging the results, because of all the caching we end up with results that *don't* necessarily correlate with user experienced page load times.

I did an experiment to try and see how word caching affects (tp) numbers,
I added an explicit delay when a text run isn't found in the word
cache.  I used nanosleep within gfxFont::GetShapedWord and tested on OSX.

Ex:
https://tbpl.mozilla.org/?tree=Try&rev=459b4e95269a

Changeset:
https://hg.mozilla.org/try/rev/459b4e95269a

OSX 10.8 opt (tp):

  1us  182.53
  2us  181.11
  5us  181.21
 10us  181.19
 20us  180.67
 50us  180.10
100us  180.97

Eeek! No change in (tp) values even when adding lots of explicit delays
to low-level text rendering code!!

I then tested page load time of a build log file on OSX 10.8, MacBook
Pro Retina. Used the builds from the 1us, 10us, and 100us try
server builds above. For a given build, start up and load a log
file, then quit and run again. Run 5 times, using the best 3
results.

Page load times (ms):

  1us  [ 2643  2794  2675]
 10us  [ 4948  3528  4443]
100us  [10874 13335 13319]

Conclusion: changes that affect the performance of harfbuzz
shaping code are not reflected in tp test results but *will*
affect actual user-experienced performance.

I think we need a set of tests that we can use to benchmark the text system, with an eye towards catching regressions that affect user-perceived page load times.
Test load time of a build log file.  Timed from the start of loading until the end of layout by including a call to body.clientHeight so that we can make a fair comparison with Webkit browsers which fire onload before layout is complete.
(In reply to John Daggett (:jtd) from comment #0)
> I then tested page load time of a build log file on OSX 10.8, MacBook
> Pro Retina. Used the builds from the 1us, 10us, and 100us try
> server builds above. For a given build, start up and load a log
> file, then quit and run again. Run 5 times, using the best 3
> results.

Note: attached testfile was served via locally running Apache instance (i.e. not via a file load).
Did an analysis of the components of a (tp) testrun.  Pulled the page load results into a Google docs spreadsheet:

https://docs.google.com/spreadsheet/ccc?key=0ArCKGq7OfNmMdEFNSk5SYmZMbjItRmVfNC0wTXFWT1E

Note the third column, the ratio of the first run time to the average of all runs after that.  On average, the first time to load a page takes 2.8 times longer than subsequent loads.
I think part of any text performance metric needs to include a way of
measuring the raw number of times text is shaped as part of a given
set of operations, since the easiest way to optimize text shaping
performance is to avoid doing it when that's possible.

Word cache hits and misses are included in telemetry data, so I did a
quick analysis of these [1]:

length	cumulative total	miss ratio
1	26.45%	0.39%
2	36.88%	1.66%
3	49.04%	2.12%
4	61.19%	2.92%
5	70.97%	3.79%
6	78.29%	4.40%
7	85.73%	3.95%
8	89.83%	6.58%
9	94.60%	5.93%
11	97.14%	5.82%
13	98.01%	7.56%
15	98.61%	7.00%
18	98.96%	7.83%
21	99.21%	10.34%
25	99.43%	10.18%
30	99.55%	14.37%
35	99.63%	15.80%
41	99.69%	16.51%
48	99.76%	17.77%
57	99.80%	18.40%
67	99.84%	19.19%
79	99.87%	18.96%
93	99.90%	17.38%
110	99.93%	17.85%
130	99.95%	19.34%
154	99.96%	20.35%
182	99.97%	22.56%
216	99.98%	25.96%
256	100.00%	28.33%

The above numbers mean that strings of length less than 21 characters
represent 99% of all word cache lookups.  For strings of length
greater than or equal to 67 characters in length but less than 79
characters are found in the cache roughly 81% of the time.  We need to
be careful about generalizing telemetry stats, accumulated from a user
set that is predominately English based, but this seems to be to
indicate that, for whatever reason, we are repeatedly rendering the
same strings multiple times (i.e. the probability that a string of
length 67 or greater occurs within content twice is fairly low).

I think this data raises several distinct questions that need more
data to determine:

1) Are we redrawing text runs repeatedly and if so, why?

2) What's the breakdown of UI-related vs. content related word cache
lookups?  Would it be better to cache UI-related strings differently,
so that we don't even hit the word cache for these cases?  e.g. cache
the shaped string data with the UI element for elements that seem to
show up repeatedly.

3) Why are there so many single character strings (they represent a
quarter of all word cache lookups)?

4) If repeated redraws are eliminated, would it make more sense *not*
to cache shaped text runs of a given length?

I think for these it's important to come up with ways of measuring
text performance so that we can answer these sorts of questions based
on data rather than on intuition.

[1] https://docs.google.com/spreadsheet/ccc?key=0ArCKGq7OfNmMdDZrWWFCYVYzRDZMQXdxaVZ0amFoS0E
(In reply to John Daggett (:jtd) from comment #4)

> Word cache hits and misses are included in telemetry data, so I did a
> quick analysis of these [1]:
...
> The above numbers mean that strings of length less than 21 characters
> represent 99% of all word cache lookups.  For strings of length
> greater than or equal to 67 characters in length but less than 79
> characters are found in the cache roughly 81% of the time.

This figure surprised me, as I saw -much- lower hit rates for long "words" when I was looking at bug 825875. 

Unfortunately, I don't think I've kept a copy of the telemetry numbers that I looked at there. (Maybe I did something wrong in the metrics.m.o UI and so didn't end up with the right figures? Though they looked plausible at the time.)

>  We need to
> be careful about generalizing telemetry stats, accumulated from a user
> set that is predominately English based, but this seems to be to
> indicate that, for whatever reason, we are repeatedly rendering the
> same strings multiple times

This may well be true.

> (i.e. the probability that a string of
> length 67 or greater occurs within content twice is fairly low).

Except in cases such as tinderbox logs, where (for example) test pathnames typically show up several times each, and source pathnames for common warnings may appear many times. As such, I think loading a TBPL log is a highly atypical case.

> I think this data raises several distinct questions that need more
> data to determine:

> 4) If repeated redraws are eliminated, would it make more sense *not*
> to cache shaped text runs of a given length?

That's what bug 825875 implemented, though we may want to adjust the cut-off length if we can get more useful measurements to use as a basis.
(In reply to John Daggett (:jtd) from comment #4)
> this seems to be to
> indicate that, for whatever reason, we are repeatedly rendering the
> same strings multiple times (i.e. the probability that a string of
> length 67 or greater occurs within content twice is fairly low).

Which part of your data indicates that?

If a page gets reloaded many times, or a user browses many pages from the same site, I think it's entirely plausible that even very long strings could show up many times. Unless you have data showing otherwise.

> I think this data raises several distinct questions that need more
> data to determine:
> 
> 1) Are we redrawing text runs repeatedly and if so, why?

Do you mean "reshaping" here? It's expected that we'd redraw text runs repeatedly, that's why we store textruns :-). And none of your data relates to drawing.

> 3) Why are there so many single character strings (they represent a
> quarter of all word cache lookups)?

Could that be shaping of the space character? That shows up a lot :-).
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #6)
> (In reply to John Daggett (:jtd) from comment #4)
> > this seems to be to
> > indicate that, for whatever reason, we are repeatedly rendering the
> > same strings multiple times (i.e. the probability that a string of
> > length 67 or greater occurs within content twice is fairly low).
> 
> Which part of your data indicates that?
> 
> If a page gets reloaded many times, or a user browses many pages from the
> same site, I think it's entirely plausible that even very long strings could
> show up many times. Unless you have data showing otherwise.

These aren't just "very long strings", these are very long strings not
containing spaces.  What are those strings and why are we shaping them
repeatedly?  I think we clearly need more data to determine what's
actually happening.  While your scenario is a possible explanation,
I'm very skeptical that it's actually the underlying cause of this.

> > I think this data raises several distinct questions that need more
> > data to determine:
> > 
> > 1) Are we redrawing text runs repeatedly and if so, why?
> 
> Do you mean "reshaping" here? It's expected that we'd redraw text runs
> repeatedly, that's why we store textruns :-). And none of your data relates
> to drawing.

Yes, reshaping, why are we repeatedly shaping the same strings?

> > 3) Why are there so many single character strings (they represent a
> > quarter of all word cache lookups)?
> 
> Could that be shaping of the space character? That shows up a lot :-).

Entirely plausible but I think we should have data to back that up.
(In reply to John Daggett (:jtd) from comment #7)
> (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #6)
> > (In reply to John Daggett (:jtd) from comment #4)

> > > 3) Why are there so many single character strings (they represent a
> > > quarter of all word cache lookups)?
> > 
> > Could that be shaping of the space character? That shows up a lot :-).
> 
> Entirely plausible but I think we should have data to back that up.

It seems unlikely those are spaces, I'd guess. The textrun construction code normally handles space characters using a short-circuit code path (see SetSpaceGlyphIfSimple) that bypasses the shaping process and word cache altogether - it just stuffs the glyph ID and advance straight into the run. It'll only fall back to looking up a space "word" in the cache if the font's space glyph doesn't fit the criteria for a "simple" CompressedGlyph record - which would generally mean very large font sizes.
As for the source long "words" that are getting shaped multiple times: from a little browsing with a build that had extra instrumentation, it looks like these are predominantly URLs. (Which is not surprising.)

In particular, we re-create a textrun for the current text in the URL bar whenever the browser window is activated or deactivated (even though the text doesn't change), and when the URL bar gains or loses focus.

Much worse, though, is that when I hover over a link on the page, so that the destination URL appears in the status overlay in the bottom corner, we create a textrun for that URL text about 8 times in succession - and then when I un-hover the link, so that the status overlay disappears, we create the textrun a further 6 times. That seems rather excessive... is it because we're doing a fade-in/out transition (or animation) for the overlay, but completely recreating the textrun at each stage of the fade?

(Hmm, a similar thing happens with the names of downloaded items in the Downloads panel, for example: as the panel is animated into view, the textruns are re-created about 3 or 4 times. These aren't usually as long as many URLs, but they again illustrate that we're apparently recreating textruns during animation even though the content isn't changing.)

Given the fact that we do (currently) often recreate textruns that may contain long URLs, maybe we should bump the max length for which we cache "words" to something like 256, so that more of these will be cached. Though much better, of course, would be to avoid rebuilding textruns so often in the first place.
(In reply to Jonathan Kew (:jfkthame) from comment #9)
> Much worse, though, is that when I hover over a link on the page, so that
> the destination URL appears in the status overlay in the bottom corner, we
> create a textrun for that URL text about 8 times in succession - and then
> when I un-hover the link, so that the status overlay disappears, we create
> the textrun a further 6 times. That seems rather excessive... is it because
> we're doing a fade-in/out transition (or animation) for the overlay, but
> completely recreating the textrun at each stage of the fade?

It doesn't seem to be (primarily, at least) due to the animation, but due to the fact that nsTextBoxFrame doesn't store a textrun, but calls nsRenderingContext methods to measure and draw text, passing a string each time. This causes a new temporary textrun to be created for each measurement operation the textbox needs to do.

So within nsTextBoxFrame::DoLayout(), we first call CalcDrawRect() and then GetInkBoundsForVisualOverflow(), each of which separately constructs the "same" textrun to do its work. Then higher-level layout wants the element's min-width, which calls nsTextBoxFrame::GetMinSize, causing the textrun to be created again. Then we repeat the DoLayout dance (having determined the min-size, our parent repeats its layout process, I guess); more textrun reconstruction. And when it's finally time to draw, we call nsTextBoxFrame::PaintTitle, which constructs the textrun afresh yet again.

Maybe we don't care enough about XUL performance to do anything about this, but it surely accounts for a large number of long-word cache hits. (Or a lot of repeated shaping, given the word-cache length cutoff.) It seems like we could greatly reduce the amount of work we're doing here if we could cache the textrun in the nsTextBoxFrame, and pass it down to the measurement and drawing routines as needed.

This wouldn't affect web -content-, though; it would only be relevant to text elements in the browser chrome. I haven't seen large numbers of long "words" in content getting re-shaped, except in unusual cases like viewing TBPL logs that are full of long (and repeated) pathnames.
Good catch. The question is then whether it's worth fixing in nsTextBoxFrame.
Sounds like it also might make sense to distinguish UI/content within the Telemetry metrics.
Another source of ridiculously-long strings being shaped into textruns appears to be the SocialAPI/Facebook Messenger code. For some reason, it is creating textruns that contain large chunks - tens or hundreds of Kchars - of (often minified) JS source code.

I've filed bug 837729 about this, as I think it's completely unnecessary and it must be hurting performance for anyone with that stuff enabled.
And filed bug 837765 about the XUL textbox, though I can imagine we might end up deciding it's not worth the complication, as I suspect XUL text is rarely perf-critical in the way HTML text would be.
Regarding text performance, there are several things that may be worth measuring (perhaps separately, so we can distinguish them). Shaping, bidi processing, text-frame construction, glyph rendering, managing text-run to content association (gfxSkipChars), ...

I'd certainly agree that the current Tp is not a good test of shaping performance, because of the way it reloads the test pages - this means that in practice, the runs that get counted probably do virtually no shaping, they just use cached glyph data. OTOH, this means that they do represent a test of the cache, which is also a worthwhile thing to be testing - and this is why a change that makes our caching less effective does regress Tp.

One simple way to have a Tp-like test that *does* test shaping performance would be to run a similar page-loading test, but ensure that before each page-load, the test harness triggers a memory-pressure notification. (I assume that would be easy to do from chrome-privileged JS.) This would cause the shaped-word caches to be flushed, so that all the text would have to be re-shaped during each page load. I guess it would also flush various other caches (images, maybe jitted JS, etc?), but it seems like all these are issues where testing the non-cached performance is also meaningful.

(OS-level caches would of course not be flushed by this approach, but I think that's OK; it would still give us a valuable test of Gecko's performance, without the artificial skewing that comes from discarding the initial load of each page.)

We could also look into using a page set that is designed to be more text-heavy, such as a series of lengthy Wikipedia articles, online literature, etc., so that "text performance" (in its various parts) is not swamped by as much other layout work, images, etc. But I suspect simply firing memory-pressure to flush the word caches before each page-load in Tp would give us something useful to start with.

Another option would be to build separate test tools/harnesses that focus on exercising a more limited area of the code, such as -just- shaping or -just- bidi resolution. But that seems like it'd be quite a bit more work, and could lead us to focus too narrowly on specific things that may not be the most important in the overall picture of how real-world browser activity makes use of the text code.
(In reply to Jonathan Kew (:jfkthame) from comment #15)

> I'd certainly agree that the current Tp is not a good test of
> shaping performance, because of the way it reloads the test pages -
> this means that in practice, the runs that get counted probably do
> virtually no shaping, they just use cached glyph data. OTOH, this
> means that they do represent a test of the cache, which is also a
> worthwhile thing to be testing - and this is why a change that makes
> our caching less effective does regress Tp.

I think the key problem here is that I don't see Tp correlating with
actual page load times given the methodology used.  And when
regressions do occur we need to have more detailed metrics to
understand where problems are occurring.  We don't want to regress
page load performance for the end user and that's less important than
regressing performance for reloading a page many times consecutively. 
Right now we're making decisions based on not regressing the latter.

> One simple way to have a Tp-like test that *does* test shaping
> performance would be to run a similar page-loading test, but ensure
> that before each page-load, the test harness triggers a
> memory-pressure notification. (I assume that would be easy to do
> from chrome-privileged JS.) This would cause the shaped-word caches
> to be flushed, so that all the text would have to be re-shaped
> during each page load. I guess it would also flush various other
> caches (images, maybe jitted JS, etc?), but it seems like all these
> are issues where testing the non-cached performance is also
> meaningful.

I think flushing caches to some degree definitely makes sense.  Not
sure the low-memory flush is the right one, that seems like you'll end
up not mimicing end user page load performance that way.  But the key
problem here is the methodology of the test, repeating loads of the
same page gives a distorted picture of page-load performance, one very
sensitive to cache hits/misses where actual page load times would
probably be less so.

I think it would be helpful to have better aggregate text metrics when
doing this sort of testing, so that we correlate perf regressions with
the characteristics of the pages, for example pages with lots of bidi
on them or pages that use fonts with lots of features or whatever.
One more quick experiment, I loaded a bunch of pages from the
Guardian, as a user might do clicking through the links on a given
site.  I looked at the word cache hit/miss ratios for the aggregate of
these pages and for single pages with an explicit word cache flush
before loading each page.  Word caching is done separately for 
chrome/content so I also separated the hit/miss numbers accordingly.

Aggregate run - load all five pages in sequence, then dump word cache stats

http://www.guardian.co.uk/business/2013/feb/28/european-union-cap-bankers-bonuses
http://www.guardian.co.uk/world/2013/feb/27/greece-blames-drug-companies-shortages
http://www.guardian.co.uk/commentisfree/2013/feb/25/cyprus-president-economic-freefall
http://www.guardian.co.uk/commentisfree/2013/feb/25/emigrate-australia-be-warned
http://www.guardian.co.uk/global-development/2012/nov/06/british-brain-drain-skills-shortages

content-word-cache hits: 62587 misses: 14077 hit-ratio: 81.638054
chrome-word-cache hits: 7426 misses: 321 hit-ratio: 95.856461

Individual pages - load each page, dump word cache stats and clear
                   cache after each load

http://www.guardian.co.uk/business/2013/feb/28/european-union-cap-bankers-bonuses
content-word-cache hits: 2215 misses: 1360 hit-ratio: 61.958046
chrome-word-cache hits: 1948 misses: 226 hit-ratio: 89.604416

http://www.guardian.co.uk/world/2013/feb/27/greece-blames-drug-companies-shortages
content-word-cache hits: 18709 misses: 6161 hit-ratio: 75.227180
chrome-word-cache hits: 1402 misses: 144 hit-ratio: 90.685638

http://www.guardian.co.uk/commentisfree/2013/feb/25/cyprus-president-economic-freefall
content-word-cache hits: 5976 misses: 2906 hit-ratio: 67.282143
chrome-word-cache hits: 1432 misses: 157 hit-ratio: 90.119568

http://www.guardian.co.uk/commentisfree/2013/feb/25/emigrate-australia-be-warned
content-word-cache hits: 24876 misses: 7827 hit-ratio: 76.066414
chrome-word-cache hits: 1291 misses: 156 hit-ratio: 89.219078

http://www.guardian.co.uk/global-development/2012/nov/06/british-brain-drain-skills-shortages
content-word-cache hits: 2332 misses: 1309 hit-ratio: 64.048340
chrome-word-cache hits: 1510 misses: 149 hit-ratio: 91.018692

Total across individual pages:
content-word-cache hits: 54108 misses: 19563 hit-ratio: 73.45

The interesting numbers to compare are the two content hit-ratio
numbers. With inter-page word caching, the hit-ratio is 82% but with
only per-page word caching the hit-ratio is 73%.  So having cached
words around across page loads only improves the hit-ratio 
by 9 percentage points.

I suspect, but need more data to prove, a number of things here.  One
is that we're making and reconstructing chrome text runs *way* too
often (hence the 90%) hit ratio for chrome.  I also think the reason
the word cache has such a high hit-ratio for individual pages is *not*
that the same words show up in many places on a given page but that
multiple reflows performed while loading a page cause text runs to be
reconstructed multiple times.  Whether this reconstruction needs to
occur or not is the question (i.e. if we keep text runs around a bit
longer at page load time there would be less need to be hitting the
word cache so much).  Again, just thoughts that need to be backed up
with hard data.
Very interesting results. Please follow up on why chrome code is looking to reshape so often. What we find out there may well speed up the rest of the UI.

On the content side, I don't think the 9% cache hit increase is negligible across pages. 

I'm not sure I understand what it means to "keep text runs around longer" when a reflow that invalidates the runs can occur for reasons beyond out control. In those cases, the word cache seems to be doing the job.
(In reply to Jet Villegas (:jet) from comment #18)
> On the content side, I don't think the 9% cache hit increase is
> negligible across pages. 

The point was that most of the word cache hits happen within the same
page rather than across pages.  Which means the layout code is
reshaping text with the same words several times.

> I'm not sure I understand what it means to "keep text runs around
> longer" when a reflow that invalidates the runs can occur for
> reasons beyond out control. In those cases, the word cache seems to
> be doing the job.

The question is whether invalidation means that text runs need to be
tossed or not I guess.  The more the code can retain and reuse text
runs the better, obviously. How do we come up with metrics to measure
that? For example, how many times is a text run recreated for the same
string/font on a given page.
(In reply to John Daggett (:jtd) from comment #17)
> I suspect, but need more data to prove, a number of things here.  One
> is that we're making and reconstructing chrome text runs *way* too
> often (hence the 90%) hit ratio for chrome.  I also think the reason
> the word cache has such a high hit-ratio for individual pages is *not*
> that the same words show up in many places on a given page but that
> multiple reflows performed while loading a page cause text runs to be
> reconstructed multiple times.

Textruns are supposed to be cached in the textframes for 20-30 seconds which should cover all the loading of a specific page. I suppose it's possible incremental loading or style changes caused by scripts are triggering frame reconstruction, or something else that causes textruns to be refreshed.

You could instrument FrameTextRunCache to see if it's flushing textruns during page load.

> Whether this reconstruction needs to
> occur or not is the question (i.e. if we keep text runs around a bit
> longer at page load time there would be less need to be hitting the
> word cache so much).

I doubt it's a timing issue unless your page loads are very slow.
Also you could instrument nsTextFrame::DestroyFrom to see if we're destroying text frames and dropping their textruns during page load.
I'm currently working on a couple approaches here:

1. modify the word cache code so that the upper limit for the length of words in the cache uses a pref rather than a hard-coded limit (currently 32).  This will also allow us to effectively disable the word cache by setting it to 0.

2. trace the call stacks for text run construction when rendering a set of pages and try and instrument which text runs are cached and which are rebuilt repeatedly.

3. look at how to instrument the FrameTextRunCache as per comment 20.
Using the debugger I dumped calls to gfxFontGroup::MakeTextRun, then massaged the data to come up with unique counts for each stack.  The page set I used had a few news sites, a set of wikipedia articles in different languages/scripts and some BBC pages with different languages.  Roughly 25% of text run construction calls were from nsBulletFrame::GetDesiredSize (!?!) for this set of pages.  Not sure that's important yet, need to correlate this with profiling data.
Depends on: 901845
Depends on: 911884
Textrun creation per call to PresShell::DoReflow

Test: load HTML5 spec, single-page version

REFLOWEND20   textruns: 6 
REFLOWEND38   textruns: 9 
REFLOWEND41   textruns: 9 
REFLOWEND46   textruns: 204 
REFLOWEND48   textruns: 216 
REFLOWEND51   textruns: 79851 
REFLOWEND52   textruns: 7 
REFLOWEND57   textruns: 25750 
REFLOWEND59   textruns: 97359 
REFLOWEND61   textruns: 7 
REFLOWEND63   textruns: 97359 
REFLOWEND64   textruns: 12 
REFLOWEND69   textruns: 1005 

We're basically creating all textruns three times in this case.  Sorting the logged textruns by reflow pass confirms this.
I'm curious how these findings are affected by the fixes for bug 703100. That is, were we more efficient before we replaced gfxTextRunCache with the current word cache?

I'm also assuming that the expensive reflows (59 & 63) are happening for valid reasons. Is that the case?
Depends on: 934710
Depends on: 935867
Assignee: jd.bugzilla → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: