Closed Bug 372631 Opened 16 years ago Closed 15 years ago

Tp regressions from bug 370588 landing


(Core :: Layout, defect)

Not set





(Reporter: roc, Assigned: roc)




(3 files, 1 obsolete file)

MacOSX Darwin 8.8.4 bm-xserve08 Dep: Tp up about 12ms (7%).

WINNT 5.1 bl-bldxp01 Dep: Tp up about 70ms (12%).

I'll profile Mac. Not sure about Windows, I thought Stuart had checked and not found that kind of regression.

To some extent we're bound to see some regression from bug 370588 because we're deliberately recording all glyph advances now. Still, this is surprisingly large.
I haven't tried with the latest patch but I can reprofile tomorrow.
I profiled an opt build in Shark, reflowing a text-heavy document.

Reflow is about 49% of the time in firefox-bin.
30% of time is under nsTextFrame::MeasureText.
20% of time is under gfxAtsuiFontGroup::InitTextRun.
15.4% of time is under ATSUGetGlyphBounds. (The rest is font matching etc.)
2.7% of time is in PostLayoutOperationCallback.
1.1% of time is in gfxTextRun::GetAdvanceWidth.

5.1% of time is in gfxTextRun::Draw, entirely under _moz_cairo_show_glyphs.

I guess I need to try profiling a build before the landing as well.
One thing I need to try is that Mac's subpixel glyph positioning means the CompressedGlyph optimization doesn't work very well. We can easily test the impact of this by rounding glyph advances up to the nearest pixel. Probably we should store all the glyph advances in appunits since we need to round to appunits anyway.
We should fix that, but it doesn't seem to help Tp. The good news is that I can reproduce the Tp regression locally.
Hmm. Comparing profiles of Tp before and after 370588:

Time spend in ATSUGetGlyphBounds rose from 4.3% of the profile to 7.3%.
ApplyMorph (some ATS function under ASTULayoutGlyphs) jumped from 0.0% to 0.7%.
GetMetricsForGlyphs (another ATS function) jumped from 0.0% to 0.6%.
PositionDeviceGlyphs rose from 0.0% to 1.5%.
TTextLineLayout::ConstructGlyphRecordArray up from 0.5% to 1.0%.
OTOH ApplyKerning did not rise significantly (0.5% to 0.6%) (still a bit odd since we've disabled kerning).
And SetupGlyphArrays went down from 0.4% to 0.1%.

PostLayoutCallback is new code that accounts for 0.9% of the profile. That seems fair. But why did the other ATSU functions slow down?The only hypothesis that springs to mind immediately is "cache effects imposed by the code in PostLayoutCallback that sets up the textrun". That still spends about half its time allocating memory in gfxTextRun::SetDetailedGlyphs which is a bit disconcerting. I need to check to see why we're doing that, the detailed-glyph path should be mostly untouched in non-intl Tp (now that I'm rounding glyph advances).

These profiles have about 70K samples each so I think these results are statistically significant.
Okay. The calls to SetDetailedGlyphs are happening because my test build has the experimental patch for bug 372629 which stores "missing glyph" glyph info in the detailed glyph records, *and* we're seeing bogus missing glyphs because it turns out ATSUI puts a zero-width 0xFFFF glyph for each second and subsequent character of a ligature. The current code totally misinterprets those glyphs. If I ignore zero-width 0xFFFF glyphs, then most of Tp slowdown disappears.

It's possible, therefore, that fixing bug 372732 (which seems to be something similar on Windows) would help Tp on Windows.
I'll try a patch that converts the glyph advances in gfxTextRun to always be rounded to the nearest appunit. That alone should be enough to fix the Tp regression on Mac.We'll just have to be careful when we update the Mac "missing glyph" code.
Well, I've fixed ATSUI code up so we never call SetDetailedGlyphs in the Tp page set, but performance still isn't back where it should be. My latest theory is that having the 8-bit code path insert bidi override characters is triggering slow paths in ATSUI. I'll test that next.
That was worth doing (saved at least one internal malloc call in ATSUI's RunBidiAlgorithm), but it seems that the *real* issue was actually just that I changed the parameter to ATSUGetGlyphBounds from kATSUseFractionalOrigins to kATSUseDeviceOrigins. Changing that back seems to fix most of the performance regression!

Anyway, I'll attach a complete patch.
This is a *trivial* patch that provides a significant speedup. By my benchmarks the other changes I've made help too, but this is trivial so we can land it first. (Note that the result of the ATSUGetGlyphBounds call is not used, all we want to do here is force the layout to happen.)
Attachment #257642 - Flags: review?(vladimir)
This fixes the rest of the issues I identified that could have regressed performance.
1) Don't use LRO/RLO/PDF characters for LTR 8-bit text, they're not needed and trigger bidi processing in ATSUI.
2) Extract ligatures properly from the ATSUI glyphrecords. In fact I've reworked the glyph extraction code thoroughly. It should be a lot easier to understand now, and more robust too.
3) On all platforms, store glyph advances in appunits and always round them to the nearest appunit. This lets us use the optimized storage *much* more frequently on Mac (from < 50% of the time to 100% of the time during Tp). This eliminates a very large number of mallocs. This may help Pango a bit too. Windows doesn't seem to do subpixel glyph placement.

This also includes the ATSUGetGlyphBounds flag fix from the previous patch.
Attachment #257740 - Flags: review?(pavlov)
Comment on attachment 257740 [details] [diff] [review]
Rework Mac glyph extraction code, store glyph advances in appunits

I want vlad to look at the ATSUI changes. If the glyph extraction algorithm is too complicated then let me know so I can add more comments where required or simplify the code further.
Attachment #257740 - Flags: review?(vladimir)
(In reply to comment #11)
> This fixes the rest of the issues I identified that could have regressed
> performance.

On Mac anyway. I don't have a good way to measure Windows performance.
Comment on attachment 257642 [details] [diff] [review]
change the ATSUGetGlyphBounds parameter only

I wonder if we can force layout even more cheaply, without grabbing the bounds...
Attachment #257642 - Flags: review?(vladimir) → review+
This patch seems to cause problems with ligatures (esp f+l, l+l) when text-align:justify is used.
Sample URL:
OS X 10.4.8, ppc
Minefield, checkout: Wed Mar 14 09:39:47 JST 2007
Attached patch better patchSplinter Review
Fix the issue in previous patch with justification with old-textframe and ligatures. The problem was that we are zeroing out spacing inside ligatures before we convert spacing from absolute (spacing includes character widths) to relative (spacing excludes character widths) by subtracting character widths. This is bad because it means we end up with negative relative spacing inside the ligature ... solution is of course to zero out intra-ligature spacing after we've done the conversion to relative. Also instead of just dropping the spacing I now just move the spacing to after the ligature.

There is a reftest issue with this patch on Mac that I think is actually a problem with the nquartz cairo backend. I'll file a bug for that separately.
Attachment #257740 - Attachment is obsolete: true
Attachment #258618 - Flags: review?(pavlov)
Attachment #257740 - Flags: review?(vladimir)
Attachment #257740 - Flags: review?(pavlov)
Comment on attachment 258618 [details] [diff] [review]
better patch

vlad, I thought you'd reviewed the new ATSUI code here, but I notice that you didn't mark the previous patch as r+.
Attachment #258618 - Flags: review?(vladimir)
Blocks: 178513
Comment on attachment 258618 [details] [diff] [review]
better patch

Can you post a followup patch that makes gfxTextRun::GetAppUnitsPerDevUnit()  return a PRUint32 and in places where you're doing x = tr->GetAppUnitsPerDevUnit()
can you make those be const?  Also make GetAppUnitsPerDEvUnits() be  a const method?

related you have this:
+    double appUnitsPerDevUnit = aTextRun->GetAppUnitsPerDevUnit();
+    double devUnitsPerAppUnit = 1/appUnitsPerDevUnit;

which should probably be:
+    const PRUint32 appUnitsPerDevUnit = aTextRun->GetAppUnitsPerDevUnit();
+    double devUnitsPerAppUnit = 1.0 / appUnitsPerDevUnit;

vlad should take a closer look at the mac changes
Attachment #258618 - Flags: review?(pavlov) → review+
Attachment #258618 - Flags: review+ → superreview+
Comment on attachment 258618 [details] [diff] [review]
better patch

Mac changes look good, and I can actually follow that code now :)
Attachment #258618 - Flags: review?(vladimir) → review+
OK, that patch is checked in. Just need to fix Windows now. I can reproduce on a Windows machine here.
Filed bug 374567 with a patch for the request in comment #19.
Mac Tp seems to have unregressed nicely.
Took a look at some VTune profile data for Windows. Some of it makes no sense, but it appears there are two major issues compared to before the patch, at least on Chris's laptop:

1) MeasureAndDrawReallyFast called gfxWindowsFont::GetHFONT which spends most of its time in MakeHFONT. InitTextRunGDI calls gfxWindowsFont::CairoScaledFont which calls MakeCairoScaledFont which calls MakeHFONT *and* then does cairo_scaled_font_create. The latter is enormously expensive, about 7 times more expensive than MakeHFONT and about 28% of the after-profile. We need to avoid messing with cairo fonts on the fast path.

2) GetCharacterPlacementA is about 30% of the after-profile, and its total time is about double the cost of the near-equivalent GetGlyphIndicesA + GetTextExtentPoint32A in the before-profile. We should try avoiding GetCharacterPlacementA, probably by calling GetGlyphIndicesA and GetTextExtentExPointI.

I don't really trust these numbers, and it seems they're not consistent across different machines, but it gives us something to work on.
Attached patch Windows fixSplinter Review
This seems to fix things on Windows. We call GetGlyphIndices(A/W) and then call GetTextExtentExPointI to get the partial string widths. As far as we can tell it gives us equal or better performance to before the landing of 370588. We don't get ligatures, but I think we do get kerning. Fallback is there.

I also restored a Truetype check that is required now that we're using GetGlyphIndices again.
Attachment #259289 - Flags: review?(pavlov)
hey, roc, when you create a patch which includes the code for another bug, please add a comment to the bug...
Blocks: 372732
Attachment #259289 - Flags: review?(pavlov) → review+
checked in that Windows patch. We'll see how that goes...
Tp on bl-bldxp01 has dropped quite a bit. Need to wait for a while to see just how well we're doing.
We seem to be still slower than before the patch by maybe 20ms on Tp2 and 30-40ms, so that patch undid about half of the regression on bl-bldxp01. Not sure how much more we can squeeze out here.
I'm going to have to mark this fixed.
Closed: 15 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.