Open Bug 1843951 Opened 1 year ago Updated 7 months ago

The indirect call in PLDHashTable::ComputeKeyHash to mOps->hashKey shows up on speedometer3

Categories

(Core :: XPCOM, enhancement)

enhancement

Tracking

()

People

(Reporter: mstange, Unassigned)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [sp3])

Attachments

(1 file)

Auditing indirect calls on Speedometer 3 (by looking at the callers of LdrpDispatchUserCallTarget on Windows) turned up the call to mOps->hashKey as one of the most common indirect calls.

I think it's worth doing the hashing inside nsTHashTable, in templated code where we know the hashing function.

nsTHashTable is templated on the EntryType so it can direct-call (or even inline)
the call to EntryType::HashKey. PLDHashTable would need to go through mOps->hashKey,
i.e. call through a function pointer.

Thanks for working on this.

What's the impact on the binary size from this change? IIRC the reason nsTHashTable is implemented this way is to avoid binary bloat.

Are there measurable performance improvements from this change?

If the binary bloat is really bad, it looks like only using inlining for specific hash tables (eg RestyleManager::mSnapshots, gfxDWriteFont::mGlyphWidths) might get us the performance benefit without as much bloat. It looks like just inlining for the top 5 most common hash tables would eliminate about 38% of the indirect calls, if I'm reading the profile correctly (we'd probably want to do more, but I was feeling lazy so I just added up the top 5). Although I'm not sure off the top of my head how to implement that easily. Maybe an optional boolean template parameter to specify whether we inline.

Flags: needinfo?(mstange.moz)

(In reply to Andrew McCreight [:mccr8] from comment #2)

What's the impact on the binary size from this change?

It's not clear to me:

https://treeherder.mozilla.org/perfherder/compare?originalProject=try&newProject=try&newRevision=8ebd1045baf3447513696d233dda1e489060ed78&framework=2&originalRevision=c7b6669f2fdde9337f235e262d3f35eb1bfef865&page=1&filter=XUL+section+nightly

This shows a 300KB increase on Windows, and a 100KB decrease on Linux. I'm not sure how high the PGO-induced variability in these metrics is.

Are there measurable performance improvements from this change?

I'm going to collect a profile later today with the PGO builds from try, and count the number of samples in hash-table related functions. The impact on speedometer3 score overall appears to be too low to be detectable; based on the numbers in the profile I would expect a 0.1% improvement. (LdrpDispatchUserCallTarget is 2% of overall speedometer3, and this particular call is 5% of LdrpDispatchUserCallTarget, and 2% * 5% = 0.1%.)

In the perfherder speedometer3 subtest comparison, none of the subtests have a change that's detected with "high" confidence. (Except for one on Linux but it's garbage - this patch is definitely not causing a 6% improvement.)

Windows
Linux
macOS

Overall I'd say the case for this patch is not very strong. A 0.1% improvement is something we'd want to take if it's cheap to make and doesn't cause adverse effects in other ways; I'm not sure if this one qualifies.

Flags: needinfo?(mstange.moz)

So the profiles confirm that the indirect call is gone, at least.
Before: https://share.firefox.dev/3NXPkQD
After: https://share.firefox.dev/3Q45PgI

Also, the number of samples containing "nsTHashtable" functions is reduced:
Before: https://share.firefox.dev/44OuxFN (3398 of 199950 = 1.69%)
After: https://share.firefox.dev/3NWrFjw (3046 of 193884 = 1.57%)
This suggests that the patch makes hash tables 7.7% faster.
(compute_part_relative_fraction(3398 / 199950, 3046 / 193884) == 0.923 == 1 - 0.077, see this document for justification)

I don't know anything about CPU branch prediction, but I wonder if maybe these indirect calls are so well predicted if they are called frequently that there's not much overhead.

(In reply to Andrew McCreight [:mccr8] from comment #2)

If the binary bloat is really bad, it looks like only using inlining for specific hash tables (eg RestyleManager::mSnapshots, gfxDWriteFont::mGlyphWidths) might get us the performance benefit without as much bloat.

For the specific case of gfxDWriteFont::mGlyphWidths, bug 1844464 should improve things (by putting a fast MruCache in front, so we hit the hashtable much less frequently).

Good point. Any place we're hitting the hashtable often enough to show up in the profiles here, maybe we want to add an MruCache in front of it anyways.

The severity field is not set for this bug.
:nika, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(nika)
Type: defect → enhancement
Flags: needinfo?(nika)
Blocks: 1870279
No longer blocks: 1870279

:mstange any update on this patch? Wondering based on https://bugzilla.mozilla.org/show_bug.cgi?id=1870279#c15

Flags: needinfo?(mstange.moz)

No update and I haven't been working on it recently.

The next step here would be to fix the gtest failures that you see when you run the job test-linux1804-64-shippable-qr/opt-gtest-1proc . Unfortunately the log has expired so I can no longer say which test was failing.

I've proposed a different approach in bug 1870279 to fix that particular regression - we can try mozilla::HashSet instead of nsTHashtable there.

Assignee: mstange.moz → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(mstange.moz)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: