The indirect call in PLDHashTable::ComputeKeyHash to mOps->hashKey shows up on speedometer3
Categories
(Core :: XPCOM, enhancement)
Tracking
()
People
(Reporter: mstange, Unassigned)
References
(Blocks 2 open bugs)
Details
(Whiteboard: [sp3])
Attachments
(1 file)
Auditing indirect calls on Speedometer 3 (by looking at the callers of LdrpDispatchUserCallTarget
on Windows) turned up the call to mOps->hashKey
as one of the most common indirect calls.
I think it's worth doing the hashing inside nsTHashTable
, in templated code where we know the hashing function.
Updated•1 year ago
|
Reporter | ||
Comment 1•1 year ago
|
||
nsTHashTable is templated on the EntryType so it can direct-call (or even inline)
the call to EntryType::HashKey. PLDHashTable would need to go through mOps->hashKey,
i.e. call through a function pointer.
Comment 2•1 year ago
|
||
Thanks for working on this.
What's the impact on the binary size from this change? IIRC the reason nsTHashTable is implemented this way is to avoid binary bloat.
Are there measurable performance improvements from this change?
If the binary bloat is really bad, it looks like only using inlining for specific hash tables (eg RestyleManager::mSnapshots, gfxDWriteFont::mGlyphWidths) might get us the performance benefit without as much bloat. It looks like just inlining for the top 5 most common hash tables would eliminate about 38% of the indirect calls, if I'm reading the profile correctly (we'd probably want to do more, but I was feeling lazy so I just added up the top 5). Although I'm not sure off the top of my head how to implement that easily. Maybe an optional boolean template parameter to specify whether we inline.
Reporter | ||
Comment 3•1 year ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #2)
What's the impact on the binary size from this change?
It's not clear to me:
This shows a 300KB increase on Windows, and a 100KB decrease on Linux. I'm not sure how high the PGO-induced variability in these metrics is.
Are there measurable performance improvements from this change?
I'm going to collect a profile later today with the PGO builds from try, and count the number of samples in hash-table related functions. The impact on speedometer3 score overall appears to be too low to be detectable; based on the numbers in the profile I would expect a 0.1% improvement. (LdrpDispatchUserCallTarget
is 2% of overall speedometer3, and this particular call is 5% of LdrpDispatchUserCallTarget
, and 2% * 5% = 0.1%.)
In the perfherder speedometer3 subtest comparison, none of the subtests have a change that's detected with "high" confidence. (Except for one on Linux but it's garbage - this patch is definitely not causing a 6% improvement.)
Overall I'd say the case for this patch is not very strong. A 0.1% improvement is something we'd want to take if it's cheap to make and doesn't cause adverse effects in other ways; I'm not sure if this one qualifies.
Reporter | ||
Comment 4•1 year ago
|
||
So the profiles confirm that the indirect call is gone, at least.
Before: https://share.firefox.dev/3NXPkQD
After: https://share.firefox.dev/3Q45PgI
Also, the number of samples containing "nsTHashtable" functions is reduced:
Before: https://share.firefox.dev/44OuxFN (3398 of 199950 = 1.69%)
After: https://share.firefox.dev/3NWrFjw (3046 of 193884 = 1.57%)
This suggests that the patch makes hash tables 7.7% faster.
(compute_part_relative_fraction(3398 / 199950, 3046 / 193884) == 0.923 == 1 - 0.077
, see this document for justification)
Comment 5•1 year ago
|
||
I don't know anything about CPU branch prediction, but I wonder if maybe these indirect calls are so well predicted if they are called frequently that there's not much overhead.
Comment 6•1 year ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #2)
If the binary bloat is really bad, it looks like only using inlining for specific hash tables (eg RestyleManager::mSnapshots, gfxDWriteFont::mGlyphWidths) might get us the performance benefit without as much bloat.
For the specific case of gfxDWriteFont::mGlyphWidths, bug 1844464 should improve things (by putting a fast MruCache in front, so we hit the hashtable much less frequently).
Comment 7•1 year ago
|
||
Good point. Any place we're hitting the hashtable often enough to show up in the profiles here, maybe we want to add an MruCache in front of it anyways.
Comment 8•1 year ago
|
||
The severity field is not set for this bug.
:nika, could you have a look please?
For more information, please visit BugBot documentation.
Updated•1 year ago
|
Comment 9•7 months ago
|
||
:mstange any update on this patch? Wondering based on https://bugzilla.mozilla.org/show_bug.cgi?id=1870279#c15
Reporter | ||
Comment 10•7 months ago
|
||
No update and I haven't been working on it recently.
The next step here would be to fix the gtest failures that you see when you run the job test-linux1804-64-shippable-qr/opt-gtest-1proc
. Unfortunately the log has expired so I can no longer say which test was failing.
I've proposed a different approach in bug 1870279 to fix that particular regression - we can try mozilla::HashSet
instead of nsTHashtable
there.
Description
•