Closed Bug 1508873 Opened 6 years ago Closed 6 years ago

decrease hash table item storage for 64-bit platforms

Categories

(Core :: MFBT, enhancement)

x86_64
All
enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla66
Tracking Status
firefox65 --- wontfix
firefox66 --- fixed

People

(Reporter: froydnj, Assigned: froydnj)

References

(Blocks 1 open bug)

Details

Attachments

(4 files)

+++ This bug was initially created as a clone of Bug #1460674 +++ Bug 1460674 comment 0 lays out some space wastage with how PLDHashTable's entries work. One would like to think that mozilla::HashTable, being all fancy C++ templates, is not susceptible to the same issue but: https://searchfox.org/mozilla-central/source/mfbt/HashTable.h#1051-1052 has the exact same problem as described in the aforementioned comment. If mValueData is aligned to an 8-byte boundary, as is quite common, then there's wasted space between mKeyHash and mValueData. (HashMaps with pointers keys are used all throughout JS, for instance.) We can fix this issue in exactly the same way as bug 1460674 is being fixed; it might be a little more complicated, but not *too* much.
We do this to make our lives easier in later patches; this check guarantees that we don't need padding between the block of cached hash values and the block of entries immediately after it.
Attachment #9028081 - Flags: review?(luke)
HashTableEntry's data layout currently wastes a fair amount of space due to ABI-mandated padding. For instance, HashTableEntry<T*> on a 64-bit platform looks like: class HashTableEntry { HashNumber mKeyHash; // Four bytes of wasted space here to pad mValueData to the correct place. unsigned char mValueData[sizeof(T*)]; }; This wasted space means that sets of pointers backed by mozilla::HashTable waste a quarter of their entry storage space. Maps of pointers to pointers waste a sixth of their entry storage space. We'd like to fix this by packing all the cached hashes together, followed by all the hash table entries. As a first step to laying out the hash table storage differently, we have to make HashTable not access entries directly, but go through an abstraction that represents the key and the entry. We call this abstraction "slots". This commit is similar to the change done for PLDHashTable previously. Parts of HashTable still work in terms of Entry; the creation and destruction of tables was not worth changing here. We'll address that in the next commit.
Attachment #9028082 - Flags: review?(luke)
As discussed in the previous commit message, HashTableEntry wastes space for common entry types. This commit reorganizes the internal storage to store all the hashes packed together, followed by all the entries, which eliminates the aforementioned waste.
Attachment #9028083 - Flags: review?(luke)
Attachment #9028081 - Flags: review?(luke) → review+
Overall the strategy looks pretty cool. I was looking at bug 1460674 to see what sort of performance testing had already been done to validate that there's no performance regression, but I couldn't find much, nor anything for JS. Two possible issues that come to mind are: 1. extra register pressure in table loops due to two pointers vs. one 2. ~2 cache hits instead of 1 per logical probe #1 seems unlikely anywhere except register-starved x86 (where it would be good to measure). #2 could be mitigated by the smaller tables causing less misses overall and if #2 was an issue, it seems like bug 1402910 could make it less so. In addition to testing on the synthetic benchmarks of large tables, like I see in the other bug, it would be good to see if any of the JS benchmarks are affected since JS heavily pounds on mozilla::HashTable. (To be clear, I wouldn't want mozilla::HashTable's and PLDHashTable's representations to diverge; this just seems like necessary due diligence for changing such a central data structure of JS.) One other question: it seems like the changes here aren't 64-bit-specific while, for 32-bit, this patch seems like a strict regression (unless I'm missing an angle?); are you thinking it's not worth the effort to have differing representations? (JS perf measurements would be useful here too.)
FWIW, reasonable JS benchmarks to test on Try are speedometer (sp), dromaeo/kraken (d) and the JS shell benchmarks (you can get them with "jsshell-bench" in |mach try fuzzy| IIRC). Probably also good to look at "JS memory usage" under the ab/sy AWSY jobs just in case.
(In reply to Luke Wagner [:luke] from comment #4) > Overall the strategy looks pretty cool. I was looking at bug 1460674 to see > what sort of performance testing had already been done to validate that > there's no performance regression, but I couldn't find much, nor anything > for JS. The BenchCollections.PLDHash gtest on my machine showed significant improvements from bug 1460674 (20% overall improvement), but the numbers might not have made it into the bug. On infra, however, the results are decidedly more mixed, with Windows (64-bit) opt and x86-64 Linux opt showing ~5-6% improvement, but OS X and x86-64 Linux showing significant regressions (20% or more): https://treeherder.mozilla.org/perf.html#/graphs?series=mozilla-inbound,1730081,1,6&series=mozilla-inbound,1730087,1,6&series=mozilla-inbound,1730093,1,6&series=mozilla-inbound,1730069,1,6&zoom=1543211122507.732,1543312701355.6702,106343.27219493352,213805.9587620977&selected=mozilla-inbound,1730081,407823,653247665,6 I don't really understand why OS X would be so terrible here, nor why opt and pgo on Linux would be so radically different, unless the changes somehow screwed up the inlining decisions made. OTOH, the graphs also show the benchmarks in question improving by about the same amount on Windows and Linux with unrelated changes, so I'm inclined to take them with a grain of salt. I did notice that (opt, not --enable-release) libxul size increased by ~140k with these changes, which seems not great. It's entirely possible that code folding would bring that down substantially. > One other question: it seems like the changes here aren't 64-bit-specific > while, for 32-bit, this patch seems like a strict regression (unless I'm > missing an angle?); are you thinking it's not worth the effort to have > differing representations? (JS perf measurements would be useful here too.) I think different representations could be done without too much trouble if need be. EntrySlot encapsulates a lot of the differences, and the remaining differences could be handled by moving allocation of the entry store itself out into EntrySlot or similar. (In reply to Jan de Mooij [:jandem] from comment #5) > FWIW, reasonable JS benchmarks to test on Try are speedometer (sp), > dromaeo/kraken (d) and the JS shell benchmarks (you can get them with > "jsshell-bench" in |mach try fuzzy| IIRC). Probably also good to look at "JS > memory usage" under the ab/sy AWSY jobs just in case. Thank you for the suggestions!
Benchmark results from the patch (Linux x86-64 only, because apparently Windows had compile errors (?!): dromaeo_css opt: 4.34% improvement kraken opt: 2.72% improvement speedometer: 0.17% worse ares6: 0.36% improvement octane: 0.33% worse six-speed: 0.47% worse sunspider: 5.78% better (!) web-tooling: 0.13% worse I don't know how to interpret these. Do we have a strict "no slowdown" policy, or do we eyeball things and make judgement calls about how bad the slowdown(s) are? OS X results are on the way. Windows will be too once I figure out the compile error.
These results look fine to me: the regressions are all < 0.5%, some significant improvements, and uses less memory.
Windows x86-64 (PGO, the opt numbers are slightly worse, but since they're not what we ship...): dromaeo_css pgo: 0.27% worse kraken pgo: 1.12% improvement speedometer pgo: 0.52% worse Linux x86-64 (PGO): dromaeo_css pgo: 0.01% better kraken pgo: 0.11% worse speedometer: 0.13% worse I don't have PGO numbers for the shell benchmarks, because apparently they don't run on PGO builds. OS X results are still waiting on test machines, so it will probably be a while before they come in. The size hit is still real, though, even with the --enable-release builds that infra is running. ~130k on Linux, ~190k (!) on Windows. I haven't tried AWSY yet, but I suspect that for AWSY's measurements, any size wins from the smaller hash tables would be drowned out by the extra code size.
Thanks for all the data so far; so far looking pretty good. Are there any 32-bit numbers coming as well? It's only ~28% of our users atm, but it would be good to make sure there's not a significant regression (esp. given that there's no win, iiuc). On the code size issue, two thoughts: - Code memory is shared between processes, while hash table memory isn't, so the patch could still be a win from a Fission pov. - If we are concerned about code size, I looked into hash table code bloat ~5 years ago and I think there are quite a few opportunities to reduce code bloat with some template-fu (moving cold paths (esp. resize, iirc) into less- or non-specialized functions, basically midpoints on the spectrum between current HashTable and PLDHashTable)
OS X (opt, not PGO): dromaeo_css opt: 1.04% improvement kraken opt: 0.72% improvement speedometer opt: 0.73% worse Kind of a wash there as well. 32-bit numbers will be forthcoming once I figure out why the static_assert I added in part 1 is triggering. I think it's because {u,}int64_t keys/values/members in either require over-alignment, which I *think* is OK. I can't tell what sort of alignment mozjemalloc guarantees, but I guess the default allocator on 32-bit platforms has to guarantee 8-byte alignment to ensure that {u,}int64_t are aligned correctly? Or maybe it does on Android/ARM, and x86 Windows just returns 4-byte alignment because, hey, x86 is OK with unaligned accesses everywhere.
Android arm7 (7.0 Moto G6): speedometer: 0.23% improvement Android arm7 (8.0 Pixel 2): speedometer: 1.30% worse Win32 pgo: dromaeo_css pgo: 1.42% worse kraken: 0.21% improvement speedometer: 0.25% worse Not sure whether that warrants doing a two-sided storage solution for 32 and 64 bit. Interesting note: I think this bit: https://searchfox.org/mozilla-central/source/mfbt/HashTable.h#1052-1053 actually means we're currently wasting some space on 32-bit for certain types of keys and values, just like we are on 64-bit. I haven't verified it by looking at debug info or anything like that, but this would be a problem in the XDR encoding/decoding JS engine code, and possibly other places.
Also, in discussions with erahm today, we weren't quite sure if the size decrease would actually be measured effectively by AWSY, because it would be entirely possible the new smaller entry store would simply take up (previously unoccupied) free space in a different jemalloc size class and freeing (previously occupied) space in a larger jemalloc size class. So yes, your hash tables would be taking less memory, but you wouldn't necessarily see that result as jemalloc would still allocate the same-ish number of memory pages. I did get some size difference numbers between builds for Windows 64-bit, but those numbers are on my windows machine and therefore not in front of me at the moment; I'll post about those tomorrow.
Comment on attachment 9028082 [details] [diff] [review] part 2 - convert HashTable to work primarily in terms of slots Review of attachment 9028082 [details] [diff] [review]: ----------------------------------------------------------------- Just looking at the patch, not asking the final "go / no-go" decision, looks great. ::: mfbt/HashTable.h @@ +1224,5 @@ > + { > + mEntry->swap(aOther.mEntry); > + } > + > + // XXX on the constness here. I usually read "XXX" as "TODO". Could you either expand on your meaning in this comment "removing the XXX" or remove the comment? @@ +1242,5 @@ > + > + bool isLive() const { return mEntry->isLive(); } > + > + void setCollision() { mEntry->setCollision(); } > + void unsetCollision() { mEntry->unsetCollision(); } nit: why \n between some of these one-line methods but not between others? @@ +1302,5 @@ > , mGeneration(aTable.generation()) > #endif > { > } > + nit: trailing space. (I guess this kind of nit is about to be history...)
Attachment #9028082 - Flags: review?(luke) → review+
Comment on attachment 9028083 [details] [diff] [review] part 3 - reorganize HashTable's internal storage Review of attachment 9028083 [details] [diff] [review]: ----------------------------------------------------------------- Ditto above comment: r+ on the raw patch, assuming positive conclusion to the measurements underway. ::: mfbt/HashTable.h @@ +1732,5 @@ > + }; > + > + FakeSlot* fake = aReportFailure > + ? aAllocPolicy.template pod_malloc<FakeSlot>(aCapacity) > + : aAllocPolicy.template maybe_pod_malloc<FakeSlot>(aCapacity); Could you add a comment explaining why aCapacity*sizeof(FakeSlot) is equivalent to two adjacent arrays of HashNumber and Entry, resp: arrays have no padding and we have a static assert there is no padding needed between the two arrays.
Attachment #9028083 - Flags: review?(luke) → review+
We do not run talos on our Android test machines (so says `mach try fuzzy`, anyway), so there are no numbers pending for dromaeo and kraken on there. And we don't run performance tests at all on Android x86 or Linux x86, so there are no pending numbers there. To summarize from above comments and to report on the Android aarch64 numbers that came in last night: 64-bit platforms ---------------- Android aarch64 opt: speedometer: 0.79% worse OS X (opt, not PGO): dromaeo_css opt: 1.04% improvement kraken opt: 0.72% improvement speedometer opt: 0.73% worse Windows x86-64 (PGO): dromaeo_css pgo: 0.27% worse kraken pgo: 1.12% improvement speedometer pgo: 0.52% worse Linux x86-64 opt (shell benchmarks only): ares6: 0.36% improvement octane: 0.33% worse six-speed: 0.47% worse sunspider: 5.78% better (!) web-tooling: 0.13% worse Linux x86-64 pgo: dromaeo_css pgo: 0.01% better kraken pgo: 0.11% worse speedometer: 0.13% worse 32-bit platforms --------------- Android arm7 (7.0 Moto G6): speedometer: 0.23% improvement Android arm7 (8.0 Pixel 2): speedometer: 1.30% worse Win32 pgo: dromaeo_css pgo: 1.42% worse kraken: 0.21% improvement speedometer: 0.25% worse
Thanks for the summary! Seems like it's almost entirely in the noise (except for that surprise shell sunspider improvement), and no big reason to add a separate 32-bit representation. Jan?
Flags: needinfo?(jdemooij)
(In reply to Nathan Froyd [:froydnj] from comment #11) I can't tell what sort of alignment mozjemalloc guarantees, but I guess the > default allocator on 32-bit platforms has to guarantee 8-byte alignment to > ensure that {u,}int64_t are aligned correctly? Or maybe it does on > Android/ARM, and x86 Windows just returns 4-byte alignment because, hey, x86 > is OK with unaligned accesses everywhere. FWIW, I *think* mozjemalloc's minimum alignment is determined at [1]. If so, that's 16 bytes on x64 Windows (8 bytes on x86 Windows), 4 bytes on 32-bit ARM and x86 Linux, and 8 bytes everywhere else (for officially supported platforms). [1] https://dxr.mozilla.org/mozilla-central/source/memory/build/mozjemalloc.cpp#386
mozjemalloc guarantees different alignment for different size classes. Tiny size classes (4 or 8, when supported) are guaranteed to be aligned for their size (4-bytes alignment for 4-byte allocations, 8-bytes alignment for 8-bytes allocations). Quantum-spaced size classes (between 16 and 512) are guaranteed to be 16-bytes aligned, with some classes having a larger alignment, still. Sub-page size classes (512, 1024, 2048) are guaranteed to be aligned for their size (respectively, 512, 1024, and 2048). Large and huge size classes are page-aligned.
(In reply to Luke Wagner [:luke] from comment #17) > Thanks for the summary! Seems like it's almost entirely in the noise > (except for that surprise shell sunspider improvement), and no big reason to > add a separate 32-bit representation. Jan? Agreed, this all looks fine. (Sorry for the delay.)
Flags: needinfo?(jdemooij)
This says part 4, but will really be rolled into part 1, because it belongs there. Due to some hash tables containing 8-byte aligned things (e.g. uint64_t) on 32-bit platforms, our previous static_assert for entry alignment <= alignof(void*) would fail. Happily, the required alignment is actually guaranteed by the memory allocator on 32-bit platforms, so we can relax the assertion somewhat on such platforms.
Attachment #9030832 - Flags: review?(luke)
Comment on attachment 9030832 [details] [diff] [review] part 4 - relax the HashTableEntry assertion for 32-bit platforms Review of attachment 9030832 [details] [diff] [review]: ----------------------------------------------------------------- Makes sense. ::: mfbt/HashTable.h @@ +960,5 @@ > + // bytes is 8 bytes, and the actual data for entries in our entry store is > + // guaranteed to have that alignment as well, thanks to the power-of-two > + // number of cached hash values stored prior to the entry data. > + static_assert(alignof(NonConstT) <= 2 * alignof(void*), > + "cannot use over-aligned entries in mozilla::HashTable"); The one remaining dot to connect to understand both these static_asserts (which could go before the #ifdef is to explain the contiguous-two-array layout of the table (hash numbers followed by entries) and what alignment property that requires.
Attachment #9030832 - Flags: review?(luke) → review+
(In reply to Luke Wagner [:luke] from comment #22) > The one remaining dot to connect to understand both these static_asserts > (which could go before the #ifdef is to explain the contiguous-two-array > layout of the table (hash numbers followed by entries) and what alignment > property that requires. Right, you requested that in comment 15 when reviewing part 3. Would you like to see and approve the comment prior to commit?
Flags: needinfo?(luke)
Oh hah right; I may be forgetful, but at least I'm consistent ;) No need to for another review.
Flags: needinfo?(luke)
Pushed by nfroyd@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/a6657732fdbe part 1 - statically assert alignment for hashtable entries; r=luke https://hg.mozilla.org/integration/mozilla-inbound/rev/4bb5dd072865 part 2 - convert HashTable to work primarily in terms of slots; r=luke https://hg.mozilla.org/integration/mozilla-inbound/rev/a356b2566ae2 part 3 - reorganize HashTable's internal storage; r=luke
Backed out for spidermonkey build bustages, Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&resultStatus=testfailed%2Cbusted&fromchange=a356b2566ae2e8f6d5ac676b39fd74aea6604664&tochange=fe92d7b4a83ff084a1ebd3a7bae85b2c7aa8aeeb&selectedJob=216651222 Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=216651222&repo=mozilla-inbound&lineNumber=656 Backout link: https://hg.mozilla.org/integration/mozilla-inbound/rev/fe92d7b4a83ff084a1ebd3a7bae85b2c7aa8aeeb [task 2018-12-12T20:05:01.890Z] /builds/worker/workspace/build/src/obj-spider/dist/include/mozilla/HashTable.h:1439:44: error: base operand of '->' has non-pointer type 'mozilla::detail::HashTable<mozilla::HashMapEntry<js::CrossCompartmentKey, js::detail::UnsafeBareReadBarriered<JS::Value> >, mozilla::HashMap<js::CrossCompartmentKey, js::detail::UnsafeBareReadBarriered<JS::Value>, js::CrossCompartmentKey::Hasher, js::SystemAllocPolicy>::MapHashPolicy, js::SystemAllocPolicy>::Slot {aka mozilla::detail::EntrySlot<mozilla::HashMapEntry<js::CrossCompartmentKey, js::detail::UnsafeBareReadBarriered<JS::Value> > >}' [task 2018-12-12T20:05:01.890Z] MOZ_ASSERT(&k != &HashPolicy::getKey(this->mCur->get())); [task 2018-12-12T20:05:01.890Z] ^ [task 2018-12-12T20:05:01.890Z] /builds/worker/workspace/build/src/obj-spider/dist/include/mozilla/Assertions.h:436:66: note: in definition of macro 'MOZ_VALIDATE_ASSERT_CONDITION_TYPE' [task 2018-12-12T20:05:01.890Z] static_assert(mozilla::detail::AssertionConditionType<decltype(x)>::isValid, \ [task 2018-12-12T20:05:01.890Z] ^ [task 2018-12-12T20:05:01.890Z] /builds/worker/workspace/build/src/obj-spider/dist/include/mozilla/Assertions.h:473:39: note: in expansion of macro 'MOZ_ASSERT_HELPER1' [task 2018-12-12T20:05:01.890Z] #define MOZ_RELEASE_ASSERT_GLUE(a, b) a b [task 2018-12-12T20:05:01.890Z] ^ [task 2018-12-12T20:05:01.890Z] /builds/worker/workspace/build/src/obj-spider/dist/include/mozilla/Assertions.h:475:3: note: in expansion of macro 'MOZ_RELEASE_ASSERT_GLUE' [task 2018-12-12T20:05:01.890Z] MOZ_RELEASE_ASSERT_GLUE( \ [task 2018-12-12T20:05:01.890Z] ^~~~~~~~~~~~~~~~~~~~~~~ [task 2018-12-12T20:05:01.890Z] /builds/worker/workspace/build/src/obj-spider/dist/include/mozilla/Assertions.h:480:25: note: in expansion of macro 'MOZ_RELEASE_ASSERT' [task 2018-12-12T20:05:01.890Z] #define MOZ_ASSERT(...) MOZ_RELEASE_ASSERT(__VA_ARGS__) [task 2018-12-12T20:05:01.890Z] ^~~~~~~~~~~~~~~~~~ [task 2018-12-12T20:05:01.890Z] /builds/worker/workspace/build/src/obj-spider/dist/include/mozilla/HashTable.h:1439:7: note: in expansion of macro 'MOZ_ASSERT' [task 2018-12-12T20:05:01.890Z] MOZ_ASSERT(&k != &HashPolicy::getKey(this->mCur->get())); [task 2018-12-12T20:05:01.890Z] ^~~~~~~~~~ [task 2018-12-12T20:05:01.890Z] In file included from /builds/worker/workspace/build/src/obj-spider/dist/include/mozilla/Assertions.h:18:0, [task 2018-12-12T20:05:01.890Z] from /builds/worker/workspace/build/src/obj-spider/dist/include/mozilla/MathAlgorithms.h:12, [task 2018-12-12T20:05:01.890Z] from /builds/worker/workspace/build/src/js/src/ds/LifoAlloc.h:11, [task 2018-12-12T20:05:01.890Z] from /builds/worker/workspace/build/src/js/src/frontend/BinASTParserBase.h:12, [task 2018-12-12T20:05:01.890Z] from /builds/worker/workspace/build/src/js/src/frontend/BinASTParserBase.cpp:7:
Flags: needinfo?(nfroyd)
Bah, apparently I failed to notice in all my try pushes that I never built debug. Looks like a simple s/->/./ will work just fine, patch in testing.
Flags: needinfo?(nfroyd)
Tests say the patch is hitting a release assert in Compartment::checkWrapperMapAfterMovingGC...only on SM arm debug builds (!). I will look more closely tomorrow.
Pushed by nfroyd@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/ff944ef6b8a4 part 1 - statically assert alignment for hashtable entries; r=luke https://hg.mozilla.org/integration/mozilla-inbound/rev/20d92a98e3a3 part 2 - convert HashTable to work primarily in terms of slots; r=luke https://hg.mozilla.org/integration/mozilla-inbound/rev/95e3de1d6643 part 3 - reorganize HashTable's internal storage; r=luke
Assignee: nobody → nfroyd
Depends on: 1514280
No longer regressions: 1557617
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: