Closed
Bug 1440824
Opened 6 years ago
Closed 6 years ago
Split the Gecko atom table into multiple subtables to reduce locking contention
Categories
(Core :: XPCOM, enhancement)
Core
XPCOM
Tracking
()
RESOLVED
FIXED
mozilla60
Tracking | Status | |
---|---|---|
firefox60 | --- | fixed |
People
(Reporter: bholley, Assigned: bholley)
References
Details
Attachments
(3 files, 1 obsolete file)
3.47 KB,
patch
|
froydnj
:
review+
|
Details | Diff | Splinter Review |
1.32 KB,
patch
|
froydnj
:
review+
|
Details | Diff | Splinter Review |
26.38 KB,
patch
|
froydnj
:
review+
|
Details | Diff | Splinter Review |
In bug 1346988, I'm working on moving CSS parsing to a background thread pool to speed up page load. One problem I've encountered is locking contention in the atom table. CSS parsing is very heavy on atomization, and atomizing requires a mutex to protect operations on the hashtable. In my initial measurements, moving ~50ms of parsing work to 6 background threads causes _each_ thread to spend ~150ms acquiring the mutex. There are lockless hashtable implementations, but they're extremely complex. A simpler approach is to use the hash key to segment the table into N subtables, each of which is protected by a separate mutex. I've implemented this and it works quite well. I can bring the locking overhead down to a couple of milliseconds per thread with 16 sub-tables, and under a millisecond with 128 or 256. Will post patches shortly.
Assignee | ||
Comment 1•6 years ago
|
||
MozReview-Commit-ID: 4uMktcaYwWW
Attachment #8953699 -
Flags: review?(nfroyd)
Assignee | ||
Comment 2•6 years ago
|
||
MozReview-Commit-ID: E1bcchzuMOu
Attachment #8953700 -
Flags: review?(nfroyd)
Assignee | ||
Comment 3•6 years ago
|
||
MozReview-Commit-ID: Hj8gKPap0cR
Attachment #8953701 -
Flags: review?(nfroyd)
Assignee | ||
Comment 4•6 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1b5c288647685a28e523ac54fe2e01f57e410910
Comment 5•6 years ago
|
||
Comment on attachment 8953700 [details] [diff] [review] Part 2 - Overhaul the atom infrastructure to support multiple subtables. v1 Review of attachment 8953700 [details] [diff] [review]: ----------------------------------------------------------------- ::: xpcom/ds/nsAtomTable.cpp @@ +234,5 @@ > + Mutex* mLock; > + void GCLocked(GCKind aKind); > + void SizeOfIncludingThisLocked(MallocSizeOf aMallocSizeOf, size_t* aSize); > + > + AtomTableEntry* Add(AtomTableKey& key) Nit: aKey @@ +256,5 @@ > + already_AddRefed<nsAtom> Atomize(const nsAString& aUTF16String); > + already_AddRefed<nsAtom> Atomize(const nsACString& aUTF8String); > + already_AddRefed<nsAtom> AtomizeMainThread(const nsAString& aUTF16String); > + void RegisterStaticAtoms(const nsStaticAtomSetup* aSetup, uint32_t aCount); > + size_t RaceySlowCount(); Nit: Racy @@ +265,5 @@ > + > +private: > + // XXXbholley: We enable multiple subtables in the next patch. > + const static size_t kNumSubTables = 1; > + nsAtomSubTable mSubTables[kNumSubTables]; Please comment (and/or statically assert) that this number should be/is a power of two. @@ +331,3 @@ > // > +// So we first make the simplifying assumption that the atoms are more or less > +// evenly-distributed across the subtables (which is the case emperically). Nit: empirically. @@ +369,5 @@ > + // There are a few considerations around how we select subtables. > + // > + // First, we want entries to be evenly distributed across the subtables. This > + // can be achieved by using any bits in the hash key, assuming the key itself > + // is evenly-distributed. Emperical measurements indicate that this method Nit: Empirical @@ +379,5 @@ > + // every entry it observes, leading to pessimal performance. In this case, > + // we're using PLDHashTable, whose primary hash function uses the N leftmost > + // bits of the hash value (where N is the log2 capacity of the table). This > + // means we should prefer the rightmost bits here. > + return mSubTables[aKey.mHash % kNumSubTables]; It's very likely that compilers will implement this using a mask instead of a modulo. But is it worth doing that explicitly -- `aKey.mHash & (kNumSubTables - 1)` -- to avoid any doubt? @@ +385,3 @@ > > void > +nsAtomTable::SizeOfIncludingThis(MallocSizeOf aMallocSizeOf, size_t* aSize) SizeOf functions that increment counters are usually called "AddSizeOf{In,Ex}cludingThis" because that makes it clearer. But since this is only taking a single measurement, it would be better to remove the outparam and use the return value. @@ +433,5 @@ > + MOZ_ASSERT_IF(aKind == GCKind::Shutdown, gUnusedAtomCount == 0); > +} > + > +size_t > +nsAtomTable::RaceySlowCount() A function with a provocative name like this deserves a comment, either here or on the declaration. In what way is it racy? What are the possibly consequences? (I can tell by reading the code, but it would be nice to have extra explanation.) Also, NS_GetNumberOfAtoms() relies on this function. It appears to only be used in tests. Its declaration also deserves a comment about its reliability. @@ +655,5 @@ > : 0; > } > > +void > +nsAtomSubTable::SizeOfIncludingThisLocked(MallocSizeOf aMallocSizeOf, size_t* aSize) Again, please use the return value instead of an outparam. @@ +660,2 @@ > { > + mLock->AssertCurrentThreadOwns(); A nice idiom we use in the Gecko Profiler: in functions like these that require a lock to be held, we have a "proof of lock" argument -- a reference to the AutoLock: https://searchfox.org/mozilla-central/rev/b469db5c90d618f4b202d5ef280e1a78224eb43b/tools/profiler/core/platform.cpp#178-180 as used here, for example: https://searchfox.org/mozilla-central/rev/b469db5c90d618f4b202d5ef280e1a78224eb43b/tools/profiler/core/platform.cpp#1621 Might be worth doing that here, and for GCLocked()? Types are generally better than assertions.
Attachment #8953700 -
Flags: feedback+
Comment 6•6 years ago
|
||
Comment on attachment 8953701 [details] [diff] [review] Part 3 - Enable multiple hashtables for atoms. v1 Review of attachment 8953701 [details] [diff] [review]: ----------------------------------------------------------------- ::: xpcom/ds/nsAtomTable.cpp @@ +268,5 @@ > + // parsing by increasing the number of subtables up to 128. This has been > + // measured to have neglible impact on the performance of initialization, GC, > + // and shutdown, and only costs about 20k of memory relative to a single > + // subtable. > + const static size_t kNumSubTables = 128; With bug 1436250 in mind I wonder if a smaller value would be better. What does the perf vs. memory usage curve look like?
Assignee | ||
Comment 7•6 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #5) > @@ +660,2 @@ > > { > > + mLock->AssertCurrentThreadOwns(); > > A nice idiom we use in the Gecko Profiler: in functions like these that > require a lock to be held, we have a "proof of lock" argument -- a reference > to the AutoLock: It was actually in the old code, but I think it's overkill in the new setup, and make things unnecessarily verbose. In the new setup, the *Locked methods are only called in one place, by the outer table (they're only separate methods for readability). The proof-of-lock stuff is most useful when you have multiple callers and nested *Locked functions. (In reply to Nicholas Nethercote [:njn] from comment #6) > > + // parsing by increasing the number of subtables up to 128. This has been > > + // measured to have neglible impact on the performance of initialization, GC, > > + // and shutdown, and only costs about 20k of memory relative to a single > > + // subtable. > > + const static size_t kNumSubTables = 128; > > With bug 1436250 in mind I wonder if a smaller value would be better. What > does the perf vs. memory usage curve look like? Memory overhead seems roughly linear with the number of tables, at least in the ranges I measured (up to 256). IIRC the locking overhead was ~3-6ms with 64 subtables, and ~1-2ms with 128 subtables. I don't think it's worth taking a measurable slowdown in the milliseconds to save 10k per process. There's lower-hanging fruit, and if we're going to chase wins at that level, I think we'll want some kind of "unimportant process" heuristic so that we don't compromise performance on stuff that matters.
Assignee | ||
Comment 8•6 years ago
|
||
Actually, that memory measurement was for tp6-facebook, not an empty page. Let me remeasure.
Assignee | ||
Comment 9•6 years ago
|
||
Remeasuring with a more-or-less empty page, I get 280,784 bytes with 64 tables vs 282,576 bytes with 128 tables. So even if 10k isn't negligible, 2k certainly is. :-)
Comment 10•6 years ago
|
||
> > "proof of lock"
>
> It was actually in the old code, but I think it's overkill in the new setup,
> and make things unnecessarily verbose.
It's an extra argument, but it saves the need for an assertion. *shrug*
Assignee | ||
Comment 11•6 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #10) > > > "proof of lock" > > > > It was actually in the old code, but I think it's overkill in the new setup, > > and make things unnecessarily verbose. > > It's an extra argument, but it saves the need for an assertion. *shrug* I guess another point is that I still want the assertion, since there are multiple locks involved, and we want to make sure we have the right one.
Updated•6 years ago
|
Attachment #8953699 -
Flags: review?(nfroyd) → review+
Comment 12•6 years ago
|
||
Comment on attachment 8953700 [details] [diff] [review] Part 2 - Overhaul the atom infrastructure to support multiple subtables. v1 Review of attachment 8953700 [details] [diff] [review]: ----------------------------------------------------------------- I generally agree with all of njn's comments. I like the style of aProofOfLock arguments, but it's true that with multiple locks, that style breaks down, because you would like to be sure the correct lock is being held (and we don't have a MutexAutoLock::AssertIsHoldingLock(const Mutex&) or similar). ::: xpcom/ds/nsAtomTable.cpp @@ +225,5 @@ > > +// In order to reduce locking contention for concurrent atomization, we segment > +// the atom table into N subtables, each with a separate lock. If the hash > +// values we use to select the subtable are evenly distributed, this reduces the > +// probability of contention by a factor of N. See bug 1440824. Do we want to cite any prior work here, e.g. Java's ConcurrentHashTable? @@ +230,5 @@ > +class nsAtomSubTable > +{ > + friend class nsAtomTable; > + PLDHashTable* mTable; > + Mutex* mLock; Why are these pointers rather than ordinary values? Is this just so detecting accesses after atom table shutdown is easier? If we do want to keep these as pointers, could they be made UniquePtrs instead? @@ +269,5 @@ > + nsAtomSubTable mSubTables[kNumSubTables]; > +}; > + > +// Static singleton instance for the atom table. > +static nsAtomTable gAtomTable; Ah, this would be a reasonable point for having the pointers in nsAtomSubTable above: we don't have to worry about static constructors for this instance if nsAtomSubTable has trivial constructors only. @@ +343,5 @@ > +void nsAtomTable::Init() > +{ > + MOZ_ASSERT(!IsInitialized()); > + for (size_t i = 0; i < kNumSubTables; ++i) { > + nsAtomSubTable& entry = mSubTables[i]; Nit: we could do: for (auto& table : mSubTables) { ... } here and everywhere else in this file. Also nit: I think we should call the temporary `table`, rather than `entry` to avoid confusion between it and hash table entries. @@ +356,5 @@ > + MOZ_ASSERT(IsInitialized()); > + for (size_t i = 0; i < kNumSubTables; ++i) { > + nsAtomSubTable& entry = mSubTables[i]; > + delete entry.mTable; > + entry.mTable = nullptr; Do we want to be super-paranoid and lock before deleting this? @@ +421,5 @@ > + // we might see a gUnusedAtomCount value in between, say, AddRef() > + // incrementing mRefCnt and it decrementing gUnusedAtomCount. > + // > + // So, we don't bother asserting that there are no unused atoms at the end of > + // a regular GC. But we can (and do) assert thist just after the last GC at Nit: "assert this" @@ +660,3 @@ > { > + mLock->AssertCurrentThreadOwns(); > + *aSize += mTable->ShallowSizeOfIncludingThis(aMallocSizeOf); It's worth noting that mLock can take up a reasonable amount of space (the platform-specific mutexes take 64 bytes on Mac, I think, 40 bytes on Linux, and something similarly-sized on Windows, not to mention whatever bits we include on top of that). So measuring mLock here somehow would be worthwhile. It's a little peculiar to have a SizeOfIncludingThis method that doesn't actually measure the memory occupied by `this`...though I guess the memory isn't allocated. Perhaps this method should be SizeOfExcludingThisLocked for that reason? (And similarly for the nsAtomTable method?) @@ +689,5 @@ > uint32_t stringLen = NS_strlen(string); > > uint32_t hash; > + AtomTableKey key(string, stringLen, &hash); > + nsAtomSubTable& entry = SelectSubTable(key); Naming nit here. @@ +741,3 @@ > uint32_t hash; > + AtomTableKey key(aUTF8String.Data(), aUTF8String.Length(), &hash); > + nsAtomSubTable& entry = SelectSubTable(key); Naming nit here. @@ +782,3 @@ > uint32_t hash; > + AtomTableKey key(aUTF16String.Data(), aUTF16String.Length(), &hash); > + nsAtomSubTable& entry = SelectSubTable(key); Naming nit here. @@ +823,5 @@ > return retVal.forget(); > } > } > > + nsAtomSubTable& entry = SelectSubTable(key); Naming nit here.
Attachment #8953700 -
Flags: review?(nfroyd) → review+
Comment 13•6 years ago
|
||
Comment on attachment 8953701 [details] [diff] [review] Part 3 - Enable multiple hashtables for atoms. v1 Review of attachment 8953701 [details] [diff] [review]: ----------------------------------------------------------------- Given that we weren't measuring mLock in the previous patch, I am less sure about the size numbers. Let's go with the 20k number quoted in the comments, saying that's what we measured for an increase vs. a single subtable. That didn't include: * 1 lock * 64 bytes/lock * 128 sub-tables = 8k (plus a little more for BlockingResourceBase, sadly, another 1-2k) * 8 bytes/subtable * 128 sub-tables = 1k (for the gAtomTable itself--not malloc'd memory, but implicitly taking up space in memory via the binary) which raises that 20k to nearly 30k. Is that worth it? 30k * 10-ish processes (content, file://, EME, addon, etc.) starts to sound fairly large... Where did the overhead numbers mentioned in comment 7 come from?
Comment 14•6 years ago
|
||
OTOH, my current session has 2.1MB for the main atom table in the main process and between 2.9 and 5.8MB for the main atom table in content processes. So maybe a 0.5-1%-ish increase in the size of the atom table is nothing to get too worried about...
Assignee | ||
Comment 15•6 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #5) > Nit: aKey > Nit: Racy Fixed. > Please comment (and/or statically assert) that this number should be/is a > power of two. Fixed. > Nit: empirically. > Nit: Empirical Fixed. > It's very likely that compilers will implement this using a mask instead of > a modulo. But is it worth doing that explicitly -- `aKey.mHash & > (kNumSubTables - 1)` -- to avoid any doubt? Ok. > But since this is only taking a single measurement, it would be better to > remove the outparam and use the return value. Fixed. > A function with a provocative name like this deserves a comment, either here > or on the declaration. In what way is it racy? What are the possibly > consequences? (I can tell by reading the code, but it would be nice to have > extra explanation.) > > Also, NS_GetNumberOfAtoms() relies on this function. It appears to only be > used in tests. Its declaration also deserves a comment about its reliability. Fixed. > Again, please use the return value instead of an outparam. Fixed. > Might be worth doing that here, and for GCLocked()? Types are generally > better than assertions. Per previous discussion I'm think it's better to skip that here. (In reply to Nathan Froyd [:froydnj] from comment #12) > I generally agree with all of njn's comments. Addressed, per above. > I like the style of > aProofOfLock arguments, but it's true that with multiple locks, that style > breaks down, because you would like to be sure the correct lock is being > held (and we don't have a MutexAutoLock::AssertIsHoldingLock(const Mutex&) > or similar). Yep. > Do we want to cite any prior work here, e.g. Java's ConcurrentHashTable? Ok. > > @@ +230,5 @@ > > +class nsAtomSubTable > > +{ > > + friend class nsAtomTable; > > + PLDHashTable* mTable; > > + Mutex* mLock; > > Why are these pointers rather than ordinary values? Is this just so > detecting accesses after atom table shutdown is easier? It was. But upon reflection, I think it's preferable to put everything inline in nsAtomTable, and then make gAtomTable a pointer. I've gone ahead and done this. > @@ +343,5 @@ > > +void nsAtomTable::Init() > > +{ > > + MOZ_ASSERT(!IsInitialized()); > > + for (size_t i = 0; i < kNumSubTables; ++i) { > > + nsAtomSubTable& entry = mSubTables[i]; > > Nit: we could do: > > for (auto& table : mSubTables) { > ... > } > > here and everywhere else in this file. Also nit: I think we should call the > temporary `table`, rather than `entry` to avoid confusion between it and > hash table entries. Done and done. > > @@ +356,5 @@ > > + MOZ_ASSERT(IsInitialized()); > > + for (size_t i = 0; i < kNumSubTables; ++i) { > > + nsAtomSubTable& entry = mSubTables[i]; > > + delete entry.mTable; > > + entry.mTable = nullptr; > > Do we want to be super-paranoid and lock before deleting this? I don't think so, since we'd be deleting the lock straight after. In any case, it's a bit of a moot point now because the shutdown function is gone, replaced with a default destructor. > Nit: "assert this" Fixed. > It's worth noting that mLock can take up a reasonable amount of space (the > platform-specific mutexes take 64 bytes on Mac, I think, 40 bytes on Linux, > and something similarly-sized on Windows, not to mention whatever bits we > include on top of that). So measuring mLock here somehow would be > worthwhile. > > It's a little peculiar to have a SizeOfIncludingThis method that doesn't > actually measure the memory occupied by `this`...though I guess the memory > isn't allocated. Perhaps this method should be SizeOfExcludingThisLocked > for that reason? (And similarly for the nsAtomTable method?) This is all fixed under the new setup. > Naming nit here. > Naming nit here. > Naming nit here. > Naming nit here. Fixed.
Assignee | ||
Comment 16•6 years ago
|
||
Flagging for re-review mostly for the new inline array setup, plus the memory reporter changes. Everything else is a pretty straightforward addressing of review comments. MozReview-Commit-ID: E1bcchzuMOu
Attachment #8954124 -
Flags: review?(nfroyd)
Assignee | ||
Updated•6 years ago
|
Attachment #8953700 -
Attachment is obsolete: true
Updated•6 years ago
|
Attachment #8954124 -
Flags: review?(nfroyd) → review+
Comment 17•6 years ago
|
||
Comment on attachment 8953701 [details] [diff] [review] Part 3 - Enable multiple hashtables for atoms. v1 Review of attachment 8953701 [details] [diff] [review]: ----------------------------------------------------------------- On reflection, I think the size concerns here are valiant, but probably not the lowest-hanging fruit on the tree.
Attachment #8953701 -
Flags: review?(nfroyd) → review+
Assignee | ||
Comment 18•6 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #17) > On reflection, I think the size concerns here are valiant, but probably not > the lowest-hanging fruit on the tree. Agreed. That said, I wanted to clarify the numbers a bit. Comment 13 is based on the numbers of a non-empty content process, which doesn't so much matter. What _does_ matter is the overhead in an empty content process, because that's the fixed overhead that hurts us when we scale up the number of content processes. I did a more thorough writeup in a comment for part 3. Here it is: // Another important consideration is memory, since we're adding fixed overhead // per content process, which we try to avoid. Measuring a mostly-empty page [1] // with various numbers of subtables, we get the following deep sizes for the // atom table: // 1 subtable: 278K // 8 subtables: 279K // 16 subtables: 282K // 64 subtables: 286K // 128 subtables: 290K // // So 128 subtables costs us 12K relative to a single table, and 4K relative // to 64 subtables. Conversely, measuring parallel (6 thread) CSS parsing on // tp6-facebook, a single table provides ~150ms of locking overhead per thread, // 64 subtables provides ~2-3ms of overhead, and 128 subtables provides <1ms. // And so while either 64 or 128 subtables would probably be acceptable, // achieving a measurable redution in contention for 4k of fixed memory // overhead is probably worth it. // // [1] The numbers will look different for content processes with complex // pages loaded, but in those cases the actual atoms will dominate memory // usage and the overhead of extra tables will be negligible. We're mostly // interested in the fixed cost for nearly-empty content processes.
Assignee | ||
Comment 19•6 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c117b3fc1dfb59c5869d7084475e3af326da3c56
Comment 20•6 years ago
|
||
> // achieving a measurable redution in contention for 4k of fixed memory
Nit: reduction
Comment 21•6 years ago
|
||
Pushed by bholley@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/973afb5d4909 Move some code around. r=froydnj https://hg.mozilla.org/integration/autoland/rev/5d83e440bbb0 Overhaul the atom infrastructure to support multiple subtables. r=froydnj https://hg.mozilla.org/integration/autoland/rev/ef3ac3531192 Enable multiple hashtables for atoms. r=froydnj
Comment 22•6 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/973afb5d4909 https://hg.mozilla.org/mozilla-central/rev/5d83e440bbb0 https://hg.mozilla.org/mozilla-central/rev/ef3ac3531192
Status: NEW → RESOLVED
Closed: 6 years ago
status-firefox60:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
You need to log in
before you can comment on or make changes to this bug.
Description
•