Closed
Bug 1384798
Opened 4 years ago
Closed 4 years ago
NS_TableDrivenQI is too slow
Categories
(Core :: XPCOM, enhancement)
Core
XPCOM
Tracking
()
RESOLVED
INCOMPLETE
People
(Reporter: ehsan, Unassigned)
Details
Attachments
(2 files)
197.51 KB,
image/png
|
Details | |
6.68 KB,
patch
|
Details | Diff | Splinter Review |
See this profile of NS_TableDrivenQI while running Speedometer (sorry about the image, here is the same thing in text form, but I figured the colors are helpful in readability as well.) │ nsresult NS_FASTCALL ▒ │ NS_TableDrivenQI(void* aThis, REFNSIID aIID, void** aInstancePtr, ▒ │ const QITableEntry* aEntries) ▒ │ { ▒ 6.02 │ push %rbp ▒ 3.11 │ mov %rsp,%rbp ▒ 2.49 │ push %r15 ◆ 0.83 │ push %r14 ▒ 1.24 │ push %rbx ▒ 0.41 │ push %rax ▒ 2.90 │ mov %rdx,%r14 ▒ 1.04 │ mov (%rsi),%eax ▒ 6.85 │ mov 0x4(%rsi),%edx ▒ 1.24 │ mov 0x8(%rsi),%ebx ▒ 2.28 │ mov 0xc(%rsi),%r8d ▒ 0.83 │ mov (%rcx),%rsi ▒ │ do { ▒ 14.11 │ add $0x10,%rcx ▒ 0.83 │ xor %r15d,%r15d ▒ 2.49 │ data16 data16 data16 nopw %cs:0x0(%rax,%rax,1) ▒ │ _ZNK4nsID6EqualsERKS_(): ▒ │ ▒ │ inline bool Equals(const nsID& aOther) const ▒ │ { ▒ │ // Unfortunately memcmp isn't faster than this. ▒ │ return ▒ │ (((uint32_t*)&m0)[0] == ((uint32_t*)&aOther.m0)[0]) && ▒ 3.53 │30: cmp (%rsi),%eax ▒ │ ↓ jne a8f790 <NS_TableDrivenQI(void*, nsID const&, void**, QITableEntry const*)+0x50> ▒ │ (((uint32_t*)&m0)[1] == ((uint32_t*)&aOther.m0)[1]) && ▒ 0.41 │ cmp 0x4(%rsi),%edx ▒ │ ↓ jne a8f790 <NS_TableDrivenQI(void*, nsID const&, void**, QITableEntry const*)+0x50> ▒ │ (((uint32_t*)&m0)[2] == ((uint32_t*)&aOther.m0)[2]) && ▒ 0.41 │ cmp 0x8(%rsi),%ebx ▒ │ ↓ jne a8f790 <NS_TableDrivenQI(void*, nsID const&, void**, QITableEntry const*)+0x50> ▒ │ _Z16NS_TableDrivenQIPvRK4nsIDPS_PK12QITableEntry(): ▒ │ if (aIID.Equals(*aEntries->iid)) { ▒ 0.21 │ cmp 0xc(%rsi),%r8d ▒ │ ↓ je a8f7af <NS_TableDrivenQI(void*, nsID const&, void**, QITableEntry const*)+0x6f> ▒ │ data16 data16 nopw %cs:0x0(%rax,%rax,1) ▒ │ *aInstancePtr = r; ▒ │ return NS_OK; ▒ │ } ▒ │ ▒ │ ++aEntries; ▒ │ } while (aEntries->iid); ▒ 32.99 │50: mov (%rcx),%rsi ▒ 1.87 │ add $0x10,%rcx ▒ 1.24 │ test %rsi,%rsi ▒ │ ↑ jne a8f770 <NS_TableDrivenQI(void*, nsID const&, void**, QITableEntry const*)+0x30> ▒ 1.04 │ mov $0x80004002,%eax ▒ │ *aInstancePtr = r; ▒ 2.49 │61: mov %r15,(%r14) ▒ │ ▒ │ *aInstancePtr = nullptr; ▒ │ return NS_ERROR_NO_INTERFACE; ▒ │ } ▒ 0.83 │ add $0x8,%rsp ▒ 1.24 │ pop %rbx ▒ 2.90 │ pop %r14 ▒ │ pop %r15 ▒ 1.45 │ pop %rbp ▒ │ ← retq ▒ │ reinterpret_cast<char*>(aThis) + aEntries->offset); ▒ │6f: movslq -0x8(%rcx),%rax ▒ 0.41 │ lea (%rdi,%rax,1),%r15 ▒ │ NS_ADDREF(r); ▒ │ mov (%rdi,%rax,1),%rax ▒ 1.66 │ mov %r15,%rdi ▒ 0.21 │ → callq *8 ▒ 0.41 │ xor %eax,%eax ▒ │ ↑ jmp a8f7a1 <NS_TableDrivenQI(void*, nsID const&, void**, QITableEntry const*)+0x61> ▒ Inlining this is an obvious first step for speeding this up. Other ideas are welcome. I thought about putting the IIDs and offsets into separate arrays, so that as we scan the IID arrays we have better cacheline utilization, but I can't think of any C++ trick which would allow us to do that without requiring a change to the NS_INTERFACE_TABLE_ENTRY macro (by basically repeating the list of IIDs twice) without runtime overhead. :-/
Comment 1•4 years ago
|
||
Could it be dereferencing the pointer aEntries->iid that takes time, not iterating the array?
Reporter | ||
Comment 2•4 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #1) > Could it be dereferencing the pointer aEntries->iid that takes time, not > iterating the array? Yes, that is certainly one issue. And once we make the IIDs members of QITableEntry, we can separate the search part of the algorithm to facilitate loop unrolling and vectorization by the optimizer (I chose to use std::find_if for that part.) The other is that we dereference aEntries in the loop condition, whereas the caller knows the size of the table at compile time and could just pass it in and we could avoid this cost every iteration too. I have a patch that does all of these, and the cost of NS_TableDrivenQI on ~150 subtests of Speedometer drops down from about 0.11% of total wall clock time to 0.05%.
Reporter | ||
Comment 3•4 years ago
|
||
This patch performs the following optimizations: * It inlines NS_TableDrivenQI(). * It rewrites the loop over the table to use the size of the table as known in the caller instead of dereferencing memory inside the loop condition. * It removes one level of pointer chasing to access the IID from the QITableEntry object by embedding the IID directly inside it.
Attachment #8890687 -
Flags: review?(nfroyd)
Comment 4•4 years ago
|
||
I'm interested to see if there are codesize+general speed costs of this. Table-driven QI was intentionally trading non-inline to reduce codesize and for non-hot paths that had definite speed wins.
![]() |
||
Comment 5•4 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #4) > I'm interested to see if there are codesize+general speed costs of this. > Table-driven QI was intentionally trading non-inline to reduce codesize and > for non-hot paths that had definite speed wins. Yes, I'd be curious, too. I did some quick checking: * Nightly x86-64 Linux libxul has 83,936 bytes of QI tables across 1,497 tables * At 16 bytes/entry, that's 5,246 entries * Note that ~1/3 of those entries are sentinels (!), so we have 3,749 actual entries So if we were to embed the IID and pass the table size as a constant, entry size would increase to 24 bytes, which means the new tables would take up ~89K of space, which is a surprisingly small increase. What does the codesize hit from inlining NS_TableDrivenQI some 1500 times look like? I think I'd rather avoid inlining NS_TableDrivenQI if possible...can you rerun the numbers without that change? A tenth of a half of a percent is a pretty small win in any event.
Flags: needinfo?(ehsan)
Reporter | ||
Comment 6•4 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #5) > (In reply to Benjamin Smedberg [:bsmedberg] from comment #4) > > I'm interested to see if there are codesize+general speed costs of this. > > Table-driven QI was intentionally trading non-inline to reduce codesize and > > for non-hot paths that had definite speed wins. > > Yes, I'd be curious, too. I did some quick checking: > > * Nightly x86-64 Linux libxul has 83,936 bytes of QI tables across 1,497 > tables > * At 16 bytes/entry, that's 5,246 entries > * Note that ~1/3 of those entries are sentinels (!), so we have 3,749 actual > entries Wow, a third! Good grief. > So if we were to embed the IID and pass the table size as a constant, entry > size would increase to 24 bytes, which means the new tables would take up > ~89K of space, which is a surprisingly small increase. Yes. Also note that soon we will be able to shrink this even further with bug 1384727, so let's not panic about code size too much here. I'll get you numbers from try though. > What does the codesize hit from inlining NS_TableDrivenQI some 1500 times > look like? I think I'd rather avoid inlining NS_TableDrivenQI if > possible...can you rerun the numbers without that change? > > A tenth of a half of a percent is a pretty small win in any event. I'll be pushing back on the idea of not inlining pretty hard unless if the code size numbers come back with several hundreds of K's or megs of difference, I think. I'm rather surprised you're reasoning about this in terms of a half a percent. QueryInterface(), especially for classes such as CCed objects, is *extremely* hot code, probably among the hottest general purpose code we have in XPCOM. We should be striving for any amount of performance we can squeeze out of it, and taking some code size hit for it is absolutely the right trade-off. I just cited numbers from one specific profiling situation that motivated the work being done here, but this is far from the first time NS_TableDrivenQI is showing up pretty hot in profiles. :-)
Flags: needinfo?(ehsan)
Comment 7•4 years ago
|
||
Can we just change CCed classes to not use table driven QI? I see about 50 uses of the CC table QI macros, though they may have subclasses that don't implement their own CC participant that also use the table QI.
Reporter | ||
Comment 8•4 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #7) > Can we just change CCed classes to not use table driven QI? I see about 50 > uses of the CC table QI macros, though they may have subclasses that don't > implement their own CC participant that also use the table QI. We could, thought it's not obvious to me if the map based QI would result in smaller code size. Also note that the QI cost isn't only coming from the cycle collector call sites, there are also many places in DOM for example which do QIs that can easily turn hot, especially when pages call things in a loop.
Reporter | ||
Comment 9•4 years ago
|
||
Here is some code size data about libxul (APK on Android) size changes as a result of this patch from try: Platform Before After Delta Win32 PGO 65286144 65462272 176128 Win64 PGO 77153280 77317120 163840 (how is this smaller than Win32?!) Linux32 PGO 105665564 106193244 527680 Linux64 PGO 115220400 115745848 525448 MacOSX 155031860 155677204 645344 Android 36348127 36403714 55587 Thoughts?
Flags: needinfo?(nfroyd)
Reporter | ||
Comment 10•4 years ago
|
||
(Try pushes, before: https://treeherder.mozilla.org/#/jobs?repo=try&revision=77724711a6cc9cf63d14f09da596e0d525688ca2, after: https://treeherder.mozilla.org/#/jobs?repo=try&revision=83bfa691a5b08d0ccb521f99e7296ed751cbb95a)
![]() |
||
Comment 11•4 years ago
|
||
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #6) > (In reply to Nathan Froyd [:froydnj] from comment #5) > > What does the codesize hit from inlining NS_TableDrivenQI some 1500 times > > look like? I think I'd rather avoid inlining NS_TableDrivenQI if > > possible...can you rerun the numbers without that change? > > > > A tenth of a half of a percent is a pretty small win in any event. > > I'll be pushing back on the idea of not inlining pretty hard unless if the > code size numbers come back with several hundreds of K's or megs of > difference, I think. I'm rather surprised you're reasoning about this in > terms of a half a percent. QueryInterface(), especially for classes such as > CCed objects, is *extremely* hot code, probably among the hottest general > purpose code we have in XPCOM. We should be striving for any amount of > performance we can squeeze out of it, and taking some code size hit for it > is absolutely the right trade-off. I just cited numbers from one specific > profiling situation that motivated the work being done here, but this is far > from the first time NS_TableDrivenQI is showing up pretty hot in profiles. > :-) I can agree that speeding up QI is a worthy goal! It is not clear to me that the performance difference you're observing is entirely due to inlining, as you have three distinct ideas going on in your patch. I'm asking whether we can get the same performance benefits *without* the inlining. (In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #9) > Here is some code size data about libxul (APK on Android) size changes as a > result of this patch from try: > > Platform Before After Delta > > Win32 PGO 65286144 65462272 176128 > Win64 PGO 77153280 77317120 163840 (how is this smaller than > Win32?!) The relative differences in entry sizes, I bet. We had ~5000 entries before, so that's 40K on win32 and 80K on win64. We have ~3200 entries now, but our entries are now 20 bytes on win32 and 24 (I think) on win64, so that's 64K on win32 and ~78K on win64. That's a swing of 24K on win32, which is at least in the ballpark for the above. > Linux32 PGO 105665564 106193244 527680 > Linux64 PGO 115220400 115745848 525448 > MacOSX 155031860 155677204 645344 These changes are wild! Half a megabyte of code?
Flags: needinfo?(nfroyd)
Reporter | ||
Comment 12•4 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #11) > (In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from > comment #6) > > (In reply to Nathan Froyd [:froydnj] from comment #5) > > > What does the codesize hit from inlining NS_TableDrivenQI some 1500 times > > > look like? I think I'd rather avoid inlining NS_TableDrivenQI if > > > possible...can you rerun the numbers without that change? > > > > > > A tenth of a half of a percent is a pretty small win in any event. > > > > I'll be pushing back on the idea of not inlining pretty hard unless if the > > code size numbers come back with several hundreds of K's or megs of > > difference, I think. I'm rather surprised you're reasoning about this in > > terms of a half a percent. QueryInterface(), especially for classes such as > > CCed objects, is *extremely* hot code, probably among the hottest general > > purpose code we have in XPCOM. We should be striving for any amount of > > performance we can squeeze out of it, and taking some code size hit for it > > is absolutely the right trade-off. I just cited numbers from one specific > > profiling situation that motivated the work being done here, but this is far > > from the first time NS_TableDrivenQI is showing up pretty hot in profiles. > > :-) > > I can agree that speeding up QI is a worthy goal! It is not clear to me > that the performance difference you're observing is entirely due to > inlining, as you have three distinct ideas going on in your patch. I'm > asking whether we can get the same performance benefits *without* the > inlining. Right, so the issue is that most of the gains here are achieved with inlining, the other achievements are quite minimal. :-( > (In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from > comment #9) > > Here is some code size data about libxul (APK on Android) size changes as a > > result of this patch from try: > > > > Platform Before After Delta > > > > Win32 PGO 65286144 65462272 176128 > > Win64 PGO 77153280 77317120 163840 (how is this smaller than > > Win32?!) > > The relative differences in entry sizes, I bet. We had ~5000 entries > before, so that's 40K on win32 and 80K on win64. We have ~3200 entries now, > but our entries are now 20 bytes on win32 and 24 (I think) on win64, so > that's 64K on win32 and ~78K on win64. That's a swing of 24K on win32, > which is at least in the ballpark for the above. OK that makes sense, thanks for the analysis. > > Linux32 PGO 105665564 106193244 527680 > > Linux64 PGO 115220400 115745848 525448 > > MacOSX 155031860 155677204 645344 > > These changes are wild! Half a megabyte of code? Yeah. :-/ I'm not too happy about this either. I suppose the issue here is that once we start inlining this code, the compiler can start to go wild, start unlooping etc and start to increase code size massively. I don't really know of a way to "control" this behavior sadly. But that being said, I mostly care about Windows here. Are you OK with the code size hit on Win32/Win64? What about Android? If yes, perhaps we can think of a compromise, that is, inline on Windows and Android and leave the code non-inlined on Linux and OSX? (Please wash your eyes after reading this!)
Reporter | ||
Comment 13•4 years ago
|
||
(unlooping is Ehsanese for unrolling!)
Reporter | ||
Updated•4 years ago
|
Flags: needinfo?(nfroyd)
![]() |
||
Comment 14•4 years ago
|
||
OK, so I took a look at this. Below is some analysis. TL; DR: I *really* don't think this is worth the codesize hit, the improvement is very small, and I'm even skeptical that the improvement is really there. # This is the dissassembly of nsMappedAttributes::QueryInterface. # nsMappedAttributes::QueryInterface is the simplest possible table-driven # QueryInterface implementation: # # NS_IMPL_QUERY_INTERFACE(nsMappedAttributes, # nsIStyleRule) # # So this is the simplest possible testcase for how our optimization is doing. # # Function prologue stuff for Win64. 181786b48: 48 89 5c 24 08 mov %rbx,0x8(%rsp) 181786b4d: 48 89 74 24 10 mov %rsi,0x10(%rsp) 181786b52: 57 push %rdi 181786b53: 48 83 ec 30 sub $0x30,%rsp # Uh, OK, where's the inlined code? # # Why are we accessing a thread-local variable (%gs is the thread information block # pointer on Win64) and poking around in it? 181786b57: 65 48 8b 04 25 58 00 mov %gs:0x58,%rax 181786b5e: 00 00 181786b60: 48 8b f1 mov %rcx,%rsi 181786b63: 44 8b 0d 7a 4d 4c 02 mov 0x24c4d7a(%rip),%r9d # 0x183c4b8e4 181786b6a: 49 8b d8 mov %r8,%rbx 181786b6d: b9 80 00 00 00 mov $0x80,%ecx 181786b72: 48 8b fa mov %rdx,%rdi 181786b75: 4e 8b 0c c8 mov (%rax,%r9,8),%r9 181786b79: 42 8b 04 09 mov (%rcx,%r9,1),%eax # OK, so some value in thread-local storage is interesting, and we're going # to skip some work if so. 181786b7d: 39 05 55 34 58 02 cmp %eax,0x2583455(%rip) # 0x183d09fd8 181786b83: 7e 4d jle 0x181786bd2 # Aha! # # This is a call to Init_thread_header. Googling indicates that calls to # Init_thread_header are involved in thread-safe initialization of # function-level statics. See: # # https://github.com/Microsoft/clang/commit/4ade6cab7cfdc694ae50668a4e7ba2169d593c29#diff-19e2b34bb87a61d6bbad7f9e2c3f3cf0R2200 # # for details. Going from the above comment, we must have been testing # _Init_thread_epoch and determining what to do next. # # 0x180e4d88c is _Init_thread_header; 0x180e4d82c is _Init_thread_footer. 181786b85: 48 8d 0d 4c 34 58 02 lea 0x258344c(%rip),%rcx # 0x183d09fd8 181786b8c: e8 fb 6c 6c ff callq 0x180e4d88c # Ah, here's the call to see if TSS == -1. If it's not, we can skip through. 181786b91: 83 3d 40 34 58 02 ff cmpl $0xffffffff,0x2583440(%rip) # 0x183d09fd8 181786b98: 75 38 jne 0x181786bd2 # We are intializating the static table for NS_TableDrivenQI. Before we tried # inlining NS_TableDrivenQI, all of this would have been done by the compiler. # Instead, we now have a bunch of SSE code to initialize the table for us. 181786b9a: 0f 10 05 9f 35 ef 01 movups 0x1ef359f(%rip),%xmm0 # 0x18367a140 181786ba1: 83 25 40 9c 57 02 00 andl $0x0,0x2579c40(%rip) # 0x183d007e8 181786ba8: 48 8d 0d 29 34 58 02 lea 0x2583429(%rip),%rcx # 0x183d09fd8 181786baf: 83 25 46 9c 57 02 00 andl $0x0,0x2579c46(%rip) # 0x183d007fc 181786bb6: f3 0f 7f 05 1a 9c 57 movdqu %xmm0,0x2579c1a(%rip) # 0x183d007d8 181786bbd: 02 181786bbe: 0f 10 05 73 ab 85 01 movups 0x185ab73(%rip),%xmm0 # 0x182fe1738 181786bc5: f3 0f 7f 05 1f 9c 57 movdqu %xmm0,0x2579c1f(%rip) # 0x183d007ec 181786bcc: 02 # This is a call to _Init_thread_footer, indicating our table initialization # is complete. 181786bcd: e8 5a 6c 6c ff callq 0x180e4d82c # After all that work, we can now call a *non-inlined* NS_TableDrivenQI. 181786bd2: 4c 8d 0d ff 9b 57 02 lea 0x2579bff(%rip),%r9 # 0x183d007d8 181786bd9: 48 c7 44 24 20 02 00 movq $0x2,0x20(%rsp) 181786be0: 00 00 181786be2: 4c 8b c3 mov %rbx,%r8 181786be5: 48 8b d7 mov %rdi,%rdx 181786be8: 48 8b ce mov %rsi,%rcx # 0x1807e7c1c is NS_TableDrivenQI. Disassembling libxul, we can find that there # are 1483 calls to NS_TableDrivenQI. Given that we have ~1500 calls to said # function (see comment 5), we have inlined in a grand total of...15 callsites. # And we have a whole bunch of useless code (see above). This optimization does # not look like it pays off. 181786beb: e8 2c 10 06 ff callq 0x1807e7c1c 181786bf0: 48 8b 5c 24 40 mov 0x40(%rsp),%rbx 181786bf5: 48 8b 74 24 48 mov 0x48(%rsp),%rsi 181786bfa: 48 83 c4 30 add $0x30,%rsp 181786bfe: 5f pop %rdi 181786bff: c3 retq
Flags: needinfo?(nfroyd)
![]() |
||
Comment 15•4 years ago
|
||
Comment on attachment 8890687 [details] [diff] [review] Optimize NS_TableDrivenQI() If comment 14 doesn't convince you, I'm not sure what else to tell you. :)
Attachment #8890687 -
Flags: review?(nfroyd)
![]() |
||
Comment 16•4 years ago
|
||
Also note that comment 14's disassembly was from a PGO build, if that makes any difference.
Reporter | ||
Comment 17•4 years ago
|
||
Comment 14 is extremely convincing, thanks for digging into this! Are you still interested in the rest of the patch?
![]() |
||
Comment 18•4 years ago
|
||
Comment 12 suggests that the inlining was the significant change, which we've seen doesn't actually do much...and embedding the IID is what gave the huge codesize increase. So that leaves us with the known-bounds loop rather than looking for zero, right? Or were you thinking about doing the known-bounds loop + the inlining change and seeing if that improved any (because presumably the codesize wouldn't go up very much now that we know the compiler isn't really doing inlining?). If you want to reprofile that and provide some numbers, we can talk!
Reporter | ||
Comment 19•4 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #18) > Comment 12 suggests that the inlining was the significant change, which > we've seen doesn't actually do much...and embedding the IID is what gave the > huge codesize increase. So that leaves us with the known-bounds loop rather > than looking for zero, right? That's what I meant, yes. > Or were you thinking about doing the > known-bounds loop + the inlining change and seeing if that improved any > (because presumably the codesize wouldn't go up very much now that we know > the compiler isn't really doing inlining?). If you want to reprofile that > and provide some numbers, we can talk! Hmm, not quite sure what you mean. I've given up squeezing any performance out of this bug really. At this point I think the only thing left on the table for us to pick up is shrinking the size of the IID tables by removing the sentinel entries, unless I'm missing something. Not sure what kind of profiling you had in mind... something different to what I had done before? Profiling this code without creating a synthesized microbenchmark is hard.
No longer blocks: Speedometer_V2
Reporter | ||
Comment 20•4 years ago
|
||
(And FWIW, since it no longer will help with Speedometer, I'd also be happy to leave it as is and look for performance gains elsewhere.)
![]() |
||
Comment 21•4 years ago
|
||
(In reply to Out sick from comment #20) > (And FWIW, since it no longer will help with Speedometer, I'd also be happy > to leave it as is and look for performance gains elsewhere.) OK, let's drop this, especially since we'll ideally be replacing the IID-driven QI with something else in the 58/59 timeframe.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INCOMPLETE
You need to log in
before you can comment on or make changes to this bug.
Description
•