Closed Bug 1384798 Opened 2 years ago Closed 2 years ago

NS_TableDrivenQI is too slow

Categories

(Core :: XPCOM, enhancement)

enhancement
Not set

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: ehsan, Unassigned)

Details

Attachments

(2 files)

See this profile of NS_TableDrivenQI while running Speedometer (sorry about the image, here is the same thing in text form, but I figured the colors are helpful in readability as well.)

       │    nsresult NS_FASTCALL                                                                                                                                  ▒
       │    NS_TableDrivenQI(void* aThis, REFNSIID aIID, void** aInstancePtr,                                                                                     ▒
       │                     const QITableEntry* aEntries)                                                                                                        ▒
       │    {                                                                                                                                                     ▒
  6.02 │      push   %rbp                                                                                                                                         ▒
  3.11 │      mov    %rsp,%rbp                                                                                                                                    ▒
  2.49 │      push   %r15                                                                                                                                         ◆
  0.83 │      push   %r14                                                                                                                                         ▒
  1.24 │      push   %rbx                                                                                                                                         ▒
  0.41 │      push   %rax                                                                                                                                         ▒
  2.90 │      mov    %rdx,%r14                                                                                                                                    ▒
  1.04 │      mov    (%rsi),%eax                                                                                                                                  ▒
  6.85 │      mov    0x4(%rsi),%edx                                                                                                                               ▒
  1.24 │      mov    0x8(%rsi),%ebx                                                                                                                               ▒
  2.28 │      mov    0xc(%rsi),%r8d                                                                                                                               ▒
  0.83 │      mov    (%rcx),%rsi                                                                                                                                  ▒
       │      do {                                                                                                                                                ▒
 14.11 │      add    $0x10,%rcx                                                                                                                                   ▒
  0.83 │      xor    %r15d,%r15d                                                                                                                                  ▒
  2.49 │      data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)                                                                                                      ▒
       │    _ZNK4nsID6EqualsERKS_():                                                                                                                              ▒
       │                                                                                                                                                          ▒
       │      inline bool Equals(const nsID& aOther) const                                                                                                        ▒
       │      {                                                                                                                                                   ▒
       │        // Unfortunately memcmp isn't faster than this.                                                                                                   ▒
       │        return                                                                                                                                            ▒
       │          (((uint32_t*)&m0)[0] == ((uint32_t*)&aOther.m0)[0]) &&                                                                                          ▒
  3.53 │30:   cmp    (%rsi),%eax                                                                                                                                  ▒
       │    ↓ jne    a8f790 <NS_TableDrivenQI(void*, nsID const&, void**, QITableEntry const*)+0x50>                                                              ▒
       │          (((uint32_t*)&m0)[1] == ((uint32_t*)&aOther.m0)[1]) &&                                                                                          ▒
  0.41 │      cmp    0x4(%rsi),%edx                                                                                                                               ▒
       │    ↓ jne    a8f790 <NS_TableDrivenQI(void*, nsID const&, void**, QITableEntry const*)+0x50>                                                              ▒
       │          (((uint32_t*)&m0)[2] == ((uint32_t*)&aOther.m0)[2]) &&                                                                                          ▒
  0.41 │      cmp    0x8(%rsi),%ebx                                                                                                                               ▒
       │    ↓ jne    a8f790 <NS_TableDrivenQI(void*, nsID const&, void**, QITableEntry const*)+0x50>                                                              ▒
       │    _Z16NS_TableDrivenQIPvRK4nsIDPS_PK12QITableEntry():                                                                                                   ▒
       │        if (aIID.Equals(*aEntries->iid)) {                                                                                                                ▒
  0.21 │      cmp    0xc(%rsi),%r8d                                                                                                                               ▒
       │    ↓ je     a8f7af <NS_TableDrivenQI(void*, nsID const&, void**, QITableEntry const*)+0x6f>                                                              ▒
       │      data16 data16 nopw %cs:0x0(%rax,%rax,1)                                                                                                             ▒
       │          *aInstancePtr = r;                                                                                                                              ▒
       │          return NS_OK;                                                                                                                                   ▒
       │        }                                                                                                                                                 ▒
       │                                                                                                                                                          ▒
       │        ++aEntries;                                                                                                                                       ▒
       │      } while (aEntries->iid);                                                                                                                            ▒
 32.99 │50:   mov    (%rcx),%rsi                                                                                                                                  ▒
  1.87 │      add    $0x10,%rcx                                                                                                                                   ▒
  1.24 │      test   %rsi,%rsi                                                                                                                                    ▒
       │    ↑ jne    a8f770 <NS_TableDrivenQI(void*, nsID const&, void**, QITableEntry const*)+0x30>                                                              ▒
  1.04 │      mov    $0x80004002,%eax                                                                                                                             ▒
       │          *aInstancePtr = r;                                                                                                                              ▒
  2.49 │61:   mov    %r15,(%r14)                                                                                                                                  ▒
       │                                                                                                                                                          ▒
       │      *aInstancePtr = nullptr;                                                                                                                            ▒
       │      return NS_ERROR_NO_INTERFACE;                                                                                                                       ▒
       │    }                                                                                                                                                     ▒
  0.83 │      add    $0x8,%rsp                                                                                                                                    ▒
  1.24 │      pop    %rbx                                                                                                                                         ▒
  2.90 │      pop    %r14                                                                                                                                         ▒
       │      pop    %r15                                                                                                                                         ▒
  1.45 │      pop    %rbp                                                                                                                                         ▒
       │    ← retq                                                                                                                                                ▒
       │            reinterpret_cast<char*>(aThis) + aEntries->offset);                                                                                           ▒
       │6f:   movslq -0x8(%rcx),%rax                                                                                                                              ▒
  0.41 │      lea    (%rdi,%rax,1),%r15                                                                                                                           ▒
       │          NS_ADDREF(r);                                                                                                                                   ▒
       │      mov    (%rdi,%rax,1),%rax                                                                                                                           ▒
  1.66 │      mov    %r15,%rdi                                                                                                                                    ▒
  0.21 │    → callq  *8                                                                                                                                           ▒
  0.41 │      xor    %eax,%eax                                                                                                                                    ▒
       │    ↑ jmp    a8f7a1 <NS_TableDrivenQI(void*, nsID const&, void**, QITableEntry const*)+0x61>                                                              ▒

Inlining this is an obvious first step for speeding this up.  Other ideas are welcome.

I thought about putting the IIDs and offsets into separate arrays, so that as we scan the IID arrays we have better cacheline utilization, but I can't think of any C++ trick which would allow us to do that without requiring a change to the NS_INTERFACE_TABLE_ENTRY macro (by basically repeating the list of IIDs twice) without runtime overhead.  :-/
Could it be dereferencing the pointer aEntries->iid that takes time, not iterating the array?
(In reply to Ting-Yu Chou [:ting] from comment #1)
> Could it be dereferencing the pointer aEntries->iid that takes time, not
> iterating the array?

Yes, that is certainly one issue.  And once we make the IIDs members of QITableEntry, we can separate the search part of the algorithm to facilitate loop unrolling and vectorization by the optimizer (I chose to use std::find_if for that part.)  The other is that we dereference aEntries in the loop condition, whereas the caller knows the size of the table at compile time and could just pass it in and we could avoid this cost every iteration too.

I have a patch that does all of these, and the cost of NS_TableDrivenQI on ~150 subtests of Speedometer drops down from about 0.11% of total wall clock time to 0.05%.
This patch performs the following optimizations:

 * It inlines NS_TableDrivenQI().
 * It rewrites the loop over the table to use the size of the table as known
   in the caller instead of dereferencing memory inside the loop condition.
 * It removes one level of pointer chasing to access the IID from the
   QITableEntry object by embedding the IID directly inside it.
Attachment #8890687 - Flags: review?(nfroyd)
I'm interested to see if there are codesize+general speed costs of this. Table-driven QI was intentionally trading non-inline to reduce codesize and for non-hot paths that had definite speed wins.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #4)
> I'm interested to see if there are codesize+general speed costs of this.
> Table-driven QI was intentionally trading non-inline to reduce codesize and
> for non-hot paths that had definite speed wins.

Yes, I'd be curious, too.  I did some quick checking:

* Nightly x86-64 Linux libxul has 83,936 bytes of QI tables across 1,497 tables
* At 16 bytes/entry, that's 5,246 entries
* Note that ~1/3 of those entries are sentinels (!), so we have 3,749 actual entries

So if we were to embed the IID and pass the table size as a constant, entry size would increase to 24 bytes, which means the new tables would take up ~89K of space, which is a surprisingly small increase.

What does the codesize hit from inlining NS_TableDrivenQI some 1500 times look like?  I think I'd rather avoid inlining NS_TableDrivenQI if possible...can you rerun the numbers without that change?

A tenth of a half of a percent is a pretty small win in any event.
Flags: needinfo?(ehsan)
(In reply to Nathan Froyd [:froydnj] from comment #5)
> (In reply to Benjamin Smedberg [:bsmedberg] from comment #4)
> > I'm interested to see if there are codesize+general speed costs of this.
> > Table-driven QI was intentionally trading non-inline to reduce codesize and
> > for non-hot paths that had definite speed wins.
> 
> Yes, I'd be curious, too.  I did some quick checking:
> 
> * Nightly x86-64 Linux libxul has 83,936 bytes of QI tables across 1,497
> tables
> * At 16 bytes/entry, that's 5,246 entries
> * Note that ~1/3 of those entries are sentinels (!), so we have 3,749 actual
> entries

Wow, a third!  Good grief.

> So if we were to embed the IID and pass the table size as a constant, entry
> size would increase to 24 bytes, which means the new tables would take up
> ~89K of space, which is a surprisingly small increase.

Yes.  Also note that soon we will be able to shrink this even further with bug 1384727, so let's not panic about code size too much here.  I'll get you numbers from try though.

> What does the codesize hit from inlining NS_TableDrivenQI some 1500 times
> look like?  I think I'd rather avoid inlining NS_TableDrivenQI if
> possible...can you rerun the numbers without that change?
> 
> A tenth of a half of a percent is a pretty small win in any event.

I'll be pushing back on the idea of not inlining pretty hard unless if the code size numbers come back with several hundreds of K's or megs of difference, I think.  I'm rather surprised you're reasoning about this in terms of a half a percent.  QueryInterface(), especially for classes such as CCed objects, is *extremely* hot code, probably among the hottest general purpose code we have in XPCOM.  We should be striving for any amount of performance we can squeeze out of it, and taking some code size hit for it is absolutely the right trade-off.  I just cited numbers from one specific profiling situation that motivated the work being done here, but this is far from the first time NS_TableDrivenQI is showing up pretty hot in profiles.  :-)
Flags: needinfo?(ehsan)
Can we just change CCed classes to not use table driven QI? I see about 50 uses of the CC table QI macros, though they may have subclasses that don't implement their own CC participant that also use the table QI.
(In reply to Andrew McCreight [:mccr8] from comment #7)
> Can we just change CCed classes to not use table driven QI? I see about 50
> uses of the CC table QI macros, though they may have subclasses that don't
> implement their own CC participant that also use the table QI.

We could, thought it's not obvious to me if the map based QI would result in smaller code size.  Also note that the QI cost isn't only coming from the cycle collector call sites, there are also many places in DOM for example which do QIs that can easily turn hot, especially when pages call things in a loop.
Here is some code size data about libxul (APK on Android) size changes as a result of this patch from try:

Platform      Before       After       Delta

Win32 PGO   65286144      65462272     176128
Win64 PGO   77153280      77317120     163840  (how is this smaller than Win32?!)
Linux32 PGO 105665564     106193244    527680
Linux64 PGO 115220400     115745848    525448
MacOSX      155031860     155677204    645344
Android     36348127      36403714     55587

Thoughts?
Flags: needinfo?(nfroyd)
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #6)
> (In reply to Nathan Froyd [:froydnj] from comment #5)
> > What does the codesize hit from inlining NS_TableDrivenQI some 1500 times
> > look like?  I think I'd rather avoid inlining NS_TableDrivenQI if
> > possible...can you rerun the numbers without that change?
> > 
> > A tenth of a half of a percent is a pretty small win in any event.
> 
> I'll be pushing back on the idea of not inlining pretty hard unless if the
> code size numbers come back with several hundreds of K's or megs of
> difference, I think.  I'm rather surprised you're reasoning about this in
> terms of a half a percent.  QueryInterface(), especially for classes such as
> CCed objects, is *extremely* hot code, probably among the hottest general
> purpose code we have in XPCOM.  We should be striving for any amount of
> performance we can squeeze out of it, and taking some code size hit for it
> is absolutely the right trade-off.  I just cited numbers from one specific
> profiling situation that motivated the work being done here, but this is far
> from the first time NS_TableDrivenQI is showing up pretty hot in profiles. 
> :-)

I can agree that speeding up QI is a worthy goal!  It is not clear to me that the performance difference you're observing is entirely due to inlining, as you have three distinct ideas going on in your patch.  I'm asking whether we can get the same performance benefits *without* the inlining.

(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #9)
> Here is some code size data about libxul (APK on Android) size changes as a
> result of this patch from try:
> 
> Platform      Before       After       Delta
> 
> Win32 PGO   65286144      65462272     176128
> Win64 PGO   77153280      77317120     163840  (how is this smaller than
> Win32?!)

The relative differences in entry sizes, I bet.  We had ~5000 entries before, so that's 40K on win32 and 80K on win64.  We have ~3200 entries now, but our entries are now 20 bytes on win32 and 24 (I think) on win64, so that's 64K on win32 and ~78K on win64.  That's a swing of 24K on win32, which is at least in the ballpark for the above.

> Linux32 PGO 105665564     106193244    527680
> Linux64 PGO 115220400     115745848    525448
> MacOSX      155031860     155677204    645344

These changes are wild!  Half a megabyte of code?
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #11)
> (In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from
> comment #6)
> > (In reply to Nathan Froyd [:froydnj] from comment #5)
> > > What does the codesize hit from inlining NS_TableDrivenQI some 1500 times
> > > look like?  I think I'd rather avoid inlining NS_TableDrivenQI if
> > > possible...can you rerun the numbers without that change?
> > > 
> > > A tenth of a half of a percent is a pretty small win in any event.
> > 
> > I'll be pushing back on the idea of not inlining pretty hard unless if the
> > code size numbers come back with several hundreds of K's or megs of
> > difference, I think.  I'm rather surprised you're reasoning about this in
> > terms of a half a percent.  QueryInterface(), especially for classes such as
> > CCed objects, is *extremely* hot code, probably among the hottest general
> > purpose code we have in XPCOM.  We should be striving for any amount of
> > performance we can squeeze out of it, and taking some code size hit for it
> > is absolutely the right trade-off.  I just cited numbers from one specific
> > profiling situation that motivated the work being done here, but this is far
> > from the first time NS_TableDrivenQI is showing up pretty hot in profiles. 
> > :-)
> 
> I can agree that speeding up QI is a worthy goal!  It is not clear to me
> that the performance difference you're observing is entirely due to
> inlining, as you have three distinct ideas going on in your patch.  I'm
> asking whether we can get the same performance benefits *without* the
> inlining.

Right, so the issue is that most of the gains here are achieved with inlining, the other achievements are quite minimal.  :-(

> (In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from
> comment #9)
> > Here is some code size data about libxul (APK on Android) size changes as a
> > result of this patch from try:
> > 
> > Platform      Before       After       Delta
> > 
> > Win32 PGO   65286144      65462272     176128
> > Win64 PGO   77153280      77317120     163840  (how is this smaller than
> > Win32?!)
> 
> The relative differences in entry sizes, I bet.  We had ~5000 entries
> before, so that's 40K on win32 and 80K on win64.  We have ~3200 entries now,
> but our entries are now 20 bytes on win32 and 24 (I think) on win64, so
> that's 64K on win32 and ~78K on win64.  That's a swing of 24K on win32,
> which is at least in the ballpark for the above.

OK that makes sense, thanks for the analysis.

> > Linux32 PGO 105665564     106193244    527680
> > Linux64 PGO 115220400     115745848    525448
> > MacOSX      155031860     155677204    645344
> 
> These changes are wild!  Half a megabyte of code?

Yeah.  :-/  I'm not too happy about this either.  I suppose the issue here is that once we start inlining this code, the compiler can start to go wild, start unlooping etc and start to increase code size massively.  I don't really know of a way to "control" this behavior sadly.

But that being said, I mostly care about Windows here.  Are you OK with the code size hit on Win32/Win64?  What about Android?  If yes, perhaps we can think of a compromise, that is, inline on Windows and Android and leave the code non-inlined on Linux and OSX? (Please wash your eyes after reading this!)
(unlooping is Ehsanese for unrolling!)
Flags: needinfo?(nfroyd)
OK, so I took a look at this.  Below is some analysis.  TL; DR: I *really* don't think this is worth the codesize hit, the improvement is very small, and I'm even skeptical that the improvement is really there.

# This is the dissassembly of nsMappedAttributes::QueryInterface.
# nsMappedAttributes::QueryInterface is the simplest possible table-driven
# QueryInterface implementation:
#
# NS_IMPL_QUERY_INTERFACE(nsMappedAttributes,
#                         nsIStyleRule)
#
# So this is the simplest possible testcase for how our optimization is doing.
#
# Function prologue stuff for Win64.
   181786b48:	48 89 5c 24 08       	mov    %rbx,0x8(%rsp)
   181786b4d:	48 89 74 24 10       	mov    %rsi,0x10(%rsp)
   181786b52:	57                   	push   %rdi
   181786b53:	48 83 ec 30          	sub    $0x30,%rsp

# Uh, OK, where's the inlined code?
#
# Why are we accessing a thread-local variable (%gs is the thread information block
# pointer on Win64) and poking around in it?
   181786b57:	65 48 8b 04 25 58 00 	mov    %gs:0x58,%rax
   181786b5e:	00 00 
   181786b60:	48 8b f1             	mov    %rcx,%rsi
   181786b63:	44 8b 0d 7a 4d 4c 02 	mov    0x24c4d7a(%rip),%r9d        # 0x183c4b8e4
   181786b6a:	49 8b d8             	mov    %r8,%rbx
   181786b6d:	b9 80 00 00 00       	mov    $0x80,%ecx
   181786b72:	48 8b fa             	mov    %rdx,%rdi
   181786b75:	4e 8b 0c c8          	mov    (%rax,%r9,8),%r9
   181786b79:	42 8b 04 09          	mov    (%rcx,%r9,1),%eax

# OK, so some value in thread-local storage is interesting, and we're going
# to skip some work if so.
   181786b7d:	39 05 55 34 58 02    	cmp    %eax,0x2583455(%rip)        # 0x183d09fd8
   181786b83:	7e 4d                	jle    0x181786bd2

# Aha!
#
# This is a call to Init_thread_header.  Googling indicates that calls to
# Init_thread_header are involved in thread-safe initialization of
# function-level statics.  See:
#
# https://github.com/Microsoft/clang/commit/4ade6cab7cfdc694ae50668a4e7ba2169d593c29#diff-19e2b34bb87a61d6bbad7f9e2c3f3cf0R2200
#
# for details.  Going from the above comment, we must have been testing
# _Init_thread_epoch and determining what to do next.
#
# 0x180e4d88c is _Init_thread_header; 0x180e4d82c is _Init_thread_footer.
   181786b85:	48 8d 0d 4c 34 58 02 	lea    0x258344c(%rip),%rcx        # 0x183d09fd8
   181786b8c:	e8 fb 6c 6c ff       	callq  0x180e4d88c

# Ah, here's the call to see if TSS == -1.  If it's not, we can skip through.
   181786b91:	83 3d 40 34 58 02 ff 	cmpl   $0xffffffff,0x2583440(%rip)        # 0x183d09fd8
   181786b98:	75 38                	jne    0x181786bd2

# We are intializating the static table for NS_TableDrivenQI.  Before we tried
# inlining NS_TableDrivenQI, all of this would have been done by the compiler.
# Instead, we now have a bunch of SSE code to initialize the table for us.
   181786b9a:	0f 10 05 9f 35 ef 01 	movups 0x1ef359f(%rip),%xmm0        # 0x18367a140
   181786ba1:	83 25 40 9c 57 02 00 	andl   $0x0,0x2579c40(%rip)        # 0x183d007e8
   181786ba8:	48 8d 0d 29 34 58 02 	lea    0x2583429(%rip),%rcx        # 0x183d09fd8
   181786baf:	83 25 46 9c 57 02 00 	andl   $0x0,0x2579c46(%rip)        # 0x183d007fc
   181786bb6:	f3 0f 7f 05 1a 9c 57 	movdqu %xmm0,0x2579c1a(%rip)        # 0x183d007d8
   181786bbd:	02 
   181786bbe:	0f 10 05 73 ab 85 01 	movups 0x185ab73(%rip),%xmm0        # 0x182fe1738
   181786bc5:	f3 0f 7f 05 1f 9c 57 	movdqu %xmm0,0x2579c1f(%rip)        # 0x183d007ec
   181786bcc:	02 

# This is a call to _Init_thread_footer, indicating our table initialization
# is complete.
   181786bcd:	e8 5a 6c 6c ff       	callq  0x180e4d82c

# After all that work, we can now call a *non-inlined* NS_TableDrivenQI.
   181786bd2:	4c 8d 0d ff 9b 57 02 	lea    0x2579bff(%rip),%r9        # 0x183d007d8
   181786bd9:	48 c7 44 24 20 02 00 	movq   $0x2,0x20(%rsp)
   181786be0:	00 00 
   181786be2:	4c 8b c3             	mov    %rbx,%r8
   181786be5:	48 8b d7             	mov    %rdi,%rdx
   181786be8:	48 8b ce             	mov    %rsi,%rcx

# 0x1807e7c1c is NS_TableDrivenQI.  Disassembling libxul, we can find that there
# are 1483 calls to NS_TableDrivenQI.  Given that we have ~1500 calls to said
# function (see comment 5), we have inlined in a grand total of...15 callsites.
# And we have a whole bunch of useless code (see above).  This optimization does
# not look like it pays off.
   181786beb:	e8 2c 10 06 ff       	callq  0x1807e7c1c
   181786bf0:	48 8b 5c 24 40       	mov    0x40(%rsp),%rbx
   181786bf5:	48 8b 74 24 48       	mov    0x48(%rsp),%rsi
   181786bfa:	48 83 c4 30          	add    $0x30,%rsp
   181786bfe:	5f                   	pop    %rdi
   181786bff:	c3                   	retq
Flags: needinfo?(nfroyd)
Comment on attachment 8890687 [details] [diff] [review]
Optimize NS_TableDrivenQI()

If comment 14 doesn't convince you, I'm not sure what else to tell you. :)
Attachment #8890687 - Flags: review?(nfroyd)
Also note that comment 14's disassembly was from a PGO build, if that makes any difference.
Comment 14 is extremely convincing, thanks for digging into this!

Are you still interested in the rest of the patch?
Comment 12 suggests that the inlining was the significant change, which we've seen doesn't actually do much...and embedding the IID is what gave the huge codesize increase.  So that leaves us with the known-bounds loop rather than looking for zero, right?  Or were you thinking about doing the known-bounds loop + the inlining change and seeing if that improved any (because presumably the codesize wouldn't go up very much now that we know the compiler isn't really doing inlining?).  If you want to reprofile that and provide some numbers, we can talk!
(In reply to Nathan Froyd [:froydnj] from comment #18)
> Comment 12 suggests that the inlining was the significant change, which
> we've seen doesn't actually do much...and embedding the IID is what gave the
> huge codesize increase.  So that leaves us with the known-bounds loop rather
> than looking for zero, right?

That's what I meant, yes.

> Or were you thinking about doing the
> known-bounds loop + the inlining change and seeing if that improved any
> (because presumably the codesize wouldn't go up very much now that we know
> the compiler isn't really doing inlining?).  If you want to reprofile that
> and provide some numbers, we can talk!

Hmm, not quite sure what you mean.  I've given up squeezing any performance out of this bug really.  At this point I think the only thing left on the table for us to pick up is shrinking the size of the IID tables by removing the sentinel entries, unless I'm missing something.  Not sure what kind of profiling you had in mind... something different to what I had done before?  Profiling this code without creating a synthesized microbenchmark is hard.
No longer blocks: Speedometer_V2
(And FWIW, since it no longer will help with Speedometer, I'd also be happy to leave it as is and look for performance gains elsewhere.)
(In reply to Out sick from comment #20)
> (And FWIW, since it no longer will help with Speedometer, I'd also be happy
> to leave it as is and look for performance gains elsewhere.)

OK, let's drop this, especially since we'll ideally be replacing the IID-driven QI with something else in the 58/59 timeframe.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.