Bug 1525303 Comment 3 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

Original comment by

(no longer active)

on 2019-02-05 16:44:15 PST

(In reply to Eric Rahm [:erahm] (ni? for phab reviews) from comment #1)

> We also added the canary more recently, I wonder if that affects things?

Could very well be.

(In reply to Nathan Froyd [:froydnj] from comment #2)

> (In reply to :Ehsan Akhgari from comment #0)
> > Bug 1386918 uninlined this function based on the assumption that doing so wouldn't cause any performance issues and if it did people would complain.
> > 
> > I complained here: https://bugzilla.mozilla.org/show_bug.cgi?id=1386918#c7
> 
> That comment says that "When looking at profiles of our AntiTracking code which logs heavily (for bug 1516540), I see this non-inline function show up quite a lot. :-( In some functions it's around 10% of self-time."
> 
> I looked at profiles and I'm not seeing 10% of self-time.  The last two profiles, from bug 1516540 comment 21 and bug 1516540 comment 22 have ~1.5% time for logging-specific code.  Is this just because it's focused on a single stack under anti-tracking, and not some sort of aggregation across the profile?
> 
> It doesn't even look to be showing up in the anti-tracking paths in the profile from bug 1516540 comment 16--this might just be an artifact of sampling, though.
> 
> Am I misunderstanding the profiles?

I was looking at the instruction level profile in perf.  All that Gecko Profiler profiles can show you is that this function shows up, that's all.  See https://perfht.ml/2GaL8wh (from a local build with some more optimizations after I made the comment above) for an example.

But if you're interested to dig in more deeply, I suggest profiling in perf.  I use https://pastoral-salesman-1.glitch.me/ as my test case.  The function you want to look at is MaybeIsFirstPartyStorageAccessGrantedFor.

> > I'm not sure where the code size initially came from.  Perhaps a compromise would be inlining everything except this non-common path? <https://searchfox.org/mozilla-central/rev/152993fa346c8fd9296e4cd6622234a664f53341/xpcom/base/Logging.cpp#58>
> 
> I don't know that inlining:
> 
>   LogModule* tmp = mLog;
>   if (MOZ_UNLIKELY(!tmp)) {
>     GetLogModule(&tmp);
>   }
> 
>   mCanary.Check();
> 
>   return tmp;
> 
> would be significantly less code size than if you inlined the current code.  We have some 5000+ callsites for logging, so all inlining has to do is add ~10 instructions to add up to 200K of code size, which I think the above snippet could easily do.  (Bug 1386918 also didn't have to deal with the `mCanary.Check()` call, which depending on what the compiler decides to do would add its own chunk of code size if you inlined things.)

Fair point.

> It's also not clear to me that inlining it would necessarily increase performance; you still have to execute the instructions, but now they'd just be self-attributed rather than attributed to `LogModule::operator*`.  You'd also have increased register pressure, worse icache performance, etc. etc.

Wouldn't you be executing strictly fewer instructions because you'd be saving the overhead of calling the out of line function in that case?

FWIW this is what the compiled function on my system looks like (I'm hoping I won't make a mistake transcribing the instructions here):

# prolog
push %rbp
mov %rsp,%rbp
push %rbx
push %rax
mov %rdi,%rbx

# atomic load
mov    0x10(%rdi),%rax

# if MOZ_UNLIKELY
test %rax,%rax
je 23

# canary check
cmpq $0xf0b0f0b,0x8(%rbx)
jne 40

# epilog
1c: add $0x8,%rsp
pop %rbx
pop %rbp
retq

# body of if MOZ_UNLIKELY
23: mov (%rbx),%rsi
mov    mozilla::sLogModuleManager,%rdi
callq  mozilla::LogModuleManager::CreateOrGetModule(char const*)
mov %rax,0x10(%rbx)
cmpq $0xf0b0f0b,0x8(%rbx)
je 1c

# MOZ_CRASH from canary check
40: lea mp_gf2m_sqr_tb+0x716,%rax
mov gMozCrashReason@Base,%rcx
mov %rax,(%rcx)
movl $0x4b,0x0
callq abort@plt


In the calling frame we have:

mov    %rsi,%r15
mov    %rdi,%r13
mov    %fs:0x28,%rax
mov    %rax,-0x30(%rbp)
lea    gAntiTrackingLog,%rdi
callq  mozilla::LazyLogModule::operator mozilla::LogModule*()
test   %rax,%rax
je     7c57745 <mozilla::AntiTrackingCommon::MaybeIsFirstPartyStorageAccessGrantedFor(nsPIDOMWindowInner*, nsIURI*)+0x165>
# this is the inlined log_test() function

Some of these are probably unfair to count as "instructions" (e.g. the moves that are just renaming registers) but I wanted to show that with the out-of-line version there is also some register usage in the calling frame.  The trade-off is hard to reason about for me tbh, it's not completely obvious that in one case e.g. the register pressure is more/less than the other.  What seems quite clear to me is that we do run strictly more code in the out-of-lined version.

I'm also curious to know why you're saying the icache performance will be worse in the inlined version.  When looking at the instruction level profile of the operator LogModule*() function itself, I see that around 20% of the samples are attributed to the first instruction (push %rbp) which is usually an indication of icache fetch stalls when a call has been made AFAIK.  That seems to be contrary to your suggestion above?

Revision 1 by

(no longer active)

on 2019-02-05 16:44:50 PST

(In reply to Eric Rahm [:erahm] (ni? for phab reviews) from comment #1)

> We also added the canary more recently, I wonder if that affects things?

Could very well be.

(In reply to Nathan Froyd [:froydnj] from comment #2)

> (In reply to :Ehsan Akhgari from comment #0)
> > Bug 1386918 uninlined this function based on the assumption that doing so wouldn't cause any performance issues and if it did people would complain.
> > 
> > I complained here: https://bugzilla.mozilla.org/show_bug.cgi?id=1386918#c7
> 
> That comment says that "When looking at profiles of our AntiTracking code which logs heavily (for bug 1516540), I see this non-inline function show up quite a lot. :-( In some functions it's around 10% of self-time."
> 
> I looked at profiles and I'm not seeing 10% of self-time.  The last two profiles, from bug 1516540 comment 21 and bug 1516540 comment 22 have ~1.5% time for logging-specific code.  Is this just because it's focused on a single stack under anti-tracking, and not some sort of aggregation across the profile?
> 
> It doesn't even look to be showing up in the anti-tracking paths in the profile from bug 1516540 comment 16--this might just be an artifact of sampling, though.
> 
> Am I misunderstanding the profiles?

I was looking at the instruction level profile in perf.  All that Gecko Profiler profiles can show you is that this function shows up, that's all.  See https://perfht.ml/2GaL8wh (from a local build with some more optimizations after I made the comment above) for an example.

But if you're interested to dig in more deeply, I suggest profiling in perf.  I use https://pastoral-salesman-1.glitch.me/ as my test case.  The function you want to look at is MaybeIsFirstPartyStorageAccessGrantedFor.

> > I'm not sure where the code size initially came from.  Perhaps a compromise would be inlining everything except this non-common path? <https://searchfox.org/mozilla-central/rev/152993fa346c8fd9296e4cd6622234a664f53341/xpcom/base/Logging.cpp#58>
> 
> I don't know that inlining:
> 
>   LogModule* tmp = mLog;
>   if (MOZ_UNLIKELY(!tmp)) {
>     GetLogModule(&tmp);
>   }
> 
>   mCanary.Check();
> 
>   return tmp;
> 
> would be significantly less code size than if you inlined the current code.  We have some 5000+ callsites for logging, so all inlining has to do is add ~10 instructions to add up to 200K of code size, which I think the above snippet could easily do.  (Bug 1386918 also didn't have to deal with the `mCanary.Check()` call, which depending on what the compiler decides to do would add its own chunk of code size if you inlined things.)

Fair point.

> It's also not clear to me that inlining it would necessarily increase performance; you still have to execute the instructions, but now they'd just be self-attributed rather than attributed to `LogModule::operator*`.  You'd also have increased register pressure, worse icache performance, etc. etc.

Wouldn't you be executing strictly fewer instructions because you'd be saving the overhead of calling the out of line function in that case?

FWIW this is what the compiled function on my system looks like (I'm hoping I won't make a mistake transcribing the instructions here):

```
# prolog
push %rbp
mov %rsp,%rbp
push %rbx
push %rax
mov %rdi,%rbx

# atomic load
mov    0x10(%rdi),%rax

# if MOZ_UNLIKELY
test %rax,%rax
je 23

# canary check
cmpq $0xf0b0f0b,0x8(%rbx)
jne 40

# epilog
1c: add $0x8,%rsp
pop %rbx
pop %rbp
retq

# body of if MOZ_UNLIKELY
23: mov (%rbx),%rsi
mov    mozilla::sLogModuleManager,%rdi
callq  mozilla::LogModuleManager::CreateOrGetModule(char const*)
mov %rax,0x10(%rbx)
cmpq $0xf0b0f0b,0x8(%rbx)
je 1c

# MOZ_CRASH from canary check
40: lea mp_gf2m_sqr_tb+0x716,%rax
mov gMozCrashReason@Base,%rcx
mov %rax,(%rcx)
movl $0x4b,0x0
callq abort@plt
```

In the calling frame we have:

```
mov    %rsi,%r15
mov    %rdi,%r13
mov    %fs:0x28,%rax
mov    %rax,-0x30(%rbp)
lea    gAntiTrackingLog,%rdi
callq  mozilla::LazyLogModule::operator mozilla::LogModule*()
test   %rax,%rax
je     7c57745 <mozilla::AntiTrackingCommon::MaybeIsFirstPartyStorageAccessGrantedFor(nsPIDOMWindowInner*, nsIURI*)+0x165>
# this is the inlined log_test() function
```

Some of these are probably unfair to count as "instructions" (e.g. the moves that are just renaming registers) but I wanted to show that with the out-of-line version there is also some register usage in the calling frame.  The trade-off is hard to reason about for me tbh, it's not completely obvious that in one case e.g. the register pressure is more/less than the other.  What seems quite clear to me is that we do run strictly more code in the out-of-lined version.

I'm also curious to know why you're saying the icache performance will be worse in the inlined version.  When looking at the instruction level profile of the operator LogModule*() function itself, I see that around 20% of the samples are attributed to the first instruction (push %rbp) which is usually an indication of icache fetch stalls when a call has been made AFAIK.  That seems to be contrary to your suggestion above?

Back to Bug 1525303 Comment 3