(In reply to Eric Rahm [:erahm] (ni? for phab reviews) from comment #1) > We also added the canary more recently, I wonder if that affects things? Could very well be. (In reply to Nathan Froyd [:froydnj] from comment #2) > (In reply to :Ehsan Akhgari from comment #0) > > Bug 1386918 uninlined this function based on the assumption that doing so wouldn't cause any performance issues and if it did people would complain. > > > > I complained here: https://bugzilla.mozilla.org/show_bug.cgi?id=1386918#c7 > > That comment says that "When looking at profiles of our AntiTracking code which logs heavily (for bug 1516540), I see this non-inline function show up quite a lot. :-( In some functions it's around 10% of self-time." > > I looked at profiles and I'm not seeing 10% of self-time. The last two profiles, from bug 1516540 comment 21 and bug 1516540 comment 22 have ~1.5% time for logging-specific code. Is this just because it's focused on a single stack under anti-tracking, and not some sort of aggregation across the profile? > > It doesn't even look to be showing up in the anti-tracking paths in the profile from bug 1516540 comment 16--this might just be an artifact of sampling, though. > > Am I misunderstanding the profiles? I was looking at the instruction level profile in perf. All that Gecko Profiler profiles can show you is that this function shows up, that's all. See https://perfht.ml/2GaL8wh (from a local build with some more optimizations after I made the comment above) for an example. But if you're interested to dig in more deeply, I suggest profiling in perf. I use https://pastoral-salesman-1.glitch.me/ as my test case. The function you want to look at is MaybeIsFirstPartyStorageAccessGrantedFor. > > I'm not sure where the code size initially came from. Perhaps a compromise would be inlining everything except this non-common path? <https://searchfox.org/mozilla-central/rev/152993fa346c8fd9296e4cd6622234a664f53341/xpcom/base/Logging.cpp#58> > > I don't know that inlining: > > LogModule* tmp = mLog; > if (MOZ_UNLIKELY(!tmp)) { > GetLogModule(&tmp); > } > > mCanary.Check(); > > return tmp; > > would be significantly less code size than if you inlined the current code. We have some 5000+ callsites for logging, so all inlining has to do is add ~10 instructions to add up to 200K of code size, which I think the above snippet could easily do. (Bug 1386918 also didn't have to deal with the `mCanary.Check()` call, which depending on what the compiler decides to do would add its own chunk of code size if you inlined things.) Fair point. > It's also not clear to me that inlining it would necessarily increase performance; you still have to execute the instructions, but now they'd just be self-attributed rather than attributed to `LogModule::operator*`. You'd also have increased register pressure, worse icache performance, etc. etc. Wouldn't you be executing strictly fewer instructions because you'd be saving the overhead of calling the out of line function in that case? FWIW this is what the compiled function on my system looks like (I'm hoping I won't make a mistake transcribing the instructions here): # prolog push %rbp mov %rsp,%rbp push %rbx push %rax mov %rdi,%rbx # atomic load mov 0x10(%rdi),%rax # if MOZ_UNLIKELY test %rax,%rax je 23 # canary check cmpq $0xf0b0f0b,0x8(%rbx) jne 40 # epilog 1c: add $0x8,%rsp pop %rbx pop %rbp retq # body of if MOZ_UNLIKELY 23: mov (%rbx),%rsi mov mozilla::sLogModuleManager,%rdi callq mozilla::LogModuleManager::CreateOrGetModule(char const*) mov %rax,0x10(%rbx) cmpq $0xf0b0f0b,0x8(%rbx) je 1c # MOZ_CRASH from canary check 40: lea mp_gf2m_sqr_tb+0x716,%rax mov gMozCrashReason@Base,%rcx mov %rax,(%rcx) movl $0x4b,0x0 callq abort@plt In the calling frame we have: mov %rsi,%r15 mov %rdi,%r13 mov %fs:0x28,%rax mov %rax,-0x30(%rbp) lea gAntiTrackingLog,%rdi callq mozilla::LazyLogModule::operator mozilla::LogModule*() test %rax,%rax je 7c57745 <mozilla::AntiTrackingCommon::MaybeIsFirstPartyStorageAccessGrantedFor(nsPIDOMWindowInner*, nsIURI*)+0x165> # this is the inlined log_test() function Some of these are probably unfair to count as "instructions" (e.g. the moves that are just renaming registers) but I wanted to show that with the out-of-line version there is also some register usage in the calling frame. The trade-off is hard to reason about for me tbh, it's not completely obvious that in one case e.g. the register pressure is more/less than the other. What seems quite clear to me is that we do run strictly more code in the out-of-lined version. I'm also curious to know why you're saying the icache performance will be worse in the inlined version. When looking at the instruction level profile of the operator LogModule*() function itself, I see that around 20% of the samples are attributed to the first instruction (push %rbp) which is usually an indication of icache fetch stalls when a call has been made AFAIK. That seems to be contrary to your suggestion above?
Bug 1525303 Comment 3 Edit History
Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.
(In reply to Eric Rahm [:erahm] (ni? for phab reviews) from comment #1) > We also added the canary more recently, I wonder if that affects things? Could very well be. (In reply to Nathan Froyd [:froydnj] from comment #2) > (In reply to :Ehsan Akhgari from comment #0) > > Bug 1386918 uninlined this function based on the assumption that doing so wouldn't cause any performance issues and if it did people would complain. > > > > I complained here: https://bugzilla.mozilla.org/show_bug.cgi?id=1386918#c7 > > That comment says that "When looking at profiles of our AntiTracking code which logs heavily (for bug 1516540), I see this non-inline function show up quite a lot. :-( In some functions it's around 10% of self-time." > > I looked at profiles and I'm not seeing 10% of self-time. The last two profiles, from bug 1516540 comment 21 and bug 1516540 comment 22 have ~1.5% time for logging-specific code. Is this just because it's focused on a single stack under anti-tracking, and not some sort of aggregation across the profile? > > It doesn't even look to be showing up in the anti-tracking paths in the profile from bug 1516540 comment 16--this might just be an artifact of sampling, though. > > Am I misunderstanding the profiles? I was looking at the instruction level profile in perf. All that Gecko Profiler profiles can show you is that this function shows up, that's all. See https://perfht.ml/2GaL8wh (from a local build with some more optimizations after I made the comment above) for an example. But if you're interested to dig in more deeply, I suggest profiling in perf. I use https://pastoral-salesman-1.glitch.me/ as my test case. The function you want to look at is MaybeIsFirstPartyStorageAccessGrantedFor. > > I'm not sure where the code size initially came from. Perhaps a compromise would be inlining everything except this non-common path? <https://searchfox.org/mozilla-central/rev/152993fa346c8fd9296e4cd6622234a664f53341/xpcom/base/Logging.cpp#58> > > I don't know that inlining: > > LogModule* tmp = mLog; > if (MOZ_UNLIKELY(!tmp)) { > GetLogModule(&tmp); > } > > mCanary.Check(); > > return tmp; > > would be significantly less code size than if you inlined the current code. We have some 5000+ callsites for logging, so all inlining has to do is add ~10 instructions to add up to 200K of code size, which I think the above snippet could easily do. (Bug 1386918 also didn't have to deal with the `mCanary.Check()` call, which depending on what the compiler decides to do would add its own chunk of code size if you inlined things.) Fair point. > It's also not clear to me that inlining it would necessarily increase performance; you still have to execute the instructions, but now they'd just be self-attributed rather than attributed to `LogModule::operator*`. You'd also have increased register pressure, worse icache performance, etc. etc. Wouldn't you be executing strictly fewer instructions because you'd be saving the overhead of calling the out of line function in that case? FWIW this is what the compiled function on my system looks like (I'm hoping I won't make a mistake transcribing the instructions here): ``` # prolog push %rbp mov %rsp,%rbp push %rbx push %rax mov %rdi,%rbx # atomic load mov 0x10(%rdi),%rax # if MOZ_UNLIKELY test %rax,%rax je 23 # canary check cmpq $0xf0b0f0b,0x8(%rbx) jne 40 # epilog 1c: add $0x8,%rsp pop %rbx pop %rbp retq # body of if MOZ_UNLIKELY 23: mov (%rbx),%rsi mov mozilla::sLogModuleManager,%rdi callq mozilla::LogModuleManager::CreateOrGetModule(char const*) mov %rax,0x10(%rbx) cmpq $0xf0b0f0b,0x8(%rbx) je 1c # MOZ_CRASH from canary check 40: lea mp_gf2m_sqr_tb+0x716,%rax mov gMozCrashReason@Base,%rcx mov %rax,(%rcx) movl $0x4b,0x0 callq abort@plt ``` In the calling frame we have: ``` mov %rsi,%r15 mov %rdi,%r13 mov %fs:0x28,%rax mov %rax,-0x30(%rbp) lea gAntiTrackingLog,%rdi callq mozilla::LazyLogModule::operator mozilla::LogModule*() test %rax,%rax je 7c57745 <mozilla::AntiTrackingCommon::MaybeIsFirstPartyStorageAccessGrantedFor(nsPIDOMWindowInner*, nsIURI*)+0x165> # this is the inlined log_test() function ``` Some of these are probably unfair to count as "instructions" (e.g. the moves that are just renaming registers) but I wanted to show that with the out-of-line version there is also some register usage in the calling frame. The trade-off is hard to reason about for me tbh, it's not completely obvious that in one case e.g. the register pressure is more/less than the other. What seems quite clear to me is that we do run strictly more code in the out-of-lined version. I'm also curious to know why you're saying the icache performance will be worse in the inlined version. When looking at the instruction level profile of the operator LogModule*() function itself, I see that around 20% of the samples are attributed to the first instruction (push %rbp) which is usually an indication of icache fetch stalls when a call has been made AFAIK. That seems to be contrary to your suggestion above?