Open Bug 1772353 Opened 3 years ago Updated 2 years ago

Investigate instrumenting auto-var-initializations with an out-of-line function call to measure performance impact

Categories

(Firefox Build System :: Toolchains, task)

task

Tracking

(Not tracked)

People

(Reporter: bholley, Unassigned)

References

(Blocks 1 open bug)

Details

Forking out a separate bug to investigate the idea from bug 1771223 comment 5:

Another way to assess the performance impact would be to instrument the generated variable initialization to include an out-of-line call to a singleton function that does a tunable amount of busywork. We'd then profile a workload, filter for that function, and look for the hottest callers.

Whether clang gives us the hooks to instrument things in this way, I'm not sure.

Tom looked at some preliminary places to do the instrumentation in bug 1771223 comment 9.

I believe -ftrivial-auto-var-init adds initializations in all places and then relies on an optimization pass to remove the redundant initializations. If we were to add code as well when adding the initialization I don't believe the optimizer would know remove it (in the redundancy cases). If this is the case this would add overhead in all places where stack variable are initialized.

Here is an example without optimization and with.

So, what I found out so far is that in the IR, instructions added because of the auto initialization are actually marked as such:

I->addAnnotationMetadata("auto-init");

Maybe after all optimization passes are through, we can scan for any remaining instructions with this metadata and inject the artificial call. This would avoid the problem of breaking optimizations.

Decoder and I just chatted a bit about this.

My mental model of how this flag works (which may or may not be accurate) is that, when enabled, LLVM generates additional IR to zero the uninitialized variables, and then proceeds with optimizing and lowering the IR per usual. Decoder makes the very good point that many (most?) of these zeroing instructions will get optimized out when processing the IR, since the common case for uninitialized variables is for them to be subsequently initialized in a compiler-verifiable way. So if we generated the OOL function calls when minting the IR, we'd end up invoking the function even in the cases where the zeroing got optimized out, skewing the results.

So I think what we'd need to do would be to generate a different IR instruction, which would have the optimization semantics of zeroing, but which would be lowered to zeroing plus the OOL call.

Ok mid-aired with Tyson and decoder who are already ahead of me. :-)

I would be most interested in finding all places where we initialize large buffers. This is just a stab in the dark, but my guess would be that the extra time is dominated by initializations of buffers larger than a cache line in size.

(In reply to Christian Holler (:decoder) from comment #3)

So, what I found out so far is that in the IR, instructions added because of the auto initialization are actually marked as such:

I->addAnnotationMetadata("auto-init");

Maybe after all optimization passes are through, we can scan for any remaining instructions with this metadata and inject the artificial call. This would avoid the problem of breaking optimizations.

I'm fairly sure this could be done in a separate pass, without even messing with LLVM internals at all. What I don't know is how to ensure that our pass is indeed the last one (after all optimizing passes), esp. with the new pass manager in LLVM. I'll try to find some answers for that.

This may be scope creep for this bug but if we could have the ability to output a warning (including source location) when an auto init is performed in the resulting binary that would would provide an actionable signal. An audit of the the location could then be performed. It may reduce the need for more involved debugging to find problematic regressions. I may be wrong but I believe this would be a high quality signal since in correct code most auto inits should be optimized away and what is left would be a bug in the compiler or the code. Of course this would also likely require a method to suppress and throttle the warnings.

(In reply to Tyson Smith [:tsmith] from comment #8)

This may be scope creep for this bug but if we could have the ability to output a warning (including source location) when an auto init is performed in the resulting binary that would would provide an actionable signal.

I think this is already doable without modifying clang. I have a wrapper that can emit LLVM IR side-by-side when compiling the tree. Inside that IR you can see instructions based on the "auto-init" annotation and debug information should also make it possible to trace it back to the respective source line of code. This requires some tools to automate, but the info should be there.

It may reduce the need for more involved debugging to find problematic regressions. I may be wrong but I believe this would be a high quality signal since in correct code most auto inits should be optimized away and what is left would be a bug in the compiler or the code.

I'm not sure that's the case. As soon as uninitialized memory is handed off to another compilation unit that will initialize it, the compiler can't reason about this anymore and will force-initialize it with the flag. This doesn't mean though that the code is buggy. I don't know how common this is.

(In reply to Doug Thayer [:dthayer] (he/him) from comment #6)

I would be most interested in finding all places where we initialize large buffers.

We could perhaps account for this by passing the size of the auto-initialized region as an argument to the out-of-line function, and then fiddling with that function to get the results we want (including doing proportional memory-bound busywork, or even calling out to different sub-functions to distinguish the cases in the profiler).

This is just a stab in the dark, but my guess would be that the extra time is dominated by initializations of buffers larger than a cache line in size.

My intuition has always been that the entire stack is usually in L1, so I'd be a little surprised if the cache line boundary were relevant. But I could be wrong.

(In reply to Bobby Holley (:bholley) from comment #10)

My intuition has always been that the entire stack is usually in L1, so I'd be a little surprised if the cache line boundary were relevant. But I could be wrong.

I won't pretend to be a cache expert, so please take what I say with a grain of salt, but my understanding was that we're typically going to be on a write-back or write-back-ish cache, so any time we write at all to a cache line we're going to pay for having to ship that off to main memory when that cache line is evicted, so most of the perf cost of writes is going to be on a per-cache-line basis, and we could generally assume that we're going to be writing to the first cache line anyway.

(In reply to Doug Thayer [:dthayer] (he/him) from comment #11)

(In reply to Bobby Holley (:bholley) from comment #10)

My intuition has always been that the entire stack is usually in L1, so I'd be a little surprised if the cache line boundary were relevant. But I could be wrong.

I won't pretend to be a cache expert, so please take what I say with a grain of salt, but my understanding was that we're typically going to be on a write-back or write-back-ish cache, so any time we write at all to a cache line we're going to pay for having to ship that off to main memory when that cache line is evicted, so most of the perf cost of writes is going to be on a per-cache-line basis, and we could generally assume that we're going to be writing to the first cache line anyway.

Woops I see what you were saying. It's plausible we don't typically have to evict stack cache lines because they might near-permanently reside in L1, so we never have to pay the write-back cost.

(In reply to Doug Thayer [:dthayer] (he/him) from comment #6)

I would be most interested in finding all places where we initialize large buffers.

It doesn't matter if we initialise a large buffer if that cost is amortised
over a suitably large amount of work done by the function. So it seems to me that
what you're looking for is functions for which the value
(frame_size / average_dynamic_cost_per_call) is high, where the cost is measured
in number of insns.

I can possibly offer you tooling (callgrind) to find functions where the average_dynamic_cost
is low. Identifying functions where frame_size is high, I'm not sure about.

(more, apologies for the noise)

I can possibly offer you tooling (callgrind) to find functions where the average_dynamic_cost
is low. Identifying functions where frame_size is high, I'm not sure about.

Maybe an LLVM plugin could crudely try to guesstimate both factors -- frame size and
expected amount of work? The latter might be done by counting the number of LLVM
insns in the innermost loop and scaling by the loop depth? FWIW register allocators
routinely use such gross hacks in order to guesstimate relative costs of spilling values.

You need to log in before you can comment on or make changes to this bug.