Open Bug 1808891 Opened 2 years ago Updated 1 year ago

Avoid costly initialisation under -ftrivial-auto-var-init=pattern

Categories

(Firefox Build System :: General, task, P3)

task

Tracking

(Not tracked)

People

(Reporter: sergesanspaille, Unassigned)

References

(Depends on 2 open bugs)

Details

As investigated in https://bugzilla.mozilla.org/show_bug.cgi?id=1515213, -ftrivial-auto-var-init as a non-neglictible impact on performance, and it is possible to instrument compilation to dump the inserted auto-init.

As it is possible to flag variable declaration as not been initialized by this flag, or use heap allocation in place of stack allocation, or change code to reduce the allocated size, it would be great to cleanup the codebase to that respect.

Severity: -- → S3
Priority: -- → P3

I've landed https://reviews.llvm.org/D148507 in LLVM upstream, this should help and we will get it alongside clang 17

Type: enhancement → task
Depends on: 1844513
Depends on: 1844520
Depends on: 1844528

Potential source of optimization, but I'm unsure of the security implication: In the following case

void foo(char*);
void bar(int n) {
    for(int i = 0; i < n; ++i) {
        char buffer[512];
        foo(buffer);
    }
}

-ftrivial-auto-var-init generates a memset in the loop, something equivalent to

void foo(char*);
void bar(int n) {
    char buffer[512];
    for(int i = 0; i < n; ++i) {
        memset(buffer, 0, sizeof(buffer));
        foo(buffer);
    }
}

the compiler cannot hoist the memset without breaking loop-carried dependency, which lead to extra overhead.
Yet if we only want to wipe stack memory, it would be fine to hoist the memset, with the inconvenience of potentially leaking the values from the previous iteration:

void foo(char*);
void bar(int n) {
    char buffer[512];
    memset(buffer, 0, sizeof(buffer));
    for(int i = 0; i < n; ++i) {
        // on second iteration, buffer is filled with values from first iteration
        foo(buffer);
    }
}

So hey, security people, what do you think?

Flags: needinfo?(twsmith)
Flags: needinfo?(tom)
Depends on: 1850315

Performing the memset once before the loop when buffer is declared (but not initialized) in the loop seems reasonable to me.

Doing anything else might have unintended consequences if buffer is assumed to retain its contents between iterations.

Flags: needinfo?(twsmith)

Now I'm curious, is it correct (not UB) to use values in buffer from a previous iteration? If so, using -ftrivial-auto-var-init= is actually breaking programs that do so.

What is the expected output of this program?

#include <stdio.h>

int main() {
    int max = 0;
    for(int i = 0; i < 10; i++) {
        int buffer[10];
        if(i == 0) {
            buffer[0] = 0;
        }
        else {
            buffer[i] = buffer[i-1] + 1;
        }
        if (buffer[i] > max) {
            max = buffer[i];
        }
    }
    printf("%d\n", max);
    return 0;
}

https://godbolt.org/z/rfGdbPzos
https://godbolt.org/z/3h4ncTx69 (with -ftrivial-auto-var-init=zero)

This program has undefined behavior without -ftrivial-auto-var-init=zero (the second iteration reads uninitialized value). The compiler is allowed to do anything at that point. The addition of -ftrivial-auto-var-init=zero indeed makes the behavior deterministic, but you cannot rely on it.

This is o,ne of the advantage of using a different init pattern for release and debug mode: the developers cannot rely on a specific behavior, so we're still using standard code.

Flags: needinfo?(tom)
Depends on: 1850505
Depends on: 1850569
Depends on: 1850948
Depends on: 1850951

Serge, here's a regex code search for other char buffer declarations that are indented 8+ spaces and thus might be defined inside a loop:

https://searchfox.org/mozilla-central/search?q=%5E%5Cs%7B8%2C%7Dchar%5Cs%2B%5Cw%2B%5C%5B.%2B%5C%5D%5B%3B%2C%5D&path=&case=true&regexp=true

Flags: needinfo?(sguelton)

Thanks! I have a compiler plugin that should fix the issue globally by doing a specific optimization, I'll test that approach first, but I appreciate the trick ;-)

Flags: needinfo?(sguelton)

(In reply to [:sergesanspaille] from comment #2)

Potential source of optimization, but I'm unsure of the security implication: In the following case
[..]
(stuff about re-zeroing buffer[512] every time around a loop)

Having used Valgrind/Memcheck to chase many uninit-values-on-the-stack bugs in
Gecko some years ago, I am inclined to think that the discussed scheme (rezero
every time around a loop, etc) is overkill. I don't remember exactly, but I
have the vague memory that most of those bugs didn't involve stack reuse.

I would think that just zeroing out at frame-creation time, and also any
sub-creation within the function (is that even allowed? isn't "stack moves
once" pretty much standard nowadays?) would get us the majority of the
benefit. That is,
sub $N, %rsp --> sub $N, %rsp ; inlined-memset(%rsp, 0, $N).

Much as it pains me to roll out the "best is enemy of the good" cliche .. even
just doing this simple initialisation would protect release builds much better
from uninitialised-stack problems that at present (wouldn't it) ?

Hey Julian,

First some feedback on the loop hoisting approach: it unfortunately didn't significantly impact the speedometer3 slowdown, so I gave up with that approach.

Then considering the approach you propose: the performance impact is not to be neglected. Let me describe how trivial-auto-var-init works:

  1. clang forces initialization of all stack variables, before the initialization that result from the actual code
  2. llvm optimizes out some of the initialization, as part of its regular optimisation process
  3. llvm delays some stack initialisation so that they can be guarded by conditions, and eventually not run depending on the execution path.

This scheme is quite good at getting rid of most of the extra initialisation, and keeps the overhead in control. But sometimes it fails because the init is needed and sometimes it fails because the compiler lacks information.

Your approach is basically an optimization of (1) but it cancels (2) and (3), which is very likely to be worse.

That being said, we could think of aggregating the stack memsets into a single one, eventually touching more memory than needed, as an optimization. I've not seen that scenario in the worst profile though...

Re comment 8, I spoke too soon, for which I apologize. The problem is more
complex than I realised at first.

A couple of observations:

(re Tyson, comment 3)

Performing the memset once before the loop when buffer is declared (but not
initialized) in the loop seems reasonable to me.

If we do that [kind of optimisation], will we lose the ability to have a
simple story about behaviour at the C++ level? Because the effect would be
"your stack allocated variables may or may not be automatically zeroed out,
depending on compiler optimisation decisions that you have no way to know".
It would be dangerous if folks came to rely on the zeroing behaviour in that
situation, and then some change in optimisation level, or compiler version, or
whatever, lead to some critical buffer not being initialised.

A second observation is that perhaps this is to some extent a problem of
messaging. Prior to this, allocation on the stack had cost O(1) regardless of
the size; now it is O(N). So we have to give up the luxury of being able to
allocate a lot of stack space but only use a small amount of it. That may be
an acceptable tradeoff.

This also suggests that what we need is some kind of toolery that can detect
such inefficient-stack-use situations. I mention this because there came into
existence a tool [1] that detects that situation for heap-allocated blocks, so
I wonder if we could find and mitigate the worst cases of performance lossage
by constructing some similar toolery for the stack.

FWIW I am all in favour of being able to use stack-zeroing in production for
Gecko, provided we have a reasonable story about what is going on at the C++
source level and that doesn't lead to the danger mentioned in the first
paragraph. So I hope these performance problems can be resolved.

Speaking at least for Valgrind/Memcheck, I view that method of detecting
uninitialized value errors as unsustainable. Over the years myself and others
have spent a great deal of effort trying to reduce its false-positive rate on
optimised code, but that is a losing battle, and I believe we are close to the
limits of what can feasibly be done by machine-code analysis.

[1] https://valgrind.org/docs/manual/dh-manual.html#dh-manual.overview

You need to log in before you can comment on or make changes to this bug.