1808891 - Avoid costly initialisation under -ftrivial-auto-var-init=pattern

Reporter

Description

•

2 years ago

As investigated in https://bugzilla.mozilla.org/show_bug.cgi?id=1515213, -ftrivial-auto-var-init as a non-neglictible impact on performance, and it is possible to instrument compilation to dump the inserted auto-init.

As it is possible to flag variable declaration as not been initialized by this flag, or use heap allocation in place of stack allocation, or change code to reduce the allocated size, it would be great to cleanup the codebase to that respect.

Alex Hochheiden [:ahochheiden] (PTO until Feb 3)

Updated

•

2 years ago

Severity: -- → S3

Priority: -- → P3

[:sergesanspaille]

Reporter

Comment 1

•

2 years ago

I've landed https://reviews.llvm.org/D148507 in LLVM upstream, this should help and we will get it alongside clang 17

Type: enhancement → task

[:sergesanspaille]

Reporter

Updated

•

2 years ago

Depends on: 1844513

[:sergesanspaille]

Reporter

Updated

•

2 years ago

Depends on: 1844520

[:sergesanspaille]

Reporter

Updated

•

2 years ago

Depends on: 1844528

[:sergesanspaille]

Reporter

Comment 2

•

1 years ago

Potential source of optimization, but I'm unsure of the security implication: In the following case

void foo(char*);
void bar(int n) {
    for(int i = 0; i < n; ++i) {
        char buffer[512];
        foo(buffer);
    }
}

-ftrivial-auto-var-init generates a memset in the loop, something equivalent to

void foo(char*);
void bar(int n) {
    char buffer[512];
    for(int i = 0; i < n; ++i) {
        memset(buffer, 0, sizeof(buffer));
        foo(buffer);
    }
}

the compiler cannot hoist the memset without breaking loop-carried dependency, which lead to extra overhead.
Yet if we only want to wipe stack memory, it would be fine to hoist the memset, with the inconvenience of potentially leaking the values from the previous iteration:

void foo(char*);
void bar(int n) {
    char buffer[512];
    memset(buffer, 0, sizeof(buffer));
    for(int i = 0; i < n; ++i) {
        // on second iteration, buffer is filled with values from first iteration
        foo(buffer);
    }
}

So hey, security people, what do you think?

Flags: needinfo?(twsmith)

Flags: needinfo?(tom)

[:sergesanspaille]

Reporter

Updated

•

1 year ago

Depends on: 1850315

Tyson Smith [:tsmith]

Comment 3

•

1 year ago

Performing the memset once before the loop when buffer is declared (but not initialized) in the loop seems reasonable to me.

Doing anything else might have unintended consequences if buffer is assumed to retain its contents between iterations.

Flags: needinfo?(twsmith)

Tyson Smith [:tsmith]

Comment 4

•

1 year ago

•

Edited

Now I'm curious, is it correct (not UB) to use values in buffer from a previous iteration? If so, using -ftrivial-auto-var-init= is actually breaking programs that do so.

What is the expected output of this program?

#include <stdio.h>

int main() {
    int max = 0;
    for(int i = 0; i < 10; i++) {
        int buffer[10];
        if(i == 0) {
            buffer[0] = 0;
        }
        else {
            buffer[i] = buffer[i-1] + 1;
        }
        if (buffer[i] > max) {
            max = buffer[i];
        }
    }
    printf("%d\n", max);
    return 0;
}

https://godbolt.org/z/rfGdbPzos
https://godbolt.org/z/3h4ncTx69 (with -ftrivial-auto-var-init=zero)

[:sergesanspaille]

Reporter

Comment 5

•

1 year ago

This program has undefined behavior without -ftrivial-auto-var-init=zero (the second iteration reads uninitialized value). The compiler is allowed to do anything at that point. The addition of -ftrivial-auto-var-init=zero indeed makes the behavior deterministic, but you cannot rely on it.

This is o,ne of the advantage of using a different init pattern for release and debug mode: the developers cannot rely on a specific behavior, so we're still using standard code.

Flags: needinfo?(tom)

[:sergesanspaille]

Reporter

Updated

•

1 year ago

Depends on: 1850505

[:sergesanspaille]

Reporter

Updated

•

1 year ago

Depends on: 1850569

[:sergesanspaille]

Reporter

Updated

•

1 year ago

Depends on: 1850948

[:sergesanspaille]

Reporter

Updated

•

1 year ago

Depends on: 1850951

Chris Peterson [:cpeterson]

Comment 6

•

1 year ago

Serge, here's a regex code search for other char buffer declarations that are indented 8+ spaces and thus might be defined inside a loop:

https://searchfox.org/mozilla-central/search?q=%5E%5Cs%7B8%2C%7Dchar%5Cs%2B%5Cw%2B%5C%5B.%2B%5C%5D%5B%3B%2C%5D&path=&case=true&regexp=true

Flags: needinfo?(sguelton)

[:sergesanspaille]

Reporter

Comment 7

•

1 year ago

Thanks! I have a compiler plugin that should fix the issue globally by doing a specific optimization, I'll test that approach first, but I appreciate the trick ;-)

Flags: needinfo?(sguelton)

Julian Seward [:jseward]

Comment 8

•

1 year ago

(In reply to [:sergesanspaille] from comment #2)

Potential source of optimization, but I'm unsure of the security implication: In the following case
[..]
(stuff about re-zeroing buffer[512] every time around a loop)

Having used Valgrind/Memcheck to chase many uninit-values-on-the-stack bugs in
Gecko some years ago, I am inclined to think that the discussed scheme (rezero
every time around a loop, etc) is overkill. I don't remember exactly, but I
have the vague memory that most of those bugs didn't involve stack reuse.

I would think that just zeroing out at frame-creation time, and also any
sub-creation within the function (is that even allowed? isn't "stack moves
once" pretty much standard nowadays?) would get us the majority of the
benefit. That is,
sub $N, %rsp --> sub $N, %rsp ; inlined-memset(%rsp, 0, $N).

Much as it pains me to roll out the "best is enemy of the good" cliche .. even
just doing this simple initialisation would protect release builds much better
from uninitialised-stack problems that at present (wouldn't it) ?

[:sergesanspaille]

Reporter

Comment 9

•

1 year ago

Hey Julian,

First some feedback on the loop hoisting approach: it unfortunately didn't significantly impact the speedometer3 slowdown, so I gave up with that approach.

Then considering the approach you propose: the performance impact is not to be neglected. Let me describe how trivial-auto-var-init works:

clang forces initialization of all stack variables, before the initialization that result from the actual code
llvm optimizes out some of the initialization, as part of its regular optimisation process
llvm delays some stack initialisation so that they can be guarded by conditions, and eventually not run depending on the execution path.

This scheme is quite good at getting rid of most of the extra initialisation, and keeps the overhead in control. But sometimes it fails because the init is needed and sometimes it fails because the compiler lacks information.

Your approach is basically an optimization of (1) but it cancels (2) and (3), which is very likely to be worse.

That being said, we could think of aggregating the stack memsets into a single one, eventually touching more memory than needed, as an optimization. I've not seen that scenario in the worst profile though...

Julian Seward [:jseward]

Comment 10

•

1 year ago

Re comment 8, I spoke too soon, for which I apologize. The problem is more
complex than I realised at first.

A couple of observations:

(re Tyson, comment 3)

Performing the memset once before the loop when buffer is declared (but not
initialized) in the loop seems reasonable to me.

If we do that [kind of optimisation], will we lose the ability to have a
simple story about behaviour at the C++ level? Because the effect would be
"your stack allocated variables may or may not be automatically zeroed out,
depending on compiler optimisation decisions that you have no way to know".
It would be dangerous if folks came to rely on the zeroing behaviour in that
situation, and then some change in optimisation level, or compiler version, or
whatever, lead to some critical buffer not being initialised.

A second observation is that perhaps this is to some extent a problem of
messaging. Prior to this, allocation on the stack had cost O(1) regardless of
the size; now it is O(N). So we have to give up the luxury of being able to
allocate a lot of stack space but only use a small amount of it. That may be
an acceptable tradeoff.

This also suggests that what we need is some kind of toolery that can detect
such inefficient-stack-use situations. I mention this because there came into
existence a tool [1] that detects that situation for heap-allocated blocks, so
I wonder if we could find and mitigate the worst cases of performance lossage
by constructing some similar toolery for the stack.

FWIW I am all in favour of being able to use stack-zeroing in production for
Gecko, provided we have a reasonable story about what is going on at the C++
source level and that doesn't lead to the danger mentioned in the first
paragraph. So I hope these performance problems can be resolved.

Speaking at least for Valgrind/Memcheck, I view that method of detecting
uninitialized value errors as unsustainable. Over the years myself and others
have spent a great deal of effort trying to reduce its false-positive rate on
optimised code, but that is a losing battle, and I believe we are close to the
limits of what can feasibly be done by machine-code analysis.

[1] https://valgrind.org/docs/manual/dh-manual.html#dh-manual.overview

Bugzilla

Avoid costly initialisation under -ftrivial-auto-var-init=pattern

Categories

(Firefox Build System :: General, task, P3)

Tracking

(Not tracked)

People

(Reporter: sergesanspaille, Unassigned)

References

(Depends on 2 open bugs)

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Updated

Updated

Updated

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Updated

Updated

Updated

Updated

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10