Avoid costly initialisation under -ftrivial-auto-var-init=pattern
Categories
(Firefox Build System :: General, task, P3)
Tracking
(Not tracked)
People
(Reporter: sergesanspaille, Unassigned)
References
(Depends on 2 open bugs)
Details
As investigated in https://bugzilla.mozilla.org/show_bug.cgi?id=1515213, -ftrivial-auto-var-init
as a non-neglictible impact on performance, and it is possible to instrument compilation to dump the inserted auto-init.
As it is possible to flag variable declaration as not been initialized by this flag, or use heap allocation in place of stack allocation, or change code to reduce the allocated size, it would be great to cleanup the codebase to that respect.
Updated•2 years ago
|
Reporter | ||
Comment 1•2 years ago
|
||
I've landed https://reviews.llvm.org/D148507 in LLVM upstream, this should help and we will get it alongside clang 17
Reporter | ||
Comment 2•1 years ago
|
||
Potential source of optimization, but I'm unsure of the security implication: In the following case
void foo(char*);
void bar(int n) {
for(int i = 0; i < n; ++i) {
char buffer[512];
foo(buffer);
}
}
-ftrivial-auto-var-init
generates a memset in the loop, something equivalent to
void foo(char*);
void bar(int n) {
char buffer[512];
for(int i = 0; i < n; ++i) {
memset(buffer, 0, sizeof(buffer));
foo(buffer);
}
}
the compiler cannot hoist the memset
without breaking loop-carried dependency, which lead to extra overhead.
Yet if we only want to wipe stack memory, it would be fine to hoist the memset, with the inconvenience of potentially leaking the values from the previous iteration:
void foo(char*);
void bar(int n) {
char buffer[512];
memset(buffer, 0, sizeof(buffer));
for(int i = 0; i < n; ++i) {
// on second iteration, buffer is filled with values from first iteration
foo(buffer);
}
}
So hey, security people, what do you think?
Comment 3•1 year ago
|
||
Performing the memset once before the loop when buffer
is declared (but not initialized) in the loop seems reasonable to me.
Doing anything else might have unintended consequences if buffer
is assumed to retain its contents between iterations.
Comment 4•1 year ago
•
|
||
Now I'm curious, is it correct (not UB) to use values in buffer
from a previous iteration? If so, using -ftrivial-auto-var-init=
is actually breaking programs that do so.
What is the expected output of this program?
#include <stdio.h>
int main() {
int max = 0;
for(int i = 0; i < 10; i++) {
int buffer[10];
if(i == 0) {
buffer[0] = 0;
}
else {
buffer[i] = buffer[i-1] + 1;
}
if (buffer[i] > max) {
max = buffer[i];
}
}
printf("%d\n", max);
return 0;
}
https://godbolt.org/z/rfGdbPzos
https://godbolt.org/z/3h4ncTx69 (with -ftrivial-auto-var-init=zero)
Reporter | ||
Comment 5•1 year ago
|
||
This program has undefined behavior without -ftrivial-auto-var-init=zero
(the second iteration reads uninitialized value). The compiler is allowed to do anything at that point. The addition of -ftrivial-auto-var-init=zero
indeed makes the behavior deterministic, but you cannot rely on it.
This is o,ne of the advantage of using a different init pattern for release and debug mode: the developers cannot rely on a specific behavior, so we're still using standard code.
Comment 6•1 year ago
|
||
Serge, here's a regex code search for other char
buffer declarations that are indented 8+ spaces and thus might be defined inside a loop:
Reporter | ||
Comment 7•1 year ago
|
||
Thanks! I have a compiler plugin that should fix the issue globally by doing a specific optimization, I'll test that approach first, but I appreciate the trick ;-)
Comment 8•1 year ago
|
||
(In reply to [:sergesanspaille] from comment #2)
Potential source of optimization, but I'm unsure of the security implication: In the following case
[..]
(stuff about re-zeroingbuffer[512]
every time around a loop)
Having used Valgrind/Memcheck to chase many uninit-values-on-the-stack bugs in
Gecko some years ago, I am inclined to think that the discussed scheme (rezero
every time around a loop, etc) is overkill. I don't remember exactly, but I
have the vague memory that most of those bugs didn't involve stack reuse.
I would think that just zeroing out at frame-creation time, and also any
sub-creation within the function (is that even allowed? isn't "stack moves
once" pretty much standard nowadays?) would get us the majority of the
benefit. That is,
sub $N, %rsp
--> sub $N, %rsp ; inlined-memset(%rsp, 0, $N)
.
Much as it pains me to roll out the "best is enemy of the good" cliche .. even
just doing this simple initialisation would protect release builds much better
from uninitialised-stack problems that at present (wouldn't it) ?
Reporter | ||
Comment 9•1 year ago
|
||
Hey Julian,
First some feedback on the loop hoisting approach: it unfortunately didn't significantly impact the speedometer3 slowdown, so I gave up with that approach.
Then considering the approach you propose: the performance impact is not to be neglected. Let me describe how trivial-auto-var-init works:
- clang forces initialization of all stack variables, before the initialization that result from the actual code
- llvm optimizes out some of the initialization, as part of its regular optimisation process
- llvm delays some stack initialisation so that they can be guarded by conditions, and eventually not run depending on the execution path.
This scheme is quite good at getting rid of most of the extra initialisation, and keeps the overhead in control. But sometimes it fails because the init is needed and sometimes it fails because the compiler lacks information.
Your approach is basically an optimization of (1) but it cancels (2) and (3), which is very likely to be worse.
That being said, we could think of aggregating the stack memsets into a single one, eventually touching more memory than needed, as an optimization. I've not seen that scenario in the worst profile though...
Comment 10•1 year ago
|
||
Re comment 8, I spoke too soon, for which I apologize. The problem is more
complex than I realised at first.
A couple of observations:
(re Tyson, comment 3)
Performing the memset once before the loop when buffer is declared (but not
initialized) in the loop seems reasonable to me.
If we do that [kind of optimisation], will we lose the ability to have a
simple story about behaviour at the C++ level? Because the effect would be
"your stack allocated variables may or may not be automatically zeroed out,
depending on compiler optimisation decisions that you have no way to know".
It would be dangerous if folks came to rely on the zeroing behaviour in that
situation, and then some change in optimisation level, or compiler version, or
whatever, lead to some critical buffer not being initialised.
A second observation is that perhaps this is to some extent a problem of
messaging. Prior to this, allocation on the stack had cost O(1) regardless of
the size; now it is O(N). So we have to give up the luxury of being able to
allocate a lot of stack space but only use a small amount of it. That may be
an acceptable tradeoff.
This also suggests that what we need is some kind of toolery that can detect
such inefficient-stack-use situations. I mention this because there came into
existence a tool [1] that detects that situation for heap-allocated blocks, so
I wonder if we could find and mitigate the worst cases of performance lossage
by constructing some similar toolery for the stack.
FWIW I am all in favour of being able to use stack-zeroing in production for
Gecko, provided we have a reasonable story about what is going on at the C++
source level and that doesn't lead to the danger mentioned in the first
paragraph. So I hope these performance problems can be resolved.
Speaking at least for Valgrind/Memcheck, I view that method of detecting
uninitialized value errors as unsustainable. Over the years myself and others
have spent a great deal of effort trying to reduce its false-positive rate on
optimised code, but that is a losing battle, and I believe we are close to the
limits of what can feasibly be done by machine-code analysis.
[1] https://valgrind.org/docs/manual/dh-manual.html#dh-manual.overview
Description
•