Investigation usage of clang -ftrivial-auto-var-init flag
Categories
(Developer Infrastructure :: Source Code Analysis, task, P2)
Tracking
(Not tracked)
People
(Reporter: pauljt, Unassigned)
References
(Depends on 3 open bugs, Blocks 1 open bug, )
Details
(Keywords: sec-want, Whiteboard: investigation)
Attachments
(1 file)
9.44 MB,
text/x-log
|
Details |
Comment 1•6 years ago
|
||
Updated•6 years ago
|
Reporter | ||
Updated•6 years ago
|
Updated•5 years ago
|
Comment 6•3 years ago
|
||
Here is the Chrome bug to enable -ftrivial-auto-var-init=pattern in official release builds. It's currently enabled in only non-official builds for desktop and ChromeOS, but not Android.
https://bugs.chromium.org/p/chromium/issues/detail?id=977230
Comment 7•3 years ago
•
|
||
Here is a perfherder comparison with -ftrivial-auto-var-init=pattern
enabled on an opt build.
Additional notes:
-
In a recent fuzzing collaboration meeting it was mentioned that efforts to run MSan instrumented builds of Chromium will likely not continue. Comment #6 has a link to the Chromium bug to enable
-ftrivial-auto-var-init=pattern
-
MSan and Valgrind are not reasonable methods of identifying issues. MSan would required a huge initial investment to get Firefox and all of it's dependencies built and instrumented and would require ongoing maintenance (and as mentioned above is being moved away from by other vendors). Valgrind requires a special build (which is no longer maintained on TC) and requires constant suppression list maintenance. It also adds an unreasonable runtime performance overhead that makes fuzzing much less effective. Both methods depend on test suites and fuzzing to identify issues.
-
Enabling
-ftrivial-auto-var-init=pattern
should not have ongoing development costs and does not rely on test suites and fuzzing to identify issues to keep users safe. It mitigates risks without treating uninitialized memory usage as defined behavior (preventing developer from actively relying on it). -
Forced initialization should be optimized out when using optimization levels >
0
if memory is properly initialized before use. -
no-opt
builds will suffer a higher performance regression (as mentioned previously) and should be excluded from the list of builds enabling this flag.
Questions that I think need to be answered before moving forward are:
-
What is the threshold or upper limit of the acceptable performance regression?
-
How can the performance regressions identified by the above perfherder link be investigated further?
-
Are there areas of the code that we know should be excluded?
-
Would it be reasonable to enable for Nightly and early Beta builds? This is only meant as an alternative if enabling for all channels is for some reason not an option. This would help shake out any existing latent or intermittent crashes dues to uninitialized memory usage.
-
- Would this give us any additional performance metrics?
-
- Should we start on a single platform?
Updated•3 years ago
|
Comment 9•3 years ago
|
||
Would it be reasonable to enable for Nightly and early Beta builds? This would help shake out any existing latent or intermittent crashes dues to uninitialized memory usage.
Would it? Wouldn't it paper over issues in Nightly / Early beta, but crash and/or be potentially exploitable in release? (I might be misunderstanding something though)
Comment 10•3 years ago
|
||
Quick drive-by question: I'm wondering what this does with AutoTArray
(and similar things like nsAutoString
) - my guess would be that initialization of its stack buffer is something that doesn't tend to get optimized out and actually affects us?
Comment 11•3 years ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #9)
Would it be reasonable to enable for Nightly and early Beta builds? This would help shake out any existing latent or intermittent crashes dues to uninitialized memory usage.
Would it? Wouldn't it paper over issues in Nightly / Early beta, but crash and/or be potentially exploitable in release? (I might be misunderstanding something though)
At the moment there is no mitigation in place, so all channels are vulnerable. The pattern used may or my not trigger crashes but it is not zero and it's consistent so if it does trigger a crash then it should be much easier to track down the cause. The pattern should also be much easier to eye-ball in crash stats. The intention is not to find the bugs that just a bonus :)
Of course the intention is to enable in all channels.
Comment 12•3 years ago
|
||
Ah i see. As long as it makes stuff more likely to crash on nightly then it'd be an improvement i guess :)
Comment 13•3 years ago
•
|
||
From the Chrome bug:
Binary size increase: 1.5%
linux-perf/speedometer2 Total Score: 2.5% slower
linux-perf/blink_perf.dom and linux-perf/dromaeo regression vary from 0 to 5%
FYI. Zero initialization:
Binary size increase: 1.05%
linux-perf/speedometer2 Total Score: 0.5% faster
linux-perf/blink_perf.dom and linux-perf/dromaeo from up 12% speedup
to 4.5% slowdown
Also, I see a bunch of followup bugs to fix 5-15% regressions in things like VP9 encode, CSS, etc (disabling the flag for those I think).
The perfherder results need more runs; most are "medium" confidence - small differences need more runs for statistical confidence. I see small regressions is most speedindex numbers, and other performance numbers, typically in the 1-6% range.
Some focused benchmarks show large regressions, like WebAudio - up to 33%: https://treeherder.mozilla.org/perfherder/comparesubtest?originalProject=try&newProject=try&newRevision=7f330cc417d914f6fd6a0107307884712a5bca7b&originalSignature=3442639&newSignature=3442639&framework=13&originalRevision=ef2d1eaa5a80d405df665726cc3e6497289db79d&page=1
Taking this sort of hit on pageload would be concerning, and would require a lot more looking and thought about the impact (and pointwise mitigations/disabling).
Comment 14•3 years ago
•
|
||
(In reply to Doug Thayer [:dthayer] (he/him) from comment #10)
Quick drive-by question: I'm wondering what this does with
AutoTArray
(and similar things likensAutoString
) - my guess would be that initialization of its stack buffer is something that doesn't tend to get optimized out and actually affects us?
Yeah - here's an example: https://searchfox.org/mozilla-central/rev/20ebe9f39e8790e2db123f89fab3293dc0a796ac/dom/base/nsFrameLoader.cpp#1946
Here's the important bit of disassembly around this line in the shippable build with -ftrivial-auto-var-init
:
...
60056d3: ba 00 01 00 00 mov $0x100,%edx
60056d8: 48 89 df mov %rbx,%rdi
60056db: be aa 00 00 00 mov $0xaa,%esi
60056e0: ff 15 12 15 eb 03 callq *0x3eb1512(%rip) # 9eb6bf8 <memset@GLIBC_2.2.5>
60056e6: 41 0f 10 84 24 68 02 movups 0x268(%r12),%xmm0
...
We memset the 256-byte buffer (16 nsID
s which are each 16 bytes) to 0xaa
before passing it into nsISHistory::RemoveEntries
. The joke here is that we're only filling it with one item to begin with and nsISHistory::RemoveEntries
doesn't append to the ids
array and then we don't do anything with the array afterwards.
Can we decorate AutoTArray
's mAutoBuf
with __attribute((uninitialized))
to get the compiler to explicitly leave it uninitialized? In these types of cases of stack-storage container types I think that's relatively safe.
EDIT: it appears the uninitialized
attribute only works for local variables. Rats.
Comment 15•3 years ago
|
||
Assuming we can't mitigate pattern initialization performance enough to make it feasible everywhere, another option would be to use pattern-init in nightly/beta and zero-init in release -- assuming the latter is better optimized based on chromium's measurements in comment 13. That way we can't unintentionally rely on zero-init semantics but still get the security benefits of initialization.
Comment 16•3 years ago
|
||
Tyson, it's interesting that the perfherder comparison does flag some reproducible regressions.
I would like to see how this comparison does on a Windows shippable (aka PGO) build, where the majority of users are.
Also, can we include the Talos performance tests? If we ask in #perftest:mozilla.org we should be able to get some help on the useful tests.
Comment 17•3 years ago
|
||
(In reply to Tyson Smith [:tsmith] from comment #7)
- What is the threshold or upper limit of the acceptable performance regression?
There is no such threshold defined. Perfherder alerts on regressions exceeding 2%. Regressions of over 100ms are more likely to be perceived by users, however we risk accumulation of smaller regressions contributing to a perceivable degradation over time.
- How can the performance regressions identified by the above perfherder link be investigated further?
The jobs can be retriggered with the profiler enabled, which can help with investigation.
Updated•2 years ago
|
Updated•2 years ago
|
Comment 18•2 years ago
|
||
While doing some extra bibliography on the subject, I realized that clang supports the attribute __attribute__((uninitialized))
which means that if the source of regression can be identified, and the initialization here is not needed, there is a way out, which is great.
Comment 19•2 years ago
|
||
The following methodology makes it possible to detect individual stack allocation that would require some care (i.e. either code rewrite or manual annotation) to not be automatically initialized:
MOZCONFIG=./local-llvm.mozconfig ./mach build
where local-llvm.mozconfig
contains
ac_add_options --enable-optimize="-O2 -ftrivial-auto-var-init=pattern -g -Rpass-missed=annotation-remarks"`
which outputs, for instance
0:16.45 /home/ssp/sources/mozilla-unified/security/nss/lib/certhigh/ocspsig.c:563:22: remark: Call to memset inserted by -ftrivial-auto-var-init. Memory operation size: 48 bytes.
0:16.45 Written Variables: response (48 bytes). [-Rpass-missed=annotation-remarks]
0:16.45 CERTOCSPResponse response;
Would that make sense to go through these?
Comment 20•2 years ago
|
||
(In reply to [:sergesanspaille] from comment #19)
The following methodology makes it possible to detect individual stack allocation that would require some care (i.e. either code rewrite or manual annotation) to not be automatically initialized:
MOZCONFIG=./local-llvm.mozconfig ./mach build
where
local-llvm.mozconfig
containsac_add_options --enable-optimize="-O2 -ftrivial-auto-var-init=pattern -g -Rpass-missed=annotation-remarks"`
which outputs, for instance
0:16.45 /home/ssp/sources/mozilla-unified/security/nss/lib/certhigh/ocspsig.c:563:22: remark: Call to memset inserted by -ftrivial-auto-var-init. Memory operation size: 48 bytes. 0:16.45 Written Variables: response (48 bytes). [-Rpass-missed=annotation-remarks] 0:16.45 CERTOCSPResponse response;
Would that make sense to go through these?
Comment 21•2 years ago
|
||
(In reply to [:sergesanspaille] from comment #20)
Would that make sense to go through these?
I think so, if it is not an unreasonable amount of effort. We will likely come across some potential security issues. I wonder what the ratio of code rewrite vs manual annotation will be.
Let's track this work in another bug. Can you please create a tracking bug and link issues to it when you have time?
Comment 22•2 years ago
•
|
||
So Serge's technique lists all places where we are adding a memset call (and potentially introducing a slowdown) - we don't have to manually edit the code or add an annotation (the flag is adding the memset call for us) - we only need to do that if it provides us a benefit (i.e. fixes a perf regression).
If we can cross-reference the list with code coverage we might be able to identify which of those are really hot paths and potentially be the source of measurable slowdown...
Comment 23•2 years ago
|
||
Comment 24•2 years ago
|
||
Attached to this comment a sorted list of places were auto-init have been added. I don't have coverage info, but some values are frightening. I'll give a try at removing the most unneeded ones. I've opened https://bugzilla.mozilla.org/show_bug.cgi?id=1808891 to track those.
Comment 25•2 years ago
|
||
That way we can't unintentionally rely on zero-init semantics but still get the security benefits of initialization.
Why shouldn't we rely on zero-init semantics? We're using clang only these days, but MSVC&GCC also supporting zero-init suggests that it's plausible enough as a long-term multi-compiler feature that it doesn't paint us in a corner. (Personally, I like to have faith that the committee will eventually do the reasonable thing and accept the paper to hoist zero init to the standard.)
A productivity anecdote to balance against performance issues:
For reasons, bug 543435 went on hold for a while. When I resumed work and rebased, all tests were orange even though reftests and crashtests as well as most WPTs had been green previously. I couldn't bisect, because the mach bootstrap
toolchain hosting wasn't valid far enough back. I saw that the failure was in Marionette code that was waiting for the initial browser window to open, so that explained why all test suites were affected. I tried undoing the most recent changes to my patch. Didn't help. I tried investigating recent changes to Marionette. Nope. In Pernosco, I saw that browser.xhtml
never fired its load event. This was a plausible failure mode that could arise from the changes in my patch interacting badly with some change done on trunk meanwhile. While failing to find anything obvious, I found another issue and got a bit ahead of myself trying to fix that other thing before finding the root cause. That other attempt caused an assertion to fire in a plausible way. However, when I tried to record it, a different assertion fired. I debugged that in Pernosco. By analyzing where a weird value came from, I eventually realized that I had failed to initialize one bit in a field of adjacent 1-bit bool
s. clang had made the effort to let the bit show through from whatever was there previously instead of making use of the UB to just do a wide zero write across the uninitialized bit and adjacent zero-initialized bits. Either this had changed between mach bootstrap
-provided clang versions or the surrounding code had changed what showed through the uninitialized bit.
The failure mode involved plausible incorrect rabbit holes, I bothered people in meetings about the issue to get ideas, the problem was resolved partly by luck (that I tried to advance something different before debugging the problem at had to completion), and the whole productivity-harming thing could have been avoided by zero-initializing by default.
Comment 26•2 years ago
|
||
It's been pointed out to me that the clang flag here and the paper only cover automatic storage duration and the bitfield from comment 25 was a member of a heap-allocated object. Anyway, the general point of what kind of grief uninitialized values can cause should still be relevant.
Comment 27•1 years ago
|
||
Extra data point: major clang users have switched to use -ftrivial-auto-var-init=pattern
in debug build and -ftrivial-auto-var-init=zero
in release builds, leading clang to support the later while it was being discouraged at some point. There's a very nice thread on the associated RFC
https://discourse.llvm.org/t/making-ftrivial-auto-var-init-zero-a-first-class-option/55143
I''l compare the two approach soon on firefox builds.
Comment 28•1 years ago
|
||
Current speedometer3 results with -ftrivial-auto-var-init=pattern
:
Current speedometer3 results with -ftrivial-auto-var-init=zero
:
Basically, and as expected, it's fatser to initialize to zero rather than a pattern. Maybe surprisingly, we even get a speedup on Linux by doing so, probably because it fills the cache in a more linear pattern (educated guess).
Having a lot at build metrics, the generated binary is also smaller, so that really looks good. And we're left with a -1.5% speedometer3 regression on Windows.
Comment 29•1 year ago
|
||
Chrome is removing -ftrivial-auto-var-init from (AV1 decoder) libaom's compilation flags because it added over 13% overhead in local tests compared with standalone libaom:
https://bugs.chromium.org/p/chromium/issues/detail?id=977230#c141
Description
•