Closed Bug 1588710 Opened 5 years ago Closed 3 years ago

Enable Stack clash protection on supported OS / arch

Categories

(Firefox Build System :: General, task, P2)

task

Tracking

(relnote-firefox -, firefox86 fixed)

RESOLVED FIXED
86 Branch
Tracking Status
relnote-firefox --- -
firefox86 --- fixed

People

(Reporter: Sylvestre, Assigned: Sylvestre)

References

Details

(Keywords: sec-want, Whiteboard: [adv-main86-])

Attachments

(6 files, 1 obsolete file)

This is a feature provided by gcc for a few years now.
clang doesn't have it and it has been marked as a blocker for redhat/fedora to use clang to build Firefox.

Serge started the work on the clang side to implement it:
http://lists.llvm.org/pipermail/llvm-dev/2019-October/135786.html
https://reviews.llvm.org/D68720

He reached out to me to evaluate the change for Firefox performances.

Opening a bug for transparency and sharing results

If there isn't performance regression, we might want to add this to our official builds.

Is it possible for him to make a backport of the patch to the clang 8 branch? That would be easiest for us to apply and test.

With the perf results:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=3850d59d0d33e2ca0c746c54f059304306368972&framework=1&selectedTimeRange=172800

I do see an improvement from 3 to 4% on Linux.
I am a bit confused to see Windows improvement as I didn't touch it (the stack clash patch is only applied on clang Linux) but the confidence is smaller.
Anyway, it matches what Serge told me. He saw improvements on some projects (and no regression yet).
I would however appreciate a confirmation by an actual perf expert.

So, if this is the case, once landed, we could even take the patch to clang 9 to avoid having to wait a few more months (clang 10 release)

:jmaher is who I usually go to for such requests, but he is on leave.

Would Serge be open to improving his patch to include Windows? Or at least documenting what would be needed to do so?

Keywords: sec-want

Comparing try with try isn't giving the same kind of results (slower)
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&newProject=try&newRevision=3850d59d0d33e2ca0c746c54f059304306368972&framework=1&selectedTimeRange=172800

I am a bit lost on how to understand these results.

On dmajor suggestion, I reused 1db6f0e4ceca73e401ca60d0e93ad29cbe1c3949 as base line (the clang patch but not enabled in our builds) and retriggered the talos jobs

Here is the comparison:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1db6f0e4ceca73e401ca60d0e93ad29cbe1c3949&newProject=try&newRevision=3850d59d0d33e2ca0c746c54f059304306368972&framework=1

Seems that there isn't any significant improvement or regression after all. Except maybe a regression in "about_preferences_basic opt e10s stylo" on Linux opt (not pgo)

Please also run raptor tests, at least speedometer. Should we also test this on Windows builds?

Serge told me that it should not have impact on Windows (and not needed there)

@dmajor, @tjr : The windows ABI already requires a touch on each stack page, so Windows shouldn't be affected, (not the case of the patch you're testing with, but should be the case in the near future). Basically it means that clang-cl will support -fstack-clash-protection, but it should be a no-op for Windows.

@sylvestre: thanks for the feedback. Having firefox passing its validation with this patch is very good news. I've made several significant changes to the original patch, so an extra round may be needed, keep in touch.

@all: once the patch lands in llvm's master, I'll provide a backport if you want.

Seems that there isn't any performance change (but I might be wrong as I am no expert in perfherder)

Assignee: nobody → sledru

Dave Hunt told me the following:

If so, this suggests that your patch has caused a high confidence regression to the kraken results across all platforms. The kraken benchmark measures JavaScript performance, details can be found at https://wiki.mozilla.org/TestEngineering/Performance/Talos/Tests#kraken. There also appears to be a high confidence improvement to tabpaint across all platforms, which is documented here: https://wiki.mozilla.org/TestEngineering/Performance/Talos/Tests#tabpaint. There are a few other medium to high confidence improvements and regressions.

I see references to linux64, macos64, and win64 in your patch here https://hg.mozilla.org/try/rev/408e7270af623961e8cb466aa31b18248aa3e512 could these explain the cross-platform impact? Were the tasks executed around the same time? The reason I ask is that if the base tasks were triggered first, and the new triggered much later, perhaps an infrastructure change could have been introduced in the meantime.

Serge told me that he will have a look why Windows performances are impacted

After discussing with dmajor, he told me that I have been probably impacted by another unrelated issue.

So, comparing with and without the option activated, I saw that:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=11d4b305e91184937dc2a2f3db7974653faabf87&newProject=try&newRevision=90d5778298672bc8e1b7f961606225ca9a867330&framework=1

Seems that we are only regressing on Windows 7:
tp5n main_startup_fileio opt e10s stylo
tp5n time_to_session_store_window_restored_ms opt e10s stylo

Attached image image.png

for legacy

Seems that we are only regressing on Windows 7:
tp5n main_startup_fileio opt e10s stylo
tp5n time_to_session_store_window_restored_ms opt e10s stylo

I'm not worried about these two, historically they haven't been super reliable for me.

@sylvestre: I've explicitly disabled the flag for non-Linux targets, users receive a warning if they try to use it for non-Linux, non-x86:x64 platform. I've also setup an extra validation step for the review, it should be fine : https://github.com/serge-sans-paille/llvm-project/pull/3/checks

Serge provided me with a new patch to skip the execution of this code on Windows & Mac.

I just started some talos job;

  1. without the option enabled:
    https://treeherder.mozilla.org/#/jobs?repo=try&revision=b67d19f2d016cfd19e2854fdd2334c7540a96b31
    (just clang + a normal fx build)
    https://hg.mozilla.org/try/rev/41b41afc9bc18557a86db05becdb0f6b6f154a10

  2. with the option activated:
    https://treeherder.mozilla.org/#/jobs?repo=try&revision=a405edf572eb19ba33b829bb691a3a3095e68cc6
    https://hg.mozilla.org/try/rev/cc533026241361425a0991c01e4bff3bb4239605

I kept the Windows & Mac part to make sure that it behaves correctly

Looking at the results ( https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=a405edf572eb19ba33b829bb691a3a3095e68cc6&newProject=try&newRevision=b67d19f2d016cfd19e2854fdd2334c7540a96b31&framework=1 ) . I don't seen any regression on Windows or Mac anymore

Just a small regression with displaylist_mutate opt e10s stylo on Linux 64 (-1.2%)

(In reply to Sylvestre Ledru [:Sylvestre] from comment #20)

Looking at the results ( https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=a405edf572eb19ba33b829bb691a3a3095e68cc6&newProject=try&newRevision=b67d19f2d016cfd19e2854fdd2334c7540a96b31&framework=1 ) . I don't seen any regression on Windows or Mac anymore

Just a small regression with displaylist_mutate opt e10s stylo on Linux 64 (-1.2%)

It looks like there are only five runs before and after. At that level, we can't draw a conclusion because the tests have too much variance. Also there aren't any raptor runs in there. Depending on how intense you want to be about this experiment, I would say we need more data. But if I understand correctly this is sort of just a casual run, so maybe this combined with the previous testing is sufficient.

Not sure we want to enable it as it might have perf impact

(In reply to Sylvestre Ledru [:Sylvestre] from comment #23)

Serge told me that we would need:
https://github.com/llvm/llvm-project/commit/0f60bcc36c34522618bd1425a45f8c6006568fb6

For the record: that patch fixes probing for dynamic alloca. If firefox is not using VLA or explicit call to alloca (or any other construct that generates dynamic alloca generation (?)), this patch should have no impact.

caused:
[task 2020-11-05T10:14:26.012Z] 10:14:26 INFO - In file included from Unified_cpp_sandbox_linux2.cpp:137:
[task 2020-11-05T10:14:26.012Z] 10:14:26 ERROR - /builds/worker/checkouts/gecko/security/sandbox/chromium/sandbox/linux/seccomp-bpf/syscall.cc:369:3: error: Unable to protect inline asm that clobbers stack pointer against stack clash [-Werror,-Wstack-protector]
[task 2020-11-05T10:14:26.012Z] 10:14:26 INFO - asm volatile(
[task 2020-11-05T10:14:26.013Z] 10:14:26 INFO - ^
[task 2020-11-05T10:14:26.013Z] 10:14:26 INFO - 1 error generated.

Depends on D97566

Attachment #9185941 - Attachment description: Bug 1588710 - Linux: enable -fstack-clash-protection → Bug 1588710 - Linux: enable -fstack-clash-protection r?#build,tjr
Attachment #9188818 - Attachment description: Bug 1588710 - clang stack clash - Cherry-pick an upstream fix #build,tjr,sguelton → Bug 1588710 - clang stack clash - Cherry-pick an upstream fix
Attachment #9188818 - Attachment description: Bug 1588710 - clang stack clash - Cherry-pick an upstream fix → Bug 1588710 - clang stack clash - Cherry-pick an upstream fix r?#build,tjr

AFAIK, no perf regression:
https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=a47c98b909b61035dae2e1e00883f2ade0fef129&newProject=try&newRevision=62108fa48bd15fe01f1a0f1ffab133af9b4207cc&framework=13

For now, enabling it when built with clang >= 11.0.0 on x86, x86_64, s390x & ppc64 on non Windows.
According to the clang author (Serge), it is built-in on Windows

Summary: Evaluate the performance impact of the Stack clash protection → Enable Stack clash protection on support OS / arch
Summary: Enable Stack clash protection on support OS / arch → Enable Stack clash protection on supported OS / arch
Attachment #9188820 - Attachment description: Bug 1588710 - Disable stack protector on some asm chromium & breakpad sandboxing code r?#build → Bug 1588710 - Do not fail on stack protector on some asm chromium & breakpad sandboxing code r?#build
Pushed by sledru@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/aaa9af4e5eba
Do not fail on stack protector on some asm chromium & breakpad sandboxing code r=mhentges
Keywords: leave-open
Pushed by sledru@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/31f2b92312ab
Try to use clang 11 for stackwalk r=firefox-build-system-reviewers,andi,mhentges
Attachment #9188818 - Attachment description: Bug 1588710 - clang stack clash - Cherry-pick an upstream fix r?#build,tjr → Bug 1588710 - clang stack clash - Cherry-pick an upstream fixes r?#build,tjr
Attachment #9185941 - Attachment description: Bug 1588710 - Linux: enable -fstack-clash-protection r?#build,tjr → Bug 1588710 - Linux/Android: enable -fstack-clash-protection r?#build,tjr
Attachment #9188818 - Attachment description: Bug 1588710 - clang stack clash - Cherry-pick an upstream fixes r?#build,tjr → Bug 1588710 - clang stack clash - Cherry-pick an upstream fix r?#build,tjr
Pushed by sledru@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5f196172bb22
clang stack clash - Cherry-pick an upstream fix r=firefox-build-system-reviewers,mhentges,dmajor
Pushed by sledru@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/abafe6c923eb
Linux/Android: enable -fstack-clash-protection r=tjr,firefox-build-system-reviewers,glandium

The currently nightly has it \o/
Tom, As I was saying, we are going to write a blog post with Serge about it.

Do you think it should be mentioned in the release notes? I know it isn't much but some press like this kind security improvements.

Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(tom)
Resolution: --- → FIXED
Depends on: 1679985

I'm not sure about the release notes; I suppose add it and see what the people who manage that say.

Flags: needinfo?(tom)

We've seen some test_jsctypes.js test failures on Try and locally. I tracked it down to alloca(0) with -fstack-clash-protection. It generates buggy machine code, a fix for that landed in LLVM:

https://github.com/llvm/llvm-project/commit/9573c9f2a363da71b2c07a3add4e52721e6028a0
https://www.mail-archive.com/llvm-branch-commits@lists.llvm.org/msg05187.html

I can reproduce this locally on central tip (hg revision f4001dfef5bc). I don't know why this is not perma-orange though.

Depends on: 1680276

Jandem, I opened bug 1680276 for this

ok
I will rebase upstream fixes or wait for 11.0.1

Flags: needinfo?(sledru)

From upstream repo:
a1e0363c7402f7aa58e24e0e6dfa447ebabc1910 to bbe6cbbed8c7460a7e8477373b9250543362e771

Pushed by sledru@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e45001cddaf4
Add missing upstream llvm patch to fix alloc(0) r=firefox-build-system-reviewers,dmajor
https://hg.mozilla.org/integration/autoland/rev/f0b02db01894
Linux/Android: enable -fstack-clash-protection r=tjr,firefox-build-system-reviewers,glandium

For now, it needs a patch version of clang 11.0.0 to work correctly.
I will remove it once 11.0.1 is released.

Attachment #9193198 - Attachment description: Bug 1588710 - stack-clash-protection: Only enable in automation for now r?#build → Bug 1588710 - stack-clash-protection: enable when clang 11.0.1 is used r?#build
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 86 Branch

This might be too technical but it shows some security improvement on Linux and I often see the press mentioning that.
Feel free to ignore it!

Release Note Request (optional, but appreciated)
[Why is this notable]: Prevent some classes of security issues
[Affects Firefox for Android]: Yes
[Suggested wording]: On Linux and Android, the protection to mitigate the stack clash attack has been activated.
[Links (documentation, blog post, etc)]: I have been writing a draft with Red Hat & Rust for the llvm blog.

A few more links for the curious:
https://blog.qualys.com/vulnerabilities-research/2017/06/19/the-stack-clash
https://pagure.io/fesco/issue/2020
https://reviews.llvm.org/D68720
https://www.phoronix.com/scan.php?page=news_item&px=LLVM-Stack-Clash-Protection-20

relnote-firefox: --- → ?
Status: RESOLVED → REOPENED
Flags: needinfo?(sledru)
Resolution: FIXED → ---
Target Milestone: 86 Branch → ---

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #49)

This had to be backed out for frequent crashes on Linux x64 debug affecting Try pushes (Bug 1679994):

The issue in bug 1679994 should have been fixed by the earlier backout here in comment 42, and we should not be seeing those issues anymore on the relanding in comment 47. Was there a different set of failures that prompted the second backout? Are there any links to the logs available?

Flags: needinfo?(aryx.bugmail)

Focus on the Linux x64 debug tasks

Flags: needinfo?(aryx.bugmail)

Thanks Aryx.

The minidump from this GTest task has

00007f63`d80b1fab 4883c00f        add     rax,0Fh
00007f63`d80b1faf 4883e0f0        and     rax,0FFFFFFFFFFFFFFF0h
00007f63`d80b1fb3 4989e5          mov     r13,rsp
00007f63`d80b1fb6 4929c5          sub     r13,rax
00007f63`d80b1fb9 4c89a5c8feffff  mov     qword ptr [rbp-138h],r12
00007f63`d80b1fc0 4939e5          cmp     r13,rsp
00007f63`d80b1fc3 7c11            jl      libxul+0x870ffd6 (00007f63`d80b1fd6)
00007f63`d80b1fc5 48c7042400000000 mov     qword ptr [rsp],0
00007f63`d80b1fcd 4881ec00100000  sub     rsp,1000h
00007f63`d80b1fd4 ebea            jmp     libxul+0x870ffc0 (00007f63`d80b1fc0)
00007f63`d80b1fd6 4c89ec          mov     rsp,r13

I don't think this does the right thing when rax is initially 0.

The compiler for this build had this squashed patch on top of llvmorg-11.0.0.

@sguelton: Does the above code look as expected to you?

Flags: needinfo?(sguelton)

Well, the story gets worse: I pulled the preprocessed quant_bands.c from CI, and downloaded the same compiler artifact that the builds are using, and on my machine I get different output (jge).

Could this be a caching problem? It looks very strange that https://treeherder.mozilla.org/jobs?repo=autoland&revision=f0b02db0189473f12beb4b483dab17708ffde615&selectedTaskRun=eSFOksz8Rd67FeVxcky7Vw.0 built a brand-new clang, yet the build task claims sccache hit rate debug was 0.6!

Flags: needinfo?(sguelton)

It's not strange, actually. That's the power of dynamic libraries. sccache uses the compiler binary (clang) for hashing, but the patch only touches libllvm... so from sccache's perspective, there was no change to the compiler. I don't think this is the first time this happens.

So, clearing the cache before landing this patch would fix the issue?
If so, how can we clear the cache ? :)
merci!

Flags: needinfo?(sledru)

Clearing the cache would mean that builds using the older clang may get objects from a build with the new clang. It would be better to find a way to make the clang binary different.

Would it be possible to include libllvm in the hashing? (and/or all llvm/clang key files)
This kind of issue is pretty tricky to debug

That would need sccache to be aware of such details of the compiler...

Depends on: 1685335
Attachment #9193040 - Attachment is obsolete: true

Let's try to land it again now that we are using clang 11.0.1 (which includes the fixes), so that we don't have the cache issue one more time :)

Pushed by sledru@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/cb69fe3e4c46
Linux/Android: enable -fstack-clash-protection r=tjr,firefox-build-system-reviewers,glandium
Attachment #9193198 - Attachment description: Bug 1588710 - stack-clash-protection: enable when clang 11.0.1 is used r?#build → Bug 1588710 - stack-clash-protection: enable when clang 11.0.1 is used
Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/autoland/rev/04b7cd1b0d65
stack-clash-protection: enable when clang 11.0.1 is used r=firefox-build-system-reviewers,glandium
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 86 Branch

If release managers want to relnote it, we (redhat, rust and firefox) wrote:
https://blog.llvm.org/posts/2021-01-05-stack-clash-protection/

I think it's too niche for our beta/release note, thanks.

Whiteboard: [adv-main86-]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: