Closed Bug 1546544 (CVE-2019-9815) Opened 2 years ago Closed 2 years ago

macOS: disable hyperthreading on threads that run content JS

Categories

(Core :: DOM: Workers, defect)

Unspecified
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla68
Tracking Status
firefox-esr60 67+ fixed
firefox66 --- wontfix
firefox67 + fixed
firefox68 + fixed

People

(Reporter: luke, Assigned: haik)

References

(Regressed 1 open bug, )

Details

(Keywords: csectype-other, sec-high, Whiteboard: [disclosure date May 14][adv-main67+][adv-esr60.7+])

Attachments

(6 files, 2 obsolete files)

Hyperthreading enables a number of timing attacks (e.g. PortSmash), including some new Spectre variants. macOS now recommends that applications running untrusted code on a thread should call a new syctl that disables hyperthreading for that thread (preventing that thread from sharing a physical core with any other thread). (The story for Linux/Windows is different, but requires no action from us, and if you're interested, ping me to discuss separately.)

The macOS security update containing the new sysctl is 10.14.5. The WebKit patch which we want to basically copy that uses this sysctl is:
https://trac.webkit.org/changeset/244233/webkit
The sysctls are used in CPU.cpp (which doesn't appear in the above diff view):
https://trac.webkit.org/browser/webkit/trunk/Source/JavaScriptCore/assembler/CPU.cpp?rev=244233

So we want to basically call kern.tcsm_enable on the main thread and any worker thread. The JSC team reports a low (<=1%) slowdown from this in practice.

The only interesting bit is that this mitigation messes with the number-of-logical-cores count, which we use in various places and report to content via navigator.hardwareConcurrency. In the above diff, WebKit seems to simply subtract 1 when the mitigation is active. That seems off, at least for the purpose of navigator.hardwareConcurrency w.r.t workers. I'd be inclined to just use the physical core count, when on macOS. (Logical threads are only worth ~1.2 physical threads anyway.)

Ideally, we'd uplift this patch to FF67, for release in time for any upcoming public disclosures (to minimize risk, perhaps just the part that calls kern.tcsm_enable, the logical-core-count part could ride the trains).

Haik, are you the right person to work on this? It looks like at least part of this is going to require updating our sandbox policy (see https://trac.webkit.org/changeset/244233/webkit#file10).

It looks like the navigator.hardwareConcurrency binding lives in dom/workers, as far as I can tell we don't actually use that for any internal logic, so presumably we should be making JS modifications instead.

Luke, is the intention to just set this for threads that are going to run untrusted JS? In that case does it make sense to move this bug to the JS engine -- note most of the WebKit changes are in JSC -- and loop in the appropriate people there?

Flags: needinfo?(luke)
Flags: needinfo?(haftandilian)

so presumably we should be making JS modifications instead.

What does "JS modifications" mean? I would expect we'd need to change the Gecko (C++) impl of hardwareConcurrency.

Luke, is the intention to just set this for threads that are going to run untrusted JS? In that case does it make sense to move this bug to the JS engine -- note most of the WebKit changes are in JSC -- and loop in the appropriate people there?

I'm not sure why JSC did the sysctl() in their JS engine, maybe it was an easy pinch point for them, but it seems a bit strange for the JS engine to do this because it lacks embedding context. E.g., we wouldn't necessarily have to set this for any parent-process threads (iiuc, we don't run untrusted JS in the parent process) and the JS engine has no concept of this distinction.

I think the right place to do this is in the shared initialization we do when setting up a window/worker thread, perhaps: https://searchfox.org/mozilla-central/source/xpcom/base/CycleCollectedJSRuntime.cpp#469. mccr8?

For updating all the "how many threads are there?" queries, we should've probably factored out a function that does this into MFBT a long time ago, mirroring WebKit's WTF::numberOfProcessorCores(), and that could be a good place to make that change so that Gecko and SM can share the logic.

Flags: needinfo?(luke) → needinfo?(continuation)

(In reply to Eric Rahm [:erahm] from comment #1)

Haik, are you the right person to work on this? It looks like at least part of this is going to require updating our sandbox policy (see https://trac.webkit.org/changeset/244233/webkit#file10).

Yes, I'll work on this. Our content sandbox whitelists sysctls so (at first glance) we'll have to add kern.tcsm_enable and kern.tcsm_available to the list.

And I assume we'll want to do this for the parent process (which isn't affected by sandboxing) as well as content processes and any other processes that run JS.

I'm wondering if making navigator.hardwareConcurrency return just the physical cores (half its current value) might causes sites to do throttling in some way and affect performance itself.

Flags: needinfo?(haftandilian)

CycleCollectedJSRuntime is the right place to go if you want to have Gecko stuff set things in the JS engine for every JS runtime.

Does SpiderMonkey need to know about this, too? There's the various JS helper threads, but I suppose they hopefully do not actually run JS...

Flags: needinfo?(continuation)

(In reply to Haik Aftandilian [:haik] from comment #3)

And I assume we'll want to do this for the parent process (which isn't affected by sandboxing) as well as content processes and any other processes that run JS.

Maybe I'm misunderstanding you, but this should only be needed for threads and processes that are running content JS. If we're running hostile JS code in the parent process we've probably already lost. :) I'm not sure how addon JS fits into this.

Group: mozilla-employee-confidential
Type: task → defect

(In reply to Andrew McCreight [:mccr8] from comment #4)

CycleCollectedJSRuntime is the right place to go if you want to have Gecko stuff set things in the JS engine for every JS runtime.

Does SpiderMonkey need to know about this, too? There's the various JS helper threads, but I suppose they hopefully do not actually run JS...

WebKit tweaked the number of wasm compilation threads, which indeed are JS helper threads for us. I'm not sure if that was necessary or not.

I think this should be marked with both the MoCo confidential and sec flags due to the NDA'ness.

DOM:Workers isn't quite right, but I'm not sure what is. XPCOM? If this is going to be handled Gecko-side.

OS: Unspecified → macOS

(In reply to Eric Rahm [:erahm] from comment #6)

WebKit tweaked the number of wasm compilation threads, which indeed are JS helper threads for us. I'm not sure if that was necessary or not.

In light of this, do you think SpiderMonkey needs to be able to deal with this, too? Thought I don't know how helper threads are spawned nowadays.

Flags: needinfo?(luke)

(In reply to Haik Aftandilian [:haik] from comment #3)

I'm wondering if making navigator.hardwareConcurrency return just the physical cores (half its current value) might causes sites to do throttling in some way and affect performance itself.

If sites are using navigator.hardwareConcurrency to control the number of workers to create (which I think is the primary motivation), then I think the physical core count is the right number to return given that each worker thread will have HT disabled. Or are you talking about some other kind of throttling?

(In reply to Andrew McCreight [:mccr8] from comment #8)

In light of this, do you think SpiderMonkey needs to be able to deal with this, too? Thought I don't know how helper threads are spawned nowadays.

Yes, in comment 2, I was suggesting we have a "number of threads" function in MFBT that then SM and Gecko could easily share it. It looks like there are a bunch of callers of sysconf(_SC_NPROCESSORS_CONF) and sysconf(_SC_NPROCESSORS_ONLN) and probably most of these want to use the shared MFBT cpu-count function.

Flags: needinfo?(luke)

Oh sorry, I might've answered the wrong question: if you're talking about which threads need to have the "disable-HT" mitigation, the SM helper threads do not run JS.

Luke, what kind of timeline do we have for this? Should Haik drop everything ASAP, are we concerned about uplifts (18 days) or is there some padding before disclosure?

Flags: needinfo?(luke)

(In reply to Eric Rahm [:erahm] from comment #11)

Luke, what kind of timeline do we have for this? Should Haik drop everything ASAP, are we concerned about uplifts (18 days) or is there some padding before disclosure?

This is my top priority right now and I'll be working on it today.

The last I heard, the disclosure date conveniently coincides with the release of FF67 (May 14th). It's hard to predict how much attention this disclosure will garner; it's not as scary as the last ones and maybe people are becoming desensitized to Spectre and this whole macOS mitigation will go unnoticed... but if it did generate significant attention and if, in response, Apple makes a public recommendation that apps running untrusted code call this new sysctl and Safari/Chrome then message "we already do", Firefox might be asked by, say, tech press, at which point we'll be in a much better position if we already do as well.

Flags: needinfo?(luke)

(In reply to Andrew McCreight [:mccr8] from comment #5)

(In reply to Haik Aftandilian [:haik] from comment #3)

And I assume we'll want to do this for the parent process (which isn't affected by sandboxing) as well as content processes and any other processes that run JS.

Maybe I'm misunderstanding you, but this should only be needed for threads and processes that are running content JS. If we're running hostile JS code in the parent process we've probably already lost. :) I'm not sure how addon JS fits into this.

Kris, can you confirm whether or not any untrusted JS from extensions still runs in the parent process?

Flags: needinfo?(kmaglione+bmo)

(In reply to Haik Aftandilian [:haik] from comment #14)

Kris, can you confirm whether or not any untrusted JS from extensions still runs in the parent process?

At the moment, we only run JS in the parent process for the deprecated proxy PAC script API. In general, we don't even allow extension content scripts to run in the parent process

Flags: needinfo?(kmaglione+bmo)

[Tracking Requested - why for this release]:
Any supported release running on macOS 10.14.5 is affected including ESR 60.

I've posted a patch that allows the new sysctl in the content process policy which also applies to the extension process. The patch includes enabling TCSM on the main thread in content/extension processes and the parent process as well as on DOM workers. We need review from someone knowledgeable about DOM workers to tell me if the changes cover all the cases we need to cover.

I haven't yet looked at the changes required for navigator.hardwareConcurrency or our own internal CPU counts.

Luke, it looks like the Safari folks have small bug in their CPU code:

55	int kernTCSMAwareNumberOfProcessorCores()
56	{
57	    static std::once_flag onceFlag;
58	    static int result;
59	    std::call_once(onceFlag, [] {
60	        result = WTF::numberOfProcessorCores();
61	        if (result <= 1)
62	            return;

Empty return is UB I think and even if not it's certainly not intended.

63	        if (isKernTCSMAvailable())
64	            --result;
65	    });
66	    return result;
67	}

Can you forward this info to them?

You bet, thanks for pointing it out.

Actually, the empty return is in a C++ lambda called by std::call_one, not the containing function, so this isn't UB, iiuc.

Luke, do you have any more information on the kern.tcsm_available sysctl? I'm wondering if we can read it once and cache the result. Specifically, will the value be the same for all threads in an application? If we read it as non-existent or non-zero for a thread at some point while the the application is running, will it be the same for all other threads that read it?

Flags: needinfo?(luke)
Assignee: nobody → haftandilian

I expect kern.tcsm_available is invariant over time and across threads in a process, so I think you should proceed under that assumption, but I'll ask the engineer who forwarded me the WebKit cset and report back if that assumption is not correct. Unfortunately, there is no documentation atm.

Flags: needinfo?(luke)

The answer is "yes": kern.tcsm_available is invariant over time and threads within a process.

Attachment #9061069 - Attachment description: Bug 1546544 - Part 1 - Enable TCSM → Bug 1546544 - Enable TCSM

I think the first patch is ready to land. The result of "kern.tcsm_available" is being cached so that will only be called once per process. "kern.tcsm_enable" needs to be called a minimum of once per JS thread that will run untrusted code, but it can be called more than once per thread when CycleCollectedJSRuntime() is called multiple times on the same thread. I think this is OK for now. On my 2018 MacBook Pro, the sysctl to set "kern.tcsm_enable" took around 10 microseconds and was only called ~60 times after starting the browser and browsing with ~15 tabs for a few minutes. "kern.tcsm_enable" will only be called on systems with the tcsm_available sysctl property.

The second patch adds a Mac-specific routine to get the number physical CPU's. I don't want to try and write common code to do this and have to write implementations for Windows and Linux too given this change doesn't require it and getting physical CPU count probably has platform-specific details to consider.

Attached image tcsm_enable Histogram

Needinfo for review (in case the phabricator mails weren't noticed)

Flags: needinfo?(luke)
Flags: needinfo?(continuation)

lgtm (with small nits), thanks for all the work!

Flags: needinfo?(luke)

Yeah, thanks for doing this!

Flags: needinfo?(continuation)

Comment on attachment 9061069 [details]
Bug 1546544 - Enable TCSM

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: Unknown. A similar patch has already landed in WebKit (see comment 0). Even with the fix in Firefox, users will need to upgrade to the macOS 10.14.5 security update to avoid the problem. Apple may release updates for earlier OS versions (such as 10.13 for example.)
  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: Unknown
  • Which older supported branches are affected by this flaw?: ESR, 66, 67 (All supported releases that run on 10.14.5)
  • If not all supported branches, which bug introduced the flaw?: OS/hardware Bug
  • Do you have backports for the affected branches?: No
  • If not, how different, hard to create, and risky will they be?: The backports should be very similar.
  • How likely is this patch to cause regressions; how much testing does it need?: Low risk of regressions. Needs testing on Mac only. On macOS 10.14.5 Beta and at least one or more earlier macOS version.
Attachment #9061069 - Flags: sec-approval?
Attachment #9061802 - Flags: sec-approval?

For reference, here's the Chromium patch:
https://chromium-review.googlesource.com/c/v8/v8/+/1594562/

Looping in release management. We're about to make release candidates so normally this would be too late to go in but it appears important enough to consider.

Flags: needinfo?(pascalc)
Flags: needinfo?(lhenry)

Can we get an uplift request for beta please? Thanks

Flags: needinfo?(pascalc) → needinfo?(haftandilian)

Comment on attachment 9061069 [details]
Bug 1546544 - Enable TCSM

sec-approval+ for trunk then.

Attachment #9061069 - Flags: sec-approval? → sec-approval+

(In reply to Pascal Chevrel:pascalc from comment #34)

Can we get an uplift request for beta please? Thanks

Coming shortly. The patch for Beta is slightly different due to some refactoring in security/sandbox/mac.

Two patches are posted to land on central.

For uplifts, the plan is to just land the first patch "Enable TCSM" that enables TCSM when running untrusted JS. The second patch which adjusts navigator.hardwareConcurrency when TCSM is enabled can ride the trains.

Can you create a patch and request uplift for ESR as well? Thanks!

Flags: needinfo?(lhenry)

Comment on attachment 9063010 [details] [diff] [review]
Enable TCSM Patch for Beta/67 Uplift

Beta/Release Uplift Approval Request

  • User impact if declined: Exposure to a new timing attack that is dependent on hyperthreading being enabled.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The change is limited to Mac and the scope of the fix is quite small.
  • String changes made/needed:
Flags: needinfo?(haftandilian)
Attachment #9063010 - Flags: approval-mozilla-beta?
Comment on attachment 9063010 [details] [diff] [review]
Enable TCSM Patch for Beta/67 Uplift

Approved for our beta branch.
Attachment #9063010 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

I'm working on the ESR patch and uplift request.

Flags: needinfo?(haftandilian)

OK. We should probably hold off on landing today given there's still some uncertainty what may need to be done for the expired cert issue. (i.e. will we need another dot release)

Is this embargoed? If it isn't, I have to write an advisory and I'll need to know how we want to credit this. If we're waiting for others to fix, we shouldn't disclose until they do.

Flags: needinfo?(luke)
Whiteboard: [NDA'd, please leave closed and in core-security] → [NDA'd, please leave closed and in core-security][adv-main67+]

Yes, it's embargoed, until ~May 14th.

Flags: needinfo?(luke)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #45)

This was autolanded:
https://hg.mozilla.org/integration/autoland/rev/fda45db5f8e981524e7538b7b2e259fa7828424d
https://hg.mozilla.org/integration/autoland/rev/d4b67960c0f921396353762c413b0bd2ef234cb0

Then backed out for timeouts on startup in mochitest-chrome:
https://hg.mozilla.org/integration/autoland/rev/18ae393f54167ce4305a7c1192fbc90b594b6361

Log link:
https://treeherder.mozilla.org/logviewer.html#?job_id=245011756&repo=autoland

Also backed out from Beta for the same issues:
https://hg.mozilla.org/releases/mozilla-beta/rev/21ec21b79d77

I don't know exactly what's causing the issue for the non-e10s tests, but it appears that calling XRE_IsE10sParentProcess() to check if (e10s is enabled && we're in the parent process) is causing the hang. XRE_IsE10sParentProcess() changes a few globals. Digging into this and should have a patch later.

Is it OK to do a try run of this fix at this point?

Flags: needinfo?(haftandilian)

(In reply to Haik Aftandilian [:haik] from comment #47)

Is it OK to do a try run of this fix at this point?

Yes. The patch is already public.

So as an update re comment 13:

  • the disclosure date is indeed May 14th
  • Apple just pinged to say they will probably have a communication on May 14th that mentions the mitigation and which browser versions (not just Safari, but all browsers) have it.

So it's good that we've prioritized this.

The first attempt at the fix used XRE_IsE10sParentProcess() to enable the mitigation in child processes and in the main process only when e10s is disabled.

Now I realize using XRE_IsE10sParentProcess() is problematic because 1) it reads prefs and is being called of the main thread, but prefs should only be read on the main thread and 2) it is being called very early in startup (see stack below.).

I've just posted a new fix that doesn't have this problem, but doesn't enable the mitigation for the non-e10s case. Not supporting e10s seems OK at this time because, on Mac, non-e10s is not supported. We did have some ESR60 issues that disabled e10s for screen reader compatibility, but that did not include Mac. I will look into a follow-up fix for the non-e10s case. One solution is to enable the mitigation all the time in the parent process, but I think we should make sure the perf difference is negligible first.

  * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.2
    * frame #0 XUL`mozilla::BrowserTabsRemoteAutostart() at nsAppRunner.cpp:4929
      frame #1 XUL`XRE_IsE10sParentProcess() at nsAppRunner.cpp:4865
      frame #2 XUL`mozilla::CycleCollectedJSRuntime::CycleCollectedJSRuntime() at CycleCollectedJSRuntime.cpp:493
      frame #3 XUL`XPCJSRuntime::XPCJSRuntime() at XPCJSRuntime.cpp:2948
      frame #4 XUL`XPCJSRuntime::XPCJSRuntime() at XPCJSRuntime.cpp:2963
      frame #5 XUL`XPCJSContext::CreateRuntime() at XPCJSContext.cpp:1067
      frame #6 XUL`mozilla::CycleCollectedJSContext::Initialize() at CycleCollectedJSContext.cpp:156
      frame #7 XUL`XPCJSContext::Initialize() at XPCJSContext.cpp:1075
      frame #8 XUL`XPCJSContext::NewXPCJSContext() at XPCJSContext.cpp:1269
      frame #9 XUL`nsXPConnect::nsXPConnect() at nsXPConnect.cpp:75
      frame #10 XUL`nsXPConnect::nsXPConnect() at nsXPConnect.cpp:67
      frame #11 XUL`nsXPConnect::InitStatics() at nsXPConnect.cpp:127
      frame #12 XUL`xpcModuleCtor() at XPCModule.cpp:11
      frame #13 XUL`nsLayoutModuleInitialize() at nsLayoutModule.cpp:108
      frame #14 XUL`nsComponentManagerImpl::Init() at nsComponentManager.cpp:493
      frame #15 XUL`::NS_InitXPCOM(nsIServiceManager **, nsIFile *, nsIDirectoryServiceProvider *)() at XPCOMInit.cpp:446
      frame #16 XUL`ScopedXPCOMStartup::Initialize() at nsAppRunner.cpp:1282
      frame #17 XUL`XREMain::XRE_main() at nsAppRunner.cpp:4682
      frame #18 XUL`XRE_main() at nsAppRunner.cpp:4767
      frame #19 XUL`mozilla::BootstrapImpl::XRE_main() at Bootstrap.cpp:45
      frame #20 firefox`do_main() at nsBrowserApp.cpp:212
      frame #21 firefox`main() at nsBrowserApp.cpp:291
      frame #22 libdyld.dylib`start()

I discussed the new fix with :luke and :mccr8. Try results look good so I will re-land on central now.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f81754ae7d90a0437b5278df70d0f85ea2d34be2
The bc7 failure appears to be due to bug 1547489.

Attachment #9063010 - Attachment is obsolete: true
Attachment #9061802 - Flags: sec-approval? → sec-approval+

With the 67 launch date pushed back to May 21st from May 14th, we will need this fix as an update to 66 in order to have it available on the May 14th disclosure date. I'm working on a Release patch.

I was worried about the same thing, but I don't think we need to issue a dot release just for this issue: Chrome won't have their patch in release until well after May 21st and, with the "optics" aside, the practical threat level of waiting a week is very low. Yesterday I raised this with the release manager and other relevant release stakeholders and everyone is in agreement. So we can just land on FF67 and let the release folks have a break from a strenuous week.

Please nominate the ESR60 patch for uplift ASAP.

Flags: needinfo?(haftandilian)

Comment on attachment 9063606 [details] [diff] [review]
Enable TCSM Patch for ESR60 Uplift

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration:
  • User impact if declined: Exposure to timing attack security issue.
  • Fix Landed on Version: 68
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Limited to macOS. Small scope of patch.
  • String or UUID changes made by this patch: None
Flags: needinfo?(haftandilian)
Attachment #9063606 - Flags: approval-mozilla-esr60?
Comment on attachment 9063606 [details] [diff] [review]
Enable TCSM Patch for ESR60 Uplift

Fix for sec-high issue, ok for ESR 60.7 uplift.
Attachment #9063606 - Flags: approval-mozilla-esr60? → approval-mozilla-esr60+
Whiteboard: [NDA'd, please leave closed and in core-security][adv-main67+] → [NDA'd, please leave closed and in core-security][adv-main67+][adv-esr60.7+]
Alias: CVE-2019-9815

Looks like public disclosure has happened, as expected.

Group: mozilla-employee-confidential
Group: core-security → core-security-release
Whiteboard: [NDA'd, please leave closed and in core-security][adv-main67+][adv-esr60.7+] → [disclosure date May 14][adv-main67+][adv-esr60.7+]

For the advisory, who is the official "reporter" of this issue for credit?

Flags: needinfo?(luke)

A whole bunch of people, apparently; for starters, all the people listed under "People" on https://mdsattacks.com/.

Flags: needinfo?(luke)

(In reply to Luke Wagner [:luke] from comment #66)

A whole bunch of people, apparently; for starters, all the people listed under "People" on https://mdsattacks.com/.

All of them reached out to Mozilla and reported it to us?

The advisory usually has a single line where we list one to three names (usually just one). I can pull every single person on the list but that seems possibly silly. How did we find out about it so we made our code change?

Dan suggests that we just link to https://mdsattacks.com/#People

sgtm

Group: core-security-release
Regressions: 1630089
You need to log in before you can comment on or make changes to this bug.