Closed Bug 1716028 Opened 3 years ago Closed 3 years ago

A race condition in crossbeam leads to a security vulnerability

Categories

(Core :: Security, defect)

defect

Tracking

()

RESOLVED FIXED
92 Branch
Tracking Status
firefox-esr78 --- wontfix
firefox-esr91 92+ fixed
firefox90 --- wontfix
firefox91 --- wontfix
firefox92 + fixed

People

(Reporter: kmaork, Assigned: emilio)

References

Details

(Keywords: csectype-race, reporter-external, sec-moderate, Whiteboard: [post-critsmash-triage][adv-main92-][adv-esr91.1-] )

Attachments

(3 files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36

Steps to reproduce:

I have encountered a bug (a very slippery race condition) in the latest version of the crossbeam crate, and after some research I verified it exists in older versions too, specifically one that is used inside firefox as well. The bug involves the combined use of three functions from the deque.rs module - Worker::push, Worker::pop, Stealer::steal_batch. Later I discovered that similar race conditions exist with the other steal functions - Stealer::steal and Stealer::steal_batch_and_pop. I attached a script that reproduces the bug.

Actual results:

The bug had caused random memory corruption issues in my purely safe rust program. Specifically, it caused both a logical bug in my program, followed by a double-free.
I spent two weeks now researching and fixing this bug in my free time.
I intend to open a pull request with my fix, but wanted to let you guys know before I do that, so you'll have time to evaluate the security repercussions.
I looked into the usage of crossbeam inside firefox, and it seems to be used quite widely.
More specifically, I see it is used in Gecko. I dug into the usages a little bit, and it seems that at least one potential place for the bug to occur is in the rayon crate. The rayon crate is used widely throughout Gecko, specifically in gfx, servo and additional third-party libraries.
The good news is that the steal_batch and steal_batch_and_pop functions, when used on lifo worker queues, are much more susceptible to the race condition than the other scenarios, and I didn't find examples for such usage in firefox.
The result of the race condition is that one or more tasks in the worker queue can be stolen twice instead of other tasks that are forgotten and never stolen. If tasks are allocated on the heap, this can cause double free and a memory leak. If not, this still can cause a logical bug.

Expected results:

I have prepared a fix which I'm ready to open a pull-request for. Please let me know when I can progress with the PR without causing any harm. This is the diff if you're interested:
https://github.com/crossbeam-rs/crossbeam/compare/master...kmaork:deque_race_fix?expand=1

Flags: sec-bounty?

Emilio: I see you've updated crossbeam and rayon in the past few months. Do you know who would own this issue?

Group: firefox-core-security → dom-core-security
Component: Untriaged → Security
Flags: needinfo?(emilio)
Keywords: csectype-race
Product: Firefox → Core

I'm not super-familiar with crossbeam's internals.

The fix looks sensible (harmless at best), but I'm not sure if it's 100% correct... Why can the buffer be swapped after reading it in line 635, but not after the check that the reporter is introducing?

In any case, this seems like the kind of thing that would be pretty hard to reliably exploit (though please correct me if I'm wrong). Not sure if we'd want to rush and cherry-pick such thing without even going for review to upstream? I took a look, and it doesn't seem like crossbeam has a dedicated security contact / etc. If we consider this a sec-high/crit, we might want to contact their maintainers privately before sending the PR and such. But otherwise I'd say that the reporter should send the patch upstream and see what the relevant maintainers say about the correctness of the fix.

In any case I'm happy to cherry-pick the fix either once it lands, or asap if you think it's worth it? This affects both stylo and webrender, afaict.

Flags: needinfo?(emilio) → needinfo?(dveditz)

Regarding correctness of the fix - the problem is that we first load the pointer to the buffer, and then read a task from the buffer. Between loading the buffer and reading the task from the buffer, the buffer may have been swapped and we find ourselves reading stale data. But, if we verify that the buffer hasn't been swapped, the task is valid and we can increment/decrement the relevant atomic counter to steal the task. From this point onward we don't care if the buffer is swapped, because we already read a task from it and we're not planning to touch it again.

Regarding probability of exploit, I'm not sure as well. It will probably be hard to exploit at the very least, and will probably be a statistical exploit and not the 100% reliable kind. I'm willing to open a PR if needed.

Flags: needinfo?(kmaork)
Assignee: nobody → emilio

(In reply to Daniel Veditz [:dveditz] from comment #4)

This is the diff if you're interested: https://github.com/crossbeam-rs/crossbeam/compare/master...kmaork:deque_race_fix?expand=1

I guess your repo is private -- can't see that. But we'd love to see the fix if you wouldn't mind attaching the diff here.

Not sure why you can't see it (I can even in priv. mode), but anyhow, I attached the commit from the reporter

Assignee: emilio → nobody

What would a test case that reproduces the problem using rayon look like?

Regarding exploitation using rayon - to exploit the bug, two things must happen:

  1. The buffer must be swapped during a steal.
  2. The data read from the old buffer must be different from the data at the same index in the new buffer.

Note that the buffer is both growable and shrinkable (shrinks when size <= capacity / 4). Therefore lots of pushing, popping and stealing will satisfy condition #1. For me it worked best with one pusher/popper thread and one stealer thread (as can be seen in the attached script).

To satisfy condition #2 we need to empty the queue after the buffer had been swapped and then push (at least) one task. If that happens before the stealer finished reading the task, the stealer will "declare" it has stolen the newly pushed task, while it actually read an old task. Now the two threads had popped the same task (the newly pushed task is lost) and the program has entered an undefined state.

Note that both conditions are things that must happen between loading the buffer and reading from it in the stealer thread. For that reason the bug is very likely to occur when using the steal_batch and steal_batch_and_pop functions with a lifo queue, as these functions load the buffer once and repetitively read tasks from it. In the other cases, the likelihood might depend more heavily on the environment. For example, if many threads are running on the target system, the chances the OS will preempt a thread at the "right" time for an exploit grows.

Anyway, assuming we figured out how to prepare the environment for the exploit, I think the most likely scenario for an exploit is:

  1. Thread A creates a new lifo queue (initial capacity is the minimum which is 32)
  2. Thread A pushes 33 items so it grows to capacity of 64
  3. Thread A pops 16 items, so we are left with 17 items and capacity of 64
  4. Thread B starts a steal operation, and loads the buffer pointer
  5. Thread A pops 17 items and pushes one item. The first pop shrinks the queue, so the buffer is replaced.
  6. Thread B reads a task from the old buffer and increments the front index. The task it read has already been popped by thread A.

To make that happen with rayon we'll need to loop over a workload that consists of 33 tasks enough times to thread B is preempted at the right time.

Flags: needinfo?(kmaork)

Hey, so what happens now? I'd like to open a PR to crossbeam and open a CVE at some point, should I wait for you guys to do something first? And by the way, is this eligible for some kind of bounty/credit?

Flags: needinfo?(dveditz)
See Also: → 1715285

Hey, what's up? I've been waiting for a month now...

(In reply to kmaork from comment #10)

Hey, so what happens now? I'd like to open a PR to crossbeam and open a CVE at some point, should I wait for you guys to do something first? And by the way, is this eligible for some kind of bounty/credit?

As Emilio said in comment 2, we'd like this patch to go through the crossbeam-rs folks before we incorporate it. And since it's not originally our code we can't issue a CVE for it. The crossbeam folks can (through GitHub) and have in the past: https://github.com/crossbeam-rs/crossbeam/security/advisories/GHSA-v5m7-53cv-f3hx. I believe only repo owners can open security advisories (I don't see a way to do it on crossbeam-rs, anyway, and I can on my own repo) so you should contact them first because issues and PRs are public. The contributor who pushed that advisory, @taiki-e, is still actively committing to crossbeam-rs as of yesterday.

This would be a tough condition to exploit in Firefox, but the "See Also" bug Emilio linked to shows it may be causing memory corruption crashes in Firefox so it's not impossible. Not speaking for the bounty committee, but IMHO this is would likely be awarded a bounty consistent with its sec-moderate rating.

Flags: needinfo?(dveditz)

The patch was merged upstream, thanks kmaork!

Flags: needinfo?(emilio)
Flags: needinfo?(emilio)
Assignee: nobody → emilio

Comment on attachment 9234297 [details]
Bug 1716028 - Update crossbeam-deque. r=jrmuizel,nical

Beta/Release Uplift Approval Request

  • User impact if declined: Potential memory corruption crashes and co.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: none
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): It's a pretty harmless patch, at best it retries a couple times more.
  • String changes made/needed: none
Attachment #9234297 - Flags: approval-mozilla-beta?

This is too late for 91 since we're already in RC week, but we can take it for the 91.1esr release shipping alongside Fx92 next cycle.

Attachment #9234297 - Flags: approval-mozilla-esr91?
Attachment #9234297 - Flags: approval-mozilla-beta?
Attachment #9234297 - Flags: approval-mozilla-beta-
Group: dom-core-security → core-security-release
Status: UNCONFIRMED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 92 Branch
Flags: qe-verify-
Whiteboard: [post-critsmash-triage]
Flags: sec-bounty? → sec-bounty+

Comment on attachment 9234297 [details]
Bug 1716028 - Update crossbeam-deque. r=jrmuizel,nical

Approved for 91.1esr.

Attachment #9234297 - Flags: approval-mozilla-esr91? → approval-mozilla-esr91+

Thank you for the bounty! How do I receive it?

Flags: needinfo?(emilio)
Flags: needinfo?(emilio) → needinfo?(dveditz)
Flags: needinfo?(dveditz) → needinfo?(tom)

(In reply to kmaork from comment #20)

Thank you for the bounty! How do I receive it?

I will be contacting you within a few days to begin the bounty payment process; please be on the lookout for an email from me.

Flags: needinfo?(tom)
Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main92+]
Whiteboard: [post-critsmash-triage][adv-main92+] → [post-critsmash-triage][adv-main92+][adv-esr91.0.1+]
Whiteboard: [post-critsmash-triage][adv-main92+][adv-esr91.0.1+] → [post-critsmash-triage][adv-main92+][adv-esr91.1+]

This doesn't need an advisory after all. Advisory and CVE exist in https://github.com/crossbeam-rs/crossbeam/security/advisories/GHSA-pqqp-xmhj-wgcw. My bad.

Whiteboard: [post-critsmash-triage][adv-main92+][adv-esr91.1+] → [post-critsmash-triage][adv-main92-][adv-esr91.1-]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: