Closed Bug 1376408 Opened 3 years ago Closed 1 year ago

Randomize small allocations in mozjemalloc

Categories

(Core :: Memory Allocator, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
mozilla69
Tracking Status
firefox69 --- fixed

People

(Reporter: stephen, Assigned: tjr)

References

(Blocks 1 open bug)

Details

(Keywords: parity-edge, sec-want, Whiteboard: [adv-main69-])

Attachments

(3 files, 2 obsolete files)

Attached patch random_small_allocations.patch (obsolete) — Splinter Review
As part of a security review into the mozjemalloc heap, I have been investigating the feasibility of adding non deterministic allocations. This would prevent an attacker leveraging the deterministic nature of how the heap tracks free regions and then allocates them; so an attacker who frees a region cannot rely on the order that the allocator will reuse freed regions. This ultimately can prevent an attacker laying out the contents of the heap in a deterministic pattern during the exploitation of a heap based buffer overflow, specifically to prevent a region where an overflow occurs being adjacent to a target region that the attacker wishes to corrupt, at an expected offset.

Obviously non determinism goes against performance in terms of caches, locality and so on so I am conscious of what is actually feasible.

I have attached a patch which focuses on adding a certain amount of entropy to small allocations from a run. When searching for a free region via a region mask, we can start the search from a random bit offset within the current mask so as to find a free region in a non deterministic way - as opposed to always finding the first free region via the lowest set bit in the bitmap.

The search order of bitmap masks was not altered (just the search of the mask itself) to preserve how regs_minelm is used.

The pseudo random number generator is a standard xorshift which compiles to a handful of instructions as does the rotate helper function, all of which are inlined.

Ideally the bitmap masks would be 64 bit on native 64 bit builds instead of 32 bit, in order to increase the potential entropy. This would be a useful addition.

Adding entropy to large and huge allocations is not covered as there seems no easy way to do so while preserving the perf benefits of how the size/address trees are used to cache previously freed large/huge allocations.
Hey Stephen, I'm not quite sure why, but the patch doesn't apply on mozilla-central for me. Do you think you could try rebasing it on an up to date checkout?
Flags: needinfo?(stephen)
Flags: needinfo?(mh+mozilla)
Attached patch random_small_allocations.patch (obsolete) — Splinter Review
Tom, this imported and built for me on the latest from mozilla-central.
Attachment #8881364 - Attachment is obsolete: true
Flags: needinfo?(stephen)
Seems a warning is being treated as an error breaking the build, this should resolve it.
Attachment #8881468 - Attachment is obsolete: true
Group: core-security → dom-core-security
One question that was raised was how this behaves if an attacker puts memory pressure on the system, so that there may only be one or limited options to select from for an allocation...
Flags: needinfo?(stephen)
It's a fair point. It is not intended to effect the exploitation of use after free vulnerabilities but rather limit an attackers ability to layout memory in a predictable fashion before exploiting a heap based buffer overflow. The ability to do this is going to be vulnerability dependant - largely dictated by whether the attacker can control all allocations being serviced from a run while the attacker is performing their heap massaging or whether some allocations which the attacker does not control are also being serviced from a run, while the attacker is performing their heap massaging (if some of the code paths the attacker triggers also go on to make allocations from the same run for example).

This blog post (https://blog.lse.epita.fr/articles/74-getting-back-determinism-in-the-lfh.html) covers the effects of memory pressure (via many attacker controlled allocations) to circumvent the non determinism in the Windows LFH in the 'First attack' section.
Flags: needinfo?(stephen)
Mitigations to problems that are well known (predictability of jemalloc) don't need to be hidden.
Group: dom-core-security
Keywords: sec-want
Keywords: parity-edge
Summary: mozjemalloc security hardening → Randomize small allocations in mozjemalloc
Flags: needinfo?(mh+mozilla)
Priority: -- → P2
Updated original patch, rebased against esr60.
Assignee: nobody → tom
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Comment on attachment 9030047 [details] [diff] [review]
Randomized free region selection for small allocations in a run. (esr60)

Review of attachment 9030047 [details] [diff] [review]:
-----------------------------------------------------------------

::: memory/build/mozjemalloc.cpp
@@ +2216,5 @@
> + return (x << n) | (x >> (32 - n));
> +}
> +
> +/* An xorshift based PRNG */
> +static inline uint32_t rand(arena_t *arena, bool lock)

Maybe provide a link to the source for these xorshift magic numbers? There are multiple variations of xorshift PRNGs.

@@ +2290,5 @@
> +
> +     /* Usable allocation found. */
> +     bit = 32 - pos + ffs((int)rotl32(mask, pos)) - 1;
> +     if (bit >= 32)
> +       bit -= 32;

Consider moving these duplicated "Usable allocation found" rand/rotl32 code snippets into one inline function or macro.
Why is the patch against esr60?
(In reply to Mike Hommey [:glandium] from comment #11)
> Why is the patch against esr60?

Tor is interested in taking this patch so I'm working on it there. If you think we could take this on central if memory or performance doesn't regress too much i can rebase it to there too!
I was thinking about a per-arena opt-in for central. Also, it seems like this should be using mozilla/RandomNum.h

I'm going to pick this back up again. I'm making it per-arena opt-in; but I'm trying to get performance good enough that it's not infeasible to enable it on all arenas by default. Using mozilla/RandomNum.h unsurprisingly was abysmal from a performance standpoint so I will be trying some tricks to get it better...

Performance update for those following at home: https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=02a5e3b5b4fb03fc774298814eaf8c7dc023057d&selectedTimeRange=172800

This run is with a fake RNG (I chose 12!) so it shows some of the impact of affecting the memory layout without the RNG cost. The real memory layout would be a bit more random than skipping to the 12th slow in the freelist.

The cost is generally pretty low: around 1% sometimes 2% with the exception of the nth-index tests in perf-reftest-singletons which tests stylo's nth-index-cache where we obliterate performance with a regression of like 40%. The other outliers (4% regression on two dromaeo_css tests) might (probably?) feed from that general regression.

That's probably a sufficient argument for either not enabling this on all arenas or perhaps taking rust code into account or something - but I think using overall performance cost is still a good metric to refine a performant RNG design even if it's only used on a few high-value arenas.

Okay; more performance data (focused on linux64):

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1ed7e00e1d7dcbcf23bff6bc2b7af451df1245cd&newProject=try&newRevision=b858a4e2d1c512f86719d8b06bb09f96d3371a36
This is the comparison between using a 64-bit state xorshift PRNG and a 128-bit state. 128-bit is preferable and we see it adds a max of about 2% all with low confidence results.

https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=b858a4e2d1c512f86719d8b06bb09f96d3371a36&selectedTimeRange=172800 is the comparison of -central vs a 128-bit state. On a few specific CSS tests we see significant regressions. (Which I believe is caused by the aforementioned n-th index cache.) On responsiveness we see 2.5-4% regressions. Other than those we see a lot of low-percentage increases, a few decreases, and a few 2-4% increases with low confidence.

I'm going to see what I can do about this one specific cache...

So I spent some time switching rust's allocator to use a separate arena that did not have the randomization applied. This did not fix the tests. After consultation, it seems like the culprit isn't specifically the nth-index cache but rather those tests have a lot of DOM Traversal that happens. So the Dromaeo isn't an indicator of the nth-index, it's an indication of hurting dom traversal, and the nth-index tests show the same thing.

I'm going to prepare patches to turn this on by default in private arenas (which will hit the segregated String and Arraybuffer) but off by default in the non-private arenas.

More updates: Turning it on on all private arenas (which included the default JS Runtime) was still too expensive **. I'm going to try and turn it on for just the arraybuffer and string arenas and see what that does to performance.

The gains from this are very slight though. I'm going to try and land something in this patch, and then - assuming the performance isn't too bad - prepare a more complicated follow-up patch that will enable the protection by default in non-content processes. (The parent, and utility processes.) This will provide greater protection against sandbox escapes as the attacker will be going over IPC and have less ability to craft the heap as they desire (like they can when they have a scripting engine.)

** https://treeherder.mozilla.org/#/jobs?repo=try&revision=7b4646d0aba90eaa6eb6b5b9321d188c01df4c7c&selectedJob=247429116 and https://treeherder.mozilla.org/#/jobs?repo=try&revision=381431222c7f002d79fcd29929ef1418d8d88ca2

This allows freelist randomization on a per-arena basis, by supplying parameters to
arena creation. It opts in the Arraybuffer and StringBuffer arenas.

It uses an xorshift PRNG with a 128-bit state. It is not cryptographically secure. An
attacker who can observe outputs of the RNG, or read its state, is already in a position
to bypass the randomization applied. At the same time we make its state 128 bit to prevent
a trivial bypass if one or two outputs are observed.

The way a run selects masks to check has not been modified, so the randomization is limited
to at most 32 bits in the current mask being tested. It should be noted that while allocations
from the same run may now be non deterministic (up to the maximum entropy as previously
stated), an attacker who can perform multiple allocations will still be able to allocate
a targeted free region (for example while exploiting a use after free vulnerability in the
DOM). Non deterministic allocations will only impede an attacker who has less control over
how they allocate a targeted free region, and may provide some benefit during exploitation
of a heap based buffer overflow vulnerability where the attacker wishes to construct a
precise layout of regions pre overflow.

FWIW, I don't know how much of a drop-in replacement it would be but there's a version of the same or a very similar PRNG in MFBT: https://searchfox.org/mozilla-central/source/mfbt/XorShift128PlusRNG.h

(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #20)

FWIW, I don't know how much of a drop-in replacement it would be but there's a version of the same or a very similar PRNG in MFBT: https://searchfox.org/mozilla-central/source/mfbt/XorShift128PlusRNG.h

Thanks! I should be able to switch over to that; I'll do that soon.

The patch I attached adds randomization and enables it for the arraybuffer and stringbuffer arenas only. Here is a try run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3ca00141959506ff8305356334a655da6a7b6c15
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=3ca00141959506ff8305356334a655da6a7b6c15&selectedTimeRange=172800

There are small regressions, most of them low confidence. Joel, wondering what you think of this performance-wise.

Flags: needinfo?(jmaher)

perf wise this is looking good- the small changes seem to be similar to your baseline which was 4 days ago, thanks for double checking!

Flags: needinfo?(jmaher)

Tom, can you please run the awsy tests against shippable builds for all tier-1 platforms as well? I'm concerned about possible memory regressions. I can help triage the results when they come in.

Flags: needinfo?(tom)

(In reply to Eric Rahm [:erahm] from comment #23)

Tom, can you please run the awsy tests against shippable builds for all tier-1 platforms as well? I'm concerned about possible memory regressions. I can help triage the results when they come in.

They're run in https://treeherder.mozilla.org/#/jobs?repo=try&revision=3ca00141959506ff8305356334a655da6a7b6c15&searchStr=SY - but I don't know where to find the results :)

(In reply to Tom Ritter [:tjr] from comment #24)

(In reply to Eric Rahm [:erahm] from comment #23)

Tom, can you please run the awsy tests against shippable builds for all tier-1 platforms as well? I'm concerned about possible memory regressions. I can help triage the results when they come in.

They're run in https://treeherder.mozilla.org/#/jobs?repo=try&revision=3ca00141959506ff8305356334a655da6a7b6c15&searchStr=SY - but I don't know where to find the results :)

Similar to talos we need a bunch of re-triggers. I ran some overnight and have a few more triggered now. You can see the results here:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=3ca00141959506ff8305356334a655da6a7b6c15&framework=4&selectedTimeRange=172800

For the "Base" numbers you can just look at the top-level summary that is shown. For the measurements w/o "Base" in the title those are actually a summary of sub-tests, you'll need to drill down to see if any of the sub-tests regressed. A cursory look says yes there are rather large regressions, and by a fair amount although talos seems uncertain.

It would help to have a baseline push (ie pop your current patch) that runs the same tests to get a more accurate comparison.

(In reply to Eric Rahm [:erahm] from comment #25)

It would help to have a baseline push (ie pop your current patch) that runs the same tests to get a more accurate comparison.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=7a5ddff304ec6427f74c3816b6dab26af376f9bd is a run without my patches apply but using the same base.

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=7a5ddff304ec6427f74c3816b6dab26af376f9bd&newProject=try&newRevision=3ca00141959506ff8305356334a655da6a7b6c15&framework=4 doesn't seem to show the same regressions.

Flags: needinfo?(erahm)

(In reply to Tom Ritter [:tjr] from comment #26)

(In reply to Eric Rahm [:erahm] from comment #25)

It would help to have a baseline push (ie pop your current patch) that runs the same tests to get a more accurate comparison.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=7a5ddff304ec6427f74c3816b6dab26af376f9bd is a run without my patches apply but using the same base.

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=7a5ddff304ec6427f74c3816b6dab26af376f9bd&newProject=try&newRevision=3ca00141959506ff8305356334a655da6a7b6c15&framework=4 doesn't seem to show the same regressions.

That looks like a net improvement across the board! r=me and we can monitor the situation in central. Ignore the windows7-32-shippable base content resident unique regression, that measurement is known to be problematic.

Flags: needinfo?(erahm)

That looks like a net improvement across the board!

The patch can not really have any impact on memory usage, either good or bad.

I worked on the review comments. As I did so, I examined the freelist selection algorithm in more detail. It was bad.

Start with 10 slots, 2 of them available for choosing. Intuitively, we'd want to choose between the two equally with a 50/50 chance - assuming our RNG is good.

Instead, even with a good RNG, the choice was weighted by the number of unavailable slots to the right (lesser significance) of an available slot. Consider:
A0000B0000 - A has a 50% chance; B has a 50% chance
000AB00000 - A has a 10% chance; B has a 90% chance.

I improved the algorithm so each slot has an equal chance of being chosen. Doing so is slower. Significantly, if you perform a micro-benchmark.

Here's a Talos run that enables randomization on the arraybuffer and string arenas, which was what we reviewed earlier:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=556792925854b34a9042b3ce882f11a6931050e7&selectedTimeRange=172800

Even with a slower bit in the hot path of allocation (well, hot path if you've enabled randomization) it doesn't seem to have really moved the needle much. It's really just bit twiddling in a tight loop, so even though it's 'slower', I think it's dwarfed by everything else.

Instead, even with a good RNG, the choice was weighted by the number of unavailable slots to the right (lesser significance) of an available slot.

You mean, it's weighted by the number of unavailable slots between available slots.

Tom, this is Tongping who was supported by Mozilla to work on an randomized allocator, thank you so much for the support. I wonder whether you could use the array (originated from Guarder paper --- USENIX Security'18) to solve this issue. When only these two available objects in the allocation-buffer, then both of them will get 50% of being chosen. Please let me know if you would like to discuss this with me over the phone, and I can also consider to fly there to talk about my projects on the randomized allocator.

Oh waw, that is very much more expensive than the previous iteration. A modulo operation with a non-constant compiles to an integer division instruction, which is super slow. And you follow that with a loop that depends on the result of the division. I'm really not convinced your concern in https://bugzilla.mozilla.org/show_bug.cgi?id=1376408#c29 justifies this cost.
Performance for this should really be measured on Android. Also note that the less performant you make this the more difficult it will be to enable on more arenas.

I compared the old and new algorithms here:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=309f656d81f2d7045588438f85cce4f43a5b3339&newProject=try&newRevision=ec78dd86fa30ab971dcf960f8ba86349aca29d03

It does hurt quite a bit. I'm reverted to the old version and will think about a better approach.

Pushed by archaeopteryx@coole-files.de:
https://hg.mozilla.org/integration/autoland/rev/c277d0025188
Randomize free region selection for small allocations in a run r=glandium

Keywords: checkin-needed
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla69
Whiteboard: [adv-main69-]
You need to log in before you can comment on or make changes to this bug.