Closed Bug 1404860 Opened 7 years ago Closed 3 years ago

[meta] At least ~33000 crashes/week due to use-after-free

Categories

(Core :: Memory Allocator, defect, P3)

defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox-esr52 - wontfix
firefox56 --- wontfix
firefox57 + wontfix
firefox58 + wontfix
firefox59 --- wontfix
firefox60 --- wontfix

People

(Reporter: nbp, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: meta, sec-audit)

The following link highlights all crashes which are reading/writing/executing memory with a jemalloc poisoned pointer, by looking at addresses containing the pattern "e5e5".  Thus ~33000 crashes/week is a lower bound approximation.

https://crash-stats.mozilla.com/search/?address=~e5e5&product=Firefox&date=%3E%3D2017-09-25T09%3A41%3A05.000Z&date=%3C2017-10-02T09%3A41%3A05.000Z&_sort=-date&_facets=signature&_facets=address&_facets=reason&_facets=platform&_facets=version&_facets=platform_version&_facets=cpu_arch&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports

I came to look at this issue because many of these crashes appeared in regions which are not supposed to free anything, such as code using LifoAlloc, and the presence of the "e5e5" pattern highlight that some other part of Gecko freed the region. (Bug 1263794 comment 10)

When we crash with an "e5e5" the source of the problem already disappeared. The source of the problem is the free which happened before we went reading this poisoned memory.

One option, would be to look at all the memory regions reported in the previous search and instrument jemalloc to return a random value on allocations & reallocations, which must be provided when the region is reallocated & freed.  If this random value is not provided, then we should crash immediately.
This reminds me of our conversation on IRC the other day. Perhaps we could start by adding an assertion around [1] to ensure the incoming pointer is page-aligned. I'm not as clear on how smaller allocations work, but perhaps we could similarly assert at [2] that the pointer is aligned to the associated bin's region size.

(In reply to Nicolas B. Pierron [:nbp] from comment #0)
> One option, would be to look at all the memory regions reported in the
> previous search and instrument jemalloc to return a random value on
> allocations & reallocations, which must be provided when the region is
> reallocated & freed.  If this random value is not provided, then we should
> crash immediately.

I did this once in bug 1271165. The jemalloc part of that bug was landed reluctantly and without review (and was later backed out), but it did work. Now that we've forked mozjemalloc and compile it as C++, perhaps that patch could be cleaned up a little to use some more Gecko stuff like [3].

[1] https://dxr.mozilla.org/mozilla-central/source/memory/build/mozjemalloc.cpp#3743
[2] https://dxr.mozilla.org/mozilla-central/source/memory/build/mozjemalloc.cpp#3680
[3] https://dxr.mozilla.org/mozilla-central/source/mfbt/XorShift128PlusRNG.h
I would be very interested to know what would happen if we shipped Nightly
for a week using an alternative allocator that recycles pages as slowly as
possible.  That is, when a page is free, it is marked no-access and kept
that way for as long as possible.  That would force at least some of the
UAFs to result in a crash at the UAF-point.

We should do this anyway.  It's an easy thing to do, which means that we
should assume our adversaries are already doing it.
Bug 1386297 (shipping ASan Nightly builds) might also help here.
That's also true.  ASan gives us complete UAF detection but only for
accesses within our own code.  A debugging allocator would give us partial
UAF detection for all code in the process.  So the approaches are
complementary.

With our use of ASan, we've (presumably) driven down the ASan-detectable UAF
rate to approximately zero.  I'm interested to know what we can harvest
using a debugging allocator, since -- AFAIK -- we've never tried that.
(But correct me if I'm wrong.)
(In reply to Julian Seward [:jseward] from comment #4)
>
> With our use of ASan, we've (presumably) driven down the ASan-detectable UAF
> rate to approximately zero.

While I'm all for implementing a debugging allocator and think that every additional testing method we can get is great, I doubt that the ASan-detectable UAFs are down to zero. We only run ASan in our CI and with fuzzing (mostly on Linux). There are still types of configuration and use-cases not covered by this and I believe that in the wild, we will see some crashes that we haven't seen before in our environments. It is hard to say though how many those would be.
(kinda OT, but /me wonders ..)
If we had a debugging allocator which could catch SEGV, it could even
speculatively remove access from pages which are almost, but not quite,
empty.  Provided we can reliably restart after SEGV and are prepared
to tolerate some perf overhead in dealing with these signals.
(In reply to Christian Holler (:decoder) from comment #5)
> I doubt that the ASan-detectable UAFs are down to zero.  [..] There are
> still types of configuration and use-cases not covered by this

Yes.  I should have been more precise.  I meant, down to zero for the
tests/fuzzings that we are actually able to run.
(In reply to Julian Seward [:jseward] from comment #6)
> (kinda OT, but /me wonders ..)
> If we had a debugging allocator which could catch SEGV, it could even
> speculatively remove access from pages which are almost, but not quite,
> empty.  Provided we can reliably restart after SEGV and are prepared
> to tolerate some perf overhead in dealing with these signals.

I don't think this can be done safely. First of all, some SEGVs are intended and skipping them opens up security vulnerabilities and other unexpected behavior and secondly, resuming safely after a SEGV isn't possible because at the point when you SEGV, most likely there have already been program/data constrains violated. Not terminating the problem can then have all sorts of undesired behavior, e.g. data corruption.
(In reply to Christian Holler (:decoder) from comment #8)
I didn't mean we should ignore SEGVs.  I meant, the allocator catches them.
If a fault is for a page it has made no-access then it makes the page
accessible (assuming the access is legit) and restarts the faulting insn.
Otherwise, it has to re-throw the fault to the next handler in the chain.
That's the tricky bit.

This is really just doing in user space what the kernel has to do anyway in
order to implement paging.
(In reply to Christian Holler (:decoder) from comment #8)
> I don't think this can be done safely. First of all, some SEGVs are intended
> and skipping them opens up security vulnerabilities and other unexpected
> behavior

This is mostly a question of cascading the exception handlers in the correct order, right? Of course, inserting extra logic before a crash makes it more likely that something might go wrong, but I don't think inserting an exception handler between the JITs and Breakpad is necessarily an issue. This is basically what MemoryProtectionExceptionHandler [1] does (including forwarding the exception to Breakpad), though it's different from the suggestion here in that it never prevents a crash.

> secondly, resuming safely after a SEGV isn't possible because
> at the point when you SEGV, most likely there have already been program/data
> constrains violated. Not terminating the problem can then have all sorts of
> undesired behavior, e.g. data corruption.

I think the main concern here is that a hijacked process might be able to trick the exception handler into letting it stay alive, correct?

[1] https://dxr.mozilla.org/mozilla-central/source/js/src/ds/MemoryProtectionExceptionHandler.cpp
(In reply to Julian Seward [:jseward] from comment #2)
> […]  That would force at least some of the
> UAFs to result in a crash at the UAF-point.

This almost what the poison value is doing, except that the poison is crashing one memory access later than expected.

The problem mentioned in comment 0 is how can we convert crashes with the stack of the "use" into crashes with the stack of the "free".
(In reply to Nicolas B. Pierron [:nbp] from comment #11)
> The problem mentioned in comment 0 is how can we convert crashes with the
> stack of the "use" into crashes with the stack of the "free".

If we had a debugging allocator, then we could record the stack and
bounds of every freed block.  Then when the program faults because of
an access to mprotected-no-access freed memory, we will know

* the stack of the invalid access

* the faulting address (from the signal), and hence

* the stack of the free() call for the faulting address.

This unfortunately won't work with the poison-value scheme because
we don't know the address of freed memory that was read.  So we
can't associate it with any specific freed block.
(In reply to Julian Seward [:jseward] from comment #12)
> (In reply to Nicolas B. Pierron [:nbp] from comment #11)
> > The problem mentioned in comment 0 is how can we convert crashes with the
> > stack of the "use" into crashes with the stack of the "free".
> 
> If we had a debugging allocator, then we could record the stack and
> bounds of every freed block.  Then when the program faults because of
> an access to mprotected-no-access freed memory, we will know

Ok, this sounds good to catch logical issues in the code which is manipulating memory incorrectly.

Could we catch a jemalloc issue?

One of the hypothesis I had was that jemalloc give the same pointer, or a sub-region pointer, to another part of the code, which thus free it, and poison part of our buffers.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #1)
> This reminds me of our conversation on IRC the other day. Perhaps we could
> start by adding an assertion around [1] to ensure the incoming pointer is
> page-aligned. I'm not as clear on how smaller allocations work, but perhaps
> we could similarly assert at [2] that the pointer is aligned to the
> associated bin's region size.

I have a patch that does this, seems to work fine locally. I'm not sure about the policy regarding security bugs - if I file a separate bug, should it too be marked security sensitive? Should it block this one? Or should I file a regular bug that doesn't reference this one?
(In reply to Nicolas B. Pierron [:nbp] from comment #0)
> The following link highlights all crashes which are reading/writing/executing memory
> with a jemalloc poisoned pointer, by looking at addresses containing the pattern "e5e5".
> Thus ~33000 crashes/week is a lower bound approximation.

Note that many individual signatures have been filed as separate UAF bugs already, either by specifically looking for them (thanks Randell!) or by flagging top crash bugs when the symptoms are seen. A brainstorm on how to make these safer (e.g. the "framepoisoning" used in the frame tree) is a good idea.

(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #14)
> I have a patch that does this, seems to work fine locally. I'm not sure about the policy
> regarding security bugs - if I file a separate bug, should it too be marked security
> sensitive? Should it block this one?

If the bug is a mitigation for a class of security problems we can work on that in the open. Attackers are well-aware that browsers have exploitable UAFs! If your assertions find bugs that need to be fixed you can file those specific ones as security-sensitive bugs.
Group: core-security → core-security-release
(In reply to Daniel Veditz [:dveditz] from comment #15)
> If the bug is a mitigation for a class of security problems we can work on
> that in the open. Attackers are well-aware that browsers have exploitable
> UAFs! If your assertions find bugs that need to be fixed you can file those
> specific ones as security-sensitive bugs.

Thanks! I filed bug 1405159 about adding the assertions, which might help test Nicolas' theory and could catch some bugs if it pans out.
(In reply to Julian Seward [:jseward] from comment #2)
> I would be very interested to know what would happen if we shipped Nightly
> for a week using an alternative allocator that recycles pages as slowly as
> possible.  That is, when a page is free, it is marked no-access and kept
> that way for as long as possible.  That would force at least some of the
> UAFs to result in a crash at the UAF-point.
> 
> We should do this anyway.  It's an easy thing to do, which means that we
> should assume our adversaries are already doing it.

ABSOLUTELY.  I used to (back in the Old Days) build mozilla with dmalloc (on freebsd), with custom mods that stored freed memory in a fifo queue before actually freeing it, and examined it before actually freeing it for corruption (and periodically), along with some additional mods.  (IIRC, it was a long time ago).  We can do a lot better than that, of course, but the slow-recycling is a key point.  You can also (at major memory and perf cost) bump smaller allocations up to a page each, which means you can protect-on-free.  (That may limit your recycle time unless you have a LOT of ram; I have one machine with 64GB which may help there - but VM may also save you, since these pages don't actually need RAM -- do the real limiting factor may be perf).  

These perf factors may limit shipping nightly with the harshest version of this (page-per-allocation), but being able to "easily" run such a configuration locally (or on Try) would be a Big Win, I think.

What is needed to actually make this happen?  First let's get it working, then see if we feel comfortable shipping it to Nightly...
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(jseward)
(In reply to Randell Jesup [:jesup] from comment #17)
> What is needed to actually make this happen?  First let's get it working,
> then see if we feel comfortable shipping it to Nightly...

FWIW, I would love to do this.  It's a significant chunk of work though,
so I'd need buy-in from my manager.
Flags: needinfo?(jseward) → needinfo?(n.nethercote)
All you're describing here is essentially done by asan, and we're going to ship those builds to some users, aren't we?
Flags: needinfo?(mh+mozilla)
(In reply to Mike Hommey [:glandium] from comment #19)
> All you're describing here is essentially done by asan, and we're going to
> ship those builds to some users, aren't we?

Julian said "ASan gives us complete UAF detection but only for accesses within our own code.  A debugging allocator would give us partial UAF detection for all code in the process.  So the approaches are complementary."

I presume Julian means that ASAN doesn't cover code we didn't compile, but uses memory allocation.  To be honest, I doubt that case is a major part of our UAF problems; I think almost all of it is in our code (somewhere), but isn't being hit in automation *and* isn't being hit by random engineers who run ASAN for "normal" browsing.

Also, ASAN is quite expensive, execution-time wise; some of these alternatives might be much less so.  Also, is ASAN available now on Windows?  It didn't use to be, but perhaps it is now.

I'm not the expert here, of course.
(In reply to Randell Jesup [:jesup] from comment #20)
> (In reply to Mike Hommey [:glandium] from comment #19)

Right.  ASan will find all UAF accesses in C/C++ code written by us.  It
will miss accesses made by any JIT generated code, by any Rust code, by any
of the libraries in the process, and by any other code that winds up in the
process by whatever means.

A debugging allocator provides some (perhaps very little, but we don't know)
level of checking for the entire process.  It also has potential to perform
better than ASan-ified code, since apart from malloc/free calls we're not
incurring extra expense.

So the two approaches are partially overlapping and partly complementary.

My real point is that since a debugging allocator is easy to do, we should
assume that others have already done this, and hence may have gained extra
insight into our UAF behaviour that we don't have.

Also, I would like to have something that can run at close to "full speed"
so as to minimise distortion to thread scheduling.  The reason is that I
have often wondered how much data races contribute to heap corruption, but
so far we don't have any feasible way to find out.
Keywords: meta, sec-audit
> FWIW, I would love to do this.  It's a significant chunk of work though,
> so I'd need buy-in from my manager.

Julian and I talked about this and we agreed that he will spend time on this in Q4. UAFs are bad and it's worth putting effort into finding new ways to identify and fix them.
Flags: needinfo?(n.nethercote)
I was discussing Bug 1407651 with Ted Campbell, and now I am starting to wonder if some of the use-after-free we are seeing are  instances of a jemalloc meta-data corruption, which ends-up reusing the same pages, thinking they are already cleared, which would manifest it-self as a double-allocated-buffers, or as double-allocated-and-free buffers.

I guess this could explain some of the middle-buffer jemalloc poisons we saw in the Assembler buffers.
Assignee: nobody → jseward
Hope it was not too forward to assign Julian based on comment 22, but since this project fixes sec-critical bugs I'd like to have an owner.
(In reply to Daniel Veditz [:dveditz] from comment #24)
> Hope it was not too forward to assign Julian based on comment 22, but since
> this project fixes sec-critical bugs I'd like to have an owner.

Is it realistic/sensible to assign to a single person a bug that covers all UAFs in Firefox?
(In reply to Nicolas B. Pierron [:nbp] from comment #23)
> I was discussing Bug 1407651 with Ted Campbell, and now I am starting to
> wonder if some of the use-after-free we are seeing are  instances of a
> jemalloc meta-data corruption, which ends-up reusing the same pages,
> thinking they are already cleared, which would manifest it-self as a
> double-allocated-buffers, or as double-allocated-and-free buffers.

If you suspect this to be a jemalloc problem, we could try to get AddressSanitizer running with jemalloc. It should be possible do to that if we add the right poison/unpoison calls to jemalloc code. With that, we could run CI and Fuzzing with jemalloc+asan and maybe find the problem?
Actually, we should probably keep this on the 57 radar in the event that a low-risk dot release ride-along candidate comes along.
Tracking 58+ for this with the hope we can make some progress on this before the release.
Note also that uninitialized allocated memory likely would still have e5e5 in it, and we have LOTS of constructors-that-don't-initialize-all-members (see Coverity reports).

Note that some e5e5 crashes (like most EXEC failures) wouldn't be uninitialized accesses, unless there's an explicit function pointer that wasn't initialized.  Most EXEC failures would be something like a UAF method call through a (freed) vtable.   If we made (or turn on) poison-on-alloc we could distinguish between them.  perhaps worth doing for Nightly builds (with a different poison value)
(In reply to Randell Jesup [:jesup] from comment #30)
> If we made (or turn on) poison-on-alloc we
> could distinguish between them.  perhaps worth doing for Nightly builds
> (with a different poison value)

mozjemalloc does have the ability to poison on alloc (with 0xe4); currently it's only enabled on debug builds [1]. I imagine enabling it for opt builds has some performance overhead, but indeed it might be worth taking the hit on Nightly if it would give us more information.

[1] https://dxr.mozilla.org/mozilla-central/source/memory/build/mozjemalloc.cpp#1172
Note bug 525063 is taking care about the constructors-that-don't-initialize-all-members.
Tracking 57+ based on Comment 28.
We should keep an eye on bug 1410132 and see what that does to JS crash signatures.
(In reply to Randell Jesup [:jesup] from comment #17)
> ABSOLUTELY.  I used to (back in the Old Days) build mozilla with dmalloc (on
> freebsd), with custom mods that stored freed memory in a fifo queue before
> actually freeing it, and examined it before actually freeing it for
> corruption (and periodically), along with some additional mods.  (IIRC, it
> was a long time ago).  We can do a lot better than that, of course, but the
> slow-recycling is a key point.  You can also (at major memory and perf cost)
> bump smaller allocations up to a page each, which means you can
> protect-on-free.  (That may limit your recycle time unless you have a LOT of
> ram; I have one machine with 64GB which may help there - but VM may also
> save you, since these pages don't actually need RAM -- do the real limiting
> factor may be perf).  

It might be interesting to offer nightly users with huge amounts of RAM the ability to temporarily opt into this. I also have a machine with 64GB of RAM that's mostly unused unless I'm compiling, so I'd at least consider it.

It also might make sense to just turn on this behavior during startup and for a short time afterwards, and then revert to normal allocation patterns, since that's when a lot of the data corruption crashes I've been looking into seem to occur. Or possibly enable during startup only if the last session crashed.

The problem with both of these options, though, is that the crash rates for all of the data corruption bugs I'm following are extremely low on nightly. We might not get any useful data for weeks, even if we enabled for all users during that time. So I suspect that enabling only during startup, and only after a crash, may be the only viable option.
If all goes well, in Q1, we will also probably have the ASan Nightly program started. Which means special opt-in builds for people to browse with on a special nightly channel.

If this is however a problem related to jemalloc, then we would have to get ASan+jemalloc on the way first.
I doubt that this, or most of this, is a jemalloc problem.
(In reply to Randell Jesup [:jesup] from comment #37)
> I doubt that this, or most of this, is a jemalloc problem.

If it is not a jemalloc problem, then we should also spend some time thinking about why our existing tools (ASan) do not catch this, instead of thinking about new tools to do the job. If jemalloc is not required to reproduce the problem, then fuzzing should be able to hit the problem at least intermittently (and with a bit of luck, CI as well).
> If it is not a jemalloc problem, then we should also spend some time
> thinking about why our existing tools (ASan) do not catch this, instead of
> thinking about new tools to do the job.

Comment 21 talks about this: JIT code, library code, and Rust code are not covered by ASan.
(In reply to Nicholas Nethercote [:njn] from comment #39)
> > If it is not a jemalloc problem, then we should also spend some time
> > thinking about why our existing tools (ASan) do not catch this, instead of
> > thinking about new tools to do the job.
> 
> Comment 21 talks about this: JIT code, library code, and Rust code are not
> covered by ASan.

Though they should, at least in theory, be covered by Valgrind.

The main problem with ASan and Valgrind, though, is that they only catch issues that show up in automation. And it will be much easier figure out why our existing fuzzers don't catch these issues when we know what's causing them. Diagnostic tools that we can actually enable in production would make that much easier.
(In reply to Nicholas Nethercote [:njn] from comment #39)
> 
> Comment 21 talks about this: JIT code, library code, and Rust code are not
> covered by ASan.

And we think that the source of the uafs we are seeing are these sources? Some comments here seem to think otherwise. It seems everything else than clear to me what we think is the potential source of the problem right now. I agree with Randell here in saying that these sources are probably not the major source of the issues we're seeing here but without any additional data, it is pure guessing.

If additional tooling can help us answer that question, then I'm all for it.

In parallel, it might also make sense though to get some developers and other people to surf around with ASan Nightly and see if that brings up stuff. We now have an automated reporter in a special ASan Nightly build that can auto-submit ASan reports back to our infrastructure. I am currently writing the proxy to receive that data and deploy it in AWS (which is not a hard task). The only task left is updates for these builds. We currently don't have those, and as long as that is the case, we can't make the program public at least. But we could still use this internally already in Q4 and doing so is a low resource investment if we can get some people to use this internally for a few hours a day.
Sharing here what we (:jseward and I) found out today about delaying allocator and ASan:

In https://reviews.llvm.org/D30101 (Feb 2017) , the ASan devs implemented two new options:

> max_free_fill_size
>    "ASan allocator flag. max_free_fill_size is the maximal amount of "
>    "bytes that will be filled with free_fill_byte during free."


> free_fill_byte
>    "Value used to fill deallocated memory."


Since ASan itself already uses a delaying allocator, we could use regular ASan and turn on these two options to detect use-after-free more reliably if they memory accesses come from uninstrumented code as suggested earlier (JIT, Rust, external libs). We wouldn't need a separate delaying allocator anymore then.

I haven't found out yet which Clang version we would need to get these options, but I assume the Clang we currently use for our ASan builds is too old to have it.

Mike, could you help us to figure out which version we need for these options and how to best get our ASan builds on that version? Thanks!
Flags: needinfo?(mh+mozilla)
It's in clang 5.0. Upgrading all builds to that version was attempted in bug 1409265 but failed because of bug 1409622. It's still possible to change the clang version for ASAN builds only, but the patch in bug 1409265 didn't actually enable it for ASAN builds because of bug 1409267. They would probably hit bug 1409622 anyways.
Flags: needinfo?(mh+mozilla)
It seems like bug 1409622 (the OSX failure) is already being worked on. In parallel, I'm looking at bug 1409267. Once both are resolved, we can upgrade ASan to Clang 5.0 and use the feature that we need for this bug.
Blocks: 1414380
Blocks: 1420582
Is this still on your radar, Julian?
Flags: needinfo?(jseward)
Not sure this bug is really worth tracking since it seems more meta in nature. Certainly any actionable UAF bugs we find should :)
(In reply to Frederik Braun [:freddyb] from comment #45)
> Is this still on your radar, Julian?

No, because anything that this can do, ASan can do better.  So we'd
be better off just working with ASan builds.
Flags: needinfo?(jseward)
We have ASan builds for surfing around and reporting failures back to us (as a proof-of-concept, they report issues to a dedicated infrastructure separate from regular crash reporting) but we don't have automatic updates for them yet (see bug 1403548). Once we have that, we can do more opt-in testing with these builds to determine how much potential this has.
This was assigned to Julian because of the debug allocator idea, but now that we've determined that's not worth pursuing I am de-assigning him.

I also wonder if this bug is so broad that it's not useful.
Assignee: jseward → nobody
(In reply to Nicholas Nethercote [:njn] from comment #49)
> I also wonder if this bug is so broad that it's not useful.

This bug is not about fixing all the use-after-free.
The problem is how do we convert non-actionable security bugs in actionable bugs.

ASan sounds like a good solution. The question is when would we be able to ship it to all our Nightly and Beta users, and even to our Windows users?
(In reply to Nicolas B. Pierron [:nbp] from comment #50)
> (In reply to Nicholas Nethercote [:njn] from comment #49)
> > I also wonder if this bug is so broad that it's not useful.
> 
> This bug is not about fixing all the use-after-free.
> The problem is how do we convert non-actionable security bugs in actionable
> bugs.
> 
> ASan sounds like a good solution. The question is when would we be able to
> ship it to all our Nightly and Beta users, and even to our Windows users?

I think that's already one step too far. The ASan Nightly program is supposed to answer the question if this has any value and if so, how much [*]. Without that kind of data, we shouldn't make a decision about shipping ASan on any regular channels because we don't know if it is going to help at all.


[*] Assuming the opt-in program reaches a suitable number of users.
Priority: -- → P3
Summary: At least ~33000 crashes/week due to use-after-free → [meta] At least ~33000 crashes/week due to use-after-free

Running the same query as in comment 0, over the past 3 months, our UAF rate is down to ~1700 uaf/week.
This looks similar to the order of magnitude of bit flips.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WORKSFORME
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.