Make PageProtectingVector more useful

RESOLVED FIXED in Firefox 52

Status

()

defect
RESOLVED FIXED
3 years ago
Last year

People

(Reporter: ehoogeveen, Assigned: ehoogeveen)

Tracking

(Blocks 1 bug)

Trunk
mozilla52
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox52 fixed)

Details

Attachments

(3 attachments, 6 obsolete attachments)

3.30 KB, patch
Details | Diff | Splinter Review
7.90 KB, patch
jandem
: review+
Details | Diff | Splinter Review
33.22 KB, patch
ehoogeveen
: review+
Details | Diff | Splinter Review
I added PageProtectingVector in bug 1273462 to let us catch unauthorized writes into protected memory (heap corruption) for vectors that primarily grow. However, we never enabled it, for two reasons:

1) It's too slow. Enabling protection on all AssemblerBuffers causes a 2% regression on Kraken (and possibly other benchmarks not included in Talos).
2) Generic memory protection crashes are hard to spot in crash stats, especially for a relatively rare crash like 1124397 (it may be a top crash, but it isn't very frequent in Nightly).

This bug intends to fix both of those things. In addition, I'd like to extend PageProtectingVector to also protect *unused* pages, to minimize the time during which pages are unprotected and maximize our chances of spotting corruption.
This extends PageProtectingVector to also protect unused pages. This sets their permissions to PROT_NONE (and the Windows equivalent), so even reading from them will trigger an exception. I haven't actually *benchmarked* this yet, and I didn't give it a kill switch, but I do think it's ready for review.
Attachment #8794728 - Flags: review?(jdemooij)
This makes protection only kick in for AssemblerBuffers that are big enough. I'm hopeful that this will close the performance gap, but I haven't actually benchmarked this yet either. I think we'll need something like this regardless if we *ever* want to enable protection, though the constant that AssemblerBuffer uses could be adjusted.
Attachment #8794730 - Flags: review?(jdemooij)
Finally, this adds an exception handler that annotates crashes from access to protected regions, and makes PageProtectingVector use it.

Adding an exception handler on Windows is trivial. On Linux it's slightly more involved, since you have to remember and restore the previous handler, but still not a big deal.

On OSX, adding an exception handler is an absolute nightmare. We have to use the Mach exception handler mechanism because Breakpad uses it, and the Unix signal handler is called *after* the Mach exception handler. It's *particularly* bad here because we have to forward to the previous handler, but the previous handler may be a different *kind* of exception handler.

Luke, you implemented the asm.js signal handler, which probably makes you the most qualified person to look at this. I initially relied heavily on that pre-existing code to make this, but I ended up basing the exception forwarding code on someone else's prototype (linked in a comment).

I don't have a Mac to test this on locally, so I ended up relying heavily on Try to get this working. It seems to be working now (I can see the annotations in the build log), but I would appreciate if either of you could test this locally, especially with the crash reporter, to see if it's working correctly.

Here's a try build to ensure that this doesn't blow everything up normally: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8ac5aa1f8289
Attachment #8794734 - Flags: review?(luke)
Attachment #8794734 - Flags: review?(jdemooij)
And finally, here's a small patch to make testing this easy. This simply comments out the selective unprotect calls used when patching jumps. When you get one of these crashes, it should print "Hit MOZ_CRASH(Tried to access a protected region!)" to stderr in debug builds. If the crash reporter is enabled, it should set the annotation to "Tried to access a protected region!" (I'm not sure how to check that this is working).

Here's a nicely red try build to show that it's working: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d370a924de3f
Comment on attachment 8794734 [details] [diff] [review]
Part 3: Add an exception handler to annotate memory protection crashes in regions of interest.

Review of attachment 8794734 [details] [diff] [review]:
-----------------------------------------------------------------

While this is really impressive (especially mach exception forwarding), this feels like a pretty large technical cost for the benefit of annotating crash reports and I think there's a distinctly non-zero risk that this could add its own source of crashes.  Are we sure we need this?
(In reply to Luke Wagner [:luke] from comment #5)
> While this is really impressive (especially mach exception forwarding), this
> feels like a pretty large technical cost for the benefit of annotating crash
> reports and I think there's a distinctly non-zero risk that this could add
> its own source of crashes.  Are we sure we need this?

Yes, topcrash bug 1124397 has been baffling us for a long time now, and this is pretty much the only way forward.

We don't need to keep this forever, though. We can make it Nightly-only and back it out after we know more (I think, I didn't look at the patches yet).
Yeah, I wasn't expecting this to be such a large undertaking when I started, so I understand your concern. I'm obviously biased in favor of the code I wrote, but here's my perspective:

1) The complicated platform-specific parts here are only used when we're *already* crashing, so I wouldn't expect it to increase the number of crashes; we would just move a kind of crash we never look at (memory protection crashes) to a MOZ_CRASH() in the exception handler.
2) This seems to me like the kind of code that will either always work or always fail - say if I handled forwarding incorrectly for a particular type of exception handler. If we see any of these MOZ_CRASH()es show up in the wild, we can probably figure out where it's going off the rails.
3) This isn't necessarily an argument for taking the code *now*, but I suspect bhackett will have to do something similar to this for his web replay work. From what I understand the initial landing will simply disable all exception handlers during recording and playback, but maybe he can build off this code later (though that would mean either duplicating or generalizing it).

Those are arguments against *not* taking it - the argument *for* taking it is more vague IMO, but maybe Jan can chime in. He told me during MozLondon that the primary reason he never followed up on enabling PageProtectingVector is that he was worried the crashes would just get lost in the noise. This exception handler *should* make these crashes all stand out in the same way across platforms (searching for memory protection crashes on Socorro is a chore, since each OS denotes them differently). But I don't know how big the benefit of that will be in practice - parts 1 and 2 in this bug won't help *us* spot the crashes, but it'll at least move them into another component.
Some benchmark results: I can see no difference on Windows for Octane or Kraken, though the latter is very noisy. Using Talos, linux64 seems to show a small but consistent 0.75% or 0.56% (e10s) regression for Kraken on part 1, and another 0.65% or 0.84% regression on part 2. This leads to a combined regression of 1.40% or 1.41% (e10s). linux64 is the only platform with a noise level low enough to draw any conclusions, and even then the regression is small, but I think there is one.

I could try to gate the protection in part 1 on the buffer size as well and see if that reduces the regression from that part, or we could even just drop part 1 altogether (since it doesn't protect data we're actually using). What do you think Jan? This is near the limit of what Talos can even measure, but AWFY might show the same thing more clearly.
Flags: needinfo?(jdemooij)
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #8)
> I could try to gate the protection in part 1 on the buffer size as well and
> see if that reduces the regression from that part, or we could even just
> drop part 1 altogether (since it doesn't protect data we're actually using).
> What do you think Jan?

Dropping part 1 makes sense IMO. It might catch more issues, but it's also more code, more overhead, etc.

Just to be clear, the plan is to remove this code (or at least to stop using PageProtectingVector for the AssemblerBuffer), once we know more, right?
Flags: needinfo?(jdemooij)
(In reply to Jan de Mooij [:jandem] from comment #9)
> Just to be clear, the plan is to remove this code (or at least to stop using
> PageProtectingVector for the AssemblerBuffer), once we know more, right?

If we don't think it's providing any value to us, sure. I'd like to keep the mechanism in the tree for sure (along with the exception handler, if possible), and protecting the buffer like this might be useful in general so the JIT component become less of a dumping ground for heap corruption crashes - but we could certainly make it Nightly only, or just raise the minimum buffer size that gets protected even more.

Right now the minimum size is 32k, which may already be a bit conservative - I believe we're mostly seeing these crashes with 128k buffers, right? The corruption may be happening when the buffer is still smaller, so I didn't want to raise the limit too much.

In any case, I'll remove part 1 and rebase the other patches around that. It's a pity we can't fully minimize the vulnerable region, but probably not worth fighting too hard for.
Ok, thanks for the explanations, both of you.  Making this NIGHTLY-only would help address some of my main concerns which are:
 - the asm.js/wasm out-of-bounds path is not on an "already going to crash" path; they resume execution after some filtering to ensure this is a proper wasm out-of-bounds
 - some issues only show up late in release (e.g., bug 1137050)
 - crashes from within the crash handler can lead to breakpad never being called at all (hence the rt->handlingSegFault check in WasmSignalHandlers.cpp).

However, iiuc, we're just trying to inject a call to MOZ_CRASH_ANNOTATE() before we end up crashing the normal way, in breakpad, right?  For that, could we instead add this logic into the existing wasm signal handlers?  This could be part of the filtering done where the handlers decide "not something I care about" and let the handling continue.  This would have the advantage of not requiring any forwarding logic and also not needing to reason about how the two handlers interact.
(In reply to Luke Wagner [:luke] from comment #11)
>  - the asm.js/wasm out-of-bounds path is not on an "already going to crash"
> path; they resume execution after some filtering to ensure this is a proper
> wasm out-of-bounds

That's true, but the wasm signal handlers are installed at JSRuntime creation, whereas this is installed in JS::detail::InitWithFailureDiagnostic, so the wasm signal handler should always be called *first*; we should never hit this exception handler unless we're actually going to crash. In addition, the Mach exception handler in this bug operates on the *task* level, which is always called after the thread level, so it can't conflict with the wasm signal handlers.

>  - crashes from within the crash handler can lead to breakpad never being
> called at all (hence the rt->handlingSegFault check in
> WasmSignalHandlers.cpp).

Hmm. One thing I could do is restore the previous handler *the moment* we enter the exception handler in the first place. We don't actually handle the crash, so we're always forwarding to the next handler; if we crash while trying to do that we should end up in that handler anyway. I'm not confident that this will work on Windows, where it might mess up the handler search, but it should work for Linux and OSX.

Adding a reentrancy check to the Mach exception handler doesn't make much sense, because the thread can only process one message at a time - if we crash while handling a message, we obviously aren't listening for the next one. But I can add an atomic reentrancy check on Linux and Windows no problem.

> However, iiuc, we're just trying to inject a call to MOZ_CRASH_ANNOTATE()
> before we end up crashing the normal way, in breakpad, right?  For that,
> could we instead add this logic into the existing wasm signal handlers? 

Unfortunately not, at least not for the Mach exception handler (which I assume is the main sticking point here). The wasm signal handlers operate on a per-thread level, and we want to be able to catch crashes from *any* thread.

Does the above address your concerns? I'll implement those changes now.
Oh, and we can make this Nightly only if that's what it comes down to. I'm a bit worried that we'll run into a crash on release at some point that we just don't see in the Nightly population, but right now this is all motivated by one crash, so maybe worrying about that is premature.
In fact, let me also remove the loop in MachExceptionHandler() - it's just misleading. We will handle at most one message, so there's no need to pretend otherwise.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #12)
Yes, thanks for explaining.
Alright, here's the new part 1, which is just part 2 rebased and tidied up a little after removing part 1. The regression on Kraken with this should be down to ~0.75%.
Attachment #8794728 - Attachment is obsolete: true
Attachment #8794730 - Attachment is obsolete: true
Attachment #8794728 - Flags: review?(jdemooij)
Attachment #8794730 - Flags: review?(jdemooij)
Attachment #8795532 - Flags: review?(jdemooij)
This implements the changes discussed:

1) All exception handlers now remove themselves from the chain before attempting to handle the exception, so they should crash "gracefully" if anything goes badly off the rails.

2) On Windows and Linux, we guard against reentrancy and races using an atomic variable. It makes no sense to do this for Mach exceptions.

3) The Mach exception handler no longer uses a loop. This is a bit conservative - in theory, an exception handler from a debugger could fix up the thread state, after which we would no longer annotate crashes, and wouldn't be able to receive the quit message on shutdown.

But I think that's enough of a niche case not to worry about it; no normal exception handler that we forward to should be letting the process continue to run, and a shutdown hang is probably the least of your worries if you're already fixing up thread states manually.

I didn't make this Nightly only yet; let me know if we still want to do that, and I can do it in another patch.
Attachment #8794734 - Attachment is obsolete: true
Attachment #8794734 - Flags: review?(luke)
Attachment #8794734 - Flags: review?(jdemooij)
Attachment #8795535 - Flags: review?(luke)
Attachment #8795535 - Flags: review?(jdemooij)
I have no idea what happened there, but that clearly wasn't the patch I tried to upload!
Attachment #8795535 - Attachment is obsolete: true
Attachment #8795535 - Flags: review?(luke)
Attachment #8795535 - Flags: review?(jdemooij)
Attachment #8795669 - Flags: review?(luke)
Attachment #8795669 - Flags: review?(jdemooij)
Comment on attachment 8795532 [details] [diff] [review]
Part 1 v2: Add a mechanism to allow users to opt out of protection for small buffers.

Review of attachment 8795532 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks. Not sure how useful Kraken is for measuring compilation overhead (Sunspider might be a better candidate?), but the 32 KB threshold should help a lot.
Attachment #8795532 - Flags: review?(jdemooij) → review+
Comment on attachment 8795669 [details] [diff] [review]
Part 2 v2.1: Add an exception handler to annotate memory protection crashes in regions of interest.

Review of attachment 8795669 [details] [diff] [review]:
-----------------------------------------------------------------

The Mach code is definitely scary. Tracking down bug 1124397 is important though, so IMO this is worth trying. Please make sure unrelated segfaults are still reported the same way on all platforms.

::: js/src/ds/MemoryProtectionExceptionHandler.cpp
@@ +61,5 @@
> +    ProtectedRegionTree() : alloc(4096), tree(&alloc) {}
> +
> +    ~ProtectedRegionTree() {
> +        MOZ_ASSERT(tree.empty());
> +        alloc.freeAll();

Nit: can remove this, the LifoAlloc destructor will do this too.

@@ +121,5 @@
> +static MOZ_COLD MOZ_ALWAYS_INLINE void
> +ReportCrashIfDebug(const char* aStr)
> +    MOZ_PRETEND_NORETURN_FOR_STATIC_ANALYSIS
> +{
> +#ifdef DEBUG

Since this code is DEBUG-only, is it really worth it? Personally I wouldn't bother with the DEBUG case.

@@ +612,5 @@
> +        // Forward the generated message to the old port. The local and remote
> +        // port fields *and their rights* are swapped on arrival, so we need to
> +        // swap them back first.
> +        forward.header.msgh_bits = (request.header.msgh_bits & ~MACH_MSGH_BITS_PORTS_MASK) |
> +            MACH_MSGH_BITS(MACH_MSGH_BITS_LOCAL(request.header.msgh_bits), 

Nit: trailing whitespace.
Attachment #8795669 - Flags: review?(jdemooij) → review+
(In reply to Jan de Mooij [:jandem] from comment #20)
> > +static MOZ_COLD MOZ_ALWAYS_INLINE void
> > +ReportCrashIfDebug(const char* aStr)
> > +    MOZ_PRETEND_NORETURN_FOR_STATIC_ANALYSIS
> > +{
> > +#ifdef DEBUG
> 
> Since this code is DEBUG-only, is it really worth it? Personally I wouldn't
> bother with the DEBUG case.

We discussed this on IRC, but for posterity: this was based on a misunderstanding. This code and MOZ_CRASH_ANNOTATE() don't actually induce a crash, they just print an error and set the crash reporter annotation respectively. MOZ_CRASH() does this in debug builds:

    MOZ_ReportCrash("" __VA_ARGS__, __FILE__, __LINE__);
    MOZ_CRASH_ANNOTATE("MOZ_CRASH(" __VA_ARGS__ ")");
    MOZ_REALLY_CRASH();

My patch does something a bit more async-signal safe (I haven't decided whether this is really a concern here, but I figured it's better to be safe) and doesn't actually crash (since that would shift the crashing thread to the handler thread, at least on OSX).
Attachment #8795669 - Flags: review?(luke) → review+
Thanks for the reviews! This addresses the remaining comments and simplifies the Mach exception handler a tiny bit more (I realized I could just use the names of the exception IDs directly). Carrying forward r+.

One final try build to be safe, looking green: https://treeherder.mozilla.org/#/jobs?repo=try&revision=5e5dfeb73e9e
Attachment #8795669 - Attachment is obsolete: true
Attachment #8795983 - Flags: review+
Keywords: checkin-needed
As a heads-up, this is expected to cause at least a small (<1%) Kraken regression. I'm hopeful it won't meaningfully regress anything else, but we might have to take some regressions temporarily to diagnose this crash. If this does show up strongly somewhere on AWFY, we can try raising the lower limit on what size buffers get protected to 64k (I think raising it to 128k risks missing the corruption though, going by some of the offsets we've seen).
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/eab215763463
Part 1: Add a mechanism to allow users to opt out of protection for small buffers. r=jandem
https://hg.mozilla.org/integration/mozilla-inbound/rev/a70dda2a48b4
Part 2: Add an exception handler to annotate memory protection crashes in regions of interest. r=jandem r=luke
Keywords: checkin-needed
Depends on: 1306379
On AWFY, looks like this caused roughly a 1.1% regression on Kraken. The visibly affected tests are kraken-beat-detection (4% worse), kraken-crypto-ccm (7.5% worse) and possibly kraken-crypto-sha256-iterative (2.6% worse).

On the assorted tests page, misc-bugs-636096-model2d sees a small bump (~3%), misc-bugs-652377-jslint-on-jslint sees a more noticeable 5.3% increase, and misc-react-shell has a ~3.5% increase. Oddly, misc-bugs-654410-nezulator actually got 2.4% *faster*.
Ah, I wasn't sure what the backout policy was on that so I opened a new bug for the fix. Let me just roll it into part 2 instead (this fixes the non-unified build for me locally). Carrying forward r+.
Attachment #8795983 - Attachment is obsolete: true
Flags: needinfo?(emanuel.hoogeveen)
Attachment #8796313 - Flags: review+
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b19e10ea4577
Part 1: Add a mechanism to allow users to opt out of protection for small buffers. r=jandem
https://hg.mozilla.org/integration/mozilla-inbound/rev/8468a31dcb94
Part 2: Add an exception handler to annotate memory protection crashes in regions of interest. r=jandem, r=luke
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/b19e10ea4577
https://hg.mozilla.org/mozilla-central/rev/8468a31dcb94
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Depends on: 1306719
Depends on: 1306828
Blocks: 1306972
Depends on: 1309230
Depends on: 1308125
Depends on: 1328222
You need to log in before you can comment on or make changes to this bug.