Closed Bug 1271165 Opened 4 years ago Closed 4 years ago

Extend mozjemalloc with a way to protect allocations against realloc-after-frees

Categories

(Core :: Memory Allocator, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
firefox49 --- affected

People

(Reporter: ehoogeveen, Assigned: ehoogeveen)

References

(Blocks 1 open bug)

Details

Attachments

(7 files, 16 obsolete files)

4.20 KB, patch
ehoogeveen
: checkin+
Details | Diff | Splinter Review
12.37 KB, patch
ehoogeveen
: checkin+
Details | Diff | Splinter Review
11.89 KB, patch
ehoogeveen
: checkin+
Details | Diff | Splinter Review
8.49 KB, patch
ehoogeveen
: review+
ehoogeveen
: checkin+
Details | Diff | Splinter Review
1004 bytes, patch
ehoogeveen
: checkin+
Details | Diff | Splinter Review
922 bytes, patch
ehoogeveen
: review+
ehoogeveen
: checkin+
Details | Diff | Splinter Review
7.71 KB, patch
Details | Diff | Splinter Review
In bug 1124397, we're seeing mysterious regions of poisoned memory showing up in the middle of AssemblyBuffers, leading to crashes. Our best guess as to the cause is that another thread is freeing a region of memory, then reallocating it down to a smaller size after jemalloc has already handed it over to us - but we don't have any real way to track down the offending party.

In this bug I'd like to propose extending mozjemalloc with several new functions (malloc_protected, calloc_protected, realloc_protected and free_protected) that associate an allocation with a unique ID. Any subsequent changes to the protected region then require the correct ID be passed in; otherwise the program will crash. Assuming our theory is correct, this should allow us to pinpoint exactly what code is messing with our memory.

I realize that we're still planning to move to jemalloc4 at some point, so modifying mozjemalloc may not be the most palatable option. Unfortunately I don't think that focusing on jemalloc4 will have good latency right now, though porting the code over shouldn't be difficult if it comes to that.

Ideally I'd like these functions to be a permanent addition to diagnose problems like these. Performance may be an issue however, so I've added a define that should let us toggle it easily.
These trailing blanks get in my way every time I try to patch mozjemalloc, since I like to strip trailing blanks at file level.
Attachment #8750134 - Flags: review?(mh+mozilla)
The actual implementation in mozjemalloc is fairly simple.

All reallocs and frees assert that the region they're touching isn't protected. Protected regions are tracked by adding them as nodes in a red-black tree, and IDs are generated using an existing PRNG that happened to be part of mozjemalloc. Protected frees remove the protection before doing the actual free, and protected reallocs briefly remove the protection, re-establishing it after the realloc is complete.

It would have been slightly nicer to keep regions protected all throughout realloc, but also more invasive: the assertion would have to take the ID into account, and the ID would need to be passed much more deeply into the arena alloc functions until they take their locks.
Attachment #8750138 - Flags: review?(mh+mozilla)
I wanted to make these functions available everywhere, even for standalone SpiderMonkey builds when jemalloc is not used, so this includes a couple of dummy implementations.

As a personal note, getting this to compile and link in all configurations took a lot of time and effort. OSX in particular stumped me for many hours (and many try runs), though looking back now it seems simple.
Attachment #8750140 - Flags: review?(mh+mozilla)
Nicholas, this adds support for several new functions to DMD. They're basically identical to malloc, calloc, realloc and free as far as DMD is concerned, but I didn't see a way to avoid duplicating code (at least without macros and creating a new file). I also qualified a few function calls - I'm not sure why this was needed, but somehow my changes to DMD's InfallibleAllocPolicy made the calls ambiguous.
Attachment #8750141 - Flags: review?(n.nethercote)
This basically just duplicates the code from replace_malloc and friends; like with DMD, I didn't see a way to avoid that without some macro and include cleverness, and I didn't really want to go down that road.
Attachment #8750142 - Flags: review?(mh+mozilla)
Jan, this adds the new functions to SM and uses them in a new alloc policy that works as a drop-in replacement for SystemAllocPolicy. It uses InlineMap to track the allocation IDs, using 2 inline elements, which should be enough for Vector and the like. That meant I couldn't place it in jsalloc.h (because InlineMap uses SystemAllocPolicy), so I put it in JitAllocPolicy.h for now.
Attachment #8750145 - Flags: review?(jdemooij)
This just flips the switch in mozjemalloc, as a separate patch in case we need a quick backout. Of course that still leaves the dummy implementations, but those should be essentially free in terms of overhead.

I gave the patch stack a full try run here:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=901e12a71c1f

I also did some performance comparisons:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=3c3f878ba8e9&newProject=try&newRevision=0e27e29a8aa2&framework=1&showOnlyImportant=1&showOnlyConfident=1
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=0e27e29a8aa2&newProject=try&newRevision=901e12a71c1f&framework=1&showOnlyImportant=1&showOnlyConfident=1

The first compares a baseline to everything up to and including part 4. I'm not sure why a11yr and *only* a11yr shows a regression (and only on Windows for that matter), but it appears to be consistent. My guess is that this comes from the change to AssemblerBuffer in part 4 and using a hashmap to track allocation IDs, since everything else should be pretty much free.

The second looks at the effect of actually enabling the new functions in mozjemalloc. The differences here are pretty minimal, thankfully. If we look at the effect of the whole stack compared to baseline:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=3c3f878ba8e9&newProject=try&newRevision=901e12a71c1f&framework=1&showOnlyConfident=1

... the regressions add up to be somewhat more significant. I hope they won't be too bad to keep the functionality enabled, but we could only use it on Nightly, or just turn it on for a little while to gather data if it comes to that.
Attachment #8750149 - Flags: review?(mh+mozilla)
Comment on attachment 8750141 [details] [diff] [review]
Part 3a: Add support for the new functions to DMD.

Review of attachment 8750141 [details] [diff] [review]:
-----------------------------------------------------------------

It would be good to have at least one call to each of the new functions in SmokeDMD.cpp.

::: memory/replace/dmd/DMD.cpp
@@ +220,5 @@
>      strcpy(s, aStr);
>      return s;
>    }
>  
> +  static void* malloc_protected_(size_t aSize, uint32_t* id)

s/id/aId/ through this patch, please.
Attachment #8750141 - Flags: review?(n.nethercote) → review+
Wow, that was fast, thank you :)

Taking a step back though, now that I've got all the kinks worked out, I wonder if it would be better to simply add two functions - malloc_protect_memory and malloc_unprotect_memory, say - and use those instead. When I started working on this, it seemed like the changes to mozjemalloc would be more invasive, but now it would make no difference. That would also remove (most of) the changes to DMD, as any additional malloc functions would be defined at a higher level.
> Taking a step back though, now that I've got all the kinks worked out, I
> wonder if it would be better to simply add two functions -
> malloc_protect_memory and malloc_unprotect_memory, say - and use those
> instead.

Sounds reasonable.
Attachment #8750138 - Attachment is obsolete: true
Attachment #8750138 - Flags: review?(mh+mozilla)
Attachment #8750140 - Attachment is obsolete: true
Attachment #8750140 - Flags: review?(mh+mozilla)
Attachment #8750141 - Attachment is obsolete: true
Attachment #8750142 - Attachment is obsolete: true
Attachment #8750142 - Flags: review?(mh+mozilla)
Attachment #8750145 - Attachment is obsolete: true
Attachment #8750145 - Flags: review?(jdemooij)
Attachment #8750149 - Attachment is obsolete: true
Attachment #8750149 - Flags: review?(mh+mozilla)
Yes, this is much nicer. I've already made the changes locally, so I figured I'd obsolete the old patches; just waiting on Try results to see if it still builds everywhere.
The actual implementation in mozjemalloc is fairly simple.

All reallocs and frees assert that the region they're touching isn't protected. Protected regions are tracked by adding them as nodes in a red-black tree, and IDs are generated using an existing PRNG that happened to be part of mozjemalloc.
Attachment #8750220 - Flags: review?(mh+mozilla)
I rolled the original part 3b into this, and the DMD changes are no longer needed, so this should be all the glue code taken care of (aside from the declarations in SM).
Attachment #8750224 - Flags: review?(mh+mozilla)
(this used to be part 4; going to quote myself below for convenience)

Jan, this adds the new functions to SM and uses them in a new alloc policy that works as a drop-in replacement for SystemAllocPolicy. It uses InlineMap to track the allocation IDs, using 2 inline elements, which should be enough for Vector and the like. That meant I couldn't place it in jsalloc.h (because InlineMap uses SystemAllocPolicy), so I put it in JitAllocPolicy.h for now.
Attachment #8750226 - Flags: review?(jdemooij)
Attachment #8750227 - Flags: review?(mh+mozilla)
This just flips the switch, in case we need to back out for performance reasons.
Attachment #8750228 - Flags: review?(jdemooij)
Comment on attachment 8750226 [details] [diff] [review]
Part 3 v2: Create and hook up an AllocPolicy for JSAPI.

Review of attachment 8750226 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/public/Utility.h
@@ +254,5 @@
>  {
>      free(p);
>  }
>  
> +static inline void js_malloc_protect(void* ptr, uint32_t* id)

Please add a brief comment above these functions to explain what 'protect' means. (Readers may think it refers to mprotect.)

@@ +493,5 @@
> +{
> +    size_t bytes;
> +    if (MOZ_UNLIKELY(!js::CalculateAllocSize<T>(numElems, &bytes)))
> +        return nullptr;
> +    return static_cast<T*>(js_malloc_protected(bytes, id));

Wouldn't it be simpler to first call js_pod_malloc<T>(numElements), and then if that returns non-nullptr, call js_malloc_protect on the result? That way we don't have to duplicate the CalculateAllocSize call, the most scary part here :)

Same for the functions below.
Attachment #8750226 - Flags: review?(jdemooij) → review+
Comment on attachment 8750228 [details] [diff] [review]
Part 5: Make AssemblerBuffers use the new alloc policy.

Review of attachment 8750228 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks a lot for doing this!
Attachment #8750228 - Flags: review?(jdemooij) → review+
Thanks for the quick reviews! Carrying forward r+.

(In reply to Jan de Mooij [:jandem] from comment #17)
> Please add a brief comment above these functions to explain what 'protect'
> means. (Readers may think it refers to mprotect.)

Good point, done. Hmm, part 1 could probably use a few more comments as well.

> @@ +493,5 @@
> > +{
> > +    size_t bytes;
> > +    if (MOZ_UNLIKELY(!js::CalculateAllocSize<T>(numElems, &bytes)))
> > +        return nullptr;
> > +    return static_cast<T*>(js_malloc_protected(bytes, id));
> 
> Wouldn't it be simpler to first call js_pod_malloc<T>(numElements), and then
> if that returns non-nullptr, call js_malloc_protect on the result? That way
> we don't have to duplicate the CalculateAllocSize call, the most scary part
> here :)

It's a bit of a trade-off between the CalculateAllocSize call and duplicating the protect-unprotect dance, but on balance I agree that the latter is nicer. Done! malloc_protect and malloc_unprotect just do nothing when passed nullptr by the way, so there's no need to check first.
Attachment #8750226 - Attachment is obsolete: true
Attachment #8750264 - Flags: review+
Here's a try run using the new functions: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b6eb00ddbc7d

Performance comparison using that run:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=3c3f878ba8e9&newProject=try&newRevision=b6eb00ddbc7d&framework=1&showOnlyConfident=1

Oddly, jobs don't seem to be triggering or completing right now, but the try run was mostly done anyway.
I'm not happy with those performance regressions, small as they are, so I'm trying out an optimization that shouldn't weaken the protection too much: skipping checks for tiny (<= 2KiB) allocations. There's no fundamental reason why small allocations couldn't exhibit double-frees or realloc-after-frees, but the length of the poisoned regions we're seeing in bug 1124397 are all integer multiples of the page size. In addition, these tiny allocations take mostly uncontested per-arena locks, so adding a contested global lock is likely a larger regression for them than for other allocations.

Additionally, we can skip protecting small allocations in ProtectedSystemAllocPolicy. I've made both these changes locally, and on Try they do seem to eliminate the performance regressions. However, I like the 2nd change less, since it complicates ProtectedSystemAllocPolicy (and there's no lock to explain the overhead), so I'm testing just the first change now to see if it's enough on its own.
Carrying forward r+. Just a tiny simplification to the realloc functions, and added MOZ_MUST_USE to checkSimulatedOOM() in emulation of bug 1267551.
Attachment #8750264 - Attachment is obsolete: true
Attachment #8750949 - Flags: review+
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #0)
> In bug 1124397, we're seeing mysterious regions of poisoned memory showing
> up in the middle of AssemblyBuffers, leading to crashes. Our best guess as
> to the cause is that another thread is freeing a region of memory, then
> reallocating it down to a smaller size after jemalloc has already handed it
> over to us - but we don't have any real way to track down the offending
> party.

I don't get your scenario. Can you be more specific (with e.g. a list of function calls you'd think are happening)
I also don't get why this needs to be at the mozjemalloc level, nor why it needs to use a unique ID.
Flags: needinfo?(emanuel.hoogeveen)
(In reply to Mike Hommey [:glandium] from comment #24)
> I don't get your scenario. Can you be more specific (with e.g. a list of
> function calls you'd think are happening)

Okay, consider the following:
1) Thread 1 has a pointer to a region of memory.
2) Thread 1 decides to free this region, but forgets to null out its pointer.
3) Thread 2 (our thread) reallocs its buffer, obtaining a pointer to a 16KiB region starting at the same address.
3) Thread 1 calls realloc, requesting a 12KiB buffer, but reusing the pointer it previously freed.
4) jemalloc shrinks the 16KiB region to a 12KiB one *in place*, poisoning the last 4KiB.
5) Thread 2 continues to use its buffer as though it's still 16KiB in size and eventually obtains a larger region, copying over the poisoned memory.

I'm not sure about the last step - I guess a realloc call would simply copy 12KiB of the buffer, so it doesn't matter what the last 4KiB of the original buffer contains. Maybe a better step 5:

5) Thread 2 continues to use the buffer as though it is 16KiB and eventually reallocs into a larger region, but only 12KiB are copied over (and the next page in the new region happens to be poisoned).

Either way, *something* is going on outside of thread 2's control that results in a buffer with jemalloc poisoned memory in it. Barring bugs in jemalloc itself, that something pretty much has to be a realloc from an unrelated location.

> I also don't get why this needs to be at the mozjemalloc level, nor why it
> needs to use a unique ID.

The reason this can only be done at the mozjemalloc (or jemalloc4) level is that whatever is causing the offending realloc isn't going to be aware of whatever contract we enter into. The allocator *itself* must check on every realloc that it's allowed to modify the region it's touching. We're trying to catch the spooky action at a distance that must be triggering this, and that means we have to tell the allocator to watch for changes to a particular address. Also, while we could theoretically mprotect a part of our buffer that we're already done with, that won't protect against a realloc that happens to return the same address.

Using a unique ID is less necessary. I was actually using the ID in an earlier version that worked differently (didn't unprotect/reprotect around calls, but instead passed the ID into jemalloc functions for the assertion), but now it just serves to protect against trying to unprotect a region you don't actually own.

I think that's a nice bit of extra safety, especially if we start using this more (since in theory, something causing a problem like the current one could use protected memory as well, and then we'd be stuck wondering what unprotected, changed and reprotected our memory behind our backs).
Flags: needinfo?(emanuel.hoogeveen)
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #25)
> Either way, *something* is going on outside of thread 2's control that
> results in a buffer with jemalloc poisoned memory in it. Barring bugs in
> jemalloc itself, that something pretty much has to be a realloc from an
> unrelated location.

Oh, and I should mention - this could all be happening on the main thread as well (though we might be building the AssemblerBuffer on a helper thread, I'm not sure). This is just the simplest scenario we managed to come up with, but more complicated scenarios (say thread 1 does a second free, and then an unrelated place gets the same address through malloc, all before thread 2 touches things again) are presumably possible.
There doesn't need to be a realloc involved for similar problems to happen. That's essentially use-after-free business. For example, replace step 3.1 with "thread 2 mallocs a buffer, obtaining a pointer to a 16KiB region starting at the same address", 3.2 with "thread 1 calls free" ; that would make step 4 become "jemalloc poisons the whole 16KiB". and would have the same result as step 5.1.

Not having looked at the patches, would that be caught?

I'm also not sure whether you intend this to debug your specific problem, or this to be enabled on all builds forever, or only debug builds or ... ? But I'm also not sure if the allocator is the right place to do this. Static analysis and/or ASAN would seem better candidates, to me. In fact, shouldn't ASAN catch those cases? Considering how C/C++ is foot-gunning, if ASAN doesn't catch them, they surely look like something ASAN should be improved to catch.
(In reply to Mike Hommey [:glandium] from comment #27)
> There doesn't need to be a realloc involved for similar problems to happen.
> That's essentially use-after-free business. For example, replace step 3.1
> with "thread 2 mallocs a buffer, obtaining a pointer to a 16KiB region
> starting at the same address", 3.2 with "thread 1 calls free" ; that would
> make step 4 become "jemalloc poisons the whole 16KiB". and would have the
> same result as step 5.1.

Yeah, that's a simpler scenario. In this case we're seeing poisoned regions that *don't* start at the beginning of the region (we know because Jan added some runtime checks to find the size and offset of the poisoned region), so the scenarios become more complicated.

> Not having looked at the patches, would that be caught?

Yes, the free would crash (assuming the memory is protected, which isn't quite a drop-in replacement), so we could see where things went off the rails instead of crashing sometime later.

> I'm also not sure whether you intend this to debug your specific problem, or
> this to be enabled on all builds forever, or only debug builds or ... ? But

Ideally I'd like this to be enabled on all builds, if it doesn't cause performance regressions. This is a crash we've only seen in the wild, that the JITs seem to be particularly sensitive to for some reason (it's the #1 topcrash on OSX), so just enabling it in debug builds wouldn't cut it (I haven't seen any failures from this on Try). If it does cause performance regressions, we could disable it again after gathering the data we need.

> I'm also not sure if the allocator is the right place to do this. Static
> analysis and/or ASAN would seem better candidates, to me. In fact, shouldn't
> ASAN catch those cases? Considering how C/C++ is foot-gunning, if ASAN
> doesn't catch them, they surely look like something ASAN should be improved
> to catch.

This mostly seems to happen on OSX (with some crashes on Windows), so we suspect that it's something to do with hardware acceleration that we aren't testing in automation. If static analysis or ASAN could catch issues like this, I agree that would be preferable, but so far we don't know what even triggers this issue. I'm not sure if anyone has tried running ASAN though, so it's possible that it would show up in our testing suite.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #28)
> I'm not sure if anyone has tried running ASAN though,
> so it's possible that it would show up in our testing suite.

I asked Jan on IRC, and he said he's tried Mac ASAN builds, but no luck reproducing thus far. Short of getting luckier than we have been so far, I think this is the least invasive way forward.
Has anyone actually reproduced with a released build? (apart for the people whose crash reports we get)
Can we reach out to (some of) them to have them try ASAN builds?
(In reply to Mike Hommey [:glandium] from comment #30)
> Has anyone actually reproduced with a released build? (apart for the people
> whose crash reports we get)

Not that I know of.

> Can we reach out to (some of) them to have them try ASAN builds?

I'm not sure. Jan, do you know if we have e-mail addresses for any of these crash reports?
Flags: needinfo?(jdemooij)
(In reply to Mike Hommey [:glandium] from comment #30)
> Has anyone actually reproduced with a released build? (apart for the people
> whose crash reports we get)
> Can we reach out to (some of) them to have them try ASAN builds?

I've tried for days and I've been able to reproduce a crash with this signature once. No luck after that. I've also tried a Mac ASan build.

After spending a lot of time on these crash reports, I'm convinced there are no reliable STR and instrumentation like this is the most promising option.
Flags: needinfo?(jdemooij)
Rebased slightly.
Attachment #8750949 - Attachment is obsolete: true
Attachment #8751882 - Flags: review+
Mike, have the responses here convinced you that we should take these patches (pending review, of course)?
Flags: needinfo?(mh+mozilla)
There are things to be said about the implementation (I actually have a draft review in splinter) but let's put that aside for now because things have not changed much, I'd rather not modify mozjemalloc for this purpose, even if it's "temporary".

Now, looking at the hypothetical scenario you're describing, and looking at bug 1124397 and to several recent crash reports with a signature related to the code added in bug 1124397, here is what I have to say:

An information that was missing is that it seems the crash is a one time occurrence, that is, it doesn't seem to hit a small set of users repeatedly, so that it's not possible to contact one and have them run an ASan build or otherwise. There also seems to be no correlation between the crashes, and it seems to happen on all platforms, so that rules out compiler bugs, since that would mean a similar bug in *three* different compilers. (my own search on crash-stats doesn't seem to agree with the note in but 1124397 about it is affecting mostly macs, it seems https://hg.mozilla.org/mozilla-central/annotate/afd82f887093/js/src/jit/x86-shared/BaseAssembler-x86-shared.h#l3429 is being hit on all platforms, with more than 60% on Windows)

What seems to characterize the issue is that something is shrinking an AssemblerBuffer vector from under itself, with the specific property that the new size, while smaller than what the AssemblerBuffer is expecting, is exactly page-aligned. This makes mozjemalloc poison the last pages of the AssemblerBuffer with 0xe5.

This actually is very interesting information, because it narrows down what kind of allocator calls are happening. Specifically, what is happening is a realloc, of something that was already a large allocation, that is shrunk *in-place* (important fact) to a smaller size that is exactly page-aligned.

On the other hand, when looking to a few crashes, what seems to happen is that we crash well after the above happened, and even worse, we do so after the buffer has been reallocated *another time*: In all the crashes I've looked at, endOffset (which indicates where the 0xe5's end) is well below the size() of the AssemblerBuffer vector. I even stumbled upon a crash where the size() is above 1MB, making it a huge allocation (larger than large), which makes it very likely to have been moved, too. (FWIW, that's the crash: https://crash-stats.mozilla.com/report/index/12d59878-3ef0-4a55-a6d1-d943c2160806 )

Now, that particular crash is actually interesting, because it shows:
- an AssemblerBuffer vector with a size > 1MB (0x1259d7)
- startOffset at 0x68fff
- endOffset at 0x6c000
- from.offset() at 0x6bf80

The interesting question to answer here is: why do we have an AssemblerBuffer that seemingly jumped from a size of 0x6c000 (the presumed size it had before it was shrunk and poisoned) to 0x1259d7 (the size it has now)? That seems like a very large jump.

Another interesting data point: I implemented a replace-malloc library that hooks realloc() and prints out sizes and callers for all reallocs meeting the criterion mentioned above, and went to youtube, playing videos. So far, the only caller that met them was /builds/slave/m-cen-l64-ntly-000000000000000/build/src/js/src/gc/Marking.cpp:1756 (in the last nightly I downloaded, which was based on 6cf0089510fad8deb866136f5b92bbced9498447), and there aren't many such calls.

Now, having written all that, it just struck me that those e5's don't actually mean the buffer was ever *actually* shrunk. Which means it could just as much be a red herring.

Let's step back a little:

These are all the places where allocations may be poisoned with 0xe5:
  https://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#4609
  https://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#4702
  https://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#4841
  https://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#4858
  https://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#4928
  https://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#5177

The first two are for free() and fill the whole buffer with 0xe5.
All the other ones are for realloc(), and, in essence, fill out between size and oldsize, if oldsize is larger than size or PAGE_CEILING(size) (i.e. size, aligned to the next page).

But here is the important thing to note: what is oldsize? Old size, basically, is the result of malloc_usable_size(ptr). That is, the mozjemalloc size class for the previous size given to malloc or realloc that gave that pointer, not the actual size that was passed back then.

Now, what this means is that:
  void *ptr = malloc(5 * 1024);
  ptr = realloc(ptr, 6 * 1024);
will actually fill from 6k into the buffer to 8k (its end) with 0xe5.

Does it actually match what happens in our case? no. Because we've always had a full number of pages filled, and at most, we can only have slightly less than a page poisoned that way.

But. Large allocations are adjacent, and can be mixed. That is, you can have a 8K allocation following two 4K allocations. And vice versa. And if there's enough space, mozjemalloc will avoid moving data.

In practice, depending on memory usage, when you do:
  void *ptr = malloc(4 * 1024);
  void *ptr2 = malloc(4 * 1024);
  free(ptr2);
  ptr = realloc(ptr, 8 * 1024);

the result can be that ptr and ptr2 are adjacent at the beginning, ptr2 is freed and *poisoned*, then ptr is reallocated in place and has the poisoned bytes in its second half. You can imagine plenty of scenarios where poison bytes are inherited in a similar way.

All that to say, there are very normal cases that can lead the poison bytes to be in large buffers.

All in all, not only am I not thrilled about the patch(es), but I'm also very dubious it will find anything interesting, especially considering the hypothetical case that is meant to be caught by the patch would be just as much caught by ASan, and I have a hard time believing it wouldn't have been caught by now.

So before going any further on this bug, I'd rather see at least the following questions answered in the context of js::jit::X86Encoding::BaseAssembler::nextJump:
- what is malloc_usable_size(code)?
- what address does code point to?
- how much data is supposed to have been written to that buffer?

if possible:
- what was the last address of m_formatter.data() before it was reallocated last?
- what was malloc_usable_size of that address back then?

And an additional question, that can't be answered by simply adding code like in bug 1124397: can multiple threads touch the same AssemblerBuffer?
Flags: needinfo?(mh+mozilla)
Hi Mike, thanks for looking into this in such detail, but I still think the patches here are likely to yield results; details below.

(In reply to Mike Hommey [:glandium] from comment #35)
> The interesting question to answer here is: why do we have an
> AssemblerBuffer that seemingly jumped from a size of 0x6c000 (the presumed
> size it had before it was shrunk and poisoned) to 0x1259d7 (the size it has
> now)? That seems like a very large jump.

I don't think this is the correct interpretation. IIUC startOffset is off by one, and should actually be 0x69000. Since the poison pattern starts at 0x69000 (420KiB), that's probably the size the buffer was shrunk down to (or at least, the rounded up size that jemalloc tracks). It was then resized again to a larger size by mozilla::Vector's doubling logic, though I think this could have happened multiple times before we actually crash (so it could have been 'doubled' from 420KiB to 1MiB, then from 1MiB to 2MiB, which is what 0x1259d7 would round up to in jemalloc).

In this scenario, all the data from 0x69000 to 0x80000 would *not* be copied over during 'doubling' (from 420KiB to 1MiB), which begs the question of why the poison pattern ends at 0x6c000 (432KiB) and not 0x80000. I think that's because this is a huge allocation, and the chunk we obtained happened not to contain poisoned data past 0x6c000. Perhaps it was simply never fully used before, or had been partially decommitted at some point in the past (IIRC this also means it must have been a *recycled* chunk rather than a newly mapped one).

> Now, having written all that, it just struck me that those e5's don't
> actually mean the buffer was ever *actually* shrunk. Which means it could
> just as much be a red herring.

I don't really understand how you came to that conclusion, but maybe it has to do with the above.

> Does it actually match what happens in our case? no. Because we've always
> had a full number of pages filled, and at most, we can only have slightly
> less than a page poisoned that way.

I don't understand this point. As far as I can tell, all allocations here are larger than 2KiB, which means jemalloc rounds them up to the next multiple of a page. Poisoning uses the new size directly, so the poison pattern might *start* at an offset that isn't page-aligned, but 420KiB (the start of the poison pattern), 432KiB (the end of the poison pattern) and 512KiB (the presumed size before the buffer was shrunk) are all page-aligned.

I don't think anything else you said contradicts the analysis above; the poisoned bytes we see probably aren't the result of in place reallocation, but are instead the result of not copying over the full buffer (because as far as jemalloc is concerned, it only needs to copy over 420KiB worth of data). There are probably instances of this bug that the runtime analysis fails to catch, simply because the chunk being allocated into is newly mapped or otherwise contains only zeroes past the length of our shrunken buffer.

> All in all, not only am I not thrilled about the patch(es), but I'm also
> very dubious it will find anything interesting, especially considering the
> hypothetical case that is meant to be caught by the patch would be just as
> much caught by ASan, and I have a hard time believing it wouldn't have been
> caught by now.

I do agree that ASan should catch this if it happens, but it seems to be a situation that we simply lack either the hardware or the software to reproduce. I still think we likely understand the problem correctly now, and the patches in this bug will point out the culprit.

> So before going any further on this bug, I'd rather see at least the
> following questions answered in the context of
> js::jit::X86Encoding::BaseAssembler::nextJump:
> - what is malloc_usable_size(code)?
> - what address does code point to?
> - how much data is supposed to have been written to that buffer?
> 
> if possible:
> - what was the last address of m_formatter.data() before it was reallocated
> last?
> - what was malloc_usable_size of that address back then?
> 
> And an additional question, that can't be answered by simply adding code
> like in bug 1124397: can multiple threads touch the same AssemblerBuffer?

Given the above, I'm not convinced this information will tell us anything new, but I'll needinfo Jan since he can probably get that data most easily.

> There are things to be said about the implementation (I actually have a
> draft review in splinter) but let's put that aside for now because things
> have not changed much, I'd rather not modify mozjemalloc for this purpose,
> even if it's "temporary".

I have to say, I don't understand why you're *this* averse to modifying mozjemalloc. We literally could have had this in the tree 3 months ago (modulo review comments), seen whether it did anything, and backed it out again if it proved useless or once it had done its job. The amount of work you did just to make the above comment also can't have been inconsiderable, especially considering you said you have a partial review already done. Frankly, while I fully understand *some* reluctance to modifying mozjemalloc considering its status, this level of resistance makes it almost completely impossible to make useful changes to this part of the tree. I really don't mean to antagonize you - I know you have your reasons and a million other things to work on - but I hope you understand how frustrating this is.
Flags: needinfo?(jdemooij)
Before I dig into your answer, I'll add this to the list of things that would be useful to know: the capacity() of the vector.
Oh, whatever. Land it, leave it on a few days, but the most important part: back it all out. If you don't, I will.
Attachment #8750134 - Flags: review?(mh+mozilla)
Attachment #8750224 - Flags: review?(mh+mozilla)
Attachment #8750227 - Flags: review?(mh+mozilla)
Attachment #8750839 - Flags: review?(mh+mozilla)
I'm really unhappy about this situation, but I think I've done everything in my power to lay out my case in reasonable terms. I'll rebase my patches and make sure they still work.
Try looks green except for intermittents: https://treeherder.mozilla.org/#/jobs?repo=try&revision=500f8a7ddcf1

To the sheriffs: this is a bit of an unorthodox situation, but these patches have been approved for checkin despite not being reviewed.

I'll file a new bug to back these out again once the instrumentation has served its purpose.
Flags: needinfo?(jdemooij)
Keywords: checkin-needed
Blocks: 1294732
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/08f723458cdd
Part 0: Strip trailing blanks from mozjemalloc. r=ehoogeveen
https://hg.mozilla.org/integration/mozilla-inbound/rev/712d30196d3a
Part 1: Add protected allocation functions and supporting infrastructure to mozjemalloc. r=ehoogeveen
https://hg.mozilla.org/integration/mozilla-inbound/rev/f9456fc72969
Part 2: Hook the new functions up and provide dummy implementations where needed. r=ehoogeveen
https://hg.mozilla.org/integration/mozilla-inbound/rev/65e1fc2ebec7
Part 3: Create and hook up an AllocPolicy for JSAPI. r=jandem
https://hg.mozilla.org/integration/mozilla-inbound/rev/ac3664eabd37
Part 4: Enable the new functionality in mozjemalloc. r=ehoogeveen
https://hg.mozilla.org/integration/mozilla-inbound/rev/26e9dcde8345
Part 5: Make AssemblerBuffers use the new alloc policy. r=jandem
Keywords: checkin-needed
Jan told me that OSX shell builds aren't working. That seems to be a configuration that doesn't have try coverage, and I can't test it locally.

This patch fixes shell builds for him, and I think we just weren't exporting the symbols before on OSX with my patches - which leaves the question of why browser builds seem to work just fine. Either way, this seems to work, just waiting for try results.
Sheriffs, please land part 6 as well (preferably before the next merge to m-c, but I understand if that's not feasible). This part shouldn't affect automation at all, but fixes shell builds on OSX.
Attachment #8780538 - Flags: checkin+
Attachment #8780539 - Flags: checkin+
Attachment #8780540 - Flags: checkin+
Attachment #8780541 - Flags: checkin+
Attachment #8780542 - Flags: checkin+
Attachment #8780543 - Flags: checkin+
Hmm, looks like the merge may have automatically unset the checkin-needed keyword. Part 6 still needs to land (IIRC this is still a somewhat manual process so it might not matter, but I've marked the other patches as checkin+ to reflect the fact that they have landed).
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/12d066b15637
Part 6: Change the new functions to MFBT_API to export them on OSX. r=jandem
Keywords: checkin-needed
Status: NEW → RESOLVED
Closed: 4 years ago
Keywords: leave-open
Resolution: --- → FIXED
Already two crashes with the nextJump signature with a build-id of a nightly that contains the landing from comment 51:
https://crash-stats.mozilla.com/report/index/77a80a3f-1b23-40a8-87fd-bce472160819
https://crash-stats.mozilla.com/report/index/4b688e46-8bbd-4cd1-9c41-114ab2160820

And AFAICT, none with a signature involving what was added in this bug...
Interesting, thanks for checking that. We do have one lead:

https://crash-stats.mozilla.com/report/index/8ef33114-a248-4beb-bbe1-4c0a02160819

This is crashing trying to deallocate a protected region from a realloc call made by a third party library. But if there are still crashes *not* being caught by this mechanism, then clearly we aren't getting the whole story.
Hmm, though to be fair, even if that is a problem it doesn't match the "in-place realloc" theory in this bug. That hypothesis is in pretty bad shape with those crash reports.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #56)
> Interesting, thanks for checking that. We do have one lead:
> 
> https://crash-stats.mozilla.com/report/index/8ef33114-a248-4beb-bbe1-
> 4c0a02160819

Nothing in Firefox is using "libDANEcore-Darwin.dylib". Googling says it comes from a dnssec-validator addon (and there are a bunch of crashes involving "libDANEcore-Darwin.dylib"). The crashes in nextJump don't mention that addon being installed.
You're right. While the crash in comment #56 proves that the mechanism implemented works and crashes like that *can* impact us in the wild, that information combined with the crashes in comment #55 proves that whatever is causing this corruption, it can't be using allocation functions to do so. I've posted an updated analysis in bug 1124397 comment #91, and I'll start preparing a backout for the patches in this bug pending Jan's approval.
> I'll start preparing a backout for the patches in this bug pending Jan's approval.

To close the loop for future readers (since the "Blocks:" change is buried a bit further up), the backout happpened on 2016-08-25 in bug 1294732.
You need to log in before you can comment on or make changes to this bug.