Open Bug 1124397 Opened 9 years ago Updated 1 year ago

crash in js::jit::AssemblerX86Shared::bind(js::jit::Label*)

Categories

(Core :: JavaScript Engine: JIT, defect, P3)

39 Branch
All
macOS
defect

Tracking

()

Tracking Status
firefox38.0.5 --- wontfix
firefox39 --- wontfix
firefox40 --- wontfix
firefox41 --- wontfix
firefox47 --- wontfix
firefox48 + wontfix
firefox49 + wontfix
firefox-esr45 --- wontfix
firefox50 --- wontfix
firefox51 --- wontfix
firefox52 --- wontfix

People

(Reporter: automatedtester, Unassigned, NeedInfo)

References

(Depends on 1 open bug)

Details

(4 keywords, Whiteboard: [#jsapi:crashes-retriage])

Crash Data

Attachments

(3 files, 7 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-89f2ccb8-28f9-4b5c-8db2-e34a82150121.
=============================================================

was using Hello for a 1:1 and updating workday with comments and browser crashed
Looks like a crash in irregexp?
Flags: needinfo?(bhackett1024)
Version: 36 Branch → 37 Branch
Is this reproducible?
Flags: needinfo?(bhackett1024)
I just tried the workday part and that didnt cause the crash. If I knew which tab exactly caused the issue (of my 100+ open) I could help narrow it down for you
Also experienced this crash.

I was scrolling downwards on the page http://en.wikipedia.org/wiki/User:Hosmich/Twin_flags with about a dozen other tabs in my space open, and the browser crashed.

Crash ID 685d07a6-4a4e-494e-b546-82e182150404
https://crash-stats.mozilla.com/report/index/685d07a6-4a4e-494e-b546-82e182150404

Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:40.0) Gecko/20100101 Firefox/40.0 ID:20150403030204 CSet: 70a113676b21
Hit this searching on a Google map on nightly: it panned across the to location and then blew up.
Have a user with this crash at https://support.mozilla.org/en-US/questions/1066578
Version: 37 Branch → 39 Branch
I hit this on dev edition a couple of days ago: https://crash-stats.mozilla.com/report/index/6551fdca-d3ac-4c18-9b25-96d0f2150626.  Landed on a page, about 10 seconds later, without user activity, crash.
Early 2013 Retina Mac, 10.9.5.
This is easily the #1 Mac topcrasher on the 38, 39 and 40 branches, and in the top 10 in the 41 and 42 branches.  It also happens on Windows, but at lower volume.

By looking at the assembly code where these crashes happen, and at the Mac and Windows crash stacks on Socorro, I've figured out the top few lines of the "true" crash stack (in current trunk code):

https://hg.mozilla.org/mozilla-central/annotate/eee2d49d055c/js/src/jit/x86-shared/Patching-x86-shared.h#l36
https://hg.mozilla.org/mozilla-central/annotate/eee2d49d055c/js/src/jit/x86-shared/BaseAssembler-x86-shared.h#l3835
https://hg.mozilla.org/mozilla-central/annotate/eee2d49d055c/js/src/jit/x86-shared/Assembler-x86-shared.h#l909
https://hg.mozilla.org/mozilla-central/annotate/eee2d49d055c/js/src/irregexp/NativeRegExpMacroAssembler.cpp#l388

In the raw dumps of these crashes on OS X, r14 is set to 0x5a5a5a5a.  I *think* this means that, at line 2 above, from.offset_ is set to 0x5a5a5a5a, as (ultimately) is NativeRegExpMacroAssembler::stack_overflow_label_::offset_ (at line 4).  0x5a5a5a5a and 0x5a5a5a5a5a5a5a5a are values used by jemalloc to poison freed memory.  I *think* the only way NativeRegExpMacroAssembler::stack_overflow_label_::offset_ can end up set to 0x5a5a5a5a at line 4 is if the NativeRegExpMacroAssembler object (from which masm.bind(&stack_overflow_label_) is called) has itself become invalid, probably only a few lines above the call to masm.bind().

I'm totally unfamiliar with this code, and it's horribly complex.  I won't be able to continue my analysis on my own.  If someone else can, please feel free to do so.  I'll also try to find one or more likely candidates to needinfo.
Keywords: topcrash-mac
Brian, do you have any idea what's going on here?  Might my analysis from comment #9 be correct?  If so, do you have any idea how the NativeRegExpMacroAssembler object might have deleted itself during a call to NativeRegExpMacroAssembler::GenerateCode()?
Flags: needinfo?(bhackett1024)
Steven, thank you for the analysis.

I looked at some crash stacks and it's not just NativeRegExpMacroAssembler, also other random masm.bind() calls as part of Baseline and Ion compilation. Especially some of the bind() calls in Baseline code are extremely hot and well tested.

It happens mostly on Mac but also on Windows, so a compiler bug is unlikely. My best guess is the assembler backend somewhere misbehaves when we have a certain memory allocation or address(es). That'd explain the mostly-Mac part...

I'll try to dig deeper in a bit.
Flags: needinfo?(jdemooij)
Depends on: 1187323
Looking at the crash addresses, I noticed that we have a lot of addresses which are ending with "0xa56".  The fact that a lot of addresses are ending the same way, /often/ means that we have a structure which is large enough to be aligned, and that the bad code looking at a fields of this structure inside this structure.

So far, the only one I can think of might be the BaselineCompiler structure.  Which would imply that the BaselineCompiler functions are called with a bad "this", for failures within calls like:

  masm.bind(&postBarrierSlot_);
I now see that the variety of crash stacks is much larger than I realized.  But (from looking at a few more) it may still be true that all of them on the Mac have this in common:

They're all calls to js::jit::AssemblerX86Shared::bind(js::jit::Label*) on a Label that's been deleted -- whose offset_ variable == 0x5a5a5a5a.
(In reply to Steven Michaud [:smichaud] from comment #13)
> They're all calls to js::jit::AssemblerX86Shared::bind(js::jit::Label*) on a
> Label that's been deleted -- whose offset_ variable == 0x5a5a5a5a.

Ok, the code pointer is aligned, and the offset is poisoned, but how did we managed to get a label allocated with the SystemAllocPolicy in the first place?
> Ok, the code pointer is aligned, and the offset is poisoned, but how
> did we managed to get a label allocated with the SystemAllocPolicy
> in the first place?

I have no idea.  I don't know this code at all :-(

Another possiblity, I suppose, is that the Label object is still valid
but was bound() to an invalid offset (== 0x5a5a5a5a).  But it seems
like AssemblerX86Shared::bind() expects objects that haven't yet been
bound.
I just noticed that some of the Windows raw dumps have esi == 0x5a5a5a5a.  Someone who has a good Windows disassembler might want to look at the assembly code for __ZN2js3jit18AssemblerX86Shared4bindEPNS0_5LabelE.
I can't get this out of my head, so I've continued to dig away at it.

Above I was pretty sure that js::jit::AssemblerX86Shared::bind(js::jit::Label*) is being called with a Label whose whose offset_ variable == 0x5a5a5a5a.  Now I've proved to myself this can't be true.  (I used an interpose library and further analysis of the assembly code for AssemblerX86Shared::bind().  I'll say more if people feel the need.)

Now I'm pretty sure GetInt32() is returning an offset == 0x5a5a5a5a, here:

https://hg.mozilla.org/mozilla-central/annotate/eee2d49d055c/js/src/jit/x86-shared/BaseAssembler-x86-shared.h#l3835

As an experiment, we might want to add debugging code here that will log some kind of error message if offset == 0x5a5a5a5a.
> Now I'm pretty sure GetInt32() is returning an offset == 0x5a5a5a5a

GetInt32() is called repeatedly in a loop from js::jit::AssemblerX86Shared::bind(js::jit::Label*), via the call to masm.nextJump().  The crashes happen, at GetInt32(), the next time through the loop after GetInt32() returns 0x5a5a5a5a.
(In reply to Steven Michaud [:smichaud] from comment #17)
> Now I'm pretty sure GetInt32() is returning an offset == 0x5a5a5a5a, here:
> 
> https://hg.mozilla.org/mozilla-central/annotate/eee2d49d055c/js/src/jit/x86-
> shared/BaseAssembler-x86-shared.h#l3835
> 
> As an experiment, we might want to add debugging code here that will log
> some kind of error message if offset == 0x5a5a5a5a.

If that's right, the asserts I added in bug 1187323 should catch this... Maybe the fuzzers will find something once it's on m-c. I'll get back to this soon.
(In reply to Steven Michaud [:smichaud] from comment #18)
> > Now I'm pretty sure GetInt32() is returning an offset == 0x5a5a5a5a
> 
> GetInt32() is called repeatedly in a loop from
> js::jit::AssemblerX86Shared::bind(js::jit::Label*), via the call to
> masm.nextJump().  The crashes happen, at GetInt32(), the next time through
> the loop after GetInt32() returns 0x5a5a5a5a.

I see only 2 options then, either we have an offset which targets something outside the code (which bug 1187323 assert against), or we have a compiler error which incorrectly alias the code pointer to a reallocated memory area.
I'll make the asserts I added in bug 1187323 release asserts, to verify comment 18. Let's see what that tells us.
Flags: needinfo?(jdemooij)
Best of all would be to get that information into Socorro crash reports.  I know how to do that (I've done it before).  Give me a day or two to write a patch.
(In reply to Steven Michaud [:smichaud] from comment #10)
> Brian, do you have any idea what's going on here?  Might my analysis from
> comment #9 be correct?  If so, do you have any idea how the
> NativeRegExpMacroAssembler object might have deleted itself during a call to
> NativeRegExpMacroAssembler::GenerateCode()?

Canceling needinfo since comment 11 indicates this isn't irregexp-specific.  NativeRegExpMacroAssembler is stack allocated and won't delete itself.
Flags: needinfo?(bhackett1024)
(Following up comment #22)

This is *much* harder than I anticipated -- as it currently stands, none of the JS code supports adding annotations to crash logs.  I *think* I've solved all the major problems, but now I'm stuck at what seems to be a bug in one of our Python build scripts (code with which I'm completely unfamiliar).

I suspect we're not going to be able to make progress until I do what I described in comment #22 ... but I'm not sure how long it will take.
I got around the seeming bug in one of our Python build scripts -- I was taking the wrong approach anyway.  But this is still horribly complicated, and I haven't yet finished the work.

Right now I have a patch that builds locally and works just fine (on OS X).  But it won't build on the tryservers on either OS X or Windows, and I haven't yet sacrificed enough chickens to make the problem(s) go away.

For now I've run out of time to spend on this bug, and need to put it aside for awhile.  I hope I can come back to it in the not too distant future.  I suspect the only way we'll be able to learn enough to fix this bug is via crash log annotations.
I sacrificed another barnful of chickens and got this building on the tryservers.  I also tested that it works as expected on OS X and Windows.  But when I ran an all-platform set of tryserver builds last night, there were an unusual number of test failures (though all seem unrelated):

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3ef9af95205c

So I've decided to do another all-platform run, and see what happens:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0d33ce1a9f00
Attachment #8644390 - Attachment is obsolete: true
Comment on attachment 8651068 [details] [diff] [review]
Crash log annotation patch, WIP: now builds on tryservers

This patch has three logical parts:

1) What's needed for non-xul components to annotate crash reports.
2) What's needed for libjs to annotate crash reports.
3) What's needed to figure out bug 1124397 (this bug).

The last part is temporary -- it can come out once we've fixed this bug.  But I think it'd be very convenient if the other two parts could be made permanent.  I'll open a bug for that and report its number here.
I've opened bug 1197259.
See Also: → 1197259
(Following up comment #26)

There are a lot of failures in both sets of tests.  Most of them are frankly inexplicable, but both sets do include some "jit" test failures.  I suspect I need to do a better job of dealing with the case of XUL *not* being loaded (for example when running in the 'js' and 'jsapi-tests' binaries).

New patch coming up.
Hm is it possible this signature disappeared on Nightly?

I wonder if the release asserts I added in bug 1187323 have something to do with that, it landed last week and there haven't been any crashes on Mac Nightly since then.
(Following up comment #31)

One thing *has* changed, though (since the patch for bug 1187323 landed):  These crashes now happen here, at the MOZ_RELEASE_ASSERT() you added to BaseAssembler::nextJump():

https://hg.mozilla.org/mozilla-central/annotate/23a04f9a321c/js/src/jit/x86-shared/BaseAssembler-x86-shared.h#l4156

I don't yet understand why (I haven't yet looked at the assembly code for these new builds).
(Following up comment #32)

OK, now I get it ... or at least I think I do:

MOZ_RELEASE_ASSERT() crashes if the condition is false.

Presumably that's because offset == 0x5a5a5a5a5a5a5a5a.  But we already knew this.  My crashlog annotation patch provides much more information, and would presumably be more useful.

I assume MOZ_RELEASE_ASSERT() just writes its output to stdout -- so nobody will see it except the people who crash (and know where to look).
My new patch's stub library only checks once for the XUL symbols it needs (using dlsym() or its equivalent).  This substantially speeds up its performance when XUL isn't (and isn't going to be) loaded -- for example in the js and jsapi-test binaries.

It also gets rid of the jit test failures.

But there are still a bunch of other failures that I haven't yet managed to explain or dismiss as irrelevant:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=30c4647a7364
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e7a2ea1a1960
Attachment #8651068 - Attachment is obsolete: true
(Following up comment #33)

For what it's worth (and after looking at the new builds' assembly code), the exact location of the current crashes in our code is here:

https://hg.mozilla.org/mozilla-central/annotate/c46370eea81a/mfbt/Assertions.h#l218
(Following up comment #34)

I think I now understand the remaining test failures, which are mainly on e10s:  AnnotateCrashReport() and friends may only be called on the main thread in a content process.

I need to come up with another patch that deals with this ... one way or another.
(Following up comment #37)

That change didn't get rid of the test failures, and neither does my present revision.  We may have to disable certain tests while my debug logging patch is in the tree, or just avoid the failures by only compiling the code on certain platforms (like OS X and Windows).

As I mentioned above in comment #23, the part of my patch specific to this bug can come out when we've fixed it.

My present revision simplifies the loading of XUL pointers in the stub library, and (I hope) guarantees that this will happen before we've imposed a sandbox on the relevant process (content or plugin).
Attachment #8652508 - Attachment is obsolete: true
Attached patch Crash log annotation patch v1.0 (obsolete) — Splinter Review
This patch avoids the test failures by not compiling the bug 1124397-specific annotations on DEBUG builds.  These annotations also only compile on OS X and Windows -- the two platforms where we've seen this bug's crashes.

The test failures themselves offer no clue as to why they're happening.  The calls to AnnotateCrashReport() and RemoveCrashReportAnnotation() are rather expensive, and I suspect this has something do do with it.  AssemblerX86Shared::bind() seems to be called rather often (though not nearly as often as BaseAssembler::nextJump()).

There's little we can do about this.  In any case, this code (and its possible future evolutions) only has to stay in the tree long enough to allow us to decipher this bug's crashes.
Attachment #8653121 - Attachment is obsolete: true
(Following up comment #27)

My current plan is to first get my patches for bug 1197259 (parts 1 and 2) into the tree, then post another patch here (for review) that just contains part 3.
Blocks: 1197220
Crash Signature: [@ js::jit::AssemblerX86Shared::bind(js::jit::Label*)] → [@ js::jit::AssemblerX86Shared::bind(js::jit::Label*)] [@ js::jit::AssemblerX86Shared::bind]
This one still baffles me. The data we have:

(1) We fail the MOZ_RELEASE_ASSERT(size_t(offset) < size()) in nextJump. In other words, we're reading some offset out of the code buffer and that offset happens to be bogus.

(2) It mostly affects OS X users (94% of the crashes), but there are also some Windows and Linux crashes. It's the #1 crash on OS X.

(3) I looked at one of the Windows minidumps and (if the debugger is not lying), it's not an OOM and also not an unusually large buffer.

(4) The fuzzers never hit this.

Especially (2) is really weird; the x86/x64 assembler buffer code should behave exactly the same on all platforms. Maybe some other function or thread is corrupting our code buffer.

I'll poke at some other Windows minidumps. After that we can try to get some more data into the crash reports...
Flags: needinfo?(jdemooij)
Depends on: 1260660
Attached patch Diagnostic patch (obsolete) — Splinter Review
I've been staring at crash dumps for a while but nothing stands out - it's super mysterious.

Here's a patch to stash some data on the stack before we crash, so we can retrieve it from the minidumps (stack memory is included in crash reports).

The |volatile| is there to make sure the compiler doesn't do anything fancy with our data.
Attachment #8654189 - Attachment is obsolete: true
Attachment #8736827 - Flags: review?(efaustbmo)
Comment on attachment 8736827 [details] [diff] [review]
Diagnostic patch

Review of attachment 8736827 [details] [diff] [review]:
-----------------------------------------------------------------

r=me

::: js/src/jit/x86-shared/BaseAssembler-x86-shared.h
@@ +3412,5 @@
> +            blackbox[0] = uintptr_t(0xABCD1234);
> +            blackbox[1] = uintptr_t(offset);
> +            blackbox[2] = uintptr_t(size());
> +            blackbox[3] = uintptr_t(from.offset());
> +            blackbox[4] = uintptr_t(code[from.offset() - 1]);

These are wrong, as discussed on IRC.
Attachment #8736827 - Flags: review?(efaustbmo) → review+
Attached patch Diagnostic patchSplinter Review
Assignee: nobody → jdemooij
Attachment #8736827 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Flags: needinfo?(jdemooij)
Attachment #8736856 - Flags: review+
I'm currently waiting for Nightly crashes with the extra crash instrumentation. On Nightly there have been ~1-2 crashes a day for the past week so I'm expecting a crash report today or tomorrow... That's assuming the patch did not somehow make the problem disappear.

These crashes are extremely weird. It's not an OOM situation. Most (not all) of these crashes are after we emit quite a lot of code though (say 40-200 KB).

One possibility is that some other thread ends up corrupting our memory. That would explain both why it's so hard to reproduce and why the platform distribution is so unusual, but I'm not really convinced because it leaves some other questions unanswered.

I've analyzed and fixed a lot of top crashes but this is the most weird and challenging one so far.
Keywords: steps-wanted
FWIW, this is the #1 top crash on Mac 45.0.1 with 11% of all Mac crashes.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #52)
> FWIW, this is the #1 top crash on Mac 45.0.1 with 11% of all Mac crashes.

Looking at the build ID [1], and sorting by build id highlight that most of the issues we got are coming 20160315153207 (72%).  Does this non-uniformity corresponds to the way updates are shipped to our users?  If not, can we dig this version and 20160316065941 (0.29%), and do a binary diff?

If build-ids have uniform user base, then such spikes could be explained by an intermittent compiler error.

If build-ids have uniform user base, then this might be caused by a popular website on 15-03-2016, they might have pushed some new code which triggered the error, and later fixed it.  Maybe we can figure that out based on the comments, and ask these website developers to tell us what was the difference.
(In reply to Nicolas B. Pierron [:nbp] from comment #53)
> (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #52)
> > FWIW, this is the #1 top crash on Mac 45.0.1 with 11% of all Mac crashes.
> 
> Looking at the build ID [1]

[1] https://crash-stats.mozilla.com/signature/?signature=js%3A%3Ajit%3A%3AAssemblerX86Shared%3A%3Abind#aggregations
Maybe we can figure that out with these versions which are from the same channel:

20160316065941   46.0b2    25   0.29%   ( 3.5 / day)
?                46.0b3     0      0%
20160322075646   46.0b4    75   0.86%   (37.5 / day)
20160324011246   46.0b5   321   3.67%   (45.8 / day)
?                46.0b6     0      0%
20160401021843   46.0b7    50   0.57%
KaiRo,

Can you help us normalize the number of users per build-id, such that we can make sure that comment 53 and comment 55 are not biased by a different population size?
Flags: needinfo?(kairo)
(In reply to Nicolas B. Pierron [:nbp] from comment #55)
> Maybe we can figure that out with these versions which are from the same
> channel:

We had some betas this cycle that were never released due to various issues, e.g. the security-related stopping of all updates last week, which made us not ship b6 at all and ship b7 only a day before b8.

(In reply to Nicolas B. Pierron [:nbp] from comment #56)
> Can you help us normalize the number of users per build-id, such that we can
> make sure that comment 53 and comment 55 are not biased by a different
> population size?

Unfortunately, while the data to do this exists, it's a whole lot of work to do it. What is the answer you want to get out of that? (Is is worth for me to pour multiple hours into potentially getting something there, esp. if the outcome may not be too reliable by itself probably?)
Flags: needinfo?(kairo)
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #57)
> Unfortunately, while the data to do this exists, it's a whole lot of work to
> do it. What is the answer you want to get out of that? (Is is worth for me
> to pour multiple hours into potentially getting something there, esp. if the
> outcome may not be too reliable by itself probably?)

I want to know if the number of crashes are correlated with the number of reports.  If not this would highlight an issue which could be an intermittent error.

While looking more into crash-stat, I found the following link which pretty-much answer with the number of crashes/ADU (Active Daily User), and highlight that they are correlated.  Thus, this is not an intermittent error.

https://crash-stats.mozilla.com/report/list?product=Firefox&range_unit=days&range_value=28&signature=js%3A%3Ajit%3A%3AAssemblerX86Shared%3A%3Abind#tab-graph
(In reply to Nicolas B. Pierron [:nbp] from comment #58)
> While looking more into crash-stat, I found the following link which
> pretty-much answer with the number of crashes/ADU (Active Daily User)

Yes, right, this one is helpful for Nightly and Aurora channels at least.
I finally got some Mac crashes with the diagnostics patch.
---
2 crashes in BaselineCompiler::emitOutOfLinePostBarrierSlot -> nextJump:

* db714a92-1884-4138-bf78-5b2332160409
  Buffer length 0x46e09. At offset 0x1fe38, we find 0xe5e5e5e5 (the instruction byte before that is also 0xe5).

* ffb0014a-e5fe-4bbc-93c7-06e772160410
  Buffer length 0x2d4f4. At offset 0x1ff95 we get 0xe5e5e5e5 again.

And 1 crash in NativeRegExpMacroAssembler::GenerateCode -> nextJump:

* 6b0f1202-7cdc-448a-83d1-f2de22160409
  Buffer length 0x2741fb. At offset 0x1afeb we get 0xe5e5e5e5.
---
0xe5e5e5e5 is jemalloc's poison pattern. It's mysterious why our assembler buffer would suddenly contain those bytes. There are a few possibilities:

(1) Memory bug elsewhere: maybe some other thread tried to free some invalid pointer and ended up poisoning our memory.

(2) The Vector underlying the AssemblerBuffer somehow points to invalid memory, or something went wrong growing the Vector. This seems a bit unlikely.

(3) Somehow writing the jump/call instruction + its jump offset went wrong. This is also unlikely.
---
There are some patterns in these crashes (I mentioned some of this earlier in this bug):

* The assembler buffers are unusually big (181 KB and 283 KB for the Baseline compilations, a whopping 2.5 MB for the regex compilation!). I want to find out how common that is for regular expressions.

* Note that the buffer offsets above are very similar: 110571, 130616, 130965. This could be a coincidence though.
I also wonder how often we get our assembler buffer corrupted with the 0xe5e5e5e5 pattern but *don't* crash because we're not binding a jump/call there.

We could scan the buffer for, say, 5 0xe5 bytes, but we have to guarantee that can't happen for real. For local testing it should be sufficient though.

A potentially interesting next step here is figuring out where the 0xe5 region starts and ends. I wonder if it's small (like 8 corrupted bytes) or most of the buffer.
FWIW there are a number of OS X crashes on the following page: http://slither.io/

That online game seems pretty popular and people likely spend a lot of time on that page, but I think even if we account for that, it's still crashier than you'd expect.

Unfortunately it's not possible to run that game unattended; maybe I can write a script to reload and start the game automatically.
Attached patch Diagnostic patch 2 (obsolete) — Splinter Review
Adds some code to track where the 0xe5 bytes start and end.

Once we know how big the region is, we're in a better position to decide what to try next.
Attachment #8740504 - Flags: review?(efaustbmo)
Attachment #8740504 - Attachment is obsolete: true
Attachment #8740504 - Flags: review?(efaustbmo)
Attachment #8740507 - Flags: review?(efaustbmo)
Attachment #8740507 - Flags: review?(efaustbmo) → review+
It's getting interesting. We have a Nightly crash report (on OS X) with the updated diagnostics: https://crash-stats.mozilla.com/report/index/0aa388b7-1de5-4d5f-a95c-65f1f2160414

Here's what it says:

* The AssemblerBuffer's size is 131101 bytes, a bit more than 128 KB. That's (again) large.
* The jump/call we want to patch is at offset 7741.
* We crash because *all* bytes in range 4096-16383 are 0xE5!

So exactly 3 pages (12 KB) are filled with the poison value.

It *could* be a bug when we resize the AssemblerBuffer's Vector, either Vector or jemalloc code. This seems somewhat unlikely, but some other stack/heap corruption bug could confuse the resize process.

Once we have a few more crash reports, we can try to narrow it down a bit as follows: after we write a new instruction to the buffer, we check if the AssemblerBuffer's length > some pretty large value. If it is, we MOZ_CRASH if we have >= ~16 0xE5 bytes at some offset we expect to be affected by this bug.

That will tell us (1) Does the buffer get poisoned when the current thread is in some particular part of the code, or does it happen at random? (2) If it's some other thread corrupting our memory, maybe we can still see its stack in the crash dumps, if we're lucky and it's still active.

First let's wait for a few more crash reports to see if it's always the same range.
It sounds like a buffer that was previously 16KB, then shrunk to 4KB (poisoning the remaining 12KB), and then grown to 128KB (or whatever the size class after that is). IIRC jemalloc doesn't zero buffers when it increases their size (unless opt_zero is set to true, which I don't believe it is in production [1]), so is it possible that we're looking at uninitialized data?

[1] https://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#1312
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #68)
> so is it possible that we're looking at uninitialized data?

The AssemblerBuffer uses a Vector and simply appends to it for the most part. I could understand it containing *some* uninitialized data (when we have a bogus offset or forget to patch it), but exactly 3 pages seems very unlikely...
Attached patch More diagnosticsSplinter Review
Let's check in AssemblerX86Shared::executableCopy that the buffer wasn't poisoned with 16 or more 0xE5 bytes.

Basically I wonder how often we have this corruption but don't catch it because we're not binding a Label.
Attachment #8742286 - Flags: review?(efaustbmo)
Attachment #8742286 - Flags: review?(efaustbmo) → review+
Another Nightly crash: https://crash-stats.mozilla.com/report/index/195cf126-1203-45e1-9e4b-0bfa32160418

Buffer size: 132414
Instruction offset: 32770

0xE5 range: 32768 - 36864 (exactly 1 page)

Interestingly, the instruction we're patching is on the page boundary: the second half of the offset is corrupt, but the first part is still intact. The instruction is 0x86, it was most likely 0x0f 0x86 (JBE with 32-bit offset), and the offset is 0xe5e5ffff. That probably was 0xffffffff (-1), it means this was the first jump in this chain.

So we learned a few new things:

* The number of corrupted bytes varies, but with the 2 crashes so far it's always a multiple of the page size.
* The offset is also different (but 32768 is also a power of 2), but the buffer size is again a bit larger than 128 KB (= quite huge).
* I think this crash proves the page was filled correctly, initially, but it was overwritten later.
* It also confirms the code and offsets were correct before it got corrupted.
I think the best move now is to modify mozjemalloc: when resizing the AssemblerBuffer, we tell mozjemalloc the range it shouldn't poison. Then when mozjemalloc poisons at least 1 page, it can check this range and MOZ_CRASH if needed.

This will require some (temporary) hacks but if we're really racing with a realloc/free on another thread, it's the best way to find out where that happens.

We'll also need a lock/mutex, hopefully the perf impact from that won't be too bad.
I've started on a patch that modifies mozjemalloc, but not in the way Jan suggested above. Instead, I've added new 'protected' allocation functions that give out or take an allocation ID. reallocs and frees (protected and unprotected) are then checked to make sure the right key was supplied, and crash if not. Using these functions for the AssemblerBuffer should protect against the kind of realloc-after-free that's suspected here, and point straight at the offending party.

I'm hoping that this functionality will be fast enough to stick around as a general mechanism for problems like these, but that also means I'm trying to make them available everywhere (even if they only do anything in mozjemalloc). Their actual implementation is pretty simple, but I'm having some difficulty with all our allocation glue code, simply because there's so much of it; might take a few more days.
Bug 1273462 has landed, which means we can now theoretically protect AssemblerBuffers that we think are likely to be hit by this bug. Unfortunately protecting every buffer all the time is too slow - it regresses Kraken by something like 3% - so this is currently disabled. Jan suggested on IRC protecting buffers for 1) large baseline scripts and 2) regexp scripts.

This shouldn't be hard to do now, for someone who knows where to look: we just need to call BaseAssembler::enableBufferProtection() from the right places (at the start of script compilation). The only wrinkle is that we'll probably need a dummy implementation for ARM, since that doesn't use the same kind of AssemblerBuffers.

Thing is, I'm kind of out of my depth here - I didn't have much luck tracing uses of BaseAssembler up to a higher level - so I'd like to leave this part up to an expert. Pinging Jan, but anyone who knows their way around the JITs feel free to jump in.
Flags: needinfo?(jdemooij)
[Tracking Requested - why for this release]:
nominating this issue for tracking. it is still the #1 crasher for mac os x on release (12.3% currently). maybe it's possible to come up with a solution for this in time for 48...
Emanuel, sorry for the delay.

We discussed this a bit yesterday. Next week I'll look at a bunch of Mac and Windows crash dumps to figure out the right conditions to use for the protect calls here.
Track this as 48/49 still have the crash.
Calixte classified the number of hits of bugs per domain, such as yahoo, gmail, youtube, and facebook.

*youtube.com/* represents ~24% of all crashes with this signature.
This signature represents ~8.5% of all *youtube.com/* crashes reported.

Here is the list of domain found (not ordered):
  www.listentoyoutube.com
  www.getlinkyoutube.com
  www.youtube.com
  consent.youtube.com
  dfromyoutube.com
  gaming.youtube.com

This corresponds to our hypothesis that some part of Gecko which is manipulating large buffers is resizing our Assembler buffer.  Which suggests that this might be related to the sounds/video handling code.
Crash volume for signature 'js::jit::AssemblerX86Shared::bind':
  - nightly (50): 2
  - esr (45): 2154

Affected platforms: Windows, Mac OS X, Linux
I think bug 1271165 is the best way forward here. Unfortunately that has been blocked on reviews for a while now, despite pings on IRC/email.
Flags: needinfo?(jdemooij)
Seems it is now moving in bug 1271165
but it is too late for 48
We've got two crash reports on Windows from the new instrumentation (!) so far:

https://crash-stats.mozilla.com/report/index/575318c2-2434-413a-a01b-71a852160818
https://crash-stats.mozilla.com/report/index/acbed5a5-5a7a-4cdb-a2bc-9bb4e2160818

There's a lot of inlining, but logically they can only be crashing here:
https://hg.mozilla.org/mozilla-unified/annotate/fe895421dfbe/memory/mozjemalloc/jemalloc.c#l6305

In other words, we must be resizing the vector from multiple threads at roughly the same time without taking a lock! Jan, can you make anything of the stacks?
Flags: needinfo?(jdemooij)
From discussion on IRC with Nicolas, sounds like these are startup crashes and thus probably not related to this bug (uptime in the crash reports seems to confirm this). Still sounds like a race that we should fix though!
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #84)
> In other words, we must be resizing the vector from multiple threads at
> roughly the same time without taking a lock! Jan, can you make anything of
> the stacks?

You can see the stacks of the other threads if you click "Show other threads", there are no other threads doing related stuff AFAICS.

Also, these two crash reports are from a single user. Combined with the uptime of 0-1 seconds, I think we should wait for more reports for now.
Flags: needinfo?(jdemooij)
(In reply to Jan de Mooij [:jandem] from comment #86)
> You can see the stacks of the other threads if you click "Show other
> threads", there are no other threads doing related stuff AFAICS.

Hmm, it's very odd that we're crashing by trying to protect a region that's already protected (especially considering that the new alloc policy used by AssemblerBuffer is the only consumer of this API). I'm not even sure how that could happen, since jemalloc shouldn't be able to hand out the same address twice without a free in between (which would have to unprotect first). This particular allocation seems to have gone through Vector::convertToHeapStorage(), so it's a malloc [1], which should be the simplest case.

I don't understand :( Hopefully some other crashes will paint a clearer picture.

[1] https://dxr.mozilla.org/mozilla-central/source/mfbt/Vector.h#863
Another crash from the same user. I noticed something interesting: they seem to have a module with the filename "jemalloc_yg.dll" loaded. So maybe calls are being intercepted in some way and this module doesn't know how to deal with the new functions. Dunno if that means it's malware, a search turned up nothing.
This one might be interesting:

https://crash-stats.mozilla.com/report/index/8ef33114-a248-4beb-bbe1-4c0a02160819

A crash on OSX 10.11, trying to realloc a protected region.
From bug 1271165 comment #55, it looks like crashes are still slipping through:

https://crash-stats.mozilla.com/report/index/77a80a3f-1b23-40a8-87fd-bce472160819
https://crash-stats.mozilla.com/report/index/4b688e46-8bbd-4cd1-9c41-114ab2160820

Unless this is the runtime analysis catching some unrelated problem, it looks like our jemalloc hypothesis isn't the whole story.
Let's recap in view of the latest information:

1) We're seeing the jemalloc poison pattern turning up in large AssemblerBuffers, page-aligned but at a non-zero offset into the buffer. AssemblerBuffer uses a mozilla::Vector that never shrinks, and grows using mozilla::Vector's doubling logic. 

2) Because jemalloc is the only source of this particular poison pattern, we suspected a complicated realloc-after-free or realloc-after-realloc scenario, wherein a broken outside actor (possibly a third-party library) was shrinking our buffer *in-place*, causing only part of it to be copied over during the next doubling.

3) We instrumented jemalloc to detect unauthorized reallocs. This turned up one crash [1] on OSX, showing that the instrumentation works as intended, but this crash does not involve *in-place* reallocation, which is needed by our hypothesis. Meanwhile, our runtime analysis continued to detect other corruption [2][3], showing that no jemalloc activity is needed.

4) Since in-place realloc can't be causing the corruption, jemalloc can't be poisoning our buffer directly. Whatever is causing this must be *copying* the poison pattern into our buffer from another buffer, perhaps one that jemalloc already freed (or one that is simply uninitialized).

5) I can see only one way to catch something like that: mprotecting our buffer. Let's revisit the mechanism added by bug 1273462, and either temporarily enable it for everything (taking a small hit to benchmarks), or add some extra infrastructure to only enable it for big baseline scripts and regular expressions.

Jan, do you agree with that analysis? It might be too early to draw conclusions, but I think the crash reports paint a pretty compelling picture.

[1] https://crash-stats.mozilla.com/report/index/8ef33114-a248-4beb-bbe1-4c0a02160819
[2] https://crash-stats.mozilla.com/report/index/77a80a3f-1b23-40a8-87fd-bce472160819
[3] https://crash-stats.mozilla.com/report/index/4b688e46-8bbd-4cd1-9c41-114ab2160820
Flags: needinfo?(jdemooij)
(For anyone else reading this: the downside - aside from performance - of using this mechanism is that memory protection faults happen for various reasons, and it might be difficult to spot a spike in crashes from something so rare)
As a follow-up to that last comment: I'm not sure how to look for memory protection crashes on all operating systems.

For Windows we can just look for reason=EXCEPTION_ACCESS_VIOLATION_WRITE: https://crash-stats.mozilla.com/search/?version=51.0a1&reason=%3DEXCEPTION_ACCESS_VIOLATION_WRITE&date=>%3D2016-01-01

For Linux I think we can look for reason=SIGSEGV with a non-zero address: https://crash-stats.mozilla.com/search/?version=51.0a1&date=>%3D2016-01-01&reason=%3DSIGSEGV&address=!%3D0x0

For OSX I *think* we can use reason=EXC_BAD_ACCESS / KERN_PROTECTION_FAILURE: https://crash-stats.mozilla.com/search/?version=51.0a1&date=>%3D2016-01-01&reason=%3DEXC_BAD_ACCESS %2F KERN_PROTECTION_FAILURE

But I'm really not sure about Linux, and I'm especially unsure about OSX.
I'd rather you answered the questions from bug 1271165 comment 35, which can be summarized as: are we absolutely sure that the vector is used correctly?
Specifically:
- are we sure it's being filled as it's supposed to be filled (iow, weren't the poison bytes there in the first place and never overwritten)
- are we sure we're actually accessing the right buffer? (for instance, does the pointer we have actually match where the vector really is, aka, has the vector been reallocated away from under us at some point?)
- are we sure we're looking at the right offset and the right vector size?

I wouldn't make pages read-only without answering those questions before, because if it's the js code that's doing something funky to itself, making the pages read-only will have bad consequences for other code.
(In reply to Mike Hommey [:glandium] from comment #94)
> - are we sure it's being filled as it's supposed to be filled (iow, weren't
> the poison bytes there in the first place and never overwritten)

It's a byte Vector, the AssemblerBuffer appends x86/x64 instructions to it. We append 5 bytes for a jump instruction + its offset, and sometimes the first X of these bytes are in the crash dump (and look perfectly sane/expected) but the bytes after that are part of the 0xe5 region (the region starts there). That suggests some of the 5 bytes we wrote got overwritten somewhere.

Furthermore, the 0xe5 region is page-aligned, if the AssemblerBuffer skipped a region in the buffer, I don't see why it would always be page aligned.

> - are we sure we're actually accessing the right buffer? (for instance, does
> the pointer we have actually match where the vector really is, aka, has the
> vector been reallocated away from under us at some point?)

We get the address from the Vector itself; we're not holding onto a stale pointer. IIRC crash dumps show the Vector (on the stack) has this address as its data pointer.

> - are we sure we're looking at the right offset and the right vector size?

Yeah, the Vector's capacity + size + offset are all sane and reasonable (also when comparing them to other things on the stack).

The reports that have the "partially corrupted jumps" I mentioned earlier, also show the offset is correct: we indeed have a jump instruction at that offset. (Also: the offset < the length, and we should never have pages filled with 0xE5 bytes in the buffer, no matter what the offset is.)

> I wouldn't make pages read-only without answering those questions before,
> because if it's the js code that's doing something funky to itself, making
> the pages read-only will have bad consequences for other code.

Our fuzzers (super effective at finding bugs here) never hitting this + the crashes being completely unreproducible, suggest it's something timing/signal/threading related.
Flags: needinfo?(jdemooij)
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #92)
> (For anyone else reading this: the downside - aside from performance - of
> using this mechanism is that memory protection faults happen for various
> reasons, and it might be difficult to spot a spike in crashes from something
> so rare)

I think we can use the SEGV handler added for AsmJS to catch if the space is one of our AssemblerBuffer and produce a specific error message if this is the case.

(In reply to Mike Hommey [:glandium] from comment #94)
> - are we sure it's being filled as it's supposed to be filled (iow, weren't
> the poison bytes there in the first place and never overwritten)

On x86/x64, we use a Vector in which we only append instructions (no more than 15 bytes at a time).  Then later we can patch back some already-written offsets withing this vector.  These offsets are payload of instructions, so we should expect to have at least one byte for the opcode and 4-8 bytes of payload.

The span of 0xe5 patterns (comment 67, comment 72) is larger than any single action performed by our on our AssemblerBuffer, and we have no opcode with 0xe5.

> - are we sure we're actually accessing the right buffer?

AssemblerBuffer as well as there underlying vector have a single instance per compilation, we do not move nor share them between compilations.  The Vector instance is usually part of the CodeGenerator structure which is allocated, or on the stack for RegExp / Baseline.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #91)
> 2) Because jemalloc is the only source of this particular poison pattern, […]

Directly or indirectly, jemalloc is indeed the source of this pattern, as the original pattern used to be 0x5a (comment 9) and changed to 0xe5, as we changed the jemalloc poison value. (Bug 1044077)
Crash volume for signature 'js::jit::AssemblerX86Shared::bind':
 - nightly (version 51): 7 crashes from 2016-08-01.
 - aurora  (version 50): 71 crashes from 2016-08-01.
 - beta    (version 49): 622 crashes from 2016-08-02.
 - release (version 48): 3148 crashes from 2016-07-25.
 - esr     (version 45): 2874 crashes from 2016-05-02.

Crash volume on the last weeks (Week N is from 08-22 to 08-28):
            W. N-1  W. N-2  W. N-3
 - nightly       2       0       0
 - aurora       21      21       5
 - beta        201     214      68
 - release     967     812     404
 - esr         191     210     207

Affected platforms: Windows, Mac OS X, Linux

Crash rank on the last 7 days:
           Browser   Content     Plugin
 - nightly           #83
 - aurora  #204      #23
 - beta    #84       #72
 - release #16       #29
 - esr     #42
Too late for 49 as next week is the RC build.
I don't know if this is interesting, but it looks like 20% of the crashes with this signature have "cpu_info = family 6 model 58 stepping 9 | 4 ∧ adapter_vendor_id = 0x8086 ∧ adapter_device_id = 0x0166".
Crash volume for signature 'js::jit::AssemblerX86Shared::bind':
 - nightly (version 52): 2 crashes from 2016-09-19.
 - aurora  (version 51): 8 crashes from 2016-09-19.
 - beta    (version 50): 112 crashes from 2016-09-20.
 - release (version 49): 1486 crashes from 2016-09-05.
 - esr     (version 45): 3513 crashes from 2016-06-01.

Crash volume on the last weeks (Week N is from 10-03 to 10-09):
            W. N-1  W. N-2
 - nightly       2       0
 - aurora        8       0
 - beta         94      18
 - release    1254     232
 - esr         284     291

Affected platforms: Windows, Mac OS X, Linux

Crash rank on the last 7 days:
           Browser   Content     Plugin
 - nightly           #369
 - aurora  #1508     #143
 - beta    #229      #108
 - release #61       #17
 - esr     #38
Priority: -- → P3
For anyone following along, ehoogeveen added some signal handling magic to hopefully catch the source of these crashes (bug 1305360, bug 1306972).
See Also: → 1310932
The fix for bug will be in tomorrow's Nightly, so hopefully the annotations will start to show up in crash stats!

By the way, I realized that I had my date range wrong for my crash stats searches. These crashes are definitely still happening, though only a few are slipping through the protection that PageProtectingVector offers (14 in 52.0a1 so far):
https://crash-stats.mozilla.com/search/?signature=~js%3A%3Ajit%3A%3AX86Encoding%3A%3ABaseAssembler%3A%3AnextJump&signature=~js%3A%3Ajit%3A%3AAssemblerX86Shared%3A%3Abind&build_id=%3E%3D20161001030430&product=Firefox&version=52.0a1&date=%3E%3D2016-10-01T00%3A00%3A00.000Z&date=%3C2017-09-30T09%3A15%3A00.000Z
The fix for bug 1309573, even.
The page protection finally caused some crashes, but they look pretty weird:
https://crash-stats.mozilla.com/report/index/25426d91-e3e3-452c-8df2-4e0482161201
https://crash-stats.mozilla.com/report/index/c3a15282-8276-4786-9d67-25f392161130
https://crash-stats.mozilla.com/report/index/a5a0a983-8d86-4c1d-9887-aa6cb2161201

The first two jump from je_malloc back into XUL and end up compiling a script - I don't know if that's actually possible via some interruption mechanism or if heap corruption created a path where none exists. The third is in the backtracking register allocator - is that an extremely rarely taken branch that's missing an AutoUnprotect, more heap corruption, or a possible bug? I don't know if any of them would end up generating a poison pattern, but it seems possible.

Hopefully bug 1322445 will give us more to work with (most of the crashes we're seeing seem to be from corruption in the last page of the buffer), but it'd be great if these 3 tell us something.
Flags: needinfo?(jdemooij)
Look at the raw dumps. For the first, for example, you'll see that the js::DebugEnvironments::takeFrameSnapshot frame comes from stack scanning instead of cfi, so anything below it can be wrong.
When I'm home I'll load these dumps in Visual Studio and see what it tells us.

The 3rd one, the backtracking allocator crash, had an uptime of 8 seconds, so that's a bit suspicious. Could be real though. The other 2 are in the assembler buffer code itself. Maybe we're missing an unprotect call somewhere? Fuzzing should probably have found that by now, though, and BaselineCompiler::emitOutOfLinePostBarrierSlot is used a lot.

More on the crash dumps in a few days.
Jan had a look at those crashes and we discussed them on IRC. From the dumps we couldn't see anything wrong: the existing AutoUnprotect calls were unprotecting the right page, and if the unprotect call itself had failed we would have crashed there instead. The only thing I can think of is hardware or OS-level failure, and given the low number of reports there doesn't seem much point in looking deeper.

In addition, bug 1322445 has landed now. That eliminates the AutoUnprotect calls and should tell us if something tries to write to the last page of our buffer.
Flags: needinfo?(jdemooij)
A small update on this as we get ready to branch 53: We've seen no crashes on Nightly since bug 1329499 landed, on any of the signatures.

https://crash-stats.mozilla.com/search/?moz_crash_reason=%3DMOZ_CRASH%28nextJump%20bogus%20offset%29&moz_crash_reason=%3DMOZ_CRASH%28Corrupt%20code%20buffer%29&build_id=%3E%3D20170110075905&product=Firefox&version=53.0a1&date=%3E%3D2017-01-10T00%3A00%3A00.000Z&date=%3C2018-01-09T00%3A00%3A00.000Z
https://crash-stats.mozilla.com/search/?moz_crash_reason=~About%20to%20overflow%20our%20AssemblerBuffer%20using%20infallibleAppend%21&build_id=%3E%3D20170110075905&product=Firefox&version=53.0a1&date=%3E%3D2017-01-10T00%3A00%3A00.000Z&date=%3C2018-01-09T00%3A00%3A00.000Z
https://crash-stats.mozilla.com/search/?moz_crash_reason=~Cannot%20access%20PageProtectingVector%20from%20more%20than%20one%20thread%20at%20a%20time%21&build_id=%3E%3D20170110075905&product=Firefox&version=53.0a1&date=%3E%3D2017-01-10T00%3A00%3A00.000Z&date=%3C2018-01-09T00%3A00%3A00.000Z
https://crash-stats.mozilla.com/search/?moz_crash_reason=~Caller%20is%20writing%20the%20poison%20pattern%20into%20this%20buffer%21&build_id=%3E%3D20170110075905&product=Firefox&version=53.0a1&date=%3E%3D2017-01-10T00%3A00%3A00.000Z&date=%3C2018-01-09T00%3A00%3A00.000Z
https://crash-stats.mozilla.com/search/?moz_crash_reason=~Tried%20to%20access%20a%20protected%20region%21&build_id=%3E%3D20170110075905&product=Firefox&version=53.0a1&date=%3E%3D2017-01-10T00%3A00%3A00.000Z&date=%3C2018-01-09T00%3A00%3A00.000Z

Meanwhile we *have* still been seeing crashes from Nightlies up to and including the 2017-01-09 one:

https://crash-stats.mozilla.com/search/?moz_crash_reason=%3DMOZ_CRASH%28nextJump%20bogus%20offset%29&moz_crash_reason=%3DMOZ_CRASH%28Corrupt%20code%20buffer%29&product=Firefox&version=53.0a1&date=%3E%3D2017-01-10T00%3A00%3A00.000Z&date=%3C2018-01-09T00%3A00%3A00.000Z

This suggests to me that some change in the 2017-01-10 Nightly, probably bug 1329499, worked around the crash. Either the timing changed due to the poison detection - but that seems unlikely to me since we changed the timing far more with previous experiments - or the cause has something to do with realloc, since we switched to malloc+free in the new allocation policy.

We could check this by switching back to using realloc, and make absolutely sure by copying the buffer into a new buffer and comparing the result of that with the result of calling realloc. As for what could be the cause, though.. Maybe we can spot some error in the mozjemalloc realloc logic, but that seems unlikely. I wonder if the administrative data for the allocation is getting corrupted somehow.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #110)
> We could check this by switching back to using realloc, and make absolutely
> sure by copying the buffer into a new buffer and comparing the result of
> that with the result of calling realloc.

Yeah this is worth a try I think, to make sure the crashes didn't disappear in the meantime. Another option is to use the code we have in a few places to scan the buffer for 0xe5 poisoning and call it before and after each realloc. Then if it fails consistently after the realloc but not before, the badness (at least the part that affect us) definitely happens during realloc.
Good idea; filed bug 1332594 about doing both.

To recap, here's what we've ruled out so far:
1) An outside actor reallocating our buffer out from under us (bug 1271165)
2) An outside actor memcpying the poison pattern into our used pages (bug 1273462, bug 1305360 and bug 1309573)
3) An outside actor memcpying the poison pattern into our unused pages (bug 1322445)
4) Buffer overflows from infallibleAppend (bug 1326302)
5) Unprotected concurrent access to the AssemblerBuffer (bug 1326302)
6) One of the legitimate users of AssemblerBuffer writing the poison pattern into it (bug 1329499)
7) An outside actor writing the poison pattern into our buffer *during* realloc (bug 1329499)

Crash dumps also don't seem to show anything odd going on with the stored address, length or capacity of the vector's buffer.
Only 2 crashes so far, but it looks like this is indeed coming 'from' jemalloc: https://crash-stats.mozilla.com/search/?moz_crash_reason=~New%20buffer%20doesn%27t%20match%20the%20old%20buffer&build_id=%3E%3D20170201030207&product=Firefox&version=54.0a1&date=%3E%3D2017-02-01T00%3A00%3A00.000Z&date=%3C2018-01-31T00%3A00%3A00.000Z

Unfortunately that puts us in a tough position. jemalloc almost certainly isn't *causing* the corruption of its own data structures, so we'd have to do the same thing for jemalloc that we did for AssemblerBuffer: make chunk headers read-only and see what shows up in crash stats.

But jemalloc uses a complicated scheme to track 'runs' of free data in chunks, using some space in each chunk header as nodes for a red-black tree, which link chunks together. That means when updating the available runs, you either have to reach into the guts of the rbtree implementation and unprotect chunks whose links get updated, or unprotect all chunks in advance of any update. Even that doesn't really work, because some nodes come from 'base' allocations, which aren't part of chunks at all (and thus wouldn't be protected).

Secondly, I made a cross-platform exception handler to surface these rare memory protection crashes in crash stats, but it currently lives in SpiderMonkey. I'd either have to move it to MFBT or risk the new memory exception crashes getting lost in the noise.

I think we can work around these crashes by avoiding realloc (since we *do* know the right allocation size), but given what we know now, this issue probably affects every user of mozilla::Vector that requires larger buffers. It's possible that switching to jemalloc4 will fix these crashes, but I think it still uses a similar underlying structure.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #113)
> That means when updating the available runs, you
> either have to reach into the guts of the rbtree implementation and
> unprotect chunks whose links get updated, or unprotect all chunks in advance
> of any update. Even that doesn't really work, because some nodes come from
> 'base' allocations, which aren't part of chunks at all (and thus wouldn't be
> protected).

Er, here I meant that reaching into the guts of the rbtree won't work; unprotecting all chunks before making any change would probably work, but also be slow. I'm not sure whether protecting chunk headers is even a good idea though, as their size might be variable and nodes might also come from runs inside each chunk. The jemalloc code is very hard to understand unfortunately.

Oh, and switching to jemalloc4 obviously wouldn't address the underlying cause here, either. The corruption might resurface in a more tractable way though.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #113)
> Only 2 crashes so far, but it looks like this is indeed coming 'from'
> jemalloc:
> https://crash-stats.mozilla.com/search/
> ?moz_crash_reason=~New%20buffer%20doesn%27t%20match%20the%20old%20buffer&buil
> d_id=%3E%3D20170201030207&product=Firefox&version=54.0a1&date=%3E%3D2017-02-
> 01T00%3A00%3A00.000Z&date=%3C2018-01-31T00%3A00%3A00.000Z

Can you detail how you come to that conclusion from those crashes?
(In reply to Mike Hommey [:glandium] from comment #115)
> Can you detail how you come to that conclusion from those crashes?

Sure. In bug 1329499 we replaced the alloc policy used by the underlying mozilla::Vector with one that uses malloc+free instead of realloc, and the crashes went away. Then in bug 1332594, we made an alloc policy that
1) mallocs a temporary buffer and copies the old bytes into it
2) reallocs the old buffer
3) compares the contents of the temporary buffer with the realloced buffer, and crashes if they don't match.

The crashes from comment #113 are instances where these buffers don't match, which indicates that jemalloc didn't copy all the bytes from the old buffer into the new buffer. The only way that could happen is if the size stored by jemalloc doesn't match the size of the original allocation (as stored by mCapacity in the vector).

In all the recent crashes we see poisoned region starting at page 2^n - 1, the last page of the buffer before we resize it. What I suspect is happening is that something is decreasing the map bits by enough (≥4 if I'm not mistaken) to reduce the stored size by 1 page when rounding down. When grabbing the size, jemalloc masks out the flag bits [1], so it'll copy exactly a page less than the actual size of the allocation, leaving the last page uninitialized (and likely filled with poison from a previous deallocation). It then frees up the old run, leaving the last page in limbo (permanently fragmenting the chunk but otherwise doing no damage).

Note that per comment #112, we've ruled out pretty much everything else I can think of already. I don't know if my hypothesis is correct, but I'm pretty sure something is going wrong inside jemalloc itself.

[1] https://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#4490
One thing we can probably do to confirm this and see how often it happens in the wild is to add a (release) assertion to arena_salloc() that checks whether the unused flag bits [1] are 0. If my hypothesis is correct and a small decrement is causing all of this havoc, some of the more significant bits should still be set.

[1] https://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#870
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #117)
> One thing we can probably do to confirm this and see how often it happens in
> the wild is to add a (release) assertion to arena_salloc() that checks
> whether the unused flag bits [1] are 0.

I filed bug 1336662 about doing this. The check is simple and fast, so let's try this before worrying about anything more invasive.
Has anyone tried to hit this MOZ_CRASH under rr chaos mode?
Looking at the 51.0.1 crashes with this signature [0], the platform distribution is now:

* Windows: 61.02%
* Mac: 38.81%
* Linux: 0.18%

Also, it's no longer true that most of these crashes are on YouTube. It's possible we fixed a media bug or something. There were 16 crashes on http://www.orange.fr/portail (all on Windows), stress testing that site could be useful.
If my hunch about something going wrong with jemalloc's metadata is correct, the assertion I'm proposing in bug 1336662 might make this more reproducible by making us crash earlier and in places that might not have crashed before. Patching jumps is very sensitive to corruption, but I imagine that other places use large buffers to store simple data (like images or computational results).
If it makes things more reproducible, it's good, but otoh, it's not going to do much good on its own. That is, you're just going to have another set of crashes at a different place, and not much more information wrt the actual root cause... until you can actually debug it locally. Which you don't need anything landed to actually attempt to do...
Sure, but blindly trying things hasn't yielded anything so far. Aside from the website Jan mentioned (I tried it locally with the jemalloc patch applied, but didn't hit any crashes), landing this might give us more of an idea on what to try, if only from comments in crash reports.

I'm perfectly happy to work on tracking down the root cause, but any instrumentation to actually protect these data structures is going to be invasive, probably pretty slow, and ideally we would want the exception handler that currently lives in SpiderMonkey to pick them out from the noise of other memory protection crashes.

rr doesn't work on Windows, we don't really see these crashes on Linux, and I don't have access to a Mac (does rr even work on OSX?). I believe Address Sanitizer is supposed to work with clang-cl on Windows now, but the last time I tried to get it working (for that Facebook crash) I got linker errors and had to give up. That doesn't leave me with a lot of options.
I filed bug 1341889 about adding some more checks to realloc and use malloc_usable_size to narrow down the problem a little more. We got a lot of crashes from the same person on Windows after bug 1339441 landed; it might be worth trying to reproduce using their extensions.
Do you have links to those crashes?
Isn't that a lot more crashes than before? This would suggest bug 1339441 made it slightly more likely to happen?
8 crashes in a week isn't a huge spike - we got 19 (13 + 6) crashes between February 1 and February 15, after bug bug 1332594 landed. The 29 crashes for that one unfortunate user are a different story though.
If only they had left an email :(
Mass wontfix for bugs affecting firefox 52.
[@ js::ProtectedReallocPolicy::maybe_pod_realloc<T>] is the top #10 signature for Nightly 20170507030205 on Windows, 15 reports from 2 installations. The moz crash reasons are:

1 MOZ_CRASH(maybe_pod_realloc: tmp buffer doesn't match old buffer!)          10  66.67 %
2 maybe_pod_realloc: buffers don't match (1048576 >= 1048576, 524288, false)!  2  13.33 %
3 maybe_pod_realloc: buffers don't match (4096 >= 4096, 2048, false)!          2  13.33 %
4 maybe_pod_realloc: buffers don't match (2097152 >= 2097152, 1048576, false)! 1   6.67 %
Crash Signature: [@ js::jit::AssemblerX86Shared::bind(js::jit::Label*)] [@ js::jit::AssemblerX86Shared::bind] → [@ js::jit::AssemblerX86Shared::bind(js::jit::Label*)] [@ js::jit::AssemblerX86Shared::bind] [@ js::ProtectedReallocPolicy::maybe_pod_realloc<T>]
This spiked recently for some reason; perhaps some user switched from Dev Edition to Nightly, or upgraded their RAM. Either way, the only conclusion we can draw from these crashes is that their hardware is bad: we're seeing a ~50/50 split between the temporary buffer being corrupted and the reallocation itself being corrupted.

As a result I think we can safely ignore these crashes on Windows. There are more of them than on other platforms, but that's probably just a consequence of the larger audience. This is one case where some sort of runtime memtest that sets a 'user probably has bad hardware' flag in the crash report would be useful, to let us ignore these reports.

For OSX [1] there's still no indication of bad hardware, but the crash rate is so low that it's hard to say anything conclusive. It might be worth looking at a disassembly of jemalloc's realloc function on OSX to see if there's a miscompilation somewhere.

[1] https://crash-stats.mozilla.com/search/?build_id=%3E%3D20170413030227&moz_crash_reason=~maybe_pod_realloc&moz_crash_reason=~free_&moz_crash_reason=~uintptr_t%28p%29%20%3D%3D%20currAddr&moz_crash_reason=~%21currSize%20%26%26%20%21currAddr&moz_crash_reason=~Could%20not%20confirm%20the%20presence%20of%20poison%21&product=Firefox&version=55.0a1&platform=Mac%20OS%20X&date=%3E%3D2017-04-13T00%3A00%3A00.000Z&date=%3C2018-04-12T00%3A00%3A00.000Z&_sort=-date&_facets=signature&_facets=moz_crash_reason&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=moz_crash_reason#facet-signature
So, for what it's worth, my wife just got this crash on her mac. Report: https://crash-stats.mozilla.com/report/index/a5a6a43d-0e88-4104-8429-c3e571170510

I'm not entirely sure what happened, but the version that caused it is gone from her machine. As a matter of fact, after submitting the crash, restarting started with 53.0.2 while the crash was on 52.0.2. And the crash report says she was upgraded to 52.0.2 an hour prior to the crash.

But anyways, the point is I have direct physical access to a machine that had the crash, so I can run hardware diagnostics on the machine (but I don't know what to use on OSX for these things).
Flags: needinfo?(emanuel.hoogeveen)
Sorry about the delay, I've been in the middle of a big move so things have been kind of crazy. That sounds very odd; has she experienced any crashes since? Unfortunately I don't have an OSX machine either, so I don't know what diagnostics you could run there.

We still don't have much in the way of statistics for OSX on Nightly, but on Windows we're getting something like a 50/50 split between corruption in the temporary buffer and corruption in the new buffer, which means we're probably looking at hardware failure.
Flags: needinfo?(emanuel.hoogeveen)
One thing I notice in our jemalloc is https://searchfox.org/mozilla-central/rev/15ce5cb2db0c85abbabe39a962b0e697c9ef098f/memory/build/mozjemalloc.cpp#1646-1659 .

Under OSX and when dealing with chunks larger than 128kB, we use |vm_copy| instead of the |memcopy| we use in any other scenario. While there is no obvious reason to doubt that syscall, it does introduce different perf behavior since it uses up copy-on-write pages instead of immediately copying. This could provide varied enough timing that makes an existing problem appear more often only on OSX.
:O

It turns out that realloc can cause the |vm_copy| code to resize a non-multiple-of-pagesize 'size' parameter. This is clearly not acceptable for page map operations. We don't check if the vm_copy has a return code either. I'm not sure what happens when you violate but it can't be good.

Now, in practice it seems we don't hit this ugly size parameter case, but all it needs is one realloc user to do it once before things go off the rails.

> printf("!!! DEBUGGING REALLOC !!!\n");
> void* x = malloc(1024*1024); 
> void* y = malloc(1024*1024);
> printf("%p %p\n", x, y);
> x = realloc(x, 128*1024 + 16);
> printf("%p %p\n", x, y);
> free(x);
> free(y);
> printf("!!! DONE DEBUGGING REALLOC !!!!\n");

This example trips my assertion that size passed to vm_copy is multiple-of-pagesize. I don't currently have an OSX build environment so I can't determine what how exactly the real syscall fails.
Great find!
So I did a test on latest XCode and OSX 10.12, and the |vm_copy| actually handles size correctly. This could be an updated usermode library that just uses memcpy in that case. It would be possible our release build sees something different, or this could just be chasing shadows.
Doing some digging through older XNU sources, it appears that vm_copy supported this non-page-aligned stuff from before 10.0. We should probably still clean up and check return value, but it now seems less likely the source of problems. Ah, well.
Assignee: jdemooij → nobody
Status: ASSIGNED → NEW
Whiteboard: [#jsapi:crashes-retriage]
Looking at the state of this problem today. Half the current crashes are from FF48 which apparently is the last supported FF for ancient versions of OSX [1].

If I filter to look at BuildIDs from 2018, I still see OSX having more than it's share of crashes.

[1] https://support.mozilla.org/en-US/questions/1200697
Major drop in crash rate ~August 26, about 1/4 what it was 4 months ago. I don't see why the big drop - still ~50% of crashes are version 45
https://crash-stats.mozilla.com/signature/?signature=js%3A%3Ajit%3A%3AAssemblerX86Shared%3A%3Abind&date=%3E%3D2018-08-06T05%3A08%3A49.000Z&date=%3C2018-09-06T05%3A08%3A49.000Z#summary
Flags: needinfo?(jdemooij)

(In reply to Wayne Mery (:wsmwk) from comment #145)

Major drop in crash rate ~August 26, about 1/4 what it was 4 months ago. I
don't see why the big drop - still ~50% of crashes are version 45
https://crash-stats.mozilla.com/signature/
?signature=js%3A%3Ajit%3A%3AAssemblerX86Shared%3A%3Abind&date=%3E%3D2018-08-
06T05%3A08%3A49.000Z&date=%3C2018-09-06T05%3A08%3A49.000Z#summary

and also for https://crash-stats.mozilla.com/signature/?signature=js%3A%3Ajit%3A%3AAssemblerX86Shared%3A%3AexecutableCopy&product=Firefox&platform=Mac%20OS%20X&date=%3E%3D2018-08-02T22%3A36%3A44.000Z&date=%3C2019-02-02T21%3A36%3A44.000Z&_sort=-date#graphs

Keywords: topcrash-mac

The leave-open keyword is there and there is no activity for 6 months.
:sdetar, maybe it's time to close this bug?

Flags: needinfo?(sdetar)

We should still leave this open longer, there are still crashes occurring.

Flags: needinfo?(sdetar)

The leave-open keyword is there and there is no activity for 6 months.
:sdetar, maybe it's time to close this bug?

Flags: needinfo?(sdetar)

From what I understand we tried to fix this issue but did not found any axis from which we could make more progress on this issue.
Therefore, I am going to mark it as stalled.

(In reply to Release mgmt bot [:sylvestre / :calixte / :marco for bugbug] from comment #149)

The leave-open keyword is there and there is no activity for 6 months.
:sdetar, maybe it's time to close this bug?

The leave-open keyword makes sense, as patches which were landed with this bug number are only dedicated to the investigation work, which did not lead to any actionable items.

Ted, do you confirm? Or there is more to be done on this issue?

Flags: needinfo?(sdetar) → needinfo?(tcampbell)
Keywords: stalled
Severity: critical → S2

Since the crash volume is low (less than 5 per week), the severity is downgraded to S3. Feel free to change it back if you think the bug is still critical.

For more information, please visit auto_nag documentation.

Severity: S2 → S3
Crash Signature: [@ js::jit::AssemblerX86Shared::bind(js::jit::Label*)] [@ js::jit::AssemblerX86Shared::bind] [@ js::ProtectedReallocPolicy::maybe_pod_realloc<T>] → [@ js::jit::AssemblerX86Shared::bind] [@ js::jit::AssemblerX86Shared::bind] [@ js::ProtectedReallocPolicy::maybe_pod_realloc<T>]
You need to log in before you can comment on or make changes to this bug.