1124397 - crash in js::jit::AssemblerX86Shared::bind(js::jit::Label*)

Reporter

Description

•

11 years ago

This bug was filed from the Socorro interface and is report bp-89f2ccb8-28f9-4b5c-8db2-e34a82150121. ============================================================= was using Hello for a 1:1 and updating workday with comments and browser crashed

Ryan VanderMeulen [:RyanVM]

Comment 1

•

11 years ago

Looks like a crash in irregexp?

Flags: needinfo?(bhackett1024)

Version: 36 Branch → 37 Branch

Brian Hackett [Laid off!]

Comment 2

•

11 years ago

Is this reproducible?

Flags: needinfo?(bhackett1024)

David Burns :automatedtester

Reporter

Comment 3

•

11 years ago

I just tried the workday part and that didnt cause the crash. If I knew which tab exactly caused the issue (of my 100+ open) I could help narrow it down for you

Matthew Bowker

Comment 4

•

11 years ago

Also experienced this crash. I was scrolling downwards on the page http://en.wikipedia.org/wiki/User:Hosmich/Twin_flags with about a dozen other tabs in my space open, and the browser crashed. Crash ID 685d07a6-4a4e-494e-b546-82e182150404 https://crash-stats.mozilla.com/report/index/685d07a6-4a4e-494e-b546-82e182150404 Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:40.0) Gecko/20100101 Firefox/40.0 ID:20150403030204 CSet: 70a113676b21

Robin Whittleton

Comment 5

•

11 years ago

Hit this searching on a Google map on nightly: it panned across the to location and then blew up.

Robin Whittleton

Comment 6

•

11 years ago

Forgot the crash report: https://crash-stats.mozilla.com/report/index/e48efdc9-e159-40ec-a852-7030a2150412

Tyler Downer [He/Him]

Comment 7

•

10 years ago

Have a user with this crash at https://support.mozilla.org/en-US/questions/1066578

status-firefox38.0.5: --- → ?

status-firefox39: --- → ?

status-firefox40: --- → ?

status-firefox41: --- → ?

Version: 37 Branch → 39 Branch

Milan Sreckovic [:milan] (needinfo for best results)

Comment 8

•

10 years ago

I hit this on dev edition a couple of days ago: https://crash-stats.mozilla.com/report/index/6551fdca-d3ac-4c18-9b25-96d0f2150626. Landed on a page, about 10 seconds later, without user activity, crash. Early 2013 Retina Mac, 10.9.5.

Steven Michaud [:smichaud] (Retired)

Comment 9

•

10 years ago

This is easily the #1 Mac topcrasher on the 38, 39 and 40 branches, and in the top 10 in the 41 and 42 branches. It also happens on Windows, but at lower volume. By looking at the assembly code where these crashes happen, and at the Mac and Windows crash stacks on Socorro, I've figured out the top few lines of the "true" crash stack (in current trunk code): https://hg.mozilla.org/mozilla-central/annotate/eee2d49d055c/js/src/jit/x86-shared/Patching-x86-shared.h#l36 https://hg.mozilla.org/mozilla-central/annotate/eee2d49d055c/js/src/jit/x86-shared/BaseAssembler-x86-shared.h#l3835 https://hg.mozilla.org/mozilla-central/annotate/eee2d49d055c/js/src/jit/x86-shared/Assembler-x86-shared.h#l909 https://hg.mozilla.org/mozilla-central/annotate/eee2d49d055c/js/src/irregexp/NativeRegExpMacroAssembler.cpp#l388 In the raw dumps of these crashes on OS X, r14 is set to 0x5a5a5a5a. I *think* this means that, at line 2 above, from.offset_ is set to 0x5a5a5a5a, as (ultimately) is NativeRegExpMacroAssembler::stack_overflow_label_::offset_ (at line 4). 0x5a5a5a5a and 0x5a5a5a5a5a5a5a5a are values used by jemalloc to poison freed memory. I *think* the only way NativeRegExpMacroAssembler::stack_overflow_label_::offset_ can end up set to 0x5a5a5a5a at line 4 is if the NativeRegExpMacroAssembler object (from which masm.bind(&stack_overflow_label_) is called) has itself become invalid, probably only a few lines above the call to masm.bind(). I'm totally unfamiliar with this code, and it's horribly complex. I won't be able to continue my analysis on my own. If someone else can, please feel free to do so. I'll also try to find one or more likely candidates to needinfo.

Keywords: topcrash-mac

Steven Michaud [:smichaud] (Retired)

Comment 10

•

10 years ago

Brian, do you have any idea what's going on here? Might my analysis from comment #9 be correct? If so, do you have any idea how the NativeRegExpMacroAssembler object might have deleted itself during a call to NativeRegExpMacroAssembler::GenerateCode()?

Flags: needinfo?(bhackett1024)

Jan de Mooij [:jandem]

Comment 11

•

10 years ago

Steven, thank you for the analysis. I looked at some crash stacks and it's not just NativeRegExpMacroAssembler, also other random masm.bind() calls as part of Baseline and Ion compilation. Especially some of the bind() calls in Baseline code are extremely hot and well tested. It happens mostly on Mac but also on Windows, so a compiler bug is unlikely. My best guess is the assembler backend somewhere misbehaves when we have a certain memory allocation or address(es). That'd explain the mostly-Mac part... I'll try to dig deeper in a bit.

Flags: needinfo?(jdemooij)

Jan de Mooij [:jandem]

Updated

•

10 years ago

Depends on: 1187323

Nicolas B. Pierron [:nbp]

Comment 12

•

10 years ago

Looking at the crash addresses, I noticed that we have a lot of addresses which are ending with "0xa56". The fact that a lot of addresses are ending the same way, /often/ means that we have a structure which is large enough to be aligned, and that the bad code looking at a fields of this structure inside this structure. So far, the only one I can think of might be the BaselineCompiler structure. Which would imply that the BaselineCompiler functions are called with a bad "this", for failures within calls like: masm.bind(&postBarrierSlot_);

Steven Michaud [:smichaud] (Retired)

Comment 13

•

10 years ago

I now see that the variety of crash stacks is much larger than I realized. But (from looking at a few more) it may still be true that all of them on the Mac have this in common: They're all calls to js::jit::AssemblerX86Shared::bind(js::jit::Label*) on a Label that's been deleted -- whose offset_ variable == 0x5a5a5a5a.

Nicolas B. Pierron [:nbp]

Comment 14

•

10 years ago

(In reply to Steven Michaud [:smichaud] from comment #13) > They're all calls to js::jit::AssemblerX86Shared::bind(js::jit::Label*) on a > Label that's been deleted -- whose offset_ variable == 0x5a5a5a5a. Ok, the code pointer is aligned, and the offset is poisoned, but how did we managed to get a label allocated with the SystemAllocPolicy in the first place?

Steven Michaud [:smichaud] (Retired)

Comment 15

•

10 years ago

> Ok, the code pointer is aligned, and the offset is poisoned, but how > did we managed to get a label allocated with the SystemAllocPolicy > in the first place? I have no idea. I don't know this code at all :-( Another possiblity, I suppose, is that the Label object is still valid but was bound() to an invalid offset (== 0x5a5a5a5a). But it seems like AssemblerX86Shared::bind() expects objects that haven't yet been bound.

Steven Michaud [:smichaud] (Retired)

Comment 16

•

10 years ago

I just noticed that some of the Windows raw dumps have esi == 0x5a5a5a5a. Someone who has a good Windows disassembler might want to look at the assembly code for __ZN2js3jit18AssemblerX86Shared4bindEPNS0_5LabelE.

Steven Michaud [:smichaud] (Retired)

Comment 17

•

10 years ago

I can't get this out of my head, so I've continued to dig away at it. Above I was pretty sure that js::jit::AssemblerX86Shared::bind(js::jit::Label*) is being called with a Label whose whose offset_ variable == 0x5a5a5a5a. Now I've proved to myself this can't be true. (I used an interpose library and further analysis of the assembly code for AssemblerX86Shared::bind(). I'll say more if people feel the need.) Now I'm pretty sure GetInt32() is returning an offset == 0x5a5a5a5a, here: https://hg.mozilla.org/mozilla-central/annotate/eee2d49d055c/js/src/jit/x86-shared/BaseAssembler-x86-shared.h#l3835 As an experiment, we might want to add debugging code here that will log some kind of error message if offset == 0x5a5a5a5a.

Steven Michaud [:smichaud] (Retired)

Comment 18

•

10 years ago

> Now I'm pretty sure GetInt32() is returning an offset == 0x5a5a5a5a GetInt32() is called repeatedly in a loop from js::jit::AssemblerX86Shared::bind(js::jit::Label*), via the call to masm.nextJump(). The crashes happen, at GetInt32(), the next time through the loop after GetInt32() returns 0x5a5a5a5a.

Jan de Mooij [:jandem]

Comment 19

•

10 years ago

(In reply to Steven Michaud [:smichaud] from comment #17) > Now I'm pretty sure GetInt32() is returning an offset == 0x5a5a5a5a, here: > > https://hg.mozilla.org/mozilla-central/annotate/eee2d49d055c/js/src/jit/x86- > shared/BaseAssembler-x86-shared.h#l3835 > > As an experiment, we might want to add debugging code here that will log > some kind of error message if offset == 0x5a5a5a5a. If that's right, the asserts I added in bug 1187323 should catch this... Maybe the fuzzers will find something once it's on m-c. I'll get back to this soon.

Nicolas B. Pierron [:nbp]

Comment 20

•

10 years ago

(In reply to Steven Michaud [:smichaud] from comment #18) > > Now I'm pretty sure GetInt32() is returning an offset == 0x5a5a5a5a > > GetInt32() is called repeatedly in a loop from > js::jit::AssemblerX86Shared::bind(js::jit::Label*), via the call to > masm.nextJump(). The crashes happen, at GetInt32(), the next time through > the loop after GetInt32() returns 0x5a5a5a5a. I see only 2 options then, either we have an offset which targets something outside the code (which bug 1187323 assert against), or we have a compiler error which incorrectly alias the code pointer to a reallocated memory area.

Jan de Mooij [:jandem]

Comment 21

•

10 years ago

I'll make the asserts I added in bug 1187323 release asserts, to verify comment 18. Let's see what that tells us.

Flags: needinfo?(jdemooij)

Steven Michaud [:smichaud] (Retired)

Comment 22

•

10 years ago

Best of all would be to get that information into Socorro crash reports. I know how to do that (I've done it before). Give me a day or two to write a patch.

Brian Hackett [Laid off!]

Comment 23

•

10 years ago

(In reply to Steven Michaud [:smichaud] from comment #10) > Brian, do you have any idea what's going on here? Might my analysis from > comment #9 be correct? If so, do you have any idea how the > NativeRegExpMacroAssembler object might have deleted itself during a call to > NativeRegExpMacroAssembler::GenerateCode()? Canceling needinfo since comment 11 indicates this isn't irregexp-specific. NativeRegExpMacroAssembler is stack allocated and won't delete itself.

Flags: needinfo?(bhackett1024)

Steven Michaud [:smichaud] (Retired)

Comment 24

•

10 years ago

(Following up comment #22) This is *much* harder than I anticipated -- as it currently stands, none of the JS code supports adding annotations to crash logs. I *think* I've solved all the major problems, but now I'm stuck at what seems to be a bug in one of our Python build scripts (code with which I'm completely unfamiliar). I suspect we're not going to be able to make progress until I do what I described in comment #22 ... but I'm not sure how long it will take.

Steven Michaud [:smichaud] (Retired)

Comment 25

•

10 years ago

Attached patch Patch that adds crash log annotations, WIP (obsolete) — Details — Splinter Review

I got around the seeming bug in one of our Python build scripts -- I was taking the wrong approach anyway. But this is still horribly complicated, and I haven't yet finished the work. Right now I have a patch that builds locally and works just fine (on OS X). But it won't build on the tryservers on either OS X or Windows, and I haven't yet sacrificed enough chickens to make the problem(s) go away. For now I've run out of time to spend on this bug, and need to put it aside for awhile. I hope I can come back to it in the not too distant future. I suspect the only way we'll be able to learn enough to fix this bug is via crash log annotations.

Steven Michaud [:smichaud] (Retired)

Comment 26

•

10 years ago

Attached patch Crash log annotation patch, WIP: now builds on tryservers (obsolete) — Details — Splinter Review

I sacrificed another barnful of chickens and got this building on the tryservers. I also tested that it works as expected on OS X and Windows. But when I ran an all-platform set of tryserver builds last night, there were an unusual number of test failures (though all seem unrelated): https://treeherder.mozilla.org/#/jobs?repo=try&revision=3ef9af95205c So I've decided to do another all-platform run, and see what happens: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0d33ce1a9f00

Attachment #8644390 - Attachment is obsolete: true

Steven Michaud [:smichaud] (Retired)

Comment 27

•

10 years ago

Comment on attachment 8651068 [details] [diff] [review] Crash log annotation patch, WIP: now builds on tryservers This patch has three logical parts: 1) What's needed for non-xul components to annotate crash reports. 2) What's needed for libjs to annotate crash reports. 3) What's needed to figure out bug 1124397 (this bug). The last part is temporary -- it can come out once we've fixed this bug. But I think it'd be very convenient if the other two parts could be made permanent. I'll open a bug for that and report its number here.

Steven Michaud [:smichaud] (Retired)

Comment 28

•

10 years ago

I've opened bug 1197259.

Comment 29

•

10 years ago

(Following up comment #26) There are a lot of failures in both sets of tests. Most of them are frankly inexplicable, but both sets do include some "jit" test failures. I suspect I need to do a better job of dealing with the case of XUL *not* being loaded (for example when running in the 'js' and 'jsapi-tests' binaries). New patch coming up.

Jan de Mooij [:jandem]

Comment 30

•

10 years ago

Hm is it possible this signature disappeared on Nightly? I wonder if the release asserts I added in bug 1187323 have something to do with that, it landed last week and there haven't been any crashes on Mac Nightly since then.

Steven Michaud [:smichaud] (Retired)

Comment 31

•

10 years ago

> there haven't been any crashes on Mac Nightly since then Not true: https://crash-stats.mozilla.com/search/?version=43.0a1&signature=%3Djs%3A%3Ajit%3A%3AAssemblerX86Shared%3A%3Abind%28js%3A%3Ajit%3A%3ALabel*%29&build_id=%3E%3D20150819030206&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

Steven Michaud [:smichaud] (Retired)

Comment 32

•

10 years ago

(Following up comment #31) One thing *has* changed, though (since the patch for bug 1187323 landed): These crashes now happen here, at the MOZ_RELEASE_ASSERT() you added to BaseAssembler::nextJump(): https://hg.mozilla.org/mozilla-central/annotate/23a04f9a321c/js/src/jit/x86-shared/BaseAssembler-x86-shared.h#l4156 I don't yet understand why (I haven't yet looked at the assembly code for these new builds).

Steven Michaud [:smichaud] (Retired)

Comment 33

•

10 years ago

(Following up comment #32) OK, now I get it ... or at least I think I do: MOZ_RELEASE_ASSERT() crashes if the condition is false. Presumably that's because offset == 0x5a5a5a5a5a5a5a5a. But we already knew this. My crashlog annotation patch provides much more information, and would presumably be more useful. I assume MOZ_RELEASE_ASSERT() just writes its output to stdout -- so nobody will see it except the people who crash (and know where to look).

Steven Michaud [:smichaud] (Retired)

Comment 34

•

10 years ago

Attached patch Crash log annotation patch, WIP: gets rid of jit test errors (obsolete) — Details — Splinter Review

My new patch's stub library only checks once for the XUL symbols it needs (using dlsym() or its equivalent). This substantially speeds up its performance when XUL isn't (and isn't going to be) loaded -- for example in the js and jsapi-test binaries. It also gets rid of the jit test failures. But there are still a bunch of other failures that I haven't yet managed to explain or dismiss as irrelevant: https://treeherder.mozilla.org/#/jobs?repo=try&revision=30c4647a7364 https://treeherder.mozilla.org/#/jobs?repo=try&revision=e7a2ea1a1960

Attachment #8651068 - Attachment is obsolete: true

Comment hidden (obsolete)

Steven Michaud [:smichaud] (Retired)

Comment 36

•

10 years ago

(Following up comment #33) For what it's worth (and after looking at the new builds' assembly code), the exact location of the current crashes in our code is here: https://hg.mozilla.org/mozilla-central/annotate/c46370eea81a/mfbt/Assertions.h#l218

Steven Michaud [:smichaud] (Retired)

Comment 37

•

10 years ago

(Following up comment #34) I think I now understand the remaining test failures, which are mainly on e10s: AnnotateCrashReport() and friends may only be called on the main thread in a content process. I need to come up with another patch that deals with this ... one way or another.

Steven Michaud [:smichaud] (Retired)

Comment 38

•

10 years ago

Attached patch Crash log annotation patch, WIP: v0.9 (obsolete) — Details — Splinter Review

(Following up comment #37) That change didn't get rid of the test failures, and neither does my present revision. We may have to disable certain tests while my debug logging patch is in the tree, or just avoid the failures by only compiling the code on certain platforms (like OS X and Windows). As I mentioned above in comment #23, the part of my patch specific to this bug can come out when we've fixed it. My present revision simplifies the loading of XUL pointers in the stub library, and (I hope) guarantees that this will happen before we've imposed a sandbox on the relevant process (content or plugin).

Attachment #8652508 - Attachment is obsolete: true

Steven Michaud [:smichaud] (Retired)

Comment 39

•

10 years ago

Comment on attachment 8653121 [details] [diff] [review] Crash log annotation patch, WIP: v0.9 Tryserver run for this patch: https://treeherder.mozilla.org/#/jobs?repo=try&revision=505e1737e131

Steven Michaud [:smichaud] (Retired)

Comment 40

•

10 years ago

Attached patch Crash log annotation patch v1.0 (obsolete) — Details — Splinter Review

This patch avoids the test failures by not compiling the bug 1124397-specific annotations on DEBUG builds. These annotations also only compile on OS X and Windows -- the two platforms where we've seen this bug's crashes. The test failures themselves offer no clue as to why they're happening. The calls to AnnotateCrashReport() and RemoveCrashReportAnnotation() are rather expensive, and I suspect this has something do do with it. AssemblerX86Shared::bind() seems to be called rather often (though not nearly as often as BaseAssembler::nextJump()). There's little we can do about this. In any case, this code (and its possible future evolutions) only has to stay in the tree long enough to allow us to decipher this bug's crashes.

Attachment #8653121 - Attachment is obsolete: true

Steven Michaud [:smichaud] (Retired)

Comment 41

•

10 years ago

Comment on attachment 8654189 [details] [diff] [review] Crash log annotation patch v1.0 Tryserver run for this patch: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1cf744fe1b7c

Steven Michaud [:smichaud] (Retired)

Comment 42

•

10 years ago

(Following up comment #27) My current plan is to first get my patches for bug 1197259 (parts 1 and 2) into the tree, then post another patch here (for review) that just contains part 3.

Wayne Mery (:wsmwk)

Updated

•

10 years ago

Blocks: 1197220

BMO Automation

Updated

•

10 years ago

Crash Signature: [@ js::jit::AssemblerX86Shared::bind(js::jit::Label*)] → [@ js::jit::AssemblerX86Shared::bind(js::jit::Label*)] [@ js::jit::AssemblerX86Shared::bind]

Jan de Mooij [:jandem]

Comment 43

•

10 years ago

This one still baffles me. The data we have: (1) We fail the MOZ_RELEASE_ASSERT(size_t(offset) < size()) in nextJump. In other words, we're reading some offset out of the code buffer and that offset happens to be bogus. (2) It mostly affects OS X users (94% of the crashes), but there are also some Windows and Linux crashes. It's the #1 crash on OS X. (3) I looked at one of the Windows minidumps and (if the debugger is not lying), it's not an OOM and also not an unusually large buffer. (4) The fuzzers never hit this. Especially (2) is really weird; the x86/x64 assembler buffer code should behave exactly the same on all platforms. Maybe some other function or thread is corrupting our code buffer. I'll poke at some other Windows minidumps. After that we can try to get some more data into the crash reports...

Jan de Mooij [:jandem]

Updated

•

10 years ago

Flags: needinfo?(jdemooij)

Jan de Mooij [:jandem]

Updated

•

10 years ago

Depends on: 1260660

Jan de Mooij [:jandem]

Comment 44

•

10 years ago

Attached patch Diagnostic patch (obsolete) — Details — Splinter Review

I've been staring at crash dumps for a while but nothing stands out - it's super mysterious. Here's a patch to stash some data on the stack before we crash, so we can retrieve it from the minidumps (stack memory is included in crash reports). The |volatile| is there to make sure the compiler doesn't do anything fancy with our data.

Attachment #8654189 - Attachment is obsolete: true

Attachment #8736827 - Flags: review?(efaustbmo)

Eric Faust [:efaust]

Comment 45

•

10 years ago

Comment on attachment 8736827 [details] [diff] [review] Diagnostic patch Review of attachment 8736827 [details] [diff] [review]: ----------------------------------------------------------------- r=me ::: js/src/jit/x86-shared/BaseAssembler-x86-shared.h @@ +3412,5 @@ > + blackbox[0] = uintptr_t(0xABCD1234); > + blackbox[1] = uintptr_t(offset); > + blackbox[2] = uintptr_t(size()); > + blackbox[3] = uintptr_t(from.offset()); > + blackbox[4] = uintptr_t(code[from.offset() - 1]); These are wrong, as discussed on IRC.

Attachment #8736827 - Flags: review?(efaustbmo) → review+

Jan de Mooij [:jandem]

Comment 46

•

10 years ago

Attached patch Diagnostic patch — Details — Splinter Review

Assignee: nobody → jdemooij

Attachment #8736827 - Attachment is obsolete: true

Status: NEW → ASSIGNED

Flags: needinfo?(jdemooij)

Attachment #8736856 - Flags: review+

Jan de Mooij [:jandem]

Updated

•

10 years ago

Keywords: checkin-needed, leave-open

Jan de Mooij [:jandem]

Comment 47

•

10 years ago

Try push, for sheriffs handling checkin-needed: https://treeherder.mozilla.org/#/jobs?repo=try&revision=2dac02526120

Pulsebot

Comment 48

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/2f39deb1b3e2

Keywords: checkin-needed

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 49

•

10 years ago

bugherder landing

https://hg.mozilla.org/integration/mozilla-inbound/rev/2f39deb1b3e2

Carsten Book [:Tomcat]

Comment 50

•

10 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/2f39deb1b3e2

Jan de Mooij [:jandem]

Comment 51

•

10 years ago

I'm currently waiting for Nightly crashes with the extra crash instrumentation. On Nightly there have been ~1-2 crashes a day for the past week so I'm expecting a crash report today or tomorrow... That's assuming the patch did not somehow make the problem disappear. These crashes are extremely weird. It's not an OOM situation. Most (not all) of these crashes are after we emit quite a lot of code though (say 40-200 KB). One possibility is that some other thread ends up corrupting our memory. That would explain both why it's so hard to reproduce and why the platform distribution is so unusual, but I'm not really convinced because it leaves some other questions unanswered. I've analyzed and fixed a lot of top crashes but this is the most weird and challenging one so far.

Jan de Mooij [:jandem]

Updated

•

10 years ago

Keywords: steps-wanted

Robert Kaiser

Comment 52

•

10 years ago

FWIW, this is the #1 top crash on Mac 45.0.1 with 11% of all Mac crashes.

Nicolas B. Pierron [:nbp]

Comment 53

•

10 years ago

(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #52) > FWIW, this is the #1 top crash on Mac 45.0.1 with 11% of all Mac crashes. Looking at the build ID [1], and sorting by build id highlight that most of the issues we got are coming 20160315153207 (72%). Does this non-uniformity corresponds to the way updates are shipped to our users? If not, can we dig this version and 20160316065941 (0.29%), and do a binary diff? If build-ids have uniform user base, then such spikes could be explained by an intermittent compiler error. If build-ids have uniform user base, then this might be caused by a popular website on 15-03-2016, they might have pushed some new code which triggered the error, and later fixed it. Maybe we can figure that out based on the comments, and ask these website developers to tell us what was the difference.

Nicolas B. Pierron [:nbp]

Comment 54

•

10 years ago

(In reply to Nicolas B. Pierron [:nbp] from comment #53) > (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #52) > > FWIW, this is the #1 top crash on Mac 45.0.1 with 11% of all Mac crashes. > > Looking at the build ID [1] [1] https://crash-stats.mozilla.com/signature/?signature=js%3A%3Ajit%3A%3AAssemblerX86Shared%3A%3Abind#aggregations

Nicolas B. Pierron [:nbp]

Comment 55

•

10 years ago

Maybe we can figure that out with these versions which are from the same channel: 20160316065941 46.0b2 25 0.29% ( 3.5 / day) ? 46.0b3 0 0% 20160322075646 46.0b4 75 0.86% (37.5 / day) 20160324011246 46.0b5 321 3.67% (45.8 / day) ? 46.0b6 0 0% 20160401021843 46.0b7 50 0.57%

Nicolas B. Pierron [:nbp]

Comment 56

•

10 years ago

KaiRo, Can you help us normalize the number of users per build-id, such that we can make sure that comment 53 and comment 55 are not biased by a different population size?

Flags: needinfo?(kairo)

Robert Kaiser

Comment 57

•

10 years ago

(In reply to Nicolas B. Pierron [:nbp] from comment #55) > Maybe we can figure that out with these versions which are from the same > channel: We had some betas this cycle that were never released due to various issues, e.g. the security-related stopping of all updates last week, which made us not ship b6 at all and ship b7 only a day before b8. (In reply to Nicolas B. Pierron [:nbp] from comment #56) > Can you help us normalize the number of users per build-id, such that we can > make sure that comment 53 and comment 55 are not biased by a different > population size? Unfortunately, while the data to do this exists, it's a whole lot of work to do it. What is the answer you want to get out of that? (Is is worth for me to pour multiple hours into potentially getting something there, esp. if the outcome may not be too reliable by itself probably?)

Flags: needinfo?(kairo)

Nicolas B. Pierron [:nbp]

Comment 58

•

10 years ago

(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #57) > Unfortunately, while the data to do this exists, it's a whole lot of work to > do it. What is the answer you want to get out of that? (Is is worth for me > to pour multiple hours into potentially getting something there, esp. if the > outcome may not be too reliable by itself probably?) I want to know if the number of crashes are correlated with the number of reports. If not this would highlight an issue which could be an intermittent error. While looking more into crash-stat, I found the following link which pretty-much answer with the number of crashes/ADU (Active Daily User), and highlight that they are correlated. Thus, this is not an intermittent error. https://crash-stats.mozilla.com/report/list?product=Firefox&range_unit=days&range_value=28&signature=js%3A%3Ajit%3A%3AAssemblerX86Shared%3A%3Abind#tab-graph

Robert Kaiser

Comment 59

•

10 years ago

(In reply to Nicolas B. Pierron [:nbp] from comment #58) > While looking more into crash-stat, I found the following link which > pretty-much answer with the number of crashes/ADU (Active Daily User) Yes, right, this one is helpful for Nightly and Aurora channels at least.

Jan de Mooij [:jandem]

Comment 60

•

10 years ago

I finally got some Mac crashes with the diagnostics patch. --- 2 crashes in BaselineCompiler::emitOutOfLinePostBarrierSlot -> nextJump: * db714a92-1884-4138-bf78-5b2332160409 Buffer length 0x46e09. At offset 0x1fe38, we find 0xe5e5e5e5 (the instruction byte before that is also 0xe5). * ffb0014a-e5fe-4bbc-93c7-06e772160410 Buffer length 0x2d4f4. At offset 0x1ff95 we get 0xe5e5e5e5 again. And 1 crash in NativeRegExpMacroAssembler::GenerateCode -> nextJump: * 6b0f1202-7cdc-448a-83d1-f2de22160409 Buffer length 0x2741fb. At offset 0x1afeb we get 0xe5e5e5e5. --- 0xe5e5e5e5 is jemalloc's poison pattern. It's mysterious why our assembler buffer would suddenly contain those bytes. There are a few possibilities: (1) Memory bug elsewhere: maybe some other thread tried to free some invalid pointer and ended up poisoning our memory. (2) The Vector underlying the AssemblerBuffer somehow points to invalid memory, or something went wrong growing the Vector. This seems a bit unlikely. (3) Somehow writing the jump/call instruction + its jump offset went wrong. This is also unlikely. --- There are some patterns in these crashes (I mentioned some of this earlier in this bug): * The assembler buffers are unusually big (181 KB and 283 KB for the Baseline compilations, a whopping 2.5 MB for the regex compilation!). I want to find out how common that is for regular expressions. * Note that the buffer offsets above are very similar: 110571, 130616, 130965. This could be a coincidence though.

Jan de Mooij [:jandem]

Comment 61

•

10 years ago

I also wonder how often we get our assembler buffer corrupted with the 0xe5e5e5e5 pattern but *don't* crash because we're not binding a jump/call there. We could scan the buffer for, say, 5 0xe5 bytes, but we have to guarantee that can't happen for real. For local testing it should be sufficient though. A potentially interesting next step here is figuring out where the 0xe5 region starts and ends. I wonder if it's small (like 8 corrupted bytes) or most of the buffer.

Jan de Mooij [:jandem]

Comment 62

•

10 years ago

FWIW there are a number of OS X crashes on the following page: http://slither.io/ That online game seems pretty popular and people likely spend a lot of time on that page, but I think even if we account for that, it's still crashier than you'd expect. Unfortunately it's not possible to run that game unattended; maybe I can write a script to reload and start the game automatically.

Jan de Mooij [:jandem]

Comment 63

•

10 years ago

Attached patch Diagnostic patch 2 (obsolete) — Details — Splinter Review

Adds some code to track where the 0xe5 bytes start and end. Once we know how big the region is, we're in a better position to decide what to try next.

Attachment #8740504 - Flags: review?(efaustbmo)

Jan de Mooij [:jandem]

Comment 64

•

10 years ago

Attached patch Diagnostic patch 2 — Details — Splinter Review

Attachment #8740504 - Attachment is obsolete: true

Attachment #8740504 - Flags: review?(efaustbmo)

Attachment #8740507 - Flags: review?(efaustbmo)

Eric Faust [:efaust]

Updated

•

10 years ago

Attachment #8740507 - Flags: review?(efaustbmo) → review+

Pulsebot

Comment 65

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/d613a9152175

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 66

•

10 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/d613a9152175

Jan de Mooij [:jandem]

Comment 67

•

10 years ago

It's getting interesting. We have a Nightly crash report (on OS X) with the updated diagnostics: https://crash-stats.mozilla.com/report/index/0aa388b7-1de5-4d5f-a95c-65f1f2160414 Here's what it says: * The AssemblerBuffer's size is 131101 bytes, a bit more than 128 KB. That's (again) large. * The jump/call we want to patch is at offset 7741. * We crash because *all* bytes in range 4096-16383 are 0xE5! So exactly 3 pages (12 KB) are filled with the poison value. It *could* be a bug when we resize the AssemblerBuffer's Vector, either Vector or jemalloc code. This seems somewhat unlikely, but some other stack/heap corruption bug could confuse the resize process. Once we have a few more crash reports, we can try to narrow it down a bit as follows: after we write a new instruction to the buffer, we check if the AssemblerBuffer's length > some pretty large value. If it is, we MOZ_CRASH if we have >= ~16 0xE5 bytes at some offset we expect to be affected by this bug. That will tell us (1) Does the buffer get poisoned when the current thread is in some particular part of the code, or does it happen at random? (2) If it's some other thread corrupting our memory, maybe we can still see its stack in the crash dumps, if we're lucky and it's still active. First let's wait for a few more crash reports to see if it's always the same range.

Patch that adds crash log annotations, WIP 10 years ago Steven Michaud [:smichaud] (Retired) 11.29 KB, patch		Details \| Diff \| Splinter Review
Crash log annotation patch, WIP: now builds on tryservers 10 years ago Steven Michaud [:smichaud] (Retired) 14.38 KB, patch		Details \| Diff \| Splinter Review
Crash log annotation patch, WIP: gets rid of jit test errors 10 years ago Steven Michaud [:smichaud] (Retired) 14.95 KB, patch		Details \| Diff \| Splinter Review
Crash log annotation patch, WIP: v0.9 10 years ago Steven Michaud [:smichaud] (Retired) 14.20 KB, patch		Details \| Diff \| Splinter Review
Crash log annotation patch v1.0 10 years ago Steven Michaud [:smichaud] (Retired) 15.09 KB, patch		Details \| Diff \| Splinter Review
Diagnostic patch 10 years ago Jan de Mooij [:jandem] 2.77 KB, patch	efaust : review+	Details \| Diff \| Splinter Review
Diagnostic patch 10 years ago Jan de Mooij [:jandem] 2.86 KB, patch	jandem : review+	Details \| Diff \| Splinter Review
Diagnostic patch 2 10 years ago Jan de Mooij [:jandem] 1.83 KB, patch		Details \| Diff \| Splinter Review
Diagnostic patch 2 10 years ago Jan de Mooij [:jandem] 1.84 KB, patch	efaust : review+	Details \| Diff \| Splinter Review
More diagnostics 10 years ago Jan de Mooij [:jandem] 1.78 KB, patch	nbp : review+	Details \| Diff \| Splinter Review