Closed Bug 1100485 Opened 10 years ago Closed 10 years ago

Intermittent OSX 10.6 tp5o | application crashed [@ 0x7fffffe0131f]

Categories

(Core :: Graphics, defect)

x86_64
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla36
Tracking Status
firefox34 --- unaffected
firefox35 --- unaffected
firefox36 --- fixed
firefox-esr31 --- unaffected

People

(Reporter: RyanVM, Assigned: ehoogeveen)

References

Details

(Keywords: crash, intermittent-failure)

Attachments

(1 file)

Attempting to bisect this down, but it appears to *NOT* be machine-related.

10:46:08     INFO -  PROCESS-CRASH | tp5o | application crashed [@ 0x7fffffe0131f]
10:46:08     INFO -  Crash dump filename: /var/folders/uJ/uJR2ld1oHOCIfMB5ZMrr8++++-k/-Tmp-/tmpvdRhSz/profile/minidumps/A8BEE220-DEDA-4E57-ACB1-647EF52A20E9.dmp
10:46:08     INFO -  Operating system: Mac OS X
10:46:08     INFO -                    10.6.8 10K549
10:46:08     INFO -  CPU: amd64
10:46:08     INFO -       family 6 model 23 stepping 10
10:46:08     INFO -       2 CPUs
10:46:08     INFO -  Crash reason:  EXC_BAD_ACCESS / KERN_PROTECTION_FAILURE
10:46:08     INFO -  Crash address: 0x4ee00000
10:46:08     INFO -  Thread 22 (crashed)
10:46:08     INFO -   0  0x7fffffe0131f
10:46:08     INFO -      rbx = 0x0000000000000002   r12 = 0x000000012327fb80
10:46:08     INFO -      r13 = 0x00007fff89801834   r14 = 0x0000000000000fc4
10:46:08     INFO -      r15 = 0x000000011f3bc938   rip = 0x00007fffffe0131f
10:46:08     INFO -      rsp = 0x000000012327f9b8   rbp = 0x000000012327f9b8
10:46:08     INFO -      Found by: given as instruction pointer in context
10:46:08     INFO -   1  libGLImage.dylib + 0x111e
10:46:08     INFO -      rip = 0x00007fff8980011f   rsp = 0x000000012327f9e0
10:46:08     INFO -      Found by: stack scanning
10:46:08     INFO -   2  XUL!mozilla::layers::ContainerLayerProperties::ComputeChangeInternal(void (*)(mozilla::layers::ContainerLayer*, nsIntRegion const&), bool&) [nsTHashtable.h:9e7225138b7d : 402 + 0x4]
10:46:08     INFO -      rip = 0x0000000101a5fc66   rsp = 0x000000012327fa20
10:46:08     INFO -      Found by: stack scanning
10:46:08     INFO -   3  XUL!mozilla::layers::AsyncCompositionManager::ApplyAsyncContentTransformToTree(mozilla::layers::Layer*) [AsyncCompositionManager.cpp:9e7225138b7d : 657 + 0x7]
10:46:08     INFO -      rip = 0x0000000101ab2936   rsp = 0x000000012327fa30
10:46:08     INFO -      Found by: stack scanning
10:46:08     INFO -   4  GLEngine + 0x13efa
10:46:08     INFO -      rip = 0x0000000127abfefb   rsp = 0x000000012327fa90
10:46:08     INFO -      Found by: stack scanning
10:46:08     INFO -   5  GLEngine + 0x15464
10:46:08     INFO -      rip = 0x0000000127ac1465   rsp = 0x000000012327faf0
10:46:08     INFO -      Found by: stack scanning
10:46:08     INFO -   6  libmozglue.dylib!huge_palloc [jemalloc.c:9e7225138b7d : 1683 + 0x7]
10:46:08     INFO -      rip = 0x00000001000188bd   rsp = 0x000000012327fb80
10:46:08     INFO -      Found by: stack scanning
10:46:08     INFO -   7  libSystem.B.dylib + 0x4989
10:46:08     INFO -      rip = 0x00007fff83e2d98a   rsp = 0x000000012327fbf0
10:46:08     INFO -      Found by: stack scanning
10:46:08     INFO -   8  GLEngine + 0x12c1d
10:46:08     INFO -      rip = 0x0000000127abec1e   rsp = 0x000000012327fc70
10:46:08     INFO -      Found by: stack scanning
Looks like it's from this merge:
https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=acbd7b68fa8c

Which is fun because mchang in that push got backed out for tsvgr crashes with the same signature (amongst other things).

Unfortunately, that's probably as far as I'm going to be able to take this bug as I'm out for the rest of the week.
I downloaded a dump from one of the many retriggers but before my commit - https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=2caae1e33648 - It has the same crash stack as bug 552020.
The stacks in this bug make no sense. While jemalloc shows up in there, there's other stuff in between that makes no sense (could never be called from jemalloc). Specifically, there's lots of stuff related to OpenGL.

As best I can figure from the stacks, we're crashing in the call at [1], unlocking the huge chunk lock. But that makes no sense - we're unconditionally taking the lock a few lines up, so releasing it should be fine.

I'm going to need help figuring this out - I don't have a Mac and I'm ashamed to say that debugging using an actual debugger is a mystery to me. I could try some printf debugging on try, but I have no idea what to even check in this case.

[1] http://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#5117
If I had to take a guess, I'd say this smells very much like an existing use of uninitialized memory in system libraries that's now triggered more often than before because of the jemalloc behavior change. One way to "validate" this theory is to try reproducing those crashes with opt_zero enabled, or with pages_purge actively memsetting the pages to 0 on mac (or remapping them, or whatever).

It wouldn't be the first time that system libraries on osx fail to initialize memory.
Ah, good thought. There are a few calls to huge_malloc and huge_palloc that pass false for the zero parameter. Since mmap *always* zeroes on allocation, this didn't used to matter. However with MALLOC_RECYCLE, we can allocate out of previously used chunks that have been MADV_FREEd - and MADV_FREE *doesn't* zero pages (MADV_DONTNEED *does*).

So it's entirely possible that either a call to huge_malloc or a call to huge_palloc isn't getting zeroed where it was implicitly being zeroed before. I'll push some changes to try to see if explicitly zeroing in calls to huge_malloc and/or huge_palloc helps.
Unfortunately that turned out not to help: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=ef770efdb514

I'm trying partial backouts now to see which part is at fault.
Simply backing out part 7 (which turns recycling on) appears to fix it: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=bc78c6a860e6

That doesn't really tell us anything about *why* recycling causes the problem though. I'm trying a few more things to see if I can narrow it down.
Turning off munmap (to see if it's caused by the capped recycling):
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=b3367b098f71

Not using MADV_FREE (decommitting instead):
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=e3a399dde8d6

Locking around the commit/decommit calls in the explicit double purge:
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=607801615239

That last one is probably a good idea anyway (should have thought of it before), but I don't know if the function even gets called. Though with Talos getting RSS from the browser, maybe that's the only place in our testing that it *would* be called.
so not using MADV_FREE doesn't show failure, nor does the locking commit/decommit calls.  A good point about the possibility of RSS collected from the browser causing the problem.  If we want to change talos at all, that is an option;

I am glad to see we are coming closer to a fix!
Yep, that was totally it. Don't know why I didn't consider this possibility when I originally wrote that logic.
Assignee: nobody → emanuel.hoogeveen
Status: NEW → ASSIGNED
Attachment #8525421 - Flags: review?(mh+mozilla)
(In reply to Joel Maher (:jmaher) from comment #28)
> so not using MADV_FREE doesn't show failure, nor does the locking
> commit/decommit calls.  A good point about the possibility of RSS collected
> from the browser causing the problem.  If we want to change talos at all,
> that is an option;

We shouldn't - if anything, it's great that Talos showed this failure, since there was definitely a problem. It's just unfortunate that it showed up in such a weird way.

I'm not sure why always decommitting also worked, but I don't think it's the right fix - as implemented it was even *broken* in the presence of jemalloc_purge_freed_pages_impl, since that recommits the pages.

Full OS X-only try run with the patch: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=38f8816bd5da
Comment on attachment 8525421 [details] [diff] [review]
Lock chunks during double purging to avoid racing with allocation.

Review of attachment 8525421 [details] [diff] [review]:
-----------------------------------------------------------------

Indeed, the extent_tree functions need the lock to be held.
Attachment #8525421 - Flags: review?(mh+mozilla) → review+
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/5de035fe199a
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
You need to log in before you can comment on or make changes to this bug.