Closed Bug 1100485 Opened 5 years ago Closed 5 years ago
Intermittent OSX 10
.6 tp5o | application crashed [@ 0x7fffffe0131f]
Attempting to bisect this down, but it appears to *NOT* be machine-related. 10:46:08 INFO - PROCESS-CRASH | tp5o | application crashed [@ 0x7fffffe0131f] 10:46:08 INFO - Crash dump filename: /var/folders/uJ/uJR2ld1oHOCIfMB5ZMrr8++++-k/-Tmp-/tmpvdRhSz/profile/minidumps/A8BEE220-DEDA-4E57-ACB1-647EF52A20E9.dmp 10:46:08 INFO - Operating system: Mac OS X 10:46:08 INFO - 10.6.8 10K549 10:46:08 INFO - CPU: amd64 10:46:08 INFO - family 6 model 23 stepping 10 10:46:08 INFO - 2 CPUs 10:46:08 INFO - Crash reason: EXC_BAD_ACCESS / KERN_PROTECTION_FAILURE 10:46:08 INFO - Crash address: 0x4ee00000 10:46:08 INFO - Thread 22 (crashed) 10:46:08 INFO - 0 0x7fffffe0131f 10:46:08 INFO - rbx = 0x0000000000000002 r12 = 0x000000012327fb80 10:46:08 INFO - r13 = 0x00007fff89801834 r14 = 0x0000000000000fc4 10:46:08 INFO - r15 = 0x000000011f3bc938 rip = 0x00007fffffe0131f 10:46:08 INFO - rsp = 0x000000012327f9b8 rbp = 0x000000012327f9b8 10:46:08 INFO - Found by: given as instruction pointer in context 10:46:08 INFO - 1 libGLImage.dylib + 0x111e 10:46:08 INFO - rip = 0x00007fff8980011f rsp = 0x000000012327f9e0 10:46:08 INFO - Found by: stack scanning 10:46:08 INFO - 2 XUL!mozilla::layers::ContainerLayerProperties::ComputeChangeInternal(void (*)(mozilla::layers::ContainerLayer*, nsIntRegion const&), bool&) [nsTHashtable.h:9e7225138b7d : 402 + 0x4] 10:46:08 INFO - rip = 0x0000000101a5fc66 rsp = 0x000000012327fa20 10:46:08 INFO - Found by: stack scanning 10:46:08 INFO - 3 XUL!mozilla::layers::AsyncCompositionManager::ApplyAsyncContentTransformToTree(mozilla::layers::Layer*) [AsyncCompositionManager.cpp:9e7225138b7d : 657 + 0x7] 10:46:08 INFO - rip = 0x0000000101ab2936 rsp = 0x000000012327fa30 10:46:08 INFO - Found by: stack scanning 10:46:08 INFO - 4 GLEngine + 0x13efa 10:46:08 INFO - rip = 0x0000000127abfefb rsp = 0x000000012327fa90 10:46:08 INFO - Found by: stack scanning 10:46:08 INFO - 5 GLEngine + 0x15464 10:46:08 INFO - rip = 0x0000000127ac1465 rsp = 0x000000012327faf0 10:46:08 INFO - Found by: stack scanning 10:46:08 INFO - 6 libmozglue.dylib!huge_palloc [jemalloc.c:9e7225138b7d : 1683 + 0x7] 10:46:08 INFO - rip = 0x00000001000188bd rsp = 0x000000012327fb80 10:46:08 INFO - Found by: stack scanning 10:46:08 INFO - 7 libSystem.B.dylib + 0x4989 10:46:08 INFO - rip = 0x00007fff83e2d98a rsp = 0x000000012327fbf0 10:46:08 INFO - Found by: stack scanning 10:46:08 INFO - 8 GLEngine + 0x12c1d 10:46:08 INFO - rip = 0x0000000127abec1e rsp = 0x000000012327fc70 10:46:08 INFO - Found by: stack scanning
Looks like it's from this merge: https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=acbd7b68fa8c Which is fun because mchang in that push got backed out for tsvgr crashes with the same signature (amongst other things). Unfortunately, that's probably as far as I'm going to be able to take this bug as I'm out for the rest of the week.
I downloaded a dump from one of the many retriggers but before my commit - https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=2caae1e33648 - It has the same crash stack as bug 552020.
this is caused by 1073662. We did a bunch of retriggers on tbpl and the pattern is easy to see: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&fromchange=ea818fdbd81c&tochange=9b77a97a378b&jobname=Rev4%20MacOSX%20Snow%20Leopard%2010.6%20mozilla-inbound%20talos%20tp5o
Depends on: 1073662
The stacks in this bug make no sense. While jemalloc shows up in there, there's other stuff in between that makes no sense (could never be called from jemalloc). Specifically, there's lots of stuff related to OpenGL. As best I can figure from the stacks, we're crashing in the call at , unlocking the huge chunk lock. But that makes no sense - we're unconditionally taking the lock a few lines up, so releasing it should be fine. I'm going to need help figuring this out - I don't have a Mac and I'm ashamed to say that debugging using an actual debugger is a mystery to me. I could try some printf debugging on try, but I have no idea what to even check in this case.  http://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#5117
If I had to take a guess, I'd say this smells very much like an existing use of uninitialized memory in system libraries that's now triggered more often than before because of the jemalloc behavior change. One way to "validate" this theory is to try reproducing those crashes with opt_zero enabled, or with pages_purge actively memsetting the pages to 0 on mac (or remapping them, or whatever). It wouldn't be the first time that system libraries on osx fail to initialize memory.
Ah, good thought. There are a few calls to huge_malloc and huge_palloc that pass false for the zero parameter. Since mmap *always* zeroes on allocation, this didn't used to matter. However with MALLOC_RECYCLE, we can allocate out of previously used chunks that have been MADV_FREEd - and MADV_FREE *doesn't* zero pages (MADV_DONTNEED *does*). So it's entirely possible that either a call to huge_malloc or a call to huge_palloc isn't getting zeroed where it was implicitly being zeroed before. I'll push some changes to try to see if explicitly zeroing in calls to huge_malloc and/or huge_palloc helps.
Unfortunately that turned out not to help: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=ef770efdb514 I'm trying partial backouts now to see which part is at fault.
Simply backing out part 7 (which turns recycling on) appears to fix it: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=bc78c6a860e6 That doesn't really tell us anything about *why* recycling causes the problem though. I'm trying a few more things to see if I can narrow it down.
Turning off munmap (to see if it's caused by the capped recycling): https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=b3367b098f71 Not using MADV_FREE (decommitting instead): https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=e3a399dde8d6 Locking around the commit/decommit calls in the explicit double purge: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=607801615239 That last one is probably a good idea anyway (should have thought of it before), but I don't know if the function even gets called. Though with Talos getting RSS from the browser, maybe that's the only place in our testing that it *would* be called.
so not using MADV_FREE doesn't show failure, nor does the locking commit/decommit calls. A good point about the possibility of RSS collected from the browser causing the problem. If we want to change talos at all, that is an option; I am glad to see we are coming closer to a fix!
Yep, that was totally it. Don't know why I didn't consider this possibility when I originally wrote that logic.
Assignee: nobody → emanuel.hoogeveen
Status: NEW → ASSIGNED
Attachment #8525421 - Flags: review?(mh+mozilla)
(In reply to Joel Maher (:jmaher) from comment #28) > so not using MADV_FREE doesn't show failure, nor does the locking > commit/decommit calls. A good point about the possibility of RSS collected > from the browser causing the problem. If we want to change talos at all, > that is an option; We shouldn't - if anything, it's great that Talos showed this failure, since there was definitely a problem. It's just unfortunate that it showed up in such a weird way. I'm not sure why always decommitting also worked, but I don't think it's the right fix - as implemented it was even *broken* in the presence of jemalloc_purge_freed_pages_impl, since that recommits the pages. Full OS X-only try run with the patch: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=38f8816bd5da
Comment on attachment 8525421 [details] [diff] [review] Lock chunks during double purging to avoid racing with allocation. Review of attachment 8525421 [details] [diff] [review]: ----------------------------------------------------------------- Indeed, the extent_tree functions need the lock to be held.
Attachment #8525421 - Flags: review?(mh+mozilla) → review+
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
You need to log in before you can comment on or make changes to this bug.