1100485 - Intermittent OSX 10.6 tp5o | application crashed [@ 0x7fffffe0131f]

Reporter

Description

•

11 years ago

Attempting to bisect this down, but it appears to *NOT* be machine-related. 10:46:08 INFO - PROCESS-CRASH | tp5o | application crashed [@ 0x7fffffe0131f] 10:46:08 INFO - Crash dump filename: /var/folders/uJ/uJR2ld1oHOCIfMB5ZMrr8++++-k/-Tmp-/tmpvdRhSz/profile/minidumps/A8BEE220-DEDA-4E57-ACB1-647EF52A20E9.dmp 10:46:08 INFO - Operating system: Mac OS X 10:46:08 INFO - 10.6.8 10K549 10:46:08 INFO - CPU: amd64 10:46:08 INFO - family 6 model 23 stepping 10 10:46:08 INFO - 2 CPUs 10:46:08 INFO - Crash reason: EXC_BAD_ACCESS / KERN_PROTECTION_FAILURE 10:46:08 INFO - Crash address: 0x4ee00000 10:46:08 INFO - Thread 22 (crashed) 10:46:08 INFO - 0 0x7fffffe0131f 10:46:08 INFO - rbx = 0x0000000000000002 r12 = 0x000000012327fb80 10:46:08 INFO - r13 = 0x00007fff89801834 r14 = 0x0000000000000fc4 10:46:08 INFO - r15 = 0x000000011f3bc938 rip = 0x00007fffffe0131f 10:46:08 INFO - rsp = 0x000000012327f9b8 rbp = 0x000000012327f9b8 10:46:08 INFO - Found by: given as instruction pointer in context 10:46:08 INFO - 1 libGLImage.dylib + 0x111e 10:46:08 INFO - rip = 0x00007fff8980011f rsp = 0x000000012327f9e0 10:46:08 INFO - Found by: stack scanning 10:46:08 INFO - 2 XUL!mozilla::layers::ContainerLayerProperties::ComputeChangeInternal(void (*)(mozilla::layers::ContainerLayer*, nsIntRegion const&), bool&) [nsTHashtable.h:9e7225138b7d : 402 + 0x4] 10:46:08 INFO - rip = 0x0000000101a5fc66 rsp = 0x000000012327fa20 10:46:08 INFO - Found by: stack scanning 10:46:08 INFO - 3 XUL!mozilla::layers::AsyncCompositionManager::ApplyAsyncContentTransformToTree(mozilla::layers::Layer*) [AsyncCompositionManager.cpp:9e7225138b7d : 657 + 0x7] 10:46:08 INFO - rip = 0x0000000101ab2936 rsp = 0x000000012327fa30 10:46:08 INFO - Found by: stack scanning 10:46:08 INFO - 4 GLEngine + 0x13efa 10:46:08 INFO - rip = 0x0000000127abfefb rsp = 0x000000012327fa90 10:46:08 INFO - Found by: stack scanning 10:46:08 INFO - 5 GLEngine + 0x15464 10:46:08 INFO - rip = 0x0000000127ac1465 rsp = 0x000000012327faf0 10:46:08 INFO - Found by: stack scanning 10:46:08 INFO - 6 libmozglue.dylib!huge_palloc [jemalloc.c:9e7225138b7d : 1683 + 0x7] 10:46:08 INFO - rip = 0x00000001000188bd rsp = 0x000000012327fb80 10:46:08 INFO - Found by: stack scanning 10:46:08 INFO - 7 libSystem.B.dylib + 0x4989 10:46:08 INFO - rip = 0x00007fff83e2d98a rsp = 0x000000012327fbf0 10:46:08 INFO - Found by: stack scanning 10:46:08 INFO - 8 GLEngine + 0x12c1d 10:46:08 INFO - rip = 0x0000000127abec1e rsp = 0x000000012327fc70 10:46:08 INFO - Found by: stack scanning

Comment hidden (Legacy TBPL/Treeherder Robot)

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 9

•

11 years ago

Looks like it's from this merge: https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=acbd7b68fa8c Which is fun because mchang in that push got backed out for tsvgr crashes with the same signature (amongst other things). Unfortunately, that's probably as far as I'm going to be able to take this bug as I'm out for the rest of the week.

Mason Chang [Inactive] [:mchang]

Comment 10

•

11 years ago

I downloaded a dump from one of the many retriggers but before my commit - https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=2caae1e33648 - It has the same crash stack as bug 552020.

Comment hidden (Legacy TBPL/Treeherder Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 13

•

11 years ago

this is caused by 1073662. We did a bunch of retriggers on tbpl and the pattern is easy to see: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&fromchange=ea818fdbd81c&tochange=9b77a97a378b&jobname=Rev4%20MacOSX%20Snow%20Leopard%2010.6%20mozilla-inbound%20talos%20tp5o

Depends on: 1073662

Comment hidden (Legacy TBPL/Treeherder Robot)

Emanuel Hoogeveen [:ehoogeveen]

Assignee

Comment 18

•

11 years ago

The stacks in this bug make no sense. While jemalloc shows up in there, there's other stuff in between that makes no sense (could never be called from jemalloc). Specifically, there's lots of stuff related to OpenGL. As best I can figure from the stacks, we're crashing in the call at [1], unlocking the huge chunk lock. But that makes no sense - we're unconditionally taking the lock a few lines up, so releasing it should be fine. I'm going to need help figuring this out - I don't have a Mac and I'm ashamed to say that debugging using an actual debugger is a mystery to me. I could try some printf debugging on try, but I have no idea what to even check in this case. [1] http://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#5117

Mike Hommey [:glandium]

Comment 19

•

11 years ago

If I had to take a guess, I'd say this smells very much like an existing use of uninitialized memory in system libraries that's now triggered more often than before because of the jemalloc behavior change. One way to "validate" this theory is to try reproducing those crashes with opt_zero enabled, or with pages_purge actively memsetting the pages to 0 on mac (or remapping them, or whatever). It wouldn't be the first time that system libraries on osx fail to initialize memory.

Emanuel Hoogeveen [:ehoogeveen]

Assignee

Comment 20

•

11 years ago

Ah, good thought. There are a few calls to huge_malloc and huge_palloc that pass false for the zero parameter. Since mmap *always* zeroes on allocation, this didn't used to matter. However with MALLOC_RECYCLE, we can allocate out of previously used chunks that have been MADV_FREEd - and MADV_FREE *doesn't* zero pages (MADV_DONTNEED *does*). So it's entirely possible that either a call to huge_malloc or a call to huge_palloc isn't getting zeroed where it was implicitly being zeroed before. I'll push some changes to try to see if explicitly zeroing in calls to huge_malloc and/or huge_palloc helps.

Comment hidden (Legacy TBPL/Treeherder Robot)

Emanuel Hoogeveen [:ehoogeveen]

Assignee

Comment 22

•

11 years ago

Unfortunately that turned out not to help: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=ef770efdb514 I'm trying partial backouts now to see which part is at fault.

Comment hidden (Legacy TBPL/Treeherder Robot)

Emanuel Hoogeveen [:ehoogeveen]

Assignee

Comment 26

•

11 years ago

Simply backing out part 7 (which turns recycling on) appears to fix it: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=bc78c6a860e6 That doesn't really tell us anything about *why* recycling causes the problem though. I'm trying a few more things to see if I can narrow it down.

Emanuel Hoogeveen [:ehoogeveen]

Assignee

Comment 27

•

11 years ago

Turning off munmap (to see if it's caused by the capped recycling): https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=b3367b098f71 Not using MADV_FREE (decommitting instead): https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=e3a399dde8d6 Locking around the commit/decommit calls in the explicit double purge: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=607801615239 That last one is probably a good idea anyway (should have thought of it before), but I don't know if the function even gets called. Though with Talos getting RSS from the browser, maybe that's the only place in our testing that it *would* be called.

Joel Maher ( :jmaher ) (UTC -8)

Comment 28

•

11 years ago

so not using MADV_FREE doesn't show failure, nor does the locking commit/decommit calls. A good point about the possibility of RSS collected from the browser causing the problem. If we want to change talos at all, that is an option; I am glad to see we are coming closer to a fix!

Emanuel Hoogeveen [:ehoogeveen]

Assignee

Comment 29

•

11 years ago

Attached patch Lock chunks during double purging to avoid racing with allocation. — Details — Splinter Review

Yep, that was totally it. Don't know why I didn't consider this possibility when I originally wrote that logic.

Assignee: nobody → emanuel.hoogeveen

Status: NEW → ASSIGNED

Attachment #8525421 - Flags: review?(mh+mozilla)

Emanuel Hoogeveen [:ehoogeveen]

Assignee

Comment 30

•

11 years ago

(In reply to Joel Maher (:jmaher) from comment #28) > so not using MADV_FREE doesn't show failure, nor does the locking > commit/decommit calls. A good point about the possibility of RSS collected > from the browser causing the problem. If we want to change talos at all, > that is an option; We shouldn't - if anything, it's great that Talos showed this failure, since there was definitely a problem. It's just unfortunate that it showed up in such a weird way. I'm not sure why always decommitting also worked, but I don't think it's the right fix - as implemented it was even *broken* in the presence of jemalloc_purge_freed_pages_impl, since that recommits the pages. Full OS X-only try run with the patch: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=38f8816bd5da

Comment hidden (Legacy TBPL/Treeherder Robot)

Mike Hommey [:glandium]

Comment 33

•

11 years ago

Comment on attachment 8525421 [details] [diff] [review] Lock chunks during double purging to avoid racing with allocation. Review of attachment 8525421 [details] [diff] [review]: ----------------------------------------------------------------- Indeed, the extent_tree functions need the lock to be held.

Attachment #8525421 - Flags: review?(mh+mozilla) → review+

Emanuel Hoogeveen [:ehoogeveen]

Assignee

Updated

•

11 years ago

Keywords: checkin-needed

Carsten Book [:Tomcat]

Comment 34

•

11 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/5de035fe199a

Keywords: checkin-needed

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 35

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/5de035fe199a

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla36

Ryan VanderMeulen [:RyanVM]

Reporter

Updated

•

11 years ago

status-firefox34: --- → unaffected

status-firefox35: --- → unaffected

status-firefox36: --- → fixed

status-firefox-esr31: --- → unaffected