Closed
Bug 1100485
Opened 10 years ago
Closed 10 years ago
Intermittent OSX 10.6 tp5o | application crashed [@ 0x7fffffe0131f]
Categories
(Core :: Graphics, defect)
Tracking
()
RESOLVED
FIXED
mozilla36
Tracking | Status | |
---|---|---|
firefox34 | --- | unaffected |
firefox35 | --- | unaffected |
firefox36 | --- | fixed |
firefox-esr31 | --- | unaffected |
People
(Reporter: RyanVM, Assigned: ehoogeveen)
References
Details
(Keywords: crash, intermittent-failure)
Attachments
(1 file)
1.01 KB,
patch
|
glandium
:
review+
|
Details | Diff | Splinter Review |
Attempting to bisect this down, but it appears to *NOT* be machine-related.
10:46:08 INFO - PROCESS-CRASH | tp5o | application crashed [@ 0x7fffffe0131f]
10:46:08 INFO - Crash dump filename: /var/folders/uJ/uJR2ld1oHOCIfMB5ZMrr8++++-k/-Tmp-/tmpvdRhSz/profile/minidumps/A8BEE220-DEDA-4E57-ACB1-647EF52A20E9.dmp
10:46:08 INFO - Operating system: Mac OS X
10:46:08 INFO - 10.6.8 10K549
10:46:08 INFO - CPU: amd64
10:46:08 INFO - family 6 model 23 stepping 10
10:46:08 INFO - 2 CPUs
10:46:08 INFO - Crash reason: EXC_BAD_ACCESS / KERN_PROTECTION_FAILURE
10:46:08 INFO - Crash address: 0x4ee00000
10:46:08 INFO - Thread 22 (crashed)
10:46:08 INFO - 0 0x7fffffe0131f
10:46:08 INFO - rbx = 0x0000000000000002 r12 = 0x000000012327fb80
10:46:08 INFO - r13 = 0x00007fff89801834 r14 = 0x0000000000000fc4
10:46:08 INFO - r15 = 0x000000011f3bc938 rip = 0x00007fffffe0131f
10:46:08 INFO - rsp = 0x000000012327f9b8 rbp = 0x000000012327f9b8
10:46:08 INFO - Found by: given as instruction pointer in context
10:46:08 INFO - 1 libGLImage.dylib + 0x111e
10:46:08 INFO - rip = 0x00007fff8980011f rsp = 0x000000012327f9e0
10:46:08 INFO - Found by: stack scanning
10:46:08 INFO - 2 XUL!mozilla::layers::ContainerLayerProperties::ComputeChangeInternal(void (*)(mozilla::layers::ContainerLayer*, nsIntRegion const&), bool&) [nsTHashtable.h:9e7225138b7d : 402 + 0x4]
10:46:08 INFO - rip = 0x0000000101a5fc66 rsp = 0x000000012327fa20
10:46:08 INFO - Found by: stack scanning
10:46:08 INFO - 3 XUL!mozilla::layers::AsyncCompositionManager::ApplyAsyncContentTransformToTree(mozilla::layers::Layer*) [AsyncCompositionManager.cpp:9e7225138b7d : 657 + 0x7]
10:46:08 INFO - rip = 0x0000000101ab2936 rsp = 0x000000012327fa30
10:46:08 INFO - Found by: stack scanning
10:46:08 INFO - 4 GLEngine + 0x13efa
10:46:08 INFO - rip = 0x0000000127abfefb rsp = 0x000000012327fa90
10:46:08 INFO - Found by: stack scanning
10:46:08 INFO - 5 GLEngine + 0x15464
10:46:08 INFO - rip = 0x0000000127ac1465 rsp = 0x000000012327faf0
10:46:08 INFO - Found by: stack scanning
10:46:08 INFO - 6 libmozglue.dylib!huge_palloc [jemalloc.c:9e7225138b7d : 1683 + 0x7]
10:46:08 INFO - rip = 0x00000001000188bd rsp = 0x000000012327fb80
10:46:08 INFO - Found by: stack scanning
10:46:08 INFO - 7 libSystem.B.dylib + 0x4989
10:46:08 INFO - rip = 0x00007fff83e2d98a rsp = 0x000000012327fbf0
10:46:08 INFO - Found by: stack scanning
10:46:08 INFO - 8 GLEngine + 0x12c1d
10:46:08 INFO - rip = 0x0000000127abec1e rsp = 0x000000012327fc70
10:46:08 INFO - Found by: stack scanning
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Reporter | ||
Comment 9•10 years ago
|
||
Looks like it's from this merge:
https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=acbd7b68fa8c
Which is fun because mchang in that push got backed out for tsvgr crashes with the same signature (amongst other things).
Unfortunately, that's probably as far as I'm going to be able to take this bug as I'm out for the rest of the week.
Comment 10•10 years ago
|
||
I downloaded a dump from one of the many retriggers but before my commit - https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=2caae1e33648 - It has the same crash stack as bug 552020.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 13•10 years ago
|
||
this is caused by 1073662. We did a bunch of retriggers on tbpl and the pattern is easy to see:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&fromchange=ea818fdbd81c&tochange=9b77a97a378b&jobname=Rev4%20MacOSX%20Snow%20Leopard%2010.6%20mozilla-inbound%20talos%20tp5o
Depends on: 1073662
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 18•10 years ago
|
||
The stacks in this bug make no sense. While jemalloc shows up in there, there's other stuff in between that makes no sense (could never be called from jemalloc). Specifically, there's lots of stuff related to OpenGL.
As best I can figure from the stacks, we're crashing in the call at [1], unlocking the huge chunk lock. But that makes no sense - we're unconditionally taking the lock a few lines up, so releasing it should be fine.
I'm going to need help figuring this out - I don't have a Mac and I'm ashamed to say that debugging using an actual debugger is a mystery to me. I could try some printf debugging on try, but I have no idea what to even check in this case.
[1] http://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#5117
Comment 19•10 years ago
|
||
If I had to take a guess, I'd say this smells very much like an existing use of uninitialized memory in system libraries that's now triggered more often than before because of the jemalloc behavior change. One way to "validate" this theory is to try reproducing those crashes with opt_zero enabled, or with pages_purge actively memsetting the pages to 0 on mac (or remapping them, or whatever).
It wouldn't be the first time that system libraries on osx fail to initialize memory.
Assignee | ||
Comment 20•10 years ago
|
||
Ah, good thought. There are a few calls to huge_malloc and huge_palloc that pass false for the zero parameter. Since mmap *always* zeroes on allocation, this didn't used to matter. However with MALLOC_RECYCLE, we can allocate out of previously used chunks that have been MADV_FREEd - and MADV_FREE *doesn't* zero pages (MADV_DONTNEED *does*).
So it's entirely possible that either a call to huge_malloc or a call to huge_palloc isn't getting zeroed where it was implicitly being zeroed before. I'll push some changes to try to see if explicitly zeroing in calls to huge_malloc and/or huge_palloc helps.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 22•10 years ago
|
||
Unfortunately that turned out not to help: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=ef770efdb514
I'm trying partial backouts now to see which part is at fault.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 26•10 years ago
|
||
Simply backing out part 7 (which turns recycling on) appears to fix it: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=bc78c6a860e6
That doesn't really tell us anything about *why* recycling causes the problem though. I'm trying a few more things to see if I can narrow it down.
Assignee | ||
Comment 27•10 years ago
|
||
Turning off munmap (to see if it's caused by the capped recycling):
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=b3367b098f71
Not using MADV_FREE (decommitting instead):
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=e3a399dde8d6
Locking around the commit/decommit calls in the explicit double purge:
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=607801615239
That last one is probably a good idea anyway (should have thought of it before), but I don't know if the function even gets called. Though with Talos getting RSS from the browser, maybe that's the only place in our testing that it *would* be called.
Comment 28•10 years ago
|
||
so not using MADV_FREE doesn't show failure, nor does the locking commit/decommit calls. A good point about the possibility of RSS collected from the browser causing the problem. If we want to change talos at all, that is an option;
I am glad to see we are coming closer to a fix!
Assignee | ||
Comment 29•10 years ago
|
||
Yep, that was totally it. Don't know why I didn't consider this possibility when I originally wrote that logic.
Assignee: nobody → emanuel.hoogeveen
Status: NEW → ASSIGNED
Attachment #8525421 -
Flags: review?(mh+mozilla)
Assignee | ||
Comment 30•10 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #28)
> so not using MADV_FREE doesn't show failure, nor does the locking
> commit/decommit calls. A good point about the possibility of RSS collected
> from the browser causing the problem. If we want to change talos at all,
> that is an option;
We shouldn't - if anything, it's great that Talos showed this failure, since there was definitely a problem. It's just unfortunate that it showed up in such a weird way.
I'm not sure why always decommitting also worked, but I don't think it's the right fix - as implemented it was even *broken* in the presence of jemalloc_purge_freed_pages_impl, since that recommits the pages.
Full OS X-only try run with the patch: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=38f8816bd5da
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 33•10 years ago
|
||
Comment on attachment 8525421 [details] [diff] [review]
Lock chunks during double purging to avoid racing with allocation.
Review of attachment 8525421 [details] [diff] [review]:
-----------------------------------------------------------------
Indeed, the extent_tree functions need the lock to be held.
Attachment #8525421 -
Flags: review?(mh+mozilla) → review+
Assignee | ||
Updated•10 years ago
|
Keywords: checkin-needed
Comment 34•10 years ago
|
||
Keywords: checkin-needed
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
Reporter | ||
Updated•10 years ago
|
status-firefox34:
--- → unaffected
status-firefox35:
--- → unaffected
status-firefox36:
--- → fixed
status-firefox-esr31:
--- → unaffected
You need to log in
before you can comment on or make changes to this bug.
Description
•