Closed
Bug 694896
Opened 13 years ago
Closed 13 years ago
[10.7] Fix jemalloc hang, crash on Mac 10.7 and enable jemalloc on 10.7
Categories
(Core :: Memory Allocator, defect)
Tracking
()
RESOLVED
FIXED
mozilla10
People
(Reporter: justin.lebar+bug, Assigned: justin.lebar+bug)
References
(Blocks 1 open bug)
Details
(Whiteboard: [MemShrink:P1][qa-])
Attachments
(6 files, 3 obsolete files)
jemalloc is currently disabled on MacOS 10.7 for reasons that I don't understand at the moment. But as more Mac users transition to the new version, enabling jemalloc there becomes increasingly important.
Assignee | ||
Updated•13 years ago
|
Whiteboard: [MemShrink]
Comment 1•13 years ago
|
||
bug 670175 is the problem.
Assignee | ||
Comment 2•13 years ago
|
||
Thanks. I resolved that bug so we can track turning it on (and fixing that hang) here.
Depends on: 670175
Assignee | ||
Updated•13 years ago
|
Blocks: lion-compatibility
Assignee | ||
Updated•13 years ago
|
Whiteboard: [MemShrink] → [MemShrink:P1]
Assignee | ||
Comment 3•13 years ago
|
||
Steven, we need some mac-fu here to understand the hangs in bug 670175. Can you help?
Comment 4•13 years ago
|
||
I'm completely swamped, so the earliest I'll have time for this is sometime next week. Furthermore I know very little about jemalloc, so I'll probably have to use brute force on this bug -- keep turning off bits of jemalloc until it stops happening, then see if the guilty "bit" can somehow be implemented differently. That's likely to be very time consuming, and may be something you guys can do as well as I can :-)
Assignee | ||
Comment 5•13 years ago
|
||
Is it possible to dual-boot with the final version of 10.7? If so, I might be able to poke this and at least get to the point that I need your help. I sort of understand jemalloc, although jemalloc on mac is a complete mystery to me. :)
Comment 6•13 years ago
|
||
> Is it possible to dual-boot with the final version of 10.7?
Sure. You need to create extra partitions on your hard drive (or add another hard drive on which you can create additional partitions). Then do a conventional install into one of those partitions. Then you can use Startup Disk in System Preferences to choose which partition to boot from.
No, I don't know how to non-destructively add new partitions to an existing hard drive. But ask Marcia Knous -- she may know how, or know who to ask.
Assignee | ||
Comment 7•13 years ago
|
||
> No, I don't know how to non-destructively add new partitions to an existing hard drive.
I think bootcamp can do this? I'll see what I can figure out.
Comment 8•13 years ago
|
||
Note that your Lion partition should be at least 30GB in order for XCode to fit into it. 40GB would probably be better. (I use 30GB, and it can be a tight fit.)
Comment 9•13 years ago
|
||
(In reply to Steven Michaud from comment #6) > No, I don't know how to non-destructively add new partitions to an existing > hard drive. But ask Marcia Knous -- she may know how, or know who to ask. I believe the stock version of Disk Utility does this just fine (unless your drive is very fragmented) — give it a try. It's very explicit about which partitions will lose data and which will remain intact, so just read the description before confirming the operation.
Assignee | ||
Comment 10•13 years ago
|
||
I had to run disk utilities repair on my partition (perhaps because I'd previously had a bootcamp partition), but then it worked, and I got Lion installed. Thanks!
Assignee | ||
Comment 11•13 years ago
|
||
I can reproduce using the steps in bug 670175 comment 17, about half the time. Firefox beachballs for about 5s. From the hang stack in attachment 544925 [details], it looks like jemalloc is taking a long time in mmap. (See the stacktrace for thread 1, at the bottom.) Maybe 10.7 doesn't like the hack we use to force mmap'ed pages down towards the bottom of the address space.
Comment 12•13 years ago
|
||
I thought that trace indicated thread contention -- between the main thread and threads 10, 12, 13, and 23. But that's more or less a guess. Hence my hunch that we're going to need to brute-force this.
Comment 13•13 years ago
|
||
> Maybe 10.7 doesn't like the hack we use to force mmap'ed pages down
> towards the bottom of the address space.
Can we stop doing this, and see if that makes the problem go away?
Assignee | ||
Comment 14•13 years ago
|
||
I seem to be able to reproduce bug 670175 comment 17 consistently if I issue the |purge| command immediately before starting Firefox.
Assignee | ||
Comment 15•13 years ago
|
||
(In reply to Steven Michaud from comment #13) > Can we stop doing this, and see if that makes the problem go away? Yep, trying...
Assignee | ||
Comment 16•13 years ago
|
||
Interestingly, when I kill FF while it's hung, the whole system is unresponsive for a few seconds. Typing and switching tabs in the terminal is very slow, for example.
> I thought that trace indicated thread contention -- between the main thread and threads 10, 12, 13,
> and 23. But that's more or less a guess.
Yes. The fact that four threads are all blocked on the main thread's arena lock indicates that the main thread has probably been sitting there for a while. So that leads me to suspect that mmap may be hanging...
Comment 17•13 years ago
|
||
> Interestingly, when I kill FF while it's hung, the whole system is > unresponsive for a few seconds. Typing and switching tabs in the > terminal is very slow, for example. This is an old problem with event taps -- a buggy system resource that we're forced to use. See bug 393664 and bug 611068.
Assignee | ||
Comment 18•13 years ago
|
||
Yikes, I think I understand what's happening. Jemalloc uses a loop when it tries to do an aligned mmap. The claim is that you only loop in the event of a race condition, where two threads call mmap simultaneously. But we go around and around this loop for a long time. The code assumes that if I mmap a 2mb chunk, unmap it, and then immediately try to map somewhere inside where that chunk used to be (so I get an aligned mapping), the final mmap operation will succeed in the absence of races. It appears that this is not the case on 10.7.
Assignee | ||
Comment 19•13 years ago
|
||
The more sane (and loop-free) thing to do would be to map 2mb and then unmap any excess. That's what jemalloc tip does. We can't do that on Windows, because apparently Windows demands a 1:1 mapping between maps and unmaps. But we ought to be able to backport this change for Linux and Mac.
Assignee | ||
Comment 20•13 years ago
|
||
Awesome, jsgcchunk is copy/pasted from jemalloc (or is it the other way around?). So whatever problem jemalloc has here, presumably the js gc has too.
Assignee | ||
Comment 21•13 years ago
|
||
(In reply to Justin Lebar [:jlebar] from comment #20) > Awesome, jsgcchunk is copy/pasted from jemalloc (or is it the other way > around?). So whatever problem jemalloc has here, presumably the js gc has > too. ...although maybe not, because js uses vm_allocate instead of mmap. Anyone know why js doesn't use mmap?
Comment 22•13 years ago
|
||
The code in jsgcchunk is super ugly and I spent quite of a bit of time looking at it today for 670596. (In reply to Justin Lebar [:jlebar] from comment #20) > Awesome, jsgcchunk is copy/pasted from jemalloc (or is it the other way > around?). So whatever problem jemalloc has here, presumably the js gc has > too. Igor created jsgcchunk as a copy from jemalloc about a year and a half ago in 553812. (In reply to Justin Lebar [:jlebar] from comment #19) > The more sane (and loop-free) thing to do would be to map 2mb and then unmap > any excess. That's what jemalloc tip does. Agreed. This should work well on mac and linux, assuming that we can't find a way to do aligned allocations directly. > We can't do that on Windows, because apparently Windows demands a 1:1 > mapping between maps and unmaps. But we ought to be able to backport this > change for Linux and Mac. In Windows, we can reserve address space independently of committing. We should be able to make things better on Windows by grabbing the 2MiB of address space as a reservation and only committing when we grab the 1MiB slice afterwards. I think what we need to do in jsgcchunk is push the high level AllocGCChunk interface into each OS's #define and just hyper-optimize for each OS individually. We also need to add new per-OS functions for madvise/MEM_RESET functionality. It would be nice if we could share this work with jemalloc.
Assignee | ||
Comment 23•13 years ago
|
||
> assuming that we can't find a way to do aligned allocations directly. I've got to hope that if there were a way, upstream jemalloc would be using it. > In Windows, we can reserve address space independently of committing. We should be able to make > things better on Windows by grabbing the 2MiB of address space as a reservation and only committing > when we grab the 1MiB slice afterwards. Yes, but two problems: * We frequently run out of virtual address space on Windows-32, so we probably can't waste it like this. * The chunk would now have to remember where its mapping starts, so that when you unmap the chunk, you do so from the right place. This would be kind of an invasive jemalloc change. Windows ain't broke, so I think we should leave it be for the moment and revisit separately whether we can improve it.
Comment 24•13 years ago
|
||
(In reply to Justin Lebar [:jlebar] from comment #23) > > assuming that we can't find a way to do aligned allocations directly. > > I've got to hope that if there were a way, upstream jemalloc would be using > it. Probably, but the linux kernel moves fast and doesn't communicate userland changes well. > > In Windows, we can reserve address space independently of committing. We should be able to make > > things better on Windows by grabbing the 2MiB of address space as a reservation and only committing > > when we grab the 1MiB slice afterwards. > > Yes, but two problems: > > * We frequently run out of virtual address space on Windows-32, so we > probably can't waste it like this. > > * The chunk would now have to remember where its mapping starts, so that > when you unmap the chunk, you do so from the right place. This would be > kind of an invasive jemalloc change. > > Windows ain't broke, so I think we should leave it be for the moment and > revisit separately whether we can improve it. I think I must have explained very poorly what I meant to do. I am not suggesting changing the algorithm or numbers we use when allocating memory at all, just unsetting the MEM_COMMIT bit until we actually make our 1MiB allocation.
Assignee | ||
Comment 25•13 years ago
|
||
> I think I must have explained very poorly what I meant to do. I am not suggesting changing the
> algorithm or numbers we use when allocating memory at all, just unsetting the MEM_COMMIT bit until
> we actually make our 1MiB allocation.
Ah, that might work! Can we do this in another bug, please?
Comment 26•13 years ago
|
||
(In reply to Justin Lebar [:jlebar] from comment #25) > Ah, that might work! Can we do this in another bug, please? Filed under 696119.
Assignee | ||
Comment 27•13 years ago
|
||
Seems to work on my 10.7 and Linux-64 machines. Pushed to try.
Assignee | ||
Updated•13 years ago
|
Summary: Enable jemalloc on MacOS 10.7 → Fix jemalloc chunk_alloc_mmap() function so it works properly on Mac 10.7
Assignee | ||
Comment 28•13 years ago
|
||
Aside from a mysterious linux64 make check error, which is probably nothing, this looks good on try. Steven, would you mind trying this build out on your 10.7 machine and seeing if you can get it to hang? It doesn't hang for me on my 10.7 machine. https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jlebar@mozilla.com-21ae01a7b22a/
Assignee | ||
Updated•13 years ago
|
Attachment #568432 -
Flags: feedback?(smichaud)
Comment 29•13 years ago
|
||
If you feel like a slightly larger population of guinea pigs to test it on, feel free to land it on the UX branch. We have about 2K ADUs, a large percentage of which I would guess are on OS X 10.7. :)
Assignee | ||
Comment 30•13 years ago
|
||
I pushed this to UX: https://hg.mozilla.org/projects/ux/rev/5e823e7f5e5d
Comment 31•13 years ago
|
||
Comment on attachment 568432 [details] [diff] [review] Patch v1 This patch doesn't seem to have fixed the problem. I still crash, though now the STR is now different. Here are the crashes I've had. They seem to be the same crash, but I include them all for the sake of completeness. I'll try to get some gdb crash stacks, and to translate the symbols in the Breakpad crash stacks: bp-faaba078-ca18-4b50-a52e-ea4432111021 bp-e9738d44-e0b9-4dd0-b60f-fd1ac2111021 bp-b8a8cae1-4a37-43c3-94df-4b76a2111021 bp-a8476906-3a59-4d59-a8f8-b1e8f2111021 bp-6845a0ff-f298-41b9-8b65-e9b532111021 Here's my STR. It's still rather difficult and pretty finicky. I may be able to improve it over time. 1) Restart your computer. 2) Run the test build, and then a few seconds (5-15?) after it's started, move the mouse up to "Nightly" in the main menu and click on it to open the menu. I'll usually crash here, or sometimes even before the mouse reaches the main menu. Sometimes subsequent attempts to run the test build will crash even before the browser window has finished opening. But as soon as you fail to crash even once, the crashes become almost impossibly difficult to reproduce. Running "purge" beforehand doesn't seem to help (to make the test build more crashy). My STR above doesn't crash on recent nightlies.
Attachment #568432 -
Flags: feedback?(smichaud) → feedback-
Comment 32•13 years ago
|
||
Even when I just turn jemalloc on for OS X 10.7, without making any of the rest of your patch's changes, my STR from bug 670175 comment #17 no longer "works" for me -- the I-beam cursor in the Google search box never stops flashing. For this test, I just applied the following part of your patch: - osx_use_jemalloc = (default_zone->version == SNOW_LEOPARD_MALLOC_ZONE_T_VERSION); + osx_use_jemalloc = (default_zone->version == SNOW_LEOPARD_MALLOC_ZONE_T_VERSION || + default_zone->version == LION_MALLOC_ZONE_T_VERSION);
Comment 33•13 years ago
|
||
You'll notice that none of my Breakpad crash reports from comment #31 have symbols even for Mozilla code. This has already come up in bug 693451, and a Breakpad bug on the issue has been opened upstream at http://code.google.com/p/google-breakpad/issues/detail?id=447.
Assignee | ||
Comment 34•13 years ago
|
||
Yeah, symbols of any kind would be very helpful...
Assignee | ||
Comment 35•13 years ago
|
||
Backed out from UX: https://hg.mozilla.org/projects/ux/rev/dc08bf63b3b1
Comment 36•13 years ago
|
||
It occurs to me that jemalloc might have (at least) two different problems on OS X 10.7 -- a hang bug (which your patch may have fixed) and a crash bug.
Assignee | ||
Comment 37•13 years ago
|
||
I think the patch did fix a hang I was observing. At least, there was a loop which I observed spinning, and now the loop is gone!
Comment 38•13 years ago
|
||
Here are translations for (some of) the symbols for my crash stacks from comment #31. I used atos to do the translations. There were a few symbols it (apparently) couldn't translate. I think this pins the blame pretty well on jemalloc. Though it doesn't come near to providing enough information to fix the bug that causes the crashes :-( Note that one of the symbols is for malloc_printf, which indicates there may have been something interesting in the system log. I'll reproduce a few more crashes, and see what I find.
Comment 39•13 years ago
|
||
[0x0-0x13013].org.mozilla.nightly: firefox(193,0x7fff79a83960) malloc: *** malloc_default_scalable_zone() failed to find 'DefaultMallocZone'
Assignee | ||
Comment 40•13 years ago
|
||
That's weird...it looks like this is the default allocator, not jemalloc.
Comment 41•13 years ago
|
||
Then I think jemalloc must be doing something to break the default allocator. I'm still working on getting a gdb stack trace of the crash, with symbols and all threads.
Comment 42•13 years ago
|
||
Here it is! This is from a non-debug custom build (with this bug's patch applied) with optimization turned off and no symbols stripped. It's based on trunk code pulled at the beginning of this week (Monday).
Comment 43•13 years ago
|
||
Note that it looks a lot like my gdb crash traces from bug 670175.
Assignee | ||
Comment 44•13 years ago
|
||
This is a shot in the dark, but what if you change jemalloc_zone->zone_name = "jemalloc_zone"; to jemalloc_zone->zone_name = "DefaultMallocZone"; ?
Comment 45•13 years ago
|
||
Comment 46•13 years ago
|
||
(In reply to comment #44) I'm trying it. I'll let you know my results when the build finishes.
Assignee | ||
Comment 47•13 years ago
|
||
btw, I find that to rebuild jemalloc, I need to do rm -f browser/app/nsBrowserApp.o && make -C memory && make -C browser && make -C toolkit/library if you don't remove nsBrowserApp.o, the jemalloc changes might not be reflected, even if you do a top-level rebuild.
Comment 48•13 years ago
|
||
(In reply to Justin Lebar [:jlebar] from comment #47) > if you don't remove nsBrowserApp.o, the jemalloc changes might not be > reflected, even if you do a top-level rebuild. That doesn't make sense. jemalloc is a shared library on mac. Rebuilding libmozutils.dylib should be enough.
Comment 49•13 years ago
|
||
Following your suggestion from comment #44 didn't help, though it did change the stack trace slightly, and got rid of the malloc_printf error message. I think the problem reported by malloc_printf must be a symptom, and not the cause. By the way, I didn't need to delete browser/app/nsBrowserApp.o. I just changed the source and did a top-level rebuild. (I always do that -- doing otherwise just wastes all the time I might have saved doing non-top-level builds, or more than the time, on fixing the problems it causes.)
Assignee | ||
Comment 50•13 years ago
|
||
> Following your suggestion from comment #44 didn't help, though it did
> change the stack trace slightly, and got rid of the malloc_printf
> error message.
Okay, that's pretty good!
Maybe it's then trying to call one of the zone functions which is null.
You could replace the NULLs in create_zone and szone2ozone with 0x1, 0x2, 0x3, etc. Maybe that would tell us which function it's trying to call.
Comment 51•13 years ago
|
||
> You could replace the NULLs in create_zone and szone2ozone with 0x1, > 0x2, 0x3, etc. Maybe that would tell us which function it's trying > to call. I did that, without backing out your suggestion from comment #44, and it made no difference -- I still get exactly the same null dereference, and no output from malloc_printf. Next I'll try backing out your suggestion from comment #44, to see if that makes any difference.
Comment 52•13 years ago
|
||
> Next I'll try backing out your suggestion from comment #44, to see
> if that makes any difference.
This just restored the original error message from malloc_printf and
the original crash at 0x10f0.
Comment 53•13 years ago
|
||
I've got a few minutes left at the end of the day, that I can't really spend doing anything else, so I'm going to try a few of my own random shots in the dark. I just tried putting #define NO_TLS at the top of jemalloc.c -- that doesn't help.
Assignee | ||
Comment 54•13 years ago
|
||
I'm pretty sure mac doesn't build without NO_TLS. I think we've made some progress here. I'll try to get things set up so I can debug this locally.
Comment 55•13 years ago
|
||
Now I've found something that *does* help: The crashes disappear if you never call szone2ozone()! I tested twice (with my STR from comment #31) just to be sure. > I'll try to get things set up so I can debug this locally. That's probably a good idea :-)
Assignee | ||
Comment 56•13 years ago
|
||
Ah, so AIUI, szone2ozone sets us up so that certain system-level allocations go through jemalloc rather than the default allocator. It might not like the zone name...
Comment 57•13 years ago
|
||
> It might not like the zone name...
I suspect the name isn't the problem ... though I've no idea what the problem actually is.
In any case, though, szone2ozone is a lot smaller target to brute-force than the whole of jemalloc.c.
Comment 58•13 years ago
|
||
(Following up comment #57) >> It might not like the zone name... > > I suspect the name isn't the problem ... though I've no idea what > the problem actually is. Actually you were right (and I was wrong). Making the following change in szone2ozone is enough to make my crashes go away: - default_zone->zone_name = "jemalloc_ozone"; + if (default_zone->version < LION_MALLOC_ZONE_T_VERSION) { + default_zone->zone_name = "jemalloc_ozone"; + } else { + default_zone->zone_name = "DefaultMallocZone"; + } I still don't know *why* this change is necessary on Lion. As best I can tell, "DefaultMallocZone" has been this zone's name for years. But maybe it's only with Lion that the OS actually starts checking the name. It's also puzzling why we don't *always* crash on Lion, if the default zone's name is "wrong". But I think it's worthwhile landing this change on the trunk, to see what happens. Once jemalloc is re-enabled on Lion, we should (of course) check that it really does give us performance gains and/or a reduction in memory usage.
Comment 59•13 years ago
|
||
And note that the default zone's name is wrong (on Lion) even when we don't call szone2ozone, and we still don't crash.
Comment 60•13 years ago
|
||
So (sigh) I've changed my mind. I no longer think we should land my change from comment #58 without understanding better what's going on (why that change "works", or appears to "work").
Comment 61•13 years ago
|
||
I'll continue to brute-force szone2ozone for a while longer, to see if I find anything interesting.
Assignee | ||
Comment 62•13 years ago
|
||
You may want to disassemble the calling stackframes (or use gdb watchpoints) to try to see where default_zone->zone_name is being read. I've been reading machine code, but I haven't seen it yet...
Assignee | ||
Comment 63•13 years ago
|
||
So I did + default_zone->zone_name = (char*)0xdeadbeef; and then waited for it to crash. It crashes in strcmp() malloc_default_purgable_zone() ImageIO_Malloc() malloc_default_purgable_zone appears to be in a loop looking for something. Perhaps the string "DefaultMallocZone". :)
Comment 64•13 years ago
|
||
> malloc_default_purgable_zone appears to be in a loop looking for > something. Perhaps the string "DefaultMallocZone". :) Nope, it's the string "DefaultPurgeableMallocZone" :-) See http://opensource.apple.com/source/Libc/Libc-594.1.4/gen/malloc.c. Which shows that you're debugging in libc code, and that Apple's libc's source is available at http://opensource.apple.com. That may make your life easier :-)
Comment 65•13 years ago
|
||
OS X 10.7.2's malloc.c is at: http://opensource.apple.com/source/Libc/Libc-763.12/gen/malloc.c You can also download the source. See http://opensource.apple.com/ and start browsing under "Mac OS X".
Assignee | ||
Comment 66•13 years ago
|
||
Ah, that's much nicer than reading asm. There's this, which explains comment 38: static inline malloc_zone_t * inline_malloc_default_scalable_zone(void) { unsigned index; if (malloc_def_zone_state < 2) _malloc_initialize(); // _malloc_printf(ASL_LEVEL_INFO, "In inline_malloc_default_scalable_zone with %d %d\n", malloc_num_zones, malloc_has_debug_zone); MALLOC_LOCK(); for (index = 0; index < malloc_num_zones; ++index) { malloc_zone_t *z = malloc_zones[index]; if(z->zone_name && strcmp(z->zone_name, "DefaultMallocZone") == 0) { MALLOC_UNLOCK(); return z; } } MALLOC_UNLOCK(); malloc_printf("*** malloc_default_scalable_zone() failed to find 'DefaultMallocZone'\n"); return NULL; // FIXME: abort() instead? }
Comment 67•13 years ago
|
||
Another small piece of the puzzle: My crashes stop happening if you don't change the zone_name in szone2ozone -- if you comment out the following line: default_zone->zone_name = "jemalloc_ozone"; So the zone name doesn't have to be "DefaultMallocZone" (without the above change it's "jemalloc_zone").
Assignee | ||
Comment 68•13 years ago
|
||
> So the zone name doesn't have to be "DefaultMallocZone" (without the above change it's
> "jemalloc_zone").
But in create_zone, don't we need to change the zone name from "jemalloc_zone" to "DefaultMallocZone"?
Comment 69•13 years ago
|
||
> But in create_zone, don't we need to change the zone name from > "jemalloc_zone" to "DefaultMallocZone"? What I reported in comment #67 tells me "no". But I'm not yet sure that's the end of the story ...
Comment 70•13 years ago
|
||
I'm closing in on the real problem, I think: I can also fix my crashes by adding the following (an extra call to malloc_zone_register()) just after the single call to szone2ozone(): malloc_zone_register(default_zone); Apparently Lion is fussy about changes made to the default zone after it's been "registered". But I suspect we should only call malloc_zone_register() once, and that the single call should be the one after szone2ozone(). Just a sec and I'll try that out.
Assignee | ||
Comment 71•13 years ago
|
||
Why do we have to register the jemalloc zone to begin with? Registering the zone doesn't make it the default one, afaict. It looks like only the szone overlay actually does something.
> But I suspect we should only call malloc_zone_register() once, and
> that the single call should be the one after szone2ozone().
If registering the zone *is* important, then this could be dangerous, because after szone2ozone, the system might be able to start using the jemalloc zone (effectively, not the zone itself, but its functions) before we've registered it. (?)
Comment 72•13 years ago
|
||
(Following up comment #70) Oops, I got it wrong. Adding the following after the call to szone2ozone() doesn't help: malloc_zone_register(default_zone); Likewise with making it the only call to malloc_zone_register(). I was confused about default_zone (returned by a call to malloc_default_zone()) and the "custom zone" created by create_zone(). default_zone is the default malloc zone created (and registered) by the OS. It's not a copy of the "custom zone", and doesn't become one. In fact I don't know what the "custom zone" is for, and why (or even if) it's needed. We get the default zone from the OS. And though we modify it, we intend the OS to continue using it as if it had been unchanged (we in fact hook it). We also never check for any zone by name. So we have no reason to change the default zone's zone_name in szone2ozone(), on any version of OS X. In fact we're lucky we got away with this on versions of OS X prior to Lion. So the fix for the crash bug is very simple. Just get rid of the following line in szone2ozone(): default_zone->zone_name = "jemalloc_ozone";
Comment 73•13 years ago
|
||
> Why do we have to register the jemalloc zone to begin with?
> Registering the zone doesn't make it the default one, afaict. It
> looks like only the szone overlay actually does something.
I assume that by "the jemalloc zone" you mean what I've been calling
the "custom zone". If so, then I agree that we (probably) don't need
to either create it or register it -- as far as I can tell it's not
used for anything. But I don't believe that creating and registering
the "custom zone" does any harm.
It'd be interesting to know the history of this code, and whether or
not the "custom zone" once served some purpose. But I'll save that
for later, because right now I'm too tired to pursue it :-)
Comment 74•13 years ago
|
||
By the way, "registering" a zone doesn't make it the default.
Assignee | ||
Comment 75•13 years ago
|
||
How about we get rid of this zone registration altogether? This seems to work for me, but knowing you, you'll find some way to crash it. :) As far as I can tell, allocations are still going through jemalloc with this patch. Or at least, the jemalloc heap size isn't significantly smaller with this patch than without.
Attachment #568432 -
Attachment is obsolete: true
Attachment #569277 -
Flags: feedback?
Assignee | ||
Updated•13 years ago
|
Attachment #569277 -
Flags: feedback? → feedback?(smichaud)
Comment 76•13 years ago
|
||
Comment on attachment 569277 [details] [diff] [review] Patch v2 - jemalloc_zone->zone_name = "jemalloc_zone"; + jemalloc_zone->zone_name = "DefaultMallocZone"; We don't want this. We don't want anything besides the real default zone to have that name. I'll do the rest of my review tomorrow.
Assignee | ||
Comment 77•13 years ago
|
||
Assignee | ||
Updated•13 years ago
|
Attachment #569277 -
Attachment is obsolete: true
Attachment #569277 -
Flags: feedback?(smichaud)
Assignee | ||
Comment 78•13 years ago
|
||
Comment on attachment 569278 [details] [diff] [review] Patch v3 Really get rid of create_zone().
Attachment #569278 -
Attachment description: Fix jemalloc hangs on Mac 10.7. → Patch v3
Attachment #569278 -
Flags: feedback?(smichaud)
Assignee | ||
Comment 79•13 years ago
|
||
> In fact I don't know what the "custom zone" is for, and why (or even
> if) it's needed.
We midaired each other and independently arrived at the same conclusion. Cool. :)
Assignee | ||
Comment 80•13 years ago
|
||
We should check whether we need to somehow lock around the modifications to default_zone. If we're not the only thread alive at that point, we might be better off reordering the modifications so that free is set before malloc and friends.
Assignee | ||
Comment 81•13 years ago
|
||
On my 10.7 machine, szone2ozone is called when only one thread is alive.
Comment 82•13 years ago
|
||
Comment on attachment 569278 [details] [diff] [review] Patch v3 This looks fine to me. And I no longer crash or hang :-) Though I do have a couple of nits: + /* Don't modify default_zone->zone_name; Mac libc may rely on the name + * being uncahnged. */ It's "unchanged". And you might want to reference this bug (bug 694896). /* Likewise for l_zone_introspect. */ static lion_malloc_introspection l_zone_introspect, l_ozone_introspect; static malloc_introspection_t * const zone_introspect = (malloc_introspection_t*)(&l_zone_introspect); static malloc_introspection_t * const ozone_introspect = (malloc_introspection_t*)(&l_ozone_introspect); We can get rid of l_zone_introspect and zone_introspect. We can also get rid of the following, which are all now dead code: zone_size() zone_free() zone_realloc() zone_force_lock() zone_force_unlock() zone_free_definite_size() > On my 10.7 machine, szone2ozone is called when only one thread is > alive. On the Mac szone2ozone is ultimately called from the following, which is called by the OS when the module containing jemalloc.c is loaded: __attribute__((constructor)) void jemalloc_darwin_init(void); For good measure I double-checked that the structures in osx_zone_types.h still match their version-specific counterparts in malloc.h on SnowLeopard and Lion -- they do. I also checked that our use of the malloc_zone_t pointers in szone2ozone() is sane -- it is. Specifically, we don't NULL out any pointers that aren't documented to sometimes be NULL, or are in malloc_introspection_t (which it seems can all safely be NULL). I also found when create_zone(), jemalloc_zone and friends were "orphaned". More on that in my next comment.
Attachment #569278 -
Flags: feedback?(smichaud) → feedback+
Assignee | ||
Comment 83•13 years ago
|
||
Attachment #569477 -
Flags: review?(khuey)
Assignee | ||
Updated•13 years ago
|
Attachment #569278 -
Attachment is obsolete: true
Comment 84•13 years ago
|
||
Paul, I've got a question for you concerning your "part 1" patch for bug 414946 (http://hg.mozilla.org/mozilla-central/rev/2b2f584dc5fd). As best we can tell, as of that patch the create_zone() function no longer does anything useful, and can safely be removed. But your patch didn't get rid of it. Did you have a reason for that? Thanks in advance!
Comment 85•13 years ago
|
||
The history of the jemalloc_zone is that the original implementation by jasone added a zone to the linked list of zones. After that, jasone started overriding the default zone, which actually worked and stuck. Looking at the code now, it looks like the jemalloc_zone doesn't do anything. I really don't remember why this is here - apart from the fact that it was in the original (it's still upstream by the look of it). Certainly, if it works to remove it, remove it.
Assignee | ||
Updated•13 years ago
|
Summary: Fix jemalloc chunk_alloc_mmap() function so it works properly on Mac 10.7 → Fix jemalloc hang, crash on Mac 10.7 and enable jemalloc on 10.7
Not going to get to this review until next week.
Comment on attachment 569477 [details] [diff] [review] Patch v4 Review of attachment 569477 [details] [diff] [review]: ----------------------------------------------------------------- ::: memory/jemalloc/jemalloc.c @@ +178,5 @@ > + * MALLOC_PAGEFILE causes all mmap()ed memory to be backed by temporary > + * files, so that if a chunk is mapped, it is guaranteed to be swappable. > + * This avoids asynchronous OOM failures that are due to VM over-commit. > + */ > +/* #define MALLOC_PAGEFILE */ Why are you making this change? @@ +2452,5 @@ > return (false); > } > #endif > > +#if defined(MOZ_MEMORY_WINDOWS) || defined(JEMALLOC_USES_MAP_ALIGN) || defined(MALLOC_PAGEFILE) Especially since you then add a conditional for it here.
Assignee | ||
Comment 88•13 years ago
|
||
>> +/* #define MALLOC_PAGEFILE */ > > Why are you making this change? AFAICT, Linux defines MALLOC_PAGEFILE but doesn't actually enable it. I want Linux to use the (presumably faster) chunk map routines.
Assignee | ||
Comment 89•13 years ago
|
||
Kyle, are you going to have a chance to look at this again before we branch? It would be cool to get this in this week.
Yeah, I should be able to review it tomorrow.
Comment on attachment 569477 [details] [diff] [review] Patch v4 Review of attachment 569477 [details] [diff] [review]: ----------------------------------------------------------------- Ok, this wasn't as scary as I expected. The code looks fine. Whether or not it manages to not explode on some odd configuration, I can't really predict. Let's try it and see! ::: memory/jemalloc/jemalloc.c @@ +2452,5 @@ > return (false); > } > #endif > > +#if defined(MOZ_MEMORY_WINDOWS) || defined(JEMALLOC_USES_MAP_ALIGN) || defined(MALLOC_PAGEFILE) Ok, now that I understand what's going on the conditional for the pagefile is ok. @@ +5963,5 @@ > default_zone = malloc_default_zone(); > > /* > + * We only use jemalloc on Mac 10.6 and 10.7. On Mac 10.5, our tests show > + * a memory regression, but this may not be real. See Mozilla bug 694335. Yeah, that comment is wrong now ;-) @@ +6662,5 @@ > * Actually create an object of the appropriate size, then find out > * how large it could have been without moving up to the next size > * class. > + * > + * I sure hope this doesn't get called often. lol
Attachment #569477 -
Flags: review?(khuey) → review+
Assignee | ||
Comment 92•13 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/62ba86c7c50c Fingers crossed.
status-firefox10:
--- → fixed
Assignee | ||
Comment 93•13 years ago
|
||
This is a big change we should track for FF10. We'll need to keep an eye that this doesn't cause hangs or crashes for users on 10.7.
tracking-firefox10:
--- → ?
Updated•13 years ago
|
Summary: Fix jemalloc hang, crash on Mac 10.7 and enable jemalloc on 10.7 → [10.7] Fix jemalloc hang, crash on Mac 10.7 and enable jemalloc on 10.7
Comment 94•13 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/62ba86c7c50c
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla10
Comment 95•13 years ago
|
||
We're not out of the woods yet. I just got this crash starting up a custom build made from current mozilla-central code. The patch that got landed should probably be backed out again.
Updated•13 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 96•13 years ago
|
||
Or at least jemalloc should be turned off for OS X 10.7 -- which is where I saw the crash.
Comment 97•13 years ago
|
||
> https://hg.mozilla.org/integration/mozilla-inbound/rev/62ba86c7c50c > https://hg.mozilla.org/mozilla-central/rev/62ba86c7c50c The wrong patch got landed! This isn't "Patch v4" (attachment 569477 [details] [diff] [review]).
Assignee | ||
Comment 98•13 years ago
|
||
Well, that's bad!
Assignee | ||
Comment 99•13 years ago
|
||
Backed out and re-pushed to inbound. https://hg.mozilla.org/integration/mozilla-inbound/rev/d7ccee4bd7f7 https://hg.mozilla.org/integration/mozilla-inbound/rev/242bbbe83dda Thanks for catching this, Steven! I'm not sure how I screwed this up...
Comment 100•13 years ago
|
||
No problem. If I hadn't caught it, someone else would have very soon :-)
Comment 101•13 years ago
|
||
By the way, you should probably also at least back the patch out from mozilla-central -- otherwise the shit will hit the fan in tomorrow's mozilla-central nightly.
Comment 102•13 years ago
|
||
> https://hg.mozilla.org/integration/mozilla-inbound/rev/242bbbe83dda
This is still wrong, and won't finish building.
A quick glance turned up that it still calls create_zone(). There may also be other problems.
Comment 103•13 years ago
|
||
The comment 99 relanding backed out from inbound due to failure to build on OS X: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=242bbbe83dda { Undefined symbols: "_create_zone", referenced from: _jemalloc_darwin_init in jemalloc.o ld: symbol(s) not found collect2: ld returned 1 exit status make[7]: *** [libmozutils.dylib] Error 1 } https://hg.mozilla.org/integration/mozilla-inbound/rev/546c39b13be7
Assignee: nobody → justin.lebar+bug
Target Milestone: mozilla10 → ---
Assignee | ||
Comment 104•13 years ago
|
||
Okay, gonna do this right. Compiling on my 10.7 machine now...
Assignee | ||
Comment 105•13 years ago
|
||
Okay, I see what's going on. Problem lies between keyboard and chair. I missed a hunk in my .rej file. :-/ I have a patch tested on my 10.7 machine. This one should work.
Assignee | ||
Comment 106•13 years ago
|
||
> By the way, you should probably also at least back the patch out from mozilla-central -- otherwise
> the shit will hit the fan in tomorrow's mozilla-central nightly.
I presume we're going to merge m-i again today before the nightly?
Comment 107•13 years ago
|
||
> I presume we're going to merge m-i again today before the nightly?
I don't think so. Lately it's been happening just once a day, 5-7AM PST. The nightlies normally get built starting around 3-4AM.
Assignee | ||
Comment 108•13 years ago
|
||
<jlebar> Are we going to merge m-i to m-c again before the next nightly? <edmorley> jlebar: yeah I imagine so, since that's ~13-14 hours out and a fair bit has landed on inbound already :-) <edmorley> jlebar: particular reason? <jlebar> edmorley, If we don't merge, we need to back out the broken jemalloc 10.7 patch I landed, which was merged earlier. <edmorley> jlebar: ah <edmorley> jlebar: ok so at the very least I'll merge from https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=546c39b13be7 to m-c if not more
Comment 109•13 years ago
|
||
First backout: https://hg.mozilla.org/mozilla-central/rev/d7ccee4bd7f7 Relanding: https://hg.mozilla.org/mozilla-central/rev/242bbbe83dda Second backout: https://hg.mozilla.org/mozilla-central/rev/242bbbe83dda The 3rd (and hopefully final) landing is on just inbound at present, and will come across on the next merge (presuming it's green hehe).
Target Milestone: --- → mozilla10
Comment 110•13 years ago
|
||
Sorry for the bugspam, the third URL in comment 109 should have been https://hg.mozilla.org/mozilla-central/rev/546c39b13be7
Comment 111•13 years ago
|
||
I just built with current trunk (rev f8d66a792ddc) and running mochitests locally on Mac OS X 10.7 doesn't work any more. Lots of malloc errors to the console and the browser doesn't start.
Comment 112•13 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/57d804eae9a0 Fingers crossed! :-)
Status: REOPENED → RESOLVED
Closed: 13 years ago → 13 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 113•13 years ago
|
||
(In reply to Josh Aas (Mozilla Corporation) from comment #111) > I just built with current trunk (rev f8d66a792ddc) and running mochitests > locally on Mac OS X 10.7 doesn't work any more. Lots of malloc errors to the > console and the browser doesn't start. That's worrying and weird. f8d66a792ddc didn't have jemalloc turned on for 10.7. Also, mochitest wfm with that rev on my 10.7 box. Can you try tip again?
Comment 114•13 years ago
|
||
(In reply to Justin Lebar [:jlebar] from comment #113) > (In reply to Josh Aas (Mozilla Corporation) from comment #111) > > I just built with current trunk (rev f8d66a792ddc) and running mochitests > > locally on Mac OS X 10.7 doesn't work any more. Lots of malloc errors to the > > console and the browser doesn't start. > > That's worrying and weird. f8d66a792ddc didn't have jemalloc turned on for > 10.7. Also, mochitest wfm with that rev on my 10.7 box. Can you try tip > again? Updating to tip fixed the problem, mochitests work again.
Comment 115•12 years ago
|
||
Is there a test case to verify this fix?
Whiteboard: [MemShrink:P1] → [MemShrink:P1][qa?]
Assignee | ||
Comment 116•12 years ago
|
||
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #115) > Is there a test case to verify this fix? Unfortunately not.
Assignee | ||
Comment 117•12 years ago
|
||
You could verify that jemalloc is enabled on 10.7 by putting a breakpoint or malloc_printf() in jemalloc.c's malloc() definition.
Comment 118•12 years ago
|
||
I don't think this is something which QA will be able to verify easily (in the face of prioritizing resources). Marking qa- in the hopes that someone else will be able to verify the fix.
Whiteboard: [MemShrink:P1][qa?] → [MemShrink:P1][qa-]
Updated•12 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•