Closed Bug 694896 Opened 13 years ago Closed 13 years ago

[10.7] Fix jemalloc hang, crash on Mac 10.7 and enable jemalloc on 10.7

Categories

(Core :: Memory Allocator, defect)

x86_64
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla10
Tracking Status
firefox10 + fixed

People

(Reporter: justin.lebar+bug, Assigned: justin.lebar+bug)

References

(Blocks 1 open bug)

Details

(Whiteboard: [MemShrink:P1][qa-])

Attachments

(6 files, 3 obsolete files)

jemalloc is currently disabled on MacOS 10.7 for reasons that I don't understand at the moment.  But as more Mac users transition to the new version, enabling jemalloc there becomes increasingly important.
Depends on: 414946
Whiteboard: [MemShrink]
bug 670175 is the problem.
Thanks.  I resolved that bug so we can track turning it on (and fixing that hang) here.
Depends on: 670175
Whiteboard: [MemShrink] → [MemShrink:P1]
Steven, we need some mac-fu here to understand the hangs in bug 670175.  Can you help?
I'm completely swamped, so the earliest I'll have time for this is sometime next week.

Furthermore I know very little about jemalloc, so I'll probably have to use brute force on this bug -- keep turning off bits of jemalloc until it stops happening, then see if the guilty "bit" can somehow be implemented differently.

That's likely to be very time consuming, and may be something you guys can do as well as I can :-)
Is it possible to dual-boot with the final version of 10.7?  If so, I might be able to poke this and at least get to the point that I need your help.  I sort of understand jemalloc, although jemalloc on mac is a complete mystery to me.  :)
> Is it possible to dual-boot with the final version of 10.7?

Sure.  You need to create extra partitions on your hard drive (or add another hard drive on which you can create additional partitions).  Then do a conventional install into one of those partitions.  Then you can use Startup Disk in System Preferences to choose which partition to boot from.

No, I don't know how to non-destructively add new partitions to an existing hard drive.  But ask Marcia Knous -- she may know how, or know who to ask.
> No, I don't know how to non-destructively add new partitions to an existing hard drive.

I think bootcamp can do this?  I'll see what I can figure out.
Note that your Lion partition should be at least 30GB in order for XCode to fit into it.  40GB would probably be better.  (I use 30GB, and it can be a tight fit.)
(In reply to Steven Michaud from comment #6)
> No, I don't know how to non-destructively add new partitions to an existing
> hard drive.  But ask Marcia Knous -- she may know how, or know who to ask.

I believe the stock version of Disk Utility does this just fine (unless your drive is very fragmented) — give it a try. It's very explicit about which partitions will lose data and which will remain intact, so just read the description before confirming the operation.
I had to run disk utilities repair on my partition (perhaps because I'd previously had a bootcamp partition), but then it worked, and I got Lion installed.  Thanks!
I can reproduce using the steps in bug 670175 comment 17, about half the time.  Firefox beachballs for about 5s.

From the hang stack in attachment 544925 [details], it looks like jemalloc is taking a long time in mmap.  (See the stacktrace for thread 1, at the bottom.)  Maybe 10.7 doesn't like the hack we use to force mmap'ed pages down towards the bottom of the address space.
I thought that trace indicated thread contention -- between the main thread and threads 10, 12, 13, and 23.  But that's more or less a guess.

Hence my hunch that we're going to need to brute-force this.
> Maybe 10.7 doesn't like the hack we use to force mmap'ed pages down
> towards the bottom of the address space.

Can we stop doing this, and see if that makes the problem go away?
I seem to be able to reproduce bug 670175 comment 17 consistently if I issue the |purge| command immediately before starting Firefox.
(In reply to Steven Michaud from comment #13)
> Can we stop doing this, and see if that makes the problem go away?

Yep, trying...
Interestingly, when I kill FF while it's hung, the whole system is unresponsive for a few seconds.  Typing and switching tabs in the terminal is very slow, for example.

> I thought that trace indicated thread contention -- between the main thread and threads 10, 12, 13, 
> and 23.  But that's more or less a guess.

Yes.  The fact that four threads are all blocked on the main thread's arena lock indicates that the main thread has probably been sitting there for a while.  So that leads me to suspect that mmap may be hanging...
> Interestingly, when I kill FF while it's hung, the whole system is
> unresponsive for a few seconds.  Typing and switching tabs in the
> terminal is very slow, for example.

This is an old problem with event taps -- a buggy system resource that
we're forced to use.  See bug 393664 and bug 611068.
Yikes, I think I understand what's happening.  Jemalloc uses a loop when it tries to do an aligned mmap.  The claim is that you only loop in the event of a race condition, where two threads call mmap simultaneously.  But we go around and around this loop for a long time.

The code assumes that if I mmap a 2mb chunk, unmap it, and then immediately try to map somewhere inside where that chunk used to be (so I get an aligned mapping), the final mmap operation will succeed in the absence of races.  It appears that this is not the case on 10.7.
The more sane (and loop-free) thing to do would be to map 2mb and then unmap any excess.  That's what jemalloc tip does.

We can't do that on Windows, because apparently Windows demands a 1:1 mapping between maps and unmaps.  But we ought to be able to backport this change for Linux and Mac.
Awesome, jsgcchunk is copy/pasted from jemalloc (or is it the other way around?).  So whatever problem jemalloc has here, presumably the js gc has too.
(In reply to Justin Lebar [:jlebar] from comment #20)
> Awesome, jsgcchunk is copy/pasted from jemalloc (or is it the other way
> around?).  So whatever problem jemalloc has here, presumably the js gc has
> too.

...although maybe not, because js uses vm_allocate instead of mmap.

Anyone know why js doesn't use mmap?
The code in jsgcchunk is super ugly and I spent quite of a bit of time looking at it today for 670596.

(In reply to Justin Lebar [:jlebar] from comment #20)
> Awesome, jsgcchunk is copy/pasted from jemalloc (or is it the other way
> around?).  So whatever problem jemalloc has here, presumably the js gc has
> too.

Igor created jsgcchunk as a copy from jemalloc about a year and a half ago in 553812.

(In reply to Justin Lebar [:jlebar] from comment #19)
> The more sane (and loop-free) thing to do would be to map 2mb and then unmap
> any excess.  That's what jemalloc tip does.

Agreed.  This should work well on mac and linux, assuming that we can't find a way to do aligned allocations directly.

> We can't do that on Windows, because apparently Windows demands a 1:1
> mapping between maps and unmaps.  But we ought to be able to backport this
> change for Linux and Mac.

In Windows, we can reserve address space independently of committing.  We should be able to make things better on Windows by grabbing the 2MiB of address space as a reservation and only committing when we grab the 1MiB slice afterwards.

I think what we need to do in jsgcchunk is push the high level AllocGCChunk interface into each OS's #define and just hyper-optimize for each OS individually.  We also need to add new per-OS functions for madvise/MEM_RESET functionality.  It would be nice if we could share this work with jemalloc.
> assuming that we can't find a way to do aligned allocations directly.

I've got to hope that if there were a way, upstream jemalloc would be using it.

> In Windows, we can reserve address space independently of committing.  We should be able to make 
> things better on Windows by grabbing the 2MiB of address space as a reservation and only committing 
> when we grab the 1MiB slice afterwards.

Yes, but two problems:

* We frequently run out of virtual address space on Windows-32, so we probably can't waste it like this.

* The chunk would now have to remember where its mapping starts, so that when you unmap the chunk, you do so from the right place.  This would be kind of an invasive jemalloc change.

Windows ain't broke, so I think we should leave it be for the moment and revisit separately whether we can improve it.
(In reply to Justin Lebar [:jlebar] from comment #23)
> > assuming that we can't find a way to do aligned allocations directly.
> 
> I've got to hope that if there were a way, upstream jemalloc would be using
> it.

Probably, but the linux kernel moves fast and doesn't communicate userland changes well.

> > In Windows, we can reserve address space independently of committing.  We should be able to make 
> > things better on Windows by grabbing the 2MiB of address space as a reservation and only committing 
> > when we grab the 1MiB slice afterwards.
> 
> Yes, but two problems:
> 
> * We frequently run out of virtual address space on Windows-32, so we
> probably can't waste it like this.
> 
> * The chunk would now have to remember where its mapping starts, so that
> when you unmap the chunk, you do so from the right place.  This would be
> kind of an invasive jemalloc change.
> 
> Windows ain't broke, so I think we should leave it be for the moment and
> revisit separately whether we can improve it.

I think I must have explained very poorly what I meant to do.  I am not suggesting changing the algorithm or numbers we use when allocating memory at all, just unsetting the MEM_COMMIT bit until we actually make our 1MiB allocation.
> I think I must have explained very poorly what I meant to do.  I am not suggesting changing the 
> algorithm or numbers we use when allocating memory at all, just unsetting the MEM_COMMIT bit until 
> we actually make our 1MiB allocation.

Ah, that might work!  Can we do this in another bug, please?
(In reply to Justin Lebar [:jlebar] from comment #25)
> Ah, that might work!  Can we do this in another bug, please?

Filed under 696119.
Attached patch Patch v1 (obsolete) — Splinter Review
Seems to work on my 10.7 and Linux-64 machines.  Pushed to try.
Summary: Enable jemalloc on MacOS 10.7 → Fix jemalloc chunk_alloc_mmap() function so it works properly on Mac 10.7
Blocks: 696162
Aside from a mysterious linux64 make check error, which is probably nothing, this looks good on try.

Steven, would you mind trying this build out on your 10.7 machine and seeing if you can get it to hang?  It doesn't hang for me on my 10.7 machine.

https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jlebar@mozilla.com-21ae01a7b22a/
Attachment #568432 - Flags: feedback?(smichaud)
If you feel like a slightly larger population of guinea pigs to test it on, feel free to land it on the UX branch. We have about 2K ADUs, a large percentage of which I would guess are on OS X 10.7. :)
Comment on attachment 568432 [details] [diff] [review]
Patch v1

This patch doesn't seem to have fixed the problem.  I still crash,
though now the STR is now different.  Here are the crashes I've had.
They seem to be the same crash, but I include them all for the sake of
completeness.  I'll try to get some gdb crash stacks, and to translate
the symbols in the Breakpad crash stacks:

bp-faaba078-ca18-4b50-a52e-ea4432111021
bp-e9738d44-e0b9-4dd0-b60f-fd1ac2111021
bp-b8a8cae1-4a37-43c3-94df-4b76a2111021
bp-a8476906-3a59-4d59-a8f8-b1e8f2111021
bp-6845a0ff-f298-41b9-8b65-e9b532111021

Here's my STR.  It's still rather difficult and pretty finicky.  I may
be able to improve it over time.

1) Restart your computer.

2) Run the test build, and then	a few seconds (5-15?) after it's
   started, move the mouse up to "Nightly" in the main menu and click
   on it to open the menu.

   I'll	usually	crash here, or sometimes even before the mouse reaches
   the main menu.

Sometimes subsequent attempts to run the test build will crash even
before the browser window has finished opening.  But as soon as you
fail to crash even once, the crashes become almost impossibly
difficult to reproduce.  Running "purge" beforehand doesn't seem to
help (to make the test build more crashy).

My STR above doesn't crash on recent nightlies.
Attachment #568432 - Flags: feedback?(smichaud) → feedback-
Even when I just turn jemalloc on for OS X 10.7, without making any of
the rest of your patch's changes, my STR from bug 670175 comment #17
no longer "works" for me -- the I-beam cursor in the Google search box
never stops flashing.

For this test, I just applied the following part of your patch:

-    osx_use_jemalloc = (default_zone->version == SNOW_LEOPARD_MALLOC_ZONE_T_VERSION);
+    osx_use_jemalloc = (default_zone->version == SNOW_LEOPARD_MALLOC_ZONE_T_VERSION ||
+                        default_zone->version == LION_MALLOC_ZONE_T_VERSION);
You'll notice that none of my Breakpad crash reports from comment #31 have symbols even for Mozilla code.  This has already come up in bug 693451, and a Breakpad bug on the issue has been opened upstream at http://code.google.com/p/google-breakpad/issues/detail?id=447.
Yeah, symbols of any kind would be very helpful...
It occurs to me that jemalloc might have (at least) two different problems on OS X 10.7 -- a hang bug (which your patch may have fixed) and a crash bug.
I think the patch did fix a hang I was observing.  At least, there was a loop which I observed spinning, and now the loop is gone!
Here are translations for (some of) the symbols for my crash stacks
from comment #31.  I used atos to do the translations.  There were a
few symbols it (apparently) couldn't translate.

I think	this pins the blame pretty well on jemalloc.  Though it
doesn't come near to providing enough information to fix the bug that
causes the crashes :-(

Note that one of the symbols is for malloc_printf, which indicates
there may have been something interesting in the system log.  I'll
reproduce a few more crashes, and see what I find.
[0x0-0x13013].org.mozilla.nightly: firefox(193,0x7fff79a83960)
  malloc: *** malloc_default_scalable_zone() failed to find 'DefaultMallocZone'
That's weird...it looks like this is the default allocator, not jemalloc.
Then I think jemalloc must be doing something to break the default allocator.

I'm still working on getting a gdb stack trace of the crash, with symbols and all threads.
Here it is!

This is from a non-debug custom build (with this bug's patch applied) with optimization turned off and no symbols stripped.  It's based on trunk code pulled at the beginning of this week (Monday).
Note that it looks a lot like my gdb crash traces from bug 670175.
This is a shot in the dark, but what if you change

  jemalloc_zone->zone_name = "jemalloc_zone";

to

  jemalloc_zone->zone_name = "DefaultMallocZone";

?
(In reply to comment #44)

I'm trying it.  I'll let you know my results when the build finishes.
btw, I find that to rebuild jemalloc, I need to do

  rm -f browser/app/nsBrowserApp.o && make -C memory && make -C browser && make -C toolkit/library

if you don't remove nsBrowserApp.o, the jemalloc changes might not be reflected, even if you do a top-level rebuild.
(In reply to Justin Lebar [:jlebar] from comment #47)
> if you don't remove nsBrowserApp.o, the jemalloc changes might not be
> reflected, even if you do a top-level rebuild.

That doesn't make sense. jemalloc is a shared library on mac. Rebuilding libmozutils.dylib should be enough.
Following your suggestion from comment #44 didn't help, though it did
change the stack trace slightly, and got rid of the malloc_printf
error message.  I think the problem reported by malloc_printf must be
a symptom, and not the cause.

By the way, I didn't need to delete browser/app/nsBrowserApp.o.  I
just changed the source and did a top-level rebuild.  (I always do
that --	doing otherwise just wastes all the time I might have saved
doing non-top-level builds, or more than the time, on fixing the
problems it causes.)
> Following your suggestion from comment #44 didn't help, though it did
> change the stack trace slightly, and got rid of the malloc_printf
> error message. 

Okay, that's pretty good!

Maybe it's then trying to call one of the zone functions which is null.

You could replace the NULLs in create_zone and szone2ozone with 0x1, 0x2, 0x3, etc.  Maybe that would tell us which function it's trying to call.
> You could replace the NULLs in create_zone and szone2ozone with 0x1,
> 0x2, 0x3, etc.  Maybe that would tell us which function it's trying
> to call.

I did that, without backing out your suggestion from comment #44, and
it made no difference -- I still get exactly the same null
dereference, and no output from malloc_printf.

Next I'll try backing out your suggestion from comment #44, to see if
that makes any difference.
> Next I'll try backing out your suggestion from comment #44, to see
> if that makes any difference.

This just restored the original error message from malloc_printf and
the original crash at 0x10f0.
I've got a few minutes left at the end of the day, that I can't really spend doing anything else, so I'm going to try a few of my own random shots in the dark.

I just tried putting #define NO_TLS at the top of jemalloc.c -- that doesn't help.
I'm pretty sure mac doesn't build without NO_TLS.

I think we've made some progress here.  I'll try to get things set up so I can debug this locally.
Now I've found something that *does* help:  The crashes disappear if you never call szone2ozone()!  I tested twice (with my STR from comment #31) just to be sure.

> I'll try to get things set up so I can debug this locally.

That's probably a good idea :-)
Ah, so AIUI, szone2ozone sets us up so that certain system-level allocations go through jemalloc rather than the default allocator.

It  might not like the zone name...
> It  might not like the zone name...

I suspect the name isn't the problem ... though I've no idea what the problem actually is.

In any case, though, szone2ozone is a lot smaller target to brute-force than the whole of jemalloc.c.
(Following up comment #57)

>> It might not like the zone name...
>
> I suspect the name isn't the problem ... though I've no idea what
> the problem actually is.

Actually you were right (and I was wrong).  Making the following
change in szone2ozone is enough to make my crashes go away:

-       default_zone->zone_name = "jemalloc_ozone";
+       if (default_zone->version < LION_MALLOC_ZONE_T_VERSION) {
+               default_zone->zone_name = "jemalloc_ozone";
+       } else {
+               default_zone->zone_name = "DefaultMallocZone";
+       }

I still don't know *why* this change is necessary on Lion.  As best I
can tell, "DefaultMallocZone" has been this zone's name for years.
But maybe it's only with Lion that the OS actually starts checking the
name.

It's also puzzling why we don't *always* crash on Lion, if the default
zone's name is "wrong".

But I think it's worthwhile landing this change on the trunk, to see
what happens.

Once jemalloc is re-enabled on Lion, we should (of course) check that
it really does give us performance gains and/or a reduction in memory
usage.
And note that the default zone's name is wrong (on Lion) even when we don't call szone2ozone, and we still don't crash.
So (sigh) I've changed my mind.

I no longer think we should land my change from comment #58 without understanding better what's going on (why that change "works", or appears to "work").
I'll continue to brute-force szone2ozone for a while longer, to see if I find anything interesting.
You may want to disassemble the calling stackframes (or use gdb watchpoints) to try to see where default_zone->zone_name is being read.

I've been reading machine code, but I haven't seen it yet...
So I did

+               default_zone->zone_name = (char*)0xdeadbeef;

and then waited for it to crash.

It crashes in

  strcmp()
  malloc_default_purgable_zone()
  ImageIO_Malloc()

malloc_default_purgable_zone appears to be in a loop looking for something.  Perhaps the string "DefaultMallocZone".  :)
> malloc_default_purgable_zone appears to be in a loop looking for
> something.  Perhaps the string "DefaultMallocZone".  :)

Nope, it's the string "DefaultPurgeableMallocZone" :-)

See http://opensource.apple.com/source/Libc/Libc-594.1.4/gen/malloc.c.
Which shows that you're debugging in libc code, and that Apple's
libc's source is available at http://opensource.apple.com.

That may make your life easier :-)
OS X 10.7.2's malloc.c is at:
http://opensource.apple.com/source/Libc/Libc-763.12/gen/malloc.c

You can also download the source.

See http://opensource.apple.com/ and start browsing under "Mac OS X".
Ah, that's much nicer than reading asm.  There's this, which explains comment 38:

static inline malloc_zone_t *
inline_malloc_default_scalable_zone(void) {
    unsigned index;

    if (malloc_def_zone_state < 2) _malloc_initialize();
    // _malloc_printf(ASL_LEVEL_INFO, "In inline_malloc_default_scalable_zone with %d %d\n", malloc_num_zones, malloc_has_debug_zone);

    MALLOC_LOCK();
    for (index = 0; index < malloc_num_zones; ++index) {
	malloc_zone_t *z = malloc_zones[index];

	if(z->zone_name && strcmp(z->zone_name, "DefaultMallocZone") == 0) {
	    MALLOC_UNLOCK();
	    return z;
	}
    }
    MALLOC_UNLOCK();

    malloc_printf("*** malloc_default_scalable_zone() failed to find 'DefaultMallocZone'\n");
    return NULL; // FIXME: abort() instead?
}
Another small piece of the puzzle:

My crashes stop happening if you don't change the zone_name in szone2ozone -- if you comment out the following line:

default_zone->zone_name = "jemalloc_ozone";

So the zone name doesn't have to be "DefaultMallocZone" (without the above change it's "jemalloc_zone").
> So the zone name doesn't have to be "DefaultMallocZone" (without the above change it's 
> "jemalloc_zone").

But in create_zone, don't we need to change the zone name from "jemalloc_zone" to "DefaultMallocZone"?
> But in create_zone, don't we need to change the zone name from
> "jemalloc_zone" to "DefaultMallocZone"?

What I reported in comment #67 tells me "no".

But I'm not yet sure that's the end of the story ...
I'm closing in on the real problem, I think:

I can also fix my crashes by adding the following (an extra call to
malloc_zone_register()) just after the single call to szone2ozone():

malloc_zone_register(default_zone);

Apparently Lion is fussy about changes made to the default zone after
it's been "registered".

But I suspect we should only call malloc_zone_register() once, and
that the single call should be the one after szone2ozone().  Just a
sec and I'll try that out.
Why do we have to register the jemalloc zone to begin with?  Registering the zone doesn't make it the default one, afaict.  It looks like only the szone overlay actually does something.

> But I suspect we should only call malloc_zone_register() once, and
> that the single call should be the one after szone2ozone().

If registering the zone *is* important, then this could be dangerous, because after szone2ozone, the system might be able to start using the jemalloc zone (effectively, not the zone itself, but its functions) before we've registered it. (?)
(Following up comment #70)

Oops, I got it wrong.  Adding the following after the call to
szone2ozone() doesn't help:

malloc_zone_register(default_zone);

Likewise with making it the only call to malloc_zone_register().

I was confused about default_zone (returned by a call to
malloc_default_zone()) and the "custom zone" created by create_zone().
default_zone is the default malloc zone created (and registered) by
the OS.  It's not a copy of the "custom zone", and doesn't become one.
In fact I don't know what the "custom zone" is for, and why (or even
if) it's needed.

We get the default zone from the OS.  And though we modify it, we
intend the OS to continue using it as if it had been unchanged (we in
fact hook it).  We also never check for any zone by name.

So we have no reason to change the default zone's zone_name in
szone2ozone(), on any version of OS X.  In fact we're lucky we got
away with this on versions of OS X prior to Lion.

So the fix for the crash bug is very simple.  Just get rid of the
following line in szone2ozone():

default_zone->zone_name = "jemalloc_ozone";
> Why do we have to register the jemalloc zone to begin with?
> Registering the zone doesn't make it the default one, afaict.  It
> looks like only the szone overlay actually does something.

I assume that by "the jemalloc zone" you mean what I've been calling
the "custom zone".  If so, then I agree that we (probably) don't need
to either create it or register it -- as far as I can tell it's not
used for anything.  But I don't believe that creating and registering
the "custom zone" does any harm.

It'd be interesting to know the history of this code, and whether or
not the "custom zone" once served some purpose.	 But I'll save that
for later, because right now I'm too tired to pursue it :-)
By the way, "registering" a zone doesn't make it the default.
Attached patch Patch v2 (obsolete) — Splinter Review
How about we get rid of this zone registration altogether?

This seems to work for me, but knowing you, you'll find some way to crash it.  :)

As far as I can tell, allocations are still going through jemalloc with this patch.  Or at least, the jemalloc heap size isn't significantly smaller with this patch than without.
Attachment #568432 - Attachment is obsolete: true
Attachment #569277 - Flags: feedback?
Attachment #569277 - Flags: feedback? → feedback?(smichaud)
Comment on attachment 569277 [details] [diff] [review]
Patch v2

-	jemalloc_zone->zone_name = "jemalloc_zone";
+	jemalloc_zone->zone_name = "DefaultMallocZone";

We don't want this.  We don't want anything besides the real default zone to have that name.

I'll do the rest of my review tomorrow.
Attached patch Patch v3 (obsolete) — Splinter Review
Attachment #569277 - Attachment is obsolete: true
Attachment #569277 - Flags: feedback?(smichaud)
Comment on attachment 569278 [details] [diff] [review]
Patch v3

Really get rid of create_zone().
Attachment #569278 - Attachment description: Fix jemalloc hangs on Mac 10.7. → Patch v3
Attachment #569278 - Flags: feedback?(smichaud)
> In fact I don't know what the "custom zone" is for, and why (or even
> if) it's needed.

We midaired each other and independently arrived at the same conclusion.  Cool.  :)
We should check whether we need to somehow lock around the modifications to default_zone.  If we're not the only thread alive at that point, we might be better off reordering the modifications so that free is set before malloc and friends.
On my 10.7 machine, szone2ozone is called when only one thread is alive.
Comment on attachment 569278 [details] [diff] [review]
Patch v3

This looks fine to me.  And I no longer crash or hang :-)

Though I do have a couple of nits:

+        /* Don't modify default_zone->zone_name; Mac libc may rely on the name
+         * being uncahnged. */

It's "unchanged".  And you might want to reference this bug (bug
694896).

 /* Likewise for l_zone_introspect. */
 static lion_malloc_introspection l_zone_introspect, l_ozone_introspect;
 static malloc_introspection_t * const zone_introspect =
        (malloc_introspection_t*)(&l_zone_introspect);
 static malloc_introspection_t * const ozone_introspect =
        (malloc_introspection_t*)(&l_ozone_introspect);

We can get rid of l_zone_introspect and zone_introspect.

We can also get	rid of the following, which are	all now	dead code:

zone_size()
zone_free()
zone_realloc()
zone_force_lock()
zone_force_unlock()
zone_free_definite_size()

> On my 10.7 machine, szone2ozone is called when only one thread is
> alive.

On the Mac szone2ozone is ultimately called from the following,	which
is called by the OS when the module containing jemalloc.c is loaded:

__attribute__((constructor))
void
jemalloc_darwin_init(void);

For good measure I double-checked that the structures in
osx_zone_types.h still match their version-specific counterparts in
malloc.h on SnowLeopard and Lion -- they do.

I also checked that our use of the malloc_zone_t pointers in
szone2ozone() is sane -- it is.  Specifically, we don't NULL out any
pointers that aren't documented to sometimes be NULL, or are in
malloc_introspection_t (which it seems can all safely be NULL).

I also found when create_zone(), jemalloc_zone and friends were
"orphaned".  More on that in my next comment.
Attachment #569278 - Flags: feedback?(smichaud) → feedback+
Attached patch Patch v4Splinter Review
Attachment #569477 - Flags: review?(khuey)
Attachment #569278 - Attachment is obsolete: true
Paul, I've got a question for you concerning your "part 1" patch for
bug 414946 (http://hg.mozilla.org/mozilla-central/rev/2b2f584dc5fd).

As best we can tell, as of that patch the create_zone() function no
longer does anything useful, and can safely be removed.  But your
patch didn't get rid of it.  Did you have a reason for that?

Thanks in advance!
The history of the jemalloc_zone is that the original implementation by jasone added a zone to the linked list of zones. After that, jasone started overriding the default zone, which actually worked and stuck.

Looking at the code now, it looks like the jemalloc_zone doesn't do anything. I really don't remember why this is here - apart from the fact that it was in the original (it's still upstream by the look of it). Certainly, if it works to remove it, remove it.
Summary: Fix jemalloc chunk_alloc_mmap() function so it works properly on Mac 10.7 → Fix jemalloc hang, crash on Mac 10.7 and enable jemalloc on 10.7
Not going to get to this review until next week.
Comment on attachment 569477 [details] [diff] [review]
Patch v4

Review of attachment 569477 [details] [diff] [review]:
-----------------------------------------------------------------

::: memory/jemalloc/jemalloc.c
@@ +178,5 @@
> + * MALLOC_PAGEFILE causes all mmap()ed memory to be backed by temporary
> + * files, so that if a chunk is mapped, it is guaranteed to be swappable.
> + * This avoids asynchronous OOM failures that are due to VM over-commit.
> + */
> +/* #define MALLOC_PAGEFILE */

Why are you making this change?

@@ +2452,5 @@
>  	return (false);
>  }
>  #endif
>  
> +#if defined(MOZ_MEMORY_WINDOWS) || defined(JEMALLOC_USES_MAP_ALIGN) || defined(MALLOC_PAGEFILE)

Especially since you then add a conditional for it here.
>> +/* #define MALLOC_PAGEFILE */
>
> Why are you making this change?

AFAICT, Linux defines MALLOC_PAGEFILE but doesn't actually enable it.  I want Linux to use the (presumably faster) chunk map routines.
Kyle, are you going to have a chance to look at this again before we branch?  It would be cool to get this in this week.
Yeah, I should be able to review it tomorrow.
Comment on attachment 569477 [details] [diff] [review]
Patch v4

Review of attachment 569477 [details] [diff] [review]:
-----------------------------------------------------------------

Ok, this wasn't as scary as I expected.

The code looks fine.  Whether or not it manages to not explode on some odd configuration, I can't really predict.  Let's try it and see!

::: memory/jemalloc/jemalloc.c
@@ +2452,5 @@
>  	return (false);
>  }
>  #endif
>  
> +#if defined(MOZ_MEMORY_WINDOWS) || defined(JEMALLOC_USES_MAP_ALIGN) || defined(MALLOC_PAGEFILE)

Ok, now that I understand what's going on the conditional for the pagefile is ok.

@@ +5963,5 @@
>      default_zone = malloc_default_zone();
>  
>      /*
> +     * We only use jemalloc on Mac 10.6 and 10.7.  On Mac 10.5, our tests show
> +     * a memory regression, but this may not be real.  See Mozilla bug 694335.

Yeah, that comment is wrong now ;-)

@@ +6662,5 @@
>  	 * Actually create an object of the appropriate size, then find out
>  	 * how large it could have been without moving up to the next size
>  	 * class.
> +	 *
> +	 * I sure hope this doesn't get called often.

lol
Attachment #569477 - Flags: review?(khuey) → review+
This is a big change we should track for FF10.  We'll need to keep an eye that this doesn't cause hangs or crashes for users on 10.7.
Summary: Fix jemalloc hang, crash on Mac 10.7 and enable jemalloc on 10.7 → [10.7] Fix jemalloc hang, crash on Mac 10.7 and enable jemalloc on 10.7
https://hg.mozilla.org/mozilla-central/rev/62ba86c7c50c
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla10
We're not out of the woods yet.  I just got this crash starting up a custom build made from current mozilla-central code.

The patch that got landed should probably be backed out again.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Or at least jemalloc should be turned off for OS X 10.7 -- which is where I saw the crash.
Well, that's bad!
Backed out and re-pushed to inbound.

https://hg.mozilla.org/integration/mozilla-inbound/rev/d7ccee4bd7f7
https://hg.mozilla.org/integration/mozilla-inbound/rev/242bbbe83dda

Thanks for catching this, Steven!  I'm not sure how I screwed this up...
No problem.

If I hadn't caught it, someone else would have very soon :-)
By the way, you should probably also at least back the patch out from mozilla-central -- otherwise the shit will hit the fan in tomorrow's mozilla-central nightly.
> https://hg.mozilla.org/integration/mozilla-inbound/rev/242bbbe83dda

This is still wrong, and won't finish building.

A quick glance turned up that it still calls create_zone().  There may also be other problems.
The comment 99 relanding backed out from inbound due to failure to build on OS X:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=242bbbe83dda
{
Undefined symbols:
  "_create_zone", referenced from:
      _jemalloc_darwin_init in jemalloc.o
ld: symbol(s) not found
collect2: ld returned 1 exit status
make[7]: *** [libmozutils.dylib] Error 1
}

https://hg.mozilla.org/integration/mozilla-inbound/rev/546c39b13be7
Assignee: nobody → justin.lebar+bug
Target Milestone: mozilla10 → ---
Okay, gonna do this right.  Compiling on my 10.7 machine now...
Okay, I see what's going on.  Problem lies between keyboard and chair.  I missed a hunk in my .rej file.  :-/

I have a patch tested on my 10.7 machine.  This one should work.
> By the way, you should probably also at least back the patch out from mozilla-central -- otherwise 
> the shit will hit the fan in tomorrow's mozilla-central nightly.

I presume we're going to merge m-i again today before the nightly?
> I presume we're going to merge m-i again today before the nightly?

I don't think so.  Lately it's been happening just once a day, 5-7AM PST.  The nightlies normally get built starting around 3-4AM.
<jlebar> Are we going to merge m-i to m-c again before the next nightly?
<edmorley> jlebar: yeah I imagine so, since that's ~13-14 hours out and a fair bit has landed on inbound already :-)
<edmorley> jlebar: particular reason?
<jlebar> edmorley, If we don't merge, we need to back out the broken jemalloc 10.7 patch I landed, which was merged earlier.
<edmorley> jlebar: ah
<edmorley> jlebar: ok so at the very least I'll merge from https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=546c39b13be7 to m-c if not more
First backout: https://hg.mozilla.org/mozilla-central/rev/d7ccee4bd7f7
Relanding: https://hg.mozilla.org/mozilla-central/rev/242bbbe83dda
Second backout: https://hg.mozilla.org/mozilla-central/rev/242bbbe83dda

The 3rd (and hopefully final) landing is on just inbound at present, and will come across on the next merge (presuming it's green hehe).
Target Milestone: --- → mozilla10
Sorry for the bugspam, the third URL in comment 109 should have been https://hg.mozilla.org/mozilla-central/rev/546c39b13be7
I just built with current trunk (rev f8d66a792ddc) and running mochitests locally on Mac OS X 10.7 doesn't work any more. Lots of malloc errors to the console and the browser doesn't start.
https://hg.mozilla.org/mozilla-central/rev/57d804eae9a0

Fingers crossed! :-)
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
(In reply to Josh Aas (Mozilla Corporation) from comment #111)
> I just built with current trunk (rev f8d66a792ddc) and running mochitests
> locally on Mac OS X 10.7 doesn't work any more. Lots of malloc errors to the
> console and the browser doesn't start.

That's worrying and weird.  f8d66a792ddc didn't have jemalloc turned on for 10.7.  Also, mochitest wfm with that rev on my 10.7 box.  Can you try tip again?
(In reply to Justin Lebar [:jlebar] from comment #113)
> (In reply to Josh Aas (Mozilla Corporation) from comment #111)
> > I just built with current trunk (rev f8d66a792ddc) and running mochitests
> > locally on Mac OS X 10.7 doesn't work any more. Lots of malloc errors to the
> > console and the browser doesn't start.
> 
> That's worrying and weird.  f8d66a792ddc didn't have jemalloc turned on for
> 10.7.  Also, mochitest wfm with that rev on my 10.7 box.  Can you try tip
> again?

Updating to tip fixed the problem, mochitests work again.
Depends on: 703087
Is there a test case to verify this fix?
Whiteboard: [MemShrink:P1] → [MemShrink:P1][qa?]
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #115)
> Is there a test case to verify this fix?

Unfortunately not.
You could verify that jemalloc is enabled on 10.7 by putting a breakpoint or malloc_printf() in jemalloc.c's malloc() definition.
I don't think this is something which QA will be able to verify easily (in the face of prioritizing resources). Marking qa- in the hopes that someone else will be able to verify the fix.
Whiteboard: [MemShrink:P1][qa?] → [MemShrink:P1][qa-]
You need to log in before you can comment on or make changes to this bug.