694896 - [10.7] Fix jemalloc hang, crash on Mac 10.7 and enable jemalloc on 10.7

Assignee

Description

•

13 years ago

jemalloc is currently disabled on MacOS 10.7 for reasons that I don't understand at the moment.  But as more Mac users transition to the new version, enabling jemalloc there becomes increasingly important.

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Depends on: 414946

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Whiteboard: [MemShrink]

Mike Hommey [:glandium]

Comment 1

•

13 years ago

bug 670175 is the problem.

Justin Lebar (not reading bugmail)

Assignee

Comment 2

•

13 years ago

Thanks.  I resolved that bug so we can track turning it on (and fixing that hang) here.

Depends on: 670175

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Blocks: lion-compatibility

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Whiteboard: [MemShrink] → [MemShrink:P1]

Justin Lebar (not reading bugmail)

Assignee

Comment 3

•

13 years ago

Steven, we need some mac-fu here to understand the hangs in bug 670175.  Can you help?

Steven Michaud [:smichaud] (Retired)

Comment 4

•

13 years ago

I'm completely swamped, so the earliest I'll have time for this is sometime next week.

Furthermore I know very little about jemalloc, so I'll probably have to use brute force on this bug -- keep turning off bits of jemalloc until it stops happening, then see if the guilty "bit" can somehow be implemented differently.

That's likely to be very time consuming, and may be something you guys can do as well as I can :-)

Justin Lebar (not reading bugmail)

Assignee

Comment 5

•

13 years ago

Is it possible to dual-boot with the final version of 10.7?  If so, I might be able to poke this and at least get to the point that I need your help.  I sort of understand jemalloc, although jemalloc on mac is a complete mystery to me.  :)

Steven Michaud [:smichaud] (Retired)

Comment 6

•

13 years ago

> Is it possible to dual-boot with the final version of 10.7?

Sure.  You need to create extra partitions on your hard drive (or add another hard drive on which you can create additional partitions).  Then do a conventional install into one of those partitions.  Then you can use Startup Disk in System Preferences to choose which partition to boot from.

No, I don't know how to non-destructively add new partitions to an existing hard drive.  But ask Marcia Knous -- she may know how, or know who to ask.

Justin Lebar (not reading bugmail)

Assignee

Comment 7

•

13 years ago

> No, I don't know how to non-destructively add new partitions to an existing hard drive.

I think bootcamp can do this?  I'll see what I can figure out.

Steven Michaud [:smichaud] (Retired)

Comment 8

•

13 years ago

Note that your Lion partition should be at least 30GB in order for XCode to fit into it.  40GB would probably be better.  (I use 30GB, and it can be a tight fit.)

Alex Limi (:limi) — Firefox UX Team

Comment 9

•

13 years ago

(In reply to Steven Michaud from comment #6)
> No, I don't know how to non-destructively add new partitions to an existing
> hard drive.  But ask Marcia Knous -- she may know how, or know who to ask.

I believe the stock version of Disk Utility does this just fine (unless your drive is very fragmented) — give it a try. It's very explicit about which partitions will lose data and which will remain intact, so just read the description before confirming the operation.

Justin Lebar (not reading bugmail)

Assignee

Comment 10

•

13 years ago

I had to run disk utilities repair on my partition (perhaps because I'd previously had a bootcamp partition), but then it worked, and I got Lion installed.  Thanks!

Justin Lebar (not reading bugmail)

Assignee

Comment 11

•

13 years ago

I can reproduce using the steps in bug 670175 comment 17, about half the time.  Firefox beachballs for about 5s.

From the hang stack in attachment 544925 [details], it looks like jemalloc is taking a long time in mmap.  (See the stacktrace for thread 1, at the bottom.)  Maybe 10.7 doesn't like the hack we use to force mmap'ed pages down towards the bottom of the address space.

Steven Michaud [:smichaud] (Retired)

Comment 12

•

13 years ago

I thought that trace indicated thread contention -- between the main thread and threads 10, 12, 13, and 23.  But that's more or less a guess.

Hence my hunch that we're going to need to brute-force this.

Steven Michaud [:smichaud] (Retired)

Comment 13

•

13 years ago

> Maybe 10.7 doesn't like the hack we use to force mmap'ed pages down
> towards the bottom of the address space.

Can we stop doing this, and see if that makes the problem go away?

Justin Lebar (not reading bugmail)

Assignee

Comment 14

•

13 years ago

I seem to be able to reproduce bug 670175 comment 17 consistently if I issue the |purge| command immediately before starting Firefox.

Justin Lebar (not reading bugmail)

Assignee

Comment 15

•

13 years ago

(In reply to Steven Michaud from comment #13)
> Can we stop doing this, and see if that makes the problem go away?

Yep, trying...

Justin Lebar (not reading bugmail)

Assignee

Comment 16

•

13 years ago

Interestingly, when I kill FF while it's hung, the whole system is unresponsive for a few seconds.  Typing and switching tabs in the terminal is very slow, for example.

> I thought that trace indicated thread contention -- between the main thread and threads 10, 12, 13, 
> and 23.  But that's more or less a guess.

Yes.  The fact that four threads are all blocked on the main thread's arena lock indicates that the main thread has probably been sitting there for a while.  So that leads me to suspect that mmap may be hanging...

Steven Michaud [:smichaud] (Retired)

Comment 17

•

13 years ago

> Interestingly, when I kill FF while it's hung, the whole system is
> unresponsive for a few seconds.  Typing and switching tabs in the
> terminal is very slow, for example.

This is an old problem with event taps -- a buggy system resource that
we're forced to use.  See bug 393664 and bug 611068.

Justin Lebar (not reading bugmail)

Assignee

Comment 18

•

13 years ago

Yikes, I think I understand what's happening.  Jemalloc uses a loop when it tries to do an aligned mmap.  The claim is that you only loop in the event of a race condition, where two threads call mmap simultaneously.  But we go around and around this loop for a long time.

The code assumes that if I mmap a 2mb chunk, unmap it, and then immediately try to map somewhere inside where that chunk used to be (so I get an aligned mapping), the final mmap operation will succeed in the absence of races.  It appears that this is not the case on 10.7.

Justin Lebar (not reading bugmail)

Assignee

Comment 19

•

13 years ago

The more sane (and loop-free) thing to do would be to map 2mb and then unmap any excess.  That's what jemalloc tip does.

We can't do that on Windows, because apparently Windows demands a 1:1 mapping between maps and unmaps.  But we ought to be able to backport this change for Linux and Mac.

Justin Lebar (not reading bugmail)

Assignee

Comment 20

•

13 years ago

Awesome, jsgcchunk is copy/pasted from jemalloc (or is it the other way around?).  So whatever problem jemalloc has here, presumably the js gc has too.

Justin Lebar (not reading bugmail)

Assignee

Comment 21

•

13 years ago

(In reply to Justin Lebar [:jlebar] from comment #20)
> Awesome, jsgcchunk is copy/pasted from jemalloc (or is it the other way
> around?).  So whatever problem jemalloc has here, presumably the js gc has
> too.

...although maybe not, because js uses vm_allocate instead of mmap.

Anyone know why js doesn't use mmap?

Terrence Cole [:terrence]

Comment 22

•

13 years ago

The code in jsgcchunk is super ugly and I spent quite of a bit of time looking at it today for 670596.

(In reply to Justin Lebar [:jlebar] from comment #20)
> Awesome, jsgcchunk is copy/pasted from jemalloc (or is it the other way
> around?).  So whatever problem jemalloc has here, presumably the js gc has
> too.

Igor created jsgcchunk as a copy from jemalloc about a year and a half ago in 553812.

(In reply to Justin Lebar [:jlebar] from comment #19)
> The more sane (and loop-free) thing to do would be to map 2mb and then unmap
> any excess.  That's what jemalloc tip does.

Agreed.  This should work well on mac and linux, assuming that we can't find a way to do aligned allocations directly.

> We can't do that on Windows, because apparently Windows demands a 1:1
> mapping between maps and unmaps.  But we ought to be able to backport this
> change for Linux and Mac.

In Windows, we can reserve address space independently of committing.  We should be able to make things better on Windows by grabbing the 2MiB of address space as a reservation and only committing when we grab the 1MiB slice afterwards.

I think what we need to do in jsgcchunk is push the high level AllocGCChunk interface into each OS's #define and just hyper-optimize for each OS individually.  We also need to add new per-OS functions for madvise/MEM_RESET functionality.  It would be nice if we could share this work with jemalloc.

Justin Lebar (not reading bugmail)

Assignee

Comment 23

•

13 years ago

> assuming that we can't find a way to do aligned allocations directly.

I've got to hope that if there were a way, upstream jemalloc would be using it.

> In Windows, we can reserve address space independently of committing.  We should be able to make 
> things better on Windows by grabbing the 2MiB of address space as a reservation and only committing 
> when we grab the 1MiB slice afterwards.

Yes, but two problems:

* We frequently run out of virtual address space on Windows-32, so we probably can't waste it like this.

* The chunk would now have to remember where its mapping starts, so that when you unmap the chunk, you do so from the right place.  This would be kind of an invasive jemalloc change.

Windows ain't broke, so I think we should leave it be for the moment and revisit separately whether we can improve it.

Terrence Cole [:terrence]

Comment 24

•

13 years ago

(In reply to Justin Lebar [:jlebar] from comment #23)
> > assuming that we can't find a way to do aligned allocations directly.
> 
> I've got to hope that if there were a way, upstream jemalloc would be using
> it.

Probably, but the linux kernel moves fast and doesn't communicate userland changes well.

> > In Windows, we can reserve address space independently of committing.  We should be able to make 
> > things better on Windows by grabbing the 2MiB of address space as a reservation and only committing 
> > when we grab the 1MiB slice afterwards.
> 
> Yes, but two problems:
> 
> * We frequently run out of virtual address space on Windows-32, so we
> probably can't waste it like this.
> 
> * The chunk would now have to remember where its mapping starts, so that
> when you unmap the chunk, you do so from the right place.  This would be
> kind of an invasive jemalloc change.
> 
> Windows ain't broke, so I think we should leave it be for the moment and
> revisit separately whether we can improve it.

I think I must have explained very poorly what I meant to do.  I am not suggesting changing the algorithm or numbers we use when allocating memory at all, just unsetting the MEM_COMMIT bit until we actually make our 1MiB allocation.

Justin Lebar (not reading bugmail)

Assignee

Comment 25

•

13 years ago

> I think I must have explained very poorly what I meant to do.  I am not suggesting changing the 
> algorithm or numbers we use when allocating memory at all, just unsetting the MEM_COMMIT bit until 
> we actually make our 1MiB allocation.

Ah, that might work!  Can we do this in another bug, please?

Terrence Cole [:terrence]

Comment 26

•

13 years ago

(In reply to Justin Lebar [:jlebar] from comment #25)
> Ah, that might work!  Can we do this in another bug, please?

Filed under 696119.

Justin Lebar (not reading bugmail)

Assignee

Comment 27

•

13 years ago

Attached patch Patch v1 (obsolete) — Details — Splinter Review

Seems to work on my 10.7 and Linux-64 machines.  Pushed to try.

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Summary: Enable jemalloc on MacOS 10.7 → Fix jemalloc chunk_alloc_mmap() function so it works properly on Mac 10.7

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Blocks: 696162

Justin Lebar (not reading bugmail)

Assignee

Comment 28

•

13 years ago

Aside from a mysterious linux64 make check error, which is probably nothing, this looks good on try.

Steven, would you mind trying this build out on your 10.7 machine and seeing if you can get it to hang?  It doesn't hang for me on my 10.7 machine.

https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jlebar@mozilla.com-21ae01a7b22a/

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Attachment #568432 - Flags: feedback?(smichaud)

Alex Limi (:limi) — Firefox UX Team

Comment 29

•

13 years ago

If you feel like a slightly larger population of guinea pigs to test it on, feel free to land it on the UX branch. We have about 2K ADUs, a large percentage of which I would guess are on OS X 10.7. :)

Justin Lebar (not reading bugmail)

Assignee

Comment 30

•

13 years ago

I pushed this to UX: https://hg.mozilla.org/projects/ux/rev/5e823e7f5e5d

Steven Michaud [:smichaud] (Retired)

Comment 31

•

13 years ago

Comment on attachment 568432 [details] [diff] [review]
Patch v1

This patch doesn't seem to have fixed the problem.  I still crash,
though now the STR is now different.  Here are the crashes I've had.
They seem to be the same crash, but I include them all for the sake of
completeness.  I'll try to get some gdb crash stacks, and to translate
the symbols in the Breakpad crash stacks:

bp-faaba078-ca18-4b50-a52e-ea4432111021
bp-e9738d44-e0b9-4dd0-b60f-fd1ac2111021
bp-b8a8cae1-4a37-43c3-94df-4b76a2111021
bp-a8476906-3a59-4d59-a8f8-b1e8f2111021
bp-6845a0ff-f298-41b9-8b65-e9b532111021

Here's my STR.  It's still rather difficult and pretty finicky.  I may
be able to improve it over time.

1) Restart your computer.

2) Run the test build, and then	a few seconds (5-15?) after it's
   started, move the mouse up to "Nightly" in the main menu and click
   on it to open the menu.

   I'll	usually	crash here, or sometimes even before the mouse reaches
   the main menu.

Sometimes subsequent attempts to run the test build will crash even
before the browser window has finished opening.  But as soon as you
fail to crash even once, the crashes become almost impossibly
difficult to reproduce.  Running "purge" beforehand doesn't seem to
help (to make the test build more crashy).

My STR above doesn't crash on recent nightlies.

Attachment #568432 - Flags: feedback?(smichaud) → feedback-

Steven Michaud [:smichaud] (Retired)

Comment 32

•

13 years ago

Even when I just turn jemalloc on for OS X 10.7, without making any of
the rest of your patch's changes, my STR from bug 670175 comment #17
no longer "works" for me -- the I-beam cursor in the Google search box
never stops flashing.

For this test, I just applied the following part of your patch:

-    osx_use_jemalloc = (default_zone->version == SNOW_LEOPARD_MALLOC_ZONE_T_VERSION);
+    osx_use_jemalloc = (default_zone->version == SNOW_LEOPARD_MALLOC_ZONE_T_VERSION ||
+                        default_zone->version == LION_MALLOC_ZONE_T_VERSION);

Steven Michaud [:smichaud] (Retired)

Comment 33

•

13 years ago

You'll notice that none of my Breakpad crash reports from comment #31 have symbols even for Mozilla code.  This has already come up in bug 693451, and a Breakpad bug on the issue has been opened upstream at http://code.google.com/p/google-breakpad/issues/detail?id=447.

Justin Lebar (not reading bugmail)

Assignee

Comment 34

•

13 years ago

Yeah, symbols of any kind would be very helpful...

Justin Lebar (not reading bugmail)

Assignee

Comment 35

•

13 years ago

Backed out from UX: https://hg.mozilla.org/projects/ux/rev/dc08bf63b3b1

Steven Michaud [:smichaud] (Retired)

Comment 36

•

13 years ago

It occurs to me that jemalloc might have (at least) two different problems on OS X 10.7 -- a hang bug (which your patch may have fixed) and a crash bug.

Justin Lebar (not reading bugmail)

Assignee

Comment 37

•

13 years ago

I think the patch did fix a hang I was observing.  At least, there was a loop which I observed spinning, and now the loop is gone!

Steven Michaud [:smichaud] (Retired)

Comment 38

•

13 years ago

Attached file Translation of crash-stack symbols from comment #31 — Details

Here are translations for (some of) the symbols for my crash stacks
from comment #31.  I used atos to do the translations.  There were a
few symbols it (apparently) couldn't translate.

I think	this pins the blame pretty well on jemalloc.  Though it
doesn't come near to providing enough information to fix the bug that
causes the crashes :-(

Note that one of the symbols is for malloc_printf, which indicates
there may have been something interesting in the system log.  I'll
reproduce a few more crashes, and see what I find.

Steven Michaud [:smichaud] (Retired)

Comment 39

•

13 years ago

[0x0-0x13013].org.mozilla.nightly: firefox(193,0x7fff79a83960)
  malloc: *** malloc_default_scalable_zone() failed to find 'DefaultMallocZone'

Justin Lebar (not reading bugmail)

Assignee

Comment 40

•

13 years ago

That's weird...it looks like this is the default allocator, not jemalloc.

Steven Michaud [:smichaud] (Retired)

Comment 41

•

13 years ago

Then I think jemalloc must be doing something to break the default allocator.

I'm still working on getting a gdb stack trace of the crash, with symbols and all threads.

Steven Michaud [:smichaud] (Retired)

Comment 42

•

13 years ago

Attached file Gdb stack trace of crash, with symbols and all threads — Details

Here it is!

This is from a non-debug custom build (with this bug's patch applied) with optimization turned off and no symbols stripped.  It's based on trunk code pulled at the beginning of this week (Monday).

Steven Michaud [:smichaud] (Retired)

Comment 43

•

13 years ago

Note that it looks a lot like my gdb crash traces from bug 670175.

Justin Lebar (not reading bugmail)

Assignee

Comment 44

•

13 years ago

This is a shot in the dark, but what if you change

  jemalloc_zone->zone_name = "jemalloc_zone";

to

  jemalloc_zone->zone_name = "DefaultMallocZone";

?

Steven Michaud [:smichaud] (Retired)

Comment 45

•

13 years ago

Attached file Another gdb stack trace of crash — Details

Steven Michaud [:smichaud] (Retired)

Comment 46

•

13 years ago

(In reply to comment #44)

I'm trying it.  I'll let you know my results when the build finishes.

Justin Lebar (not reading bugmail)

Assignee

Comment 47

•

13 years ago

btw, I find that to rebuild jemalloc, I need to do

  rm -f browser/app/nsBrowserApp.o && make -C memory && make -C browser && make -C toolkit/library

if you don't remove nsBrowserApp.o, the jemalloc changes might not be reflected, even if you do a top-level rebuild.

Mike Hommey [:glandium]

Comment 48

•

13 years ago

(In reply to Justin Lebar [:jlebar] from comment #47)
> if you don't remove nsBrowserApp.o, the jemalloc changes might not be
> reflected, even if you do a top-level rebuild.

That doesn't make sense. jemalloc is a shared library on mac. Rebuilding libmozutils.dylib should be enough.

Steven Michaud [:smichaud] (Retired)

Comment 49

•

13 years ago

Attached file Gdb trace of crash after following suggestion from comment #44 — Details

Following your suggestion from comment #44 didn't help, though it did
change the stack trace slightly, and got rid of the malloc_printf
error message.  I think the problem reported by malloc_printf must be
a symptom, and not the cause.

By the way, I didn't need to delete browser/app/nsBrowserApp.o.  I
just changed the source and did a top-level rebuild.  (I always do
that --	doing otherwise just wastes all the time I might have saved
doing non-top-level builds, or more than the time, on fixing the
problems it causes.)

Justin Lebar (not reading bugmail)

Assignee

Comment 50

•

13 years ago

> Following your suggestion from comment #44 didn't help, though it did
> change the stack trace slightly, and got rid of the malloc_printf
> error message. 

Okay, that's pretty good!

Maybe it's then trying to call one of the zone functions which is null.

You could replace the NULLs in create_zone and szone2ozone with 0x1, 0x2, 0x3, etc.  Maybe that would tell us which function it's trying to call.

Steven Michaud [:smichaud] (Retired)

Comment 51

•

13 years ago

> You could replace the NULLs in create_zone and szone2ozone with 0x1,
> 0x2, 0x3, etc.  Maybe that would tell us which function it's trying
> to call.

I did that, without backing out your suggestion from comment #44, and
it made no difference -- I still get exactly the same null
dereference, and no output from malloc_printf.

Next I'll try backing out your suggestion from comment #44, to see if
that makes any difference.

Steven Michaud [:smichaud] (Retired)

Comment 52

•

13 years ago

> Next I'll try backing out your suggestion from comment #44, to see
> if that makes any difference.

This just restored the original error message from malloc_printf and
the original crash at 0x10f0.

Steven Michaud [:smichaud] (Retired)

Comment 53

•

13 years ago

I've got a few minutes left at the end of the day, that I can't really spend doing anything else, so I'm going to try a few of my own random shots in the dark.

I just tried putting #define NO_TLS at the top of jemalloc.c -- that doesn't help.

Justin Lebar (not reading bugmail)

Assignee

Comment 54

•

13 years ago

I'm pretty sure mac doesn't build without NO_TLS.

I think we've made some progress here.  I'll try to get things set up so I can debug this locally.

Steven Michaud [:smichaud] (Retired)

Comment 55

•

13 years ago

Now I've found something that *does* help:  The crashes disappear if you never call szone2ozone()!  I tested twice (with my STR from comment #31) just to be sure.

> I'll try to get things set up so I can debug this locally.

That's probably a good idea :-)

Justin Lebar (not reading bugmail)

Assignee

Comment 56

•

13 years ago

Ah, so AIUI, szone2ozone sets us up so that certain system-level allocations go through jemalloc rather than the default allocator.

It  might not like the zone name...

Steven Michaud [:smichaud] (Retired)

Comment 57

•

13 years ago

> It  might not like the zone name...

I suspect the name isn't the problem ... though I've no idea what the problem actually is.

In any case, though, szone2ozone is a lot smaller target to brute-force than the whole of jemalloc.c.

Steven Michaud [:smichaud] (Retired)

Comment 58

•

13 years ago

(Following up comment #57)

>> It might not like the zone name...
>
> I suspect the name isn't the problem ... though I've no idea what
> the problem actually is.

Actually you were right (and I was wrong).  Making the following
change in szone2ozone is enough to make my crashes go away:

-       default_zone->zone_name = "jemalloc_ozone";
+       if (default_zone->version < LION_MALLOC_ZONE_T_VERSION) {
+               default_zone->zone_name = "jemalloc_ozone";
+       } else {
+               default_zone->zone_name = "DefaultMallocZone";
+       }

I still don't know *why* this change is necessary on Lion.  As best I
can tell, "DefaultMallocZone" has been this zone's name for years.
But maybe it's only with Lion that the OS actually starts checking the
name.

It's also puzzling why we don't *always* crash on Lion, if the default
zone's name is "wrong".

But I think it's worthwhile landing this change on the trunk, to see
what happens.

Once jemalloc is re-enabled on Lion, we should (of course) check that
it really does give us performance gains and/or a reduction in memory
usage.

Steven Michaud [:smichaud] (Retired)

Comment 59

•

13 years ago

And note that the default zone's name is wrong (on Lion) even when we don't call szone2ozone, and we still don't crash.

Steven Michaud [:smichaud] (Retired)

Comment 60

•

13 years ago

So (sigh) I've changed my mind.

I no longer think we should land my change from comment #58 without understanding better what's going on (why that change "works", or appears to "work").

Steven Michaud [:smichaud] (Retired)

Comment 61

•

13 years ago

I'll continue to brute-force szone2ozone for a while longer, to see if I find anything interesting.

Justin Lebar (not reading bugmail)

Assignee

Comment 62

•

13 years ago

You may want to disassemble the calling stackframes (or use gdb watchpoints) to try to see where default_zone->zone_name is being read.

I've been reading machine code, but I haven't seen it yet...

Justin Lebar (not reading bugmail)

Assignee

Comment 63

•

13 years ago

So I did

+               default_zone->zone_name = (char*)0xdeadbeef;

and then waited for it to crash.

It crashes in

  strcmp()
  malloc_default_purgable_zone()
  ImageIO_Malloc()

malloc_default_purgable_zone appears to be in a loop looking for something.  Perhaps the string "DefaultMallocZone".  :)

Steven Michaud [:smichaud] (Retired)

Comment 64

•

13 years ago

> malloc_default_purgable_zone appears to be in a loop looking for
> something.  Perhaps the string "DefaultMallocZone".  :)

Nope, it's the string "DefaultPurgeableMallocZone" :-)

See http://opensource.apple.com/source/Libc/Libc-594.1.4/gen/malloc.c.
Which shows that you're debugging in libc code, and that Apple's
libc's source is available at http://opensource.apple.com.

That may make your life easier :-)

Steven Michaud [:smichaud] (Retired)

Comment 65

•

13 years ago

OS X 10.7.2's malloc.c is at:
http://opensource.apple.com/source/Libc/Libc-763.12/gen/malloc.c

You can also download the source.

See http://opensource.apple.com/ and start browsing under "Mac OS X".

Justin Lebar (not reading bugmail)

Assignee

Comment 66

•

13 years ago

Ah, that's much nicer than reading asm.  There's this, which explains comment 38:

static inline malloc_zone_t *
inline_malloc_default_scalable_zone(void) {
    unsigned index;

    if (malloc_def_zone_state < 2) _malloc_initialize();
    // _malloc_printf(ASL_LEVEL_INFO, "In inline_malloc_default_scalable_zone with %d %d\n", malloc_num_zones, malloc_has_debug_zone);

    MALLOC_LOCK();
    for (index = 0; index < malloc_num_zones; ++index) {
	malloc_zone_t *z = malloc_zones[index];

	if(z->zone_name && strcmp(z->zone_name, "DefaultMallocZone") == 0) {
	    MALLOC_UNLOCK();
	    return z;
	}
    }
    MALLOC_UNLOCK();

    malloc_printf("*** malloc_default_scalable_zone() failed to find 'DefaultMallocZone'\n");
    return NULL; // FIXME: abort() instead?
}

Steven Michaud [:smichaud] (Retired)

Comment 67

•

13 years ago

Another small piece of the puzzle:

My crashes stop happening if you don't change the zone_name in szone2ozone -- if you comment out the following line:

default_zone->zone_name = "jemalloc_ozone";

So the zone name doesn't have to be "DefaultMallocZone" (without the above change it's "jemalloc_zone").

Justin Lebar (not reading bugmail)

Assignee

Comment 68

•

13 years ago

> So the zone name doesn't have to be "DefaultMallocZone" (without the above change it's 
> "jemalloc_zone").

But in create_zone, don't we need to change the zone name from "jemalloc_zone" to "DefaultMallocZone"?

Steven Michaud [:smichaud] (Retired)

Comment 69

•

13 years ago

> But in create_zone, don't we need to change the zone name from
> "jemalloc_zone" to "DefaultMallocZone"?

What I reported in comment #67 tells me "no".

But I'm not yet sure that's the end of the story ...

Steven Michaud [:smichaud] (Retired)

Comment 70

•

13 years ago

I'm closing in on the real problem, I think:

I can also fix my crashes by adding the following (an extra call to
malloc_zone_register()) just after the single call to szone2ozone():

malloc_zone_register(default_zone);

Apparently Lion is fussy about changes made to the default zone after
it's been "registered".

But I suspect we should only call malloc_zone_register() once, and
that the single call should be the one after szone2ozone().  Just a
sec and I'll try that out.

Justin Lebar (not reading bugmail)

Assignee

Comment 71

•

13 years ago

Why do we have to register the jemalloc zone to begin with?  Registering the zone doesn't make it the default one, afaict.  It looks like only the szone overlay actually does something.

> But I suspect we should only call malloc_zone_register() once, and
> that the single call should be the one after szone2ozone().

If registering the zone *is* important, then this could be dangerous, because after szone2ozone, the system might be able to start using the jemalloc zone (effectively, not the zone itself, but its functions) before we've registered it. (?)

Steven Michaud [:smichaud] (Retired)

Comment 72

•

13 years ago

(Following up comment #70)

Oops, I got it wrong.  Adding the following after the call to
szone2ozone() doesn't help:

malloc_zone_register(default_zone);

Likewise with making it the only call to malloc_zone_register().

I was confused about default_zone (returned by a call to
malloc_default_zone()) and the "custom zone" created by create_zone().
default_zone is the default malloc zone created (and registered) by
the OS.  It's not a copy of the "custom zone", and doesn't become one.
In fact I don't know what the "custom zone" is for, and why (or even
if) it's needed.

We get the default zone from the OS.  And though we modify it, we
intend the OS to continue using it as if it had been unchanged (we in
fact hook it).  We also never check for any zone by name.

So we have no reason to change the default zone's zone_name in
szone2ozone(), on any version of OS X.  In fact we're lucky we got
away with this on versions of OS X prior to Lion.

So the fix for the crash bug is very simple.  Just get rid of the
following line in szone2ozone():

default_zone->zone_name = "jemalloc_ozone";

Steven Michaud [:smichaud] (Retired)

Comment 73

•

13 years ago

> Why do we have to register the jemalloc zone to begin with?
> Registering the zone doesn't make it the default one, afaict.  It
> looks like only the szone overlay actually does something.

I assume that by "the jemalloc zone" you mean what I've been calling
the "custom zone".  If so, then I agree that we (probably) don't need
to either create it or register it -- as far as I can tell it's not
used for anything.  But I don't believe that creating and registering
the "custom zone" does any harm.

It'd be interesting to know the history of this code, and whether or
not the "custom zone" once served some purpose.	 But I'll save that
for later, because right now I'm too tired to pursue it :-)

Steven Michaud [:smichaud] (Retired)

Comment 74

•

13 years ago

By the way, "registering" a zone doesn't make it the default.

Justin Lebar (not reading bugmail)

Assignee

Comment 75

•

13 years ago

Attached patch Patch v2 (obsolete) — Details — Splinter Review

How about we get rid of this zone registration altogether?

This seems to work for me, but knowing you, you'll find some way to crash it.  :)

As far as I can tell, allocations are still going through jemalloc with this patch.  Or at least, the jemalloc heap size isn't significantly smaller with this patch than without.

Attachment #568432 - Attachment is obsolete: true

Attachment #569277 - Flags: feedback?

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Attachment #569277 - Flags: feedback? → feedback?(smichaud)

Steven Michaud [:smichaud] (Retired)

Comment 76

•

13 years ago

Comment on attachment 569277 [details] [diff] [review]
Patch v2

-	jemalloc_zone->zone_name = "jemalloc_zone";
+	jemalloc_zone->zone_name = "DefaultMallocZone";

We don't want this.  We don't want anything besides the real default zone to have that name.

I'll do the rest of my review tomorrow.

Justin Lebar (not reading bugmail)

Assignee

Comment 77

•

13 years ago

Attached patch Patch v3 (obsolete) — Details — Splinter Review

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Attachment #569277 - Attachment is obsolete: true

Attachment #569277 - Flags: feedback?(smichaud)

Justin Lebar (not reading bugmail)

Assignee

Comment 78

•

13 years ago

Comment on attachment 569278 [details] [diff] [review]
Patch v3

Really get rid of create_zone().

Attachment #569278 - Attachment description: Fix jemalloc hangs on Mac 10.7. → Patch v3

Attachment #569278 - Flags: feedback?(smichaud)

Justin Lebar (not reading bugmail)

Assignee

Comment 79

•

13 years ago

> In fact I don't know what the "custom zone" is for, and why (or even
> if) it's needed.

We midaired each other and independently arrived at the same conclusion.  Cool.  :)

Justin Lebar (not reading bugmail)

Assignee

Comment 80

•

13 years ago

We should check whether we need to somehow lock around the modifications to default_zone.  If we're not the only thread alive at that point, we might be better off reordering the modifications so that free is set before malloc and friends.

Justin Lebar (not reading bugmail)

Assignee

Comment 81

•

13 years ago

On my 10.7 machine, szone2ozone is called when only one thread is alive.

Steven Michaud [:smichaud] (Retired)

Comment 82

•

13 years ago

Comment on attachment 569278 [details] [diff] [review]
Patch v3

This looks fine to me.  And I no longer crash or hang :-)

Though I do have a couple of nits:

+        /* Don't modify default_zone->zone_name; Mac libc may rely on the name
+         * being uncahnged. */

It's "unchanged".  And you might want to reference this bug (bug
694896).

 /* Likewise for l_zone_introspect. */
 static lion_malloc_introspection l_zone_introspect, l_ozone_introspect;
 static malloc_introspection_t * const zone_introspect =
        (malloc_introspection_t*)(&l_zone_introspect);
 static malloc_introspection_t * const ozone_introspect =
        (malloc_introspection_t*)(&l_ozone_introspect);

We can get rid of l_zone_introspect and zone_introspect.

We can also get	rid of the following, which are	all now	dead code:

zone_size()
zone_free()
zone_realloc()
zone_force_lock()
zone_force_unlock()
zone_free_definite_size()

> On my 10.7 machine, szone2ozone is called when only one thread is
> alive.

On the Mac szone2ozone is ultimately called from the following,	which
is called by the OS when the module containing jemalloc.c is loaded:

__attribute__((constructor))
void
jemalloc_darwin_init(void);

For good measure I double-checked that the structures in
osx_zone_types.h still match their version-specific counterparts in
malloc.h on SnowLeopard and Lion -- they do.

I also checked that our use of the malloc_zone_t pointers in
szone2ozone() is sane -- it is.  Specifically, we don't NULL out any
pointers that aren't documented to sometimes be NULL, or are in
malloc_introspection_t (which it seems can all safely be NULL).

I also found when create_zone(), jemalloc_zone and friends were
"orphaned".  More on that in my next comment.

Attachment #569278 - Flags: feedback?(smichaud) → feedback+

Justin Lebar (not reading bugmail)

Assignee

Comment 83

•

13 years ago

Attached patch Patch v4 — Details — Splinter Review

Attachment #569477 - Flags: review?(khuey)

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Attachment #569278 - Attachment is obsolete: true

Steven Michaud [:smichaud] (Retired)

Comment 84

•

13 years ago

Paul, I've got a question for you concerning your "part 1" patch for
bug 414946 (http://hg.mozilla.org/mozilla-central/rev/2b2f584dc5fd).

As best we can tell, as of that patch the create_zone() function no
longer does anything useful, and can safely be removed.  But your
patch didn't get rid of it.  Did you have a reason for that?

Thanks in advance!

Paul Biggar

Comment 85

•

13 years ago

The history of the jemalloc_zone is that the original implementation by jasone added a zone to the linked list of zones. After that, jasone started overriding the default zone, which actually worked and stuck.

Looking at the code now, it looks like the jemalloc_zone doesn't do anything. I really don't remember why this is here - apart from the fact that it was in the original (it's still upstream by the look of it). Certainly, if it works to remove it, remove it.

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Summary: Fix jemalloc chunk_alloc_mmap() function so it works properly on Mac 10.7 → Fix jemalloc hang, crash on Mac 10.7 and enable jemalloc on 10.7

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 86

•

13 years ago

Not going to get to this review until next week.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 87

•

13 years ago

Comment on attachment 569477 [details] [diff] [review]
Patch v4

Review of attachment 569477 [details] [diff] [review]:
-----------------------------------------------------------------

::: memory/jemalloc/jemalloc.c
@@ +178,5 @@
> + * MALLOC_PAGEFILE causes all mmap()ed memory to be backed by temporary
> + * files, so that if a chunk is mapped, it is guaranteed to be swappable.
> + * This avoids asynchronous OOM failures that are due to VM over-commit.
> + */
> +/* #define MALLOC_PAGEFILE */

Why are you making this change?

@@ +2452,5 @@
>  	return (false);
>  }
>  #endif
>  
> +#if defined(MOZ_MEMORY_WINDOWS) || defined(JEMALLOC_USES_MAP_ALIGN) || defined(MALLOC_PAGEFILE)

Especially since you then add a conditional for it here.

Justin Lebar (not reading bugmail)

Assignee

Comment 88

•

13 years ago

>> +/* #define MALLOC_PAGEFILE */
>
> Why are you making this change?

AFAICT, Linux defines MALLOC_PAGEFILE but doesn't actually enable it.  I want Linux to use the (presumably faster) chunk map routines.

Justin Lebar (not reading bugmail)

Assignee

Comment 89

•

13 years ago

Kyle, are you going to have a chance to look at this again before we branch?  It would be cool to get this in this week.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 90

•

13 years ago

Yeah, I should be able to review it tomorrow.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 91

•

13 years ago

Comment on attachment 569477 [details] [diff] [review]
Patch v4

Review of attachment 569477 [details] [diff] [review]:
-----------------------------------------------------------------

Ok, this wasn't as scary as I expected.

The code looks fine.  Whether or not it manages to not explode on some odd configuration, I can't really predict.  Let's try it and see!

::: memory/jemalloc/jemalloc.c
@@ +2452,5 @@
>  	return (false);
>  }
>  #endif
>  
> +#if defined(MOZ_MEMORY_WINDOWS) || defined(JEMALLOC_USES_MAP_ALIGN) || defined(MALLOC_PAGEFILE)

Ok, now that I understand what's going on the conditional for the pagefile is ok.

@@ +5963,5 @@
>      default_zone = malloc_default_zone();
>  
>      /*
> +     * We only use jemalloc on Mac 10.6 and 10.7.  On Mac 10.5, our tests show
> +     * a memory regression, but this may not be real.  See Mozilla bug 694335.

Yeah, that comment is wrong now ;-)

@@ +6662,5 @@
>  	 * Actually create an object of the appropriate size, then find out
>  	 * how large it could have been without moving up to the next size
>  	 * class.
> +	 *
> +	 * I sure hope this doesn't get called often.

lol

Attachment #569477 - Flags: review?(khuey) → review+

Justin Lebar (not reading bugmail)

Assignee

Comment 92

•

13 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/62ba86c7c50c

Fingers crossed.

status-firefox10: --- → fixed

Justin Lebar (not reading bugmail)

Assignee

Comment 93

•

13 years ago

This is a big change we should track for FF10.  We'll need to keep an eye that this doesn't cause hangs or crashes for users on 10.7.

tracking-firefox10: --- → ?

Scoobidiver (away)

Updated

•

13 years ago

Summary: Fix jemalloc hang, crash on Mac 10.7 and enable jemalloc on 10.7 → [10.7] Fix jemalloc hang, crash on Mac 10.7 and enable jemalloc on 10.7

Marco Bonardo [:mak] (Away Apr 25 - May 5)

Comment 94

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/62ba86c7c50c

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla10

Steven Michaud [:smichaud] (Retired)

Comment 95

•

13 years ago

Attached file Gdb stack trace of crash in current code — Details

We're not out of the woods yet.  I just got this crash starting up a custom build made from current mozilla-central code.

The patch that got landed should probably be backed out again.

Steven Michaud [:smichaud] (Retired)

Updated

•

13 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Steven Michaud [:smichaud] (Retired)

Comment 96

•

13 years ago

Or at least jemalloc should be turned off for OS X 10.7 -- which is where I saw the crash.

Steven Michaud [:smichaud] (Retired)

Comment 97

•

13 years ago

> https://hg.mozilla.org/integration/mozilla-inbound/rev/62ba86c7c50c
> https://hg.mozilla.org/mozilla-central/rev/62ba86c7c50c

The wrong patch got landed!  This isn't "Patch v4" (attachment 569477 [details] [diff] [review]).

Justin Lebar (not reading bugmail)

Assignee

Comment 98

•

13 years ago

Well, that's bad!

Justin Lebar (not reading bugmail)

Assignee

Comment 99

•

13 years ago

Backed out and re-pushed to inbound.

https://hg.mozilla.org/integration/mozilla-inbound/rev/d7ccee4bd7f7
https://hg.mozilla.org/integration/mozilla-inbound/rev/242bbbe83dda

Thanks for catching this, Steven!  I'm not sure how I screwed this up...

Steven Michaud [:smichaud] (Retired)

Comment 100

•

13 years ago

No problem.

If I hadn't caught it, someone else would have very soon :-)

Steven Michaud [:smichaud] (Retired)

Comment 101

•

13 years ago

By the way, you should probably also at least back the patch out from mozilla-central -- otherwise the shit will hit the fan in tomorrow's mozilla-central nightly.

Steven Michaud [:smichaud] (Retired)

Comment 102

•

13 years ago

> https://hg.mozilla.org/integration/mozilla-inbound/rev/242bbbe83dda

This is still wrong, and won't finish building.

A quick glance turned up that it still calls create_zone().  There may also be other problems.

Ed Morley [:emorley]

Comment 103

•

13 years ago

The comment 99 relanding backed out from inbound due to failure to build on OS X:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=242bbbe83dda
{
Undefined symbols:
  "_create_zone", referenced from:
      _jemalloc_darwin_init in jemalloc.o
ld: symbol(s) not found
collect2: ld returned 1 exit status
make[7]: *** [libmozutils.dylib] Error 1
}

https://hg.mozilla.org/integration/mozilla-inbound/rev/546c39b13be7

Assignee: nobody → justin.lebar+bug

Target Milestone: mozilla10 → ---

Justin Lebar (not reading bugmail)

Assignee

Comment 104

•

13 years ago

Okay, gonna do this right.  Compiling on my 10.7 machine now...

Justin Lebar (not reading bugmail)

Assignee

Comment 105

•

13 years ago

Okay, I see what's going on.  Problem lies between keyboard and chair.  I missed a hunk in my .rej file.  :-/

I have a patch tested on my 10.7 machine.  This one should work.

Justin Lebar (not reading bugmail)

Assignee

Comment 106

•

13 years ago

> By the way, you should probably also at least back the patch out from mozilla-central -- otherwise 
> the shit will hit the fan in tomorrow's mozilla-central nightly.

I presume we're going to merge m-i again today before the nightly?

Steven Michaud [:smichaud] (Retired)

Comment 107

•

13 years ago

> I presume we're going to merge m-i again today before the nightly?

I don't think so.  Lately it's been happening just once a day, 5-7AM PST.  The nightlies normally get built starting around 3-4AM.

Justin Lebar (not reading bugmail)

Assignee

Comment 108

•

13 years ago

<jlebar> Are we going to merge m-i to m-c again before the next nightly?
<edmorley> jlebar: yeah I imagine so, since that's ~13-14 hours out and a fair bit has landed on inbound already :-)
<edmorley> jlebar: particular reason?
<jlebar> edmorley, If we don't merge, we need to back out the broken jemalloc 10.7 patch I landed, which was merged earlier.
<edmorley> jlebar: ah
<edmorley> jlebar: ok so at the very least I'll merge from https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=546c39b13be7 to m-c if not more

Ed Morley [:emorley]

Comment 109

•

13 years ago

First backout: https://hg.mozilla.org/mozilla-central/rev/d7ccee4bd7f7
Relanding: https://hg.mozilla.org/mozilla-central/rev/242bbbe83dda
Second backout: https://hg.mozilla.org/mozilla-central/rev/242bbbe83dda

The 3rd (and hopefully final) landing is on just inbound at present, and will come across on the next merge (presuming it's green hehe).

Target Milestone: --- → mozilla10

Ed Morley [:emorley]

Comment 110

•

13 years ago

Sorry for the bugspam, the third URL in comment 109 should have been https://hg.mozilla.org/mozilla-central/rev/546c39b13be7

Josh Aas

Comment 111

•

13 years ago

I just built with current trunk (rev f8d66a792ddc) and running mochitests locally on Mac OS X 10.7 doesn't work any more. Lots of malloc errors to the console and the browser doesn't start.

Ed Morley [:emorley]

Comment 112

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/57d804eae9a0

Fingers crossed! :-)

Status: REOPENED → RESOLVED

Closed: 13 years ago → 13 years ago

Resolution: --- → FIXED

Justin Lebar (not reading bugmail)

Assignee

Comment 113

•

13 years ago

(In reply to Josh Aas (Mozilla Corporation) from comment #111)
> I just built with current trunk (rev f8d66a792ddc) and running mochitests
> locally on Mac OS X 10.7 doesn't work any more. Lots of malloc errors to the
> console and the browser doesn't start.

That's worrying and weird.  f8d66a792ddc didn't have jemalloc turned on for 10.7.  Also, mochitest wfm with that rev on my 10.7 box.  Can you try tip again?

Josh Aas

Comment 114

•

13 years ago

(In reply to Justin Lebar [:jlebar] from comment #113)
> (In reply to Josh Aas (Mozilla Corporation) from comment #111)
> > I just built with current trunk (rev f8d66a792ddc) and running mochitests
> > locally on Mac OS X 10.7 doesn't work any more. Lots of malloc errors to the
> > console and the browser doesn't start.
> 
> That's worrying and weird.  f8d66a792ddc didn't have jemalloc turned on for
> 10.7.  Also, mochitest wfm with that rev on my 10.7 box.  Can you try tip
> again?

Updating to tip fixed the problem, mochitests work again.

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Depends on: 703087

u279076

Comment 115

•

13 years ago

Is there a test case to verify this fix?

Whiteboard: [MemShrink:P1] → [MemShrink:P1][qa?]

Justin Lebar (not reading bugmail)

Assignee

Comment 116

•

13 years ago

(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #115)
> Is there a test case to verify this fix?

Unfortunately not.

Justin Lebar (not reading bugmail)

Assignee

Comment 117

•

13 years ago

You could verify that jemalloc is enabled on 10.7 by putting a breakpoint or malloc_printf() in jemalloc.c's malloc() definition.

u279076

Comment 118

•

13 years ago

I don't think this is something which QA will be able to verify easily (in the face of prioritizing resources). Marking qa- in the hopes that someone else will be able to verify the fix.

Whiteboard: [MemShrink:P1][qa?] → [MemShrink:P1][qa-]

Alex Keybl [:akeybl]

Updated

•

13 years ago

tracking-firefox10: ? → +

Patch v1 13 years ago Justin Lebar (not reading bugmail) 9.78 KB, patch	smichaud : feedback-	Details \| Diff \| Splinter Review
Translation of crash-stack symbols from comment #31 13 years ago Steven Michaud [:smichaud] (Retired) 1.07 KB, text/plain		Details
Gdb stack trace of crash, with symbols and all threads 13 years ago Steven Michaud [:smichaud] (Retired) 42.56 KB, text/plain		Details
Another gdb stack trace of crash 13 years ago Steven Michaud [:smichaud] (Retired) 25.81 KB, text/plain		Details
Gdb trace of crash after following suggestion from comment #44 13 years ago Steven Michaud [:smichaud] (Retired) 26.73 KB, text/plain		Details
Patch v2 13 years ago Justin Lebar (not reading bugmail) 12.60 KB, patch		Details \| Diff \| Splinter Review
Patch v3 13 years ago Justin Lebar (not reading bugmail) 13.98 KB, patch	smichaud : feedback+	Details \| Diff \| Splinter Review
Patch v4 13 years ago Justin Lebar (not reading bugmail) 17.31 KB, patch	khuey : review+	Details \| Diff \| Splinter Review
Gdb stack trace of crash in current code 13 years ago Steven Michaud [:smichaud] (Retired) 16.74 KB, text/plain		Details