Last Comment Bug 694896 - [10.7] Fix jemalloc hang, crash on Mac 10.7 and enable jemalloc on 10.7
: [10.7] Fix jemalloc hang, crash on Mac 10.7 and enable jemalloc on 10.7
Status: RESOLVED FIXED
[MemShrink:P1][qa-]
:
Product: Core
Classification: Components
Component: Memory Allocator (show other bugs)
: unspecified
: x86_64 Mac OS X
: -- normal with 2 votes (vote)
: mozilla10
Assigned To: Justin Lebar (not reading bugmail)
:
: Mike Hommey [:glandium]
Mentors:
Depends on: 414946 670175 703087
Blocks: lion-compatibility 696162
  Show dependency treegraph
 
Reported: 2011-10-16 20:19 PDT by Justin Lebar (not reading bugmail)
Modified: 2012-01-03 13:11 PST (History)
24 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
fixed


Attachments
Patch v1 (9.78 KB, patch)
2011-10-20 10:13 PDT, Justin Lebar (not reading bugmail)
smichaud: feedback-
Details | Diff | Splinter Review
Translation of crash-stack symbols from comment #31 (1.07 KB, text/plain)
2011-10-21 11:49 PDT, Steven Michaud [:smichaud] (Retired)
no flags Details
Gdb stack trace of crash, with symbols and all threads (42.56 KB, text/plain)
2011-10-21 12:55 PDT, Steven Michaud [:smichaud] (Retired)
no flags Details
Another gdb stack trace of crash (25.81 KB, text/plain)
2011-10-21 13:08 PDT, Steven Michaud [:smichaud] (Retired)
no flags Details
Gdb trace of crash after following suggestion from comment #44 (26.73 KB, text/plain)
2011-10-21 13:34 PDT, Steven Michaud [:smichaud] (Retired)
no flags Details
Patch v2 (12.60 KB, patch)
2011-10-24 21:00 PDT, Justin Lebar (not reading bugmail)
no flags Details | Diff | Splinter Review
Patch v3 (13.98 KB, patch)
2011-10-24 21:04 PDT, Justin Lebar (not reading bugmail)
smichaud: feedback+
Details | Diff | Splinter Review
Patch v4 (17.31 KB, patch)
2011-10-25 13:26 PDT, Justin Lebar (not reading bugmail)
khuey: review+
Details | Diff | Splinter Review
Gdb stack trace of crash in current code (16.74 KB, text/plain)
2011-11-03 09:38 PDT, Steven Michaud [:smichaud] (Retired)
no flags Details

Description Justin Lebar (not reading bugmail) 2011-10-16 20:19:01 PDT
jemalloc is currently disabled on MacOS 10.7 for reasons that I don't understand at the moment.  But as more Mac users transition to the new version, enabling jemalloc there becomes increasingly important.
Comment 1 Mike Hommey [:glandium] 2011-10-17 00:13:29 PDT
bug 670175 is the problem.
Comment 2 Justin Lebar (not reading bugmail) 2011-10-17 07:05:46 PDT
Thanks.  I resolved that bug so we can track turning it on (and fixing that hang) here.
Comment 3 Justin Lebar (not reading bugmail) 2011-10-18 12:38:02 PDT
Steven, we need some mac-fu here to understand the hangs in bug 670175.  Can you help?
Comment 4 Steven Michaud [:smichaud] (Retired) 2011-10-18 13:26:03 PDT
I'm completely swamped, so the earliest I'll have time for this is sometime next week.

Furthermore I know very little about jemalloc, so I'll probably have to use brute force on this bug -- keep turning off bits of jemalloc until it stops happening, then see if the guilty "bit" can somehow be implemented differently.

That's likely to be very time consuming, and may be something you guys can do as well as I can :-)
Comment 5 Justin Lebar (not reading bugmail) 2011-10-18 13:41:58 PDT
Is it possible to dual-boot with the final version of 10.7?  If so, I might be able to poke this and at least get to the point that I need your help.  I sort of understand jemalloc, although jemalloc on mac is a complete mystery to me.  :)
Comment 6 Steven Michaud [:smichaud] (Retired) 2011-10-18 13:48:14 PDT
> Is it possible to dual-boot with the final version of 10.7?

Sure.  You need to create extra partitions on your hard drive (or add another hard drive on which you can create additional partitions).  Then do a conventional install into one of those partitions.  Then you can use Startup Disk in System Preferences to choose which partition to boot from.

No, I don't know how to non-destructively add new partitions to an existing hard drive.  But ask Marcia Knous -- she may know how, or know who to ask.
Comment 7 Justin Lebar (not reading bugmail) 2011-10-18 13:51:36 PDT
> No, I don't know how to non-destructively add new partitions to an existing hard drive.

I think bootcamp can do this?  I'll see what I can figure out.
Comment 8 Steven Michaud [:smichaud] (Retired) 2011-10-18 14:20:45 PDT
Note that your Lion partition should be at least 30GB in order for XCode to fit into it.  40GB would probably be better.  (I use 30GB, and it can be a tight fit.)
Comment 9 Alex Limi (:limi) — Firefox UX Team 2011-10-19 01:29:43 PDT
(In reply to Steven Michaud from comment #6)
> No, I don't know how to non-destructively add new partitions to an existing
> hard drive.  But ask Marcia Knous -- she may know how, or know who to ask.

I believe the stock version of Disk Utility does this just fine (unless your drive is very fragmented) — give it a try. It's very explicit about which partitions will lose data and which will remain intact, so just read the description before confirming the operation.
Comment 10 Justin Lebar (not reading bugmail) 2011-10-19 05:30:31 PDT
I had to run disk utilities repair on my partition (perhaps because I'd previously had a bootcamp partition), but then it worked, and I got Lion installed.  Thanks!
Comment 11 Justin Lebar (not reading bugmail) 2011-10-19 13:15:00 PDT
I can reproduce using the steps in bug 670175 comment 17, about half the time.  Firefox beachballs for about 5s.

From the hang stack in attachment 544925 [details], it looks like jemalloc is taking a long time in mmap.  (See the stacktrace for thread 1, at the bottom.)  Maybe 10.7 doesn't like the hack we use to force mmap'ed pages down towards the bottom of the address space.
Comment 12 Steven Michaud [:smichaud] (Retired) 2011-10-19 13:26:13 PDT
I thought that trace indicated thread contention -- between the main thread and threads 10, 12, 13, and 23.  But that's more or less a guess.

Hence my hunch that we're going to need to brute-force this.
Comment 13 Steven Michaud [:smichaud] (Retired) 2011-10-19 13:28:47 PDT
> Maybe 10.7 doesn't like the hack we use to force mmap'ed pages down
> towards the bottom of the address space.

Can we stop doing this, and see if that makes the problem go away?
Comment 14 Justin Lebar (not reading bugmail) 2011-10-19 13:29:24 PDT
I seem to be able to reproduce bug 670175 comment 17 consistently if I issue the |purge| command immediately before starting Firefox.
Comment 15 Justin Lebar (not reading bugmail) 2011-10-19 13:29:53 PDT
(In reply to Steven Michaud from comment #13)
> Can we stop doing this, and see if that makes the problem go away?

Yep, trying...
Comment 16 Justin Lebar (not reading bugmail) 2011-10-19 13:32:51 PDT
Interestingly, when I kill FF while it's hung, the whole system is unresponsive for a few seconds.  Typing and switching tabs in the terminal is very slow, for example.

> I thought that trace indicated thread contention -- between the main thread and threads 10, 12, 13, 
> and 23.  But that's more or less a guess.

Yes.  The fact that four threads are all blocked on the main thread's arena lock indicates that the main thread has probably been sitting there for a while.  So that leads me to suspect that mmap may be hanging...
Comment 17 Steven Michaud [:smichaud] (Retired) 2011-10-19 13:51:03 PDT
> Interestingly, when I kill FF while it's hung, the whole system is
> unresponsive for a few seconds.  Typing and switching tabs in the
> terminal is very slow, for example.

This is an old problem with event taps -- a buggy system resource that
we're forced to use.  See bug 393664 and bug 611068.
Comment 18 Justin Lebar (not reading bugmail) 2011-10-19 13:56:08 PDT
Yikes, I think I understand what's happening.  Jemalloc uses a loop when it tries to do an aligned mmap.  The claim is that you only loop in the event of a race condition, where two threads call mmap simultaneously.  But we go around and around this loop for a long time.

The code assumes that if I mmap a 2mb chunk, unmap it, and then immediately try to map somewhere inside where that chunk used to be (so I get an aligned mapping), the final mmap operation will succeed in the absence of races.  It appears that this is not the case on 10.7.
Comment 19 Justin Lebar (not reading bugmail) 2011-10-19 14:37:20 PDT
The more sane (and loop-free) thing to do would be to map 2mb and then unmap any excess.  That's what jemalloc tip does.

We can't do that on Windows, because apparently Windows demands a 1:1 mapping between maps and unmaps.  But we ought to be able to backport this change for Linux and Mac.
Comment 20 Justin Lebar (not reading bugmail) 2011-10-19 16:10:04 PDT
Awesome, jsgcchunk is copy/pasted from jemalloc (or is it the other way around?).  So whatever problem jemalloc has here, presumably the js gc has too.
Comment 21 Justin Lebar (not reading bugmail) 2011-10-19 16:16:19 PDT
(In reply to Justin Lebar [:jlebar] from comment #20)
> Awesome, jsgcchunk is copy/pasted from jemalloc (or is it the other way
> around?).  So whatever problem jemalloc has here, presumably the js gc has
> too.

...although maybe not, because js uses vm_allocate instead of mmap.

Anyone know why js doesn't use mmap?
Comment 22 Terrence Cole [:terrence] 2011-10-19 17:33:39 PDT
The code in jsgcchunk is super ugly and I spent quite of a bit of time looking at it today for 670596.

(In reply to Justin Lebar [:jlebar] from comment #20)
> Awesome, jsgcchunk is copy/pasted from jemalloc (or is it the other way
> around?).  So whatever problem jemalloc has here, presumably the js gc has
> too.

Igor created jsgcchunk as a copy from jemalloc about a year and a half ago in 553812.

(In reply to Justin Lebar [:jlebar] from comment #19)
> The more sane (and loop-free) thing to do would be to map 2mb and then unmap
> any excess.  That's what jemalloc tip does.

Agreed.  This should work well on mac and linux, assuming that we can't find a way to do aligned allocations directly.

> We can't do that on Windows, because apparently Windows demands a 1:1
> mapping between maps and unmaps.  But we ought to be able to backport this
> change for Linux and Mac.

In Windows, we can reserve address space independently of committing.  We should be able to make things better on Windows by grabbing the 2MiB of address space as a reservation and only committing when we grab the 1MiB slice afterwards.

I think what we need to do in jsgcchunk is push the high level AllocGCChunk interface into each OS's #define and just hyper-optimize for each OS individually.  We also need to add new per-OS functions for madvise/MEM_RESET functionality.  It would be nice if we could share this work with jemalloc.
Comment 23 Justin Lebar (not reading bugmail) 2011-10-20 08:29:17 PDT
> assuming that we can't find a way to do aligned allocations directly.

I've got to hope that if there were a way, upstream jemalloc would be using it.

> In Windows, we can reserve address space independently of committing.  We should be able to make 
> things better on Windows by grabbing the 2MiB of address space as a reservation and only committing 
> when we grab the 1MiB slice afterwards.

Yes, but two problems:

* We frequently run out of virtual address space on Windows-32, so we probably can't waste it like this.

* The chunk would now have to remember where its mapping starts, so that when you unmap the chunk, you do so from the right place.  This would be kind of an invasive jemalloc change.

Windows ain't broke, so I think we should leave it be for the moment and revisit separately whether we can improve it.
Comment 24 Terrence Cole [:terrence] 2011-10-20 08:47:14 PDT
(In reply to Justin Lebar [:jlebar] from comment #23)
> > assuming that we can't find a way to do aligned allocations directly.
> 
> I've got to hope that if there were a way, upstream jemalloc would be using
> it.

Probably, but the linux kernel moves fast and doesn't communicate userland changes well.

> > In Windows, we can reserve address space independently of committing.  We should be able to make 
> > things better on Windows by grabbing the 2MiB of address space as a reservation and only committing 
> > when we grab the 1MiB slice afterwards.
> 
> Yes, but two problems:
> 
> * We frequently run out of virtual address space on Windows-32, so we
> probably can't waste it like this.
> 
> * The chunk would now have to remember where its mapping starts, so that
> when you unmap the chunk, you do so from the right place.  This would be
> kind of an invasive jemalloc change.
> 
> Windows ain't broke, so I think we should leave it be for the moment and
> revisit separately whether we can improve it.

I think I must have explained very poorly what I meant to do.  I am not suggesting changing the algorithm or numbers we use when allocating memory at all, just unsetting the MEM_COMMIT bit until we actually make our 1MiB allocation.
Comment 25 Justin Lebar (not reading bugmail) 2011-10-20 08:50:53 PDT
> I think I must have explained very poorly what I meant to do.  I am not suggesting changing the 
> algorithm or numbers we use when allocating memory at all, just unsetting the MEM_COMMIT bit until 
> we actually make our 1MiB allocation.

Ah, that might work!  Can we do this in another bug, please?
Comment 26 Terrence Cole [:terrence] 2011-10-20 09:50:00 PDT
(In reply to Justin Lebar [:jlebar] from comment #25)
> Ah, that might work!  Can we do this in another bug, please?

Filed under 696119.
Comment 27 Justin Lebar (not reading bugmail) 2011-10-20 10:13:50 PDT
Created attachment 568432 [details] [diff] [review]
Patch v1

Seems to work on my 10.7 and Linux-64 machines.  Pushed to try.
Comment 28 Justin Lebar (not reading bugmail) 2011-10-20 14:40:55 PDT
Aside from a mysterious linux64 make check error, which is probably nothing, this looks good on try.

Steven, would you mind trying this build out on your 10.7 machine and seeing if you can get it to hang?  It doesn't hang for me on my 10.7 machine.

https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jlebar@mozilla.com-21ae01a7b22a/
Comment 29 Alex Limi (:limi) — Firefox UX Team 2011-10-20 23:34:17 PDT
If you feel like a slightly larger population of guinea pigs to test it on, feel free to land it on the UX branch. We have about 2K ADUs, a large percentage of which I would guess are on OS X 10.7. :)
Comment 30 Justin Lebar (not reading bugmail) 2011-10-21 07:43:51 PDT
I pushed this to UX: https://hg.mozilla.org/projects/ux/rev/5e823e7f5e5d
Comment 31 Steven Michaud [:smichaud] (Retired) 2011-10-21 09:53:50 PDT
Comment on attachment 568432 [details] [diff] [review]
Patch v1

This patch doesn't seem to have fixed the problem.  I still crash,
though now the STR is now different.  Here are the crashes I've had.
They seem to be the same crash, but I include them all for the sake of
completeness.  I'll try to get some gdb crash stacks, and to translate
the symbols in the Breakpad crash stacks:

bp-faaba078-ca18-4b50-a52e-ea4432111021
bp-e9738d44-e0b9-4dd0-b60f-fd1ac2111021
bp-b8a8cae1-4a37-43c3-94df-4b76a2111021
bp-a8476906-3a59-4d59-a8f8-b1e8f2111021
bp-6845a0ff-f298-41b9-8b65-e9b532111021

Here's my STR.  It's still rather difficult and pretty finicky.  I may
be able to improve it over time.

1) Restart your computer.

2) Run the test build, and then	a few seconds (5-15?) after it's
   started, move the mouse up to "Nightly" in the main menu and click
   on it to open the menu.

   I'll	usually	crash here, or sometimes even before the mouse reaches
   the main menu.

Sometimes subsequent attempts to run the test build will crash even
before the browser window has finished opening.  But as soon as you
fail to crash even once, the crashes become almost impossibly
difficult to reproduce.  Running "purge" beforehand doesn't seem to
help (to make the test build more crashy).

My STR above doesn't crash on recent nightlies.
Comment 32 Steven Michaud [:smichaud] (Retired) 2011-10-21 10:00:01 PDT
Even when I just turn jemalloc on for OS X 10.7, without making any of
the rest of your patch's changes, my STR from bug 670175 comment #17
no longer "works" for me -- the I-beam cursor in the Google search box
never stops flashing.

For this test, I just applied the following part of your patch:

-    osx_use_jemalloc = (default_zone->version == SNOW_LEOPARD_MALLOC_ZONE_T_VERSION);
+    osx_use_jemalloc = (default_zone->version == SNOW_LEOPARD_MALLOC_ZONE_T_VERSION ||
+                        default_zone->version == LION_MALLOC_ZONE_T_VERSION);
Comment 33 Steven Michaud [:smichaud] (Retired) 2011-10-21 10:27:25 PDT
You'll notice that none of my Breakpad crash reports from comment #31 have symbols even for Mozilla code.  This has already come up in bug 693451, and a Breakpad bug on the issue has been opened upstream at http://code.google.com/p/google-breakpad/issues/detail?id=447.
Comment 34 Justin Lebar (not reading bugmail) 2011-10-21 10:36:43 PDT
Yeah, symbols of any kind would be very helpful...
Comment 35 Justin Lebar (not reading bugmail) 2011-10-21 10:37:55 PDT
Backed out from UX: https://hg.mozilla.org/projects/ux/rev/dc08bf63b3b1
Comment 36 Steven Michaud [:smichaud] (Retired) 2011-10-21 10:58:01 PDT
It occurs to me that jemalloc might have (at least) two different problems on OS X 10.7 -- a hang bug (which your patch may have fixed) and a crash bug.
Comment 37 Justin Lebar (not reading bugmail) 2011-10-21 11:20:42 PDT
I think the patch did fix a hang I was observing.  At least, there was a loop which I observed spinning, and now the loop is gone!
Comment 38 Steven Michaud [:smichaud] (Retired) 2011-10-21 11:49:19 PDT
Created attachment 568724 [details]
Translation of crash-stack symbols from comment #31

Here are translations for (some of) the symbols for my crash stacks
from comment #31.  I used atos to do the translations.  There were a
few symbols it (apparently) couldn't translate.

I think	this pins the blame pretty well on jemalloc.  Though it
doesn't come near to providing enough information to fix the bug that
causes the crashes :-(

Note that one of the symbols is for malloc_printf, which indicates
there may have been something interesting in the system log.  I'll
reproduce a few more crashes, and see what I find.
Comment 39 Steven Michaud [:smichaud] (Retired) 2011-10-21 11:57:43 PDT
[0x0-0x13013].org.mozilla.nightly: firefox(193,0x7fff79a83960)
  malloc: *** malloc_default_scalable_zone() failed to find 'DefaultMallocZone'
Comment 40 Justin Lebar (not reading bugmail) 2011-10-21 12:12:47 PDT
That's weird...it looks like this is the default allocator, not jemalloc.
Comment 41 Steven Michaud [:smichaud] (Retired) 2011-10-21 12:45:20 PDT
Then I think jemalloc must be doing something to break the default allocator.

I'm still working on getting a gdb stack trace of the crash, with symbols and all threads.
Comment 42 Steven Michaud [:smichaud] (Retired) 2011-10-21 12:55:24 PDT
Created attachment 568746 [details]
Gdb stack trace of crash, with symbols and all threads

Here it is!

This is from a non-debug custom build (with this bug's patch applied) with optimization turned off and no symbols stripped.  It's based on trunk code pulled at the beginning of this week (Monday).
Comment 43 Steven Michaud [:smichaud] (Retired) 2011-10-21 12:57:53 PDT
Note that it looks a lot like my gdb crash traces from bug 670175.
Comment 44 Justin Lebar (not reading bugmail) 2011-10-21 13:02:39 PDT
This is a shot in the dark, but what if you change

  jemalloc_zone->zone_name = "jemalloc_zone";

to

  jemalloc_zone->zone_name = "DefaultMallocZone";

?
Comment 45 Steven Michaud [:smichaud] (Retired) 2011-10-21 13:08:36 PDT
Created attachment 568750 [details]
Another gdb stack trace of crash
Comment 46 Steven Michaud [:smichaud] (Retired) 2011-10-21 13:13:44 PDT
(In reply to comment #44)

I'm trying it.  I'll let you know my results when the build finishes.
Comment 47 Justin Lebar (not reading bugmail) 2011-10-21 13:17:20 PDT
btw, I find that to rebuild jemalloc, I need to do

  rm -f browser/app/nsBrowserApp.o && make -C memory && make -C browser && make -C toolkit/library

if you don't remove nsBrowserApp.o, the jemalloc changes might not be reflected, even if you do a top-level rebuild.
Comment 48 Mike Hommey [:glandium] 2011-10-21 13:27:01 PDT
(In reply to Justin Lebar [:jlebar] from comment #47)
> if you don't remove nsBrowserApp.o, the jemalloc changes might not be
> reflected, even if you do a top-level rebuild.

That doesn't make sense. jemalloc is a shared library on mac. Rebuilding libmozutils.dylib should be enough.
Comment 49 Steven Michaud [:smichaud] (Retired) 2011-10-21 13:34:00 PDT
Created attachment 568760 [details]
Gdb trace of crash after following suggestion from comment #44

Following your suggestion from comment #44 didn't help, though it did
change the stack trace slightly, and got rid of the malloc_printf
error message.  I think the problem reported by malloc_printf must be
a symptom, and not the cause.

By the way, I didn't need to delete browser/app/nsBrowserApp.o.  I
just changed the source and did a top-level rebuild.  (I always do
that --	doing otherwise just wastes all the time I might have saved
doing non-top-level builds, or more than the time, on fixing the
problems it causes.)
Comment 50 Justin Lebar (not reading bugmail) 2011-10-21 13:51:47 PDT
> Following your suggestion from comment #44 didn't help, though it did
> change the stack trace slightly, and got rid of the malloc_printf
> error message. 

Okay, that's pretty good!

Maybe it's then trying to call one of the zone functions which is null.

You could replace the NULLs in create_zone and szone2ozone with 0x1, 0x2, 0x3, etc.  Maybe that would tell us which function it's trying to call.
Comment 51 Steven Michaud [:smichaud] (Retired) 2011-10-21 14:21:45 PDT
> You could replace the NULLs in create_zone and szone2ozone with 0x1,
> 0x2, 0x3, etc.  Maybe that would tell us which function it's trying
> to call.

I did that, without backing out your suggestion from comment #44, and
it made no difference -- I still get exactly the same null
dereference, and no output from malloc_printf.

Next I'll try backing out your suggestion from comment #44, to see if
that makes any difference.
Comment 52 Steven Michaud [:smichaud] (Retired) 2011-10-21 14:33:41 PDT
> Next I'll try backing out your suggestion from comment #44, to see
> if that makes any difference.

This just restored the original error message from malloc_printf and
the original crash at 0x10f0.
Comment 53 Steven Michaud [:smichaud] (Retired) 2011-10-21 14:48:20 PDT
I've got a few minutes left at the end of the day, that I can't really spend doing anything else, so I'm going to try a few of my own random shots in the dark.

I just tried putting #define NO_TLS at the top of jemalloc.c -- that doesn't help.
Comment 54 Justin Lebar (not reading bugmail) 2011-10-21 14:55:18 PDT
I'm pretty sure mac doesn't build without NO_TLS.

I think we've made some progress here.  I'll try to get things set up so I can debug this locally.
Comment 55 Steven Michaud [:smichaud] (Retired) 2011-10-21 15:31:30 PDT
Now I've found something that *does* help:  The crashes disappear if you never call szone2ozone()!  I tested twice (with my STR from comment #31) just to be sure.

> I'll try to get things set up so I can debug this locally.

That's probably a good idea :-)
Comment 56 Justin Lebar (not reading bugmail) 2011-10-21 15:52:49 PDT
Ah, so AIUI, szone2ozone sets us up so that certain system-level allocations go through jemalloc rather than the default allocator.

It  might not like the zone name...
Comment 57 Steven Michaud [:smichaud] (Retired) 2011-10-21 16:37:35 PDT
> It  might not like the zone name...

I suspect the name isn't the problem ... though I've no idea what the problem actually is.

In any case, though, szone2ozone is a lot smaller target to brute-force than the whole of jemalloc.c.
Comment 58 Steven Michaud [:smichaud] (Retired) 2011-10-24 13:34:33 PDT
(Following up comment #57)

>> It might not like the zone name...
>
> I suspect the name isn't the problem ... though I've no idea what
> the problem actually is.

Actually you were right (and I was wrong).  Making the following
change in szone2ozone is enough to make my crashes go away:

-       default_zone->zone_name = "jemalloc_ozone";
+       if (default_zone->version < LION_MALLOC_ZONE_T_VERSION) {
+               default_zone->zone_name = "jemalloc_ozone";
+       } else {
+               default_zone->zone_name = "DefaultMallocZone";
+       }

I still don't know *why* this change is necessary on Lion.  As best I
can tell, "DefaultMallocZone" has been this zone's name for years.
But maybe it's only with Lion that the OS actually starts checking the
name.

It's also puzzling why we don't *always* crash on Lion, if the default
zone's name is "wrong".

But I think it's worthwhile landing this change on the trunk, to see
what happens.

Once jemalloc is re-enabled on Lion, we should (of course) check that
it really does give us performance gains and/or a reduction in memory
usage.
Comment 59 Steven Michaud [:smichaud] (Retired) 2011-10-24 13:45:36 PDT
And note that the default zone's name is wrong (on Lion) even when we don't call szone2ozone, and we still don't crash.
Comment 60 Steven Michaud [:smichaud] (Retired) 2011-10-24 13:48:00 PDT
So (sigh) I've changed my mind.

I no longer think we should land my change from comment #58 without understanding better what's going on (why that change "works", or appears to "work").
Comment 61 Steven Michaud [:smichaud] (Retired) 2011-10-24 13:57:03 PDT
I'll continue to brute-force szone2ozone for a while longer, to see if I find anything interesting.
Comment 62 Justin Lebar (not reading bugmail) 2011-10-24 14:13:38 PDT
You may want to disassemble the calling stackframes (or use gdb watchpoints) to try to see where default_zone->zone_name is being read.

I've been reading machine code, but I haven't seen it yet...
Comment 63 Justin Lebar (not reading bugmail) 2011-10-24 14:51:22 PDT
So I did

+               default_zone->zone_name = (char*)0xdeadbeef;

and then waited for it to crash.

It crashes in

  strcmp()
  malloc_default_purgable_zone()
  ImageIO_Malloc()

malloc_default_purgable_zone appears to be in a loop looking for something.  Perhaps the string "DefaultMallocZone".  :)
Comment 64 Steven Michaud [:smichaud] (Retired) 2011-10-24 15:04:04 PDT
> malloc_default_purgable_zone appears to be in a loop looking for
> something.  Perhaps the string "DefaultMallocZone".  :)

Nope, it's the string "DefaultPurgeableMallocZone" :-)

See http://opensource.apple.com/source/Libc/Libc-594.1.4/gen/malloc.c.
Which shows that you're debugging in libc code, and that Apple's
libc's source is available at http://opensource.apple.com.

That may make your life easier :-)
Comment 65 Steven Michaud [:smichaud] (Retired) 2011-10-24 15:12:13 PDT
OS X 10.7.2's malloc.c is at:
http://opensource.apple.com/source/Libc/Libc-763.12/gen/malloc.c

You can also download the source.

See http://opensource.apple.com/ and start browsing under "Mac OS X".
Comment 66 Justin Lebar (not reading bugmail) 2011-10-24 15:15:59 PDT
Ah, that's much nicer than reading asm.  There's this, which explains comment 38:

static inline malloc_zone_t *
inline_malloc_default_scalable_zone(void) {
    unsigned index;

    if (malloc_def_zone_state < 2) _malloc_initialize();
    // _malloc_printf(ASL_LEVEL_INFO, "In inline_malloc_default_scalable_zone with %d %d\n", malloc_num_zones, malloc_has_debug_zone);

    MALLOC_LOCK();
    for (index = 0; index < malloc_num_zones; ++index) {
	malloc_zone_t *z = malloc_zones[index];

	if(z->zone_name && strcmp(z->zone_name, "DefaultMallocZone") == 0) {
	    MALLOC_UNLOCK();
	    return z;
	}
    }
    MALLOC_UNLOCK();

    malloc_printf("*** malloc_default_scalable_zone() failed to find 'DefaultMallocZone'\n");
    return NULL; // FIXME: abort() instead?
}
Comment 67 Steven Michaud [:smichaud] (Retired) 2011-10-24 15:19:21 PDT
Another small piece of the puzzle:

My crashes stop happening if you don't change the zone_name in szone2ozone -- if you comment out the following line:

default_zone->zone_name = "jemalloc_ozone";

So the zone name doesn't have to be "DefaultMallocZone" (without the above change it's "jemalloc_zone").
Comment 68 Justin Lebar (not reading bugmail) 2011-10-24 15:24:25 PDT
> So the zone name doesn't have to be "DefaultMallocZone" (without the above change it's 
> "jemalloc_zone").

But in create_zone, don't we need to change the zone name from "jemalloc_zone" to "DefaultMallocZone"?
Comment 69 Steven Michaud [:smichaud] (Retired) 2011-10-24 15:27:30 PDT
> But in create_zone, don't we need to change the zone name from
> "jemalloc_zone" to "DefaultMallocZone"?

What I reported in comment #67 tells me "no".

But I'm not yet sure that's the end of the story ...
Comment 70 Steven Michaud [:smichaud] (Retired) 2011-10-24 15:38:56 PDT
I'm closing in on the real problem, I think:

I can also fix my crashes by adding the following (an extra call to
malloc_zone_register()) just after the single call to szone2ozone():

malloc_zone_register(default_zone);

Apparently Lion is fussy about changes made to the default zone after
it's been "registered".

But I suspect we should only call malloc_zone_register() once, and
that the single call should be the one after szone2ozone().  Just a
sec and I'll try that out.
Comment 71 Justin Lebar (not reading bugmail) 2011-10-24 20:18:47 PDT
Why do we have to register the jemalloc zone to begin with?  Registering the zone doesn't make it the default one, afaict.  It looks like only the szone overlay actually does something.

> But I suspect we should only call malloc_zone_register() once, and
> that the single call should be the one after szone2ozone().

If registering the zone *is* important, then this could be dangerous, because after szone2ozone, the system might be able to start using the jemalloc zone (effectively, not the zone itself, but its functions) before we've registered it. (?)
Comment 72 Steven Michaud [:smichaud] (Retired) 2011-10-24 20:46:27 PDT
(Following up comment #70)

Oops, I got it wrong.  Adding the following after the call to
szone2ozone() doesn't help:

malloc_zone_register(default_zone);

Likewise with making it the only call to malloc_zone_register().

I was confused about default_zone (returned by a call to
malloc_default_zone()) and the "custom zone" created by create_zone().
default_zone is the default malloc zone created (and registered) by
the OS.  It's not a copy of the "custom zone", and doesn't become one.
In fact I don't know what the "custom zone" is for, and why (or even
if) it's needed.

We get the default zone from the OS.  And though we modify it, we
intend the OS to continue using it as if it had been unchanged (we in
fact hook it).  We also never check for any zone by name.

So we have no reason to change the default zone's zone_name in
szone2ozone(), on any version of OS X.  In fact we're lucky we got
away with this on versions of OS X prior to Lion.

So the fix for the crash bug is very simple.  Just get rid of the
following line in szone2ozone():

default_zone->zone_name = "jemalloc_ozone";
Comment 73 Steven Michaud [:smichaud] (Retired) 2011-10-24 20:52:34 PDT
> Why do we have to register the jemalloc zone to begin with?
> Registering the zone doesn't make it the default one, afaict.  It
> looks like only the szone overlay actually does something.

I assume that by "the jemalloc zone" you mean what I've been calling
the "custom zone".  If so, then I agree that we (probably) don't need
to either create it or register it -- as far as I can tell it's not
used for anything.  But I don't believe that creating and registering
the "custom zone" does any harm.

It'd be interesting to know the history of this code, and whether or
not the "custom zone" once served some purpose.	 But I'll save that
for later, because right now I'm too tired to pursue it :-)
Comment 74 Steven Michaud [:smichaud] (Retired) 2011-10-24 20:56:36 PDT
By the way, "registering" a zone doesn't make it the default.
Comment 75 Justin Lebar (not reading bugmail) 2011-10-24 21:00:31 PDT
Created attachment 569277 [details] [diff] [review]
Patch v2

How about we get rid of this zone registration altogether?

This seems to work for me, but knowing you, you'll find some way to crash it.  :)

As far as I can tell, allocations are still going through jemalloc with this patch.  Or at least, the jemalloc heap size isn't significantly smaller with this patch than without.
Comment 76 Steven Michaud [:smichaud] (Retired) 2011-10-24 21:04:25 PDT
Comment on attachment 569277 [details] [diff] [review]
Patch v2

-	jemalloc_zone->zone_name = "jemalloc_zone";
+	jemalloc_zone->zone_name = "DefaultMallocZone";

We don't want this.  We don't want anything besides the real default zone to have that name.

I'll do the rest of my review tomorrow.
Comment 77 Justin Lebar (not reading bugmail) 2011-10-24 21:04:42 PDT
Created attachment 569278 [details] [diff] [review]
Patch v3
Comment 78 Justin Lebar (not reading bugmail) 2011-10-24 21:05:32 PDT
Comment on attachment 569278 [details] [diff] [review]
Patch v3

Really get rid of create_zone().
Comment 79 Justin Lebar (not reading bugmail) 2011-10-24 21:23:31 PDT
> In fact I don't know what the "custom zone" is for, and why (or even
> if) it's needed.

We midaired each other and independently arrived at the same conclusion.  Cool.  :)
Comment 80 Justin Lebar (not reading bugmail) 2011-10-24 21:25:51 PDT
We should check whether we need to somehow lock around the modifications to default_zone.  If we're not the only thread alive at that point, we might be better off reordering the modifications so that free is set before malloc and friends.
Comment 81 Justin Lebar (not reading bugmail) 2011-10-25 07:10:19 PDT
On my 10.7 machine, szone2ozone is called when only one thread is alive.
Comment 82 Steven Michaud [:smichaud] (Retired) 2011-10-25 13:10:21 PDT
Comment on attachment 569278 [details] [diff] [review]
Patch v3

This looks fine to me.  And I no longer crash or hang :-)

Though I do have a couple of nits:

+        /* Don't modify default_zone->zone_name; Mac libc may rely on the name
+         * being uncahnged. */

It's "unchanged".  And you might want to reference this bug (bug
694896).

 /* Likewise for l_zone_introspect. */
 static lion_malloc_introspection l_zone_introspect, l_ozone_introspect;
 static malloc_introspection_t * const zone_introspect =
        (malloc_introspection_t*)(&l_zone_introspect);
 static malloc_introspection_t * const ozone_introspect =
        (malloc_introspection_t*)(&l_ozone_introspect);

We can get rid of l_zone_introspect and zone_introspect.

We can also get	rid of the following, which are	all now	dead code:

zone_size()
zone_free()
zone_realloc()
zone_force_lock()
zone_force_unlock()
zone_free_definite_size()

> On my 10.7 machine, szone2ozone is called when only one thread is
> alive.

On the Mac szone2ozone is ultimately called from the following,	which
is called by the OS when the module containing jemalloc.c is loaded:

__attribute__((constructor))
void
jemalloc_darwin_init(void);

For good measure I double-checked that the structures in
osx_zone_types.h still match their version-specific counterparts in
malloc.h on SnowLeopard and Lion -- they do.

I also checked that our use of the malloc_zone_t pointers in
szone2ozone() is sane -- it is.  Specifically, we don't NULL out any
pointers that aren't documented to sometimes be NULL, or are in
malloc_introspection_t (which it seems can all safely be NULL).

I also found when create_zone(), jemalloc_zone and friends were
"orphaned".  More on that in my next comment.
Comment 83 Justin Lebar (not reading bugmail) 2011-10-25 13:26:41 PDT
Created attachment 569477 [details] [diff] [review]
Patch v4
Comment 84 Steven Michaud [:smichaud] (Retired) 2011-10-25 13:32:05 PDT
Paul, I've got a question for you concerning your "part 1" patch for
bug 414946 (http://hg.mozilla.org/mozilla-central/rev/2b2f584dc5fd).

As best we can tell, as of that patch the create_zone() function no
longer does anything useful, and can safely be removed.  But your
patch didn't get rid of it.  Did you have a reason for that?

Thanks in advance!
Comment 85 Paul Biggar 2011-10-25 14:16:55 PDT
The history of the jemalloc_zone is that the original implementation by jasone added a zone to the linked list of zones. After that, jasone started overriding the default zone, which actually worked and stuck.

Looking at the code now, it looks like the jemalloc_zone doesn't do anything. I really don't remember why this is here - apart from the fact that it was in the original (it's still upstream by the look of it). Certainly, if it works to remove it, remove it.
Comment 86 Kyle Huey [:khuey] (Exited; not receiving bugmail, email if necessary) 2011-10-27 05:48:26 PDT
Not going to get to this review until next week.
Comment 87 Kyle Huey [:khuey] (Exited; not receiving bugmail, email if necessary) 2011-10-31 06:00:39 PDT
Comment on attachment 569477 [details] [diff] [review]
Patch v4

Review of attachment 569477 [details] [diff] [review]:
-----------------------------------------------------------------

::: memory/jemalloc/jemalloc.c
@@ +178,5 @@
> + * MALLOC_PAGEFILE causes all mmap()ed memory to be backed by temporary
> + * files, so that if a chunk is mapped, it is guaranteed to be swappable.
> + * This avoids asynchronous OOM failures that are due to VM over-commit.
> + */
> +/* #define MALLOC_PAGEFILE */

Why are you making this change?

@@ +2452,5 @@
>  	return (false);
>  }
>  #endif
>  
> +#if defined(MOZ_MEMORY_WINDOWS) || defined(JEMALLOC_USES_MAP_ALIGN) || defined(MALLOC_PAGEFILE)

Especially since you then add a conditional for it here.
Comment 88 Justin Lebar (not reading bugmail) 2011-10-31 07:12:06 PDT
>> +/* #define MALLOC_PAGEFILE */
>
> Why are you making this change?

AFAICT, Linux defines MALLOC_PAGEFILE but doesn't actually enable it.  I want Linux to use the (presumably faster) chunk map routines.
Comment 89 Justin Lebar (not reading bugmail) 2011-11-01 15:10:48 PDT
Kyle, are you going to have a chance to look at this again before we branch?  It would be cool to get this in this week.
Comment 90 Kyle Huey [:khuey] (Exited; not receiving bugmail, email if necessary) 2011-11-01 15:11:55 PDT
Yeah, I should be able to review it tomorrow.
Comment 91 Kyle Huey [:khuey] (Exited; not receiving bugmail, email if necessary) 2011-11-02 08:42:53 PDT
Comment on attachment 569477 [details] [diff] [review]
Patch v4

Review of attachment 569477 [details] [diff] [review]:
-----------------------------------------------------------------

Ok, this wasn't as scary as I expected.

The code looks fine.  Whether or not it manages to not explode on some odd configuration, I can't really predict.  Let's try it and see!

::: memory/jemalloc/jemalloc.c
@@ +2452,5 @@
>  	return (false);
>  }
>  #endif
>  
> +#if defined(MOZ_MEMORY_WINDOWS) || defined(JEMALLOC_USES_MAP_ALIGN) || defined(MALLOC_PAGEFILE)

Ok, now that I understand what's going on the conditional for the pagefile is ok.

@@ +5963,5 @@
>      default_zone = malloc_default_zone();
>  
>      /*
> +     * We only use jemalloc on Mac 10.6 and 10.7.  On Mac 10.5, our tests show
> +     * a memory regression, but this may not be real.  See Mozilla bug 694335.

Yeah, that comment is wrong now ;-)

@@ +6662,5 @@
>  	 * Actually create an object of the appropriate size, then find out
>  	 * how large it could have been without moving up to the next size
>  	 * class.
> +	 *
> +	 * I sure hope this doesn't get called often.

lol
Comment 92 Justin Lebar (not reading bugmail) 2011-11-02 09:10:02 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/62ba86c7c50c

Fingers crossed.
Comment 93 Justin Lebar (not reading bugmail) 2011-11-02 22:29:14 PDT
This is a big change we should track for FF10.  We'll need to keep an eye that this doesn't cause hangs or crashes for users on 10.7.
Comment 94 Marco Bonardo [::mak] 2011-11-03 08:31:29 PDT
https://hg.mozilla.org/mozilla-central/rev/62ba86c7c50c
Comment 95 Steven Michaud [:smichaud] (Retired) 2011-11-03 09:38:20 PDT
Created attachment 571676 [details]
Gdb stack trace of crash in current code

We're not out of the woods yet.  I just got this crash starting up a custom build made from current mozilla-central code.

The patch that got landed should probably be backed out again.
Comment 96 Steven Michaud [:smichaud] (Retired) 2011-11-03 09:44:13 PDT
Or at least jemalloc should be turned off for OS X 10.7 -- which is where I saw the crash.
Comment 97 Steven Michaud [:smichaud] (Retired) 2011-11-03 09:49:19 PDT
> https://hg.mozilla.org/integration/mozilla-inbound/rev/62ba86c7c50c
> https://hg.mozilla.org/mozilla-central/rev/62ba86c7c50c

The wrong patch got landed!  This isn't "Patch v4" (attachment 569477 [details] [diff] [review]).
Comment 98 Justin Lebar (not reading bugmail) 2011-11-03 09:53:13 PDT
Well, that's bad!
Comment 99 Justin Lebar (not reading bugmail) 2011-11-03 10:06:07 PDT
Backed out and re-pushed to inbound.

https://hg.mozilla.org/integration/mozilla-inbound/rev/d7ccee4bd7f7
https://hg.mozilla.org/integration/mozilla-inbound/rev/242bbbe83dda

Thanks for catching this, Steven!  I'm not sure how I screwed this up...
Comment 100 Steven Michaud [:smichaud] (Retired) 2011-11-03 10:11:22 PDT
No problem.

If I hadn't caught it, someone else would have very soon :-)
Comment 101 Steven Michaud [:smichaud] (Retired) 2011-11-03 10:12:39 PDT
By the way, you should probably also at least back the patch out from mozilla-central -- otherwise the shit will hit the fan in tomorrow's mozilla-central nightly.
Comment 102 Steven Michaud [:smichaud] (Retired) 2011-11-03 10:30:33 PDT
> https://hg.mozilla.org/integration/mozilla-inbound/rev/242bbbe83dda

This is still wrong, and won't finish building.

A quick glance turned up that it still calls create_zone().  There may also be other problems.
Comment 103 Ed Morley [:emorley] 2011-11-03 10:47:06 PDT
The comment 99 relanding backed out from inbound due to failure to build on OS X:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=242bbbe83dda
{
Undefined symbols:
  "_create_zone", referenced from:
      _jemalloc_darwin_init in jemalloc.o
ld: symbol(s) not found
collect2: ld returned 1 exit status
make[7]: *** [libmozutils.dylib] Error 1
}

https://hg.mozilla.org/integration/mozilla-inbound/rev/546c39b13be7
Comment 104 Justin Lebar (not reading bugmail) 2011-11-03 11:40:35 PDT
Okay, gonna do this right.  Compiling on my 10.7 machine now...
Comment 105 Justin Lebar (not reading bugmail) 2011-11-03 12:04:26 PDT
Okay, I see what's going on.  Problem lies between keyboard and chair.  I missed a hunk in my .rej file.  :-/

I have a patch tested on my 10.7 machine.  This one should work.
Comment 106 Justin Lebar (not reading bugmail) 2011-11-03 12:09:07 PDT
> By the way, you should probably also at least back the patch out from mozilla-central -- otherwise 
> the shit will hit the fan in tomorrow's mozilla-central nightly.

I presume we're going to merge m-i again today before the nightly?
Comment 107 Steven Michaud [:smichaud] (Retired) 2011-11-03 12:16:33 PDT
> I presume we're going to merge m-i again today before the nightly?

I don't think so.  Lately it's been happening just once a day, 5-7AM PST.  The nightlies normally get built starting around 3-4AM.
Comment 108 Justin Lebar (not reading bugmail) 2011-11-03 12:30:03 PDT
<jlebar> Are we going to merge m-i to m-c again before the next nightly?
<edmorley> jlebar: yeah I imagine so, since that's ~13-14 hours out and a fair bit has landed on inbound already :-)
<edmorley> jlebar: particular reason?
<jlebar> edmorley, If we don't merge, we need to back out the broken jemalloc 10.7 patch I landed, which was merged earlier.
<edmorley> jlebar: ah
<edmorley> jlebar: ok so at the very least I'll merge from https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=546c39b13be7 to m-c if not more
Comment 109 Ed Morley [:emorley] 2011-11-03 13:03:07 PDT
First backout: https://hg.mozilla.org/mozilla-central/rev/d7ccee4bd7f7
Relanding: https://hg.mozilla.org/mozilla-central/rev/242bbbe83dda
Second backout: https://hg.mozilla.org/mozilla-central/rev/242bbbe83dda

The 3rd (and hopefully final) landing is on just inbound at present, and will come across on the next merge (presuming it's green hehe).
Comment 110 Ed Morley [:emorley] 2011-11-03 13:04:01 PDT
Sorry for the bugspam, the third URL in comment 109 should have been https://hg.mozilla.org/mozilla-central/rev/546c39b13be7
Comment 111 Josh Aas 2011-11-03 18:52:25 PDT
I just built with current trunk (rev f8d66a792ddc) and running mochitests locally on Mac OS X 10.7 doesn't work any more. Lots of malloc errors to the console and the browser doesn't start.
Comment 112 Ed Morley [:emorley] 2011-11-04 03:16:17 PDT
https://hg.mozilla.org/mozilla-central/rev/57d804eae9a0

Fingers crossed! :-)
Comment 113 Justin Lebar (not reading bugmail) 2011-11-04 07:46:27 PDT
(In reply to Josh Aas (Mozilla Corporation) from comment #111)
> I just built with current trunk (rev f8d66a792ddc) and running mochitests
> locally on Mac OS X 10.7 doesn't work any more. Lots of malloc errors to the
> console and the browser doesn't start.

That's worrying and weird.  f8d66a792ddc didn't have jemalloc turned on for 10.7.  Also, mochitest wfm with that rev on my 10.7 box.  Can you try tip again?
Comment 114 Josh Aas 2011-11-04 08:35:48 PDT
(In reply to Justin Lebar [:jlebar] from comment #113)
> (In reply to Josh Aas (Mozilla Corporation) from comment #111)
> > I just built with current trunk (rev f8d66a792ddc) and running mochitests
> > locally on Mac OS X 10.7 doesn't work any more. Lots of malloc errors to the
> > console and the browser doesn't start.
> 
> That's worrying and weird.  f8d66a792ddc didn't have jemalloc turned on for
> 10.7.  Also, mochitest wfm with that rev on my 10.7 box.  Can you try tip
> again?

Updating to tip fixed the problem, mochitests work again.
Comment 115 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2011-12-28 14:31:27 PST
Is there a test case to verify this fix?
Comment 116 Justin Lebar (not reading bugmail) 2012-01-03 07:21:18 PST
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #115)
> Is there a test case to verify this fix?

Unfortunately not.
Comment 117 Justin Lebar (not reading bugmail) 2012-01-03 07:39:17 PST
You could verify that jemalloc is enabled on 10.7 by putting a breakpoint or malloc_printf() in jemalloc.c's malloc() definition.
Comment 118 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-01-03 08:13:25 PST
I don't think this is something which QA will be able to verify easily (in the face of prioritizing resources). Marking qa- in the hopes that someone else will be able to verify the fix.

Note You need to log in before you can comment on or make changes to this bug.