Last Comment Bug 702250 - Disable jemalloc on mac 10.5 due to crash in ozone_size
: Disable jemalloc on mac 10.5 due to crash in ozone_size
Status: RESOLVED FIXED
[qa-]
: crash
Product: Core
Classification: Components
Component: Memory Allocator (show other bugs)
: Trunk
: x86 Mac OS X
: -- critical (vote)
: mozilla11
Assigned To: Justin Lebar (not reading bugmail)
:
Mentors:
http://www.alpine-electronics.co.uk/f...
Depends on: 738176
Blocks: 694335
  Show dependency treegraph
 
Reported: 2011-11-14 05:49 PST by Logan Rosen [:Logan]
Modified: 2012-03-22 14:48 PDT (History)
10 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
unaffected
-
fixed
-
fixed


Attachments
Patch v1 (1.65 KB, patch)
2011-11-16 13:57 PST, Justin Lebar (not reading bugmail)
justin.lebar+bug: review-
Details | Diff | Splinter Review
Patch v2 (disable properly, in configure.in) (2.50 KB, patch)
2011-11-18 08:05 PST, Justin Lebar (not reading bugmail)
khuey: review+
asa: approval‑mozilla‑aurora+
Details | Diff | Splinter Review

Description Logan Rosen [:Logan] 2011-11-14 05:49:34 PST
This bug was filed from the Socorro interface and is 
report bp-ab845691-eeb9-4f5a-bab9-a10112111114 .
============================================================= 
I am receiving this crash signature every time I drag a high-resolution picture from Firefox to another application in Mac OS X 10.5.8, using Nightly 20111113.

Steps to reproduce:
1. Find a high-resolution photo, like <http://images.kaneva.com/filestore9/5119771/6318375/speakerUgrilleUtexture.jpg>.
2. Drag it from Firefox to another application on the dock.
3. Firefox crashes.
Comment 1 Matthias Versen [:Matti] 2011-11-14 07:47:23 PST
looks like a jemalloc bug (regression from bug 414946) or less likely a apple system bug.
Comment 2 Sheila Mooney 2011-11-14 10:39:59 PST
Logan, can you find a regression window for this? When you started seeing it on the Trunk? Any chance you can try and repro this with a build from Oct 5th and Oct 6th. This would help us understand if it's a regression from the jmalloc checkin.
Comment 3 Justin Lebar (not reading bugmail) 2011-11-14 10:41:07 PST
This is very likely a regression from enabling jemalloc on 10.5, since before that, ozone_size didn't exist.
Comment 4 Steven Michaud [:smichaud] (Retired) 2011-11-14 11:02:54 PST
For what it's worth, I can't reproduce this.

I tested with today's mozilla-central nightly on OS X 10.5.8.  I tried dragging the URL from comment #0 to Safari in the Dock.  I also tried loading the URL from comment #0, then dragging the graphic to Safari in the Dock.  In neither case did I have any trouble.  Safari ran and loaded the graphic, and FF didn't crash.

So if anyone's going to find a regression range for this bug, it'll probably need to be you, Logan.
Comment 5 Steven Michaud [:smichaud] (Retired) 2011-11-14 11:29:41 PST
https://crash-stats.mozilla.com/ does provide some data:

https://crash-stats.mozilla.com/report/list?product=Firefox&query_search=signature&query_type=contains&query=ozone_size&reason_type=contains&date=11%2F14%2F2011%2011%3A12%3A02&range_value=4&range_unit=weeks&hang_type=any&process_type=any&do_query=1&signature=ozone_size

There were a (relatively) small number of these crashes over the last four weeks, almost all of them on OS X 10.5.8.

The oldest build with one of these crashes in that data is the 2011-11-02 mozilla-central nightly.  The patch that turned on jemalloc for OS X 10.5 (the patch for bug 694335) landed on mozilla-central on 2011-11-01 -- which makes the 2011-11-02 nightly the first one that contained it.
Comment 6 Steven Michaud [:smichaud] (Retired) 2011-11-14 11:30:34 PST
If I could reproduce this bug I'd take it.  But I can't.
Comment 7 Steven Michaud [:smichaud] (Retired) 2011-11-14 11:33:33 PST
We might just want to turn off jemalloc on OS X 10.5.

But the number of these crashes is (still) quite small.  And this is the only jemalloc bug (10.5-specific or not) that I'm aware of.

Justin, what do you think about turning off jemalloc on OS X 10.5?
Comment 8 Justin Lebar (not reading bugmail) 2011-11-14 11:37:06 PST
> Justin, what do you think about turning off jemalloc on OS X 10.5?

If we can't figure this bug out, then we may have to, but I don't think we've reached that point yet.   The memory benefits from jemalloc are particularly meaningful on old hardware, which is exactly what's running 10.5, so I'd be a bummer if we had to turn it off for an issue we don't understand.

Can we ask QA to try to help us reproduce this issue?
Comment 9 Steven Michaud [:smichaud] (Retired) 2011-11-14 11:47:15 PST
Another thing to consider:

Though the number of these crashes is still quite small, so is the number of people using anything on the 10 or 11 branches.  So this bug's crash is actually the #6 topcrasher on those branches.

Unless we can fix this bug soon (which will require finding out how to reproduce it), we will seriously need to consider disabling jemalloc on OS X 10.5.

https://crash-stats.mozilla.com/query/query?product=Firefox&version=Firefox%3A11.0a1&version=Firefox%3A10.0a2&version=Firefox%3A10.0a1&platform=mac&range_value=1&range_unit=weeks&date=11%2F14%2F2011+11%3A42%3A29&query_search=signature&query_type=contains&query=&reason=&build_id=&process_type=any&hang_type=any&do_query=1
Comment 10 Steven Michaud [:smichaud] (Retired) 2011-11-14 11:49:27 PST
> Can we ask QA to try to help us reproduce this issue?

Marcia's good at that sort of thing ... though I'm not sure she'd want me to say so :-)

For that matter so am I.

But we both have *lots* of other things to deal with.

For the time being, let's wait and see if someone can provide us better STR for these crashes.
Comment 11 Logan Rosen [:Logan] 2011-11-14 11:51:47 PST
Thanks for the quick responses, everyone. Unfortunately, the computer running 10.5.8 that I have access to is not at home, so I'll have to try reproducing the bugs again in the latest build tomorrow, in different situations to make sure that there isn't another variable that I'm not mentioning.
Comment 12 Steven Michaud [:smichaud] (Retired) 2011-11-14 12:07:30 PST
Thanks, Logan.  If you can find steps to reproduce that also work for me, that would be extremely helpful!
Comment 13 Steven Michaud [:smichaud] (Retired) 2011-11-14 12:08:54 PST
> If you can find steps to reproduce that also work for me, that would
> be extremely helpful!

For example that I first have to install such-and-such an extension.
Comment 14 Logan Rosen [:Logan] 2011-11-15 05:36:08 PST
Okay - I made a dumb mistake in my original report, as I didn't test the example image for the reproduction steps. This one <http://images.kaneva.com/filestore9/5119771/6318375/speakerUgrilleUtexture.jpg> crashes the latest build of Nightly (20111114) every time I drag it to another application on the dock.
Comment 15 Logan Rosen [:Logan] 2011-11-15 05:39:33 PST
Ugh, I pasted the wrong link again. <http://www.alpine-electronics.co.uk/fileadmin/images/MainNavigation/Products/Product_pics/14_Speakers/07_Center_Speaker/DLB100R/productpic_DLB100R_01.jpg> <-- this one crashes every time. Ignore the one in Comment 14.
Comment 16 Justin Lebar (not reading bugmail) 2011-11-15 07:35:55 PST
Boom.  I can reproduce this!
Comment 17 Justin Lebar (not reading bugmail) 2011-11-15 07:58:34 PST
In a debug build, we fail this assertion:

http://hg.mozilla.org/mozilla-central/annotate/3f0b94325b80/memory/jemalloc/jemalloc.c#l4335

...that looks pretty bad.
Comment 18 Steven Michaud [:smichaud] (Retired) 2011-11-15 08:04:42 PST
> Boom.  I can reproduce this!

I still can't :-(

But maybe I'm doing it wrong.  Someone needs to give me precise,
detailed steps to reproduce.

Here are the various things I've been doing.  None of them crash for
me on OS X 10.5.8 with yesterday's mozilla-central nightly.

1) Drag the URL from comment #14 to Safari in the Dock.

2) Drag the URL from comment #14 to the Desktop.

3) Visit the URL from comment #14, then drag the image to Safari in
   the Dock.

4) Visit the URL from comment #14, then drag the image to the Desktop.

Aside from getting the steps exactly right, I may also need to have
one or more extensions installed.  I currently have none.
Comment 19 Steven Michaud [:smichaud] (Retired) 2011-11-15 08:06:30 PST
> ...that looks pretty bad.

Yup.  Looks like memory corruption.

Since you're able to reproduce this, are you willing take it?

And just to be sure, you're seeing crashes on OS X 10.5.8?
Comment 20 Kyle Huey [:khuey] (Exited; not receiving bugmail, email if necessary) 2011-11-15 08:07:09 PST
If it's only occurring with drag and drop, is it possible we have an allocator mismatch somewhere?
Comment 21 Logan Rosen [:Logan] 2011-11-15 08:08:01 PST
(In reply to Steven Michaud from comment #18)
> > Boom.  I can reproduce this!
> 
> I still can't :-(
> 
> But maybe I'm doing it wrong.  Someone needs to give me precise,
> detailed steps to reproduce.
> 
> Here are the various things I've been doing.  None of them crash for
> me on OS X 10.5.8 with yesterday's mozilla-central nightly.
> 
> 1) Drag the URL from comment #14 to Safari in the Dock.
> 
> 2) Drag the URL from comment #14 to the Desktop.
> 
> 3) Visit the URL from comment #14, then drag the image to Safari in
>    the Dock.
> 
> 4) Visit the URL from comment #14, then drag the image to the Desktop.
> 
> Aside from getting the steps exactly right, I may also need to have
> one or more extensions installed.  I currently have none.

I had a pastefail in Comment 14 - you should use the image linked to in Comment 15 in order to reproduce the crash. Sorry. :P
Comment 22 Steven Michaud [:smichaud] (Retired) 2011-11-15 08:12:02 PST
> from comment #14

My comment was wrong.  I was actually using the URL from comment #15.  Please answer my other questions!
Comment 23 Logan Rosen [:Logan] 2011-11-15 08:14:30 PST
(In reply to Steven Michaud from comment #22)
> > from comment #14
> 
> My comment was wrong.  I was actually using the URL from comment #15. 
> Please answer my other questions!

Honestly, step 3 with the URL in Comment 15 should be crashing it for you in 10.5.8. I'm not sure why it is not happening for you.
Comment 24 Justin Lebar (not reading bugmail) 2011-11-15 08:14:42 PST
I thought we were failing the assertion, but actually, we're segfaulting inside the assertion.  On my debug build, we segfault trying to access address 0xa5a5a5a5.

I wonder if the system allocator is using that junk mask or if it's coming from jemalloc...
Comment 25 Justin Lebar (not reading bugmail) 2011-11-15 08:17:10 PST
Steven, the testcase I've successfully used is to drag the image from the URL in comment #15 onto the Finder in the doc.  I'm on 10.5.8.

I can send you credentials to the machine I've been using over IRC, if you ping me.
Comment 26 Steven Michaud [:smichaud] (Retired) 2011-11-15 09:23:53 PST
Ok, I've now been playing around on Justin's machine (connected
remotely via VNC).  On that machine the following STR works just fine
(I always crash):

Visit the URL from comment #15, then drag the image to the Desktop.

FF on Justin's machine has no extensions, and loads nothing but
standard plugins.

At first I thought it was easier to crash because of how slow
drag-and-drop is over VNC.  And that may yet be the case (neither of
us has yet been able to test on that machine except over a VNC
connection).

But now I suspect the crucial difference is that my machine has 4GB of
RAM, but Justin's has only 1GB.

Logan, how much RAM does your machine have?
Comment 27 Justin Lebar (not reading bugmail) 2011-11-15 10:05:35 PST
It doesn't crash for me when I set the resolution to 1024x768.

It does crash at 1280x1024 and 1680x1050.  It crashes with both available color depths ("thousands" and "millions").
Comment 28 Steven Michaud [:smichaud] (Retired) 2011-11-15 10:12:45 PST
My 10.5.8 machine is already at the maximum screen resolution and color depth (and doesn't crash).  I'm working on writing an eatram utility that would use low-level APIs to allocate up to a certain amount of unswappable RAM -- to see if it makes a difference to lower the effectively available amount of RAM on my machine.
Comment 29 Logan Rosen [:Logan] 2011-11-15 10:45:08 PST
My 10.5.8 machine has 4 GB of RAM. I'd have to check on the resolution (I think it's 1680 x 1050).
Comment 30 Justin Lebar (not reading bugmail) 2011-11-15 13:59:40 PST
Marcia, this crash signature is seen only on 10.5, correct?

Here's what I think is happening.  The short story is: It appears to be an OSX bug.

We malloc space for something bigger than 1MB (perhaps the drag/drop image).  The way jemalloc handles this is it rounds up to the next mb and places the allocation on a 1mb boundary.

Later, the OS frees the image in [NSBitmapImageRep _freeData].  But it doesn't pass us back the pointer we gave it!  It passes back a pointer 0x20 bytes past what we gave.

Now we're hosed.  isalloc_validate looks at this pointer and first checks whether it's inside a chunk jemalloc owns.  It is.  Then isalloc_validate assumes that the corresponding chunk holds a bunch of small allocations, rather than one big one, because the pointer we gave it isn't 1mb aligned.  So isalloc_validate accesses one of this chunk's fields, and we crash.

But even if we didn't have to go through isalloc_validate, we'd still be hosed when we actually tried to free that pointer.


I've been trying to confirm this interpretation by dumping out the addresses jemalloc gives out, but malloc_printf appears to be deadlocking.  I'll report back once I know for sure.  But it looks like we'll have to disable jemalloc on 10.5 to get around this issue.
Comment 31 Marcia Knous [:marcia - use ni] 2011-11-15 14:08:56 PST
https://crash-stats.mozilla.com/report/index/6c2e594f-82f0-4925-b9d7-13c6e2111110 confirms there is one crash on the trunk on 10.6.8. The query also shows Seamonkey crashes on 10.7.2 But the rest of the Firefox crashes are 10.5.

https://crash-stats.mozilla.com/report/list?signature=ozone_size
Comment 32 Steven Michaud [:smichaud] (Retired) 2011-11-15 14:13:12 PST
The SeaMonkey crashes on 10.7.2 are bad news -- there are *lots* of them.
Comment 33 Steven Michaud [:smichaud] (Retired) 2011-11-15 14:19:03 PST
(Following up comment #32)

And they're all in one nightly (dated 2011-11-13), and they're all crashes on startup.
Comment 34 Steven Michaud [:smichaud] (Retired) 2011-11-15 14:26:18 PST
(Following up comment #33)

However, most of the crashes are reported as happening in the span of a couple minutes between 13:58 and 14:00 on 2011-11-13.  So this may be an error in Socorro.
Comment 35 Justin Lebar (not reading bugmail) 2011-11-15 22:53:17 PST
We need to keep a close eye on this.  If it's actually happening regularly outside 10.5, then my hypothesis from comment 30 may be wrong, and we may need to press the panic button.
Comment 36 Justin Lebar (not reading bugmail) 2011-11-16 12:32:07 PST
Finally confirmed that, indeed, we allocated a chunk at one address and then we're being asked to get the size of an address *inside that chunk*.

* chunk_alloc_mmap 0xd400000, size 2mb.  (Because it's a 2mb chunk, we know it's in response to a huge malloc.  If it were a chunk holding arenas, it would have size 1mb.)

* isalloc_validate(0xd400020)

Bizarre.

I'm not sure if we can safely work around this without disabling jemalloc on 10.5.  I have one idea: It appears that macos is calling ozone_size not to because it cares about the actual size of the allocation, but because it's trying to figure out whether the allocation lives in this zone.

We can safely report yes or no to this question even if the pointer given isn't to the beginning of the chunk.  (It wouldn't be correct in all cases -- if you gave a pointer 1MB past the end of the chunk in question, we'd say that it didn't live in jemalloc's memory, even though it does in this case, since the chunk has size 2MB.)

So I'm going to try hacking ozone_size to return 0 or 1.  But I doubt this is actually going to solve our problem, because I suspect that macos is then going to try to free() the pointer at 0xd400020, which is totally not allowed.
Comment 37 Steven Michaud [:smichaud] (Retired) 2011-11-16 12:38:25 PST
Awkward question:

Are you sure something like this doesn't also happen on other versions of OS X?  In such a way that it doesn't cause crashes, or causes many fewer crashes?
Comment 38 Justin Lebar (not reading bugmail) 2011-11-16 12:41:28 PST
My debug build doesn't blow up on the assertion when I drag and drop on 10.7, so...I think so?  It should crash just as hard in a debug build because we're filling the chunk with 0xa5a5a5a5.
Comment 39 Steven Michaud [:smichaud] (Retired) 2011-11-16 12:45:06 PST
By the way, I spent a few hours yesterday writing a nice little app that allocates large chunks of RAM and then locks it in place by calling mlock() -- which I hope means that it can't be swapped out.  It (apparently) manages to eat about 3GB of my 4GB.  But even with all that RAM "gone", I still can't reproduce these crashes.
Comment 40 Steven Michaud [:smichaud] (Retired) 2011-11-16 12:48:57 PST
(In reply to comment #38)

> It should crash just as hard in a debug build because we're filling
> the chunk with 0xa5a5a5a5.

Point well taken.

But could we turn on this filling-with-0xa5a5a5a5 in mozilla-central
nightlies, just to see what happens?
Comment 41 Justin Lebar (not reading bugmail) 2011-11-16 13:03:44 PST
> So I'm going to try hacking ozone_size to return 0 or 1.

This doesn't work; we crash in startup.  It looks like macos (not unreasonably) expects the returned size to be correct.

> But could we turn on this filling-with-0xa5a5a5a5 in mozilla-central
> nightlies, just to see what happens?

I guess...  It would likely be a big perf regression, so it'd be a pain for anyone landing perf-sensitive code.  There's a sanity check in isalloc_validate that we could definitely turn on for release builds:

  assert(chunk->arena->magic == ARENA_MAGIC)

That might be enough to catch problems.

> https://crash-stats.mozilla.com/report/index/6c2e594f-82f0-4925-b9d7-13c6e2111110 confirms there 
> is one crash on the trunk on 10.6.8.

This appears to be a crash in plugin-container, so...who knows what crazy stuff flash is doing.

Anyway, I'll spin a patch for 10.5 and file a separate bug for testing in the nightlies.
Comment 42 Justin Lebar (not reading bugmail) 2011-11-16 13:57:16 PST
Created attachment 574992 [details] [diff] [review]
Patch v1
Comment 43 Justin Lebar (not reading bugmail) 2011-11-16 14:02:03 PST
Kyle, note that this isn't a full backout of bug 694335; that bug also enabled jemalloc on 32-bit 10.6, and I'm leaving that in here.
Comment 44 Justin Lebar (not reading bugmail) 2011-11-16 14:24:15 PST
Filed bug 703087 for making the assertion fire on release builds, so we can get an idea if this is a problem on other versions.
Comment 45 Justin Lebar (not reading bugmail) 2011-11-17 15:15:10 PST
Inbound: https://hg.mozilla.org/integration/mozilla-inbound/rev/a3fcfb7d6647

It's a real shame we had to do this.  Let's just hope jemalloc sticks on 10.6 and 10.7!
Comment 46 Phil Ringnalda (:philor) 2011-11-17 19:46:35 PST
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/ad76b1669e56 - dunno whether it requires a clobber (seems quite unlikely) or something else, but the 10.5 opt tests were a solid bar of orange, whining about "Non-aligned pointer being freed" and the like until an oom blew up httpd.js startup.
Comment 47 Justin Lebar (not reading bugmail) 2011-11-17 21:24:01 PST
Oh, right.  Dynamically turning off jemalloc doesn't work on 32-bit.

Oops.
Comment 48 Justin Lebar (not reading bugmail) 2011-11-18 08:05:42 PST
Created attachment 575464 [details] [diff] [review]
Patch v2 (disable properly, in configure.in)
Comment 49 Ed Morley [:emorley] 2011-11-21 19:12:20 PST
https://hg.mozilla.org/mozilla-central/rev/291faf6a656a
Comment 50 Sheila Mooney 2011-11-22 14:51:51 PST
Regression introduced in 10. Not sure high volume but would be nice to fix.
Comment 51 Justin Lebar (not reading bugmail) 2011-11-22 17:26:46 PST
Fixed in Aurora for FF 10: https://hg.mozilla.org/releases/mozilla-aurora/rev/a63ec4940565
Comment 52 Justin Lebar (not reading bugmail) 2011-11-22 20:00:57 PST
Steven, jst suggested we file a bug with Apple on this issue.  They're not supporting 10.5 anymore, but they may be able to advise whether they view this issue as a bug on their end, or whether they reserve the right to call zone_size using internal pointers.

I looked around, but it's not clear to me how to file such a bug against Apple.  Do you know how?
Comment 53 Logan Rosen [:Logan] 2011-11-22 21:16:20 PST
I believe that you can report bugs to Apple at http://developer.apple.com/bugreporter/ - you just need to be registered as an Apple Developer.
Comment 54 Steven Michaud [:smichaud] (Retired) 2011-11-22 21:26:58 PST
(Following up comment #53)

Yes, that's what I do when I report a bug to Apple.

But Apple's bug database is non-public (only you will be able to read any responses or followups that Apple may make).  So whenever I open a Mozilla-related bug with Apple, I refer to the Mozilla bug that my report has (in effect) been spun off from, and make sure to say that the Mozilla bug will be the best place to look for current information.

I almost never get any response other than "this bug has been fixed in such-and-such a release", or "this bug is a dup of some other bug" (which I can't see).
Comment 55 Steven Michaud [:smichaud] (Retired) 2011-11-22 21:31:27 PST
It used to be the case that you could buy "support incidents" from Apple on a per-case basis.  Perhaps you still can.  I don't know whether Mozilla has (or ever had) official support for using this service, but it might be worth looking into.
Comment 56 Armen Zambrano [:armenzg] (EDT/UTC-4) 2011-11-24 07:04:36 PST
Just for record purposes, there is an associated regression with this backout:
http://graphs-new.mozilla.org/graph.html#tests=[[115,52,13],[72,52,13],[77,52,13]]&sel=1321820586832.4902,1322146945719&displayrange=7&datatype=running

Talos Regression :( Tp5 MozAfterPaint increase 7.66% on MacOSX 10.5.8 Mozilla-Aurora
Talos Regression :( Dromaeo (CSS) decrease 3.49% on MacOSX 10.5.8 Mozilla-Aurora
Talos Regression :( Dromaeo (String/Array/Eval/Regex) decrease 4.11% on MacOSX 10.5.8 Mozilla-Aurora

There is also an improvement:
Talos Improvement! Tp5 MozAfterPaint (RSS) decrease 15.9% on MacOSX 10.5.8 Mozilla-Aurora
Comment 57 Justin Lebar (not reading bugmail) 2011-11-24 09:49:24 PST
And just to repeat what's been said, this is all expected.  The RSS decrease is due to a measurement error in Talos (bug 693404), and the perf regressions are real.
Comment 58 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2011-12-28 14:04:52 PST
Marking qa- as I don't think anyone in QA was able to reproduce this crash locally. Logan, would you be able to verify this is fixed in Firefox 10 and 11?
Comment 59 Logan Rosen [:Logan] 2011-12-28 17:22:46 PST
The issue was fixed by disabling jemalloc in the nightly builds; I can tell you that much right now.
Comment 61 Justin Lebar (not reading bugmail) 2012-01-05 13:35:22 PST
I think basically any memory corruption bug could trigger this crash signature.  So those crashes don't necessarily mean something is wrong.
Comment 62 Steven Michaud [:smichaud] (Retired) 2012-01-05 14:31:27 PST
(In reply to comment #61)

It's odd, though, that your patch from bug 703087 doesn't catch the corruption before the crashes in arena_salloc().
Comment 63 Justin Lebar (not reading bugmail) 2012-01-05 14:40:24 PST
Well, that assertion doesn't catch every way you can mess up ozone_size.  The assertion will fire if you pass a pointer to something that's not a jemalloc arena, but it won't fire if you pass a pointer inside an arena but not at the start of an allocation.

Note You need to log in before you can comment on or make changes to this bug.