Closed Bug 702250 Opened 13 years ago Closed 13 years ago

Disable jemalloc on mac 10.5 due to crash in ozone_size

Categories

(Core :: Memory Allocator, defect)

x86
macOS
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla11
Tracking Status
firefox9 --- unaffected
firefox10 - fixed
firefox11 - fixed

People

(Reporter: Logan, Assigned: justin.lebar+bug)

References

()

Details

(Keywords: crash, Whiteboard: [qa-])

Crash Data

Attachments

(1 file, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-ab845691-eeb9-4f5a-bab9-a10112111114 .
============================================================= 
I am receiving this crash signature every time I drag a high-resolution picture from Firefox to another application in Mac OS X 10.5.8, using Nightly 20111113.

Steps to reproduce:
1. Find a high-resolution photo, like <http://images.kaneva.com/filestore9/5119771/6318375/speakerUgrilleUtexture.jpg>.
2. Drag it from Firefox to another application on the dock.
3. Firefox crashes.
looks like a jemalloc bug (regression from bug 414946) or less likely a apple system bug.
Component: Shell Integration → jemalloc
Product: Firefox → Core
QA Contact: shell.integration → jemalloc
Logan, can you find a regression window for this? When you started seeing it on the Trunk? Any chance you can try and repro this with a build from Oct 5th and Oct 6th. This would help us understand if it's a regression from the jmalloc checkin.
This is very likely a regression from enabling jemalloc on 10.5, since before that, ozone_size didn't exist.
Assignee: nobody → justin.lebar+bug
For what it's worth, I can't reproduce this.

I tested with today's mozilla-central nightly on OS X 10.5.8.  I tried dragging the URL from comment #0 to Safari in the Dock.  I also tried loading the URL from comment #0, then dragging the graphic to Safari in the Dock.  In neither case did I have any trouble.  Safari ran and loaded the graphic, and FF didn't crash.

So if anyone's going to find a regression range for this bug, it'll probably need to be you, Logan.
Assignee: justin.lebar+bug → nobody
https://crash-stats.mozilla.com/ does provide some data:

https://crash-stats.mozilla.com/report/list?product=Firefox&query_search=signature&query_type=contains&query=ozone_size&reason_type=contains&date=11%2F14%2F2011%2011%3A12%3A02&range_value=4&range_unit=weeks&hang_type=any&process_type=any&do_query=1&signature=ozone_size

There were a (relatively) small number of these crashes over the last four weeks, almost all of them on OS X 10.5.8.

The oldest build with one of these crashes in that data is the 2011-11-02 mozilla-central nightly.  The patch that turned on jemalloc for OS X 10.5 (the patch for bug 694335) landed on mozilla-central on 2011-11-01 -- which makes the 2011-11-02 nightly the first one that contained it.
If I could reproduce this bug I'd take it.  But I can't.
We might just want to turn off jemalloc on OS X 10.5.

But the number of these crashes is (still) quite small.  And this is the only jemalloc bug (10.5-specific or not) that I'm aware of.

Justin, what do you think about turning off jemalloc on OS X 10.5?
> Justin, what do you think about turning off jemalloc on OS X 10.5?

If we can't figure this bug out, then we may have to, but I don't think we've reached that point yet.   The memory benefits from jemalloc are particularly meaningful on old hardware, which is exactly what's running 10.5, so I'd be a bummer if we had to turn it off for an issue we don't understand.

Can we ask QA to try to help us reproduce this issue?
Another thing to consider:

Though the number of these crashes is still quite small, so is the number of people using anything on the 10 or 11 branches.  So this bug's crash is actually the #6 topcrasher on those branches.

Unless we can fix this bug soon (which will require finding out how to reproduce it), we will seriously need to consider disabling jemalloc on OS X 10.5.

https://crash-stats.mozilla.com/query/query?product=Firefox&version=Firefox%3A11.0a1&version=Firefox%3A10.0a2&version=Firefox%3A10.0a1&platform=mac&range_value=1&range_unit=weeks&date=11%2F14%2F2011+11%3A42%3A29&query_search=signature&query_type=contains&query=&reason=&build_id=&process_type=any&hang_type=any&do_query=1
> Can we ask QA to try to help us reproduce this issue?

Marcia's good at that sort of thing ... though I'm not sure she'd want me to say so :-)

For that matter so am I.

But we both have *lots* of other things to deal with.

For the time being, let's wait and see if someone can provide us better STR for these crashes.
Thanks for the quick responses, everyone. Unfortunately, the computer running 10.5.8 that I have access to is not at home, so I'll have to try reproducing the bugs again in the latest build tomorrow, in different situations to make sure that there isn't another variable that I'm not mentioning.
Thanks, Logan.  If you can find steps to reproduce that also work for me, that would be extremely helpful!
> If you can find steps to reproduce that also work for me, that would
> be extremely helpful!

For example that I first have to install such-and-such an extension.
Okay - I made a dumb mistake in my original report, as I didn't test the example image for the reproduction steps. This one <http://images.kaneva.com/filestore9/5119771/6318375/speakerUgrilleUtexture.jpg> crashes the latest build of Nightly (20111114) every time I drag it to another application on the dock.
Boom.  I can reproduce this!
In a debug build, we fail this assertion:

http://hg.mozilla.org/mozilla-central/annotate/3f0b94325b80/memory/jemalloc/jemalloc.c#l4335

...that looks pretty bad.
> Boom.  I can reproduce this!

I still can't :-(

But maybe I'm doing it wrong.  Someone needs to give me precise,
detailed steps to reproduce.

Here are the various things I've been doing.  None of them crash for
me on OS X 10.5.8 with yesterday's mozilla-central nightly.

1) Drag the URL from comment #14 to Safari in the Dock.

2) Drag the URL from comment #14 to the Desktop.

3) Visit the URL from comment #14, then drag the image to Safari in
   the Dock.

4) Visit the URL from comment #14, then drag the image to the Desktop.

Aside from getting the steps exactly right, I may also need to have
one or more extensions installed.  I currently have none.
> ...that looks pretty bad.

Yup.  Looks like memory corruption.

Since you're able to reproduce this, are you willing take it?

And just to be sure, you're seeing crashes on OS X 10.5.8?
If it's only occurring with drag and drop, is it possible we have an allocator mismatch somewhere?
(In reply to Steven Michaud from comment #18)
> > Boom.  I can reproduce this!
> 
> I still can't :-(
> 
> But maybe I'm doing it wrong.  Someone needs to give me precise,
> detailed steps to reproduce.
> 
> Here are the various things I've been doing.  None of them crash for
> me on OS X 10.5.8 with yesterday's mozilla-central nightly.
> 
> 1) Drag the URL from comment #14 to Safari in the Dock.
> 
> 2) Drag the URL from comment #14 to the Desktop.
> 
> 3) Visit the URL from comment #14, then drag the image to Safari in
>    the Dock.
> 
> 4) Visit the URL from comment #14, then drag the image to the Desktop.
> 
> Aside from getting the steps exactly right, I may also need to have
> one or more extensions installed.  I currently have none.

I had a pastefail in Comment 14 - you should use the image linked to in Comment 15 in order to reproduce the crash. Sorry. :P
> from comment #14

My comment was wrong.  I was actually using the URL from comment #15.  Please answer my other questions!
(In reply to Steven Michaud from comment #22)
> > from comment #14
> 
> My comment was wrong.  I was actually using the URL from comment #15. 
> Please answer my other questions!

Honestly, step 3 with the URL in Comment 15 should be crashing it for you in 10.5.8. I'm not sure why it is not happening for you.
I thought we were failing the assertion, but actually, we're segfaulting inside the assertion.  On my debug build, we segfault trying to access address 0xa5a5a5a5.

I wonder if the system allocator is using that junk mask or if it's coming from jemalloc...
Steven, the testcase I've successfully used is to drag the image from the URL in comment #15 onto the Finder in the doc.  I'm on 10.5.8.

I can send you credentials to the machine I've been using over IRC, if you ping me.
Ok, I've now been playing around on Justin's machine (connected
remotely via VNC).  On that machine the following STR works just fine
(I always crash):

Visit the URL from comment #15, then drag the image to the Desktop.

FF on Justin's machine has no extensions, and loads nothing but
standard plugins.

At first I thought it was easier to crash because of how slow
drag-and-drop is over VNC.  And that may yet be the case (neither of
us has yet been able to test on that machine except over a VNC
connection).

But now I suspect the crucial difference is that my machine has 4GB of
RAM, but Justin's has only 1GB.

Logan, how much RAM does your machine have?
It doesn't crash for me when I set the resolution to 1024x768.

It does crash at 1280x1024 and 1680x1050.  It crashes with both available color depths ("thousands" and "millions").
My 10.5.8 machine is already at the maximum screen resolution and color depth (and doesn't crash).  I'm working on writing an eatram utility that would use low-level APIs to allocate up to a certain amount of unswappable RAM -- to see if it makes a difference to lower the effectively available amount of RAM on my machine.
My 10.5.8 machine has 4 GB of RAM. I'd have to check on the resolution (I think it's 1680 x 1050).
Marcia, this crash signature is seen only on 10.5, correct?

Here's what I think is happening.  The short story is: It appears to be an OSX bug.

We malloc space for something bigger than 1MB (perhaps the drag/drop image).  The way jemalloc handles this is it rounds up to the next mb and places the allocation on a 1mb boundary.

Later, the OS frees the image in [NSBitmapImageRep _freeData].  But it doesn't pass us back the pointer we gave it!  It passes back a pointer 0x20 bytes past what we gave.

Now we're hosed.  isalloc_validate looks at this pointer and first checks whether it's inside a chunk jemalloc owns.  It is.  Then isalloc_validate assumes that the corresponding chunk holds a bunch of small allocations, rather than one big one, because the pointer we gave it isn't 1mb aligned.  So isalloc_validate accesses one of this chunk's fields, and we crash.

But even if we didn't have to go through isalloc_validate, we'd still be hosed when we actually tried to free that pointer.


I've been trying to confirm this interpretation by dumping out the addresses jemalloc gives out, but malloc_printf appears to be deadlocking.  I'll report back once I know for sure.  But it looks like we'll have to disable jemalloc on 10.5 to get around this issue.
https://crash-stats.mozilla.com/report/index/6c2e594f-82f0-4925-b9d7-13c6e2111110 confirms there is one crash on the trunk on 10.6.8. The query also shows Seamonkey crashes on 10.7.2 But the rest of the Firefox crashes are 10.5.

https://crash-stats.mozilla.com/report/list?signature=ozone_size
The SeaMonkey crashes on 10.7.2 are bad news -- there are *lots* of them.
(Following up comment #32)

And they're all in one nightly (dated 2011-11-13), and they're all crashes on startup.
(Following up comment #33)

However, most of the crashes are reported as happening in the span of a couple minutes between 13:58 and 14:00 on 2011-11-13.  So this may be an error in Socorro.
We need to keep a close eye on this.  If it's actually happening regularly outside 10.5, then my hypothesis from comment 30 may be wrong, and we may need to press the panic button.
Finally confirmed that, indeed, we allocated a chunk at one address and then we're being asked to get the size of an address *inside that chunk*.

* chunk_alloc_mmap 0xd400000, size 2mb.  (Because it's a 2mb chunk, we know it's in response to a huge malloc.  If it were a chunk holding arenas, it would have size 1mb.)

* isalloc_validate(0xd400020)

Bizarre.

I'm not sure if we can safely work around this without disabling jemalloc on 10.5.  I have one idea: It appears that macos is calling ozone_size not to because it cares about the actual size of the allocation, but because it's trying to figure out whether the allocation lives in this zone.

We can safely report yes or no to this question even if the pointer given isn't to the beginning of the chunk.  (It wouldn't be correct in all cases -- if you gave a pointer 1MB past the end of the chunk in question, we'd say that it didn't live in jemalloc's memory, even though it does in this case, since the chunk has size 2MB.)

So I'm going to try hacking ozone_size to return 0 or 1.  But I doubt this is actually going to solve our problem, because I suspect that macos is then going to try to free() the pointer at 0xd400020, which is totally not allowed.
Awkward question:

Are you sure something like this doesn't also happen on other versions of OS X?  In such a way that it doesn't cause crashes, or causes many fewer crashes?
My debug build doesn't blow up on the assertion when I drag and drop on 10.7, so...I think so?  It should crash just as hard in a debug build because we're filling the chunk with 0xa5a5a5a5.
By the way, I spent a few hours yesterday writing a nice little app that allocates large chunks of RAM and then locks it in place by calling mlock() -- which I hope means that it can't be swapped out.  It (apparently) manages to eat about 3GB of my 4GB.  But even with all that RAM "gone", I still can't reproduce these crashes.
(In reply to comment #38)

> It should crash just as hard in a debug build because we're filling
> the chunk with 0xa5a5a5a5.

Point well taken.

But could we turn on this filling-with-0xa5a5a5a5 in mozilla-central
nightlies, just to see what happens?
> So I'm going to try hacking ozone_size to return 0 or 1.

This doesn't work; we crash in startup.  It looks like macos (not unreasonably) expects the returned size to be correct.

> But could we turn on this filling-with-0xa5a5a5a5 in mozilla-central
> nightlies, just to see what happens?

I guess...  It would likely be a big perf regression, so it'd be a pain for anyone landing perf-sensitive code.  There's a sanity check in isalloc_validate that we could definitely turn on for release builds:

  assert(chunk->arena->magic == ARENA_MAGIC)

That might be enough to catch problems.

> https://crash-stats.mozilla.com/report/index/6c2e594f-82f0-4925-b9d7-13c6e2111110 confirms there 
> is one crash on the trunk on 10.6.8.

This appears to be a crash in plugin-container, so...who knows what crazy stuff flash is doing.

Anyway, I'll spin a patch for 10.5 and file a separate bug for testing in the nightlies.
Summary: crash ozone_size → Disable jemalloc on mac 10.5 due to crash in ozone_size
Blocks: 694335
Attached patch Patch v1 (obsolete) — Splinter Review
Attachment #574992 - Flags: review?(khuey)
Kyle, note that this isn't a full backout of bug 694335; that bug also enabled jemalloc on 32-bit 10.6, and I'm leaving that in here.
Filed bug 703087 for making the assertion fire on release builds, so we can get an idea if this is a problem on other versions.
Attachment #574992 - Flags: approval-mozilla-aurora?
Inbound: https://hg.mozilla.org/integration/mozilla-inbound/rev/a3fcfb7d6647

It's a real shame we had to do this.  Let's just hope jemalloc sticks on 10.6 and 10.7!
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/ad76b1669e56 - dunno whether it requires a clobber (seems quite unlikely) or something else, but the 10.5 opt tests were a solid bar of orange, whining about "Non-aligned pointer being freed" and the like until an oom blew up httpd.js startup.
Oh, right.  Dynamically turning off jemalloc doesn't work on 32-bit.

Oops.
Attachment #574992 - Flags: review-
Attachment #574992 - Flags: review+
Attachment #574992 - Flags: approval-mozilla-aurora?
Attachment #575464 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/291faf6a656a
Assignee: nobody → justin.lebar+bug
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla11
Regression introduced in 10. Not sure high volume but would be nice to fix.
Attachment #575464 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #574992 - Attachment is obsolete: true
Steven, jst suggested we file a bug with Apple on this issue.  They're not supporting 10.5 anymore, but they may be able to advise whether they view this issue as a bug on their end, or whether they reserve the right to call zone_size using internal pointers.

I looked around, but it's not clear to me how to file such a bug against Apple.  Do you know how?
I believe that you can report bugs to Apple at http://developer.apple.com/bugreporter/ - you just need to be registered as an Apple Developer.
(Following up comment #53)

Yes, that's what I do when I report a bug to Apple.

But Apple's bug database is non-public (only you will be able to read any responses or followups that Apple may make).  So whenever I open a Mozilla-related bug with Apple, I refer to the Mozilla bug that my report has (in effect) been spun off from, and make sure to say that the Mozilla bug will be the best place to look for current information.

I almost never get any response other than "this bug has been fixed in such-and-such a release", or "this bug is a dup of some other bug" (which I can't see).
It used to be the case that you could buy "support incidents" from Apple on a per-case basis.  Perhaps you still can.  I don't know whether Mozilla has (or ever had) official support for using this service, but it might be worth looking into.
Just for record purposes, there is an associated regression with this backout:
http://graphs-new.mozilla.org/graph.html#tests=[[115,52,13],[72,52,13],[77,52,13]]&sel=1321820586832.4902,1322146945719&displayrange=7&datatype=running

Talos Regression :( Tp5 MozAfterPaint increase 7.66% on MacOSX 10.5.8 Mozilla-Aurora
Talos Regression :( Dromaeo (CSS) decrease 3.49% on MacOSX 10.5.8 Mozilla-Aurora
Talos Regression :( Dromaeo (String/Array/Eval/Regex) decrease 4.11% on MacOSX 10.5.8 Mozilla-Aurora

There is also an improvement:
Talos Improvement! Tp5 MozAfterPaint (RSS) decrease 15.9% on MacOSX 10.5.8 Mozilla-Aurora
And just to repeat what's been said, this is all expected.  The RSS decrease is due to a measurement error in Talos (bug 693404), and the perf regressions are real.
Marking qa- as I don't think anyone in QA was able to reproduce this crash locally. Logan, would you be able to verify this is fixed in Firefox 10 and 11?
Whiteboard: [qa-]
The issue was fixed by disabling jemalloc in the nightly builds; I can tell you that much right now.
I think basically any memory corruption bug could trigger this crash signature.  So those crashes don't necessarily mean something is wrong.
(In reply to comment #61)

It's odd, though, that your patch from bug 703087 doesn't catch the corruption before the crashes in arena_salloc().
Well, that assertion doesn't catch every way you can mess up ozone_size.  The assertion will fire if you pass a pointer to something that's not a jemalloc arena, but it won't fire if you pass a pointer inside an arena but not at the start of an allocation.
Depends on: 738176
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: