Last Comment Bug 764192 - (jemalloc-assertions) Temporarily enable some jemalloc assertions in the hopes of catching heap corruption (bug 709860)
(jemalloc-assertions)
: Temporarily enable some jemalloc assertions in the hopes of catching heap cor...
Status: RESOLVED FIXED
: qawanted
Product: Core
Classification: Components
Component: Memory Allocator (show other bugs)
: Trunk
: x86 Windows XP
: -- normal (vote)
: mozilla16
Assigned To: Justin Lebar (not reading bugmail)
: Marcia Knous [:marcia - use ni]
Mentors:
Depends on: 721710 770238
Blocks: 675260 751545 709860 766173 766250 781049 787675
  Show dependency treegraph
 
Reported: 2012-06-12 15:38 PDT by Justin Lebar (not reading bugmail)
Modified: 2012-10-03 13:01 PDT (History)
9 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
fixed


Attachments
Patch v1 (11.68 KB, patch)
2012-06-12 17:41 PDT, Justin Lebar (not reading bugmail)
mh+mozilla: review+
Details | Diff | Splinter Review
Patch v2, as landed (13.57 KB, patch)
2012-06-13 06:13 PDT, Justin Lebar (not reading bugmail)
akeybl: approval‑mozilla‑aurora+
akeybl: approval‑mozilla‑beta-
Details | Diff | Splinter Review

Description Justin Lebar (not reading bugmail) 2012-06-12 15:38:59 PDT
See bug 709860 comment 30: We want to (temporarily) turn on some jemalloc assertions with the hope of catching heap corruption in that bug before Firefox crashes and burns.
Comment 1 Justin Lebar (not reading bugmail) 2012-06-12 17:41:06 PDT
Created attachment 632475 [details] [diff] [review]
Patch v1
Comment 2 Justin Lebar (not reading bugmail) 2012-06-12 17:43:58 PDT
The assertions I enabled here are for all size classes.  I think it would be a premature optimization to turn on the assertions just for size 72; they should be pretty cheap.  We'll see if I'm wrong.

https://tbpl.mozilla.org/?tree=Try&rev=0d05c82b7fb4
Comment 3 Mike Hommey [:glandium] 2012-06-12 22:11:18 PDT
Something I wondered when working on jemalloc3 is if it wouldn't make sense to enable jemalloc (with assertions on) on debug builds. It may require some tweaks to work properly with e.g. tracemalloc, but i think it's totally worth it.
Comment 4 Justin Lebar (not reading bugmail) 2012-06-13 04:59:23 PDT
(In reply to Mike Hommey [:glandium] from comment #3)
> Something I wondered when working on jemalloc3 is if it wouldn't make sense
> to enable jemalloc (with assertions on) on debug builds. It may require some
> tweaks to work properly with e.g. tracemalloc, but i think it's totally
> worth it.

I think that would be great, if we could get it to work.
Comment 5 Justin Lebar (not reading bugmail) 2012-06-13 05:08:43 PDT
Comment on attachment 632475 [details] [diff] [review]
Patch v1

This doesn't appear to be a large performance regression.  There may be a 1-2% drop in a few benchmarks; it's hard to say for sure.  (Reading performance test numbers is always a black art.)

I think we should go ahead with this; it's temporary code anyway.
Comment 6 Mike Hommey [:glandium] 2012-06-13 05:17:45 PDT
Comment on attachment 632475 [details] [diff] [review]
Patch v1

Review of attachment 632475 [details] [diff] [review]:
-----------------------------------------------------------------

I wonder if it would make sense to put that stuff under a #ifdef MOZ_TEMP_INVESTIGATION or something.
Comment 7 Justin Lebar (not reading bugmail) 2012-06-13 05:38:51 PDT
Our plan is to do that and to set MOZ_TEMP_INVESTIGATION only if we're not on release/esr.  (For my reference, that's

  ifeq (,$(filter release esr,$(MOZ_UPDATE_CHANNEL)))

.)
Comment 8 Justin Lebar (not reading bugmail) 2012-06-13 06:11:35 PDT
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=7e553f328d3a
Comment 9 Justin Lebar (not reading bugmail) 2012-06-13 06:13:58 PDT
Created attachment 632660 [details] [diff] [review]
Patch v2, as landed
Comment 10 Justin Lebar (not reading bugmail) 2012-06-13 06:16:01 PDT
Comment on attachment 632660 [details] [diff] [review]
Patch v2, as landed

[Approval Request Comment]
This is wanted on beta for investigating bug 709860.  I figure we might as well take it on Aurora as well.

No string changes.  This may cause an increase in crashes, as we're now stricter about detecting heap corruption.  The code disables itself on release and esr channels.
Comment 11 Justin Lebar (not reading bugmail) 2012-06-13 06:28:51 PDT
And a follow-up to fix red: https://hg.mozilla.org/integration/mozilla-inbound/rev/2b43c6ad74ea
Comment 12 Benjamin Smedberg [:bsmedberg] 2012-06-13 10:06:45 PDT
Marcia can you watch for crash signatures related to this when this lands in nightly (probably for tomorrow's nightly)? I *think* they will show up as crashes in "arena_"-something.
Comment 13 Justin Lebar (not reading bugmail) 2012-06-13 10:21:05 PDT
> I *think* they will show up as crashes in "arena_"-something.

Probably?  There's heavy inlining here; particularly so, I would expect, on Windows PGO builds.

I could make the crash happen in a not-inlined function, which would give this a unique crash signature, if we wanted.
Comment 15 Alex Keybl [:akeybl] 2012-06-15 15:40:52 PDT
Comment on attachment 632660 [details] [diff] [review]
Patch v2, as landed

[Triage Comment]
Thanks for ensuring this never hits release users. Approving in support of our crash investigation.
Comment 16 Benjamin Smedberg [:bsmedberg] 2012-06-18 12:55:06 PDT
We did a search on crashes with address 123 (address 0x7b) and did not come up with any on the nightly channel in the past week. Unless MOZ_CRASH isn't giving me the data I expect...
Comment 17 Justin Lebar (not reading bugmail) 2012-06-18 12:56:38 PDT
Would you like to land this on branches, so we can get more exposure, or do you think we should take a different approach?
Comment 18 Benjamin Smedberg [:bsmedberg] 2012-06-18 13:20:56 PDT
Yes, I think this should land on branches, if Talos didn't show any noticeable slowdown from it. Perhaps permanently...
Comment 19 Benjamin Smedberg [:bsmedberg] 2012-06-18 13:22:28 PDT
Note, the custom query was in bug 765626, and can be repeated for larger testing populations.
Comment 20 Marcia Knous [:marcia - use ni] 2012-06-18 15:46:37 PDT
bsmedberg: I missed the earlier ping and just had a chance to more deeply at crash stats: http://tinyurl.com/css2jky shows a handful of crashes with arena_dalloc - some of them map to dates after the checkin. Not sure if those crashes are what you were looking for or not.

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #12)
> Marcia can you watch for crash signatures related to this when this lands in
> nightly (probably for tomorrow's nightly)? I *think* they will show up as
> crashes in "arena_"-something.
Comment 21 Justin Lebar (not reading bugmail) 2012-06-18 15:50:56 PDT
It's hard to understand what this crash report means; it says it's crashing with a null-pointer exception at a function call.

http://hg.mozilla.org/mozilla-central/annotate/983b91e5aa17/memory/mozjemalloc/jemalloc.c#l4512

Is there any way to tell whether it crashed due to MOZ_CRASH?
Comment 22 Mike Hommey [:glandium] 2012-06-18 22:50:44 PDT
It so happens that I'm currently tracking what would appear to be a compiler bug in MSVC 2010 that, in some instances, show up as crashes in arena_dalloc_small when I'm debugging in MSVC. I wonder if that could be related.
Comment 23 Benjamin Smedberg [:bsmedberg] 2012-06-19 08:12:59 PDT
Before today, MOZ_CRASH on Windows *should* show up as a EXCEPTION_ACCESS_VIOLATION_WRITE with address at 0x7b (write-dereferencing 123). But since the followup patch for bug 761859 landed, it should today start showing up as EXCEPTION_BREAKPOINT, which is searchable in the query form and will make this easier.

I'm going to download the minidump for https://crash-stats.mozilla.com/report/index/42fe6a14-6c93-45b9-b28d-462fb2120616 specifically and provide a better diagnosis: I think aggressive inlining may be the actual cause there.
Comment 24 Benjamin Smedberg [:bsmedberg] 2012-06-19 09:13:54 PDT
ok, I'm actually completely wrong about MOZ_CRASH. It was doing reinterpret_cast<int*>(0) = 123 and not what I thought, which was reinterpret_cast<int*>(123) = 0. Hrmph. So the crash https://crash-stats.mozilla.com/report/index/42fe6a14-6c93-45b9-b28d-462fb2120616 is in fact a MOZ_CRASH abort. It appears looking through the assembly that the actual assert we're hitting is http://hg.mozilla.org/mozilla-central/annotate/da8c6039c25e/memory/mozjemalloc/jemalloc.c#l3310
Comment 25 Justin Lebar (not reading bugmail) 2012-06-19 09:33:51 PDT
> It appears looking through the assembly that the actual assert we're hitting is 
> http://hg.mozilla.org/mozilla-central/annotate/da8c6039c25e/memory/mozjemalloc/jemalloc.c#l3310

I think that's a double-free.
Comment 26 Justin Lebar (not reading bugmail) 2012-06-19 09:36:15 PDT
Or at least, it's freeing something that is not in use (maybe free'd, maybe never malloc'ed).
Comment 27 Justin Lebar (not reading bugmail) 2012-06-19 12:27:58 PDT
Do we still want to push this to branches, or do we want to see if bug 766173 is the right fix?  (Trying to be conservative here...)
Comment 28 Benjamin Smedberg [:bsmedberg] 2012-06-19 13:05:29 PDT
That's a question for the release-driver du jour: it seems likely that we've fixed the specific issue here, though we probably won't know until bug 766173 hits aurora and perhaps beta audiences. I definitely think that we should try turning on release checking in general, not just for this size class, but perhaps that should be a separate bug and can ride the trains...
Comment 29 Justin Lebar (not reading bugmail) 2012-06-19 13:17:27 PDT
> I definitely think that we should try turning on release checking in general, not just for this size 
> class.

I totally agree; these assertions turned out to be really useful.

Only the memset() filling is specific to the 72B size class, although most of the asserts here are for the "small" size classes (less than or equal to 2kb).  I filed bug 766250 on enabling more asserts.
Comment 30 Alex Keybl [:akeybl] 2012-06-19 15:44:50 PDT
Comment on attachment 632660 [details] [diff] [review]
Patch v2, as landed

Let's stay on the conservative side and wait to see what happens on Aurora with bug 766173 before uplifting to beta, since we now think we have a fix and are worried about the possibility of increased crashiness.
Comment 31 Alex Keybl [:akeybl] 2012-06-24 12:19:05 PDT
Given https://bugzilla.mozilla.org/show_bug.cgi?id=709860#c39 it sounds like it makes sense to land on Aurora, if that helps us determine the cause.
Comment 32 Justin Lebar (not reading bugmail) 2012-06-24 13:12:53 PDT
Blocks bug 675260, because that is apparently heap corruption being detected by this patch.
Comment 33 Justin Lebar (not reading bugmail) 2012-06-24 13:39:12 PDT
Aurora 15: https://hg.mozilla.org/releases/mozilla-aurora/rev/41730cca9ae1
Comment 34 Alex Keybl [:akeybl] 2012-07-02 16:12:00 PDT
Comment on attachment 632660 [details] [diff] [review]
Patch v2, as landed

[Triage Comment]
Given how close we are to release, there isn't really the opportunity to land this on Beta 14. Minusing given that.

Note You need to log in before you can comment on or make changes to this bug.