Last Comment Bug 583520 - (CVE-2010-2770) Mac crash with fuzzed font in data: URL
(CVE-2010-2770)
: Mac crash with fuzzed font in data: URL
Status: RESOLVED FIXED
[sg:critical?][critsmash:investigating]
: crash, testcase, verified1.9.1, verified1.9.2
Product: Core
Classification: Components
Component: Graphics (show other bugs)
: 1.9.2 Branch
: x86 Mac OS X
: -- critical (vote)
: ---
Assigned To: John Daggett (:jtd)
:
Mentors:
Depends on:
Blocks: 594536
  Show dependency treegraph
 
Reported: 2010-07-31 16:13 PDT by Brandon Sterne (:bsterne)
Modified: 2011-01-27 11:28 PST (History)
22 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
unaffected
.9+
.9-fixed
.12+
.12-fixed


Attachments
testcase (crash) (39.83 KB, text/html)
2010-07-31 16:13 PDT, Brandon Sterne (:bsterne)
no flags Details
Reduced testcase (37.78 KB, text/html)
2010-08-02 16:20 PDT, Steven Michaud [:smichaud] (Retired)
no flags Details
Reduced testcase rev1 (fix META tag) (37.74 KB, text/html)
2010-08-02 17:34 PDT, Steven Michaud [:smichaud] (Retired)
no flags Details
Reduced testcase rev2 (add text-rendering for Safari) (37.77 KB, text/html)
2010-08-02 19:19 PDT, John Daggett (:jtd)
no flags Details
safari stack on 10.5.8 (33.49 KB, text/plain)
2010-08-02 19:21 PDT, John Daggett (:jtd)
no flags Details
Stack trace of "faulty glyph" exceptions (36.61 KB, text/plain)
2010-08-03 15:03 PDT, Steven Michaud [:smichaud] (Retired)
no flags Details
patch, check alloc fails within cocoa native theme code (4.83 KB, patch)
2010-08-12 00:31 PDT, John Daggett (:jtd)
jaas: review-
Details | Diff | Review
patch, v2, check alloc fails within cocoa native theme code (4.79 KB, patch)
2010-08-12 20:27 PDT, John Daggett (:jtd)
jaas: review+
christian: approval1.9.2.9+
christian: approval1.9.1.12+
Details | Diff | Review
patch, v2a, port alloc checks to trunk (4.68 KB, patch)
2010-08-13 00:39 PDT, John Daggett (:jtd)
joe: approval2.0+
Details | Diff | Review

Description Brandon Sterne (:bsterne) 2010-07-31 16:13:08 PDT
Created attachment 461832 [details]
testcase (crash)

Marc Schoenefeld reported the following to security@mozilla.org.  I confirmed that it crashed for me on Firefox 3.6.8, but NOT on trunk.  I died before Crash Reporter came up, but I saved the relevant portion of the Apple crash report, which I'll attach.

Crash Wrangler told Marc that it was a double free issue, but I can't tell if that's consistent with what I got.  sg:crit? until we know for sure.

----------

Hi,

my font fuzzer triggered the following on Ffx 3.6.8/OSX10.6
The testcase is attached as html file embedding the
malformed font in a data: url tag. The Crashwrangler
wrapped said it is exploitable, so I thought I better forward
this.

Summary from CW:

Faulty glyph (id:38) outline detected - replacing with a space/null
glyph - in memory font kind
Fri Jul 30 12:20:24 maeckes2.local firefox-bin[10483] <Error>:
CGBitmapContextInfoCreate: unable to allocate 10584 bytes for bitmap data
objc[10483]: FREED(id): message autorelease sent to freed object=0x1f2de010

[..]

---
exception=EXC_BAD_INSTRUCTION:signal=4:is_exploitable=yes:instruction_disassembly=:instruction_address=0x0000000097db24b4:access_type=:access_address=0x0000000000000000:
Illegal instruction at 0x0000000097db24b4, probably a exploitable issue.


Cheers
Marc
Comment 1 John Daggett (:jtd) 2010-08-01 18:03:39 PDT
Reproduced on Mac OS X 10.6.4 (10F569) with 3.6.4:
http://crash-stats.mozilla.com/report/index/27f03011-e64e-40e9-8a84-ca5842100801

I'm guessing this is a memory-stress bug that exposes bugs in various components, 

0 libobjc.A.dylib _objc_error 
1 libobjc.A.dylib __objc_error 
2 libobjc.A.dylib _freedHandler 
3 XUL DrawCellWithScaling widget/src/cocoa/nsNativeThemeCocoa.mm:398
4 XUL DrawCellWithSnapping widget/src/cocoa/nsNativeThemeCocoa.mm:529
5 XUL nsNativeThemeCocoa::DrawPushButton widget/src/cocoa/nsNativeThemeCocoa.mm:745
6 XUL nsNativeThemeCocoa::DrawWidgetBackground widget/src/cocoa/nsNativeThemeCocoa.mm:1634
7 XUL nsCSSRendering::PaintBackgroundWithSC layout/base/nsCSSRendering.cpp:2053
8 XUL nsCSSRendering::PaintBackground layout/base/nsCSSRendering.cpp:1367
9 XUL nsButtonFrameRenderer::PaintBorderAndBackground layout/forms/nsButtonFrameRenderer.cpp:269
10 XUL nsDisplayButtonBorderBackground::Paint layout/forms/nsButtonFrameRenderer.cpp:181
11 XUL nsDisplayClip::Paint layout/base/nsDisplayList.cpp:405
12 XUL XUL@0x222a12 
13 XUL nsLayoutUtils::PaintFrame layout/base/nsLayoutUtils.cpp:1145

Safari 5.0 (6533.16) doesn't crash after running for 5 minutes but the system is sluggish and not very responsive.
Comment 2 John Daggett (:jtd) 2010-08-01 18:29:09 PDT
Does not occur with trunk code, with harfbuzz enabled (default) or disabled.  Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0b3pre) Gecko/20100801 Minefield/4.0b3pre

However, the console log show a number of errors that don't look great:

8/2/10 10:25:09 AM	firefox-bin[70963]	*** __NSAutoreleaseNoPool(): Object 0x2cc2070 of class __NSFastEnumerationEnumerator autoreleased with no pool in place - just leaking
8/2/10 10:25:09 AM	firefox-bin[70963]	*** __NSAutoreleaseNoPool(): Object 0x2b24160 of class TopLevelWindowData autoreleased with no pool in place - just leaking
8/2/10 10:25:09 AM	firefox-bin[70963]	*** __NSAutoreleaseNoPool(): Object 0x2b4dd10 of class NSCFString autoreleased with no pool in place - just leaking
8/2/10 10:25:09 AM	firefox-bin[70963]	*** __NSAutoreleaseNoPool(): Object 0x2b1f6b0 of class ToolbarWindow autoreleased with no pool in place - just leaking
8/2/10 10:25:09 AM	firefox-bin[70963]	*** __NSAutoreleaseNoPool(): Object 0x2b1fca0 of class WindowDelegate autoreleased with no pool in place - just leaking
8/2/10 10:25:10 AM	firefox-bin[70965]	*** __NSAutoreleaseNoPool(): Object 0x1950dbd0 of class NSEvent autoreleased with no pool in place - just leaking
8/2/10 10:25:10 AM	[0x0-0x621621].org.mozilla.minefield	2010-08-02 10:25:10.070 firefox-bin[70965:903] *** __NSAutoreleaseNoPool(): Object 0x1950dbd0 of class NSEvent autoreleased with no pool in place - just leaking

So we may be just leaking something that 3.6.x does not, thus cloaking the underlying problem.
Comment 3 John Daggett (:jtd) 2010-08-01 22:57:30 PDT
Running with 1.9.2 latest, crash occurs in debugger here:

#0  0x954e631d in CGPathAddQuadCurveToPoint ()
#1  0x954e701f in path_create_path ()
#2  0x954e63a6 in CGPathCreateByNormalizingGlyphPath ()
#3  0x954e61b7 in CGFontCreateGlyphPath ()
#4  0x954e51fd in CGFontCreateGlyphBitmap8 ()
#5  0x954e4e19 in create_missing_bitmaps ()
#6  0x9549db77 in CGGlyphLockLockGlyphBitmaps ()
#7  0x9720d52b in ripc_RenderGlyphs ()
#8  0x9720cf6f in ripc_DrawGlyphs ()
#9  0x9549bd22 in draw_glyphs ()
#10 0x9549b677 in CGContextShowGlyphsWithAdvances ()
#11 0x01339acd in _cairo_quartz_surface_show_glyphs (abstract_surface=0x1f6696c0, op=CAIRO_OPERATOR_OVER, source=0x1ec92b70, glyphs=0xbfff8df8, num_glyphs=15, scaled_font=0x1b1948e0, remaining_glyphs=0xbfff8574, extents=0x0) at /builds/moz192/gfx/cairo/cairo/src/cairo-quartz-surface.c:2414
#12 0x01328f1a in _cairo_surface_show_text_glyphs (surface=0x1f6696c0, op=CAIRO_OPERATOR_OVER, source=0x1ec92b70, utf8=0x0, utf8_len=0, glyphs=0xbfff8df8, num_glyphs=15, clusters=0xbfff85f8, num_clusters=0, cluster_flags=0, scaled_font=0x1b1948e0, extents=0x0) at /builds/moz192/gfx/cairo/cairo/src/cairo-surface.c:2664
#13 0x01304ae8 in _cairo_gstate_show_text_glyphs (gstate=0x1f62fe30, utf8=0x0, utf8_len=0, glyphs=0xbfff9a04, num_glyphs=15, clusters=0x0, num_clusters=0, cluster_flags=0) at /builds/moz192/gfx/cairo/cairo/src/cairo-gstate.c:1699
#14 0x012f7359 in _moz_cairo_show_glyphs (cr=0x3c31c00, glyphs=0xbfff9a04, num_glyphs=15) at /builds/moz192/gfx/cairo/cairo/src/cairo.c:3237
#15 0x0128dfab in GlyphBuffer::Flush (this=0xbfff9a04, aCR=0x3c31c00, aDrawToPath=0, aReverse=0, aFinish=1) at /builds/moz192/gfx/thebes/src/gfxFont.cpp:761
#16 0x0128775a in gfxFont::Draw (this=0x1b193190, aTextRun=0x3fad600, aStart=421, aEnd=436, aContext=0x1f66a320, aDrawToPath=0, aPt=0xbfffb0b0, aSpacing=0x0) at /builds/moz192/gfx/thebes/src/gfxFont.cpp:899
#17 0x012821dc in gfxTextRun::DrawGlyphs (this=0x3fad600, aFont=0x1b193190, aContext=0x1f66a320, aDrawToPath=0, aPt=0xbfffb0b0, aStart=421, aEnd=436, aProvider=0xbfffb254, aSpacingStart=421, aSpacingEnd=436) at /builds/moz192/gfx/thebes/src/gfxFont.cpp:2063
#18 0x01286ff8 in gfxTextRun::Draw (this=0x3fad600, aContext=0x1f66a320, aPt=@0xbfffb150, aStart=416, aLength=20, aDirtyRect=0xbfffb304, aProvider=0xbfffb254, aAdvanceWidth=0xbfffb360) at /builds/moz192/gfx/thebes/src/gfxFont.cpp:2274
#19 0x004ac3e7 in nsTextFrame::DrawText (this=0x4706890, aCtx=0x1f66a320, aTextBaselinePt=@0xbfffb340, aOffset=416, aLength=20, aDirtyRect=0xbfffb304, aProvider=0xbfffb254, aAdvanceWidth=@0xbfffb360, aDrawSoftHyphen=0) at /builds/moz192/layout/generic/nsTextFrameThebes.cpp:4768
#20 0x004b5a7c in nsTextFrame::PaintText (this=0x4706890, aRenderingContext=0x1f6e2ca0, aPt=@0xbfffb3b4, aDirtyRect=@0xbfffb3a4) at /builds/moz192/layout/generic/nsTextFrameThebes.cpp:4755
#21 0x004b5b69 in nsDisplayText::Paint (this=0x3cbeabc, aBuilder=0xbfffb570, aCtx=0x1f6e2ca0) at /builds/moz192/layout/generic/nsTextFrameThebes.cpp:3856
#22 0x0039481e in nsDisplayList::Paint (this=0x387e42c, aBuilder=0xbfffb570, aCtx=0x1f6e2ca0) at /builds/moz192/layout/base/nsDisplayList.cpp:405
#23 0x00394860 in nsDisplayWrapList::Paint (this=0x387e410, aBuilder=0xbfffb570, aCtx=0x1f6e2ca0) at /builds/moz192/layout/base/nsDisplayList.cpp:1007
#24 0x003949d3 in nsDisplayClip::Paint (this=0x387e410, aBuilder=0xbfffb570, aCtx=0x1f6e2ca0) at /builds/moz192/layout/base/nsDisplayList.cpp:1203
#25 0x0039481e in nsDisplayList::Paint (this=0xbfffb528, aBuilder=0xbfffb570, aCtx=0x1f6e2ca0) at /builds/moz192/layout/base/nsDisplayList.cpp:405
#26 0x003c2050 in nsLayoutUtils::PaintFrame (aRenderingContext=0x1f6e2ca0, aFrame=0x4412228, aDirtyRegion=@0xbfffb848, aBackstop=4294967295, aFlags=0) at /builds/moz192/layout/base/nsLayoutUtils.cpp:1161
#27 0x003da404 in PresShell::Paint (this=0x1c23d410, aView=0x1b170690, aRenderingContext=0x1f6e2ca0, aDirtyRegion=@0xbfffb848) at /builds/moz192/layout/base/nsPresShell.cpp:5844
Comment 4 John Daggett (:jtd) 2010-08-02 00:52:46 PDT
When running this in 1.9.2 debug builds on 10.6 I get lots of malloc errors, after which the app fails in a variety of places - DrawCellWithScaling, the cycle collector, _cairo_quartz_surface_show_glyphs.  On 10.5 this is really hard to track down as it completely locks up my machine (!) and I have to do a hard reboot.
Comment 5 John Daggett (:jtd) 2010-08-02 01:43:50 PDT
When run with libgmalloc enabled on 10.6, exception occurs in sqlite code:

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x7af7e000
0x03206925 in sqlite3VXPrintf (pAccum=0xbfff332c, useExtended=1, fmt=0x32ba0ba "q'", ap=0xbfff32d0 "?\017\034E") at /builds/moz192/db/sqlite3/src/sqlite3.c:16946
16946	        for(i=n=0; (ch=escarg[i])!=0 && k!=0; i++, k--){
(gdb) where
#0  0x03206925 in sqlite3VXPrintf (pAccum=0xbfff332c, useExtended=1, fmt=0x32ba0ba "q'", ap=0xbfff32d0 "?\017\034E") at /builds/moz192/db/sqlite3/src/sqlite3.c:16946
#1  0x032075fe in sqlite3XPrintf (p=0xbfff332c, zFormat=0x32ba0b6 "'%.*q'") at /builds/moz192/db/sqlite3/src/sqlite3.c:17258
#2  0x03242364 in sqlite3VdbeExpandSql (p=0x47561f18, zRawSql=0x47571f98 " UNION ALL SELECT id, url, title, rev_host, visit_count FROM moz_places WHERE url = ?1 LIMIT 1") at /builds/moz192/db/sqlite3/src/sqlite3.c:51609
#3  0x03252bb6 in sqlite3VdbeExec (p=0x47561f18) at /builds/moz192/db/sqlite3/src/sqlite3.c:57883
#4  0x03240329 in sqlite3Step (p=0x47561f18) at /builds/moz192/db/sqlite3/src/sqlite3.c:50603
#5  0x03240626 in sqlite3_step (pStmt=0x47561f18) at /builds/moz192/db/sqlite3/src/sqlite3.c:50662
#6  0x0100047e in mozilla::storage::Statement::ExecuteStep (this=0x4755bfc0, _moreResults=0xbfff3b0c) at /builds/moz192/storage/src/mozStorageStatement.cpp:729
#7  0x01048cc3 in nsNavHistory::GetUrlIdFor (this=0x45153e20, aURI=0xcfa01f40, aEntryID=0xbfff3c20, aAutoCreate=1) at /builds/moz192/toolkit/components/places/src/nsNavHistory.cpp:1849
#8  0x0109673f in nsNavBookmarks::InsertBookmark (this=0x484f2f20, aFolder=12, aURI=0xcfa01f40, aIndex=-1, aTitle=@0x7af6dff0, aNewBookmarkId=0xbfff3ec4) at /builds/moz192/toolkit/components/places/src/nsNavBookmarks.cpp:1140
Comment 6 christian 2010-08-02 10:32:57 PDT
Does this affect 1.9.1 as well? I would imagine it does. If so, please nominate or set as unaffected.
Comment 7 Steven Michaud [:smichaud] (Retired) 2010-08-02 13:45:40 PDT
> Does this affect 1.9.1 as well?

I can't make today's 1.9.1-branch nightly crash on OS X 10.5.8.  But
it does crash (the same crash) on OS X 10.6.4.
Comment 8 Steven Michaud [:smichaud] (Retired) 2010-08-02 16:20:40 PDT
Created attachment 462241 [details]
Reduced testcase

Here's a reduced testcase.

There are a couple of things that are very odd about it (particularly
#1 in my list).  This makes me think we might be dealing with more
than one bug here.

1) The crashes only happen if the testcase has at least one button,
   and if its label has at least two characters.  They happen drawing
   the button.

2) Without the META tag, the crashes only happen on the 1.9.1 branch
   (testing with today's nightlies on OS X 10.6.4) -- not also on the
   1.9.2 branch.
Comment 9 Steven Michaud [:smichaud] (Retired) 2010-08-02 16:55:55 PDT
To crash on the 1.9.2 branch (on OS X 10.6.4) with my reduced testcase
from comment #8, you may need to download it and open it locally (or
download it to a local server and load it from there).
Comment 10 Steven Michaud [:smichaud] (Retired) 2010-08-02 17:34:41 PDT
Created attachment 462271 [details]
Reduced testcase rev1 (fix META tag)
Comment 11 John Daggett (:jtd) 2010-08-02 18:59:26 PDT
With Steven's second reduced testcase and a debug build of 1.9.2, I hang here:

#0  0x90495266 in mach_msg_trap ()
#1  0x9049cae2 in mach_msg ()
#2  0x94cf105a in SendFontManagementMessageWithMessageStatus ()
#3  0x94cfb60b in SendOFAStrikeMessage ()
#4  0x94cfdfd2 in _eOFAGetGlyphData ()
#5  0x94cfd63b in GenerateGlyphData ()
#6  0x94cfce33 in _eGetGlyphAddresses ()
#7  0x94d2a3e1 in _eATSGlyphGetScreenMetrics ()
#8  0x94d2a2e3 in ATSGlyphGetScreenMetrics ()
#9  0x93c1d5a0 in ATSUGlyphGetScreenMetrics ()
#10 0x04108c8d in gfxAtsuiFont::SetupGlyphExtents (this=0x198a2660, aContext=0x1988ea30, aGlyphID=38, aNeedTight=0, aExtents=0x198b3a30) at /builds/moz192/gfx/thebes/src/gfxAtsuiFonts.cpp:465
#11 0x040ea4fe in gfxTextRun::FetchGlyphExtents (this=0x1bc3e000, aRefContext=0x1988ea30) at /builds/moz192/gfx/thebes/src/gfxFont.cpp:2956
#12 0x0410be09 in gfxAtsuiFontGroup::MakeTextRun (this=0x194e6e90, aString=0x1bc07008, aLength=56072, aParams=0xbfff7db0, aFlags=17039617) at /builds/moz192/gfx/thebes/src/gfxAtsuiFonts.cpp:747
#13 0x04101945 in TextRunWordCache::MakeTextRun (this=0x18bbc010, aText=0x19c52008, aLength=56072, aFontGroup=0x194e6e90, aParams=0xbfff91d0, aFlags=17039616) at /builds/moz192/gfx/thebes/src/gfxTextRunWordCache.cpp:683
#14 0x04101a63 in gfxTextRunWordCache::MakeTextRun (aText=0x19c52008, aLength=56072, aFontGroup=0x194e6e90, aParams=0xbfff91d0, aFlags=17039616) at /builds/moz192/gfx/thebes/src/gfxTextRunWordCache.cpp:992
#15 0x0332bab4 in MakeTextRun (aText=0x19c52008, aLength=56072, aFontGroup=0x194e6e90, aParams=0xbfff91d0, aFlags=17039616) at /builds/moz192/layout/generic/nsTextFrameThebes.cpp:436
#16 0x0332ec74 in BuildTextRunsScanner::BuildTextRunForFrames (this=0xbfffa410, aTextBuffer=0x19c6d618) at /builds/moz192/layout/generic/nsTextFrameThebes.cpp:1783
#17 0x0332f0bb in BuildTextRunsScanner::FlushFrames (this=0xbfffa410, aFlushLineBreaks=0, aSuppressTrailingBreak=0) at /builds/moz192/layout/generic/nsTextFrameThebes.cpp:1214
#18 0x0332f2e1 in BuildTextRunsScanner::ScanFrame (this=0xbfffa410, aFrame=0x64cbc88) at /builds/moz192/layout/generic/nsTextFrameThebes.cpp:1365
#19 0x0332fc3d in BuildTextRuns (aContext=0x1988ea30, aForFrame=0x66cca98, aLineContainer=0x66cc988, aForFrameLine=0xbfffaf9c) at /builds/moz192/layout/generic/nsTextFrameThebes.cpp:1123

The ATSServer is pegged at 100% CPU which is the cause of the system hang.  Killing that process (it restarts automatically) eliminates the hang.
Comment 12 John Daggett (:jtd) 2010-08-02 19:19:53 PDT
Created attachment 462301 [details]
Reduced testcase rev2 (add text-rendering for Safari)

Add in 'text-rendering: optimizeLegibility;' and Safari crashes (ATSServer doesn't hang)
Comment 13 John Daggett (:jtd) 2010-08-02 19:21:08 PDT
Created attachment 462303 [details]
safari stack on 10.5.8

Safari stacktrace with Safari 5.0.1 running on 10.5.8, crashes within Webkit code.
Comment 14 John Daggett (:jtd) 2010-08-02 19:37:59 PDT
No crash with either Safari 5.0.1 or Webkit trunk on 10.6.4.
Comment 15 Joe Drew (not getting mail) 2010-08-03 13:34:27 PDT
John, feel free to reassign as necessary.
Comment 16 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2010-08-03 13:42:16 PDT
Jonathan, can you analyze the font, maybe minimize the testcase and figure out what it is about the font that's causing trouble here?
Comment 17 Steven Michaud [:smichaud] (Retired) 2010-08-03 13:54:26 PDT
I'm playing a hunch ... which if it works out might lead to a simple fix.

No hooking, I promise :-)
Comment 18 Steven Michaud [:smichaud] (Retired) 2010-08-03 15:03:11 PDT
Created attachment 462536 [details]
Stack trace of "faulty glyph" exceptions

(Following up comment #17)

> I'm playing a hunch ... which if it works out might lead to a simple
> fix.
>
> No hooking, I promise :-)

It didn't work out ... but I might yet be able to salvage something.

Remember (from comment #0) that the following error shows up (hundreds
of times in my tests) in the seconds before this bug's crash (on OS X
10.6.X):

Faulty glyph (id:38) outline detected -
  replacing with a space/null glyph - in memory font kind

Here's a stack trace of the code from which these errors get
displayed.  (They're displayed by a C++ exception handler in
libTrueTypeScaler.dylib (in the ATS framework, which is in turn inside
the ApplicationServices framework).  I broke on calls to __cxa_throw
(from libstdc++.dylib).)

Note that all these C++ exceptions occur during the call to
ATSUGetGlyphBounds() from gfxAtsuiFontGroup::InitTextRun().
ATSUGetGlyphBounds() has an error return, which we discard in current
code.  I'd hoped that the "faulty glyph" error message corresponded to
an uncaught C++ exception, and that it would result in
ATSUGetGlyphBounds() returning an error.  No such luck :-(

I'll look for another way to "expose" these exceptions to
gfxAtsuiFontGroup::InitTextRun() -- if only as a proof of concept (to
see if we can get around the crashes by somehow handling these
exceptions in gfxAtsuiFontGroup::InitTextRun()).
Comment 19 christian 2010-08-03 15:10:16 PDT
I'm in contact with Apple about this.
Comment 20 Steven Michaud [:smichaud] (Retired) 2010-08-03 15:54:21 PDT
When I run a current 1.9.2 branch build inside gdb (an opt build with
debug symbols) and set NSZombieEnabled to "Yes" ("set env
NSZombieEnabled Yes"), this bug's crashes no longer happen.  But I end
up crashing anyway, shortly after a memory allocation fails.

When I look at the crashed firefox-bin process with 'top', I find that
it's VSIZE (virtual memory) is 4 gigs (it's normally about 1 gig) and
the number of memory regions (MREGS) is enormously larger than it
should be (about 25000 as opposed to about 450).

Something is clearly eating huge amounts of RAM in many small bites.
We don't yet know whether it's the OS or the browser.
Comment 21 Steven Michaud [:smichaud] (Retired) 2010-08-03 15:59:45 PDT
(Following up comment #20)

> When I look at the crashed firefox-bin process with 'top', I find
> that it's VSIZE (virtual memory) is 4 gigs (it's normally about 1
> gig) and the number of memory regions (MREGS) is enormously larger
> than it should be (about 25000 as opposed to about 450).

This is with or without NSZombieEnabled.  The crashed process is
running in gdb.

This devouring of memory doesn't happen on the trunk.  So it's pretty
clear the problem is ATSUI -- either in itself or in our use of it.
Comment 22 Steven Michaud [:smichaud] (Retired) 2010-08-03 16:08:59 PDT
(In reply to comment #1)

> I'm guessing this is a memory-stress bug that exposes bugs in
> various components.

So it seems to be.
Comment 23 Steven Michaud [:smichaud] (Retired) 2010-08-03 17:25:13 PDT
(Following up comment #20)

> Something is clearly eating huge amounts of RAM in many small bites.
> We don't yet know whether it's the OS or the browser.

malloc_history can't find this RAM.  I'm pretty sure this means that
it's been allocated by the OS (using something else than malloc) --
which tends to place the blame for this problem on the OS (and
specifically on ATSUI).

Here's what I did:

1) gdb /path/to/firefox-bin

2) set env MallocStackLogging 1

3) run

4) Load a local copy of my reduced testcase from comment #10.

5) Wait for the browser to crash.

6) Find firefox-bin's pid (from outside of gdb).

7) Run the following (also outside of gdb):

   a) malloc_history [firefox-bin's pid] -all_by_size

   b) malloc_history [firefox-bin's pid] -all_by_count

The top allocation by size (56016 calls for 1549932 bytes) wasn't
nearly big enough.  Neither were the top five allocations by count
(69485 calls for 277940 bytes, 56016 calls for 1549932 bytes, 53328
calls for 853248 bytes, 43436 calls for 161080 bytes).  (Note that I
assume the byte figures are totals -- that "56016 calls for 1549932
bytes" means "56016 calls for a total of 1549932 bytes".  Someone
please tell me if I'm wrong.)  The "56016 calls for 1549932 bytes" was
from gfxTextRun::AllocateDetailedGlyphs().

I got pretty much the same numbers even when I skipped steps 4 and 5
-- though gfxTextRun::AllocateDetailedGlyphs() wasn't in the list.

I also got pretty much the same numbers on the trunk without skipping
steps 4 and 5 -- and gfxTextRun::AllocateDetailedGlyphs() was top of
the list.

My understanding is that, when using MallocStackLogging,
malloc_history doesn't show statistics for allocations that were made
and then freed.
Comment 24 Jonathan Kew (:jfkthame) 2010-08-04 02:55:36 PDT
(In reply to comment #23)
> (Following up comment #20)
> 
> > Something is clearly eating huge amounts of RAM in many small bites.
> > We don't yet know whether it's the OS or the browser.
> 
> malloc_history can't find this RAM.  I'm pretty sure this means that
> it's been allocated by the OS (using something else than malloc) --
> which tends to place the blame for this problem on the OS (and
> specifically on ATSUI).

ATS is encountering failures due to corrupt glyph data; it seems plausible that each time it tries to rasterize such a glyph it allocates memory for the glyph bitmap, and then fails to clean up properly when it aborts the rasterization. That would rapidly lead to out-of-memory failure and bring down the whole app, just like we're seeing here.
Comment 25 Steven Michaud [:smichaud] (Retired) 2010-08-04 10:37:02 PDT
(Following up comment #23)

>> Something is clearly eating huge amounts of RAM in many small
>> bites.  We don't yet know whether it's the OS or the browser.
>
> malloc_history can't find this RAM.  I'm pretty sure this means that
> it's been allocated by the OS (using something else than malloc) --
> which tends to place the blame for this problem on the OS (and
> specifically on ATSUI).

libTrueTypeScaler.dylib doesn't import any strange, low-level methods
for allocating RAM (only calloc, malloc and realloc).  So I think the
debugging facilities in Apple's malloc implementation (those activated
by setting MallocStackLogging to 1) must not record calls to malloc
from system libraries.
Comment 26 Steven Michaud [:smichaud] (Retired) 2010-08-04 11:00:50 PDT
(Following up comment #18)

>> I'm playing a hunch ... which if it works out might lead to a
>> simple fix.
>>
>> No hooking, I promise :-)
>
> It didn't work out ... but I might yet be able to salvage something.

> I'll look for another way to "expose" these exceptions to
> gfxAtsuiFontGroup::InitTextRun() -- if only as a proof of concept
> (to see if we can get around the crashes by somehow handling these
> exceptions in gfxAtsuiFontGroup::InitTextRun()).

I've given up on this, at least for the time being.  Whatever I come
up with is likely to be very hackish, and not narrowly focused.

We should give some thought, though, to whether or not it's possible
to get rid of the call to ATSUGetGlyphBounds() in
gfxAtsuiFontGroup::InitTextRun().  I don't know enough about the font
code to answer this question myself.
Comment 27 John Daggett (:jtd) 2010-08-12 00:31:28 PDT
Created attachment 465147 [details] [diff] [review]
patch, check alloc fails within cocoa native theme code

With this patch I don't get the SIGILL crashes in DrawCellWithScaling within Cocoa native theme code.  Eventually mem alloc's fail and the browser aborts.  I did see a couple crashes in the cycle collector after mem alloc failures occurred but I can't reproduce those now.

Josh, feel free to reassign if you're not the right person for this.
Comment 28 Josh Aas 2010-08-12 10:04:37 PDT
Comment on attachment 465147 [details] [diff] [review]
patch, check alloc fails within cocoa native theme code

>+    if (!bitmap) {
>+      NS_WARNING("CGBitmapContextCreateImage failed");
>+      CGContextRelease(bitmapctx);
>+      CGColorSpaceRelease(colorSpace);
>+      CGContextSetCTM(aCGContext, savedCTM);
>+      return;
>+    }

In the last block of the patch here you're releasing "colorSpace" for a second time. It was already released after the call to "CGBitmapContextCreate".
Comment 29 John Daggett (:jtd) 2010-08-12 20:27:24 PDT
Created attachment 465538 [details] [diff] [review]
patch, v2, check alloc fails within cocoa native theme code

Fixed the double delete of colorSpace.

Also, confirmed that this doesn't fix the problem on 10.5, on 10.5 the ATS server goes into an infinite loop and effectively hangs the system.
Comment 31 John Daggett (:jtd) 2010-08-13 00:39:13 PDT
Created attachment 465585 [details] [diff] [review]
patch, v2a, port alloc checks to trunk
Comment 32 Joe Drew (not getting mail) 2010-08-13 08:28:51 PDT
Comment on attachment 465585 [details] [diff] [review]
patch, v2a, port alloc checks to trunk

approved SO HARD.
Comment 33 Jesse Ruderman 2010-08-13 08:30:47 PDT
> Also, confirmed that this doesn't fix the problem on 10.5, on 10.5 the ATS
> server goes into an infinite loop and effectively hangs the system.

This was split out into bug 586895.
Comment 34 Al Billings [:abillings] 2010-08-19 23:28:54 PDT
I'm not sure that this is fully fixed. I'm crashing when running through the manual tests.

Running through the attached testcases in a nightly build, I was able to get crashes, though not every time. This was using Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.9pre) Gecko/20100819 Namoroka/3.6.9pre on 10.6.4.

With the reduced testcase, rev 1, I got the follow crash which our crashreporter didn't catch (the data is from Apple's crash reporter): http://mozilla.pastebin.com/P3TuxeWP

I got this crash with the original attached testcase after I loaded it a few times: http://crash-stats.mozilla.com/report/index/bp-2145062c-49e3-45ae-a76b-4691e2100819

If I load each of the testcases a few times individually in series with the others without starting a new browser session, I will get a hang followed by either a crashreporter caught crash or an Apple caught crash.
Comment 35 Al Billings [:abillings] 2010-08-20 14:00:04 PDT
This looks like it is having the same problems in 1.9.1. Running the first testcase, I'm getting a system level crash in OS X 10.6.4 with the nightly 1.9.1 build. 

With the rev 1 of the reduced case, I get this crash: http://crash-stats.mozilla.com/report/index/deec2acc-7eb6-4215-a381-6b5f72100820

With the rev 2 testcase, I got the following crash caught by Apple, not us: http://mozilla.pastebin.com/80cqsgpi
Comment 36 John Daggett (:jtd) 2010-08-21 07:25:02 PDT
(In reply to comment #34)
> I'm not sure that this is fully fixed. I'm crashing when running through the
> manual tests.

Yup, the testcase stresses memory so all sorts of out-of-memory related bugs will happen.  But the original crash was a SIGILL crash which is a bad thing in security terms, it means that random data was executed as code.  So *that* crash is fixed but other crashes will still occur.

> Running through the attached testcases in a nightly build, I was able to get
> crashes, though not every time. This was using Mozilla/5.0 (Macintosh; U; Intel
> Mac OS X 10.6; en-US; rv:1.9.2.9pre) Gecko/20100819 Namoroka/3.6.9pre on
> 10.6.4.
> 
> With the reduced testcase, rev 1, I got the follow crash which our
> crashreporter didn't catch (the data is from Apple's crash reporter):
> http://mozilla.pastebin.com/P3TuxeWP

Out-of-memory abort, not a crash.

> I got this crash with the original attached testcase after I loaded it a few
> times:
> http://crash-stats.mozilla.com/report/index/bp-2145062c-49e3-45ae-a76b-4691e2100819

Yeah, this looks a lot like the crash in comment 3.

So I think this should stay "fixed", since the original sg:critical crash was fixed.  We should log other crashes as separate bugs.
Comment 37 Al Billings [:abillings] 2010-09-02 18:27:11 PDT
Marking this one as verified for 1.9.2 and 1.9.1.

Note You need to log in before you can comment on or make changes to this bug.