Closed Bug 732330 Opened 8 years ago Closed 8 years ago

crash while doing lots of system font fallback on 10.6/10.7

Categories

(Core :: General, defect, P1, critical)

13 Branch
x86
macOS
defect

Tracking

()

RESOLVED FIXED
mozilla13

People

(Reporter: jtd, Assigned: jfkthame)

References

()

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

Running with latest nightly, I'm seeing crashes in various places when running a font fallback test.

Steps:

1. Download latest nightly
2. Click on URL

Result: test iterates over all Unicode 6.1 codepoints by script and attempts to measure single-character text spans containing each character.  The test iterates over the full set of characters, displays the time, then starts over again.  In a debug build this crashes before reaching 20%, in nightly builds seems to take a few minutes:

https://crash-stats.mozilla.com/report/index/bp-f9dd8912-c80c-4e75-9d0b-c02112120302

Looks like it's crashing in cycle collection but the underlying cause might be something else causing corruption.

Assertion failure: (ptrBits & 0x7) == 0, at ../../../dist/include/jsval.h:766

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x0000000000000000
0x000000010003935e in MOZ_Crash () at Assertions.cpp:76
76	  *((volatile int *) NULL) = 123;  /* To continue from here in GDB: "return" then "continue". */
(gdb) where 40
#0  0x000000010003935e in MOZ_Crash () at Assertions.cpp:76
#1  0x00000001000393ee in MOZ_Assert (s=0x104563838 "(ptrBits & 0x7) == 0", file=0x1045b9b40 "../../../dist/include/jsval.h", ln=766) at Assertions.cpp:88
#2  0x000000010237a3e6 in JSVAL_TO_OBJECT_IMPL (l={asBits = 18445477440609320963, debugView = {payload47 = 4294967299, tag = JSVAL_TAG_OBJECT}, s = {payload = {i32 = 3, u32 = 3, why = JS_NO_ITER_VALUE}}, asDouble = -nan(0xb800100000003), asPtr = 0xfffb800100000003, asWord = 18445477440609320963}) at jsval.h:766
#3  0x000000010237c778 in JS::Value::toObject (this=0x10916a500) at jsapi.h:544
#4  0x0000000103ee596d in js::GCMarker::processMarkStackTop (this=0x10a342228, budget=@0x7fff5fbf81b8) at jsgcmark.cpp:1090
#5  0x0000000103ee25a2 in js::GCMarker::drainMarkStack (this=0x10a342228, budget=@0x7fff5fbf81b8) at jsgcmark.cpp:1172
#6  0x0000000103eb2a41 in MarkGrayAndWeak (rt=0x10a342000) at jsgc.cpp:3030
#7  0x0000000103eb664f in EndMarkPhase (cx=0x1091b8400) at jsgc.cpp:3048
#8  0x0000000103eb9248 in IncrementalGCSlice (cx=0x1091b8400, budget=10000, gckind=js::GC_NORMAL) at jsgc.cpp:3531
#9  0x0000000103eb96ae in GCCycle (cx=0x1091b8400, comp=0x0, budget=10000, gckind=js::GC_NORMAL) at jsgc.cpp:3644
#10 0x0000000103ebc81e in Collect () at jsgc.cpp:3707
#11 0x0000000103ebcbd6 in js::GCSlice (cx=0x1091b8400, comp=0x0, gckind=js::GC_NORMAL, reason=js::gcreason::REFRESH_FRAME) at jsgc.cpp:3733
#12 0x0000000103e97208 in js::NotifyDidPaint (cx=0x1091b8400) at jsfriendapi.cpp:726
#13 0x0000000102a9b177 in nsXPConnect::NotifyDidPaint (this=0x1003671f0) at nsXPConnect.cpp:2836
#14 0x000000010179cea4 in nsAutoNotifyDidPaint::~nsAutoNotifyDidPaint (this=0x7fff5fbf8908) at nsPresShell.cpp:5446
#15 0x0000000101787b6c in PresShell::Paint (this=0x10da0b800, aViewToPaint=0x10d9be660, aWidgetToPaint=0x10d92d640, aDirtyRegion=@0x7fff5fbf8a78, aIntDirtyRegion=@0x7fff5fbf9230, aWillSendDidPaint=false) at nsPresShell.cpp:5542
#16 0x000000010223a1b9 in nsViewManager::Refresh (this=0x10d9fe880, aView=0x10d9be660, aWidget=0x10d92d640, aRegion=@0x7fff5fbf9230, aWillSendDidPaint=false) at nsViewManager.cpp:377
#17 0x000000010223acbb in nsViewManager::DispatchEvent (this=0x10d9fe880, aEvent=0x7fff5fbf91e0, aView=0x10d9be660, aStatus=0x7fff5fbf8e14) at nsViewManager.cpp:813
#18 0x0000000102234129 in HandleEvent (aEvent=0x7fff5fbf91e0) at nsView.cpp:158
#19 0x00000001032364e4 in nsChildView::DispatchEvent (this=0x10d92d640, event=0x7fff5fbf91e0, aStatus=@0x7fff5fbf8fa4) at nsChildView.mm:1512
#20 0x000000010322ef2c in nsChildView::DispatchWindowEvent (this=0x10d92d640, event=@0x7fff5fbf91e0) at nsChildView.mm:1522
#21 0x000000010322ae60 in -[ChildView drawRect:inContext:] (self=0x10da0f9d0, _cmd=0x10460cf14, aContext=0x114565c40, aRect={origin = {x = 0, y = 136}, size = {width = 1309, height = 209}}) at nsChildView.mm:2588
#22 0x000000010322bd26 in -[ChildView drawRect:] (self=0x10da0f9d0, _cmd=0x7fff87d6f8b8, aRect={origin = {x = 0, y = 136}, size = {width = 1309, height = 209}}) at nsChildView.mm:2487
#23 0x00007fff8750dbab in -[NSView _drawRect:clip:] ()
#24 0x00007fff8750b85d in -[NSView _recursiveDisplayRectIfNeededIgnoringOpacity:isVisibleRect:rectIsVisibleRectForView:topView:] ()
#25 0x00007fff8750c34e in -[NSView _recursiveDisplayRectIfNeededIgnoringOpacity:isVisibleRect:rectIsVisibleRectForView:topView:] ()
#26 0x00007fff8750c34e in -[NSView _recursiveDisplayRectIfNeededIgnoringOpacity:isVisibleRect:rectIsVisibleRectForView:topView:] ()
#27 0x00007fff8750a593 in -[NSThemeFrame _recursiveDisplayRectIfNeededIgnoringOpacity:isVisibleRect:rectIsVisibleRectForView:topView:] ()
#28 0x00007fff875059af in -[NSView _displayRectIgnoringOpacity:isVisibleRect:rectIsVisibleRectForView:] ()
#29 0x00007fff874fe429 in -[NSView displayIfNeeded] ()

Note: this does *not* occur on 10.5, only on 10.6/10.7
Building with individual changesets within the regression range, this appears to be the cause:

Jonathan Kew — bug 721821 - make nsBidiPresUtils::WriteReverse aware of clusters, so that diacritics in RTL text display correctly in SVG text. r=smontagu

http://hg.mozilla.org/mozilla-central/rev/7658ef219e8b
Blocks: 721821
Summary: crash while doing lots of system font fallback on 10.7.3 → crash while doing lots of system font fallback on 10.6/10.7
Crash Signature: [@ libmozglue.dylib@0x5dd8]
Keywords: crash, regression
Version: Trunk → 13 Branch
Running the URL test on a different 10.6 machine runs fine but crashes when quitting:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20120302 Firefox/13.0a1

https://crash-stats.mozilla.com/report/index/bp-945898ef-c3c6-42b2-95a1-1f8192120302
I'm optimistic, based on my local testing so far, that the patch in bug 732443 will resolve this.
Depends on: 732443
(In reply to Jonathan Kew (:jfkthame) from comment #4)
> I'm optimistic, based on my local testing so far, that the patch in bug
> 732443 will resolve this.

The patch for bug 732443 does appear to resolve the crashing problem.  But I'm somewhat mystified as to why exactly.  That only changes the code in ClusterIterator::Next() which is only called within nsBidiPresUtils::WriteReverse.  I set up a simple assert to bounds check the writes in that routine and it doesn't fire without the patch.

I think we should figure out why the fix for bug 721821 caused crashes so that we're sure we're not just papering over a problem that needs to be fixed.
(In reply to John Daggett (:jtd) from comment #5)
> The patch for bug 732443 does appear to resolve the crashing problem.  But
> I'm somewhat mystified as to why exactly.  That only changes the code in
> ClusterIterator::Next() which is only called within
> nsBidiPresUtils::WriteReverse.

That's not the only caller: it's also used by gfxShapedWord::SetupClusterBoundaries(), which is where the cluster-recognition algorithm used to live; it was factored out into the ClusterIterator utility so that it could be shared with WriteReverse.

It looks to me like it the error in ClusterIterator::Next() meant that in the case where the last character of the string (shapedWord being created, or missing-glyph textRun being filled in) is a surrogate pair, it would advance the iterator one code unit too far, beyond the correct "limit" position. And this means that SetupClusterBoundaries() would set the "cluster extender" flag on an out-of-bounds glyph record. That could easily explain unpredictable corruption and random crashiness, and your test brings it up because it handles lots of non-BMP characters (in isolation, so that they're at the end of the relevant shapedWord or textRun).
This adds an assertion at the end of ClusterIterator::Next(), to verify that it's not overshooting the end of the string. Without the patch in bug 732443, this fires a number of times during your testcase, confirming that the iterator is at fault and can result in SetupClusterBoundaries writing to an invalid glyph record.

If this assertion had been present before bug 732443 was fixed, it would no doubt have been triggered by various of Jesse's crash/fuzz-tests, too.

The patch also eliminates the mText member of ClusterIterator from non-debug builds, as it is not actually needed except to sanity-check the position. (We'd need it if we ever want ClusterIterator to have additional methods such as Prev() or Restart(), but currently there's no use case for these.)
Assignee: nobody → jfkthame
Attachment #602588 - Flags: review?(jdaggett)
Duplicate of this bug: 732696
(In reply to Jonathan Kew (:jfkthame) from comment #6)
> That's not the only caller: it's also used by
> gfxShapedWord::SetupClusterBoundaries(), which is where the
> cluster-recognition algorithm used to live; it was factored out into the
> ClusterIterator utility so that it could be shared with WriteReverse.

You're right.  We seem to now have three classes called "ClusterIterator", the one you wrote in nsUnicodeProperties.cpp, along with gfxTextRun::ClusterIterator and another in nsTextFrameThebes.  Right now, gfxTextRun::ClusterIterator seems like dead code and the nsTextFrameThebes one looks like it's a segment iterator, although I haven't looked over it closely.
Comment on attachment 602588 [details] [diff] [review]
patch, add an assertion that would have caught the ClusterIterator surrogate-handling error

Looks good.
Attachment #602588 - Flags: review?(jdaggett) → review+
(In reply to John Daggett (:jtd) from comment #9)

> You're right.  We seem to now have three classes called "ClusterIterator",
> the one you wrote in nsUnicodeProperties.cpp, along with
> gfxTextRun::ClusterIterator and another in nsTextFrameThebes.  Right now,
> gfxTextRun::ClusterIterator seems like dead code 

As it happens, I've got a patch in my WIP queue for bug 715473 that removes gfxTextRun::ClusterIterator, as nobody is using it.
The actual crash here was fixed by the patch (bfa4df617fca) in bug 732443. Pushed extra assertion to guard against such an error in any future reworking of this code.
https://hg.mozilla.org/integration/mozilla-inbound/rev/e0f3f797e0fd
Target Milestone: --- → mozilla13
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.