Closed Bug 1471892 Opened Last year Closed Last year

Crash in CoreGraphics@0x26fc8 stack overflow in crossing_count

Categories

(Core :: Graphics, defect, critical)

61 Branch
x86_64
macOS
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla63
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- unaffected
firefox61 + verified
firefox62 + verified
firefox63 + verified

People

(Reporter: philipp, Assigned: rhunt)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(3 files)

This bug was filed from the Socorro interface and is
report bp-c051d71b-199d-4129-9cbb-630120180627.
=============================================================

Top 10 frames of crashing thread:

0 CoreGraphics CoreGraphics@0x26fc8 
1 CoreGraphics CoreGraphics@0x272fe 
2 CoreGraphics CoreGraphics@0x27350 
3 CoreGraphics CoreGraphics@0x27350 
4 CoreGraphics CoreGraphics@0x27350 
5 CoreGraphics CoreGraphics@0x272fe 
6 CoreGraphics CoreGraphics@0x272fe 
7 CoreGraphics CoreGraphics@0x272fe 
8 CoreGraphics CoreGraphics@0x27350 
9 CoreGraphics CoreGraphics@0x272fe 

=============================================================

these content crashes on osx 10.9 are spiking up in volume in firefox 61. right now, they account for around a third of all collected tab crashes in 61 from macos users.

some user comments indicate the crashes are repeatedly occurring on particular sites, including adobe stock so perhaps this would be a good first way to try to reproduce and troubleshoot the issue.
one crashing url mentioned in a comment was https://stock.adobe.com/search?load_type=search&native_visual_search=&similar_content_id=&is_recent_search=&k=Garage+and+towing
Stephen, are you able to take a look at this?
Flags: needinfo?(spohl.mozilla.bugs)
Main thread here is usually in layout doing something with text - 

nsTextFrame::EnsureTextRun(nsTextFrame::TextRunType, mozilla::gfx::DrawTarget*, nsIFrame*, nsLineList_iterator const*, unsigned int*)

The crashing thread is background, and is doing an allocation related to CoreText and CoreGraphics. Not sure what the deal is with that background thread. I think this bug probably belongs over layout.

Note we've seen this crash in older releases as well.
Jonathan -- Can you take a look at this and assess where the problem is?  Please needinfo me back when you know more.  (I'm looking to find the best owner.). Thanks!
Flags: needinfo?(jkew)
The reports seem to be solely on 10.9, so it may well be a macOS bug that has been fixed in later releases, but something we're doing tickles it on that version.

It looks to me like this belongs in GFX, it's very much a Graphics stack. The fact that we're crashing on a non-main thread suggests perhaps something like OMTP? (Or could this be an image-decoding thread? Do they use CoreGraphics to get stuff done?)

Hmm, OMTP on macOS got enabled for Fx59 in bug 1422392, but these crashes don't appear to start until Fx60. Ryan, were there any significant OMTP-related changes between 59 and 60?
Flags: needinfo?(rhunt)
Flags: needinfo?(mreavy)
Flags: needinfo?(jkew)
Removing my n-i for now as this seems to be in the right hands.
Flags: needinfo?(spohl.mozilla.bugs)
Indeed there was a major change, we enabled parallel-OMTP by default on OSX in 60.

This worker thread is immediately after two paint worker threads but before a image worker thread. We usually queue up 3 threads on a quad core machine (which this crash report seems to be), so I believe this is a crash on a paint thread.

One of the other paint threads is rasterizing glyphs in CoreGraphics as well, so my best guess is this is a race condition in CoreGraphics. I'm not sure what we can do about it yet.
Assignee: nobody → rhunt
Flags: needinfo?(rhunt)
Flags: needinfo?(mreavy)
(In reply to Ryan Hunt [:rhunt] from comment #6)
> Indeed there was a major change, we enabled parallel-OMTP by default on OSX
> in 60.
> 
> [...]
> 
> I'm not sure what we can do about it yet.

As a stop-gap measure, could parallel-OMTP be disabled on 10.9 for now?
I'd prefer not to, but if it's the only way then we could. The pref that would need to be changed for 10.9 would be `layers.omtp.paint-workers` to `1`.

Lee, do you have any idea what's going on here or what we could do about it? Would this affect our webrender blob rasterization?
Flags: needinfo?(lsalzman)
We could pretty easily flip that pref for only 10.9 users via Normandy and see if it does indeed make the crashes go away. The advantage to that is that the pref flip is tied to the duration of the experiment, so users will auto-revert once we turn it off. I'll go ahead and file that request per IRC discussion with rhunt.
Component: General → Graphics
Depends on: 1472308
The disabling via Normandy should now be live for OSX 10.9 users. If that's indeed the culprit, we should see this crash volume go down as it rolls out to users.
Adding 62 as affected since the signature I added are all 10.9 users crashing in CoreGraphics.
Crash Signature: [@ CoreGraphics@0x26fc8] [@ CoreGraphics@0x26794] [@ CoreGraphics@0x26fcc] → [@ CoreGraphics@0x26fc8] [@ CoreGraphics@0x26794] [@ CoreGraphics@0x26fcc] [@ CoreGraphics@0x27117]
changing the layers.omtp.paint-workers pref via bug 1472308 hasn't made any dent in the crash rates here so far.

iulia, if you have a system with OSX 10.9 available for testing, could you see if the crash with the url in comment #0 is reproducible there and perhaps find out a regression window?
Flags: needinfo?(iulia.cristescu)
It would be good to confirm the new crash reports do indeed have P-OMTP disabled. We add the amount of paint workers into the crash notes. So for P-OMTP enabled generally the crash notes will contain ‘OMTP+3’. If it’s been disabled then we should see ‘OMTP+1’.

If we are getting crash reports with it disabled, then I’m curious what other threads are doing when the paint thread crashes.
yes, there are plenty of reports with p-omtp disabled since we rolled out the pref flip via normandy:
https://crash-stats.mozilla.com/signature/?app_notes=~OMTP%2B1&release_channel=release&signature=CoreGraphics%400x26fc8#reports
(In reply to [:philipp] from comment #12)
> changing the layers.omtp.paint-workers pref via bug 1472308 hasn't made any
> dent in the crash rates here so far.
> 
> iulia, if you have a system with OSX 10.9 available for testing, could you
> see if the crash with the url in comment #0 is reproducible there and
> perhaps find out a regression window?

I also see a lot of crashes on rei.com - here are a few more in case you have trouble reproducing:

*https://www.rei.com/product/111539/kuhl-renegade-shorts-mens-12-inseam 
*https://www.mirror.co.uk/news/ 
*https://www.football.london/arsenal-fc/news/granit-xhaka-latest-arsenal-news-14862158 
*https://www.epicgames.com/fortnite/en-US/success 
*https://www.beaconlighting.com.au/beacon-lighting-catalogue.html
(In reply to [:philipp] from comment #14)
> yes, there are plenty of reports with p-omtp disabled since we rolled out
> the pref flip via normandy:
> https://crash-stats.mozilla.com/signature/
> ?app_notes=~OMTP%2B1&release_channel=release&signature=CoreGraphics%400x26fc8
> #reports

That’s disappointing. At least from the one crash report I looked at, it wasn’t seeing any obvious culprit.
(In reply to Ryan Hunt [:rhunt] (PTO July 4th - July 6th) from comment #16)
> (In reply to [:philipp] from comment #14)
> > yes, there are plenty of reports with p-omtp disabled since we rolled out
> > the pref flip via normandy:
> > https://crash-stats.mozilla.com/signature/
> > ?app_notes=~OMTP%2B1&release_channel=release&signature=CoreGraphics%400x26fc8
> > #reports
> 
> That’s disappointing. At least from the one crash report I looked at, it
> wasn’t seeing any obvious culprit.

Given that this only seems to affect 10.9 and no later, it seems like it might be better to just disable OMTP entirely on the affected systems and call it good?

It looks like we do have some bits of widget theme code gated on Yosemite-or-later, but nothing that jumps out as looking like it would cause this mess...
Flags: needinfo?(lsalzman)
(In reply to [:philipp] from comment #12)
> changing the layers.omtp.paint-workers pref via bug 1472308 hasn't made any
> dent in the crash rates here so far.
> 
> iulia, if you have a system with OSX 10.9 available for testing, could you
> see if the crash with the url in comment #0 is reproducible there and
> perhaps find out a regression window?

Hello, Philipp! I tried to reproduce the crash on Mac OS X 10.9.5, using the info provided in comment 0 and comment 15. I only managed to reproduce a @ CoreGraphics@0x27117 crash (https://crash-stats.mozilla.com/report/index/e1fdd97b-dfa9-4686-bdb2-ad1970180709) using these specs. Also I manually checked for a regression and it seems that the problem started to be triggered since 59.0a1 (2018-01-11). Interesting fact to be mentioned: the "layers.omtp.paint-workers" pref doesn't exist until the above mentioned build. But no difference is made if changing this pref from "-1" to "1" on the affected builds.
Flags: needinfo?(iulia.cristescu)
(In reply to Iulia Cristescu, QA [:JuliaC] from comment #18)
> (In reply to [:philipp] from comment #12)
> > changing the layers.omtp.paint-workers pref via bug 1472308 hasn't made any
> > dent in the crash rates here so far.
> > 
> > iulia, if you have a system with OSX 10.9 available for testing, could you
> > see if the crash with the url in comment #0 is reproducible there and
> > perhaps find out a regression window?
> 
> Hello, Philipp! I tried to reproduce the crash on Mac OS X 10.9.5, using the
> info provided in comment 0 and comment 15. I only managed to reproduce a @
> CoreGraphics@0x27117 crash
> (https://crash-stats.mozilla.com/report/index/e1fdd97b-dfa9-4686-bdb2-
> ad1970180709) using these specs. Also I manually checked for a regression
> and it seems that the problem started to be triggered since 59.0a1
> (2018-01-11). Interesting fact to be mentioned: the
> "layers.omtp.paint-workers" pref doesn't exist until the above mentioned
> build. But no difference is made if changing this pref from "-1" to "1" on
> the affected builds.

Iulia, can you confirm if changing "layers.omtp.enabled" to false in the affected builds resolves this issue? A restart is required for the pref change to take affect.
Flags: needinfo?(iulia.cristescu)
(In reply to Ryan Hunt [:rhunt] from comment #19)
> Iulia, can you confirm if changing "layers.omtp.enabled" to false in the
> affected builds resolves this issue? A restart is required for the pref
> change to take affect.

Yes, I can confirm that changing the "layers.omtp.enabled" pref to false prevent the crash to be triggered. The investigated builds were 63.0a1 (2018-07-09), 62.0b6 build1 (20180705151535) and 61.0.1 build1 (20180704003137).
Flags: needinfo?(iulia.cristescu)
It seems to me then that disabling OMTP entirely on the affected systems is the best way of reducing this crash right now.

According to Iulia, this crash is reproducible from the beginning of OMTP on OSX with 59.
I've reopened bug 1472308 for doing that. Should we also do this for Beta62?
Flags: needinfo?(rhunt)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #22)
> I've reopened bug 1472308 for doing that. Should we also do this for Beta62?

Good question. How many users do we have on OSX 10.9? Is it worth trying to get a workaround fix developed for this?

If not, then yes we should blacklist this permanently.
Flags: needinfo?(rhunt)
I have a hard time believing it's worth spending any real time coming up with a workaround for 10.9 users at this point. From what I can tell, it's around 10% of our total OSX user base, which is in turn only ~7% of our total desktop ADIs. So under 1% of our total users, give or take.

How easily could we write a patch for disabling this on 10.9 so we don't have to maintain it indefinitely in Normandy?
(In reply to Ryan VanderMeulen [:RyanVM] from comment #24)
> I have a hard time believing it's worth spending any real time coming up
> with a workaround for 10.9 users at this point. From what I can tell, it's
> around 10% of our total OSX user base, which is in turn only ~7% of our
> total desktop ADIs. So under 1% of our total users, give or take.
> 
> How easily could we write a patch for disabling this on 10.9 so we don't
> have to maintain it indefinitely in Normandy?

Pretty easy. I'm writing one now.
Comment on attachment 8991148 [details]
Bug 1471892 - Block OMTP on OSX 10.9 to work around CoreGraphics crash.

https://reviewboard.mozilla.org/r/256116/#review262998

Please just use !OnYosemiteOrLater() instead of adding OnMavericksExactly().
Attachment #8991148 - Flags: review-
Do we not have symbols to see where the crash is actually happening? I'm not in a huge rush to have a different code path for 10.9...
[Tracking Requested - why for this release]:
Some insights from talking with jrmuizel about this.

This appears to be a stack overflow when drawing a glyph. We use a thread pool for 'paint workers' with OMTP and tiling, which means the worker threads get smaller stack sizes than the main thread. This may be the difference causing us to crash now.

If this is the case, one workaround would be to disable P-OMTP on 10.9 and modify the code in that case to use the paint thread instead of a single paint worker and make sure that has enough stack space.
Comment on attachment 8991148 [details]
Bug 1471892 - Block OMTP on OSX 10.9 to work around CoreGraphics crash.

https://reviewboard.mozilla.org/r/256116/#review263220

I'm ok with taking this for now. We can reproduce the issue and have some idea what's going on. We can track it down further without people needing to experience crashes.
Attachment #8991148 - Flags: review?(jmuizelaar) → review+
Open to taking this in beta 62, though the crash volume doesn't seem all that high perhaps it would be on release.
(In reply to Ryan Hunt [:rhunt] from comment #30)
> Some insights from talking with jrmuizel about this.
> 
> This appears to be a stack overflow when drawing a glyph. We use a thread
> pool for 'paint workers' with OMTP and tiling, which means the worker
> threads get smaller stack sizes than the main thread. This may be the
> difference causing us to crash now.
> 
> If this is the case, one workaround would be to disable P-OMTP on 10.9 and
> modify the code in that case to use the paint thread instead of a single
> paint worker and make sure that has enough stack space.

The stack in a report like https://crash-stats.mozilla.com/report/index/1ab89eca-d9f1-4b0b-bf23-4b6930180713 looks to me like CoreGraphics may be wandering into infinite-recursion territory; if so, a thread with a larger stack might not actually help.
(In reply to Jonathan Kew (:jfkthame) from comment #33)
> (In reply to Ryan Hunt [:rhunt] from comment #30)
> > Some insights from talking with jrmuizel about this.
> > 
> > This appears to be a stack overflow when drawing a glyph. We use a thread
> > pool for 'paint workers' with OMTP and tiling, which means the worker
> > threads get smaller stack sizes than the main thread. This may be the
> > difference causing us to crash now.
> > 
> > If this is the case, one workaround would be to disable P-OMTP on 10.9 and
> > modify the code in that case to use the paint thread instead of a single
> > paint worker and make sure that has enough stack space.
> 
> The stack in a report like
> https://crash-stats.mozilla.com/report/index/1ab89eca-d9f1-4b0b-bf23-
> 4b6930180713 looks to me like CoreGraphics may be wandering into
> infinite-recursion territory; if so, a thread with a larger stack might not
> actually help.

That was my gut feeling as well, but a stack size change is the best theory I have so far.

At the least we can reproduce it now, so we can see if that's the case.
Pushed by rhunt@eqrion.net:
https://hg.mozilla.org/integration/mozilla-inbound/rev/df00870327ff
Block OMTP before OSX 10.10 to work around CoreGraphics crash. r=jrmuizel
https://hg.mozilla.org/mozilla-central/rev/df00870327ff
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
I confirmed that increasing the stack size of the paint threads avoids the crash. It's not exactly clear how we want to solve this though.
I investigated Chrome's behaviour a little. It looks like it's getting the path for the glyph earlier on the main thread (for as of yet unknown reasons) and that avoids the crossing_count deep recursion from happening on the drawing threads. The drawing threads have a stack size of 512K which is more than our 256K default. I'm not sure if 512K is enough.
Here's the stack that causes glyph path creation:

    frame #7: 0x00007fff8bdcf0de CoreGraphics`CGFontCreateGlyphPath + 46
    frame #8: 0x00007fff8d304be5 CoreText`TFont::CreatePathForGlyph(unsigned short, CGAffineTransform const*) const + 365
    frame #9: 0x00007fff8d30096d CoreText`CTFontCreatePathForGlyph + 107
    frame #10: 0x0000000108de3de1 Chromium Framework`SkScalerContext_Mac::generatePath(this=0x00007ffa56048c00, glyph=0x00007ffa53ffbdb0, path=0x00007fff589eed10) + 465 at SkFontHost_mac.cpp:1432
    frame #11: 0x0000000108d13680 Chromium Framework`SkScalerContext::internalGetPath(this=0x00007ffa56048c00, glyph=0x00007ffa53ffbdb0, fillPath=0x0000000000000000, devPath=0x00007ffa53ff2210, fillToDevMatrix=0x0000000000000000) + 64 at SkScalerContext.cpp:583
    frame #12: 0x0000000108c9d37a Chromium Framework`SkGlyphCache::findPath(this=0x00007ffa53ffbc70, glyph=0x00007ffa53ffbdb0) + 106 at SkGlyphCache.cpp:234
    frame #13: 0x0000000108ce524a Chromium Framework`SkPaint::getTextPath(void const*, unsigned long, float, float, SkPath*) const + 138 at SkPaint.cpp:2322
    frame #14: 0x0000000108ce51c0 Chromium Framework`SkPaint::getTextPath(this=<unavailable>, textData=<unavailable>, length=<unavailable>, x=<unavailable>, y=<unavailable>, path=<unavailable>) const + 128 at SkPaint.cpp:1124
    frame #15: 0x000000010ae979df Chromium Framework`blink::SkiaTextMetrics::getSkiaBoundsForGlyph(this=0x00007fff589eef48, glyph=<unavailable>, bounds=0x00007fff589eef38) + 63 at SkiaTextMetrics.cpp:71
    frame #16: 0x000000010ae86475 Chromium Framework`blink::SimpleFontData::platformBoundsForGlyph(this=<unavailable>, glyph=<unavailable>) const + 69 at SimpleFontData.cpp:382
Summary: Crash in CoreGraphics@0x26fc8 → Crash in CoreGraphics@0x26fc8 stack overflow in crossing_count
Please nominate this for Beta approval when you get a chance.
Flags: needinfo?(rhunt)
Approval Request Comment
[Feature/Bug causing the regression]: Stack size change causes stack overflow in CoreGraphics
[User impact if declined]: Crashes
[Is this code covered by automated tests?]: No
[Has the fix been verified in Nightly?]: Yes
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: Not really
[Why is the change risky/not risky?]: It is a feature flip back to a older configuration for older OSes
[String changes made/needed]: None
Flags: needinfo?(rhunt)
Attachment #8993409 - Flags: approval-mozilla-beta?
Here's a work in progress patch I wrote that makes it so we don't use the paint workers when we have paint-workers = 1. This might give us a better stack size for core graphics and could let us have some OMTP on these systems.
Comment on attachment 8993426 [details] [diff] [review]
single-paint-worker.patch

Jeff could you see if this resolves the issue on your machine? You need to have layers.omtp.paint-workers = 1 in your prefs.
Attachment #8993426 - Flags: feedback?(jmuizelaar)
Comment on attachment 8993409 [details] [diff] [review]
omtp-osx-cg.patch

Let's see how this affects crashes on beta. OK to uplift for beta 11.
Attachment #8993409 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Adding qe-verify+, as a reminder to check for these signatures in crash-stats after beta 11 is released next week.
Flags: qe-verify+
(In reply to Ryan Hunt [:rhunt] from comment #43)
> Comment on attachment 8993426 [details] [diff] [review]
> single-paint-worker.patch
> 
> Jeff could you see if this resolves the issue on your machine? You need to
> have layers.omtp.paint-workers = 1 in your prefs.

Yes. This patch and layers.omtp.paint-workers=1 avoids the crash.
Attachment #8993409 - Attachment is patch: true
Attachment #8993409 - Attachment mime type: text/x-patch → text/plain
Hi everyone, I managed to reproduce this issue in Nightly version 61 and the tab crashed instantly, after which I downloaded the latest version and no crashes occurred even after I tried stressing different pages and reloads. I will mark this issue Verified as Fixed in Nightly 63.
Attachment #8993426 - Flags: feedback?(jmuizelaar) → feedback+
Blocks: 1477740
I managed to reproduce the issue in Firefox 61.0.1, Mac OSx 10.9 using the URLs from Comment 15.
On Mac OSx 10.12 and 10.13, and Windows, Firefox 61.0.1 and Firefox 62.0b12 the issue is not reproducing.
Also on Mac OSx 10.9, Firefox 62.0b12, the issue is not reproducible anymore so I will mark this as Verified on 62.
Please don't mess with tracking flags.
could you request uplift approval for mozilla-release as well in case we're going for another dot-release?
for some reason the pref flip through normandy doesn't appear to make a big difference, but landing the changes in-tree did according to the crash data...
Flags: needinfo?(rhunt)
(In reply to [:philipp] from comment #51)
> could you request uplift approval for mozilla-release as well in case we're
> going for another dot-release?
> for some reason the pref flip through normandy doesn't appear to make a big
> difference, but landing the changes in-tree did according to the crash
> data...

This sounds like bug 1477380. 'layers.omtp.enabled' is a restart required/startup pref, and normandy had a bug causing user pref changes to not persist across restarts.
Flags: needinfo?(rhunt)
Attachment #8993409 - Flags: approval-mozilla-release?
Not sure why I wasn't given a uplift request to fill out when I set the flag. The notes in comment 41 are still accurate with the additional point that we've seen this help the crash rate on beta.
Comment on attachment 8993409 [details] [diff] [review]
omtp-osx-cg.patch

Approved for 61.0.2.
Attachment #8993409 - Flags: approval-mozilla-release? → approval-mozilla-release+
Hi, This issue is fixed in 61.0.2 (64-bit) version as well, I will mark this issue accordingly.
Status: RESOLVED → VERIFIED
Flags: qe-verify+
See Also: → 1578075
You need to log in before you can comment on or make changes to this bug.