Closed Bug 1324908 Opened 3 years ago Closed 3 years ago

[e10s] OS X printing related crashes in CoreGraphics@0x regressing in Firefox 51

Categories

(Core :: Printing: Output, defect, P1, critical)

51 Branch
Unspecified
macOS
defect

Tracking

()

RESOLVED FIXED
mozilla53
Tracking Status
firefox50 --- unaffected
firefox51 blocking fixed
firefox52 - fixed
firefox53 - unaffected

People

(Reporter: philipp, Assigned: nical)

References

Details

(Keywords: crash, regression, topcrash-mac)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-28277d6d-7c81-41a7-828c-ad6482161220.
=============================================================

[Tracking Requested - why for this release]:
this is a major mac os x regression in firefox 51 and later. these signatures are accounting for a third of all crashes from mac os x in 51.0b during the last week.

these crashes seem to be spread out across multiple different signatures (attaching the most frequent ones to this report) and nearly always happen in the content process, so they seem to be related to e10s. 

many user comments indicate that people where trying to print a pdf file at the time of the crash: http://bit.ly/2h9BTk5
curiously there aren't any crashes for this on nightly at all, but they started in volume on 51.0a2 20161026004014. these are the couple of patches that landed in this particular build: https://hg.mozilla.org/releases/mozilla-aurora/pushloghtml?fromchange=96b6a1a83bd74da807a67bb9449bc7158f6ee2cb&tochange=a16ab27e533088e1711a82ff380d3ebd262ca837

so bug 1310804 looks related...
See Also: → 1310804
Tracking for 51 and 52, as this being 1/3 of all mac crashes in beta sound pretty bad. 

Haik, bobowen, can you take a look? We may need to back out the work in bug 1310804.
Flags: needinfo?(haftandilian)
Flags: needinfo?(bobowencode)
On further thought let's call this a blocker for 51.  I don't think this big of an increase in crashes by OS is acceptable.  It basically means Mac users can't print boarding passes, tickets, etc.
Bug 1310804 is unlikely to be related to anything on build 51. That bug made sure that sandboxed/print_via_parent printing was not turned on for 51. Sandboxed/print_via_parent printing is the enabled for 52 and Nightly at the moment. Still looking at the crash reports for clues. Seeing the same crash reports on 51 and 52 is an indication this isn't related to the remote printing and sandboxing that is enabled in 52+.
why is it related to component Gecko Profiler ? It seems to be Printing: Output.
Component: Gecko Profiler → Printing: Output
Flags: needinfo?(haftandilian)
Priority: -- → P1
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #2)
> Tracking for 51 and 52, as this being 1/3 of all mac crashes in beta sound
> pretty bad. 
> 
> Haik, bobowen, can you take a look? We may need to back out the work in bug
> 1310804.

It seems more likely that something else introduced a problem with not printing via the parent and bug 1310804 uncovered this.
It stopped again in Fx52, when printing via the parent was turned on again (bug 1317801).

So we need to find out what introduced the crash into printing in the child.
The only thing I landed that was Mac specific (that I can think of) is bug 1308259.
But that was uplifted to 50 and we haven't seen the crash there.

jwatt - is there anything you can think of that landed in 51 that might be causing this?
Flags: needinfo?(bobowencode) → needinfo?(jwatt)
I can reproduce this on my new laptop, trying to get a regression range at the moment.
So, it looks like it is my change for bug 1308259:
https://hg.mozilla.org/releases/mozilla-aurora/pushloghtml?fromchange=4c20c80bf18a8ececa61d9e0440e6f4aa916d455&tochange=44754fcbce505bb509c290657231d65918f819af

However the same change is in Fx50 and it doesn't crash there. :-/
Flags: needinfo?(jwatt)
I wonder if saving as a PDF and then printing would be a viable workaround to call out on SUMO. If we can reproduce the crash and if that helps, we might write up a SUMO page to that effect.
Looks like the crash I'm getting is actually the one from bug 880140, so may be different from this one.
Although it occurs after the nsDeviceContextSpecX::EndPage() call as well.

Also, annoyingly my crash doesn't happen when I build beta locally or on try.
Flags: needinfo?(twalker)
Following the link[1] in comment 0, an user had provided a comment saying Firefox crashed when loading the URL[2].

Here is the STR to reproduce the crash which is 100% reproducible on Firefox 51.0b10(e10s force enabled).
1. Load url in [2].
2. In pdf.js preview, click print button and then print dialog appears.
3. Print
4. Crash happens.

[1] http://bit.ly/2h9BTk5
[2] http://www.ivis.org/proceedings/voorjaarsdagen/2016/3.pdf
Update more testing result.
1. On nightly and aurora, doing same STR will cause content process hung a long while(>15s) before print dialog pops out. After print and back to pdf preview, preview performance significantly dropped. 
2. On beta, print function works fine when e10s is off.
Logs of crash.

[Parent 39068] WARNING: NS_ENSURE_SUCCESS(*result, true) failed with result 0x80004001: file /Users/astley/mozilla/mozilla-beta/embedding/components/printingui/ipc/PrintingParent.cpp, line 66
[GFX3-]: Surface size too large (exceeds extent limit)!
[GFX2-]: Allowing surface with invalid size (Cairo) Size(47600,67360)
[GFX3-]: Surface size too large (exceeds extent limit)!
[GFX2-]: Allowing surface with invalid size (Cairo) Size(47600,67360)
[Parent 39068] WARNING: Caller should supply a printer name.: file /Users/astley/mozilla/mozilla-beta/widget/nsPrintOptionsImpl.cpp, line 1171
[Parent 39068] WARNING: Caller should supply a printer name.: file /Users/astley/mozilla/mozilla-beta/widget/nsPrintOptionsImpl.cpp, line 1171
[GFX3-]: Surface size too large (exceeds extent limit)!
[GFX2-]: Allowing surface with invalid size (Cairo) Size(47600,67360)
[GFX3-]: Surface size too large (exceeds extent limit)!
[GFX2-]: Allowing surface with invalid size (Cairo) Size(47600,67360)

###!!! [Parent][MessageChannel] Error: (msgtype=0x2E007F,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv
Crash Signature: [@ CoreGraphics@0x36a97] [@ CoreGraphics@0x36be7] [@ CoreGraphics@0x4bb0c] [@ CoreGraphics@0x4f3eb] [@ CoreGraphics@0x37377] [@ CoreGraphics@0x374ff] [@ CoreGraphics@0x499e4] [@ CoreGraphics@0x413b2] [@ CoreGraphics@0x3722f] [@ CoreGraphics@0x4bb… → [@ decode_swap] [@ CoreGraphics@0x36a97] [@ CoreGraphics@0x36be7] [@ CoreGraphics@0x4bb0c] [@ CoreGraphics@0x4f3eb] [@ CoreGraphics@0x37377] [@ CoreGraphics@0x374ff] [@ CoreGraphics@0x499e4] [@ CoreGraphics@0x413b2] [@ CoreGraphics@0x3722f] [@ C…
Bob, by comment 12, do you have any ideas or thoughts for the root cause ? I think the problem seriously impact users using Firefox with e10s on by default.
Flags: needinfo?(bobowencode)
(In reply to Bob Owen (:bobowen) (PTO until 3rd Jan) from comment #10)
> Looks like the crash I'm getting is actually the one from bug 880140, so may
> be different from this one.
> Although it occurs after the nsDeviceContextSpecX::EndPage() call as well.
> 
> Also, annoyingly my crash doesn't happen when I build beta locally or on try.

It is 100% reproducible in my local build, here is my STR.
1. Load url [1].
2. Scroll to end of page and wait it to show on screen.
3. Issue "Command + p" to print.
4. Crash.

I also found this regression comes from bug 1308259.
This PDF[1] contain portrait and landscape.
If you print pages 2~3(landscape), it works well.
If you print page 1 (portrait), it will crash.

[1] http://www.ivis.org/proceedings/voorjaarsdagen/2016/3.pdf
Thanks Farmer to identify the regression patch, let's wait for Bob's feedback.
Unfortunately I still can't reproduce with my local build.

Farmer, can you try something with your build please.
Could you try removing the line at [1], I added this as part of my patch as it seemed the correct thing to do, but maybe it isn't in some circumstances.

jwatt - I think you're more familiar with the Mac printing bits, can you spot anything in the patch for Bug 1308259 that might be causing this?

[1] https://hg.mozilla.org/mozilla-central/file/tip/gfx/2d/DrawTargetCairo.cpp#l1876
Flags: needinfo?(jwatt)
Flags: needinfo?(fatseng)
Flags: needinfo?(bobowencode)
Unfortunately, it still crash after removing DrawTargetCairo.cpp#l1876.
Flags: needinfo?(fatseng)
(In reply to Farmer Tseng[:fatseng] from comment #18)
> Unfortunately, it still crash after removing DrawTargetCairo.cpp#l1876.

It was a bit of a long shot, as I'm fairly sure that line should be correct.

It looks like cairo_quartz_surface_create_cg_layer might not check the status of the surface passed in, so it could be that it is creating a new surface which is actually invalid.

Could you try adding this just above the switch in DrawTargetCairo::CreateSimilarDrawTarget

  if (cairo_surface_status(mSurface)) {
    return nullptr;
  }
Flags: needinfo?(fatseng)
(In reply to Bob Owen (:bobowen) (PTO until 3rd Jan) from comment #17)
> jwatt - I think you're more familiar with the Mac printing bits, can you
> spot anything in the patch for Bug 1308259 that might be causing this?

No, those changes look correct.
Flags: needinfo?(jwatt)
(In reply to Bob Owen (:bobowen) from comment #19)
> It looks like cairo_quartz_surface_create_cg_layer might not check the
> status of the surface passed in, so it could be that it is creating a new
> surface which is actually invalid.

From my debugging on beta with e10s enabled the code from in these areas all looks fine.
(s/from in/flow in/)

To fill in the missing macOS platform stack frames, the crash stack looks like this:

  decode_swap ()
  decode_data ()
  img_decode_read ()
  img_alphamerge_read ()
  CGSImageStreamRead ()
  PDFImageEmitDefinition ()
  __CFSetApplyFunction_block_invoke ()
  CFBasicHashApply ()
  CFSetApplyFunction ()
  PDFImageSetEmitDefinitions ()
  emit_page_resources(PDFDocument*) ()
  PDFDocumentEndPage ()
  pdf_EndPage ()
  pdfSpoolingEndPage(void*, void*) ()
  PJCEndPage(OpaquePMPrintSession*) ()
  PMSessionEndPageNoDialog ()
  nsDeviceContextSpecX::EndPage()
  nsDeviceContext::EndPage()
  nsSimplePageSequenceFrame::DoPageEnd()
  nsPrintEngine::PrintPage(nsPrintObject*, bool&)
  nsPagePrintTimer::Run()
Under the CreateSimilarDrawTarget call in nsSimplePageSequenceFrame::PrePrintNextPage the wrapped CGContext is of type kCGContextTypePDF, but for some reason the CGLayerCreateWithContext call in cairo_quartz_surface_create_cg_layer results in a CGContext of type kCGContextTypeGeneric (raster). :/
(In reply to Bob Owen (:bobowen) from comment #19)
> (In reply to Farmer Tseng[:fatseng] from comment #18)
> > Unfortunately, it still crash after removing DrawTargetCairo.cpp#l1876.
> 
> It was a bit of a long shot, as I'm fairly sure that line should be correct.
> 
> It looks like cairo_quartz_surface_create_cg_layer might not check the
> status of the surface passed in, so it could be that it is creating a new
> surface which is actually invalid.
> 
> Could you try adding this just above the switch in
> DrawTargetCairo::CreateSimilarDrawTarget
> 
>   if (cairo_surface_status(mSurface)) {
>     return nullptr;
>   }

still crash
Flags: needinfo?(fatseng)
[Tracking Requested - why for this release]:

(In reply to Bob Owen (:bobowen) from comment #6)
> It stopped again in Fx52, when printing via the parent was turned on again
> (bug 1317801).

Given that Fx52 is unaffected now that print.print_via_parent defaults to true, I don't think we necessarily need to track for Fx52. (Same goes for Fx53.)  (Of course if we were to need to flip print.print_via_parent to false for some reason these versions would be affected again.)
(In reply to Astley Chen [:astley] UTC+8 from comment #12)
> Update more testing result.
> 1. On nightly and aurora, doing same STR will cause content process hung a
> long while(>15s) before print dialog pops out. After print and back to pdf
> preview, preview performance significantly dropped. 
On nightly and aurora, I'm seeing a different issue with same scenario. Probably we should create another bug to track it.
Assignee: nobody → bobowencode
Status: NEW → ASSIGNED
So I don't want to de-rail... but not sure if this is related.

Since Dec 20th, we're seeing an increase in App Crashes with addons/e10s experiment group (3-4 more per hour).  There's 6-7 LESS Plug-in crashes per hour (than e10s only).

It looks like specifically Content crashes (because Main crash graph is the same between groups).

I've asked Marco Castelluccio and Ben M to see if it correlates to an OS (to see if it is Mac only) and  add-on correlation.

https://sql.telemetry.mozilla.org/dashboard/stability-metrics-for-e10s-add-ons-experiment-beta-49-50-51-
Is this in any way related to bug 1324610 (which seems to be OK in 51, but broken otherwise and has a patch on the bug)?  Or does that bug help us figure this one out.
Flags: needinfo?(haftandilian)
(In reply to Milan Sreckovic [:milan] from comment #28)
> Is this in any way related to bug 1324610 (which seems to be OK in 51, but
> broken otherwise and has a patch on the bug)?  Or does that bug help us
> figure this one out.

I don't think they're related. Bug 1324610 is caused by a content sandbox restriction interfering with printing. And the content sandbox is only enabled on build 52+. But bug 1324610 could be the cause of the issue mentioned in comment 26.
Flags: needinfo?(haftandilian)
(In reply to :shell escalante from comment #27)
> So I don't want to de-rail... but not sure if this is related.
> 
> Since Dec 20th, we're seeing an increase in App Crashes with addons/e10s
> experiment group (3-4 more per hour).  There's 6-7 LESS Plug-in crashes per
> hour (than e10s only).
> 
> It looks like specifically Content crashes (because Main crash graph is the
> same between groups).
> 
> I've asked Marco Castelluccio and Ben M to see if it correlates to an OS (to
> see if it is Mac only) and  add-on correlation.
> 
> https://sql.telemetry.mozilla.org/dashboard/stability-metrics-for-e10s-add-
> ons-experiment-beta-49-50-51-

A lot of users crashing with the "CoreGraphics@0x3722f" signature have addons installed, ~55% have "AdBlock Plus", ~18.75% have "ADB Helper", ~18.75% have "Valence".
Maybe the fact that we're enabling e10s for more people is increasing the volume of this crash, but I think the crash is likely unrelated to the addons themselves.
Crash Signature: CoreGraphics@0x4bb5c] [@ CoreGraphics@0x4bbca] [@ CoreGraphics@0x4b8bc] [@ CoreGraphics@0x40280] [@ CoreGraphics@0x498f4] [@ CoreGraphics@0x4b068] → CoreGraphics@0x4bb5c] [@ CoreGraphics@0x4bbca] [@ CoreGraphics@0x4b8bc] [@ CoreGraphics@0x40280] [@ CoreGraphics@0x498f4] [@ CoreGraphics@0x4b068] [@ CoreGraphics@0x4f13b]
I finally remembered from before Christmas that the patch for bug 1308259 was uplifted to 50, but we don't see the crash there (comment 6).

I also managed to finally reproduce on my local build, for some reason it takes several seconds for it to crash, I was obviously just too impatient because on normal beta it was crashing immediately.

So, there were two possibilities:

1) Something broke this on Fx51 before bug 1308259 was uplifted. Applying the patch to the base of Fx51 proved that wasn't the case, as it still crashed.

2) Something fixed it during Fx50, before bug 1308259 was uplifted.
So, I went to the common ancestor of the base of Fx51 and the uplift to Fx50 and applied the patch for bug 1308259 and sure enough it crashes.
Then I bisected the fix for the crash by applying the patch each time and got this:

https://hg.mozilla.org/releases/mozilla-aurora/rev/312abfcd4ee3

If I disable that pref on beta, I don't get the crash.
So it looks like this might actually be a regression introduced by shared persistent buffer provider.

nical - can you take a look at this please?
Assignee: bobowencode → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(twalker) → needinfo?(nical.bugzilla)
update on comment 27 - looks like unrelated content crashes, since the add-ons related ones are mostly seen on Windows.  sorry.
See Also: → 1328766
Assignee: nobody → nical.bugzilla
Flags: needinfo?(nical.bugzilla)
Since the crash isn't trivial to investigate and has a worrying volume on beta, the most sensible thing to do is disable the pref and uplift asap.
Attachment #8824022 - Flags: review?(ethlin)
Shouldn't affect 52/53 as long as we use print_via_parent there.
Attachment #8824022 - Flags: review?(ethlin) → review+
nical, could you help to create a BETA uplift request so that RM could proceed to approve it ? Thanks.
Flags: needinfo?(nical.bugzilla)
Pushed by nsilva@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9edfd2e0fb78
Turn off BufferProviderShared on mac due to printing issues. r=ethlin
Comment on attachment 8824022 [details] [diff] [review]
Disable PersistentBufferProviderShared on mac

Approval Request Comment
[Feature/Bug causing the regression]:
[User impact if declined]: top crash on mac
[Is this code covered by automated tests?]: apparently not.
[Has the fix been verified in Nightly?]: It has been verified on beta.
[Needs manual test from QE? If yes, steps to reproduce]: looking at crash numbers should be enough.
[List of other uplifts needed for the feature/fix]:
[Is the change risky?]: Not risky.
[Why is the change risky/not risky?]: This puts mac users in a configuration closer to windows users for (so very well tested code paths).
[String changes made/needed]: None.
Flags: needinfo?(nical.bugzilla)
Attachment #8824022 - Flags: approval-mozilla-beta?
Attachment #8824022 - Flags: approval-mozilla-aurora?
Comment on attachment 8824022 [details] [diff] [review]
Disable PersistentBufferProviderShared on mac

Crash fix, blocking 51 release, let's uplift this for the beta 13 build today.
Attachment #8824022 - Flags: approval-mozilla-beta?
Attachment #8824022 - Flags: approval-mozilla-beta+
Attachment #8824022 - Flags: approval-mozilla-aurora?
Attachment #8824022 - Flags: approval-mozilla-aurora+
https://hg.mozilla.org/mozilla-central/rev/9edfd2e0fb78
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
See Also: → 1531417
You need to log in before you can comment on or make changes to this bug.