Last Comment Bug 746896 - Crash in _cairo_surface_snapshot_copy_on_write during CC after printing a large canvas
: Crash in _cairo_surface_snapshot_copy_on_write during CC after printing a lar...
Status: VERIFIED FIXED
[sg:critical][advisory-tracking+]
: crash, sec-critical, testcase
Product: Core
Classification: Components
Component: Graphics (show other bugs)
: Trunk
: x86_64 Linux
: -- critical (vote)
: mozilla16
Assigned To: Nick Cameron [:nrc]
:
Mentors:
Depends on:
Blocks: 379903 740325 746883
  Show dependency treegraph
 
Reported: 2012-04-18 23:47 PDT by Jesse Ruderman
Modified: 2012-07-24 16:19 PDT (History)
12 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
wontfix
+
verified
+
verified
+
verified
14+
verified


Attachments
testcase (50 bytes, text/html)
2012-04-18 23:47 PDT, Jesse Ruderman
no flags Details
stack trace (4.25 KB, text/plain)
2012-04-18 23:47 PDT, Jesse Ruderman
no flags Details
patch (6.04 KB, patch)
2012-05-31 22:10 PDT, Nick Cameron [:nrc]
no flags Details | Diff | Splinter Review
patch (6.08 KB, patch)
2012-06-01 02:32 PDT, Nick Cameron [:nrc]
joe: review-
Details | Diff | Splinter Review
updated patch (8.43 KB, patch)
2012-06-04 16:19 PDT, Nick Cameron [:nrc]
joe: review+
Details | Diff | Splinter Review
updated patch with review comments (8.52 KB, patch)
2012-06-05 11:44 PDT, Nick Cameron [:nrc]
ncameron: review+
akeybl: approval‑mozilla‑aurora+
lukasblakk+bugs: approval‑mozilla‑beta+
lukasblakk+bugs: approval‑mozilla‑esr10+
Details | Diff | Splinter Review

Description Jesse Ruderman 2012-04-18 23:47:21 PDT
Created attachment 616478 [details]
testcase

1. Set
  user_pref("gfx.canvas.azure.enabled", true);
2. Load the testcase.
3. Print the page (to PDF).
4. Quit Firefox.

Result: crash during a shutdown CC.

This might be the cause of bug 740325. I'm filing this separately because I'm not sure, and because the testcase demonstrates that it's a security bug.
Comment 1 Jesse Ruderman 2012-04-18 23:47:59 PDT
Created attachment 616479 [details]
stack trace

Opt: bp-f44efcb1-d956-4e1f-8b83-42e382120419
Comment 2 Daniel Veditz [:dveditz] 2012-04-19 13:17:02 PDT
Assigning to Joe as a placeholder to find the right developer to work on this.
Comment 3 David Bolter [:davidb] 2012-04-26 13:48:19 PDT
From Joe: Azure is enabled by default since FF7 on windows.
Comment 4 Bas Schouten (:bas.schouten) 2012-04-26 17:10:30 PDT
I believe I saw a patch for some similar bug? Jeff? Do you remember anything?
Comment 5 Jeff Muizelaar [:jrmuizel] 2012-05-03 12:38:24 PDT
(In reply to Bas Schouten (:bas) from comment #4)
> I believe I saw a patch for some similar bug? Jeff? Do you remember anything?

Nothing jumps to mind.
Comment 6 Joe Drew (not getting mail) 2012-05-17 13:48:37 PDT
Bas wins!
Comment 7 Bas Schouten (:bas.schouten) 2012-05-17 18:48:44 PDT
I'm not seeing this on windows. Am I correct in understanding this is on Linux with Azure canvas preffed on manually?
Comment 8 Jesse Ruderman 2012-05-19 10:18:07 PDT
Right. I haven't tested other platforms.
Comment 9 Daniel Veditz [:dveditz] 2012-05-24 13:23:48 PDT
Are ever intending to ship Azure on linux? If not this is a less severe bug.

Does this problem repro on the Mac build? If so we need to fix it
Comment 10 Daniel Veditz [:dveditz] 2012-05-24 13:27:25 PDT
If it's related to bug 740325 then it may be a regression (and happens on Windows, too). Or maybe some more recent code change is just making an old crash more likely.
Comment 11 Joe Drew (not getting mail) 2012-05-24 13:53:19 PDT
We are planning on shipping Azure everywhere, but that doesn't mean that the same bugs will appear everywhere. In particular, the Cairo backend for Azure is quite unfinished.

However, Nick is working on the Cairo backend, so maybe he can fix this.
Comment 12 David Bolter [:davidb] 2012-05-31 13:42:28 PDT
Hi Nick, any traction here?
Comment 13 Nick Cameron [:nrc] 2012-05-31 14:06:49 PDT
Hi working on it (In reply to David Bolter [:davidb] from comment #12)
> Hi Nick, any traction here?

Hi, working on it right now, I can reproduce with no problem, and have lots of info on how the crash is happening, but no fix yet (or even an idea of how to fix).
Comment 14 Nick Cameron [:nrc] 2012-05-31 20:44:05 PDT
OK, figured it out now and I have a hack-ey fix, will post a proper fix soon.

See also bug 760349, which has similar steps to reproduce, but I think it totally different. My fix does not fix that bug.
Comment 15 Nick Cameron [:nrc] 2012-05-31 22:10:06 PDT
Created attachment 629055 [details] [diff] [review]
patch

This fixes the crash, but causes a copy when we often don't need one. We should replace this with a copy-on-write link to data in GetThebesSurfaceForDrawTarget, but that is a fair amount of work (we would need a new class, gfxSourceSurfaceSurface or something), so I suggest we make that a separate bug, especially as this is a security issue, but Azure canvas is not properly supported yet.
Comment 16 Nick Cameron [:nrc] 2012-06-01 02:32:56 PDT
Created attachment 629112 [details] [diff] [review]
patch
Comment 17 Joe Drew (not getting mail) 2012-06-01 11:12:24 PDT
Comment on attachment 629112 [details] [diff] [review]
patch

Review of attachment 629112 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/thebes/gfxImageSurface.cpp
@@ +199,5 @@
> +    }
> +
> +    gfxIntSize size(data->GetSize().width, data->GetSize().height);
> +    if (size != mSize)
> +    {

{ on previous line

@@ +203,5 @@
> +    {
> +        return false;
> +    }
> +
> +    CopyForStride(mData, data->GetData(), size, mStride, data->Stride());

You'll need to have a check to see if we're A1 or A8; I don't think you'll be able to copy a DataSourceSurface to ourselves in that case.

(Aside: Is DataSourceSurface always 32-bit?)

::: gfx/thebes/gfxImageSurface.h
@@ +75,5 @@
>      /* Fast copy from another image surface; returns TRUE if successful, FALSE otherwise */
>      bool CopyFrom (gfxImageSurface *other);
>  
> +    /**
> +     * Fast copy from an Azure source surface; returns TRUE if successful, FALSE otherwise

s/ Azure//

Azure is the project, not the API :)

::: gfx/thebes/gfxPlatform.cpp
@@ +596,5 @@
>  
>      IntSize size = data->GetSize();
>      gfxASurface::gfxImageFormat format = OptimalFormatForContent(ContentForFormat(data->GetFormat()));
>  
> +    // We need to make a copy here because data might change it's data under us

its

@@ +597,5 @@
>      IntSize size = data->GetSize();
>      gfxASurface::gfxImageFormat format = OptimalFormatForContent(ContentForFormat(data->GetFormat()));
>  
> +    // We need to make a copy here because data might change it's data under us
> +    nsRefPtr<gfxImageSurface> imageSurf = new gfxImageSurface(gfxIntSize(size.width, size.height), format, false);

This is going to introduce a ton of copies :( Does this always get used with the Cairo backend?
Comment 18 Nick Cameron [:nrc] 2012-06-01 13:17:34 PDT
Try push: https://tbpl.mozilla.org/?tree=Try&rev=d07787580fac
Comment 19 Nick Cameron [:nrc] 2012-06-01 13:24:25 PDT
(In reply to Joe Drew (:JOEDREW!) from comment #17)
> Comment on attachment 629112 [details] [diff] [review]
> @@ +203,5 @@
> > +    {
> > +        return false;
> > +    }
> > +
> > +    CopyForStride(mData, data->GetData(), size, mStride, data->Stride());
> 
> You'll need to have a check to see if we're A1 or A8; I don't think you'll
> be able to copy a DataSourceSurface to ourselves in that case.
> 
> (Aside: Is DataSourceSurface always 32-bit?)
>

DataSourceSurface doesn't expose any API about the format of its data, which is why there is no check. I don't know if they are always 32bit, I'll check both things with Bas next week.

> @@ +597,5 @@
> >      IntSize size = data->GetSize();
> >      gfxASurface::gfxImageFormat format = OptimalFormatForContent(ContentForFormat(data->GetFormat()));
> >  
> > +    // We need to make a copy here because data might change it's data under us
> > +    nsRefPtr<gfxImageSurface> imageSurf = new gfxImageSurface(gfxIntSize(size.width, size.height), format, false);
> 
> This is going to introduce a ton of copies :( Does this always get used with
> the Cairo backend?

Yes, unfortunately so. I'm not sure what you mean by always. Any time we hit this path, there will be a copy. The bug is with the Skia backend (Cairo is only being used because of the printing), which is probably why this bug only manifests when doing a print and on Linux (although I think we should take the same path with Direct2D, so not sure why there is no crash on Windows). With the Cairo backend, we take the if path in this method (the change is in the else) path, so we would avoid a copy.

We should replace the copy with a copy on write class, but it might be a fair bit of work and isn't necessary for a fix here (and this (Azure/Skia canvas) is not even properly supported yet), so I propose doing that work later, I can file a bug if you agree.
Comment 20 Nick Cameron [:nrc] 2012-06-04 15:15:06 PDT
> DataSourceSurface doesn't expose any API about the format of its data, which
> is why there is no check. I don't know if they are always 32bit, I'll check
> both things with Bas next week.

WRONG! Fixing this now...
Comment 21 Nick Cameron [:nrc] 2012-06-04 16:16:42 PDT
> (Aside: Is DataSourceSurface always 32-bit?)

No, I think they can be 565 or A8 also
Comment 22 Nick Cameron [:nrc] 2012-06-04 16:19:17 PDT
Created attachment 629988 [details] [diff] [review]
updated patch
Comment 23 Joe Drew (not getting mail) 2012-06-05 10:51:53 PDT
Comment on attachment 629988 [details] [diff] [review]
updated patch

Review of attachment 629988 [details] [diff] [review]:
-----------------------------------------------------------------

One very important fix is necessary, but I don't need to review it again.

::: gfx/thebes/gfx2DGlue.h
@@ +2,5 @@
>   * License, v. 2.0. If a copy of the MPL was not distributed with this
>   * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
>  
> +#ifndef GFX_2D_GLUE_H
> +#define GFX_2D_GLUE_H

*shakes fist at Bas*

::: gfx/thebes/gfxImageSurface.cpp
@@ +202,5 @@
> +        !(a1 == gfxASurface::ImageFormatRGB24 &&
> +          a2 == gfxASurface::ImageFormatARGB32)) {
> +        return false;
> +    }
> +}

missing "return true"
Comment 24 Nick Cameron [:nrc] 2012-06-05 11:44:13 PDT
Created attachment 630245 [details] [diff] [review]
updated patch with review comments

Made the change from the review and added an assertion, carrying r=joe

changed

imageSurf->CopyFrom(source);

to

bool resultOfCopy = imageSurf->CopyFrom(source);
NS_ASSERTION(resultOfCopy, "Failed to copy surface.");

in gfxPlatform::GetThebesSurfaceForDrawTarget because the review comment made me realise I never actually checked the returned result. It should always be true because the memory is allocated earlier and the format is picked to be correct.
Comment 26 Nick Cameron [:nrc] 2012-06-12 20:09:44 PDT
https://hg.mozilla.org/mozilla-central/rev/52b9c50593c3
Comment 27 Nick Cameron [:nrc] 2012-06-12 20:16:46 PDT
(In reply to Nick Cameron [:nrc] from comment #25)

I don't know how I managed to change tracking-firefox-esr10, but I don't seem to be able to change it back.

dveditz: do you want to change this back, and should we land this on beta/aurora?
Comment 28 Nick Cameron [:nrc] 2012-06-12 22:11:52 PDT
(In reply to Nick Cameron [:nrc] from comment #26)
> https://hg.mozilla.org/mozilla-central/rev/52b9c50593c3

False alarm.
Comment 29 Nick Cameron [:nrc] 2012-06-14 15:02:09 PDT
Another go: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=28add393ae55
Comment 30 Ed Morley [:emorley] 2012-06-15 06:11:58 PDT
https://hg.mozilla.org/mozilla-central/rev/28add393ae55
Comment 31 Daniel Veditz [:dveditz] 2012-06-21 17:34:38 PDT
(In reply to Nick Cameron [:nrc] from comment #27)
> dveditz: do you want to change this back, and should we land this on
> beta/aurora?

Please request approval to land this patch on Aurora/Beta/ESR, thanks.
Comment 32 Nick Cameron [:nrc] 2012-06-21 19:21:03 PDT
Comment on attachment 630245 [details] [diff] [review]
updated patch with review comments

[Approval Request Comment]
Bug caused by (feature/regressing bug #): unknown
User impact if declined: crashes on printing, security vulnerability
Testing completed (on m-c, etc.): yes, see above comments
Risk to taking this patch (and alternatives if risky): can increase copying and thus impact performance
String or UUID changes made by this patch: none
Comment 33 Nick Cameron [:nrc] 2012-06-22 14:18:31 PDT
Comment on attachment 630245 [details] [diff] [review]
updated patch with review comments

Apparently this breaks Azure/Skia canvas. Mattwoodrow is working on a patch, I'll post more details and a bug number on Monday. I don't think we need to backout, but we should probably hold off on pushing to Beta etc.
Comment 34 Nick Cameron [:nrc] 2012-06-24 20:37:56 PDT
Comment on attachment 630245 [details] [diff] [review]
updated patch with review comments

[Approval Request Comment]
Bug caused by (feature/regressing bug #): unknown
User impact if declined: crashes on printing, security vulnerability
Testing completed (on m-c, etc.): yes, see above comments
Risk to taking this patch (and alternatives if risky): can increase copying and thus impact performance; causes incorrect rendering with Azure/Skia canvas, but this is not enabled by default on any platform
String or UUID changes made by this patch: none
Comment 35 Alex Keybl [:akeybl] 2012-06-26 09:48:18 PDT
Before landing on beta/esr, have we checked all applicable perf dashboards to ensure that we have no new regressions there?
Comment 36 Robert O'Callahan (:roc) (email my personal email if necessary) 2012-06-26 20:37:44 PDT
https://hg.mozilla.org/releases/mozilla-aurora/rev/d88144263365
Comment 37 Lukas Blakk [:lsblakk] use ?needinfo 2012-07-02 16:12:20 PDT
Comment on attachment 630245 [details] [diff] [review]
updated patch with review comments

From Nick in email:
 I checked the dashboards and there are no performance regressions around the time the patch landed on inbound or Aurora (the code changed is pref'ed off, so this is expected)

Approving for beta and esr10, please update status flags when landed.
Comment 40 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-07-11 12:35:54 PDT
Firefox 13.0.1: crash
Firefox 14.0b11: no crash
Firefox 15.0a2 2012-07-11: no crash
Firefox 16.0a1 2012-07-11: no crash
Firefox 10.0.6esrpre 2012-07-11: no crash
Comment 41 Matt Woodrow (:mattwoodrow) 2012-07-20 18:53:25 PDT
As suggested earlier, this patch prevented the crash by breaking azure/skia canvas. It hasn't shown up anywhere since this isn't a supported configuration and requires a perf to be activated.

I'm not really sure why this was classified as sg:crit and uplifted to all the branches.

The crash is caused by the created gfxImageSurface relying on the pixel data owned by the draw target without taking a ref to the draw target to ensure that the data lives long enough.

In nsCanvasRenderingContext2DAzure::Reset(), we delete the draw target before the thebes surface, leaving it with a dangling pointer to the pixel memory.

This really isn't specific to skia, except that the gfxImageSurface destructor apparently touches this deleted memory, whereas the others (e.g. gfxQuartzSurface) isn't.

The obvious answer of just storing a ref to the DrawTarget as user data on the gfxASurface won't work since that will create a cycle and prevent them from ever being collected.

I think we do want to use this ref though, and change the DrawTarget -> gfxASurface lookup code to use something other than a reference stored on the DrawTarget's UserData.

Note You need to log in before you can comment on or make changes to this bug.