Open Bug 987010 Opened 10 years ago Updated 2 years ago

[Skia] ConvertBGRAToBGRX is sitting on a hotpath and we don't want it there

Categories

(Core :: Graphics, defect)

x86
macOS
defect

Tracking

()

People

(Reporter: gw280, Unassigned)

References

Details

(Whiteboard: [leave open])

Attachments

(2 files)

This is killing our performance for Skia content. We use BGRX extensively in Gecko but Skia can't handle it so we have to iterate over each pixel in a buffer and ensure the A channel is opaque.

The main reason this is there is because (if I recall correctly) one of the image decoders likes to generate BGRX with non-opaque alpha values. This is perfectly reasonable for a BGRX buffer, but plays hell with Skia which can only understand BGRA (and has a flag you can set on a BGRA bitmap to say "I'm opaque - optimise away"). 

Two options here as I currently see it:

a) Find all the places we generate BGRX buffers in Gecko and ensure they generate valid alpha channels for BGRA use.
b) Add codepaths in Skia to support BGRX.

My personal opinion is that a) is probably the better choice here, despite it feeling a little "unclean". I think adding BGRX support in Skia may end up being slower because it can't just do a straight memcpy anymore to draw a BGRX bitmap to a BGRA surface.
Do you have profiles for this? That would help finding the right places to fix this.
Wait, why can't we just set the "I'm opaque" flag for BGRX bitmaps?
Specifically, it looks like SkBitmap::setAlphaType(kIgnore_AlphaType) should be what we want for this sort of data.
/**
 *  Describes how to interpret the alpha compoent of a pixel.
 */
enum SkAlphaType {
    /**
     *  All pixels should be treated as opaque, regardless of the value stored
     *  in their alpha field. Used for legacy images that wrote 0 or garbarge
     *  in their alpha field, but intended the RGB to be treated as opaque.
     */
    kIgnore_SkAlphaType,
This will behave differently when reading back, so we might need to convert in that path.
(In reply to George Wright (:gw280) from comment #0)
> This is killing our performance for Skia content. We use BGRX extensively in
> Gecko but Skia can't handle it so we have to iterate over each pixel in a
> buffer and ensure the A channel is opaque.
> 
> The main reason this is there is because (if I recall correctly) one of the
> image decoders likes to generate BGRX with non-opaque alpha values. This is
> perfectly reasonable for a BGRX buffer, but plays hell with Skia which can
> only understand BGRA (and has a flag you can set on a BGRA bitmap to say
> "I'm opaque - optimise away"). 


Can you recall which image decoder this was? I'm happy to make the image decoders output buffers with A set to 0xff
(In reply to Andreas Gal :gal from comment #6)
> This will behave differently when reading back, so we might need to convert
> in that path.

Good point. That's at least a better place to take the perf hit.
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #8)
> (In reply to Andreas Gal :gal from comment #6)
> > This will behave differently when reading back, so we might need to convert
> > in that path.
> 
> Good point. That's at least a better place to take the perf hit.

Actually, this should only be a problem for other places where we can't handle BGRX, right?
Oh nice! They implemented this whilst we weren't looking. Awesome. Let's do a try run and see what happens.
Comment on attachment 8395655 [details] [diff] [review]
Ignore alpha in Skia BGRX source surfaces

Review of attachment 8395655 [details] [diff] [review]:
-----------------------------------------------------------------

DrawTargetSkia::Init also needs updating.
Attachment #8395655 - Flags: review?(gwright) → review+
(In reply to Andreas Gal :gal from comment #1)
> Do you have profiles for this? That would help finding the right places to
> fix this.

I believe Matt Woodrow does.

(In reply to Andreas Gal :gal from comment #6)
> This will behave differently when reading back, so we might need to convert
> in that path.

Should be fine so long as we maintain the correct alphatype in the SourceSurface that's read back, I think.
Try run looks good, will push this today
We're crashing: 

12:39:14  WARNING -  PROCESS-CRASH | /tests/image/test/mochitest/test_bug399925.html | application crashed [@ S32A_Opaque_BlitRow32_SSE2(unsigned int *,unsigned int const *,int,unsigned int)]

Looks like we've opened up using a new blitter that crashes on WinXP.
Seems to crash on Android too
(In reply to George Wright (:gw280) from comment #18)
> We're crashing: 
> 
> 12:39:14  WARNING -  PROCESS-CRASH |
> /tests/image/test/mochitest/test_bug399925.html | application crashed [@
> S32A_Opaque_BlitRow32_SSE2(unsigned int *,unsigned int const *,int,unsigned
> int)]
> 
> Looks like we've opened up using a new blitter that crashes on WinXP.

The crash is not reproducible locally on Win7 or on Linux, and on Linux there is not even a Valgrind error.

We'll have to do some slow iterations on tryserver here.

The crash is strange when you look at it:

12:39:14     INFO -  Crash reason:  EXCEPTION_ACCESS_VIOLATION_READ
12:39:14     INFO -  Crash address: 0xf900000
12:39:14     INFO -  Thread 0 (crashed)
12:39:14     INFO -   0  gkmedias.dll!S32A_Opaque_BlitRow32_SSE2(unsigned int *,unsigned int const *,int,unsigned int) [SkBlitRow_opts_SSE2.cpp:731a5c2dc6dd : 187 + 0x0]
12:39:14     INFO -      eip = 0x0145f46d   esp = 0x0012e138   ebp = 0x0012e144   ebx = 0x13923000
12:39:14     INFO -      esi = 0x13923000   edi = 0x00000004   eax = 0x0f900000   ecx = 0x00000019
12:39:14     INFO -      edx = 0xffffff9c   efl = 0x00010257
12:39:14     INFO -      Found by: given as instruction pointer in context
12:39:14     INFO -   1  gkmedias.dll!Sprite_D32_S32::blitRect(int,int,int,int) [SkSpriteBlitter_ARGB32.cpp:731a5c2dc6dd : 48 + 0xa]
12:39:14     INFO -      eip = 0x013d140e   esp = 0x0012e14c   ebp = 0x0012e16c
12:39:14     INFO -      Found by: call frame info


The crash line, SkBlitRow_opts_SSE2.cpp:187, is:

http://hg.mozilla.org/mozilla-central/file/690c810c8e3e/gfx/skia/trunk/src/opts/SkBlitRow_opts_SSE2.cpp#l187

The crash address being 0xf900000 means that the 's' pointer is 0xf900000 here. Since it's obtained by small increments from 'src', that means that 'src' is roughly 0xf900000. But that doesn't make sense: 0xf900000 does not look like a pointer, if anything it looks like a RGBA value. 'src' was computed in frame #1 by some pointer arithmetic from fSource->fPixels. Maybe 'src' or 'fPixels' was accidentally scribbled with a RGBA value?

The only way that I have to find out is add printfs and push to try...
Assignee: snorp → bjacob
Here is a try push with a crazy approach to logging:

https://tbpl.mozilla.org/?tree=Try&rev=62d6a81af68c

The first problem is that the only way to print to mochitest log is window.dump() / nsGlobalWindow::Dump(). The other problem is that there is no trivial way to print very verbose logging inside skia during the running of one precise mochitest. So this hijacks dump() to have a pair of special strings interpreted as enabling/disabling verbose logging, and this lets skia write directly to the FILE* used by dump().
Previous try had a linking error; new try:

https://tbpl.mozilla.org/?tree=Try&rev=a78ee9c5689d

And... ?!?!?! this is green! No crash! And the log shows that this particular test was run, hitting this particular code path!

Let's try checking if this still reproduces without my logging patch...
Try push without my logging patch:

https://tbpl.mozilla.org/?tree=Try&rev=88651c3aa430

Also retriggered M4 a few times on the try push with my logging patch.
All green! So far it doesn't look like anything prevents us from re-landing. Let's do a full try push...
(In reply to Benoit Jacob [:bjacob] from comment #24)
> All green! So far it doesn't look like anything prevents us from re-landing.
> Let's do a full try push...

You'll want to change the code in DrawTargetSkia::Init, too, as in https://hg.mozilla.org/integration/mozilla-inbound/rev/08c58214f01f

I think that may be the spot that actually crashes, too.
Ah... wish I had known earlier... thanks
(In reply to Benoit Jacob [:bjacob] from comment #27)
> Ah... wish I had known earlier... thanks

Sorry. George commented about that above, and I fixed it without attaching a new patch. Just now noticed that you weren't testing that.
Try push with the right patch, and the logging patch:

https://tbpl.mozilla.org/?tree=Try&rev=74ae7757fad9
Green again! Still not clear that I'm being of any use here.
(In reply to Benoit Jacob [:bjacob] from comment #30)
> Green again! Still not clear that I'm being of any use here.

Maybe you're magic?
No, I am more reminded of the following dialogue in a French TV series:

 - You are like the H of Hawaii.
 - Ooh, that is so romantic...
 - You are of no use.

(Referring to the fact that H's are rather ignored in French pronunciation).
Solid green after retriggering; full try push:

https://tbpl.mozilla.org/?tree=Try&rev=c46e8fcefe05
M4 is definitely green, so at this point we can probably forget about it. There is one strange reftest failure in a clipping-related test, and one M1 crash on OSX, that are worrying; the rest looks like I pulled a bad revision. Another try to confirm... https://tbpl.mozilla.org/?tree=Try&rev=baf34f9edd6a
OK. So there is exactly one thing to worry about here: the Android reftest failures. Looking.

The Mac M1 crash is happening on central and being starred there; the rest is gone, so the previous try push indeed was using a bad revision.
An update. I've been trying to reproduce locally. This does not reproduce on my Nexus S when loaded as a Web page. I tried to properly run reftest on my Nexus S. I hit two bugs, bug 996862 and bug 997144, and at the moment, I still can't.

At the moment I'm making non-DEBUG builds in case this problem might be non-DEBUG-only, and I'm charging an Asus Transformer device in case this problem might be Tegra-only (all our test slaves reproducing this are Tegra-based).
Update: I am able to run reftest locally now, and I still don't reproduce any reftest failure in this directory caused by this patch. I also tried non-DEBUG builds, and builds with NEON disabled.
Wait! New working hypothesis. The failure I'm getting on tbpl is I get RGB values of 90 instead of the expected 255.  Now, 90 == 0x5a.  That's the value that we poison free'd bytes with, right?  So if that's what we're seeing, then we have a use-after-free here, which explains why it's so hard to reproduce, and which unfortunately means that we can't just ignore it.
Try push with Milan's patch to fix imagedecoder refcounting on the off change that it might be greener.... https://tbpl.mozilla.org/?tree=Try&rev=1c6a6dfd06c7
Try pushes with the two hunks separate...

https://tbpl.mozilla.org/?tree=Try&rev=1d4ff086ee05

and

https://tbpl.mozilla.org/?tree=Try&rev=7234c714df1e

also, it seems that what's happening (explaining comment 38) is likely that we're asking skia to leave the alpha channel uninitialized but then we're reading from it, i.e. now suspecting a bug on our side, in the compositor, e.g. if we're using the alpha channel from sampled texels.
Try results are in:
 - the problem is caused by the DrawTargetSkia::Init bit, the other bit is green
 - the problem is not fixed by the fix for bug 994907.

Also, I did some debugging with :jrmuizel, we are probably closing in on the cause for this bug. Even though the canvas has -moz-opaque, the SharedSurface's in the SurfaceStream have alpha; the reason why they have alpha is that this is determined from the fact that the GLContext used by SkiaGL has alpha (that is, its faked framebuffer 0 has alpha). I suppose that since we have multiple-SkiaGL-canvases-on-one-GLContext, this method of propagating the information of whether the canvas has alpha is no longer suitable.
(In reply to Benoit Jacob [:bjacob] from comment #41)
> Try results are in:
>  - the problem is caused by the DrawTargetSkia::Init bit, the other bit is
> green
>  - the problem is not fixed by the fix for bug 994907.
> 
> Also, I did some debugging with :jrmuizel, we are probably closing in on the
> cause for this bug. Even though the canvas has -moz-opaque, the
> SharedSurface's in the SurfaceStream have alpha; the reason why they have
> alpha is that this is determined from the fact that the GLContext used by
> SkiaGL has alpha (that is, its faked framebuffer 0 has alpha). I suppose
> that since we have multiple-SkiaGL-canvases-on-one-GLContext, this method of
> propagating the information of whether the canvas has alpha is no longer
> suitable.

We should be setting the SharedSurface opaque stuff correctly, according to this: http://dxr.mozilla.org/mozilla-central/source/gfx/layers/client/ClientCanvasLayer.cpp#61
Actually, I was just looking at that in a debugger. We do hit the line that you link to,

  caps = GetContentFlags() & CONTENT_OPAQUE ? SurfaceCaps::ForRGB() : SurfaceCaps::ForRGBA();

Which means that my earlier comment was incorrect in this respect.

But GetContentFlags() returns 0, i.e. we don't have the CONTENT_OPAQUE flag. That is the bug.

Anyway; it looks like the path through which we're trying to propagate this bit of information goes through the Layer  (GetContentFlags() here is called on the ClientCanvasLayer). That seems strange: what does the layer, which should be just a placeholder in the layer tree, have to do with the specifics of the canvas' surface? So at the moment I'm trying to understand the current flow of information, and trying to understand what it should be instead. Any help with that is appreciated.
(In reply to Benoit Jacob [:bjacob] from comment #43)
> Actually, I was just looking at that in a debugger. We do hit the line that
> you link to,
> 
>   caps = GetContentFlags() & CONTENT_OPAQUE ? SurfaceCaps::ForRGB() :
> SurfaceCaps::ForRGBA();
> 
> Which means that my earlier comment was incorrect in this respect.
> 
> But GetContentFlags() returns 0, i.e. we don't have the CONTENT_OPAQUE flag.
> That is the bug.
> 
> Anyway; it looks like the path through which we're trying to propagate this
> bit of information goes through the Layer  (GetContentFlags() here is called
> on the ClientCanvasLayer). That seems strange: what does the layer, which
> should be just a placeholder in the layer tree, have to do with the
> specifics of the canvas' surface? So at the moment I'm trying to understand
> the current flow of information, and trying to understand what it should be
> instead. Any help with that is appreciated.

We set the content flags when we create the canvas in CRC2D::GetCanvasLayer. The opaque flag is set in response to mOpaque, which is set when we have the opaque attribute (moz-opaque=true?) or the 'alpha: false' getContext() option.

I still don't see how this can cause a crash, though.
It doesn't cause a crash :) just incorrect rendering, as we end up compositing a RGBA surface with uninitialized alpha.

Thanks for the pointers, btw.
(In reply to Benoit Jacob [:bjacob] from comment #45)
> It doesn't cause a crash :) just incorrect rendering, as we end up
> compositing a RGBA surface with uninitialized alpha.
> 
> Thanks for the pointers, btw.

Oh, I see. So the crash just magically went away? wtf?
The Windows crash? Yes, for all we know, it just magically disappeared. I'm only looking at a reftest failure anymmore, and it only reproduces on Android slaves.
So actually, the bug is right there in CanvasRenderingContext2D::GetCanvasLayer:

4324	  canvasLayer->Initialize(data);
4325	  uint32_t flags = mOpaque ? Layer::CONTENT_OPAQUE : 0;
4326	  canvasLayer->SetContentFlags(flags);

At line 4324 we call ClientCanvasLayer::Initialize, which reads the mContentFlags to decide the SharedSurface format (the line of code that I pasted in comment 43).

But it's only just below, at line 4326, that we actually call SetContentFlags to give the mContentFlags the correct value.

So by the time we create the SharedSurface in ClientCanvasLayer::Initialize, we don't have the correct mContentFlags yet.
This isn't a "nice" patch, but rather a "minimal" patch.

There needs to be a follow-up bug for getting us to the "nice" place.

The problem is that the bit of information "this canvas does not have alpha" is transported through a long chain of places:

HTMLCanvasElement -> CanvasRenderingContext2D -> ClientCanvasLayer -> CanvasClient -> SharedSurface

To get a "minimal" patch here, I'm not changing what this chain looks like. The step where this bit of information was lost, was the CanvasRenderingContext2D -> ClientCanvasLayer step. That is CanvasRenderingContext2D::GetCanvasLayer. The way that it talks to the ClientCanvasLayer is by ClientCanvasLayer::Initialize(CanvasLayer::Data). So to do my minimal fix, I'm adding a mHasAlpha field to CanvasLayer::Data. You'll notice that CanvasLayer::Data already has a bunch of fields that make mHasAlpha not seem out of place there. So hopefully that's OK for a minimal patch.

The nicer patch should remove the Layer altogether from this chain and make the CanvasRenderingContext2D talk directly to the CanvasClient. But that's a much bigger project, I'm afraid.
Attachment #8408518 - Flags: review?(jmuizelaar)
Landed the first part i.e. Snorp's original patch:
https://hg.mozilla.org/integration/mozilla-inbound/rev/05d2713df1fd
Whiteboard: [leave open]
Comment on attachment 8408518 [details] [diff] [review]
unfuck-canvas-alpha

Review of attachment 8408518 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/Layers.h
@@ +1836,2 @@
>  public:
>    struct Data {

Add a comment about how this struct probably isn't a good idea.

::: gfx/layers/client/ClientCanvasLayer.cpp
@@ +64,2 @@
>      }
> +    MOZ_ASSERT(caps.alpha == aData.mHasAlpha);

Add a comment about the redundant information
Attachment #8408518 - Flags: review?(jmuizelaar) → review+
Still orange /o\

At least now we know where to add fprintf's to the reftest log... will do that next.
Landed the canvas alpha propagation patch, as it still fixes a bug we know we had...

https://hg.mozilla.org/integration/mozilla-inbound/rev/927185688195

The bug assignee didn't login in Bugzilla in the last 7 months, so the assignee is being reset.

Assignee: jacob.benoit.1 → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.