Open Bug 637828 Opened 11 years ago Updated 3 years ago

mozilla.org project transitions show graphics corruption

Categories

(Core :: Graphics, defect)

defect
Not set
normal

Tracking

()

People

(Reporter: stechz, Assigned: jrmuizel)

References

Details

Attachments

(4 files, 3 obsolete files)

Attached image Screenshot
Seems to only reproduce on Android.

Steps:
1. Navigate to http://mozilla.org
2. Wait for transition from Firefox to Thunderbird

There are graphics corruption issues after the fade is finished. See attachment.
tracking-fennec: --- → ?
This only seems to happen for smaller scales? After a certain amount of zooming in, this problem doesn't occur. I'm assuming the magic resolution range for this corruption is <=1.0.
I see this on desktop in portrait orientation.
Assignee: nobody → jones.chris.g
STR in test-ipcbrowser
 (1) Load http://mozilla.org
 (2) Set resolution to 0.5/0.5

Any resolution <1.0 seems to tickle the bug.  Curiously, the corruption is
 - ABSENT in a vanilla desktop-firefox build
 - ABSENT in a vanilla desktop-firefox build with MOZ_LAYERS_FORCE_SHMEM_SURFACES=1
 - ABSENT in a vanilla desktop-fennec build (guessing from comment 1)
 - present in desktop-fennec build with --enable-mobile-optimize

So it looks like the secret sauce needed to trigger this is --enable-mobile-optimize.  The only significant difference between that and MOZ_LAYERS_FORCE_SHMEM_SURFACES=1 in a vanilla build wrt shadow layers is the image resampling algorithms used.  (The only other difference is that slightly different code is used to composite layers to the Gtk2 window, but since this bug appears on android too I think we can rule that out.)

This smells like a pixman bug :/.
This makes the bug go away

diff --git a/gfx/layers/ipc/ShadowLayers.cpp b/gfx/layers/ipc/ShadowLayers.cpp
--- a/gfx/layers/ipc/ShadowLayers.cpp
+++ b/gfx/layers/ipc/ShadowLayers.cpp
@@ -393,21 +393,17 @@ ShadowLayerForwarder::GetParentBackendTy
   return mParentBackend;
 }
 
 static gfxASurface::gfxImageFormat
 OptimalFormatFor(gfxASurface::gfxContentType aContent)
 {
   switch (aContent) {
   case gfxASurface::CONTENT_COLOR:
-#ifdef MOZ_GFX_OPTIMIZE_MOBILE
-    return gfxASurface::ImageFormatRGB16_565;
-#else
     return gfxASurface::ImageFormatRGB24;
-#endif
   case gfxASurface::CONTENT_ALPHA:

Looks like pixman is choking somewhere trying to composite to/from 565 surfaces.  I'd guess a problem with rgba32->565.  Digging.
OS: Android → All
This bug is gone when I --enable-system-cairo for a local cairo 1.10 build with my distro's pixman (0.18.4).

Halp!  Do we cherry-pick upstream patches in this situation?
Judging by the date of the git commit we've got in-tree, it looks we're on pixman ~0.18.2.  Is that right?
The pixman in tree should be from between 0.19.4 and 0.19.6. Does the patch on bug 604168 fix it for you?
Yeah, I got confused by an out-of-date README and then sidetracked for a bit.  pixman-version.h says 0.19.1, FWIW.

I will give that a shot.
Bad news: I tried a local build of cairo 1.10 with a local build of pixman 0.21.2 (being careful to check LD_DEBUG), and the bug was gone.  That's sort of good.

But, with a local build of pixman 0.18.4 running with local cairo 1.10, the bug was also gone.  0.18.4 looks to be older than our in-tree pixman.  That means the remaining possible culprits are
 - cairo 1.9 vs 1.10 (sigh)
 - some patch we have applied to pixman in-tree
 - some patch we have applied to cairo in-tree

(The bug also doesn't appear with in-tree cairo+pixman when I force the Xlib backend to use rgb565 surfaces instead of rgb24, but that's not too surprising.)
In-tree cairo claims to be 1.9.5 (cairo-features.h.in), not sure how accurate that is.  I tried some cairo versions around there

1.9.2 - no tee backend
1.9.4 - crash
1.9.6 - crash
1.9.8 - OK

So, maybe something was fixed around 1.9.6-1.9.8.
I spent some time looking over the cairo log to see if there was anything that looked related to this bug.  Nothing /really/ stood out, though there were a few patches that looked somewhat relevant.  I tried applying them and deps, but turns out that some quite substantial cairo changes, many of them to cairo-image-surface, landed the day after the rev we imported last.  (Not coincidentally, I imagine! :) .)  So attempting to cherry-pick looks to be a lost cause, at least with my cairo-fu.

One of the substantial changes was http://cgit.freedesktop.org/cairo/commit/?id=b9407af6a4bc792c1bcb52c90aa8a618627bb618, referencing the tantalizing goodies

image-rgba firefox-talos-gfx-0 342050.17 (342155.88 0.02%) -> 69412.44 (69702.90 0.21%): 4.93x speedup
███▉
[snip]
image-rgba firefox-planet-gnome-0 217512.61 (217636.80 0.06%) -> 123341.02 (123641.94 0.12%): 1.76x speedup
▊

It's of course unclear how those numbers would translate to m-c tip.  This probably wasn't all that appealing in the past because we haven't used the image backend for anything "serious".  We are now, on mobile.  I don't think this is news to anyone here.

So, I would be remiss to not pursue this: What would it take to land cairo 1.10 for fennec 4, if *only* for fennec 4?  We have a patch in bug 562746.  We can land cairo 1.10 on a branch if need be.  Some folks who can hack on regression fixes are clear of their FF4 blockers (like me).

The alternative, possibly leaving this bug and maybe others unfixed and leaving possibly significant perf wins on the table, because we end up shipping a year-out-of-date cairo, doesn't sound very good.

Let the flames begin!
Tentatively setting dep; a cairo expert could probably directly patch this bug or cherry-pick a fix, but that would take some time for me and I'd rather settle comment 11 first.
Depends on: 562746
Another option which I believe has been bounced around before is shipping a libcairo.so built from scratch and building --enable-system-cairo.  This shouldn't be a problem on android since we don't have to worry about conflicting symbols.
That sounds like a reasonable plan
(In reply to comment #11)
> So, I would be remiss to not pursue this: What would it take to land cairo 1.10
> for fennec 4, if *only* for fennec 4?  We have a patch in bug 562746.  We can
> land cairo 1.10 on a branch if need be.  Some folks who can hack on regression
> fixes are clear of their FF4 blockers (like me).

That's a great idea in general. I think that cairo 1.10.x should be now in a *much* better shape than 1.9 snapshots. Once the release schedule for cairo 1.10 had been announced, a lot of people joined testing efforts and found numerous bugs which are now fixed. Also cairo 1.10 now got into some linux distros where additional testing and bugfixing happened.

Regarding fennec, special care needs to be taken not to forget doing cairo_surface_flush/cairo_surface_mark_dirty calls in places where necessary:
http://cairographics.org/manual-1.0.2/cairo-cairo-surface-t.html#cairo-surface-mark-dirty
http://cairographics.org/manual-1.0.2/cairo-cairo-surface-t.html#cairo-surface-flush

New version of cairo is less forgiving in the cases when this is not done properly (see bug 616700 as an example). Also these flush/dirty calls are only relevant for image backend and are for example noops for xlib, so the bugs may show up in fennec only.

Also in the linux world, the in-tree pixman/cairo copies are irrelevant because everyone is using system libraries anyway.
(In reply to comment #15)
> Regarding fennec, special care needs to be taken not to forget doing
> cairo_surface_flush/cairo_surface_mark_dirty calls in places where necessary:
> http://cairographics.org/manual-1.0.2/cairo-cairo-surface-t.html#cairo-surface-mark-dirty
> http://cairographics.org/manual-1.0.2/cairo-cairo-surface-t.html#cairo-surface-flush

And I think that in order to make this missing cairo_surface_flush/cairo_surface_mark_dirty calls hunt more efficient, the use of valgrind API could be used. We need to track any accesses to surface memory buffers performed by anything other than cairo/pixman and happening at a wrong time (not between flush and dirty calls). And this is exactly what valgrind can do easily.
I'm not confident about upgrading cairo. We would be adding a lot of unknowns for Fennec's release candidate. Updating cairo is the kind of thing we want to do at the beginning of the release cycle.

Are there any other solutions that can circumvent this bug?
Siarhei: I've found several places where we are not flushing and marking dirty. I could use some help figuring out how to set up valgrind to find the instances we care about.
It should be possible to patch cairo to make use of VALGRIND_MAKE_MEM_DEFINED and VALGRIND_MAKE_MEM_NOACCESS or VALGRIND_MAKE_MEM_UNDEFINED client requests:
    http://valgrind.org/docs/manual/mc-manual.html

So that 'cairo_image_surface_create_for_data' and 'cairo_surface_mark_dirty' would use VALGRIND_MAKE_MEM_NOACCESS on the surface memory buffer and 'cairo_surface_flush' would use VALGRIND_MAKE_MEM_DEFINED. After this is done, valgrind would complain about any attempts to access this buffer when it is not marked as 'defined'. Additionally either every cairo function also should mark buffer as defined on entry and restore its original state on exit in order not to catch itself, or just the valgrind log needs to be filtered to ignore any memory accesses which have cairo functions showing up in a backtrace.
Is there any easy function to call that gives me the number of bytes for an image, given its format, width, height, and stride?
No, need to compute from GetSize() and BytePerPixelFromFormat(). But total size isn't typically useful information on its own for an arbitrary image surface, except for accounting; what are you trying to do?
VALGRIND_MAKE_MEM_DEFINED needs to know the size of the memory region.
Yeah, you can't do that unless the stride is the width*bpp.  Need to mark images row-by-row otherwise.
(Probably simplest just to use a single row-by-row impl, since this isn't terribly perf critical.)
tracking-fennec: ? → 2.0+
If someone can come up with a reduced test case, that might help here.
I tried for a bit, but this page is animating using jquery junk and I couldn't reproduce the problem with tests that just animated the opacity of <img>'s and <div>'s and text.  In fact, it was after I tried to make a small test case that I asked you about cairo-trace last night ;).

I'm sure it's doable with some work.
I think the key features that tickle this bug are:
1) Something with opacity and gradients
2) Some sort of timed invalidation involving these gradients
Whiteboard: [needs-cairo-update]
Removing the use of slow operation 'SRC with a mask' helps this problem.
Attached patch Remove unneeded optimization (obsolete) — Splinter Review
I don't see any reason to use SOURCE here. Pixman will already turn OVER with RGB24 source into COPY when it can. Plus if we use SOURCE with a MASK it will turn into a very slow path, including hitting pixman's general code.
Attachment #516979 - Flags: review?(roc)
Attachment #516979 - Attachment is obsolete: true
Attachment #516979 - Flags: review?(roc)
Attachment #516983 - Flags: review?(roc)
This patch sounds good on its own, but is the underlying bug something we're likely to trip on elsewhere?
Jeff tells me that the ADD composite operation seems to be wrong, but it might be the input to that operation in clip_and_composite_source.
Comment on attachment 516983 [details] [diff] [review]
Version that builds

This is removing an optimization that was added in bug 418494 - an optimization that Jeff actually questioned in bug 418494 comment 4. Further, Jeff tells me that pixman is already smart enough to optimize to source when it makes sense, like when we have no alpha channel. Finally, Jeff also tells me that using source requires us to draw through a mask, while over can be a lot faster.

So, as long as this doesn't regress us on Try, I'm happy with this patch!
Attachment #516983 - Flags: review?(roc) → review+
Whiteboard: [needs-cairo-update] → [can-land]
(In reply to comment #32)
> Jeff tells me that the ADD composite operation seems to be wrong, but it might
> be the input to that operation in clip_and_composite_source.

I've confirmed that inputs to the composite operation
at http://mxr.mozilla.org/mozilla-central/source/gfx/cairo/cairo/src/cairo-surface-fallback.c#365 are fine and the result is wrong.

This suggests that compositing a scaled RGB24 source with a A8 mask, the ADD operation and a RGB565 destination surface is broken.

Knowing this should make this much easier to fix. Unfortunately, I may not be able to debug this soon (before Monday) So I'd be happy if someone else took a look.
This is exciting news! I'm no expert, so if anyone else wants to look please do. In the mean time, I think I'll poke around...
BTW, there is still another corruption bug, even when all paths to cairo-surface-fallback.c:365 are gone. You can see it on the download button. If these are the same bug, perhaps there's an underlying draw path to blame.
No luck debugging this. We'll need someone else to take a look.
I tried reproducing the problem with a standalone test case. So either, there's something else special going on here or I missed something.
Note that I haven't seen any evidence of try Talos results, which is what I want before landing this change.
Whiteboard: [can-land] → [can-land][can land only if it passed try]
The problem is:
the destination pixman image has a transform.

Cairo does not expect this.
(In reply to comment #40)
> The problem is:
> the destination pixman image has a transform.
> 
> Cairo does not expect this.

It seems like this happens when we use the destination as a transformed source. The solution to this problem, is not obvious. Upstream has completely rewritten this code.
This fixes the problem for me. Hopefully, it fixes all the other problem too.
Assignee: jones.chris.g → jmuizelaar
Attachment #517529 - Flags: review?
Comment on attachment 517529 [details] [diff] [review]
Reset the transform on the destination

too much patch
Attachment #517529 - Flags: review? → review-
This time without the debugging code.
Attachment #517529 - Attachment is obsolete: true
Attachment #517535 - Flags: review?
Comment on attachment 517535 [details] [diff] [review]
Reset the transform on the destination v2

just enough patch
Attachment #517535 - Flags: review? → review+
It seems like the code at fault here is really pixman. I think that pixman should probably not be using a transformed fetch on the destination image under any circumstances.

For example, in this case we're fetching destination pixels from a different part of the image than we're storing them to. I can't see any reason for wanting this behaviour.
This fixes everything that updating cairo fixed for me! Let's get both patches in!
(In reply to comment #47)
> This fixes everything that updating cairo fixed for me! Let's get both patches
> in!

I'm not in as much of a rush to take the first patch because of any potential risk. If someone can show a measurable difference, it would make it more attractive.
No longer depends on: 638594
Blocks: 616638
Here's a final version ready for pushing.
Attachment #517535 - Attachment is obsolete: true
Attachment #517578 - Flags: review+
Whiteboard: [can-land][can land only if it passed try] → [can-land]
Attachment #516983 - Flags: approval2.0-
Duplicate of this bug: 616638
Duplicate of this bug: 610344
Keywords: checkin-needed
Whiteboard: [can-land]
Pushed http://hg.mozilla.org/mozilla-central/rev/6dab3f1dfa0e

Leaving open for other patch.
tracking-fennec: 2.0+ → ---
Keywords: checkin-needed
(In reply to comment #46)
> It seems like the code at fault here is really pixman. I think that pixman
> should probably not be using a transformed fetch on the destination image under
> any circumstances.

I wonder if this is actually fixed in pixman already: http://cgit.freedesktop.org/pixman/commit/?id=7f4eabbeec92e55fd8f812c0e5d8568eacbb633d
Should we also take this on the relbranch?
blocking2.0: --- → ?
Whiteboard: [asking for .x]
Removing - jrmuizel tells me I'm seeing a different bug!
blocking2.0: ? → ---
Whiteboard: [asking for .x]
(In reply to comment #54)
> Another use case: "ghost pixels" when panning regular text

I've filed bug 641506 for that issue, which seems to be newer than this bug.
You need to log in before you can comment on or make changes to this bug.