Closed Bug 995216 Opened 7 years ago Closed 7 years ago

Reftest failures with BasicCompositor enabled on Linux

Categories

(Core :: Graphics: Layers, defect)

All
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla32

People

(Reporter: cwiiis, Assigned: cwiiis)

References

()

Details

Attachments

(2 files)

Some reftests fail when enabling the BasicCompositor on Linux. Some were due to mask drawing errors, which are fixed in bug 993475, the others are small pixel difference that I think we should just fuzz.
There are 6 remaining tests that fail with the BasicCompositor.

They all have 100 pixels that differ by 1, so I assume this is a layer edge and the difference is down to blending and a tiny rounding error, or something along these lines. I don't think this difference indicates the test failing, so let's fuzz it.
Attachment #8405474 - Flags: review?(ehsan)
I'm not going to be here to deal with this for 2 weeks, so I hope it's not too much of an imposition to reassign this to you, nical :)
Assignee: chrislord.net → nical.bugzilla
Comment on attachment 8405474 [details] [diff] [review]
Fuzz the remaining tests that fail

Fuzzing 100px seems like a lot.  I'm not sure why this is necessary but I'm not comfortable r+ing this without at least understanding why we're doing this.
Attachment #8405474 - Flags: review?(ehsan)
Attaching a link to the reftest failure.

To answer comment #3, I think this failure isn't indicating that the thing being testing is failing, so much as there's a small inaccuracy in compositing. The error runs along the edge of something (I assume that there's a layer border there), and has a difference that isn't visible to the naked eye.

Given the amount of these small inaccuracies we've fuzzed elsewhere, I didn't think it would be worth trying to track down the exact cause, as it may well be something we can't easily fix - on the other hand, others may disagree.

needinfo mattwoodrow for comment.
Flags: needinfo?(matt.woodrow)
Comment on attachment 8405474 [details] [diff] [review]
Fuzz the remaining tests that fail

Actually, looking at the try log here <https://tbpl.mozilla.org/?tree=Try&rev=48eeeaa9357d>, this seems like an actual bug.  These tests animate the green/blue box from the far left edge of the screen towards the right, and it seems like one row of pixels on the far left edge get corrupted after the bug moves away.
Attachment #8405474 - Flags: review-
Talking to Ehsan on IRC, these tests animate a coloured box from left to right. So this test failing indicates that part of the screen that should update, hasn't, I suppose.

Taking this into consideration, I think we should fix this... I don't like the idea of animations leaving behind transient artifacts, even if Chrome does it all the time :p
s/row/column/!
Debugging this, I can see nothing wrong with any of the visible or invalidated regions during this test (which I've modified to be a bit simpler, but still reproduces the error).

While I haven't exhausted this avenue yet, it would indicate that the problem is somehow in composition rather than painting (which makes sense, given the error doesn't occur until we use the basic compositor).

Tomorrow, I'll confirm that it's the moving of the content and not just it being on a separate layer that triggers the error, then I'll start stepping through the compositor code.

Any ideas that will save me some debugging time are welcome of course.
The error does seem to be in composition. I've confirmed this stuff gets drawn with the third DrawQuad call after a BeginFrame call with an mCopyTarget comes in for this test. However I can't locally reproduce a problem with that. :)
I altered the test to not actually move the block, but just set the left css attribute, which causes the layerisation and triggers the problem.

First draw:

 0:01.26 XXX DrawQuad (0.00, 0.00, 800.00x108.00)
 0:01.26 XXX DrawQuad (0.00, 108.00, 108.00x100.00)
 0:01.26 XXX DrawQuad (208.00, 108.00, 592.00x100.00)
 0:01.26 XXX DrawQuad (0.00, 208.00, 800.00x792.00)
 0:01.26 XXX DrawQuad (0.00, 0.00, 100.00x100.00)
 0:01.26 XXX DrawQuad (0.00, 0.00, 800.00x108.00)
 0:01.26 XXX DrawQuad (0.00, 108.00, 108.00x100.00)
 0:01.26 XXX DrawQuad (208.00, 108.00, 592.00x100.00)
 0:01.26 XXX DrawQuad (0.00, 208.00, 800.00x792.00)
 0:01.26 XXX DrawQuad (0.00, 0.00, 100.00x100.00)

Second draw:

 0:01.46 XXX DrawQuad (0.00, 0.00, 800.00x108.00)
 0:01.46 XXX DrawQuad (0.00, 108.00, 108.00x100.00)
 0:01.46 XXX DrawQuad (208.00, 108.00, 592.00x100.00)
 0:01.46 XXX DrawQuad (0.00, 208.00, 800.00x792.00)
 0:01.46 XXX DrawQuad (0.00, 0.00, 100.00x100.00)

Third draw:

 0:01.56 XXX DrawQuad (0.00, 0.00, 800.00x108.00)
 0:01.56 XXX DrawQuad (0.00, 108.00, 108.00x100.00)
 0:01.56 XXX DrawQuad (208.00, 108.00, 592.00x100.00)
 0:01.56 XXX DrawQuad (0.00, 208.00, 800.00x792.00)
 0:01.56 XXX DrawQuad (0.00, 0.00, 100.00x100.00)

At no point does it explicitly draw down that edge, or do any series of draws that would leave a sliver like that, which would indicate some kind of rounding error in the transform matrix causing this... But that doesn't explain why the colour is fe0000, and not, say, ff0101 (the two adjacent colours are pure white and pure red). I guess some blending with the blue or green, which don't contain any red component of course, could explain it getting darker, but the blue/green rectangles are never near that part of the screen.

A second possibility could be an inaccurate clip rect causing drawing to be clipped at the sub-pixel level, causing blending with the empty buffer (perhaps black?) and the colour - this would explain the pixel getting darker and it being along the edge of a draw (though an inaccurate transform matrix could be the cause for the same reason too).

Still digging.
So, the transform used during DrawQuad at no point in this test has a non-integer translation, and removing the clipping calls entirely makes no difference... Which kind of blows all my hypotheses out of the water.

This is, however, looking more and more likely to be a cairo bug. While getting to the bottom of this and fixing it would be ideal, I think it'd be safe to fuzz it and open a bug to deal with this later, rather than block the basic compositor entirely.
After some pain, I can confirm that the errant pixels (which are now coming up as ff0101 for me for some reason... I can't remember changing anything :/) do not exist in the source surface, where they are the correct ff0000 - so the DrawSurfaceWithTextureCoords, somehow when compositing the texture source, ends up modifying the colour, likely due to some kind of misalignment from some rounding error somewhere.

I'm going to dig through that now and see if I can find exactly where/why that's happening. Everything is still pointing to some kind of rounding error atm.

I still think it's reasonable to fuzz this, but I'll see what I can do before tomorrow afternoon.
This is the simplest fix for the error introduced, but I'm going to experiment a bit and see if I can just get the calculations not to introduce so much error as to result in sampling neighbouring pixels.
Comment on attachment 8415984 [details] [diff] [review]
Simplest fix for rounding error

Review of attachment 8415984 [details] [diff] [review]:
-----------------------------------------------------------------

r+ for this patch fwiw.
Attachment #8415984 - Flags: review+
Unfortunately there doesn't seem to be a simple way of increasing the accuracy here - it's lost too far up the stack... It may be we just have to go for the simple fix (I'll add a comment about it before committing).

Random thought: I wonder if we want to fix this by reducing cairo's sample resolution, if that's possible - we can't possibly need the kind of detail that 5 significant digits is affecting during composition... But I suppose long-term we'll be moving to Skia anyway and I doubt it has the same issue.

I'll have a look at this again tomorrow and more than likely just commit the simple fix.
(In reply to Chris Lord [:cwiiis] from comment #15)
> Unfortunately there doesn't seem to be a simple way of increasing the
> accuracy here - it's lost too far up the stack... It may be we just have to
> go for the simple fix (I'll add a comment about it before committing).
> 
> Random thought: I wonder if we want to fix this by reducing cairo's sample
> resolution, if that's possible - we can't possibly need the kind of detail
> that 5 significant digits is affecting during composition... But I suppose
> long-term we'll be moving to Skia anyway and I doubt it has the same issue.
> 
> I'll have a look at this again tomorrow and more than likely just commit the
> simple fix.

We shouldn't need to reduce cairo's sampling resolution. I'd say they probably have an incorrect bias that causes the error in the first place. 1e-5 is unlikely to actually cause a pixel to be shifted by more than 1/512th when sampling.
ugh, ok, so 1e-5 was enough for the one test that I was checking, but 1e-4 is required to make them all pass... I'll follow up with Bas on IRC.
Assignee: nical.bugzilla → chrislord.net
Forgot about this, pushed to inbound:

https://hg.mozilla.org/integration/mozilla-inbound/rev/b422e91ad2c9
Flags: needinfo?(matt.woodrow)
https://hg.mozilla.org/mozilla-central/rev/b422e91ad2c9
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
You need to log in before you can comment on or make changes to this bug.