Closed Bug 907463 Opened 11 years ago Closed 11 years ago

Deal properly with IPDL abnormal shutdowns and fix some other OMTC d3d11 reftest bugs

Categories

(Core :: Graphics: Layers, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla28

People

(Reporter: nrc, Assigned: nrc)

References

Details

Attachments

(3 files, 6 obsolete files)

We have a whole bunch of reftest failures for d3d11 OMTC (between 18 and 24 depending on the platform). See https://tbpl.mozilla.org/?tree=Try&rev=8ce25d14a7a2 (but ignore the Win XP line).

There are (at first glance) three categories:

Large canvases, which usually pass but occasionally fail. Mark as random.
Gradients which are hard alternations. Mark as fail, see bug 582236.
Anti-aliasing errors for text. Not sure what to do here. I would like to understand the cause and why it does not manifest using MTC.
Attached patch patch 1: large canvases (obsolete) — Splinter Review
This makes me sad because I put a lot of work in to make this work, but it only works like 80% of the time.
Attachment #793191 - Flags: review?(bas)
Attached patch patch 2: gradients (obsolete) — Splinter Review
Attachment #793198 - Flags: review?(bas)
https://tbpl.mozilla.org/?tree=Try&rev=8ce25d14a7a2

So the remaining reftests are all text. They all seem to be off by fuzz in anti-aliasing. I don't see that we are going down different rendering paths for the different cases so that suggests that maybe DirectWrite is non-deterministic. But we don't have these problems with the current code. Also it seems like a very similar set of tests fail on each run and with very, very similar numbers of wrong pixels, so things are a little predictable.

Any ideas for what is causing this? And for what we should do about it?
Flags: needinfo?(jdaggett)
Flags: needinfo?(bas)
Comment on attachment 793191 [details] [diff] [review]
patch 1: large canvases

Review of attachment 793191 [details] [diff] [review]:
-----------------------------------------------------------------

Wait... why does it only work 80% of the time, large canvases not always working is bad no?
(In reply to Bas Schouten (:bas.schouten) from comment #4)
> Comment on attachment 793191 [details] [diff] [review]
> patch 1: large canvases
> 
> Review of attachment 793191 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Wait... why does it only work 80% of the time, large canvases not always
> working is bad no?

As far as I can tell, it works 100% of the time in real life (subject to OOM, but there is not much we can do about that). It only passes tests 80% of the time, I presume also due to memory, but I'm not really sure why.
(In reply to Nick Cameron [:nrc] from comment #2)
> Created attachment 793198 [details] [diff] [review]
> patch 2: gradients

I need to understand why this happens, does this happen on current trunk D3D11 OMTC?

I'll have a look at the text as well, I suspect maybe we're not setting the ALLOW_SUBPIXEL_AA (or whatever the flag's called) correctly.
(In reply to Bas Schouten (:bas.schouten) from comment #6)
> (In reply to Nick Cameron [:nrc] from comment #2)
> > Created attachment 793198 [details] [diff] [review]
> > patch 2: gradients
> 
> I need to understand why this happens, does this happen on current trunk
> D3D11 OMTC?
> 

Yes, all these are for current trunk.

> I'll have a look at the text as well, I suspect maybe we're not setting the
> ALLOW_SUBPIXEL_AA (or whatever the flag's called) correctly.

Thanks!
(In reply to Bas Schouten (:bas.schouten) from comment #6)
> I'll have a look at the text as well, I suspect maybe we're not setting the
> ALLOW_SUBPIXEL_AA (or whatever the flag's called) correctly.

So, the only place we set the Moz2D aa flag from client layers is http://mxr.mozilla.org/mozilla-central/source/gfx/layers/client/ClientThebesLayer.cpp#31. That looks like it is doing the right thing to me. And whenever we use it, it seems to be doing the right thing. But, a lot of the test failures don't have any transparency so we don't set the flag in either test or ref cases. In fact most of the failing tests don't use subpixel AA, so I suspect this flag is not the issue. (Looking at the failures in the reftest analyser, one or two have errors in colour fringing, but most do not.)
I just noticed that we are failing a box shadow test on Win8 only. We seem to be one rgb value out on every pixel on two lines at the edge of the shadow. Could it be a d2d 1.1 issue?

(Try push is here https://tbpl.mozilla.org/?tree=Try&rev=8dc6f5c64683).
The font errors are due to Azure vs Thebes rendering. When we render the first (test) page we have no texture clients to test, so we have to assume we can't use Azure. The next time around we figure out that we can use Azure so do. This leads to small errors when rendering lines at non-integer offsets.

Matt Woodrow is very close to getting Azure/Cairo done. So I think the best fix for this is to just use Azure for everything and get rid of the Thebes paths (which we want to do anyway). Just waiting for some Try runs to confirm that is bug free. Then we can rip out Thebes from ThebesLayerBuffer and these bugs should all just disappear.
Flags: needinfo?(jdaggett)
Flags: needinfo?(bas)
Bug 914505 fixes most of these failures. That will land sooner than Azure everywhere :-)
Depends on: 914505
Attachment #793191 - Attachment is obsolete: true
Attachment #793191 - Flags: review?(bas)
Attachment #793198 - Attachment is obsolete: true
Attachment #793198 - Flags: review?(bas)
Depends on: 907926
Attached patch revert OP_SOURCE optimisatin (obsolete) — Splinter Review
I confirmed that the patch with these changes regressed the OMTC win7 reftests. The only way I can see why is these changes. I'm not really sure why. It looks like we are blending in a tiny amount from outside our rect, but I checked all the sizes, positions, and transforms, and everything is an integer, so I am not sure how that could happen. I prefer to lose this optimisation and get OMTC turned on. We can always come back to this if we have more ideas.
Attachment #805858 - Flags: review?(matt.woodrow)
Comment on attachment 805858 [details] [diff] [review]
revert OP_SOURCE optimisatin

Review of attachment 805858 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry, but this isn't an optimization. It's required for correct rendering with BasicLayers.
Attachment #805858 - Flags: review?(matt.woodrow) → review-
(In reply to Matt Woodrow (:mattwoodrow) from comment #13)
> Comment on attachment 805858 [details] [diff] [review]
> revert OP_SOURCE optimisatin
> 
> Review of attachment 805858 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Sorry, but this isn't an optimization. It's required for correct rendering
> with BasicLayers.

Why?

Could we keep OP_SOURCE and still clear the surface?
Clearing the surface means that we'll draw the window pixels in two passes. It's possible for the pixels to get presented to the screen in the middle of that, resulting in flashing colors.

We really need things to be 'atomic'.
We are failing this test for OMTC Windows due to OOMing. Running this test individually (locally) is fine. However, running it multiple times causes us to fail with the same logs and result as on Try. This is because we can't allocate more memory. I believe because the previously created canvases are referenced by JS and need to be GCed - waiting for a short time then refreshing brings success. This is only an issue on OMTC because we need to allocate memory to transport the canvas output from main thread to compositor and that is where we fail - we have enough memory for the canvas itself.

We don't crash when runnning this test locally or on Try (a fix in another bug), and using random-if rather than skip-if means we continue to test for crashes.
Attachment #806422 - Flags: review?(roc)
Comment on attachment 805858 [details] [diff] [review]
revert OP_SOURCE optimisatin

Review of attachment 805858 [details] [diff] [review]:
-----------------------------------------------------------------

Trailing whitespace!
Attachment #805858 - Flags: review- → review+
Blocks: 917658
(In reply to Nick Cameron [:nrc] from comment #16)
> Created attachment 806422 [details] [diff] [review]
> ignore result of very big canvas test
> 
> We are failing this test for OMTC Windows due to OOMing. Running this test
> individually (locally) is fine. However, running it multiple times causes us
> to fail with the same logs and result as on Try. This is because we can't
> allocate more memory. I believe because the previously created canvases are
> referenced by JS and need to be GCed - waiting for a short time then
> refreshing brings success. This is only an issue on OMTC because we need to
> allocate memory to transport the canvas output from main thread to
> compositor and that is where we fail - we have enough memory for the canvas
> itself.
> 
> We don't crash when runnning this test locally or on Try (a fix in another
> bug), and using random-if rather than skip-if means we continue to test for
> crashes.

Some of our tests use SpecialPowers.forceGC();
Would that be enough to fix the failing test here?
(In reply to Nicolas Silva [:nical] from comment #18)
> (In reply to Nick Cameron [:nrc] from comment #16)
> > Created attachment 806422 [details] [diff] [review]
> > ignore result of very big canvas test
> > 
> > We are failing this test for OMTC Windows due to OOMing. Running this test
> > individually (locally) is fine. However, running it multiple times causes us
> > to fail with the same logs and result as on Try. This is because we can't
> > allocate more memory. I believe because the previously created canvases are
> > referenced by JS and need to be GCed - waiting for a short time then
> > refreshing brings success. This is only an issue on OMTC because we need to
> > allocate memory to transport the canvas output from main thread to
> > compositor and that is where we fail - we have enough memory for the canvas
> > itself.
> > 
> > We don't crash when runnning this test locally or on Try (a fix in another
> > bug), and using random-if rather than skip-if means we continue to test for
> > crashes.
> 
> Some of our tests use SpecialPowers.forceGC();
> Would that be enough to fix the failing test here?

Probably, but that is only available in mochitests (afaik) and this is a reftest.
See https://bugzilla.mozilla.org/show_bug.cgi?id=910962#c5 for why we should do this.

Doing this prevents an assertion in bc mochitests where one of our processes (child) crashes and we try to release shmem in the parent, but IPDL has already taken care of it.
Attachment #808094 - Flags: review?(nical.bugzilla)
Comment on attachment 808094 [details] [diff] [review]
Do not release shmems if our IPDL tree is going to be destroyed

Review of attachment 808094 [details] [diff] [review]:
-----------------------------------------------------------------

I would much much prefer to propagate an "OnActorDestroy()" hook (or something of the like) on all compositable/texture clients, rather than having using the notion of Shmem in all of these classes.

::: gfx/layers/client/CanvasClient.h
@@ +88,5 @@
>    virtual void OnDetach() MOZ_OVERRIDE
>    {
>      mBuffer = nullptr;
>    }
> +  

nit: trailing space
I used OnActorDestroy and then changed to ForgetShmem, to make it clearer what was going on, but I don't feel strongly either way. This version uses OnActorDestroy (I do want to use that rather than just ActorDestroy to make clear that we don't think we are overriding the IPDL method).
Attachment #808094 - Attachment is obsolete: true
Attachment #808094 - Flags: review?(nical.bugzilla)
Attachment #808333 - Flags: review?(nical.bugzilla)
Comment on attachment 808333 [details] [diff] [review]
Do not release shmems if our IPDL tree is going to be destroyed

Review of attachment 808333 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!
Attachment #808333 - Flags: review?(nical.bugzilla) → review+
(In reply to Ed Morley [:edmorley UTC+1] from comment #26)
> Also failures on crashtest-ipc:
> https://tbpl.mozilla.org/php/getParsedLog.php?id=28461128&tree=Mozilla-
> Inbound

Annoyingly Cipc was not working properly when I did my Try push a week or so ago. Sigh.
Attached patch patch 3 v2Splinter Review
Asking for re-review because I changed quite a few things - now destroys client side as well as host, only forgets on abnormal shutdown, etc.
Attachment #808333 - Attachment is obsolete: true
Attachment #820861 - Flags: review?(nical.bugzilla)
Attachment #820861 - Flags: review?(nical.bugzilla) → review+
https://hg.mozilla.org/mozilla-central/rev/8e1b8db80906
https://hg.mozilla.org/mozilla-central/rev/b89258451b74
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla27
This might have caused bug 930575. If it did, then this needs to be backed out.
Bug 930575 still occurs after the backout, so that did not address the graphical scrolling issue which is still being seen on Buri using:

Gaia   afbf45f26a73b7cd5e0a831bea48087331975286
SourceStamp 2f2a45f04e7c
BuildID 20131025100746
Version 27.0a1
(In reply to Marcia Knous [:marcia] from comment #34)
> Bug 930575 still occurs after the backout, so that did not address the
> graphical scrolling issue which is still being seen on Buri using:
> 
> Gaia   afbf45f26a73b7cd5e0a831bea48087331975286
> SourceStamp 2f2a45f04e7c
> BuildID 20131025100746
> Version 27.0a1

The build that Ed spun does not contain the backout of this patch, as it hasn't landed on central.
Please see Bug 930575#c12 - Backing this patch did address the issue seen in that bug.
We should probably change the summary of this bug to better reflect what we're actually doing in the fix?
Summary: Fix OMTC d3d11 reftest bugs → Deal properly with IPDL abnormal shutdowns and fix some other OMTC d3d11 reftest bugs
Attached patch patch 3 v2.1 (obsolete) — Splinter Review
Slightly tweaked - speculative fix for smoke test issues.
Nick, any chance this is related to bug 933867?
(In reply to Milan Sreckovic [:milan] from comment #39)
> Nick, any chance this is related to bug 933867?

Sort of, commenting there.
Blocks: 930575
Attached patch patch 3 v2.1Splinter Review
rebased
Attachment #826593 - Attachment is obsolete: true
Attachment #806422 - Flags: checkin+
Attachment #805858 - Attachment is obsolete: true
We no longer need that patch, as demonstrated by https://tbpl.mozilla.org/?tree=Try&rev=2436e9f4b2e5. I bet it was fixed by bug 923434. That is nice, since that patch was the source of bug 930575 which got this lot backed out the last time.
https://hg.mozilla.org/mozilla-central/rev/372dddb6113d
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Target Milestone: mozilla27 → mozilla28
Whoops, forget my [leave open] whiteboard thingy.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
There is this issue which is stopping me landing the last patch (from bug 930575):

> > > 5. 3d test from the marketplace, seems to work.
> > > 6. played with http://bit.ly/css3d-mol (from bug 862097).
> > > 
> > > - I had one experience where if the device goes to sleep some graphics
> > > doesn't render correctly when coming back on the http://bit.ly/css3d-mol. 
> > > Specifically, the molecules didn't render.
> > > 
> > 
> > This is something that could be due to my patch and is a bit worrying. I
> > would like to investigate this before landing.
> > 
> > Is this reproducible? Does it happen every time? And have you seen this
> > behaviour before?
> 
> It happened one time and I can not reproduce it. I am currently unsure how to
> reproduce it; I tried multiple times to just have the device sleep and then wake
> it.  I think there's some other state I have to get into in order to reproduce
> it... I hadn't figured it out.  I have not seen the behavior before in the main
> build.

This sounds like something this patch could be responsible for, not sure how to proceed though
There are some B2G bugs pending on this to land (where we try to deallocate shmems twice). Nick, do you have a rough ETA on this? Is it possible to take the shmem part into a separate patch which would be maybe easier to land without causing the regression in bug 930575?
Blocks: 907715, 926558
(In reply to Nicolas Silva [:nical] from comment #47)
> There are some B2G bugs pending on this to land (where we try to deallocate
> shmems twice). Nick, do you have a rough ETA on this? Is it possible to take
> the shmem part into a separate patch which would be maybe easier to land
> without causing the regression in bug 930575?

I think we can just land this. There is only the shmem part to land really. There is one issue which happened which I am puzzled by and would like to investigate (see the recent comments in 930575), but given that it is rare and hard to repro, I'm not sure what to do about it. If you want to spin up this patch and have a play and see if you can repro, that would be appreciated. Otherwise I think we can land this once inbound opens again.
(In reply to Nicolas Silva [:nical] from comment #47)
> There are some B2G bugs pending on this to land (where we try to deallocate
> shmems twice). Nick, do you have a rough ETA on this? Is it possible to take
> the shmem part into a separate patch which would be maybe easier to land
> without causing the regression in bug 930575?

Given that we're not sure this will stick past smoke testing, please don't land stuff on top for a few days so we can easily backout if necessary. Thanks!
https://hg.mozilla.org/mozilla-central/rev/1f0b615d64a5
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Oops.  Does this bug have to do with bug 940745?
(In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from comment #52)
> Oops.  Does this bug have to do with bug 940745?

I don't think so. The patch doesn't change any IPDL protocols, it only changes what happens when IPDL trees are shutdown. Therefore, I wouldn't suspect it except in situations where we see something like 'destroy' or 'shutdown' in the stack or which happen when a process crashes or something is shutdown or changes state dramatically (like when a phone goes to sleep or wakes up).
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: