Closed Bug 1023882 Opened 6 years ago Closed 6 years ago

Refactor ClientTiledThebesLayer::Render()'s logic to be more easily readable

Categories

(Core :: Graphics: Layers, defect)

All
Gonk (Firefox OS)
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla33
Tracking Status
firefox31 --- wontfix
firefox32 --- fixed
firefox33 --- fixed
b2g-v1.4 --- wontfix
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed

People

(Reporter: kats, Assigned: kats)

References

Details

Attachments

(8 files, 3 obsolete files)

4.15 KB, patch
cwiiis
: review+
Details | Diff | Splinter Review
3.76 KB, patch
cwiiis
: review+
Details | Diff | Splinter Review
8.22 KB, patch
cwiiis
: review+
Details | Diff | Splinter Review
7.95 KB, patch
cwiiis
: review+
Details | Diff | Splinter Review
2.95 KB, patch
cwiiis
: review+
Details | Diff | Splinter Review
3.85 KB, patch
cwiiis
: review+
Details | Diff | Splinter Review
6.27 KB, patch
cwiiis
: review+
Details | Diff | Splinter Review
4.72 KB, patch
cwiiis
: review+
Details | Diff | Splinter Review
The Render function in ClientTiledThebesLayer is weird because it gets called multiple times with different state bits set, and so you have to mentally keep track of a lot of things to figure out what actually gets run on any given execution of the function. I refactored it a bunch - not sure if I made it better but I think I did a little bit, and at least I understand it now. Would like to land these patches assuming I didn't break anything.
Attachment #8438442 - Attachment description: Split the invalid region calculation from the low-precision invalid region calculation → Part 5 - Split the invalid region calculation from the low-precision invalid region calculation
Comment on attachment 8438437 [details] [diff] [review]
Part 1 - Small changes

Review of attachment 8438437 [details] [diff] [review]:
-----------------------------------------------------------------

Much clearer, nice.

::: gfx/layers/client/ClientTiledThebesLayer.cpp
@@ +187,5 @@
> +ClientTiledThebesLayer::UseFastPath()
> +{
> +  const FrameMetrics& parentMetrics = GetParent()->GetFrameMetrics();
> +  bool multipleTransactionsNeeded = gfxPrefs::UseProgressiveTilePainting()
> +                                 || gfxPrefs::UseLowPrecisionBuffer()

I think currently if you disable progressive tile painting, low precision painting won't happen(?) - I'd be in favour of fixing that though, so this can slide.

::: gfx/layers/client/ClientTiledThebesLayer.h
@@ +80,5 @@
>     */
>    void BeginPaint();
>  
>    /**
> +   * Determine if we can use a fast path to just do a single high-precision

nit, comma after 'high-precision'.
Attachment #8438437 - Flags: review?(chrislord.net) → review+
Comment on attachment 8438438 [details] [diff] [review]
Part 2 - Create a unified first-transaction block of code

Review of attachment 8438438 [details] [diff] [review]:
-----------------------------------------------------------------

Excellent.
Attachment #8438438 - Flags: review?(chrislord.net) → review+
Comment on attachment 8438439 [details] [diff] [review]
Part 3 - Extract a high-precision render function

Review of attachment 8438439 [details] [diff] [review]:
-----------------------------------------------------------------

Again, nice.

::: gfx/layers/client/ClientTiledThebesLayer.cpp
@@ +200,5 @@
> +  if (aInvalidRegion.IsEmpty() || mPaintData.mLowPrecisionPaintCount != 0) {
> +    return false;
> +  }
> +
> +  // Only draw progressively when the resolution is unchanged.

I wonder if we can factor out the resolution change into the UseFastPath function? The HasShadowTarget bit here is a bit suspect too, but I'd put money on a whole bunch of ref-tests failing if we remove it :(
Attachment #8438439 - Flags: review?(chrislord.net) → review+
Comment on attachment 8438440 [details] [diff] [review]
Part 4 - Extract a low-precision render function

Review of attachment 8438440 [details] [diff] [review]:
-----------------------------------------------------------------

Good, with the first comment addressed.

::: gfx/layers/client/ClientTiledThebesLayer.cpp
@@ +417,5 @@
>    TILING_PRLOG_OBJ(("TILING 0x%p: Low-precision valid region is %s\n", this, tmpstr.get()), mLowPrecisionValidRegion);
>    TILING_PRLOG_OBJ(("TILING 0x%p: Low-precision invalid region is %s\n", this, tmpstr.get()), lowPrecisionInvalidRegion);
>  
>    // Render the low precision buffer, if there's area to invalidate and the
>    // visible region is larger than the critical display port.

The latter part of this comment should be moved into the RenderLowPrecision method.

::: gfx/layers/client/ClientTiledThebesLayer.h
@@ +86,5 @@
>     */
>    bool UseFastPath();
>  
> +  // Helper functions to do the high- and low-precision paints, respectively.
> +  // These functions return true if they updated the paint buffers.

Any reason to go from C-style to C++-style comments here? I get that it's commenting two functions, but it seems out of place - might as well comment both separately I think? I'm not fussy, so your choice.
Attachment #8438440 - Flags: review?(chrislord.net) → review+
Comment on attachment 8438443 [details] [diff] [review]
Part 6 - Move the SetRepeatTransaction out of TiledContentClient

This part seems to have a bug based on testing on Fennec. Unflagging for now.
Attachment #8438443 - Flags: review?(chrislord.net)
Added a missing hunk
Attachment #8438443 - Attachment is obsolete: true
Attachment #8438486 - Flags: review?(chrislord.net)
Rebased
Attachment #8438445 - Attachment is obsolete: true
Attachment #8438445 - Flags: review?(chrislord.net)
Attachment #8438489 - Flags: review?(chrislord.net)
Comment on attachment 8438442 [details] [diff] [review]
Part 5 - Split the invalid region calculation from the low-precision invalid region calculation

Review of attachment 8438442 [details] [diff] [review]:
-----------------------------------------------------------------

Ok.
Attachment #8438442 - Flags: review?(chrislord.net) → review+
Comment on attachment 8438486 [details] [diff] [review]
Part 6 - Move the SetRepeatTransaction out of TiledContentClient

Review of attachment 8438486 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with comment addressed.

::: gfx/layers/client/ClientTiledThebesLayer.cpp
@@ +406,5 @@
> +    if (!mPaintData.mPaintFinished) {
> +      // There is still more high-res stuff to paint, so we're not
> +      // done yet. A subsequent transaction will take care of this.
> +      ClientManager()->SetRepeatTransaction();
> +      return;

Previously EndPaint(false) would be called before returning here I think. Am I correct in thinking this, and if so, is it intended to skip it?
Attachment #8438486 - Flags: review?(chrislord.net) → review+
Comment on attachment 8438487 [details] [diff] [review]
Part 7 - Move down the low-precision invalid region calculation to where it's needed

Review of attachment 8438487 [details] [diff] [review]:
-----------------------------------------------------------------

Seems sound.
Attachment #8438487 - Flags: review?(chrislord.net) → review+
Comment on attachment 8438489 [details] [diff] [review]
Part 8 - Remove the argument to EndPaint

Review of attachment 8438489 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, nice work :)

::: gfx/layers/client/ClientTiledThebesLayer.cpp
@@ +430,5 @@
>      }
>    }
>  
> +  // If we get here, we've done all the high- and low-precision
> +  // paints we wanted to do, so we can finish the paint and chill.

https://www.youtube.com/watch?feature=player_detailpage&v=eQZc53dpPrQ#t=34
Attachment #8438489 - Flags: review?(chrislord.net) → review+
(In reply to Chris Lord [:cwiiis] from comment #9)
> Part 1 - Small changes
> 
> I think currently if you disable progressive tile painting, low precision
> painting won't happen(?) - I'd be in favour of fixing that though, so this
> can slide.

I don't see any code that would disable low-precision painting if the progressive tile painting is turned off. However it is kind of odd that low-res tiles are always drawn using the ProgressiveUpdate call, even if progressive tile painting is turned off (I think). There's definitely some entanglement between the two prefs that we should sort out.

> 
> nit, comma after 'high-precision'.

Fixed.

(In reply to Chris Lord [:cwiiis] from comment #11)
> Part 3 - Extract a high-precision render function
>
> > +  // Only draw progressively when the resolution is unchanged.
> 
> I wonder if we can factor out the resolution change into the UseFastPath
> function? The HasShadowTarget bit here is a bit suspect too, but I'd put
> money on a whole bunch of ref-tests failing if we remove it :(

Maybe. I didn't understand the purpose of HasShadowTarget so I was unsure if we could lump that into the fast-path code or not.

(In reply to Chris Lord [:cwiiis] from comment #12)
> Part 4 - Extract a low-precision render function
> >    // Render the low precision buffer, if there's area to invalidate and the
> >    // visible region is larger than the critical display port.
> 
> The latter part of this comment should be moved into the RenderLowPrecision
> method.

Done.

> > +  // Helper functions to do the high- and low-precision paints, respectively.
> > +  // These functions return true if they updated the paint buffers.
> 
> Any reason to go from C-style to C++-style comments here? I get that it's
> commenting two functions, but it seems out of place - might as well comment
> both separately I think? I'm not fussy, so your choice.

Not really. I changed it back to /** comments and added one for each function.

(In reply to Chris Lord [:cwiiis] from comment #18)
> Part 6 - Move the SetRepeatTransaction out of TiledContentClient
> > +    if (!mPaintData.mPaintFinished) {
> > +      // There is still more high-res stuff to paint, so we're not
> > +      // done yet. A subsequent transaction will take care of this.
> > +      ClientManager()->SetRepeatTransaction();
> > +      return;
> 
> Previously EndPaint(false) would be called before returning here I think. Am
> I correct in thinking this, and if so, is it intended to skip it?

Previously EndPaint(false) would get called, but it would be a no-op when mPaintData.mPaintFinished is false, because of if the (!aFinish && !mPaintData.mPaintFinished) check at the top. Now it doesn't get called at all in this case, which is equivalent, and allows the cleanup in part 8.

(In reply to Chris Lord [:cwiiis] from comment #20)
> https://www.youtube.com/watch?feature=player_detailpage&v=eQZc53dpPrQ#t=34

:)
I backed these out in https://hg.mozilla.org/integration/mozilla-inbound/rev/44d144664d36 for breaking b2g mochitest-3 (and also b2g mochitest-4 when the test got moved to a different chunk: https://tbpl.mozilla.org/php/getParsedLog.php?id=41541251&tree=Mozilla-Inbound
Flags: needinfo?(bugmail.mozilla)
I'll look into this failure. Does seem to correlate 100% with the time my patches were in the tree. I also got a talos regression notification for Android rck2 from these changes, so I must have introduced some sort of functional change that I didn't intend to.
Flags: needinfo?(bugmail.mozilla)
So I can reproduce a problem on that test when running the B2G emulator mochitest locally. Not sure if it's the exact same problem but it certainly seems plausible. The problem is that the client side gets stuck at [1], because it tries to access the mutex after the compositor side has deleted it (but before the client side receives the delete notification).

There's probably also a problem in these patches that causes this problem to be exposed in the first place, because in theory there should be no functional changes here. Not sure what that problem is but I have try pushes pending to at least narrow down which patches are to blame.

[1] http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorChild.cpp?rev=9a72db713446#239
Randall, any thoughts on comment 25?
Flags: needinfo?(rbarker)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #26)
> Randall, any thoughts on comment 25?

It isn't clear why the lock would hang. The CrossProcessMutex should continue to work even if the compositor side has been deleted. I did not see any place where the mutex could be locked and then deleted leaving the mutex in a locked state. I wonder if shared memory used by the mutex is getting garbage written to it some how? From what I remember when I looked at the pthread mutex implementation was the mutex handle was really just an int. If the handle was non-zero when trying to obtain the lock, it would suspend until the handle returned to zero. Worst case would be to send a synchronous message to the Child process to ensure the mutex gets deleted on the child side before the compositor deletes it? Maybe cast the handle to an int and print out its value after the delete and before the hanging lock to see if it is changing between delete and the attempted lock? Here is the pthread_mutex_t struct:

typedef struct
{
    int volatile value;
} pthread_mutex_t;

To dig into the mutex source:
http://androidxref.com/4.3_r2.1/xref/bionic/libc/include/pthread.h
http://androidxref.com/4.3_r2.1/xref/bionic/libc/bionic/pthread.c
Flags: needinfo?(rbarker)
Ah, what's happening is that the mutex creation is quickly followed by the mutex deletion on the compositor side. The mutex is deleted before the client side receives the creation message, and so by the time it tries to create its side of the mutex it has already been torn down. For some reason when the client side uses the stale handle to create the client-side CrossProcessMutex it doesn't fail, but instead points to some other shared memory buffer or something, and so the mutex there is "locked".
Maybe add a check [1] that mCount > 1, if not, call pthread_mutex_init again? When pthread_mutex_destroy is called, I'm pretty sure the handle is set to a non-zero value and needs to be initialized again. 

[1] http://mxr.mozilla.org/mozilla-central/source/ipc/glue/CrossProcessMutex_posix.cpp#88
I filed bug 1024728 to spin off this CrossProcessMutex thing. Based on my try pushes part 7 of this bug seems to have introduced the functional change.
Comment on attachment 8438487 [details] [diff] [review]
Part 7 - Move down the low-precision invalid region calculation to where it's needed

Review of attachment 8438487 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/client/ClientTiledThebesLayer.cpp
@@ -239,5 @@
>  ClientTiledThebesLayer::RenderLowPrecision(nsIntRegion& aInvalidRegion,
>                                             LayerManager::DrawThebesLayerCallback aCallback,
>                                             void* aCallbackData)
>  {
> -  if (!aInvalidRegion.IsEmpty() &&

I don't see this !aInvalidRegion.IsEmpty() check coming back at any point - perhaps this is the cause of the functional difference?
(In reply to Chris Lord [:cwiiis] from comment #31)
> I don't see this !aInvalidRegion.IsEmpty() check coming back at any point -
> perhaps this is the cause of the functional difference?

There's a lowPrecisionInvalidRegion.IsEmpty() check before the call to RenderLowPrecision ever happens so I don't think that's the problem. I think the problem might be that before lowPrecisionInvalidRegion was getting before the RenderHighPrecision code and now it's getting computed after. mValidRegion is used in the computation of lowPrecisionInvalidRegion, and it's modified by RenderHighPrecision, so the new value computed will be different. I'll do a try push to see if I can confirm this.
Oh it also might be that I removed the critical displayport isEmpty check from computing the lowPrecisionInvalidRegion. Will test that too.
Comment 32 seems to be right, based on this: https://tbpl.mozilla.org/?tree=Try&rev=97a9bed13bde. I'm also doing a full run at https://tbpl.mozilla.org/?tree=Try&rev=908cc9577c76 just to be safe because I've been backed out way too many times lately.
Yeah I got that email too. It's likely this change. I'd rather not back this out again because that test is pretty finicky and I'm going to be making other changes to this code as well which will probably move the numbers around a bunch.
Depends on: 1026742
Would like to uplift this to 2.0 / Gecko32 because a bunch of the other low-res fixes landed after it and need uplifting. Uplifting this will make those fixes easier to uplift, and will also make it easier to debug issues on 2.0 if we have further bugs filed against it.
blocking-b2g: --- → 2.0?
Comment on attachment 8438437 [details] [diff] [review]
Part 1 - Small changes

[Approval Request Comment]
Bug caused by (feature/regressing bug #): none
User impact if declined: No user-visible impact here. The changes are purely code reorganization. However not uplifting this will make it much harder to fix user-visible bug fixes (see list of dep bugs). This is feature that is new to B2G 2.0 and so there will probably be a few more bugfixes needed as things stabilize; having these patches uplifted will speed up development quit a bit.
Testing completed (on m-c, etc.): locally, on m-c for a few days now. I've been working in this code almost exclusively since it landed and haven't found any further issues with it. It did cause a Fennec talos regression that is being tracked separately. I'm actively looking into it and should have a fix for it soon.
Risk to taking this patch (and alternatives if risky): low risk. Code is only used on Fennec and B2G.
String or IDL/UUID changes made by this patch: none
Attachment #8438437 - Flags: approval-mozilla-aurora?
^ approval request applies to all patches in this bug
Attachment #8438437 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Would not block on this, but considered approval given the user impact and bake time in comment #40.
blocking-b2g: 2.0? → ---
You need to log in before you can comment on or make changes to this bug.