Closed Bug 983035 Opened 10 years ago Closed 3 years ago

Fix performance and memory usage regressions with tiling on OSX

Categories

(Core :: Graphics: Layers, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: mattwoodrow, Assigned: mattwoodrow)

References

Details

(Keywords: perf)

https://tbpl.mozilla.org/?tree=Try&rev=fc2193cf91e3

With the single paint buffer mode (the default), we regress tscrollx, cart and tart, as well as the memory usage tests.

From profiling this I can see a few major issues:

1) We copy from the single paint buffer into the individual tiles which is an extra cost on the main-thread which we didn't have previously. This almost doubles our drawing time on the testcase I'm looking at. We also spend quite a decent amount of time creating and destroying DrawTargetCG's, since we create a new one for every operation. Allocating the single draw buffer isn't free either, but we have an equivalent cost without tiling.

2) We always retain the in-memory copy of tiles even after it's been uploaded to the GPU. This is pretty wasteful for tiles that aren't changing frequently, and probably explains all the memory regressions.

3) In the case where the previous paint hasn't finished we have to copy the old tile into a new one. This also hurts us on the following composite since we now have to upload a full tile. One of the talos tscrollx subtests consistently hits this case for me when testing locally.

Possible solutions:

a) Switch to painting each tile separately. This should get rid of the copy cost from (1), but will instead add overhead from processing display items multiple times. It'll be hard to measure this, since the extra overhead will be very workload dependent. A lot of the particularly bad cases definitely won't be covered by our existing talos tests.

b) Add a recording/replay API to Moz2D so instead of allocating and painting into the single paint buffer we just record the command stream. Then we can replay this (potentially in parallel) to each of our tiles. Should fix (1) without the downside of (a), but we don't have this code at the moment.

c) Do something similar to ContentClientIncremental to tackle problems (2) and (3). For this we'd only retain the GPU texture for each tile, no CPU buffer. When we want to draw a buffer we grab a new one from a pool (never try to grab the front buffer, since it doesn't have a CPU surface) and draw the invalid region into that. The pixels outside the invalid region would remain untouched (and are garbage). The receiving end of the tile transaction would upload the invalid region to the existing GPU texture, and then return the buffer to the pool.

d) Play around with thread scheduling and upload timing to reduce the chances of (3) occurring. In particular, we could upload the tiles directly when we receive the transaction (since it's async) rather than returning to the event loop and waiting for the composite event. We could also call ReadUnlock on each tile as we upload it, rather than waiting to finish the whole set before unlocking.

From what I can see, it's going to be hard to match our current talos numbers without doing (c), and one of (a) or (b).

Any thoughts or other ideas?
Flags: needinfo?(bas)
Flags: needinfo?(bgirard)
b) is awesome. It will allow us to do a) and we can still do c) on top if we really have to. I don't believe d) will make a big difference. b) will let us paint tiles in parallel!
I might try a quick prototype of (b) using skia's SkPicture.

If I switch content rendering over to skia, I can get a baseline number and then see how much recording and playback makes on that number.

I'll probably play with that tomorrow or while travelling over the weekend.
Even for a single tile we can record the paint commands and replay them on a different thread to make the main thread more responsive.
One other thing I want to try is using display ports (without APZC) set to the size of the scroll-port rounded up to the nearest tile boundary.

That should mean we only repaint whole tiles when scrolling, and might skip some of the issues we have.
I've done a bit of profiling and experimenting with this and found some interesting things:

* We're currently uploading the entirety of every tile regardless of what changed. To fix this we need the content side to add whole tiles to the invalid area whenever we can't re-use the existing buffer. Fixing this doesn't have much effect on tscrollx results since the content side is our bottleneck.

* Doing the tile upload directly inside the transaction (since it's async), unlocking the tiles individually as we upload them and the above gives us a marginal tscroll win. This is because it reduces the chances of us trying to draw into a tile before it's uploaded and requiring a copy.

* Switching to per-tile drawing is a mixed bag. Some of the tscrollx tests do better, and others do worse. The net result is basically nothing. tsvgx also regresses hugely.

* Using a display port the size of the scrollport rounded out to the nearest tile boundaries gives us a big win on the slowest tscrollx subtest (but basically nothing for the others). In this mode we only paint whole tiles at a time, and most of the scroll operations become a no-op (except the scrollbar). Tscrollx only measures average frame interval, and not variance/std dev though, and I'm concerned this will look very janky on slow scrolling pages.

* I prototyped the record/reply (single-threaded) using SkPicture. It's a considerable reduction in paint time compared to painting to the single buffer and then copying to tiles. Unfortunately all of this time is taken away from before we try grab the tiles from the compositor, meaning that we are more likely to do so before upload is completed. The resulting copies required eats up most of our performance win. A few of the tscrollx tests end up better off, but it's not overly convincing.

Local tscrollx results for SkPicture - https://pastebin.mozilla.org/4632896


Given that, it sounds like we still want something similar to ContentClientIncremental, so that we never race with the compositor for access to the tile buffer. That's idea (c) from my original post, and the only solution I haven't tried yet (since it's not clear how it could fit into new-textures without being awful).

Nical, do you have any thoughts on that, or any of the above?
Flags: needinfo?(nical.bugzilla)
(In reply to Matt Woodrow (:mattwoodrow) from comment #5)

> * I prototyped the record/reply (single-threaded) using SkPicture. It's a
> considerable reduction in paint time compared to painting to the single
> buffer and then copying to tiles. Unfortunately all of this time is taken
> away from before we try grab the tiles from the compositor, meaning that we
> are more likely to do so before upload is completed. The resulting copies
> required eats up most of our performance win. A few of the tscrollx tests
> end up better off, but it's not overly convincing.
> 
> Local tscrollx results for SkPicture - https://pastebin.mozilla.org/4632896

For reference, on the first two sub-tests of tscrollx we're spending about 2.5x as long in GetBackBuffer (which is checking the lock, recycling a tile and copying the old tile contents into the new) as we are replaying the SkPicture onto the tile afterwards.
The problem with uploading immediately rather than waiting for the composite is that there's the possibility you'll get two UseTiledLayerBuffer messages before the composite and have wasted upload(s). I guess you could force handling of a composite before dealing with any subsequent uploading, but if the content side is so fast in this situation, I suppose this isn't going to alleviate the problem particularly - you'll just end up hitting it on the next draw instead.

Bas had suggested not allowing content to get ahead like this and using a cross-process condvar to make sure the last transaction has been dealt with before starting a new one. I kind of like this idea because it'd very much simplify a lot of the code and make the gralloc buffer-locking situation easier to deal with... But of course it does mean client side could just be wasting time waiting for the compositor when it could be drawing, which isn't so great.

ReadUnlocking immediately after upload seems like a reasonable thing to do - I don't think it adds anything complexity-wise and if it's a measurable benefit, then no reason why not.

(c) requires efficient sub-image texture upload, which might be a problem on mobile. I'd like to see the results though. We upload the whole tile currently as lots of Android drivers/GPUs in the past had slow and/or buggy sub-image texture upload. I think we'd have to retain an 'always upload the whole tile' path, but I'd like to see some results for this path, I could imagine it being a pretty decent win.

I think a combination of (b) and (c) would be nice - (b) would massively benefit b2g, where we could render directly to tiles and avoid the memory hit of the single-draw buffer (it already has partial upload via gralloc).

I think if the overhead of copying tiles is so large, we should consider blocking on the compositor too, though I imagine all these numbers look quite different on mobile.

As a side-note for (a), more recent Cairo has efficient (apparently) recording surfaces, and the current in-tree version doesn't (it has recording surfaces, but they're awful). If Skia content rendering is near ready, then I suppose this doesn't matter, but worth knowing.
Another aside, while it's a noble goal, I hope we're not trying to make tiles perform better than a single rotating buffer in every way *without* using displayports - I'm not saying it can't be done, but I'd say that's more an indictment of our rotated buffer code than anything... We should always be able to draw faster if we can just upload only what's changed into a single buffer, it's the wins elsewhere that make tiling worthwhile (layer resizing, scrolling (with a displayport and tile-alignment), progressive update, parallelisation, etc.)
(In reply to Chris Lord [:cwiiis] from comment #7)
> The problem with uploading immediately rather than waiting for the composite
> is that there's the possibility you'll get two UseTiledLayerBuffer messages
> before the composite and have wasted upload(s). I guess you could force
> handling of a composite before dealing with any subsequent uploading, but if
> the content side is so fast in this situation, I suppose this isn't going to
> alleviate the problem particularly - you'll just end up hitting it on the
> next draw instead.

That's not ideal because we wouldn't be operating in parallel at all. There's not much reason to having OMTC at that point.

> 
> Bas had suggested not allowing content to get ahead like this and using a
> cross-process condvar to make sure the last transaction has been dealt with
> before starting a new one. I kind of like this idea because it'd very much
> simplify a lot of the code and make the gralloc buffer-locking situation
> easier to deal with... But of course it does mean client side could just be
> wasting time waiting for the compositor when it could be drawing, which
> isn't so great.

I think Bas wanted this to block if content got more than a frame ahead. So content should be producing a frame while the compositor is compositing the previous frame, but never while it was still compositing 2 frames back.


> 
> ReadUnlocking immediately after upload seems like a reasonable thing to do -
> I don't think it adds anything complexity-wise and if it's a measurable
> benefit, then no reason why not.
> 
> (c) requires efficient sub-image texture upload, which might be a problem on
> mobile. I'd like to see the results though. We upload the whole tile
> currently as lots of Android drivers/GPUs in the past had slow and/or buggy
> sub-image texture upload. I think we'd have to retain an 'always upload the
> whole tile' path, but I'd like to see some results for this path, I could
> imagine it being a pretty decent win.

Yeah, I think we'll have to keep doing that for mobile unfortunately. I'm primarily concerned with optimizing for desktop (but any wins for mobile in the process will be awesome).

> 
> I think a combination of (b) and (c) would be nice - (b) would massively
> benefit b2g, where we could render directly to tiles and avoid the memory
> hit of the single-draw buffer (it already has partial upload via gralloc).
> 
> I think if the overhead of copying tiles is so large, we should consider
> blocking on the compositor too, though I imagine all these numbers look
> quite different on mobile.
> 
> As a side-note for (a), more recent Cairo has efficient (apparently)
> recording surfaces, and the current in-tree version doesn't (it has
> recording surfaces, but they're awful). If Skia content rendering is near
> ready, then I suppose this doesn't matter, but worth knowing.

Bas is working on a native Moz2D version of this currently. I was just playing with skia content since it was the quickest path to a prototype that I could benchmark.


(In reply to Chris Lord [:cwiiis] from comment #8)
> Another aside, while it's a noble goal, I hope we're not trying to make
> tiles perform better than a single rotating buffer in every way *without*
> using displayports - I'm not saying it can't be done, but I'd say that's
> more an indictment of our rotated buffer code than anything... We should
> always be able to draw faster if we can just upload only what's changed into
> a single buffer, it's the wins elsewhere that make tiling worthwhile (layer
> resizing, scrolling (with a displayport and tile-alignment), progressive
> update, parallelisation, etc.)

Yes indeed, that is a very good point. I think tiling is a fairly high priority though, since it lets us remove deprecated textures, and getting AZPC support for desktop (which gives us access to most of the other wins) won't be such a high priority (sadface).

So we need to optimize tiling for desktop sufficiently that we don't take a big hit in the interim.
(In reply to Matt Woodrow (:mattwoodrow) from comment #0)
> c) Do something similar to ContentClientIncremental to tackle problems (2)
> and (3). For this we'd only retain the GPU texture for each tile, no CPU
> buffer. When we want to draw a buffer we grab a new one from a pool (never
> try to grab the front buffer, since it doesn't have a CPU surface) and draw
> the invalid region into that. The pixels outside the invalid region would
> remain untouched (and are garbage). The receiving end of the tile
> transaction would upload the invalid region to the existing GPU texture, and
> then return the buffer to the pool.

So we talked about it on irc with Matt and this should be implementable as a "write-only" TextureClient that doesn't keep a buffer on the client side, but instead just sends updated regions to the host side. The Host would pass those regions to a DataTextureSource so that the latter uploads the regions with whatever makes sense (glTexSubImage2D for openGL, etc.). In any case I would really like to avoid adding another Compositable implementation for this. Using TextureClient/Host/Source also has the advantage that keeping the backend-specific parts in DataTextureSource, we don't need to write any backend-specific code for this (except that some DataTextureSource implems don't support partial updates yet so a tiny bit of glue needs to be added there for non-gl backends).
Flags: needinfo?(nical.bugzilla)
Most of what I wrote is written above. So just count these as my votes:

c)
On mobile glTexSubImage2D is slower. I doubt on desktop it's faster, and it's likely to be slower in some rare case, but I don't know for sure. The problem is it's hard to get convincing numbers since the performance can be fast is the previous composite has completed and the texture is no longer locked by the command buffer.

We need to focus on direct texturing/client storage for this where possible. In case where we don't have that I think the best thing to do is glTexImage2D. But this is a regression from the previous buffer rotation case. However if we're trading a slightly worse average case performance for a better worse case performance I think it's the right tradeoff to make and we shouldn't shy away from making that tradeoff if it comes down to it.

d)
If we do mostly async transaction then it is probably ok, but with sync transaction this hurts the main thread.

b)
I want this but trying to do it part of this and gate tiling on desktop on that project is going to be a total disaster. That itself will bring it's own impacts to performance. I vote that we keep these projects separate for now.
Flags: needinfo?(bgirard)
(In reply to Benoit Girard (:BenWa) from comment #11)
> Most of what I wrote is written above. So just count these as my votes:
> 
> c)
> On mobile glTexSubImage2D is slower. I doubt on desktop it's faster, and
> it's likely to be slower in some rare case, but I don't know for sure. The
> problem is it's hard to get convincing numbers since the performance can be
> fast is the previous composite has completed and the texture is no longer
> locked by the command buffer.
> 
> We need to focus on direct texturing/client storage for this where possible.
> In case where we don't have that I think the best thing to do is
> glTexImage2D. But this is a regression from the previous buffer rotation
> case. However if we're trading a slightly worse average case performance for
> a better worse case performance I think it's the right tradeoff to make and
> we shouldn't shy away from making that tradeoff if it comes down to it.


How does client storage handle the case where we want to modify the pixel buffer before the upload is done? I don't see how it could. That means we now have to wait until after SwapBuffers() before calling ReadUnlock, instead of once the glTexImage2D call. We're already losing performance because content wants the buffers before uploads are complete and has to make copies, this sounds like it would make things worse. It's probably a performance win for the compositor thread, but we don't seem to be limited by that currently (for tscroll at least).

I haven't looked at direct texturing and double buffering the tiles, but that would be a considerable memory usage regression over trunk.

> d)
> If we do mostly async transaction then it is probably ok, but with sync
> transaction this hurts the main thread.

We should never have sync transactions with tiling.

> 
> b)
> I want this but trying to do it part of this and gate tiling on desktop on
> that project is going to be a total disaster. That itself will bring it's
> own impacts to performance. I vote that we keep these projects separate for
> now.

If we can't get acceptable performance without it, then what choice do we have? I don't think landing tiling with a ~50% performance regression will be acceptable.
(In reply to Matt Woodrow (:mattwoodrow) from comment #12)
> How does client storage handle the case where we want to modify the pixel
> buffer before the upload is done?

Either we (1) fence or (2) we delete the texture which will block if we do it too early. After that we're free to do an incremental upload on that tile (if it's not too old) and we known we wont block on the command buffer.

> That means we
> now have to wait until after SwapBuffers() before calling ReadUnlock,
> instead of once the glTexImage2D call.

Yes we will have to do our own buffer management. But that means that everything is in our control and it also means we can avoid an extra copy or two that is otherwise outside of our control. I already have an implementation of client storage with old tiling which was very easy to code. I'd hope its still easy to do with new tiling.
 
> We're already losing performance
> because content wants the buffers before uploads are complete and has to
> make copies, this sounds like it would make things worse.

Client storage means that we can have GL use our bit directly instead of copying them. This is strictly an advantage with all other things being equal. I don't see how you think this will make performance worse. Double buffer + a texture is like tripple buffer so we shouldn't compare a tripple buffer approach to a double buffered client storage approach and say that client storage is worse.

> It's probably a
> performance win for the compositor thread, but we don't seem to be limited
> by that currently (for tscroll at least).
> 
> I haven't looked at direct texturing and double buffering the tiles, but
> that would be a considerable memory usage regression over trunk.
> 
> > d)
> > If we do mostly async transaction then it is probably ok, but with sync
> > transaction this hurts the main thread.
> 
> We should never have sync transactions with tiling.

If we have anything else in the transaction that requires a swap then the entire transaction falls back to a sync transaction. We're seeing this now on b2g where we are still using a non tiling layer for the scroll bar.

> 
> > 
> > b)
> > I want this but trying to do it part of this and gate tiling on desktop on
> > that project is going to be a total disaster. That itself will bring it's
> > own impacts to performance. I vote that we keep these projects separate for
> > now.
> 
> If we can't get acceptable performance without it, then what choice do we
> have? I don't think landing tiling with a ~50% performance regression will
> be acceptable.

There's no way that a properly tuned tiling implementation is a 50% performance regression.
(In reply to Benoit Girard (:BenWa) from comment #13)
> (In reply to Matt Woodrow (:mattwoodrow) from comment #12)
> > How does client storage handle the case where we want to modify the pixel
> > buffer before the upload is done?
> 
> Either we (1) fence or (2) we delete the texture which will block if we do
> it too early. After that we're free to do an incremental upload on that tile
> (if it's not too old) and we known we wont block on the command buffer.

Fencing sounds ok, but it still has the same issue below.
  
> Client storage means that we can have GL use our bit directly instead of
> copying them. This is strictly an advantage with all other things being
> equal. I don't see how you think this will make performance worse. Double
> buffer + a texture is like tripple buffer so we shouldn't compare a tripple
> buffer approach to a double buffered client storage approach and say that
> client storage is worse.
> 

On trunk we have single buffered (only a texture), with tiling enabled it's double buffered (one memory copy, one texture).

I wasn't referring to copies within GL, but to the one our tiling code does. We try grab the memory buffer for the tile, and if it's still being uploaded we allocate a new tile (probably from the pool) and copy the old tile contents into the new one. If we switch the upload from a synchronous one to asynchronous, then we'll be much less likely to be able to re-use the existing buffer and will have to take the fallback path that requires us to copy the tile contents. That copy kills performance, and shows up massively in profiles.


> 
> If we have anything else in the transaction that requires a swap then the
> entire transaction falls back to a sync transaction. We're seeing this now
> on b2g where we are still using a non tiling layer for the scroll bar.
> 

Ok, but we should just not do that for desktop. Or we can make uploading from within the transaction conditional on the transaction being async. Not a big issue either way.


> 
> There's no way that a properly tuned tiling implementation is a 50%
> performance regression.

36% on 10.8 and 56% on 10.6 currently to be precise.

That's not hard to believe, for simple content we spend about the same amount of time copying from the single-paint buffer into the individual tiles as we did rasterizing the content in the first place. That effectively double our paint time (but works out to 30-50% regression because other overhead like display list and layer building is unchanged).

Switching to painting directly into each tile avoids this overhead, but instead incurs new overhead processing display items multiple times. It's not a net win for the performance testing we have.
(In reply to Matt Woodrow (:mattwoodrow) from comment #14)
> On trunk we have single buffered (only a texture), with tiling enabled it's
> double buffered (one memory copy, one texture).
> 
> I wasn't referring to copies within GL, but to the one our tiling code does.
> We try grab the memory buffer for the tile, and if it's still being uploaded
> we allocate a new tile (probably from the pool) and copy the old tile
> contents into the new one. If we switch the upload from a synchronous one to
> asynchronous, then we'll be much less likely to be able to re-use the
> existing buffer and will have to take the fallback path that requires us to
> copy the tile contents. That copy kills performance, and shows up massively
> in profiles.

Right now we do a glTexImage whenever the tile changes so that means we have at least one full copy every frame. With client storage we avoid this copy. In the current case we sometimes do a copy due to contention, with client storage we would do this always.

Current: 1 upload copy + sometimes 1 contention copy
Client: 0 upload copy + always a contention copy*

* we only need to copy the bit that changed in the previous frame but haven't in the current frame. This may be less bits.

When you take everything in consideration client storage/direct texturing gives us more flexibly and are a net win.


> 
> 
> > 
> > If we have anything else in the transaction that requires a swap then the
> > entire transaction falls back to a sync transaction. We're seeing this now
> > on b2g where we are still using a non tiling layer for the scroll bar.
> > 
> 
> Ok, but we should just not do that for desktop. Or we can make uploading
> from within the transaction conditional on the transaction being async. Not
> a big issue either way.
> 
> 
> > 
> > There's no way that a properly tuned tiling implementation is a 50%
> > performance regression.
> 
> 36% on 10.8 and 56% on 10.6 currently to be precise.
> 
> That's not hard to believe, for simple content we spend about the same
> amount of time copying from the single-paint buffer into the individual
> tiles as we did rasterizing the content in the first place. That effectively
> double our paint time (but works out to 30-50% regression because other
> overhead like display list and layer building is unchanged).
> 
> Switching to painting directly into each tile avoids this overhead, but
> instead incurs new overhead processing display items multiple times. It's
> not a net win for the performance testing we have.
(In reply to Benoit Girard (:BenWa) from comment #15)
> 
> Right now we do a glTexImage whenever the tile changes so that means we have
> at least one full copy every frame. With client storage we avoid this copy.
> In the current case we sometimes do a copy due to contention, with client
> storage we would do this always.
> 
> Current: 1 upload copy + sometimes 1 contention copy
> Client: 0 upload copy + always a contention copy*
> 
> * we only need to copy the bit that changed in the previous frame but
> haven't in the current frame. This may be less bits.
> 
> When you take everything in consideration client storage/direct texturing
> gives us more flexibly and are a net win.

Switching to glTexSubImage2D on desktop isn't an issue though, and I haven't seen any evidence of it being slower. It should be faster since the upload copy is of a smaller area.

We're also removing a copy from the compositor thread and adding one to the content thread. That's not apples for apples when we're already limited by our content thread performance.

I guess we could look at tscroll and see if it's actually measuring things we care about (currently it's only measuring content thread timings, and doesn't notice if we over-produce), but that seems out of scope for this bug.
Let's discuss this during the workweek.
Flags: needinfo?(bas)
Depends on: 1053078
The combination of the tiling work for b2g, DrawTargetTiled and refresh driver thottling has fixed a lot of the performance issues here.

I'm working on this again to try get it completed.
Assignee: nobody → matt.woodrow
Depends on: 1057212
Depends on: 1057222
Depends on: 1059033
Depends on: 1060114
Depends on: 982433
Keywords: perf
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.