Closed Bug 1464032 (canvas-ipc) Opened Last year Closed 3 months ago

Remote D2D canvas drawing to the GPU process

Categories

(Core :: Graphics, enhancement, P1)

All
Windows
enhancement

Tracking

()

RESOLVED FIXED
mozilla69
Tracking Status
firefox69 --- fixed

People

(Reporter: bobowen, Assigned: bobowen)

References

(Blocks 3 open bugs, Regressed 1 open bug)

Details

Attachments

(14 files, 29 obsolete files)

4.18 KB, text/plain
Details
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
This is part of the work to remove GPU access from the content process, so that we can improve the sandbox policy.
Comment on attachment 8981459 [details] [diff] [review]
WIP - Using Moz2D recording for canvas when D2D.

This is very rough at this stage, but I thought it was worth trying to get some feedback.
I'll split this into separate patches when it comes to review.

The general approach is to use a PersistentBufferProviderShared and new RecordedTextureData/TranslatedTextureData, that records and plays back the DrawTarget commands in the GPU process ready for when the texture is required.
The recordings are just copied into shared memory and sent in bulk at the moment on Unlock and when we have a snapshot that actually requires image data being sent back to the JS.

There are a few hacks in here and the lifetime management is very dodgy, but generally things seem to be working OK.
(I'm sure I'm committing several crimes against our gfx code.)

I've been testing on the following because they've appeared in other canvas bugs I saw:
http://slither.io/
https://www.w3schools.com/html/html5_canvas.asp
http://randomibis.com/coolclock/
Google sheets
(Any other testing suggestions welcome.)

I've added a SnapshotProvider to the DrawTargetRecording, so that I could use it instead of a DrawTargetWrapAndRecord. Having a full DrawTargetSkia running in the content process behind that to provide snapshots was really poor performance wise.

Here are some try pushes.
Full try push with a recent version of this patch:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6791aeaef615f2d6fb1bdd394f93eeef930aa517
(This was before I made some of the IPC async, the GetSnapshot will always have to be sync I think. Failures are mainly down to those lifetime issues I think.)

Talos try push on the parent revision for comparison:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b5c8aedd39c4345fa3f8cd630ba55753d7b19360

Talos try push again on the parent revision, but with skia canvases:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d7f16baee9788b3d0cf7eb698e22cc225de6638a

Regressions on the animometer tests are pretty heavy at the moment, but still better than with just skia in content. (Apart from the Images test, which is *massively* faster in skia anyway for some reason.)

I'm going to work on the crash regressions and then I have a couple of ideas to improve the performance, but I probably need to do some profiling first.
Attachment #8981459 - Flags: feedback?(jmuizelaar)
FYI Jeff is on leave for the next month or so, you might want to try Bas or somebody else for feedback instead.
Comment on attachment 8981459 [details] [diff] [review]
WIP - Using Moz2D recording for canvas when D2D.

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #3)
> FYI Jeff is on leave for the next month or so, you might want to try Bas or
> somebody else for feedback instead.

Thanks!
Attachment #8981459 - Flags: feedback?(jmuizelaar) → feedback?(bas)
I realised a load of those crashes were due to the fact I'd left SnapshotProvider as RefCounted, when I decided I actually couldn't due to circular dependencies. I still want to revisit that along with the general lifetime management of these new TextureData IPC actors.

Here's a new try push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=94ede1735936ff1943dbac96bde7c15c727afb52
Moz2D recording isn't cheap, I'd mostly be interested in seeing more perf numbers. For the time this isn't exactly high priority since we're doing D2D for content drawing in the content process anyway.
Moving to p3 because no activity for at least 24 weeks.
See https://github.com/mozilla/bug-handling/blob/master/policy/triage-bugzilla.md#how-do-you-triage for more information
Priority: P1 → P3
Priority: P3 → P1
Attachment #8981459 - Attachment is obsolete: true
Attachment #8981459 - Flags: feedback?(bas)
Otherwise, we crash in the content process when we try to record this.
Duplicate of this bug: 1395315
This is ground work for when we will be returning a recording TextureData for
certain types in subsequent patches.
This is so we don't need to lock the previous back buffer when it might also be
locked by the compositor. These locks are generally for copying to the next
back buffer or when getting image data from the previous back buffer.
These are to be used as part of recording canvas drawing in the content
processes and playing it back in the GPU process through shared memory.
These are to be used as part of recording canvas drawing in the content
processes and playing it back in the GPU process through shared memory.
Attachment #9027968 - Attachment is obsolete: true
These are extensions to the Moz2D RecordedEvents to record and play back canvas
texture related functions in the GPU process.
The CanvasTranslator handles the playback of these and the Moz2D ones.
RecordedTextureData records TextureData calls for play back in the GPU process.
CanvasChild and CanvasParent set up the recorder and translator.
They also help to manage the starting of translation and co-ordinating the
translation with the frame transactions.
This patch also includes other changes to wire up recording and playback.
This also improves the recording and translation speeds.
Comment on attachment 9027907 [details] [diff] [review]
Part 1: Fix unified build issues

While I'm sorting out a shutdown problem and waiting for try runs, I'll kick off reviews on some of the preliminary patches, which are hopefully fairly straightforward.
Attachment #9027907 - Flags: review?(jmuizelaar)
Attachment #9027910 - Flags: review?(jmuizelaar)
Attachment #9027911 - Flags: review?(jmuizelaar)
Attachment #9027913 - Flags: review?(jmuizelaar)
Attachment #9027914 - Flags: review?(jmuizelaar)
Have you benchmarked this to see what impact it has on performance?
Flags: needinfo?(bobowencode)
(In reply to Jeff Muizelaar [:jrmuizel] from comment #26)
> Have you benchmarked this to see what impact it has on performance?

Yes, it generally seems pretty good, certainly for the canvas tests.
Here's a recent run (only 1 iteration), I'm going to have to update a few of the patches to fix some things on try, so I'll do a fuller talos and raptor run with those:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=6805736ae89b&newProject=try&newRevision=4ed0797c45a75d371c8328da90758ea6b6f9e754&framework=10
Flags: needinfo?(bobowencode)
Attached patch Part 1: Fix unified build issues (obsolete) — Splinter Review
Attachment #9028443 - Flags: review?(jmuizelaar)
Attachment #9027907 - Attachment is obsolete: true
Attachment #9027907 - Flags: review?(jmuizelaar)
Attachment #9027910 - Attachment is obsolete: true
Attachment #9027910 - Flags: review?(jmuizelaar)
This is ground work for when we will be returning a recording TextureData for
certain types in subsequent patches.
Attachment #9027954 - Attachment is obsolete: true
These are to be used as part of recording canvas drawing in the content
processes and playing it back in the GPU process through shared memory.
Attachment #9027969 - Attachment is obsolete: true
RecordedTextureData records TextureData calls for play back in the GPU process.
CanvasChild and CanvasParent set up the recorder and translator.
They also help to manage the starting of translation and co-ordinating the
translation with the frame transactions.
This patch also includes other changes to wire up recording and playback.
Attachment #9028017 - Attachment is obsolete: true
Attachment #9028047 - Attachment is obsolete: true
Some more try runs...

Full (non-talos/raptor) run with remote canvas disabled:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b09475f2b16de2a2168f801e0cc002cc13e57e2e

Full talos/raptor run with remote canvas disabled:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=318d25ab6b8e40cb01ffddf6ec1b501cf04be5cf

Windows talos/raptor run with remote canvas enabled:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5f4e17303ecf07f7be3a8a95dc01a2bc50bc1031

For comparison a talos and raptor run from a while ago with the same base changeset:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6805736ae89b534072f4f1f6222fe6e07923e587

(This hasn't got everything repeated)
(In reply to Bob Owen (:bobowen) from comment #27)
> (In reply to Jeff Muizelaar [:jrmuizel] from comment #26)
> > Have you benchmarked this to see what impact it has on performance?
> 
> Yes, it generally seems pretty good, certainly for the canvas tests.
> Here's a recent run (only 1 iteration), I'm going to have to update a few of
> the patches to fix some things on try, so I'll do a fuller talos and raptor
> run with those:
> https://treeherder.mozilla.org/perf.html#/
> compare?originalProject=try&originalRevision=6805736ae89b&newProject=try&newR
> evision=4ed0797c45a75d371c8328da90758ea6b6f9e754&framework=10

To be a bit more specific here the canvas performance looks very good, with performance gains in the canvas related tests as far as I can tell:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=6805736ae89b&newProject=try&newRevision=5f4e17303ecf07f7be3a8a95dc01a2bc50bc1031&framework=10

I'm sure there are issues that need to be resolved, but I also think there could well be further performance gains to be had in the future.
Attachment #9028443 - Flags: review?(jmuizelaar) → review+
Comment on attachment 9028444 [details] [diff] [review]
Part 2: Remove LoadEvent and replace with DoWithEvent

Review of attachment 9028444 [details] [diff] [review]:
-----------------------------------------------------------------

Can you include the intent of this patch in the commit message?
Attachment #9027911 - Flags: review?(jmuizelaar) → review+
Attachment #9027913 - Flags: review?(jmuizelaar) → review+
Comment on attachment 9027914 [details] [diff] [review]
Part 5: Make sure the DrawTarget can create a similar DrawTarget when falling back to empty surface

Review of attachment 9027914 [details] [diff] [review]:
-----------------------------------------------------------------

Redirecting to mstange
Attachment #9027914 - Flags: review?(jmuizelaar) → review?(mstange)
This patch modifies DoWithEvent so that we can more easily have a
DoWithEventFromStream callable from outside of Moz2D similar to
LoadEventFromStream. We will add that in a later patch for the new
EventRingBuffer. It also changes the only user of LoadEventFromStream
over to it, so we can can get rid of it and LoadEvent entirely.
Attachment #9028864 - Flags: review?(jmuizelaar)
Attachment #9028444 - Attachment is obsolete: true
Attachment #9028444 - Flags: review?(jmuizelaar)
Comment on attachment 9028445 [details] [diff] [review]
Part 6: Add remote canvas pref and refactor TextuteData creation to use it

This spans graphics and layout, so I'm not sure if you'll both want to look at this.
I had quite a job trying to re-arrange the existing logic here, so that I could slot in the decision to remote fairly cleanly.
Attachment #9028445 - Flags: review?(matt.woodrow)
Attachment #9028445 - Flags: review?(jmuizelaar)
Comment on attachment 9027966 [details] [diff] [review]
Part 7: Take snapshot before return for TextureClients with synchronization

(I've added the extra context below to the comment in the patch locally.)

This also makes it easier to asynchronously cache the DataSourceSurface in the
GPU process, when a page is using getImageData. This is done in a later patch.
Attachment #9027966 - Flags: review?(matt.woodrow)
Attachment #9028446 - Flags: review?(jmuizelaar)
Attachment #9027981 - Flags: review?(jmuizelaar)
Attachment #9027992 - Flags: review?(jmuizelaar)
Attachment #9028001 - Flags: review?(jmuizelaar)
Comment on attachment 9028447 [details] [diff] [review]
Part 12: Add CanvasParent, CanvasChild and RecordedTextureData

This ties everything together, so there's quite a bit of layout stuff in here as well.

Also, including jld for the IPC parts.
Attachment #9028447 - Flags: review?(matt.woodrow)
Attachment #9028447 - Flags: review?(jmuizelaar)
Attachment #9028447 - Flags: review?(jld)
Attachment #9028022 - Flags: review?(jmuizelaar)
Attachment #9028027 - Flags: review?(jmuizelaar)
Attachment #9028448 - Flags: review?(jmuizelaar)
See Also: → 1513308
Comment on attachment 9027914 [details] [diff] [review]
Part 5: Make sure the DrawTarget can create a similar DrawTarget when falling back to empty surface

Review of attachment 9027914 [details] [diff] [review]:
-----------------------------------------------------------------

Seems reasonable.

As an aside, I find it a bit odd that this code thinks that an empty B8G8R8A8 surface is a valid opaque surface (it uses gfxAlphaType::Opaque), but that's already an existing problem in the code...
Attachment #9027914 - Flags: review?(mstange) → review+
Comment on attachment 9028445 [details] [diff] [review]
Part 6: Add remote canvas pref and refactor TextuteData creation to use it

Review of attachment 9028445 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/client/ClientLayerManager.cpp
@@ +906,5 @@
>    // because the canvas will most likely be flattened into a thebes layer instead
>    // of being sent to the compositor, in which case rendering into shared memory
>    // is wasteful.
>    if (IsCompositingCheap() &&
> +      (gfxPrefs::PersistentBufferProviderSharedEnabled() ||

So it looks like this returns false on Windows (and OSX), since it got disabled in bug 1285271 due to regressions (that aren't really explained).

I realize that canvas remoting isn't going to be enabled yet either, but since enabling it will also enable the shared buffer provider, do we have any idea of what the fallout from that might be?
Attachment #9028445 - Flags: review?(matt.woodrow) → review+
Comment on attachment 9027966 [details] [diff] [review]
Part 7: Take snapshot before return for TextureClients with synchronization

Review of attachment 9027966 [details] [diff] [review]:
-----------------------------------------------------------------

Snapshots are usually just a copy-on-write reference to the same underlying data as the DrawTarget, which wouldn't be ok to hold on to outside of the TextureClient lock. It looks like TextureClient::Unlock calls DetachAllSnapshots though, which should break this dependency by actually allocating and copying into the snapshot.

It seems like finding a way to support multiple-readers, single-writer would be better as it wouldn't require an extra copy, but maybe I need to look at more of the patch queue first.

I think this is ok, assuming that it is what you intended, but can you please add more comments to PersistentBufferProvider explaining why it's safe, and why copying is the right thing to do.
Attachment #9027966 - Flags: review?(matt.woodrow) → review+
Bob, can you give an overview of resulting architecture from these patches? I know we spoke a bit about it, but it's worth getting it down in writing.
Flags: needinfo?(bobowencode)
(In reply to Matt Woodrow (:mattwoodrow) from comment #44)
...
> > +      (gfxPrefs::PersistentBufferProviderSharedEnabled() ||
> 
> So it looks like this returns false on Windows (and OSX), since it got
> disabled in bug 1285271 due to regressions (that aren't really explained).
> 
> I realize that canvas remoting isn't going to be enabled yet either, but
> since enabling it will also enable the shared buffer provider, do we have
> any idea of what the fallout from that might be?

Yeah, that is a bit of a concern, but my understanding is that the issues were down to D2D interaction and all of the D2D interaction has changed with these because it's over in the GPU process.
Of course we may see similar issues, so perhaps running an experiment in Beta will be a good idea.

It's also possible that taking the snapshot before returning the textures, the use of keyed mutexes and the D2D factory synchronisation in the final patch might help with whatever the issues were.
Flags: needinfo?(bobowencode)
(In reply to Matt Woodrow (:mattwoodrow) from comment #45)
> Comment on attachment 9027966 [details] [diff] [review]
> Part 7: Take snapshot before return for TextureClients with synchronization
> 
> Review of attachment 9027966 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Snapshots are usually just a copy-on-write reference to the same underlying
> data as the DrawTarget, which wouldn't be ok to hold on to outside of the
> TextureClient lock. It looks like TextureClient::Unlock calls
> DetachAllSnapshots though, which should break this dependency by actually
> allocating and copying into the snapshot.
> 
> It seems like finding a way to support multiple-readers, single-writer would
> be better as it wouldn't require an extra copy, but maybe I need to look at
> more of the patch queue first.
> 
> I think this is ok, assuming that it is what you intended, but can you
> please add more comments to PersistentBufferProvider explaining why it's
> safe, and why copying is the right thing to do.

You're right this is safe because of the DetachAllSnapshots call, I'll change the comments to make this clear.
I'm using textures protected by keyed mutexes at the moment, because this was the easiest thing to get working and performing well.
This is because I only have to wait in the content main thread for the locks to have happened and then it can wait in the compositor thread (in the GPU process) for the rest of the drawing and the unlock.
However using keyed mutexes did cause some performance problems when the previous texture was being accessed to get image data or copy to the next texture client. So doing it this way means we never have that conflict.

As far as I can tell using keyed mutexes doesn't really support multiple-readers (or even single), single-writer because you still have to take the same lock to read.
I'm pretty sure that we'll be able to go back to using textures without the mutex and SyncObjects instead, where we would have to wait in the compositor thread for the SyncObject synchronize, but that seemed trickier to get right, so I went for what seemed like the simpler solution to start with.
I'm reasonably confident that that approach will have even better performance once it was working.
(In reply to Jeff Muizelaar [:jrmuizel] from comment #46)
> Bob, can you give an overview of resulting architecture from these patches?
> I know we spoke a bit about it, but it's worth getting it down in writing.

Yeah, I'll work on a comment to this bug to give an overall summary, it will be Monday now I think before I can do that.
That can then form the basis of adding the explanation into the tree, which would be good.
Meant to leave that needinfo, so I don't forget.
Flags: needinfo?(bobowencode)
Attached file Remote 2D Canvas design notes.txt (obsolete) —
I've created some high level design notes for these changes.
Hopefully these give you a reasonable idea about how things hang together.
I've done it as a text attachment, so I can correct or add to it as required.

I'm kind of on PTO from Wednesday, but I'm more than happy to answer any questions during that (assuming you're not on PTO as well :-) ) and I'll be doing bits of work from time to time anyway.
Flags: needinfo?(bobowencode) → needinfo?(jmuizelaar)
Attachment #9031865 - Attachment is obsolete: true
Comment on attachment 9028447 [details] [diff] [review]
Part 12: Add CanvasParent, CanvasChild and RecordedTextureData

Review of attachment 9028447 [details] [diff] [review]:
-----------------------------------------------------------------

The IPC parts look reasonable.

(I was going to point out CrossProcessSemaphore::CloseHandle, to save resources after the semaphore has been shared, but it's a no-op on Windows and this is Windows-only, so it looks like you won't need it.)
Attachment #9028447 - Flags: review?(jld) → review+
Attachment #9027981 - Flags: review?(jmuizelaar) → review?(rhunt)
Attachment #9027992 - Flags: review?(jmuizelaar) → review?(rhunt)
Attachment #9028001 - Flags: review?(jmuizelaar) → review?(rhunt)
Attachment #9028022 - Flags: review?(jmuizelaar) → review?(rhunt)
Attachment #9028027 - Flags: review?(jmuizelaar) → review?(rhunt)
Attachment #9028445 - Flags: review?(jmuizelaar) → review?(rhunt)
Attachment #9028446 - Flags: review?(jmuizelaar) → review?(rhunt)
Attachment #9028447 - Flags: review?(jmuizelaar) → review?(rhunt)
Attachment #9028448 - Flags: review?(jmuizelaar) → review?(rhunt)
Attachment #9028864 - Flags: review?(jmuizelaar) → review?(rhunt)

Hey Ryan, we're nearly two months in here with little movement. What's the ETA on first run through?

Flags: needinfo?(rhunt)

Apologies, I've been preoccupied trying to ship scroll anchoring in 66. I've glanced through these patches and hope to do a first pass this week.

Flags: needinfo?(rhunt)
Comment on attachment 9028022 [details] [diff] [review]
Part 13: Make the recording of surface data more efficient

Review of attachment 9028022 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/2d/RecordedEventImpl.h
@@ +2727,2 @@
>    }
>  }

Was the difference from this change actually noticeable? I'd expect at least some of this transformation to be performed by the compiler.
Comment on attachment 9027981 [details] [diff] [review]
Part 9: Add a D3D11 device to be used on canvas threads in the GPU process

Review of attachment 9027981 [details] [diff] [review]:
-----------------------------------------------------------------

Does it work to share the D3D11 device with the compositor? That would make some of the texture synchronization easier and would help avoid the problems that we've run into with texture sharing between devices. It would also be a memory usage improvement.
Flags: needinfo?(jmuizelaar) → needinfo?(bobowencode)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #56)

Comment on attachment 9028022 [details] [diff] [review]
Part 13: Make the recording of surface data more efficient

Review of attachment 9028022 [details] [diff] [review]:

::: gfx/2d/RecordedEventImpl.h
@@ +2727,2 @@

}
}

Was the difference from this change actually noticeable? I'd expect at least
some of this transformation to be performed by the compiler.

Looking at the disassembly (in Nightly) it appears to be doing the call to BytesPerPixel and the two multiplications on every iteration. Also for the SizeCollector the call to BytesPerPixel and a single multiplication is done per iteration (it drops the other multiplication because the pointer position isn't used).
I doubt it makes a big difference in the grand scheme of things, I think I did this before I found more important performance wins. I thought it was worth keeping anyway.

(In reply to Jeff Muizelaar [:jrmuizel] from comment #57)

Comment on attachment 9027981 [details] [diff] [review]
Part 9: Add a D3D11 device to be used on canvas threads in the GPU process

Review of attachment 9027981 [details] [diff] [review]:

Does it work to share the D3D11 device with the compositor? That would make
some of the texture synchronization easier and would help avoid the problems
that we've run into with texture sharing between devices. It would also be a
memory usage improvement.

When I tried this originally it crashed pretty quickly, but if we used the Direct2D synchronization (from part 15) around all of the Direct 3D and DXGI calls on the compositor thread as well, then that might well solve those problems.

I guess we could add a flag to the Direct3D texture code to set when we call Factory::CreateDrawTargetForD3D11Texture (and anywhere else) and then always use the synchronization.

Flags: needinfo?(bobowencode)
Attachment #9028864 - Flags: review?(rhunt) → review+
Comment on attachment 9028027 [details] [diff] [review]
Part 14: Refactor Path recording to record Arc properly

Review of attachment 9028027 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not sure if this will conflict with this commit [1] by Bas that landed a bit back.

Otherwise, I just have a few questions/comments.

[1] https://hg.mozilla.org/mozilla-central/rev/7eac43ea765ebb657f9749a6a8fb2c5c006fae8d

::: gfx/2d/PathRecording.cpp
@@ +10,5 @@
>  
>  namespace mozilla {
>  namespace gfx {
>  
> +#define NEXT_PARAMS(_type) \

Can you undefine this macro when it's done being used? Just to ensure it plays nice with unified builds.

@@ +60,5 @@
> +        aPathSink.Close();
> +        break;
> +      default:
> +        MOZ_MAKE_COMPILER_ASSUME_IS_UNREACHABLE(
> +          "We control mOpTypes, so this should never happen.");

This is only true if we validate the incoming bytes of the PathOps constructor that takes a stream. Do we have plans to do this, and if not what will happen here?

@@ +120,5 @@
> +  return newPathOps;
> +}
> +
> +size_t
> +PathOps::Size() const

What is this used for? Does it make sense to compute this when building paths and transmit it along with recordings?

::: gfx/2d/PathRecording.h
@@ +43,5 @@
> +  bool StreamToSink(PathSink &aPathSink) const;
> +
> +  PathOps TransformedCopy(const Matrix &aTransform) const;
> +
> +  size_t Size() const;

It might be good to document whether this is size in bytes or amount of items.

::: gfx/2d/RecordedEventImpl.h
@@ +726,5 @@
>  
>    ReferencePtr mRefPtr;
>    FillRule mFillRule;
> +  PathOps* mPathOps;
> +  bool mOwnPathOps = false;

My understanding is that this is new optimization to prevent a copy when recording a path.

In this case, RecordedPathCreation is being initialized as a temporary in  DrawTargetRecording::EnsurePathStored where it's guaranteed that the path will outlive this event, so it's okay that we take a pointer to the PathOps and don't free it or worry about a dangling pointer.

I'm a bit worried about this being broken unintentionally in the future. It took me a bit to figure it out.

Would it make more sense to instead use a RefPtr<PathRecording> to the source path? Or as an additional pointer to ensure the lifetime for the direct pointer works out. This would ensure it stays alive no matter how the caller uses this class.

Alternatively, if there's no good solution to this then please add a comment explaining what's going on.
Attachment #9028027 - Flags: review?(rhunt)
Attachment #9028022 - Flags: review?(rhunt) → review+

Thanks rhunt (and jrmuizel) for starting to look at these, I know how busy you all are.

(In reply to Ryan Hunt [:rhunt] from comment #59)

Comment on attachment 9028027 [details] [diff] [review]
Part 14: Refactor Path recording to record Arc properly

Review of attachment 9028027 [details] [diff] [review]:

I'm not sure if this will conflict with this commit [1] by Bas that landed a
bit back.

Yeah, thanks, doesn't look like the rebase will be too bad.

+#define NEXT_PARAMS(_type) \

Can you undefine this macro when it's done being used? Just to ensure it
plays nice with unified builds.

Will do thanks.

@@ +60,5 @@

  •    aPathSink.Close();
    
  •    break;
    
  •  default:
    
  •    MOZ_MAKE_COMPILER_ASSUME_IS_UNREACHABLE(
    
  •      "We control mOpTypes, so this should never happen.");
    

This is only true if we validate the incoming bytes of the PathOps
constructor that takes a stream. Do we have plans to do this, and if not
what will happen here?

I think I was originally using a similar case statement in the recording before I hit on the idea of having it all in one vector<uint8_t>, where this did hold. I obviously didn't change that when I copied it, thanks I'll fix it.
I think here I will return false and I need to check for that in the caller.
In the other cases I think I should probably just crash.

@@ +120,5 @@

  • return newPathOps;
    +}

+size_t
+PathOps::Size() const

What is this used for? Does it make sense to compute this when building
paths and transmit it along with recordings?

It's only used in RecordedPathCreation::OutputSimpleEventInfo, which isn't called normally, only if you add in some logging.
So, I didn't think it was worth computing up front as it isn't normally needed.

::: gfx/2d/PathRecording.h
@@ +43,5 @@

  • bool StreamToSink(PathSink &aPathSink) const;
  • PathOps TransformedCopy(const Matrix &aTransform) const;
  • size_t Size() const;

It might be good to document whether this is size in bytes or amount of
items.

Will do.

::: gfx/2d/RecordedEventImpl.h
@@ +726,5 @@

ReferencePtr mRefPtr;
FillRule mFillRule;

  • PathOps* mPathOps;
  • bool mOwnPathOps = false;

My understanding is that this is new optimization to prevent a copy when
recording a path.

In this case, RecordedPathCreation is being initialized as a temporary in
DrawTargetRecording::EnsurePathStored where it's guaranteed that the path
will outlive this event, so it's okay that we take a pointer to the PathOps
and don't free it or worry about a dangling pointer.

I'm a bit worried about this being broken unintentionally in the future. It
took me a bit to figure it out.

Would it make more sense to instead use a RefPtr<PathRecording> to the
source path? Or as an additional pointer to ensure the lifetime for the
direct pointer works out. This would ensure it stays alive no matter how the
caller uses this class.

Alternatively, if there's no good solution to this then please add a comment
explaining what's going on.

I think I was trying to use the same member for the recording and playback (because we only need the PathOps for playback).
But I think you're right, it'd be better if I just held a RefPtr<PathRecording> as well.

Comment on attachment 9028447 [details] [diff] [review]
Part 12: Add CanvasParent, CanvasChild and RecordedTextureData

Review of attachment 9028447 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good!

We have some gfx documentation in gfx/docs/, do you think you could put your explainer doc into that folder as part of this work?

::: gfx/layers/client/TextureRecorded.h
@@ +4,5 @@
> + * License, v. 2.0. If a copy of the MPL was not distributed with this
> + * file, You can obtain one at https://mozilla.org/MPL/2.0/. */
> +
> +#ifndef mozilla_layers_TextureRecorded_h
> +#define mozilla_layers_TextureRecorded_h

Any reason for the naming to be reversed for the filename?

::: gfx/layers/ipc/CanvasChild.cpp
@@ +139,5 @@
> +
> +already_AddRefed<gfx::DataSourceSurface>
> +CanvasChild::GetDataSurface(const gfx::SourceSurface* aSurface)
> +{
> +  MOZ_ASSERT(aSurface);

You mentioned in the notes that you want to make this work from other threads. What other threads are consumers of this?

Maybe add a main thread assert/warning for the meantime?

::: gfx/layers/ipc/CompositorBridgeChild.cpp
@@ +258,5 @@
>  
>    mCanSend = true;
>    mIdNamespace = aNamespace;
> +
> +  if (gfx::gfxVars::RemoteCanvasEnabled()) {

Should we do this lazily? Maybe a TODO even if you don't want to do so now.

::: gfx/layers/ipc/PCanvas.ipdl
@@ +19,5 @@
> +  async CreateTranslator(TextureType aTextureType, Handle aReadHandle,
> +                         CrossProcessSemaphoreHandle aReaderSem,
> +                         CrossProcessSemaphoreHandle aWriterSem);
> +
> +  async StartTranslation();

My understanding is that this gets called when we detect that the reader has transitioned into the Stopped state (which we can check from the shmem ring buffer), and we need this explicit API to tell it to wake up again?

Calling this ResumeTranslation, or NotifyNewDataAvailable or something along those lines.

Documenting both methods here would be really useful.

(In reply to Bob Owen (:bobowen) from comment #58)

Does it work to share the D3D11 device with the compositor? That would make
some of the texture synchronization easier and would help avoid the problems
that we've run into with texture sharing between devices. It would also be a
memory usage improvement.

When I tried this originally it crashed pretty quickly, but if we used the Direct2D synchronization (from part 15) around all of the Direct 3D and DXGI calls on the compositor thread as well, then that might well solve those problems.

I guess we could add a flag to the Direct3D texture code to set when we call Factory::CreateDrawTargetForD3D11Texture (and anywhere else) and then always use the synchronization.

You might have seen this already, but this is documented at https://docs.microsoft.com/en-us/windows/desktop/direct2d/multi-threaded-direct2d-apps

Ultimately, only one thread can be using the D3DDevice's immediate context at a time, so we need to lock around all the compositor threads usage of the device.

This does mean that <canvas> work would block the compositor, whereas with separate devices, we can have some amount of preemption.

The lower memory usage, and less cross-device sharing is appealing though, would make a good follow-up investigation.

Comment on attachment 9027981 [details] [diff] [review]
Part 9: Add a D3D11 device to be used on canvas threads in the GPU process

Review of attachment 9027981 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/ipc/GPUParent.cpp
@@ +229,2 @@
>    }
> +  if (gfxVars::UseWebRender() || gfxVars::RemoteCanvasEnabled()) {

Why does remote canvas mean we need to create a DirectComposition device?
Comment on attachment 9027992 [details] [diff] [review]
Part 10: Add a CanvasTranslator and canvas recorded events

Review of attachment 9027992 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/CanvasTranslator.cpp
@@ +31,5 @@
> +#ifdef XP_WIN
> +    case TextureType::DXGI:
> +    {
> +      RefPtr<ID3D11Device> device =
> +        gfx::DeviceManagerDx::Get()->GetCanvasDevice();

Can we also assert on device being non-null, since D3D11TextureData::Create will go looking for a different one if it's null, and we probably don't want that.

@@ +206,5 @@
> +
> +TextureData*
> +CanvasTranslator::LookupTextureData(gfx::ReferencePtr aDrawTarget)
> +{
> +  TextureMap::const_iterator result = mTextureDatas.find(aDrawTarget);

Should we be holding the mTextureDatasMonitor when accessing mTextureDatas?

We have the DataMutex class to make the equivalent pattern impossible when using Mutexes, would be nice if we could have a Monitor equivalent.

::: gfx/layers/RecordedCanvasEventImpl.h
@@ +218,5 @@
> +
> +inline bool
> +RecordedTextureUnlock::PlayCanvasEvent(CanvasTranslator* aTranslator) const
> +{
> +  TextureData* textureData = aTranslator->LookupTextureData(mDT);

The translator's mTextureDatas owns these TextureData* (using UniquePtr), and is also protected by a Monitor (suggesting that it's accessed from multiple threads).

Are we guaranteed that this won't be deleted from under us between the LookupTextureData call, and the Unlock?

Maybe we could have assertions around which threads can read/write the mTextureDatas map, and only allow accessing the raw pointers from the (hopefully single) write thread?
Comment on attachment 9028448 [details] [diff] [review]
Part 15: Spread the playback of canvas recordings across multiple threads in the GPU process

Review of attachment 9028448 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/2d/2D.h
@@ +1916,5 @@
>  private:
>    static DrawEventRecorder *mRecorder;
>  };
>  
> +class MOZ_RAII AutoMoz2DSerialize final

AutoSerializeWithMoz2D?

::: gfx/layers/RecordedCanvasEventImpl.h
@@ +175,5 @@
>      return false;
>    }
>  
> +  gfx::AutoMoz2DSerialize moz2dSerialize(
> +      aTranslator->GetReferenceDrawTarget()->GetBackendType());

It would be nice if we could somehow verify that we're not missing any callers that access our d3d device (current or future).

Have you tried running with D2D and the D3D device's debugging modes enabled?
Attachment #9028448 - Flags: review?(rhunt) → review+
Comment on attachment 9028446 [details] [diff] [review]
Part 8: Add a CanvasEventRingBuffer and CanvasDrawEventRecorder

Review of attachment 9028446 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/2d/RecordedEvent.h
@@ +194,5 @@
> +   *
> +   * @param aRecordedEvent the event to record
> +   */
> +  template<class RE>
> +  void RecordEvent(const RE* aRecordedEvent)

Do you know how much of a performance difference it makes having 3 different implementations of this code (RecordEvent, write and WriteInParts)? Why is MemWriter faster? Is it just because we only have to call UpdateWriteCount once per-event?

Are there any callers to write() that happen when RecordEvent isn't on the stack?

It seems like we could just implement write as the contents of WriteInParts, and call UpdateWriteCount once, at the end of RecordEvent.
Comment on attachment 9028446 [details] [diff] [review]
Part 8: Add a CanvasEventRingBuffer and CanvasDrawEventRecorder

Review of attachment 9028446 [details] [diff] [review]:
-----------------------------------------------------------------

Would it be possible to make the RingBuffer a standalone class (ideally in ipc/glue/)?

I'm finding it hard to reason about the RecordedEvent logic interleaved the strict ring buffer code, and I think that's making this feel more complicated than it needs to.

If we did that then we could get another reviewer (without requiring gfx/moz2d experience) to look over the thread safety of the actual ring buffer, which I think would be valuable.

(In reply to Matt Woodrow (:mattwoodrow) from comment #61)

Comment on attachment 9028447 [details] [diff] [review]
Part 12: Add CanvasParent, CanvasChild and RecordedTextureData

Thanks for looking at all of this and all of the feedback.
I'm tied up with other work at the moment, but I'll try and answer questions and then update the patches when I get a chance.

We have some gfx documentation in gfx/docs/, do you think you could put your
explainer doc into that folder as part of this work?

Yes that's the plan after it's had some more work and formatting, although I wasn't sure where to put it, so that answers that. :-)

::: gfx/layers/client/TextureRecorded.h
...
Any reason for the naming to be reversed for the filename?

I think I named the class before I realised that the other files and classes were the other way around, so I'll change it to be consistent.

::: gfx/layers/ipc/CanvasChild.cpp
@@ +139,5 @@

+already_AddRefed<gfx::DataSourceSurface>
+CanvasChild::GetDataSurface(const gfx::SourceSurface* aSurface)
+{

  • MOZ_ASSERT(aSurface);

You mentioned in the notes that you want to make this work from other
threads. What other threads are consumers of this?

Maybe add a main thread assert/warning for the meantime?

I think I saw a situation where a canvas texture ended up on the paint thread, but yeah I'll add an assertion to make this obvious until I can do the follow-up.

::: gfx/layers/ipc/CompositorBridgeChild.cpp
@@ +258,5 @@
...

  • if (gfx::gfxVars::RemoteCanvasEnabled()) {

Should we do this lazily? Maybe a TODO even if you don't want to do so now.

Yeah, I was more concerned about making the creation of the ring buffer lazy, but the whole thing probably could be as well.
I'll add as another follow-up.

::: gfx/layers/ipc/PCanvas.ipdl
@@ +19,5 @@

  • async CreateTranslator(TextureType aTextureType, Handle aReadHandle,
  •                     CrossProcessSemaphoreHandle aReaderSem,
    
  •                     CrossProcessSemaphoreHandle aWriterSem);
    
  • async StartTranslation();

My understanding is that this gets called when we detect that the reader has
transitioned into the Stopped state (which we can check from the shmem ring
buffer), and we need this explicit API to tell it to wake up again?

Calling this ResumeTranslation, or NotifyNewDataAvailable or something along
those lines.

Documenting both methods here would be really useful.

I think ResumeTranslation is much better thanks, I'll rename.
Sorry I'd missed these as I was going through adding more comments (there are no doubt other places as well).

(In reply to Matt Woodrow (:mattwoodrow) from comment #62)

(In reply to Bob Owen (:bobowen) from comment #58)

Does it work to share the D3D11 device with the compositor? That would make
some of the texture synchronization easier and would help avoid the problems
that we've run into with texture sharing between devices. It would also be a
memory usage improvement.

When I tried this originally it crashed pretty quickly, but if we used the Direct2D synchronization (from part 15) around all of the Direct 3D and DXGI calls on the compositor thread as well, then that might well solve those problems.

I guess we could add a flag to the Direct3D texture code to set when we call Factory::CreateDrawTargetForD3D11Texture (and anywhere else) and then always use the synchronization.

You might have seen this already, but this is documented at https://docs.microsoft.com/en-us/windows/desktop/direct2d/multi-threaded-direct2d-apps

Yeah, that is where I found this originally to make the sharing across the canvas threads work.
I leave the possibility of sharing devices to a follow-up as you suggest.

(In reply to Matt Woodrow (:mattwoodrow) from comment #63)
...

  • if (gfxVars::UseWebRender() || gfxVars::RemoteCanvasEnabled()) {

Why does remote canvas mean we need to create a DirectComposition device?

Hmm, I think this might be a hangover from when I was trying to force the creation of the device that the DrawTargetD2D1 depended on.
I'll try and get rid of it.

(In reply to Matt Woodrow (:mattwoodrow) from comment #64)

Comment on attachment 9027992 [details] [diff] [review]
Part 10: Add a CanvasTranslator and canvas recorded events
...

  •  RefPtr<ID3D11Device> device =
    
  •    gfx::DeviceManagerDx::Get()->GetCanvasDevice();
    

Can we also assert on device being non-null, since D3D11TextureData::Create
will go looking for a different one if it's null, and we probably don't want
that.

Ah yes, I'll add a check for that.

  • TextureMap::const_iterator result = mTextureDatas.find(aDrawTarget);

Should we be holding the mTextureDatasMonitor when accessing mTextureDatas?

This should be on the same thread that might modify it, so I think we're OK here.
Only WaitForTextureData gets call from the compositor thread, which is why that and the modifications hold the Monitor.

We have the DataMutex class to make the equivalent pattern impossible when
using Mutexes, would be nice if we could have a Monitor equivalent.

Yeah, it would be good to have this checked with assertions, the class would have to hold the thread that it can be modified on, as it could be any one of the canvas threads for each translation run, but that should be doable.
Does DataMutex allow for this sort of case, where we only modify on one thread and don't want to lock for read access on that thread?

::: gfx/layers/RecordedCanvasEventImpl.h
@@ +218,5 @@

+inline bool
+RecordedTextureUnlock::PlayCanvasEvent(CanvasTranslator* aTranslator) const
+{

  • TextureData* textureData = aTranslator->LookupTextureData(mDT);

The translator's mTextureDatas owns these TextureData* (using UniquePtr),
and is also protected by a Monitor (suggesting that it's accessed from
multiple threads).

Are we guaranteed that this won't be deleted from under us between the
LookupTextureData call, and the Unlock?

Maybe we could have assertions around which threads can read/write the
mTextureDatas map, and only allow accessing the raw pointers from the
(hopefully single) write thread?

Hmm, the compositor thread is the only other thread that accesses that is not the (current) write thread.
With a call to WaitForTextureData through LookupTextureDataForClientDrawTarget.
I don't think the TextureData can normally get deleted, because the compositor thread holds a read lock on the associated TextureClient in the content process.
That might break down in some failure scenarios though and I agree it's a non-obvious guarantee either way.

Actually, the compositor thread only uses this to call Serialize to get a SurfaceDescriptor to create its TextureHost.
I should probably just call Serialize on the canvas thread and store the SurfaceDescriptors in a separate container and change this Monitor to protect that.
We would add on the canvas thread and remove on the compositor thread (think this set-up only happens once per texture).
mTextureDatas would then only be accessed on a single thread at a time.

(In reply to Matt Woodrow (:mattwoodrow) from comment #65)

Comment on attachment 9028448 [details] [diff] [review]
Part 15: Spread the playback of canvas recordings across multiple threads in
the GPU process
...

+class MOZ_RAII AutoMoz2DSerialize final

AutoSerializeWithMoz2D?

That's a much better name, thanks.

::: gfx/layers/RecordedCanvasEventImpl.h
@@ +175,5 @@

 return false;

}

  • gfx::AutoMoz2DSerialize moz2dSerialize(
  •  aTranslator->GetReferenceDrawTarget()->GetBackendType());
    

It would be nice if we could somehow verify that we're not missing any
callers that access our d3d device (current or future).

Have you tried running with D2D and the D3D device's debugging modes enabled?

I used one of these to track down a couple of problems I had with all of this, but I'm not sure I had both of these turned on.
I'll look into it.

(In reply to Matt Woodrow (:mattwoodrow) from comment #66)

Comment on attachment 9028446 [details] [diff] [review]
Part 8: Add a CanvasEventRingBuffer and CanvasDrawEventRecorder

Review of attachment 9028446 [details] [diff] [review]:

::: gfx/2d/RecordedEvent.h
@@ +194,5 @@

    • @param aRecordedEvent the event to record
  • */
  • template<class RE>
  • void RecordEvent(const RE* aRecordedEvent)

Do you know how much of a performance difference it makes having 3 different
implementations of this code (RecordEvent, write and WriteInParts)? Why is
MemWriter faster? Is it just because we only have to call UpdateWriteCount
once per-event?

MemWriter is much faster because there are no checks for space and the compiler can just merge any continuous fixed length WriteElements in the specific RecordedEventDerived<Derived>::Record(MemWriter) into direct memory movements in assembly.

The compiler also does a similar thing instead of calling memcpy for write, but only for individual WriteElements because of the need to check for space each time.
If I make write more complicated, then the compiler stops doing this, although I don't think it's as important as using MemWriter. Maybe I could convince the compiler to still do this by letting it know which paths are more likely.

It would be even nicer to change the RecordedEvents to do one fixed length write and optionally one (or more) variable length writes, then the compiler would probably always optimize. That would be a large separate refactor though, so I decided not to go down that route. It would probably help on the performance of the reading side as well though.

Are there any callers to write() that happen when RecordEvent isn't on the
stack?

I don't think so.

It seems like we could just implement write as the contents of WriteInParts,
and call UpdateWriteCount once, at the end of RecordEvent.

Quite possibly, but from what I remember this was more about getting the compiler to do multiple simple memory movements in assembly, instead of multiple calls to memcpy.

(In reply to Matt Woodrow (:mattwoodrow) from comment #67)

Comment on attachment 9028446 [details] [diff] [review]
Part 8: Add a CanvasEventRingBuffer and CanvasDrawEventRecorder

Review of attachment 9028446 [details] [diff] [review]:

Would it be possible to make the RingBuffer a standalone class (ideally in
ipc/glue/)?

I'm finding it hard to reason about the RecordedEvent logic interleaved the
strict ring buffer code, and I think that's making this feel more
complicated than it needs to.

If we did that then we could get another reviewer (without requiring
gfx/moz2d experience) to look over the thread safety of the actual ring
buffer, which I think would be valuable.

I think this is a bit difficult with the way the recording is currently set up in Moz2D.
The recording is done through virtual calls (through RecordToStream), with the actual recording (in RecordedEventImpl.h) not usable directly from outside of Moz2D.
In order to get the performance benefits described above the writes need to be in Moz2D not the other side of a virtual call and Moz2D can't include code from elsewhere (and certainly as it stands the ring buffer isn't generic enough to put into mfbt).
I did make some attempts to untangle this, but I don't think it's a small job.
If the RecordedEvent definitions and some of the templated code in RecordedEventImpl.h could be moved back into RecordedEvent.h then I think using a Stream defined outside of Moz2D would be easier.

Attachment #9027981 - Flags: review?(rhunt) → review?(jmuizelaar)
Attachment #9027992 - Flags: review?(rhunt) → review?(jmuizelaar)
Attachment #9028001 - Flags: review?(rhunt) → review?(jmuizelaar)
Attachment #9028445 - Flags: review?(rhunt) → review?(jmuizelaar)
Attachment #9028446 - Flags: review?(rhunt) → review?(jmuizelaar)
Attachment #9028447 - Flags: review?(rhunt) → review?(jmuizelaar)

Stealing the review back from rhunt, because I probably have more time then he does now.

(In reply to Bob Owen (:bobowen) from comment #68)

You mentioned in the notes that you want to make this work from other
threads. What other threads are consumers of this?

Maybe add a main thread assert/warning for the meantime?

I think I saw a situation where a canvas texture ended up on the paint thread, but yeah I'll add an assertion to make this obvious until I can do the follow-up.

This does seem likely, if you wrap the <canvas> in some other effect that needs to be drawn on the content side (any filter without WR, complex filter chains with WR), then we'd end up trying to draw the <canvas> DT with our paint thread recordings.

I don't think sending a request to the main thread and waiting will help here, since it's very possible for the main thread to already be waiting on the paint threads to complete.

Attached patch Part 1: Fix unified build issues (obsolete) — Splinter Review

r=jrmuizel - rebased version of part 1 in case you want to apply the roll-up and compile.

Attachment #9028443 - Attachment is obsolete: true
Attachment #9044222 - Flags: review+

(In reply to Bob Owen (:bobowen) from comment #72)

(In reply to Matt Woodrow (:mattwoodrow) from comment #66)

Comment on attachment 9028446 [details] [diff] [review]
Part 8: Add a CanvasEventRingBuffer and CanvasDrawEventRecorder

Review of attachment 9028446 [details] [diff] [review]:

::: gfx/2d/RecordedEvent.h
@@ +194,5 @@

    • @param aRecordedEvent the event to record
  • */
  • template<class RE>
  • void RecordEvent(const RE* aRecordedEvent)

Do you know how much of a performance difference it makes having 3 different
implementations of this code (RecordEvent, write and WriteInParts)? Why is
MemWriter faster? Is it just because we only have to call UpdateWriteCount
once per-event?

MemWriter is much faster because there are no checks for space and the compiler can just merge any continuous fixed length WriteElements in the specific RecordedEventDerived<Derived>::Record(MemWriter) into direct memory movements in assembly.

The compiler also does a similar thing instead of calling memcpy for write, but only for individual WriteElements because of the need to check for space each time.
If I make write more complicated, then the compiler stops doing this, although I don't think it's as important as using MemWriter. Maybe I could convince the compiler to still do this by letting it know which paths are more likely.

It would be even nicer to change the RecordedEvents to do one fixed length write and optionally one (or more) variable length writes, then the compiler would probably always optimize. That would be a large separate refactor though, so I decided not to go down that route. It would probably help on the performance of the reading side as well though.

Are there any callers to write() that happen when RecordEvent isn't on the
stack?

I don't think so.

It seems like we could just implement write as the contents of WriteInParts,
and call UpdateWriteCount once, at the end of RecordEvent.

Quite possibly, but from what I remember this was more about getting the compiler to do multiple simple memory movements in assembly, instead of multiple calls to memcpy.

(In reply to Matt Woodrow (:mattwoodrow) from comment #67)

Comment on attachment 9028446 [details] [diff] [review]
Part 8: Add a CanvasEventRingBuffer and CanvasDrawEventRecorder

Review of attachment 9028446 [details] [diff] [review]:

Would it be possible to make the RingBuffer a standalone class (ideally in
ipc/glue/)?

I'm finding it hard to reason about the RecordedEvent logic interleaved the
strict ring buffer code, and I think that's making this feel more
complicated than it needs to.

If we did that then we could get another reviewer (without requiring
gfx/moz2d experience) to look over the thread safety of the actual ring
buffer, which I think would be valuable.

I think this is a bit difficult with the way the recording is currently set up in Moz2D.
The recording is done through virtual calls (through RecordToStream), with the actual recording (in RecordedEventImpl.h) not usable directly from outside of Moz2D.
In order to get the performance benefits described above the writes need to be in Moz2D not the other side of a virtual call and Moz2D can't include code from elsewhere (and certainly as it stands the ring buffer isn't generic enough to put into mfbt).
I did make some attempts to untangle this, but I don't think it's a small job.
If the RecordedEvent definitions and some of the templated code in RecordedEventImpl.h could be moved back into RecordedEvent.h then I think using a Stream defined outside of Moz2D would be easier.

I'm confused by this. It seems like the EventRingBuffer abstraction already allows CanvasEventRingBuffer to be outside of Moz2D. Can't CanvasEventRingBuffer be the RingBuffer standalone class?

Flags: needinfo?(bobowencode)

Also, it seems like this kind of thing would be useful for the WebGL remoting work. Do you know what's happening there? Can we share infrastructure?

(In reply to Jeff Muizelaar [:jrmuizel] from comment #77)

(In reply to Bob Owen (:bobowen) from comment #72)
...

I'm finding it hard to reason about the RecordedEvent logic interleaved the
strict ring buffer code, and I think that's making this feel more
complicated than it needs to.

If we did that then we could get another reviewer (without requiring
gfx/moz2d experience) to look over the thread safety of the actual ring
buffer, which I think would be valuable.

I think this is a bit difficult with the way the recording is currently set up in Moz2D.
The recording is done through virtual calls (through RecordToStream), with the actual recording (in RecordedEventImpl.h) not usable directly from outside of Moz2D.
In order to get the performance benefits described above the writes need to be in Moz2D not the other side of a virtual call and Moz2D can't include code from elsewhere (and certainly as it stands the ring buffer isn't generic enough to put into mfbt).
I did make some attempts to untangle this, but I don't think it's a small job.
If the RecordedEvent definitions and some of the templated code in RecordedEventImpl.h could be moved back into RecordedEvent.h then I think using a Stream defined outside of Moz2D would be easier.

I'm confused by this. It seems like the EventRingBuffer abstraction already allows CanvasEventRingBuffer to be outside of Moz2D. Can't CanvasEventRingBuffer be the RingBuffer standalone class?

Sorry I wasn't very clear there, I was talking about the known length (at compile time) memcpys.
This means that the compiler can convert multiple contiguous (in the case of MemWriter in EventRingBuffer::RecordEvent) or single WriteElements (in the case of EventRingBuffer::write) straight into assembly, instead of lots of calls to memcpy for small lengths.
If that part of EventRingBuffer is outside of Moz2D the other side of a virtual call then I don't think the compiler can do that.

The rest can and is outside of Moz2D. Obviously some of it needs to be because it uses things that Moz2D can't.
I guess that could be moved out of layers and somewhere more generic, but it would still be very specific for use with Moz2D recording, because of the above.

(In reply to Jeff Muizelaar [:jrmuizel] from comment #78)

Also, it seems like this kind of thing would be useful for the WebGL remoting work. Do you know what's happening there? Can we share infrastructure?

It has its own more generic producer-consumer queue (at least last time I saw it). It seemed quite complicated for the simple char* writes and reads that Moz2D recording needs.
It's possible that these problems can be solved and we could eventually use more common ring buffer code, but in order to get these optimizations I think Moz2D recording would need some rework.

Flags: needinfo?(bobowencode)

(In reply to Bob Owen (:bobowen) from comment #79)

Sorry I wasn't very clear there, I was talking about the known length (at compile time) memcpys.
This means that the compiler can convert multiple contiguous (in the case of MemWriter in EventRingBuffer::RecordEvent) or single WriteElements (in the case of EventRingBuffer::write) straight into assembly, instead of lots of calls to memcpy for small lengths.
If that part of EventRingBuffer is outside of Moz2D the other side of a virtual call then I don't think the compiler can do that.

The rest can and is outside of Moz2D. Obviously some of it needs to be because it uses things that Moz2D can't.
I guess that could be moved out of layers and somewhere more generic, but it would still be very specific for use with Moz2D recording, because of the above.

I've filed bug 1529677 to move the writing of the event type into RecordToStream. If we take that change and move some of the reading complexity so that we only adjust the counts once per event it seems like it should be possible consolidate more of the ring buffer code in CanvasEventRingBuffer.

i.e. the code in UpdateWriteCount and UpdateReadCount would be moved to virtual functions, with these functions ideally (when not reading/writing in parts) only being called once per event.

Does this seem workable?

Flags: needinfo?(bobowencode)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #80)

(In reply to Bob Owen (:bobowen) from comment #79)
...
I've filed bug 1529677 to move the writing of the event type into RecordToStream. If we take that change and move some of the reading complexity so that we only adjust the counts once per event it seems like it should be possible consolidate more of the ring buffer code in CanvasEventRingBuffer.

i.e. the code in UpdateWriteCount and UpdateReadCount would be moved to virtual functions, with these functions ideally (when not reading/writing in parts) only being called once per event.

Does this seem workable?

Moving the writing of the event type does seem like an improvement.

Although I don't think having the single update to the write and read counts are the main benefit for MemWriter, I think it
more comes from the efficiency of the memory copying.
Indeed once you have taken the hit of doing the copies separately (either through memcpy or more efficient means) there might be a benefit in updating the counts quickly, because it means the other side can then start reading or writing, if it is waiting.

I'm happy to try and move more stuff into CanvasEventRingBuffer though and see if it has a detectable effect on performance.

Flags: needinfo?(bobowencode)

(In reply to Bob Owen (:bobowen) from comment #81)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #80)
I'm happy to try and move more stuff into CanvasEventRingBuffer though and see if it has a detectable effect on performance.

It would also be good to get some gtests for the ring buffer.

Comment on attachment 9028001 [details] [diff] [review]
Part 11: Add DataSurfaceProvider to be used to retrieve surface data from a SourceSurfaceRecording

Review of attachment 9028001 [details] [diff] [review]:
-----------------------------------------------------------------

I think my preference would be for Canvas to deal with getting the pixel data for a DrawTargetRecording without needing DrawTargetRecording to know about it. Is that possible?

(In reply to Jeff Muizelaar [:jrmuizel] from comment #82)

(In reply to Bob Owen (:bobowen) from comment #81)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #80)
I'm happy to try and move more stuff into CanvasEventRingBuffer though and see if it has a detectable effect on performance.

It would also be good to get some gtests for the ring buffer.

More testing would definitely be good, at the moment it very much relies on integration testing with the canvas tests.
I've not used gtests or unit tested something similar to this before, do you know of any good examples I could look at?

(In reply to Jeff Muizelaar [:jrmuizel] from comment #83)

Comment on attachment 9028001 [details] [diff] [review]
Part 11: Add DataSurfaceProvider to be used to retrieve surface data from a
SourceSurfaceRecording

Review of attachment 9028001 [details] [diff] [review]:

I think my preference would be for Canvas to deal with getting the pixel
data for a DrawTargetRecording without needing DrawTargetRecording to know
about it. Is that possible?

Possibly, my only reservation is when the SourceSurface is handed off to something else which then needs the data.
I guess Canvas would know when that will happen though, I'll look into it.
I should be able to get back to this soon.

Flags: needinfo?(jmuizelaar)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #53)

Comment on attachment 9028447 [details] [diff] [review]
Part 12: Add CanvasParent, CanvasChild and RecordedTextureData

Review of attachment 9028447 [details] [diff] [review]:

The IPC parts look reasonable.

(I was going to point out CrossProcessSemaphore::CloseHandle, to save
resources after the semaphore has been shared, but it's a no-op on Windows
and this is Windows-only, so it looks like you won't need it.)

I must have missed this originally.
This is currently Windows only, but should be fairly easy to extend to other platforms.
At first I thought I was already doing this, but that was for the ring buffer shared memory.
I see this is basically the same for CrossProcessSemaphore's internal shared memory on posix.
I've added them locally to part 8 that creates the CrossProcessSemaphores, thanks.

(In reply to Bob Owen (:bobowen) from comment #84)

More testing would definitely be good, at the moment it very much relies on integration testing with the canvas tests.
I've not used gtests or unit tested something similar to this before, do you know of any good examples I could look at?

We have some stuff in https://searchfox.org/mozilla-central/source/gfx/tests/gtest. I'm not sure what kind of testing we have for IPC things.

Flags: needinfo?(jmuizelaar)

I've rebased and made changes addressing up to comment 80.
While running performance tests to get a baseline for the suggested changes in that comment, I've hit issues where occasionally the reading (or possible recording) of RecordedPathCreation (or maybe RecordedPathDestruction) is failing. I'm struggling at the moment to see what the issue is.

I'll pick this up again when I get back from PTO.

Alias: canvas-ipc

I fixed the race I had, which was in the code when I had a clash between the Reader moving from Waiting to Stopped and the Writer moving it from Waiting to Processing. This only showed up on my new dev machine, I think because of the large number of cores available.

I'm going to push new patches to Phabricator.
Unfortunately, I think that means previously reviewed patches will need to be r+ed again.

(In reply to Bob Owen (:bobowen) from comment #81)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #80)

(In reply to Bob Owen (:bobowen) from comment #79)
...
I've filed bug 1529677 to move the writing of the event type into RecordToStream. If we take that change and move some of the reading complexity so that we only adjust the counts once per event it seems like it should be possible consolidate more of the ring buffer code in CanvasEventRingBuffer.

i.e. the code in UpdateWriteCount and UpdateReadCount would be moved to virtual functions, with these functions ideally (when not reading/writing in parts) only being called once per event.

Does this seem workable?

Moving the writing of the event type does seem like an improvement.

Although I don't think having the single update to the write and read counts are the main benefit for MemWriter, I think it
more comes from the efficiency of the memory copying.
Indeed once you have taken the hit of doing the copies separately (either through memcpy or more efficient means) there might be a benefit in updating the counts quickly, because it means the other side can then start reading or writing, if it is waiting.

I'm happy to try and move more stuff into CanvasEventRingBuffer though and see if it has a detectable effect on performance.(In reply to Bob Owen (:bobowen) from comment #84)

I've moved most of the reading and writing logic back into CanvasEventRingBuffer, Only the MemWriter part is still in EventRingBuffer. This doesn't seem to affect the performance, which is mainly from using MemWriter most of the tome anyway.
I think it generally makes thing easier to follow in CanvasEventRingBuffer.
This is done on top of the patch for bug 1529677.

(In reply to Jeff Muizelaar [:jrmuizel] from comment #82)

(In reply to Bob Owen (:bobowen) from comment #81)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #80)
I'm happy to try and move more stuff into CanvasEventRingBuffer though and see if it has a detectable effect on performance.

It would also be good to get some gtests for the ring buffer.

More testing would definitely be good, at the moment it very much relies on integration testing with the canvas tests.
I've not used gtests or unit tested something similar to this before, do you know of any good examples I could look at?

I've not had chance to look at this yet, but I'd like to get something landed (preffed off) and make this one of the follow-ups before enabling on Nightly.

(In reply to Jeff Muizelaar [:jrmuizel] from comment #83)

Comment on attachment 9028001 [details] [diff] [review]
Part 11: Add DataSurfaceProvider to be used to retrieve surface data from a
SourceSurfaceRecording

Review of attachment 9028001 [details] [diff] [review]:

I think my preference would be for Canvas to deal with getting the pixel
data for a DrawTargetRecording without needing DrawTargetRecording to know
about it. Is that possible?

Possibly, my only reservation is when the SourceSurface is handed off to something else which then needs the data.
I guess Canvas would know when that will happen though, I'll look into it.
I should be able to get back to this soon.

I've managed to do this in what is hopefully an acceptable way.
In order to be able to intervene and fetch the data, I've made PersistentBufferProvider::BorrowSnapshot allowable inside or outside of a lock and made CanvasRenderingContext2D use it whenever it fetches a SourceSurface for which we require (or might require) the data.
I think this also removes the possibility of that needing to happen off the main thread.

This patch modifies DoWithEvent so that we can more easily have a
DoWithEventFromStream callable from outside of Moz2D similar to
LoadEventFromStream. We will add that in a later patch for the new
EventRingBuffer. It also changes the only user of LoadEventFromStream
over to it, so we can can get rid of it and LoadEvent entirely.

Depends on D27076

Otherwise, we crash in the content process when we try to record this.

Depends on D27080

This is ground work for when we will be returning a recording TextureData for
certain types in subsequent patches.

Depends on D27082

This is so we don't need to lock the previous back buffer when it might also be
locked by the compositor. These locks are generally for copying to the next
back buffer or when getting image data from the previous back buffer.
This also makes it easier to asynchronously cache the DataSourceSurface in the
GPU process, when a page is using getImageData. This is done in a later patch.

Depends on D27086

These are to be used as part of recording canvas drawing in the content
processes and playing it back in the GPU process through shared memory.

Depends on D27088

These are extensions to the Moz2D RecordedEvents to record and play back canvas
texture related functions in the GPU process.
The CanvasTranslator handles the playback of these and the Moz2D ones.

Depends on D27090

RecordedTextureData records TextureData calls for play back in the GPU process.
CanvasChild and CanvasParent set up the recorder and translator.
They also help to manage the starting of translation and co-ordinating the
translation with the frame transactions.
This patch also includes other changes to wire up recording and playback.

Depends on D27092

This also improves the recording and translation speeds.

Depends on D27095

Attachment #9027911 - Attachment is obsolete: true
Attachment #9027913 - Attachment is obsolete: true
Attachment #9027914 - Attachment is obsolete: true
Attachment #9027966 - Attachment is obsolete: true
Attachment #9027981 - Attachment is obsolete: true
Attachment #9027981 - Flags: review?(jmuizelaar)
Attachment #9027992 - Attachment is obsolete: true
Attachment #9027992 - Flags: review?(jmuizelaar)
Attachment #9028001 - Attachment is obsolete: true
Attachment #9028001 - Flags: review?(jmuizelaar)
Attachment #9028022 - Attachment is obsolete: true
Attachment #9028027 - Attachment is obsolete: true
Attachment #9028445 - Attachment is obsolete: true
Attachment #9028445 - Flags: review?(jmuizelaar)
Attachment #9028446 - Attachment is obsolete: true
Attachment #9028446 - Flags: review?(jmuizelaar)
Attachment #9028447 - Attachment is obsolete: true
Attachment #9028447 - Flags: review?(matt.woodrow)
Attachment #9028447 - Flags: review?(jmuizelaar)
Attachment #9028448 - Attachment is obsolete: true
Attachment #9028864 - Attachment is obsolete: true
Attachment #9044211 - Attachment is obsolete: true
Attachment #9044222 - Attachment is obsolete: true

Bob, one thing I was just thinking about that may have missed. How does this change handle image data? Does it get put in the recording every frame? e.g. If I'm drawing a bunch of 10k x 10k images to canvas every frame, how is that handled?

Flags: needinfo?(bobowencode)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #104)

Bob, one thing I was just thinking about that may have missed. How does this change handle image data? Does it get put in the recording every frame? e.g. If I'm drawing a bunch of 10k x 10k images to canvas every frame, how is that handled?

I haven't done anything specific for cases like that and I may be misunderstanding what you mean, but ...
In general assuming we're talking about the same images (SourceSurfaces?) being used over and over again then I think the caching will kind of work as it does now. Except you have a SourceSurfaceRecording cached in the content process that references the real SourceSurface in the GPU process.

If you have any examples of this sort of behaviour then I can test to see how it goes.

Flags: needinfo?(bobowencode)

Hey bob I've been asked to help review "Part 8: Add a CanvasEventRingBuffer and CanvasDrawEventRecorder" -- is the ring buffer design copied or based off any other implementation/literature? It's a pretty subtle thing to review so I want to be sure I have all the context I can.

(In reply to Bob Owen (:bobowen) from comment #105)

If you have any examples of this sort of behaviour then I can test to see how it goes.

What prevents this example from consuming infinite memory?

<html>
        <canvas width=1920 height=1920 id=dest></canvas>
        <script>
                c = document.getElementById("dest").getContext("2d");
                i = 0;
                function loadImage() {
                        var img = new Image;
                        img.onload = function() {
                                c.drawImage(img, 0, 0);
                                img.src = "";
                                loadImage();
                        }
                        img.src = "http://lorempixel.com/1920/1920/?" + i++
                }
                loadImage();
        </script>

Flags: needinfo?(bobowencode)

(In reply to Alexis Beingessner [:Gankro] from comment #106)

Hey bob I've been asked to help review "Part 8: Add a CanvasEventRingBuffer and CanvasDrawEventRecorder" -- is the ring buffer design copied or based off any other implementation/literature? It's a pretty subtle thing to review so I want to be sure I have all the context I can.

Hi Alexis, I just read the basic theory and then implemented from scatch to fit in with the fairly simple requirements for Moz2D recording and our IPC and synchronisation classes. Sorry to be brief on mobile.

Flags: needinfo?(bobowencode)

I wrote a sync version of the script above but it consumes infinite memory on current Firefox so maybe it's ok not to worry too much about.

<html>

        <canvas width=1920 height=1920 id=src></canvas>
        <canvas width=1920 height=1920 id=dest></canvas>

        <script>
                c = document.getElementById("dest").getContext("2d");
                src = document.getElementById("src").getContext("2d");

                for (var i = 0; i < 100; i++ ) {
                     src.fillStyle = `rgb(${255.*Math.random()}, 0, 255, 1)`;
                     src.fillRect(Math.random()*1920, Math.random()*1920, 150, 100);
                }
                for (var i = 0; i<10000; i++) {
                        img = src.getImageData(0, 0, 1920, 1920);
                        c.putImageData(img, 0, 0);
                }
        </script>

(In reply to Jeff Muizelaar [:jrmuizel] from comment #109)

I wrote a sync version of the script above but it consumes infinite memory on current Firefox so maybe it's ok not to worry too much about.

Yeah, this eats impressive amounts of memory in the content process with the remote canvas enabled and disabled.

In the dev build it crashes the GPU process because I currently have a gfxDevCrash if an event fails to play back and I think it fails at some point because of a timeout.
If I remove the gfxDevCrash is seems to behave in a fairly similar manner either way.

I might need to think about removing the gfxDevCrash and handle things differently before we enable on Nightly or possible before we roll out.
However I think having the crash is useful at the moment, but maybe I should have another gfxDevCrash at the actual timeout, to distinguish between that and a failure for a different reason.

Pushed by bobowencode@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/be11539bd8d8
Part 1: Fix unified build issues. r=jrmuizel
https://hg.mozilla.org/integration/mozilla-inbound/rev/274a9d9596a2
Part 2: Remove LoadEvent and replace with DoWithEvent. r=jrmuizel
https://hg.mozilla.org/integration/mozilla-inbound/rev/b2e0e341bb82
Part 3: Remove unused GetObjectRef. r=jrmuizel
https://hg.mozilla.org/integration/mozilla-inbound/rev/76f6050cb57f
Part 4: Record DrawTarget::Flush and DrawTarget::DetachAllSnapshots. r=jrmuizel
https://hg.mozilla.org/integration/mozilla-inbound/rev/c83dbcc4dade
Part 5: Make sure the DrawTarget can create a similar DrawTarget when falling back to empty surface. r=mstange
https://hg.mozilla.org/integration/mozilla-inbound/rev/258c6c199656
Part 6: Add remote canvas pref and refactor TextuteData creation to use it. r=mattwoodrow
https://hg.mozilla.org/integration/mozilla-inbound/rev/a2720ec3086f
Part 7: Take snapshot before return for TextureClients with synchronization. r=mattwoodrow
https://hg.mozilla.org/integration/mozilla-inbound/rev/fcc9f5f6dfe1
Part 8: Add a CanvasEventRingBuffer and CanvasDrawEventRecorder. r=Gankro
https://hg.mozilla.org/integration/mozilla-inbound/rev/0e6cf27e9728
Part 9: Add a D3D11 device to be used on canvas threads in the GPU process. r=jrmuizel
https://hg.mozilla.org/integration/mozilla-inbound/rev/04067aec22bb
Part 10: Add a CanvasTranslator and canvas recorded events. r=jrmuizel
https://hg.mozilla.org/integration/mozilla-inbound/rev/e32d6c3ba887
Part 11: Make SourceSurface from DrawTargetRecording::CreateSourceSurfaceFromData hold its data. r=jrmuizel
https://hg.mozilla.org/integration/mozilla-inbound/rev/4357d695b8d5
Part 12: Add CanvasParent, CanvasChild and RecordedTextureData. r=mattwoodrow, jld
https://hg.mozilla.org/integration/mozilla-inbound/rev/09b2e60abc85
Part 13: Make the recording of surface data more efficient. r=rhunt
https://hg.mozilla.org/integration/mozilla-inbound/rev/2fd1ba3d96cd
Part 14: Refactor Path recording to record Arc properly. r=jrmuizel
https://hg.mozilla.org/integration/mozilla-inbound/rev/2195b79ea888
Part 15: Spread the playback of canvas recordings across multiple threads in the GPU process. r=mattwoodrow
Regressions: 1558009
Regressions: 1558117
Regressions: 1559437
Regressions: 1559676
Attachment #9057541 - Attachment is obsolete: true
Attachment #9057561 - Attachment is obsolete: true
Regressions: 1567054
You need to log in before you can comment on or make changes to this bug.