Allow mapping the APZ updater thread to the WR scene builder thread

RESOLVED FIXED in Firefox 61

Status

()

enhancement
P1
normal
RESOLVED FIXED
Last year
Last year

People

(Reporter: kats, Assigned: kats)

Tracking

Other Branch
mozilla61
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox61 fixed)

Details

(Whiteboard: [gfx-noted])

Attachments

(11 attachments)

59 bytes, text/x-review-board-request
botond
: review+
Details
59 bytes, text/x-review-board-request
nical
: review+
Details
59 bytes, text/x-review-board-request
botond
: review+
Details
59 bytes, text/x-review-board-request
nical
: review+
Details
59 bytes, text/x-review-board-request
nical
: review+
Details
59 bytes, text/x-review-board-request
botond
: review+
Details
59 bytes, text/x-review-board-request
botond
: review+
Details
59 bytes, text/x-review-board-request
nical
: review+
Details
59 bytes, text/x-review-board-request
botond
: review+
Details
59 bytes, text/x-review-board-request
botond
: review+
nical
: review+
Details
59 bytes, text/x-review-board-request
botond
: review+
Details
As discussed somewhat in bug 1391318, we need to apply APZ updates for a particular content transaction at the same time that WR swaps in the corresponding built scene coming from the scene builder thread. And in particular, we need to acquire the APZ tree lock while we do both the scene swap and the APZ update, and then release the lock when done. (This is so that any hit-testing that is happening concurrently always get consistent results).

And also, we can't acquire the APZ tree lock on the render backend thread because that would result in a deadlock situation. So we have to do it from the scene builder thread. This means that when WR is enabled with async scene building, the scene builder thread has to serve as the APZ updater thread (which is an abstract concept introduced in bug 1449620).

This bug will do all the plumbing to hook this up.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=9952ed3949849d7ba3204bbddf4028ef5e44c585

Need to debug this, it's not working as expected when the async scene building is turned on.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=631461e4752531e4b72b1a95c0ab50d1713a6e98

Now the code is mostly behaving as I want it to. When I turn it on there's some flickering which seems to be due to the async scroll offsets not getting applied on some frames. I don't think it's a blocker to getting these patches landed since it's off by default anyway and again my patch queue is getting too large to manage.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #2)
> When I turn it on there's
> some flickering which seems to be due to the async scroll offsets not
> getting applied on some frames.

I'm fairly sure this is because the async scroll offsets are currently being sent to WR as part of the "GenerateFrame" transaction at [1] and therefore races with the scene building, when async scene building is enabled. In other words, WRBP sends a display list transaction to WR, which gets delegated to the scene builder thread. WRBP then sends a GenerateFrame transaction which can end up getting processed either before or after the scene swap. If the GenerateFrame transaction is processed after the scene swap, then the async scroll offsets in the GenerateFrame transaction are invalidated by the new scene and that results in the wrong scroll offsets. I expect this to be fixed by the changes for bug 1451469, because we'll stop computing the async scroll offsets in WRBP when we send the GenerateFrame transaction, but instead pull them from WR at render time.

[1] https://searchfox.org/mozilla-central/rev/f860c2bf00576959b07e2d9619c7b5088b0005be/gfx/layers/wr/WebRenderBridgeParent.cpp#1240
Try push from comment 15 is showing debug assertion failures on Windows, which don't make any sense to me. The codepath in question shouldn't be getting run unless the pref is turned on, which it isn't. I'll investigate but it shouldn't block reviewing the patches.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #8)
> Created attachment 8965369 [details]
> Bug 1449982 - Introduce pref to control async scene building.
> 
> Until all the pieces are in place, turning on this pref will probably
> break horribly. But we need the pref so we can add the rest of the
> pieces with enabling them by default.

I assume you mean "without enabling them by default".
(In reply to Botond Ballo [:botond] from comment #18)
> I assume you mean "without enabling them by default".

Whoops, yes.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #17)
> Try push from comment 15 is showing debug assertion failures on Windows,
> which don't make any sense to me. The codepath in question shouldn't be
> getting run unless the pref is turned on, which it isn't. I'll investigate
> but it shouldn't block reviewing the patches.

https://treeherder.mozilla.org/logviewer.html#?job_id=172152618&repo=try&lineNumber=1776 shows that we're definitely setting the flag somehow but the stack is not symbolicated. I kicked off another try push with more panics/assertions to track down this is happening. From a code inspection it seems impossible, unless maybe gfxPrefs is busted in the GPU process or something.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #20)
> From a code inspection it seems impossible, unless maybe gfxPrefs
> is busted in the GPU process or something.

Fairly sure now this a miscompilation problem. If I store the gfxPrefs::WebRenderAsyncSceneBuild() into a local bool and pass that, it works fine. Passing the pref without putting it in a local var ends up passing true instead of false. Doing one more try push to confirm.
Yeah, confirmed. The good scenario [1] has a green try push [2] and the bad scenario [3] has panics [4]. In both cases gfxPrefs::WebRenderAsyncSceneBuild() is false on the C++ side, but in the bad scenario that false gets turned into true crossing the FFI boundary, and is true in rust code.

I'll update the patch to avoid the miscompilation, and see if I can reproduce that locally to debug the generated code.

[1] https://hg.mozilla.org/try/rev/14e0f49dc1e89e37e80886f681aa7de16d955b3d#l4.13
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=14ce3103c4362787ffbea1bfd79d1a1b98bdd07a
[3] https://hg.mozilla.org/try/rev/4e1f0bc7249954d0f455a2d9e9ab969830c91e3f#l4.12
[4] https://treeherder.mozilla.org/#/jobs?repo=try&revision=ff5297363729451e19d71b7d03dcff80fdbf0968
Comment on attachment 8965366 [details]
Bug 1449982 - Don't hold the sIndirectLayerTreesLock unnecessarily while notifying APZ of a layer tree removal.

https://reviewboard.mozilla.org/r/234090/#review240262
Attachment #8965366 - Flags: review?(botond) → review+
Comment on attachment 8965368 [details]
Bug 1449982 - Maintain a map from WrWindowId to APZUpdater.

https://reviewboard.mozilla.org/r/234094/#review240264

::: gfx/layers/apz/public/APZUpdater.h:114
(Diff revision 1)
> +
> +  // Used to manage the mapping from a WR window id to APZUpdater. These are only
> +  // used if WebRender is enabled. Both sWindowIdMap and mWindowId should only
> +  // be used while holding the sWindowIdLock.
> +  static StaticMutex sWindowIdLock;
> +  static std::unordered_map<uint64_t, APZUpdater*> sWindowIdMap;

It's possible to store the WindowId itself as the key type if you specialize std::hash for it (or provide your own hasher as a third template argument to unordered_map).

I'll leave it up to you if you want to do that.
Attachment #8965368 - Flags: review?(botond) → review+
Comment on attachment 8965371 [details]
Bug 1449982 - Implement the WR updater thread registration.

https://reviewboard.mozilla.org/r/234100/#review240268

::: gfx/layers/apz/public/APZUpdater.h:48
(Diff revision 1)
>    explicit APZUpdater(const RefPtr<APZCTreeManager>& aApz);
>  
>    bool HasTreeManager(const RefPtr<APZCTreeManager>& aApz);
>    void SetWebRenderWindowId(const wr::WindowId& aWindowId);
>  
> +  static void SetUpdaterThread(const wr::WrWindowId& aWindowId);

I spent a good minute trying to figure out what this method does based just on its declaration, and wasn't able to.

Please either rename it to something more descriptive (like MarkCurrentThreadAsUpdaterForWindow()), or add a comment.

::: gfx/layers/apz/public/APZUpdater.h:129
(Diff revision 1)
> +  // thread. Additionally, if the async scene building pref is enabled, this
> +  // becomes the thread which will serve as the updater thread.
> +  // This is written to once during init and never cleared, and so reading it
> +  // from multiple threads during normal operation (after initialization)
> +  // without locking should be fine.
> +  Maybe<PlatformThreadId> mUpdaterThreadId;

It looks like this field is only read if async scene building is enabled.

Assuming that's the intention, can we mention that in the comment (and maybe use that fact to simplify the comment, i.e. if that's the case we don't care what this stores unless async scene building is enabled).
Attachment #8965371 - Flags: review?(botond) → review+
Comment on attachment 8965372 [details]
Bug 1449982 - Add the task queue for running things on the updater thread.

https://reviewboard.mozilla.org/r/234102/#review240276

::: commit-message-d1eba:5
(Diff revision 1)
> +Bug 1449982 - Add the task queue for running things on the updater thread. r?botond
> +
> +If we're using the WR scene builder thread as the updater thread, we
> +cannot just post a task to its message loop, because it's a rust thread
> +doesn't have a message loop. Instead, we put these tasks into a queue

There's a missing word here (e.g. "it's a rust thread _that_ doesn't have a message loop")

::: gfx/layers/apz/src/APZUpdater.cpp:386
(Diff revision 1)
>  
>  void
>  apz_deregister_updater(mozilla::wr::WrWindowId aWindowId)
>  {
> +  // Run anything that's still left. Note that this function gets called even
> +  // if async scene building is off, but in that case we don't want to do

So this function gets called even if async scene building is off, but apz_run_updater() doesn't?
Attachment #8965372 - Flags: review?(botond) → review+
Comment on attachment 8965374 [details]
Bug 1449982 - Move the WebRenderScrollData storage from WebRenderBridgeParent to APZUpdater.

https://reviewboard.mozilla.org/r/234106/#review240286

::: gfx/layers/apz/public/APZUpdater.h:62
(Diff revision 1)
>    void UpdateHitTestingTree(LayersId aRootLayerTreeId,
>                              Layer* aRoot,
>                              bool aIsFirstPaint,
>                              LayersId aOriginatingLayersId,
>                              uint32_t aPaintSequenceNumber);
>    void UpdateHitTestingTree(LayersId aRootLayerTreeId,

This subtle signature change belies a very significant behavioural change.

This UHTT overload is not like the other any more; please call it something like StoreScrollDataAndUpdateHitTestingTree().

Also, the |aScrollData| parameter has gone from being the root layer tree's scroll data, to being the originating layer tree's scroll data. Please document that, either via a comment or the parameter's name.

::: gfx/layers/apz/src/APZUpdater.cpp:145
(Diff revision 1)
> -                                       bool,
> -                                       LayersId,
> -                                       uint32_t>(
> -      "APZUpdater::UpdateHitTestingTree",
> +    "APZUpdater::UpdateHitTestingTree",
> -      mApz,
> -      func,
> +    [=]() {
> +      self->mScrollData[aOriginatingLayersId] = aScrollData;

WebRenderScrollData is a heavy class to be copying around. Can we pass it by UniquePtr, or at the very least Move() it here?

::: gfx/layers/apz/src/APZUpdater.cpp:151
(Diff revision 1)
> -      aRootLayerTreeId,
> -      aScrollData,
> -      aIsFirstPaint,
> -      aOriginatingLayersId,
> -      aPaintSequenceNumber));
> +      auto root = self->mScrollData.find(aRootLayerTreeId);
> +      if (root == self->mScrollData.end()) {
> +        return;
> +      }
> +      self->mApz->UpdateHitTestingTree(aRootLayerTreeId,
> +          WebRenderScrollDataWrapper(*self.get(), &(root->second)),

RefPtr overloads operator*, so that |*self| is sufficient.
Attachment #8965374 - Flags: review?(botond) → review+
Comment on attachment 8965374 [details]
Bug 1449982 - Move the WebRenderScrollData storage from WebRenderBridgeParent to APZUpdater.

https://reviewboard.mozilla.org/r/234106/#review240308

::: gfx/layers/apz/public/APZUpdater.h:125
(Diff revision 1)
> +  // Map from layers id to WebRenderScrollData. This can only be touched on
> +  // the updater thread.
> +  std::unordered_map<LayersId,
> +                     WebRenderScrollData,
> +                     LayersId::HashFn,
> +                     LayersId::EqualFn> mScrollData;

Note: you shouldn't actually need LayersId::EqualFn. Since LayersId overloads operator==, the default equal function should just work.
Comment on attachment 8965374 [details]
Bug 1449982 - Move the WebRenderScrollData storage from WebRenderBridgeParent to APZUpdater.

https://reviewboard.mozilla.org/r/234106/#review240324

::: gfx/layers/wr/WebRenderBridgeParent.cpp:523
(Diff revision 1)
>    if (!rootWrbp) {
>      return;
>    }
>    if (RefPtr<APZUpdater> apz = cbp->GetAPZUpdater()) {
> +    if (aData) {
> -    apz->UpdateFocusState(rootLayersId, GetLayersId(),
> +      apz->UpdateFocusState(rootLayersId, GetLayersId(),

Since we always update the focus state, I think it would make this function's interface cleaner if the signature was:

  UpdateAPZ(const FocusTarget&,
            const Maybe<WebRenderScrollData>&)
            
and the caller passed in data.GetFocusTarget() in the case where the scroll data is also being updated.
Comment on attachment 8965375 [details]
Bug 1449982 - Hook up epoch-based task processing on the updater thread.

https://reviewboard.mozilla.org/r/234108/#review240326

This is just a first pass; I'm posting my comments so far.

I need to think about some of the things in this patch a bit more.

I'd also feel a bit more comfortable if :nical had a look at the change of WebRenderBridgeParent.cpp, and the WR API usage in APZUpdater, as well.

::: gfx/layers/apz/src/APZCTreeManager.cpp:3249
(Diff revision 1)
>  
>  void
> +APZCTreeManager::LockTree()
> +{
> +  AssertOnUpdaterThread();
> +  mTreeLock.Lock();

This really scares me.

How sure are we that there is no way we'll get stuck with the lock held as something in between pre_scene_swap and post_scene_swap running into an error, or a message never arriving, or something?

Do we need a mechanism to ensure the lock is always released, even if something goes wrong (assuming that something isn't going to take down the entire process, in which case it doesn't matter)?

::: gfx/layers/apz/src/APZUpdater.cpp:95
(Diff revision 1)
> +    return;
> +  }
> +
> +  wr::WrPipelineId pipeline;
> +  wr::WrEpoch epoch;
> +  while (wr_pipeline_info_next_removed_pipeline(aInfo, &pipeline)) {

I don't know much about how Rust <--> C++ interop works, but is there an advantage to having an API like this, that crosses the FFI boundary once for every element in a collection (as opposed to querying the entire collection in one call)?

I also don't love this API in that it requires a variable in the enclosing scope to use as the out-param, which makes it seem like the variable is an output of the loop, when it's not.

Scoping the variable might make this clearer:

{
  wr::WrPipelineId pipeline;
  while (wr_pipeline_info_next_removed_pipeline(...)) {
    ...
  }
}

...

{
  wr::WrPipelineId pipeline;
  wr::WrEpoch epoch;
  while (wr_pipeline_info_next_epoch(...)) {
    ...
  }
}

(though an API that just returns a collection would be the cleanest).

::: gfx/layers/apz/src/APZUpdater.cpp:101
(Diff revision 1)
> +    LayersId layersId = wr::AsLayersId(pipeline);
> +    updater->mEpochData.erase(layersId);
> +  }
> +  // Reset the built info for all pipelines, then put it back for the ones
> +  // that got built in this scene swap.
> +  for (auto &i : updater->mEpochData) {

nit: '&' goes next to the type

::: gfx/layers/apz/src/APZUpdater.cpp:104
(Diff revision 1)
> +  // Reset the built info for all pipelines, then put it back for the ones
> +  // that got built in this scene swap.
> +  for (auto &i : updater->mEpochData) {
> +    i.second.mBuilt = Nothing();
> +  }
> +  while (wr_pipeline_info_next_epoch(aInfo, &pipeline, &epoch)) {

It would also be nice to document either these functions, or the WrPipelineInfo struct in bindings.rs. I'd like to see things like:

  - this structure represents information
    about layer trees after a scene build
    [I know "layer tree" is bad term,
    feel free to use a better one]

  - deleted_pipelines contains one entry for
    every layer tree that went away / was
    destroyed since the last scene build
    
  - epochs contains one entry for every
    layer tree that is visible in this scene,
    together with the epoch representing
    the last scene build when it was built
    
  - (there may be layer trees which are
    not destroyed but also not visible, which
    would show up in neither deleted_pipelines
    nor epochs)
    
(I may have gotten some of that wrong, but if so that just underscores the need for documentation.)

::: gfx/layers/apz/src/APZUpdater.cpp:431
(Diff revision 1)
> +bool
> +APZUpdater::IsQueueBlocked() const
> +{
> +  AssertOnUpdaterThread();
> +
> +  for (auto i : mEpochData) {

const auto&

::: gfx/layers/apz/src/APZUpdater.cpp:442
(Diff revision 1)
> +}
> +
> +void
> +APZUpdater::ProcessQueue()
> +{
> +  // Note that running a task might update mEpochData, so we can't

Perhaps add an early exit for if the queue is empty at the beginning of the function?

::: gfx/layers/wr/WebRenderBridgeParent.cpp:509
(Diff revision 1)
>    }
>    return lts->mParent;
>  }
>  
>  void
> -WebRenderBridgeParent::UpdateAPZ(const Maybe<WebRenderScrollData>& aData,
> +WebRenderBridgeParent::UpdateAPZ(const Maybe<wr::Epoch>& aEpoch,

The more Maybe's we add to this function's signature, the less I'm liking it :)

I'm now thinking I'd prefer two methods:

  UpdateAPZFocusState(const FocusTarget&)
  UpdateAPZHitTestTree(const WRScrollData&, const Epoch&)
  
even if it means e.g. calling GetRootCompositorBridgeParent() twice in the case of a nonempty transaction (though I'm willing to be dissuaded on that front).
(In reply to Botond Ballo [:botond] from comment #30)
>   UpdateAPZFocusState(const FocusTarget&)
>   UpdateAPZHitTestTree(const WRScrollData&, const Epoch&)
>   
> even if it means e.g. calling GetRootCompositorBridgeParent() twice in the
> case of a nonempty transaction (though I'm willing to be dissuaded on that
> front).

Alternatively, we could have APZUpdater::UpdateHitTestingTree() _also_ call UpdateFocusState(), and then we can have the above two signatures without any redundant code execution (as an added bonus, the two updates could be combined into a single Runnable as well).
(In reply to Botond Ballo [:botond] from comment #30)
> The more Maybe's we add to this function's signature, the less I'm liking it
> :)

(In case it's not clear why, it's because having three Maybe parameters makes it seem like there are 8 potential combinations of Nothing/Some values that could be passed, when in fact only 2 of the combinations are valid.)
(In reply to Botond Ballo [:botond] from comment #30)
> I need to think about some of the things in this patch a bit more.

Ok, I've thought about it a bit, and my main concern is that this approach can introduce unnecessary latency in certain cases.

Consider a scenario where two layer trees / pipelines are visible, A and B. Suppose B is slow to paint / render, while A is fast.

Now suppose the following happens in the following order (not talking about threads / events yet, just wallclock time):

  1. A completes a display list build.
  2. B completes a display list build.
  3. A responds to a SetTargetAPZC notification.

(SetTargetAPZC is just an example; it can be any message that goes into the queue.)

Now, without WebRender, (1) and (2) only send transactions to the compositor once the correspondings _paints_ complete. Since we're assuming that A is fast to paint while B is slow to paint, let's assume the resulting messages arrive at the compositor in the following order:

  1. A: transaction
  2. A: SetTargetAPZC notification
  3. B: transaction

They are then processed in that order; in particular, APZ can process the SetTargetAPZC notification without waiting for B's slow paint to complete.

Now, let's assume for sake of the example that with WebRender, the slowness of B's paint manifests as scene-building being slow. Since, with WR, we send transactions as soon as the display list build is complete, messages will now (likely) arrive at the compositor in the following order:

  1. A: transaction
  2. B: transaction
  3. A: SetTargetAPZC notification

The messages go into the queue, with the "B: transaction" message enqueuing an epoch requirement that isn't satisfied until the (slow) scene-build for B completes.

This seems unnecessary. If, pre-WR, A's SetTargetAPZC didn't need to wait for B's paint to complete, why does it now need to wait for B's scene-build to complete?

I think this can be addressed by the following modification to the approach:

  - For each message in the queue, track which layers id the message came from.

  - Process the message as soon as _its_ layers id has an up-to-date scene
    (rather than _all_ layers ids having an up-to-date scene).

(This may be equivalent to having a separate queue per layers id; if so, and it's cleaner to implement that way, I'd be fine with that.)
Comment on attachment 8965375 [details]
Bug 1449982 - Hook up epoch-based task processing on the updater thread.

Clearing the review flag until the concerns about the tree locking and the added latency are addressed.
Attachment #8965375 - Flags: review?(botond)
(In reply to Botond Ballo [:botond] from comment #33)
>   3. A responds to a SetTargetAPZC notification.

(I mean, of course, "A sends a SetTargetAPZC notification in response to an event").
Comment on attachment 8965367 [details]
Bug 1449982 - Move the window id allocation to CompositorBridgeParent.

https://reviewboard.mozilla.org/r/234092/#review240512
Attachment #8965367 - Flags: review?(nical.bugzilla) → review+
Comment on attachment 8965369 [details]
Bug 1449982 - Introduce pref to control async scene building.

https://reviewboard.mozilla.org/r/234096/#review240514
Attachment #8965369 - Flags: review?(nical.bugzilla) → review+
Comment on attachment 8965370 [details]
Bug 1449982 - Add the plumbing for scene builder thread interaction.

https://reviewboard.mozilla.org/r/234098/#review240516
Attachment #8965370 - Flags: review?(nical.bugzilla) → review+
Comment on attachment 8965373 [details]
Bug 1449982 - Clean up WrEpoch usage.

https://reviewboard.mozilla.org/r/234104/#review240518

Nice.
Attachment #8965373 - Flags: review?(nical.bugzilla) → review+
(In reply to Botond Ballo [:botond] from comment #18)
> I assume you mean "without enabling them by default".

Fixed

(In reply to Botond Ballo [:botond] from comment #24)
> It's possible to store the WindowId itself as the key type if you specialize
> std::hash for it (or provide your own hasher as a third template argument to
> unordered_map).
> 
> I'll leave it up to you if you want to do that.

I'll leave this as-is for now.

(In reply to Botond Ballo [:botond] from comment #25)
> > +  static void SetUpdaterThread(const wr::WrWindowId& aWindowId);
> 
> I spent a good minute trying to figure out what this method does based just
> on its declaration, and wasn't able to.
> 
> Please either rename it to something more descriptive (like
> MarkCurrentThreadAsUpdaterForWindow()), or add a comment.

Added a comment to explain this function.

> ::: gfx/layers/apz/public/APZUpdater.h:129
> > +  Maybe<PlatformThreadId> mUpdaterThreadId;
> 
> It looks like this field is only read if async scene building is enabled.
> 
> Assuming that's the intention, can we mention that in the comment (and maybe
> use that fact to simplify the comment, i.e. if that's the case we don't care
> what this stores unless async scene building is enabled).

Yup, updated the comment accordingly.

(In reply to Botond Ballo [:botond] from comment #26)
> > +cannot just post a task to its message loop, because it's a rust thread
> > +doesn't have a message loop. Instead, we put these tasks into a queue
> 
> There's a missing word here (e.g. "it's a rust thread _that_ doesn't have a
> message loop")

Good catch, fixed.

> ::: gfx/layers/apz/src/APZUpdater.cpp:386
> > +  // Run anything that's still left. Note that this function gets called even
> > +  // if async scene building is off, but in that case we don't want to do
> 
> So this function gets called even if async scene building is off, but
> apz_run_updater() doesn't?

Yeah. I added assertions and comments in apz_run_updater (as well as the pre/post swap functions) to that effect.

(In reply to Botond Ballo [:botond] from comment #27)
> >    void UpdateHitTestingTree(LayersId aRootLayerTreeId,
> 
> This subtle signature change belies a very significant behavioural change.
> 
> This UHTT overload is not like the other any more; please call it something
> like StoreScrollDataAndUpdateHitTestingTree().
> 
> Also, the |aScrollData| parameter has gone from being the root layer tree's
> scroll data, to being the originating layer tree's scroll data. Please
> document that, either via a comment or the parameter's name.

I ended up renaming this to UpdateScrollDataAndTreeState, and added some documentation in the .h file. This function now updates the scroll data, focus state, and hit-testing tree all inside one runnable.

> ::: gfx/layers/apz/src/APZUpdater.cpp:145
> > +      self->mScrollData[aOriginatingLayersId] = aScrollData;
> 
> WebRenderScrollData is a heavy class to be copying around. Can we pass it by
> UniquePtr, or at the very least Move() it here?

Done (via Move)

> ::: gfx/layers/apz/src/APZUpdater.cpp:151
> > +      self->mApz->UpdateHitTestingTree(aRootLayerTreeId,
> > +          WebRenderScrollDataWrapper(*self.get(), &(root->second)),
> 
> RefPtr overloads operator*, so that |*self| is sufficient.

Fixed

(In reply to Botond Ballo [:botond] from comment #28)
> > +                     LayersId::EqualFn> mScrollData;
> 
> Note: you shouldn't actually need LayersId::EqualFn. Since LayersId
> overloads operator==, the default equal function should just work.

Fixed. Also filed bug 1452601 to remove it in other places.

(In reply to Botond Ballo [:botond] from comment #29)
> Since we always update the focus state, I think it would make this
> function's interface cleaner if the signature was:
> 
>   UpdateAPZ(const FocusTarget&,
>             const Maybe<WebRenderScrollData>&)

I ended up implementing your other suggestion and split this into two functions, UpdateAPZFocusState and UpdateAPZScrollData. The focus state one just updates the focus state, and the scroll data one calls APZUpdater::UpdateScrollDataAndTreeState which internal updates the focus state and the hit-testing tree. This way the non-empty transaction version still just does the CompositorBridgeParent lookup once, and we have two functions with non-Maybe arguments so it's much clearer.


(In reply to Botond Ballo [:botond] from comment #30)
> > +  AssertOnUpdaterThread();
> > +  mTreeLock.Lock();
> 
> This really scares me.
> 
> How sure are we that there is no way we'll get stuck with the lock held as
> something in between pre_scene_swap and post_scene_swap running into an
> error, or a message never arriving, or something?
> 
> Do we need a mechanism to ensure the lock is always released, even if
> something goes wrong (assuming that something isn't going to take down the
> entire process, in which case it doesn't matter)?

This is a little scary to me as well, but the code that invokes this is really straightforward - it's basically straight-line code without any intermediate conditions, so for now at least I'm confident there's no cases where we would call pre_scene_swap without also calling post_scene_swap except in termination conditions where the scene build thread died before it got a chance to do post_scene_swap, and in that case it doesn't matter. I couldn't think of any other way to do this (i.e. we need to have separate calls to acquire/release the lock) and I'm open to suggestions on ways to make this less scary.

> ::: gfx/layers/apz/src/APZUpdater.cpp:95
> > +  while (wr_pipeline_info_next_removed_pipeline(aInfo, &pipeline)) {
> 
> I don't know much about how Rust <--> C++ interop works, but is there an
> advantage to having an API like this, that crosses the FFI boundary once for
> every element in a collection (as opposed to querying the entire collection
> in one call)?

I agree this isn't great, but I didn't want to fiddle with that API too much in this bug. I think the main advantage is that we don't make a second copy of the data structure, but instead read directly from the one copy held on the rust side. But really this structure isn't that big, so I don't foresee a problem with copying it. I've filed bug 1452620 for this, and I'd like to defer changes here to that bug.

> ::: gfx/layers/apz/src/APZUpdater.cpp:101
> > +  for (auto &i : updater->mEpochData) {
> 
> nit: '&' goes next to the type

Fixed

> ::: gfx/layers/apz/src/APZUpdater.cpp:104
> > +  while (wr_pipeline_info_next_epoch(aInfo, &pipeline, &epoch)) {
> 
> It would also be nice to document either these functions, or the
> WrPipelineInfo struct in bindings.rs. I'd like to see things like:

Will do this in bug 1452620 as well, since I might change the API entirely.

> ::: gfx/layers/apz/src/APZUpdater.cpp:431
> > +  for (auto i : mEpochData) {
> 
> const auto&

Fixed

> ::: gfx/layers/apz/src/APZUpdater.cpp:442
> Perhaps add an early exit for if the queue is empty at the beginning of the
> function?

Done. I wasn't a huge fan of this because now we have two places where we lock the queue inside this function but I don't care that much.

> ::: gfx/layers/wr/WebRenderBridgeParent.cpp:509
> > -WebRenderBridgeParent::UpdateAPZ(const Maybe<WebRenderScrollData>& aData,
> > +WebRenderBridgeParent::UpdateAPZ(const Maybe<wr::Epoch>& aEpoch,
> 
> The more Maybe's we add to this function's signature, the less I'm liking it
> :)
> 
> I'm now thinking I'd prefer two methods:
> 
>   UpdateAPZFocusState(const FocusTarget&)
>   UpdateAPZHitTestTree(const WRScrollData&, const Epoch&)
>   
> even if it means e.g. calling GetRootCompositorBridgeParent() twice in the
> case of a nonempty transaction (though I'm willing to be dissuaded on that
> front).

Yeah I ended up doing something similar to this.

(In reply to Botond Ballo [:botond] from comment #33)
> Ok, I've thought about it a bit, and my main concern is that this approach
> can introduce unnecessary latency in certain cases.

Yeah, thanks for catching this! I kinda had this in the back of mind at some point but I guess I never gave it too much thought. I'll split the queue so that it's per-layers id.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #40)
> Yeah, thanks for catching this! I kinda had this in the back of mind at some
> point but I guess I never gave it too much thought. I'll split the queue so
> that it's per-layers id.

It turned out to be somewhat complex to actually have multiple std::deque's because I would need to have a map from layers id -> deque (or stuff in the EpochState struct) and clearing out entries from that map then becomes messy. So I just augmented the existing queue to store a (layersid,runnable) and then just processed the runnables with unblocked layers ids.
(In reply to Botond Ballo [:botond] from comment #30)
> I'd also feel a bit more comfortable if :nical had a look at the change of
> WebRenderBridgeParent.cpp, and the WR API usage in APZUpdater, as well.

@nical: I flagged you for review on another patch for this ^
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #40)
> This is a little scary to me as well, but the code that invokes this is
> really straightforward - it's basically straight-line code without any
> intermediate conditions, so for now at least I'm confident there's no cases
> where we would call pre_scene_swap without also calling post_scene_swap
> except in termination conditions where the scene build thread died before it
> got a chance to do post_scene_swap, and in that case it doesn't matter. I
> couldn't think of any other way to do this (i.e. we need to have separate
> calls to acquire/release the lock) and I'm open to suggestions on ways to
> make this less scary.

The potential concerns I can think of:

  - A catchable panic / exception-like thing escaping the intervening
    code and skipping the unlock.

  - Hanging indefinitely / for a long time inside the intervening
    code (e.g. I notice [1] adds a recv() call).

If we can rule these out with reasonable confidence, then I'm happy with it.

[1] https://github.com/servo/webrender/pull/2613/commits/24eec95667aa036025f4bcf822f12cfd7f0891c3
(In reply to Botond Ballo [:botond] from comment #54)
> The potential concerns I can think of:
> 
>   - A catchable panic / exception-like thing escaping the intervening
>     code and skipping the unlock.

All our rust code is built with panic=abort, and rust doesn't have other exceptions as far as I know. So this shouldn't be an issue. In C++ we would probably use an RAII wrapper here, so I wrote a similar thing (untested) for this code, the patch is at https://github.com/staktrace/webrender/commit/f724f2a3849c2ce86eda6e234c61f7e8459c8cd1 but honestly I feel less confident with that code than the original code so unless you feel strongly I'd rather not land it.

>   - Hanging indefinitely / for a long time inside the intervening
>     code (e.g. I notice [1] adds a recv() call).

The only way this can happen is if the render backend thread itself gets wedged, which should again only happen in case of shutdown or deadlock. The recv() call is on a channel that's specific to the message, so there's never anything else on that channel.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #22)
> I'll update the patch to avoid the miscompilation, and see if I can
> reproduce that locally to debug the generated code.

I can't seem to reproduce the problem locally, and just using a local bool var doesn't work either, probably because it gets optimized away (in my try push previously I had a MOZ_ASSERT that was preventing it from getting optimized away, but that's not an assert we want to land). I'll have to try more complicated things.
Comment on attachment 8965375 [details]
Bug 1449982 - Hook up epoch-based task processing on the updater thread.

https://reviewboard.mozilla.org/r/234108/#review240704

::: gfx/layers/apz/public/APZUpdater.h:54
(Diff revisions 1 - 2)
>    void SetWebRenderWindowId(const wr::WindowId& aWindowId);
>  
> +  /**
> +   * This function is invoked from rust on the scene builder thread when it
> +   * is created. It effectively tells the APZUpdater "the current thread is
> +   * the updater thread from this window id" and allows APZUpdater to remember

_for_ this window id?

::: gfx/layers/apz/src/APZUpdater.cpp:211
(Diff revisions 1 - 2)
>        self->mEpochData[aOriginatingLayersId].mRequired = aEpoch;
>      }
>    ));
>    RunOnUpdaterThread(NS_NewRunnableFunction(
>      "APZUpdater::UpdateHitTestingTree",
> -    [=]() {
> +    [=,aScrollData=Move(aScrollData)]() {

The first use of a C++14 init-capture in APZ code. Nice :)

(On the other hand, sigh, I now need to get moving and implement this in Eclipse if I don't want to see a gigantic syntax error here :p)

::: gfx/layers/apz/src/APZUpdater.cpp:446
(Diff revisions 1 - 2)
>  
>  void
>  APZUpdater::ProcessQueue()
>  {
> +  { // scope lock to check for emptiness
> +    MutexAutoLock lock(mQueueLock);

Do we actually need to acquire the lock for the check?

Seems like the worst thing that can happen in case of a race, is that empty() gives the wrong answer. If that answer is "not empty", we proceed to acquire the lock again and get the correct answer.

If the answer is "empty", then the queue must just have been populated by another thread; the current thread could easily have lost the race and correctly early-exited here, so early-exiting doesn't seem like a bad outcome.

::: gfx/layers/wr/WebRenderBridgeParent.cpp:523
(Diff revisions 1 - 2)
> +  }
> +}
> +
> +void
> +WebRenderBridgeParent::UpdateAPZScrollData(const wr::Epoch& aEpoch,
> +                                           const WebRenderScrollData& aData)

While I don't think it makes a difference in what it compiles down to, making this parameter |WebRenderScrollData&&| would be conceptually more correct.
Attachment #8965375 - Flags: review?(botond) → review+
(In reply to Botond Ballo [:botond] from comment #57)
> ::: gfx/layers/apz/src/APZUpdater.cpp:446
> (Diff revisions 1 - 2)
> >  
> >  void
> >  APZUpdater::ProcessQueue()
> >  {
> > +  { // scope lock to check for emptiness
> > +    MutexAutoLock lock(mQueueLock);
> 
> Do we actually need to acquire the lock for the check?
> 
> Seems like the worst thing that can happen in case of a race, is that
> empty() gives the wrong answer. If that answer is "not empty", we proceed to
> acquire the lock again and get the correct answer.
> 
> If the answer is "empty", then the queue must just have been populated by
> another thread; the current thread could easily have lost the race and
> correctly early-exited here, so early-exiting doesn't seem like a bad
> outcome.

Then again, we should probably listen to Hans Boehm [1] who argues that no data race in C++ is truly benign, and keep the lock.

[1] https://www.usenix.org/legacy/event/hotpar11/tech/final_files/Boehm.pdf
(In reply to Botond Ballo [:botond] from comment #57)
> > +   * is created. It effectively tells the APZUpdater "the current thread is
> > +   * the updater thread from this window id" and allows APZUpdater to remember
> 
> _for_ this window id?

Whoops, fixed.

> ::: gfx/layers/wr/WebRenderBridgeParent.cpp:523
> > +WebRenderBridgeParent::UpdateAPZScrollData(const wr::Epoch& aEpoch,
> > +                                           const WebRenderScrollData& aData)
> 
> While I don't think it makes a difference in what it compiles down to,
> making this parameter |WebRenderScrollData&&| would be conceptually more
> correct.

Ok, will update that. Does the same apply to APZUpdater::UpdateScrollDataAndTreeState?

(In reply to Botond Ballo [:botond] from comment #58)
> 
> Then again, we should probably listen to Hans Boehm [1] who argues that no
> data race in C++ is truly benign, and keep the lock.

Yeah I'm ok with keeping the lock. With the changes in the last patch the early-return seems more worth it.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #59)
> > While I don't think it makes a difference in what it compiles down to,
> > making this parameter |WebRenderScrollData&&| would be conceptually more
> > correct.
> 
> Ok, will update that. Does the same apply to
> APZUpdater::UpdateScrollDataAndTreeState?

Yeah, we should make it WebRenderScrollata&& throughout the call chain.
Comment on attachment 8966263 [details]
Bug 1449982 - Conceptually split the mUpdaterQueue into separate queues per layers id.

https://reviewboard.mozilla.org/r/235014/#review240752

::: gfx/layers/apz/src/APZCTreeManager.cpp:1981
(Diff revision 1)
>      // nsBaseWidget (as opposed to over PAPZCTreeManager). We want this function
>      // to run on the updater thread, so bounce it over.
>      MOZ_ASSERT(XRE_IsParentProcess());
>  
>      GetUpdater()->RunOnUpdaterThread(
> +        mRootLayersId,

Why not aGuid.mLayersId?

::: gfx/layers/apz/src/APZUpdater.cpp:452
(Diff revision 1)
> +        // still blocked back in and finish
> +        std::swap(mUpdaterQueue, blockedTasks);
>          break;
>        }
>        task = mUpdaterQueue.front();
>        mUpdaterQueue.pop_front();

I think the correctness of this (specifically, the property that it preserves relative ordering of events within a layers id) depends on the blocked state of a layers id not changing (from blocked to unblocked) in the middle of a ProcessQueue() operation.

(Otherwise, if a new event for a previously blocked layers id is enqueued at the right time, it can "jump the queue" in front of events waiting in blockedTasks.)

Do we know that this is the case (a layers id will not become unblocked in the middle of a ProcessQueue)? (I think the answer is yes, because only CompleteSceneSwap() will unblock a layers id, and that can only get called on the updater thread and therefore not in the middle of ProcessQueue().)

Even if it is the case, it would be good to document the fact that we are relying on this property.
(In reply to Botond Ballo [:botond] from comment #61)
> >      GetUpdater()->RunOnUpdaterThread(
> > +        mRootLayersId,
> 
> Why not aGuid.mLayersId?

They should always be the same here, I can add an assertion to that effect if you prefer. If they do ever differ, mRootLayersId would be the safer option because it comes from a more trustworthy source (CompositorBridgeParent) than aGuid.mLayersId (which comes from widget but maybe in the future from less-trustworthy call sites).

> ::: gfx/layers/apz/src/APZUpdater.cpp:452
> I think the correctness of this (specifically, the property that it
> preserves relative ordering of events within a layers id) depends on the
> blocked state of a layers id not changing (from blocked to unblocked) in the
> middle of a ProcessQueue() operation.

Good point, I hadn't actually considered this but I agree with your analysis. I'll add a comment to this effect.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #62)
> (In reply to Botond Ballo [:botond] from comment #61)
> > >      GetUpdater()->RunOnUpdaterThread(
> > > +        mRootLayersId,
> > 
> > Why not aGuid.mLayersId?
> 
> They should always be the same here, I can add an assertion to that effect
> if you prefer. If they do ever differ, mRootLayersId would be the safer
> option because it comes from a more trustworthy source
> (CompositorBridgeParent) than aGuid.mLayersId (which comes from widget but
> maybe in the future from less-trustworthy call sites).

What about the case where this comes from a child process [1]?

[1] https://searchfox.org/mozilla-central/rev/b55e1a1cbcaee34878e133fbac20c4c2af6e11b5/dom/ipc/TabChild.cpp#544
(In reply to Botond Ballo [:botond] from comment #74)
> (In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #62)
> > (In reply to Botond Ballo [:botond] from comment #61)
> > > >      GetUpdater()->RunOnUpdaterThread(
> > > > +        mRootLayersId,
> > > 
> > > Why not aGuid.mLayersId?
> > 
> > They should always be the same here, I can add an assertion to that effect
> > if you prefer. If they do ever differ, mRootLayersId would be the safer
> > option because it comes from a more trustworthy source
> > (CompositorBridgeParent) than aGuid.mLayersId (which comes from widget but
> > maybe in the future from less-trustworthy call sites).
> 
> What about the case where this comes from a child process [1]?
> 
> [1]
> https://searchfox.org/mozilla-central/rev/
> b55e1a1cbcaee34878e133fbac20c4c2af6e11b5/dom/ipc/TabChild.cpp#544

Ok, I see now that there is a comment in this branch of UpdateZoomConstraints() that says:

 // This can happen if we're in the UI process and got a call directly from
 // nsBaseWidget (as opposed to over PAPZCTreeManager). We want this function
 // to run on the updater thread, so bounce it over.

However, is this comment still accurate with the updater thread refactor in place? As far as I can tell, APZCTreeManagerParent still lives on the compositor thread, which can now be different from the updated thread.
Comment on attachment 8966263 [details]
Bug 1449982 - Conceptually split the mUpdaterQueue into separate queues per layers id.

https://reviewboard.mozilla.org/r/235014/#review240788

Anyways, r+ assuming the UpdateZoomConstraints issue is sorted out.
Attachment #8966263 - Flags: review?(botond) → review+
(In reply to Botond Ballo [:botond] from comment #75)
> Ok, I see now that there is a comment in this branch of
> UpdateZoomConstraints() that says:
> 
>  // This can happen if we're in the UI process and got a call directly from
>  // nsBaseWidget (as opposed to over PAPZCTreeManager). We want this function
>  // to run on the updater thread, so bounce it over.
> 
> However, is this comment still accurate with the updater thread refactor in
> place? As far as I can tell, APZCTreeManagerParent still lives on the
> compositor thread, which can now be different from the updated thread.

Shoot, you're right, I was thinking the comment was still accurate but it's not (in the case where async scene-building is enabled). It didn't show up in my try pushes because I never did one with async scene building enabled (and maybe it's not covered by any tests anyway, who knows).

I'll use aGuid.mLayersId here, and I'll fix up the comment/other assertion in another bug since it's conceptually different from this bug.
The latest patch set has all the above comments addressed. I filed bug 1452845 to fix APZCTreeManager::UpdateZoomConstraints (and anything else that turns out to be buggy).

Try push with the above patchset is at https://treeherder.mozilla.org/#/jobs?repo=try&revision=f6f8a5ca70446eb0185301d63b9b0fb55e2a3c57

Same patches rebased on master are at https://treeherder.mozilla.org/#/jobs?repo=try&revision=4f2b25aece7d4233876538d4320e93d3ad196cfc

Note also that in order to avoid the miscompilation issue I ended up adding an if condition with separate branches for the enabled and disabled case, that's in the WebRenderAPI.cpp change in part 4.
Comment on attachment 8965375 [details]
Bug 1449982 - Hook up epoch-based task processing on the updater thread.

https://reviewboard.mozilla.org/r/234108/#review241050
Attachment #8965375 - Flags: review?(nical.bugzilla) → review+
Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/433668f7db7b
Don't hold the sIndirectLayerTreesLock unnecessarily while notifying APZ of a layer tree removal. r=botond
https://hg.mozilla.org/integration/autoland/rev/7af325560dfa
Move the window id allocation to CompositorBridgeParent. r=nical
https://hg.mozilla.org/integration/autoland/rev/c40aa91bc0b6
Maintain a map from WrWindowId to APZUpdater. r=botond
https://hg.mozilla.org/integration/autoland/rev/7b83f46b5fb3
Introduce pref to control async scene building. r=nical
https://hg.mozilla.org/integration/autoland/rev/f14bce5630f5
Add the plumbing for scene builder thread interaction. r=nical
https://hg.mozilla.org/integration/autoland/rev/1c9e68c30a81
Implement the WR updater thread registration. r=botond
https://hg.mozilla.org/integration/autoland/rev/0c1ab4c3acd4
Add the task queue for running things on the updater thread. r=botond
https://hg.mozilla.org/integration/autoland/rev/3eab97d46f5a
Clean up WrEpoch usage. r=nical
https://hg.mozilla.org/integration/autoland/rev/526faf046875
Move the WebRenderScrollData storage from WebRenderBridgeParent to APZUpdater. r=botond
https://hg.mozilla.org/integration/autoland/rev/1a6398d3f9b3
Hook up epoch-based task processing on the updater thread. r=botond,nical
https://hg.mozilla.org/integration/autoland/rev/3542146e9973
Conceptually split the mUpdaterQueue into separate queues per layers id. r=botond
You need to log in before you can comment on or make changes to this bug.