TextureClient::TryReadLock() shows up a lot in the 8-second BHR hang data

NEW
Unassigned

Status

()

Core
Graphics: Layers
P3
normal
7 months ago
2 months ago

People

(Reporter: Away for a while, Unassigned)

Tracking

(Blocks: 1 bug, {perf})

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [qf:meta][bhr][gfx-noted])

(Reporter)

Description

7 months ago
See <https://people-mozilla.org/~mlayzell/bhr/20170429/all.html>, and search for "TextureClient::TryReadLock()".  These are stack traces that are captured from periods of times when the main thread has been unresponsive (i.e., not processing events) for 8 seconds.  This doesn't necessarily mean that the stack trace was an 8 second hang itself, but that it was the tail of an 8 second execution where we didn't process any events from our event queue.

Obviously the paints happening here can be expensive, but then again 

The code here <https://searchfox.org/mozilla-central/rev/b0e1da2a90ada7e00f265838a3fafd00af33e547/gfx/layers/client/TextureClient.cpp#465> allows for a maximum of 500ms jank on the main thread each time it is called in the worst cast, at least in theory.  That's really bad.  :-(

We know this has regressed LinkedIn (bug 1349696), and this has regressed Talos (bug 1339454).  This is now on beta ongoing to be shipped in 54.

Can we do something about this please Matt?  Do we have any options, such as reducing this timeout?  Or backing out bug 1325227?
Flags: needinfo?(matt.woodrow)
(Reporter)

Updated

7 months ago
See Also: → bug 1339454
(Reporter)

Updated

7 months ago
Depends on: 1339454
(Reporter)

Updated

7 months ago
No longer depends on: 1339454
(Reporter)

Comment 1

7 months ago
(Although the Talos regressions may be unrelated... but perhaps it's interesting to experiment with different values for the lock timeout on Talos and see if numbers change in a meaningful way?)
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #0)
> 
> We know this has regressed LinkedIn (bug 1349696), and this has regressed
> Talos (bug 1339454).  This is now on beta ongoing to be shipped in 54.

Bug 1349696 wasn't obviously a regression to me, and the newer profiles didn't show the read lock at all, just general GPU slowness.

There were a couple of talos regressions, but also a much larger number of improvements (both in number and magnitude).

> 
> Can we do something about this please Matt?  Do we have any options, such as
> reducing this timeout?  Or backing out bug 1325227?

As above, it's not obvious to me that anything is really worse than what we had previously (where we had an unbounded sync call blocking on the compositor), so backing out doesn't seem like a great choice at this point.

We probably don't want to be waiting 500ms per-lock, having that be the total accumulated wait for the paint would be much better.

500ms may also be unnecessarily large, at least for the potential case where the compositor has forgotten to actually release the lock we're waiting on.

I'm pretty busy with retained-dl stuff at the moment unfortunately, but I can try take a look at this soon.
Flags: needinfo?(matt.woodrow)

Updated

7 months ago
Whiteboard: [qf:p1][bhr] → [qf:p1][bhr][gfx-noted]
(Reporter)

Comment 3

7 months ago
(In reply to Matt Woodrow (:mattwoodrow) from comment #2)
> (In reply to :Ehsan Akhgari (super long backlog, slow to respond) from
> comment #0)
> > 
> > We know this has regressed LinkedIn (bug 1349696), and this has regressed
> > Talos (bug 1339454).  This is now on beta ongoing to be shipped in 54.
> 
> Bug 1349696 wasn't obviously a regression to me, and the newer profiles
> didn't show the read lock at all, just general GPU slowness.

Yeah sorry I tested that myself in the end and in turned out to not be an issue.

> There were a couple of talos regressions, but also a much larger number of
> improvements (both in number and magnitude).

Fair enough.

> > 
> > Can we do something about this please Matt?  Do we have any options, such as
> > reducing this timeout?  Or backing out bug 1325227?
> 
> As above, it's not obvious to me that anything is really worse than what we
> had previously (where we had an unbounded sync call blocking on the
> compositor), so backing out doesn't seem like a great choice at this point.

Ugh, I was afraid it wouldn't be that easy...

> We probably don't want to be waiting 500ms per-lock, having that be the
> total accumulated wait for the paint would be much better.
> 
> 500ms may also be unnecessarily large, at least for the potential case where
> the compositor has forgotten to actually release the lock we're waiting on.
> 
> I'm pretty busy with retained-dl stuff at the moment unfortunately, but I
> can try take a look at this soon.

Thanks, I appreciate it.  I will keep an eye on the BHR data for this as we keep getting better data and will let you know if we find out any more information that looks related.
Waiting to see if we get more data.
Whiteboard: [qf:p1][bhr][gfx-noted] → [qf:meta][bhr][gfx-noted]
Priority: -- → P3
Keywords: perf
You need to log in before you can comment on or make changes to this bug.