Closed Bug 1361918 Opened 7 years ago Closed 6 months ago

[meta] TextureClient::TryReadLock() shows up a lot in the 8-second BHR hang data

Categories

(Core :: Graphics: Layers, defect, P3)

defect

Tracking

()

RESOLVED INACTIVE

People

(Reporter: ehsan.akhgari, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: meta, perf, Whiteboard: [bhr:TryReadLock][gfx-noted])

See <https://people-mozilla.org/~mlayzell/bhr/20170429/all.html>, and search for "TextureClient::TryReadLock()".  These are stack traces that are captured from periods of times when the main thread has been unresponsive (i.e., not processing events) for 8 seconds.  This doesn't necessarily mean that the stack trace was an 8 second hang itself, but that it was the tail of an 8 second execution where we didn't process any events from our event queue.

Obviously the paints happening here can be expensive, but then again 

The code here <https://searchfox.org/mozilla-central/rev/b0e1da2a90ada7e00f265838a3fafd00af33e547/gfx/layers/client/TextureClient.cpp#465> allows for a maximum of 500ms jank on the main thread each time it is called in the worst cast, at least in theory.  That's really bad.  :-(

We know this has regressed LinkedIn (bug 1349696), and this has regressed Talos (bug 1339454).  This is now on beta ongoing to be shipped in 54.

Can we do something about this please Matt?  Do we have any options, such as reducing this timeout?  Or backing out bug 1325227?
Flags: needinfo?(matt.woodrow)
See Also: → 1339454
Depends on: 1339454
No longer depends on: 1339454
(Although the Talos regressions may be unrelated... but perhaps it's interesting to experiment with different values for the lock timeout on Talos and see if numbers change in a meaningful way?)
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #0)
> 
> We know this has regressed LinkedIn (bug 1349696), and this has regressed
> Talos (bug 1339454).  This is now on beta ongoing to be shipped in 54.

Bug 1349696 wasn't obviously a regression to me, and the newer profiles didn't show the read lock at all, just general GPU slowness.

There were a couple of talos regressions, but also a much larger number of improvements (both in number and magnitude).

> 
> Can we do something about this please Matt?  Do we have any options, such as
> reducing this timeout?  Or backing out bug 1325227?

As above, it's not obvious to me that anything is really worse than what we had previously (where we had an unbounded sync call blocking on the compositor), so backing out doesn't seem like a great choice at this point.

We probably don't want to be waiting 500ms per-lock, having that be the total accumulated wait for the paint would be much better.

500ms may also be unnecessarily large, at least for the potential case where the compositor has forgotten to actually release the lock we're waiting on.

I'm pretty busy with retained-dl stuff at the moment unfortunately, but I can try take a look at this soon.
Flags: needinfo?(matt.woodrow)
Whiteboard: [qf:p1][bhr] → [qf:p1][bhr][gfx-noted]
(In reply to Matt Woodrow (:mattwoodrow) from comment #2)
> (In reply to :Ehsan Akhgari (super long backlog, slow to respond) from
> comment #0)
> > 
> > We know this has regressed LinkedIn (bug 1349696), and this has regressed
> > Talos (bug 1339454).  This is now on beta ongoing to be shipped in 54.
> 
> Bug 1349696 wasn't obviously a regression to me, and the newer profiles
> didn't show the read lock at all, just general GPU slowness.

Yeah sorry I tested that myself in the end and in turned out to not be an issue.

> There were a couple of talos regressions, but also a much larger number of
> improvements (both in number and magnitude).

Fair enough.

> > 
> > Can we do something about this please Matt?  Do we have any options, such as
> > reducing this timeout?  Or backing out bug 1325227?
> 
> As above, it's not obvious to me that anything is really worse than what we
> had previously (where we had an unbounded sync call blocking on the
> compositor), so backing out doesn't seem like a great choice at this point.

Ugh, I was afraid it wouldn't be that easy...

> We probably don't want to be waiting 500ms per-lock, having that be the
> total accumulated wait for the paint would be much better.
> 
> 500ms may also be unnecessarily large, at least for the potential case where
> the compositor has forgotten to actually release the lock we're waiting on.
> 
> I'm pretty busy with retained-dl stuff at the moment unfortunately, but I
> can try take a look at this soon.

Thanks, I appreciate it.  I will keep an eye on the BHR data for this as we keep getting better data and will let you know if we find out any more information that looks related.
Waiting to see if we get more data.
Whiteboard: [qf:p1][bhr][gfx-noted] → [qf:meta][bhr][gfx-noted]
Keywords: perf
Whiteboard: [qf:meta][bhr][gfx-noted] → [qf:meta][bhr:TryReadLock][gfx-noted]
Performance Impact: --- → ?
Keywords: meta
Whiteboard: [qf:meta][bhr:TryReadLock][gfx-noted] → [bhr:TryReadLock][gfx-noted]
Summary: TextureClient::TryReadLock() shows up a lot in the 8-second BHR hang data → [meta] TextureClient::TryReadLock() shows up a lot in the 8-second BHR hang data
Performance Impact: ? → ---
Severity: normal → S3

The meta keyword is there, the bug doesn't depend on other bugs and there is no activity for 12 months.
:bhood, maybe it's time to close this bug?

Flags: needinfo?(bhood)

Closing. OP is no longer active, and neither is this meta.

Status: NEW → RESOLVED
Closed: 6 months ago
Flags: needinfo?(bhood)
Resolution: --- → INACTIVE
You need to log in before you can comment on or make changes to this bug.