Open Bug 1602842 Opened 5 years ago Updated 6 months ago

OSXDisplay::EnableVsync hangs in CVDisplayLinkCreateWithActiveCGDisplays

Categories

(Core :: Widget: Cocoa, defect, P2)

defect

Tracking

()

Performance Impact high

People

(Reporter: florian, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: perf:responsiveness, Whiteboard: [bhr:CVDisplayLinkCreateWithActiveCGDisplays])

In BHR reports from the 20191207 nightly, we see lots[*] of hangs in:
CVDisplayLinkCreateWithActiveCGDisplays CoreVideo
OSXVsyncSource::OSXDisplay::EnableVsync() XUL
http://queze.net/bhr/test/#row=39&filter=CVDisplayLinkCreateWithActiveCGDisplays

  • We have a total of 23,515 hang reports, for a total of 66,610s.
Flags: needinfo?(florian)

(In reply to Steven Michaud [:smichaud] (Retired) from comment #1)

Did these start with the 20191207 nightly?

No, we already had some on the 20191206 nightly, but at a much lower volume. I don't know the exact build id. What made me notice this today is that it's in the second position on my BHR dashboard (http://queze.net/bhr/test/ This ranks the hang stacks by how much user time was wasted). This is more than 6% of the total reported hang time, which seems a lot given it's OS X only.

By 'hangs' for BHR we mean "a period of more than 128ms during which the event loop was blocked".

Flags: needinfo?(florian)

Thanks for the info. Before I can do anything here, I really do need the build id for the nightly on which these "hangs" increased dramatically.

Markus, do you have any idea which patch might have triggered this problem?

Flags: needinfo?(mstange)

No, I don't know if that's even the case. I just CC'd you in case you might have ideas what could be going on.
I agree, we need a better "regression range" to be able to find out what's going on here.

Flags: needinfo?(mstange)

What is BHR? Is there any documentation on it? Where does its data come from? How far back does it go?

CVDisplayLinkCreateWithActiveCGDisplays() is a macOS system call, for which there are no obvious alternatives. We probably have to keep using it. And if the reported "hangs" aren't noticeable in everyday use, or even if they are and aren't too annoying, we may not need to do anything about them.

The one case where further examination might be worthwhile is if a number of users are very badly effected. Is there any way to find this out from the BHR data?

Flags: needinfo?(florian)

CVDisplayLinkCreateWithActiveCGDisplays() is a macOS system call, for which there are no obvious alternatives. We probably have to keep using it.

That said, we do call this method much more often than Safari or Chrome (see bug 1201401 comment 128). So at some point Mozilla may want to refactor the code that uses CVDisplayLinkCreateWithActiveCGDisplays() to call it less often. That's not something I want to do, though :-)

(In reply to Steven Michaud [:smichaud] (Retired) from comment #7)

What is BHR?

BHR is Background Hang Reporter.

Is there any documentation on it?

I'm not sure; I don't know any.

Where does its data come from?

It comes from Telemetry, from our Nightly users (it's disabled for beta and release). Whenever the main thread is unresponsive for 128+ms, we capture a stack and send it through telemetry.

How far back does it go?

Telemetry servers have quite a bit of history, but it's not in an easy to access form. I don't have easy access to data for previous days (it's something I would like to add at some point).

The one case where further examination might be worthwhile is if a number of users are very badly effected. Is there any way to find this out from the BHR data?

I don't have an easy way to find out right now, but the data exists on the servers as for every hang report a stack + the hang duration is sent to the telemetry servers.

Flags: needinfo?(florian)

From looking at the hang data from comment 0, it appears a "typical" hang is an average of 2-4s, which is a pretty long and user-observable hang

Whiteboard: [qf] → [qf:p1:responsiveness]

it appears a "typical" hang is an average of 2-4s

That surprises me. While working on bug 1201401 I hooked CVDisplayLinkCreateWithActiveCGDisplays() as part of my research into the bug. I'd have noticed such a long hang, but didn't. Florian talked about the problem having got much worse on 2019-12-07. We may have wonky data. Let's see how things change over time.

Here, "typical" means "typical among hangs that already take more than 128ms", or in other words: Calls to this function are either fast or really slow.

Maybe this happens when connecting or disconnecting external monitors? During those transitions, the window server is often blocked for a second or more, so any function that waits for a reply from the window server will show up as hanging, and this function might just be the first of those that we call when this happens. But that still doesn't explain why its hang rate has increased recently. Unless this is just new data from all the users that would have crashed in the past.

The priority flag is not set for this bug.
:spohl, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(spohl.mozilla.bugs)
Flags: needinfo?(spohl.mozilla.bugs)
Priority: -- → P2
Severity: normal → S3
Whiteboard: [qf:p1:responsiveness] → [qf:p1:responsiveness][bhr:CVDisplayLinkCreateWithActiveCGDisplays]
See Also: → 1415923
Priority: P2 → P3
See Also: → 1728764

I'll take a look.

Assignee: nobody → mstange.moz
Status: NEW → ASSIGNED

And with that I mean, I'll take a look at rewriting our vsync code. I'd like to stop using CVDisplayLinkCreateWithActiveCGDisplays altogether, and instead switch to per-window vsync where we create the display link for the correct (virtual) screen. We should also switch to a model where we keep the display link alive and disable+reenable it, rather than destroying and recreating it, which is what we do at the moment.

Performance Impact: --- → P1
Whiteboard: [qf:p1:responsiveness][bhr:CVDisplayLinkCreateWithActiveCGDisplays] → [bhr:CVDisplayLinkCreateWithActiveCGDisplays]
Depends on: 1759232

The Performance Priority Calculator has determined this bug's performance priority to be P1.

Platforms: macOS
Impact on browser UI: Renders browser effectively unusable
[x] Bug affects multiple sites
[x] Multiple reporters

The severity field for this bug is set to S3. However, the Performance Impact field flags this bug as having a high impact on the performance.
:mstange, could you consider increasing the severity of this performance-impacting bug? Alternatively, if you think the performance impact is lower than previously assessed, could you request a re-triage from the performance team by setting the Performance Impact flag to ??

For more information, please visit auto_nag documentation.

Flags: needinfo?(mstange.moz)

I haven't gotten a chance to work on bug 1759232 more, but I don't think there's an alternative for this particular bug. This is definitely still worth fixing properly.

Assignee: mstange.moz → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(mstange.moz)
Priority: P3 → P2

:sphol can you review this bug and determine if the severity matches the performance impact? Same message as comment 17 - a High impact would translate to a S2. If you feel this is an S3, can you comment on the mismatch between our calculations and yours and we can reassess the impact?

Flags: needinfo?(spohl.mozilla.bugs)

Markus is better positioned to evaluate this.

Flags: needinfo?(spohl.mozilla.bugs) → needinfo?(mstange.moz)

I don't see CVDisplayLinkCreateWithActiveCGDisplays on http://queze.net/bhr/test/ at the moment, so maybe this is lower priority now.

Flags: needinfo?(mstange.moz)

(In reply to Markus Stange [:mstange] from comment #21)

I don't see CVDisplayLinkCreateWithActiveCGDisplays on http://queze.net/bhr/test/ at the moment, so maybe this is lower priority now.

It's not in the top 50 hangs, but if you search for it, you see it has 225 hangs reported for a total of 128s on the 20240417 Nightly builds.

Oh! I missed that. Ok, then the high priority on this bug is probably still adequate.

:spohl based on comment 23 this still seems to be a High impact, which we suggest be associated with an S2 bug. Do you agree with the assessment or are there other factors which would keep this bug from being marked S2?

Adding NI for attention to comment 24

Flags: needinfo?(spohl.mozilla.bugs)
Severity: S3 → S2
Flags: needinfo?(spohl.mozilla.bugs)
You need to log in before you can comment on or make changes to this bug.