Closed Bug 1303369 Opened 3 years ago Closed 3 years ago

Don't fill the gfx critical log if the compositor is blocked forever.

Categories

(Core :: Graphics: Layers, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox51 --- fixed
firefox52 --- fixed

People

(Reporter: nical, Assigned: nical)

Details

Attachments

(2 files, 1 obsolete file)

We should limit the amount of logging coming from nsRefreshDriver if the compositor is blocked for indefinitely, because the log could contain important information about why the problem happened.
Attached patch Stop logging after 8 seconds (obsolete) — Splinter Review
Attachment #8792024 - Flags: review?(milan)
Comment on attachment 8792024 [details] [diff] [review]
Stop logging after 8 seconds

Review of attachment 8792024 [details] [diff] [review]:
-----------------------------------------------------------------

Correct me if I'm wrong, but this is going to limit us to getting a few messages at a time, but nothing is stopping multiple refresh problems from adding up, three or so entries each, and hiding the other messages?  If we know that there was a refresh lag for 8+ seconds, what's the value in knowing there were 5 of such events, vs. just one, and have those other four*3 wipe out the rest of the messages?
Here's one where I just crashed - https://crash-stats.mozilla.com/report/index/5520bb85-38ab-4501-8417-f37a72160916.  15 out of 16 messages are with the refresh driver < 8 seconds.  The 16th (max size by default) was a refresh driver > 8 seconds.  Would the patch above have 15 out of those 16 show up in the log?
We already have >3k reports with the refresh messages in the error log.  Can we get rid of this message while we're figuring out how to limit it?  Because it's making the critical log useless at this point.
Flags: needinfo?(nical.bugzilla)
(In reply to Milan Sreckovic [:milan] from comment #3)
> For example, that same crash -
> https://crash-stats.mozilla.com/report/index/d366d0e8-9b36-44c3-bafc-
> b48872160914

(In reply to Milan Sreckovic [:milan] from comment #4)
> Here's one where I just crashed -
> https://crash-stats.mozilla.com/report/index/5520bb85-38ab-4501-8417-
> f37a72160916.

Both of these are with builds prior to the patch that fixed the log spam.

(In reply to Milan Sreckovic [:milan] from comment #5)
> We already have >3k reports with the refresh messages in the error log.  Can
> we get rid of this message while we're figuring out how to limit it? 
> Because it's making the critical log useless at this point.

The bulk of the issue is already figured out, there's still some adjustments to make, but the vast majority of the reports with this in the log are from nightly builds prior to the fix.

(In reply to Milan Sreckovic [:milan] from comment #2)
> Comment on attachment 8792024 [details] [diff] [review]
> Stop logging after 8 seconds
> 
> Review of attachment 8792024 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Correct me if I'm wrong, but this is going to limit us to getting a few
> messages at a time, but nothing is stopping multiple refresh problems from
> adding up, three or so entries each, and hiding the other messages?

Sure, that's true with almost all of the gfxCriticalError/Note messages in the code.
If it fires too much it means either the logging logic is incorrect or someone needs to stop what he is doing and fix a critical bug asap. In this case the logging code was bogus and spammed the log, which should be addressed now.

> If we know that there was a refresh lag for 8+ seconds, what's the value in
> knowing there were 5 of such events, vs. just one, and have those other
> four*3 wipe out the rest of the messages?

I'd rather avoid having the log fire only once because of cases where a bad tab switch takes a second or some addon messes up, and then the log doesn't fire again later when the compositor actually froze for a very long time. Knowing that the problem happened several time is useful info too.

(In reply to Milan Sreckovic [:milan] from comment #5)
> We already have >3k reports with the refresh messages in the error log.  Can
> we get rid of this message while we're figuring out how to limit it? 
> Because it's making the critical log useless at this point.

Looking at https://crash-stats.mozilla.com/search/?graphics_critical_error=~Refresh%20driver%20waiting&build_id=%3E20160916030204&product=Firefox&_sort=build_id&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=graphics_critical_error#crash-reports

There has been around 450 crashes with a 1s or more freeze since the logging fix landed which is still too much, but for most of them the log is not filling up all the gfxCriticalNote slots (not sure how to build a query that shows crashes where the same note filled all slots), so we are not in the emergency situation we were in a few days ago. These 450 crashes most likely did freeze badly for long periods of time, so I would not try to minimize this number by tweaking the log. We just want to make sure it doesn't prevent us from seeing the other logging items.


We could do the same thing without resetting the timeout threshold next time we paint. This would mean we only log a freeze if it held longer than the previous one up to 16sec and never log more than 4 times total. As long as we don't do gfxCriticalNoteOnce which will only record the first 1sec freeze and not show potential worse ones after that.
Flags: needinfo?(nical.bugzilla)
With this patch we log only the first time we hit each threshold (1s, 2s, 4s and 8s).
Attachment #8792024 - Attachment is obsolete: true
Attachment #8792024 - Flags: review?(milan)
Attachment #8792453 - Flags: review?(milan)
Comment on attachment 8792453 [details] [diff] [review]
Stop logging after 8 seconds & only log if the freeze is worse than the previous one

Review of attachment 8792453 [details] [diff] [review]:
-----------------------------------------------------------------

Still feels like this belongs in telemetry, rather than crash reports.  It looks like a normal thing to happen (sad, but not unexpected), doesn't feel like a critical error.
Attachment #8792453 - Flags: review?(milan) → review+
(In reply to Milan Sreckovic [:milan] from comment #8)
> Still feels like this belongs in telemetry, rather than crash reports.  It
> looks like a normal thing to happen (sad, but not unexpected), doesn't feel
> like a critical error.

You have a point about telemetry.
It should not be a normal thing to happen, though. 1 second is not so far from "forever" in terms of compositor responsiveness. It's scary and I'd like to have reliable numbers.
Someone just reported to me that he was still seeing the log sometimes when the browser didn't look like it was frozen, So I guess I'll disable the log entirely in the aurora train that starts today/tomorrow and see if there's a way to get the logging to be more reliable.
Attachment #8792536 - Flags: review?(milan)
Attachment #8792536 - Flags: review?(milan) → review+
Pushed by nsilva@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/aeef09f7560c
Disable the nsRefreshDriver warnings while false positives are investigated. r=milan
https://hg.mozilla.org/mozilla-central/rev/aeef09f7560c
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Comment on attachment 8792536 [details] [diff] [review]
Disable the warning (aurora)

Approval Request Comment
[Feature/regressing bug #]:
[User impact if declined]: None, we would have a harder time collecting the gfxCriticalNote info in crash-stats. This disables a log which has false positives so we can't trust enough to draw conclusions from it and it occupies space in the limited list of logged items.
[Describe test coverage new/current, TreeHerder]: None.
[Risks and why]: None, this just comments out a log.
[String/UUID change made/needed]: None.
Attachment #8792536 - Flags: approval-mozilla-aurora?
Comment on attachment 8792536 [details] [diff] [review]
Disable the warning (aurora)

The patch disables logs to better collect information. Take it in 51 aurora.
Attachment #8792536 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.