Closed Bug 1298085 Opened 4 years ago Closed 4 years ago

If the compositor thread is blocked, the main thread will keep piling transactions up.

Categories

(Core :: Graphics: Layers, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla51
Tracking Status
firefox51 --- fixed

People

(Reporter: nical, Assigned: nical)

References

(Depends on 1 open bug)

Details

(Whiteboard: [gfx-noted])

Attachments

(3 files, 2 obsolete files)

The refresh driver throttles itself to one frame every 200ms if the compositor is not keeping up to avoid being more than one frame in advance.

The problem is, sometimes the compositor thread can be blocked, and the transactions pile up. Since the compositor is not consuming and releasing the allocated textures, the content side will allocate new textures and add them to the next transactions, and eventually things blow up.

I can reproduce this fairly easily on Linux with GL layers: The compositor blocks on the first draw call (glClear() in CompositorOGL::BeginFrame) if the screen is locked, and boom.

There has been reports of the same problem on other configurations so I assume this is not the only way to reproduce the issue.
Perhaps we could up the nsRefresDriver threshold to, say 500ms or more, send pings to the compositor if the threshold is reached here: http://searchfox.org/mozilla-central/rev/44f6964ba95b8ddd8ebf70c55b34cd2323afeef4/layout/base/nsRefreshDriver.cpp#2069
And only after receiving the pong, would we Tick.

This way, if a bug makes us miss a composite notification, we have another safe-guard and we'll keep refreshing, but only if the compositor is able to process incoming transactions (if not, we are better off not refreshing at all).

Thoughts?
(In reply to Nicolas Silva [:nical] from comment #1)
> Perhaps we could up the nsRefresDriver threshold to, say 500ms or more, send
> pings to the compositor if the threshold is reached here:
> http://searchfox.org/mozilla-central/rev/
> 44f6964ba95b8ddd8ebf70c55b34cd2323afeef4/layout/base/nsRefreshDriver.cpp#2069
> And only after receiving the pong, would we Tick.
> 
> This way, if a bug makes us miss a composite notification, we have another
> safe-guard and we'll keep refreshing, but only if the compositor is able to
> process incoming transactions (if not, we are better off not refreshing at
> all).
> 
> Thoughts?

Do we know why the compositor isn't ticking at all in the first place? If a transaction exists, shouldn't it be compositing or is something bad happening in the compositor? 200 ms block in the compositor seems like a big problem. Or maybe the better thing is even at 200ms, we should ping the compositor to see if something happened rather than just do another transaction?
Summary: If the compositor thread is blocked, the main thread will keep piling transaction up. → If the compositor thread is blocked, the main thread will keep piling transactions up.
(In reply to Mason Chang [:mchang] from comment #2)
> Do we know why the compositor isn't ticking at all in the first place? If a
> transaction exists, shouldn't it be compositing or is something bad
> happening in the compositor?

One way to reproduce this is when locking the screen on linux: the next gl call will block for a long while, (probably until the screen is unlocked).

I've seen similar similar symptoms (crash coming back from screen lock/sleep mode) being reported on windows too. I suppose it depends on the drivers. It would also be a problem if you are simply debugging something on the compositor thread and the page animates, I guess.

With a debug build, it also just happens a bunch of times while the browser starts up.

> 200 ms block in the compositor seems like a big
> problem. Or maybe the better thing is even at 200ms, we should ping the
> compositor to see if something happened rather than just do another
> transaction?

I need something quickly that I can uplift to aurora because of the issues it causes with canvas, so I think that a short term compromise is to bump up the threshold to something much more noticeable (like 1sec) to see how many bugs this is papering over and reduce the number of times this is happening. I'll add some sync handshake trickery in the canvas code to remove the immediate issue.

Then I think that we should move the unthrottling logic out of nsRefreshDriver (which does not know about IPDL), and see if users of this class can ask the refresh driver for how long it has been waiting for a transaction id. Then the caller hopefully can call into something that issues a synchronous call to the compositor thread, or better even, do a ping as you said and scedule a forced-refresh upon reception of the pong.
Let's start with this.
Assignee: nobody → nical.bugzilla
Attachment #8787603 - Flags: review?(mchang)
Comment on attachment 8787603 [details] [diff] [review]
Step 1 - increase the refresh driver's transaction timeout and emmit a gfxCriticalNote

Review of attachment 8787603 [details] [diff] [review]:
-----------------------------------------------------------------

nit: Add a comment explaining why this is probably bad.
Attachment #8787603 - Flags: review?(mchang) → review+
This makes sure canvas 2D does not allocate dozens of TextureClients if the compositor is not consuming/releasing them. Since canvas is often used with requestAnimationFrame it really needs a short term solution, but other layer types are affected as well (tiling in particular, not sure about webgl).
Attachment #8788419 - Flags: review?(mchang)
It's likely that we could just remove this timeout entirely. I added it as a fallback in case something else went wrong, but maybe we should just remove it and fix any bugs that show up.
How about let's do what Matt suggested in comment 7?
Flags: needinfo?(nical.bugzilla)
Sounds good to me!
Flags: needinfo?(nical.bugzilla)
No more 200ms threshold. Instead we fire a gfxCriticalNote every 2^N seconds.
Attachment #8787603 - Attachment is obsolete: true
Attachment #8788419 - Attachment is obsolete: true
Attachment #8788419 - Flags: review?(mchang)
Attachment #8789341 - Flags: review?(mchang)
Comment on attachment 8789341 [details] [diff] [review]
Remove the 200ms unthrottling thing, warn after 1s, 2s, 4s, 8s, etc.

Review of attachment 8789341 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/base/nsRefreshDriver.h
@@ +423,5 @@
>    // start of every tick.
>    bool mResizeSuppressed;
>  
>    int64_t mMostRecentRefreshEpochTime;
> +  uint64_t mWarningThreshold;

nit: add a comment explaining what this is for.
Attachment #8789341 - Flags: review?(mchang) → review+
Pushed by nsilva@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/26862b624c71
Don't unthrottle the refresh driver after 200ms when waiting for compositor transactions. r=mchang
sorry had to backout since this closed the tree for too long for waiting for the follow up.

backed out for bustage like https://treeherder.mozilla.org/logviewer.html#?job_id=35619031&repo=mozilla-inbound
Flags: needinfo?(nical.bugzilla)
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/528eaed8c185
Backed out changeset 26862b624c71 for bustage on a CLOSED TREE
Pushed by nsilva@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/96e65eb8c472
Don't unthrottle the refresh driver after 200ms when waiting for compositor transactions. r=mchang
Backed out for failing browser-chrome test browser_Troubleshoot.js (looks like updating the expectations of the test should fix this):

https://hg.mozilla.org/integration/mozilla-inbound/rev/f19715529c24

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=96e65eb8c47237e91ceee9f4565d6856baa33862
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=35623443&repo=mozilla-inbound

15:08:09     INFO -  449 INFO TEST-START | toolkit/modules/tests/browser/browser_Troubleshoot.js
15:08:09     INFO -  WebGL(0x7f5e14f78800)::ForceLoseContext
15:08:10     INFO -  JavaScript warning: resource://gre/modules/Troubleshoot.jsm, line 451: Error: WebGL: Failed to create WebGL context: WebGL creation failed:
15:08:10     INFO -  * Error during native OpenGL init.
15:08:10     INFO -  * WebGL 2 requires support for the following features:
15:08:10     INFO -    clear_buffers
15:08:10     INFO -    get_integer64_indexed
15:08:10     INFO -    gpu_shader4
15:08:10     INFO -  * Exhausted GL driver options.
15:08:10     INFO -  TEST-INFO | started process screentopng
15:08:11     INFO -  TEST-INFO | screentopng: exit 0
15:08:11     INFO -  450 INFO checking window state
15:08:11     INFO -  451 INFO Console message: [JavaScript Warning: "Error: WebGL: Failed to create WebGL context: WebGL creation failed:
15:08:11     INFO -  * Error during native OpenGL init.
15:08:11     INFO -  * WebGL 2 requires support for the following features:
15:08:11     INFO -    clear_buffers
15:08:11     INFO -    get_integer64_indexed
15:08:11     INFO -    gpu_shader4
15:08:11     INFO -  * Exhausted GL driver options." {file: "resource://gre/modules/Troubleshoot.jsm" line: 451}]
15:08:11     INFO -  452 INFO TEST-UNEXPECTED-FAIL | toolkit/modules/tests/browser/browser_Troubleshoot.js | Schema mismatch, Error: Validation error: Object has property not in schema: object={"numTotalWindows":2,"numAcceleratedWindows":0,"windowLayerManagerType":"Basic","windowLayerManagerRemote":true,"supportsHardwareH264":"No","currentAudioBackend":"pulse","numAcceleratedWindowsMessage":["blockedGfxCard"],"adapterDescription":"VMware, Inc. -- Gallium 0.4 on llvmpipe (LLVM 3.3, 128 bits)","adapterVendorID":"VMware, Inc.","adapterDeviceID":"Gallium 0.4 on llvmpipe (LLVM 3.3, 128 bits)","adapterRAM":"","adapterDrivers":"","driverVersion":"2.1 Mesa 9.2.1","driverDate":"","webglRenderer":"VMware, Inc. -- Gallium 0.4 on llvmpipe (LLVM 3.3, 128 bits)","webgl2Renderer":"WebGL creation failed: \n* Error during native OpenGL init.\n* WebGL 2 requires support for the following features: \n  clear_buffers\n  get_integer64_indexed\n  gpu_shader4\n* Exhausted GL driver options.","info":{"AzureCanvasBackend":"skia","AzureCanvasAccelerated":0,"AzureFallbackCanvasBackend":"none","AzureContentBackend":"skia","CairoUseXRender":0},"failures":["[GFX1-]: Refresh driver waiting for the compositor for1.32642 seconds.","[GFX1-]: Refresh driver waiting for the compositor for1.15291 seconds.","[GFX1-]: Refresh driver waiting for the compositor for7.93974 seconds.","[GFX1-]: Refresh driver waiting for the compositor for1.93608 seconds.","[GFX1-]: Refresh driver waiting for the compositor for2.68407 seconds.","[GFX1-]: Refresh driver waiting for the compositor for1.02396 seconds.","[GFX1-]: Refresh driver waiting for the compositor for2.11269 seconds."],"indices":[0,1,2,3,4,5,6],"featureLog":{"features":[{"name":"HW_COMPOSITING","description":"Compositing","status":"available","log":[{"type":"default","status":"available"}]},{"name":"OPENGL_COMPOSITING","description":"OpenGL Compositing","status":"blacklisted","log":[{"type":"default","status":"available"},{"type":"env","status":"blacklisted","message":"#BLOCKLIST_FEATURE_FAILURE_SOFTWARE_GL"}]}],"fallbacks":[]},"crashGuards":[]}, schema={"required":true,"type":"object","properties":{"numTotalWindows":{"required":true,"type":"number"},"numAcceleratedWindows":{"required":true,"type":"number"},"windowLayerManagerType":{"type":"string"},"windowLayerManagerRemote":{"type":"boolean"},"supportsHardwareH264":{"type":"string"},"currentAudioBackend":{"type":"string"},"numAcceleratedWindowsMessage":{"type":"array"},"adapterDescription":{"type":"string"},"adapterVendorID":{"type":"string"},"adapterDeviceID":{"type":"string"},"adapterSubsysID":{"type":"string"},"adapterRAM":{"type":"string"},"adapterDrivers":{"type":"string"},"driverVersion":{"type":"string"},"driverDate":{"type":"string"},"adapterDescription2":{"type":"string"},"adapterVendorID2":{"type":"string"},"adapterDeviceID2":{"type":"string"},"adapterSubsysID2":{"type":"string"},"adapterRAM2":{"type":"string"},"adapterDrivers2":{"type":"string"},"driverVersion2":{"type":"string"},"driverDate2":{"type":"string"},"isGPU2Active":{"type":"boolean"},"direct2DEnabled":{"type":"boolean"},"directWriteEnabled":{"type":"boolean"},"directWriteVersion":{"type":"string"},"clearTypeParameters":{"type":"string"},"webglRenderer":{"type":"string"},"webgl2Renderer":{"type":"string"},"info":{"type":"object"},"failures":{"type":"array","items":{"type":"string"}},"featureLog":{"type":"object"},"crashGuards":{"type":"array"},"direct2DEnabledMessage":{"type":"array"}}} -
15:08:11     INFO -  Stack trace:
15:08:11     INFO -  chrome://mochitests/content/browser/toolkit/modules/tests/browser/browser_Troubleshoot.js:snapshotSchema/<:40
15:08:11     INFO -  453 INFO TEST-PASS | toolkit/modules/tests/browser/browser_Troubleshoot.js | The pref should be set: javascript.troubleshoot -
15:08:11     INFO -  454 INFO TEST-PASS | toolkit/modules/tests/browser/browser_Troubleshoot.js | The pref should be set: troubleshoot.foo -
15:08:11     INFO -  455 INFO TEST-PASS | toolkit/modules/tests/browser/browser_Troubleshoot.js | The pref should be set: javascript.print_to_filename -
15:08:11     INFO -  456 INFO TEST-PASS | toolkit/modules/tests/browser/browser_Troubleshoot.js | The pref should be set: network.proxy.troubleshoot -
Attached patch Fix the testSplinter Review
The test maintains a list of properties that can appear in about:suport and in what form they can appear. The list did not contain the optional "indices" property that is present only if the number if number of items in the gfx critical log is greater than a certain number. (see http://searchfox.org/mozilla-central/rev/950e13cca1fda6507dc53c243295044e8eda4493/gfx/2d/Logging.h#197 ).

The fix is simply to add the indices entry in the schema that about:support is tested against.
Flags: needinfo?(nical.bugzilla)
Attachment #8790217 - Flags: review?(mchang)
Attachment #8790217 - Flags: review?(mchang) → review+
Pushed by nsilva@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8d41ddbda49d
Don't unthrottle the refresh driver after 200ms when waiting for compositor transactions. r=mchang
https://hg.mozilla.org/integration/mozilla-inbound/rev/533af0114689
Update test browser_Troubleshoot.js to recognize the indices field from gfx logging. r=mchang
It appears this patch is causing many of these error messages in about:support.  This patch is in the regression range.

failures	[GFX1-]: Refresh driver waiting for the compositor for1.43333 seconds.
Failure Log
(#0) Error	Refresh driver waiting for the compositor for1.43333 seconds.
(#13) Error	Refresh driver waiting for the compositor for1.16809 seconds.
(#14) Error	Refresh driver waiting for the compositor for1.30978 seconds.
(#15) 	CP+[GFX1-]: Refresh driver waiting for the compositor for1.61857 seconds.
(#16) Error	Refresh driver waiting for the compositor for1.26734 seconds.
(#17) Error	Refresh driver waiting for the compositor for1.21802 seconds.
(#18) Error	Refresh driver waiting for the compositor for1.73488 seconds.
(#19) Error	Refresh driver waiting for the compositor for1.93955 seconds.
(#20) Error	Refresh driver waiting for the compositor for27.5069 seconds.
(#21) Error	Refresh driver waiting for the compositor for2.0159 seconds.
(#22) Error	Refresh driver waiting for the compositor for3.09471 seconds.
(#23) Error	Refresh driver waiting for the compositor for1.35097 seconds.
(#24) Error	Refresh driver waiting for the compositor for2.91203 seconds.
(#25) Error	Refresh driver waiting for the compositor for1.93484 seconds.
(#26) Error	Refresh driver waiting for the compositor for1.02408 seconds.
(#27) Error	Refresh driver waiting for the compositor for1.00016 seconds.
https://hg.mozilla.org/mozilla-central/rev/8d41ddbda49d
https://hg.mozilla.org/mozilla-central/rev/533af0114689
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
(In reply to Gary [:streetwolf] from comment #19)
> It appears this patch is causing many of these error messages in
> about:support.  This patch is in the regression range.

This patch is causing us to report these failures. The failures already existed before, though, and we should investigate why there are so many of them.
Depends on: 1302701
(In reply to Nicolas Silva [:nical] from comment #22)
> (In reply to Gary [:streetwolf] from comment #19)
> > It appears this patch is causing many of these error messages in
> > about:support.  This patch is in the regression range.
> 
> This patch is causing us to report these failures. The failures already
> existed before, though, and we should investigate why there are so many of
> them.

I opened up Bug 1302713.
No longer depends on: 1302701
Blocks: 1302713
Depends on: 1302922
:nical, we may want to get rid of the critical note, or perhaps do a "Once" version of it.  Since we have a fixed size buffer for all the critical errors and notes, this one can happen often enough that it fills it in all by itself, and we lose all the other information.  For example https://crash-stats.mozilla.com/report/index/d366d0e8-9b36-44c3-bafc-b48872160914.
Flags: needinfo?(nical.bugzilla)
I just landed two patches in bug 1302713 which should fix the log spam. Let's wait for a day and see.

If I got the logging logic right this time, the log should really only happen if the compositor has been unresponsive for for more than 1 second in release builds and 5 seconds in debug builds. If it keeps filling up the critical log it means something terribly bad is happening and needs to be fixed asap.

knowing that the compositor has been unresponsive several times is useful info (it will let us make the difference between "froze several times" and just "froze once at startup"), and if it has been unresponsive for 2^N seconds, it's bad enough that I think it's worth having N items in the log (we could put a limit to this N, like 3 or 4, but more than 1).
Flags: needinfo?(nical.bugzilla)
Well, the crash from comment 24 has > 100s delays, so, yes, let's fix it right away :)  Maybe it is the false positives.

I'm not arguing we don't need all this information, just that I don't want it hiding the other useful stuff we get in the crash reports. That information is useful across a wide range of bugs, not just this one.  We can introduce another field, outside of the critical errors crash reporter field if we want it all.
Flags: needinfo?(milan)
Attachment #8791736 - Flags: feedback?(nical.bugzilla)
Flags: needinfo?(milan)
Depends on: 874423
You need to log in before you can comment on or make changes to this bug.