Closed
Bug 1298085
Opened 9 years ago
Closed 8 years ago
If the compositor thread is blocked, the main thread will keep piling transactions up.
Categories
(Core :: Graphics: Layers, defect)
Core
Graphics: Layers
Tracking
()
RESOLVED
FIXED
mozilla51
Tracking | Status | |
---|---|---|
firefox51 | --- | fixed |
People
(Reporter: nical, Assigned: nical)
References
Details
(Whiteboard: [gfx-noted])
Attachments
(3 files, 2 obsolete files)
5.14 KB,
patch
|
mchang
:
review+
|
Details | Diff | Splinter Review |
1.75 KB,
patch
|
mchang
:
review+
|
Details | Diff | Splinter Review |
1.21 KB,
patch
|
Details | Diff | Splinter Review |
The refresh driver throttles itself to one frame every 200ms if the compositor is not keeping up to avoid being more than one frame in advance.
The problem is, sometimes the compositor thread can be blocked, and the transactions pile up. Since the compositor is not consuming and releasing the allocated textures, the content side will allocate new textures and add them to the next transactions, and eventually things blow up.
I can reproduce this fairly easily on Linux with GL layers: The compositor blocks on the first draw call (glClear() in CompositorOGL::BeginFrame) if the screen is locked, and boom.
There has been reports of the same problem on other configurations so I assume this is not the only way to reproduce the issue.
Assignee | ||
Comment 1•9 years ago
|
||
Perhaps we could up the nsRefresDriver threshold to, say 500ms or more, send pings to the compositor if the threshold is reached here: http://searchfox.org/mozilla-central/rev/44f6964ba95b8ddd8ebf70c55b34cd2323afeef4/layout/base/nsRefreshDriver.cpp#2069
And only after receiving the pong, would we Tick.
This way, if a bug makes us miss a composite notification, we have another safe-guard and we'll keep refreshing, but only if the compositor is able to process incoming transactions (if not, we are better off not refreshing at all).
Thoughts?
Updated•8 years ago
|
Whiteboard: [gfx-noted]
Comment 2•8 years ago
|
||
(In reply to Nicolas Silva [:nical] from comment #1)
> Perhaps we could up the nsRefresDriver threshold to, say 500ms or more, send
> pings to the compositor if the threshold is reached here:
> http://searchfox.org/mozilla-central/rev/
> 44f6964ba95b8ddd8ebf70c55b34cd2323afeef4/layout/base/nsRefreshDriver.cpp#2069
> And only after receiving the pong, would we Tick.
>
> This way, if a bug makes us miss a composite notification, we have another
> safe-guard and we'll keep refreshing, but only if the compositor is able to
> process incoming transactions (if not, we are better off not refreshing at
> all).
>
> Thoughts?
Do we know why the compositor isn't ticking at all in the first place? If a transaction exists, shouldn't it be compositing or is something bad happening in the compositor? 200 ms block in the compositor seems like a big problem. Or maybe the better thing is even at 200ms, we should ping the compositor to see if something happened rather than just do another transaction?
Assignee | ||
Updated•8 years ago
|
Summary: If the compositor thread is blocked, the main thread will keep piling transaction up. → If the compositor thread is blocked, the main thread will keep piling transactions up.
Assignee | ||
Comment 3•8 years ago
|
||
(In reply to Mason Chang [:mchang] from comment #2)
> Do we know why the compositor isn't ticking at all in the first place? If a
> transaction exists, shouldn't it be compositing or is something bad
> happening in the compositor?
One way to reproduce this is when locking the screen on linux: the next gl call will block for a long while, (probably until the screen is unlocked).
I've seen similar similar symptoms (crash coming back from screen lock/sleep mode) being reported on windows too. I suppose it depends on the drivers. It would also be a problem if you are simply debugging something on the compositor thread and the page animates, I guess.
With a debug build, it also just happens a bunch of times while the browser starts up.
> 200 ms block in the compositor seems like a big
> problem. Or maybe the better thing is even at 200ms, we should ping the
> compositor to see if something happened rather than just do another
> transaction?
I need something quickly that I can uplift to aurora because of the issues it causes with canvas, so I think that a short term compromise is to bump up the threshold to something much more noticeable (like 1sec) to see how many bugs this is papering over and reduce the number of times this is happening. I'll add some sync handshake trickery in the canvas code to remove the immediate issue.
Then I think that we should move the unthrottling logic out of nsRefreshDriver (which does not know about IPDL), and see if users of this class can ask the refresh driver for how long it has been waiting for a transaction id. Then the caller hopefully can call into something that issues a synchronous call to the compositor thread, or better even, do a ping as you said and scedule a forced-refresh upon reception of the pong.
Assignee | ||
Comment 4•8 years ago
|
||
Let's start with this.
Assignee: nobody → nical.bugzilla
Attachment #8787603 -
Flags: review?(mchang)
Comment 5•8 years ago
|
||
Comment on attachment 8787603 [details] [diff] [review]
Step 1 - increase the refresh driver's transaction timeout and emmit a gfxCriticalNote
Review of attachment 8787603 [details] [diff] [review]:
-----------------------------------------------------------------
nit: Add a comment explaining why this is probably bad.
Attachment #8787603 -
Flags: review?(mchang) → review+
Assignee | ||
Comment 6•8 years ago
|
||
This makes sure canvas 2D does not allocate dozens of TextureClients if the compositor is not consuming/releasing them. Since canvas is often used with requestAnimationFrame it really needs a short term solution, but other layer types are affected as well (tiling in particular, not sure about webgl).
Attachment #8788419 -
Flags: review?(mchang)
Comment 7•8 years ago
|
||
It's likely that we could just remove this timeout entirely. I added it as a fallback in case something else went wrong, but maybe we should just remove it and fix any bugs that show up.
Comment 8•8 years ago
|
||
How about let's do what Matt suggested in comment 7?
Flags: needinfo?(nical.bugzilla)
Assignee | ||
Comment 10•8 years ago
|
||
No more 200ms threshold. Instead we fire a gfxCriticalNote every 2^N seconds.
Attachment #8787603 -
Attachment is obsolete: true
Attachment #8788419 -
Attachment is obsolete: true
Attachment #8788419 -
Flags: review?(mchang)
Attachment #8789341 -
Flags: review?(mchang)
Comment 11•8 years ago
|
||
Comment on attachment 8789341 [details] [diff] [review]
Remove the 200ms unthrottling thing, warn after 1s, 2s, 4s, 8s, etc.
Review of attachment 8789341 [details] [diff] [review]:
-----------------------------------------------------------------
::: layout/base/nsRefreshDriver.h
@@ +423,5 @@
> // start of every tick.
> bool mResizeSuppressed;
>
> int64_t mMostRecentRefreshEpochTime;
> + uint64_t mWarningThreshold;
nit: add a comment explaining what this is for.
Attachment #8789341 -
Flags: review?(mchang) → review+
Comment 12•8 years ago
|
||
Pushed by nsilva@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/26862b624c71
Don't unthrottle the refresh driver after 200ms when waiting for compositor transactions. r=mchang
Comment 13•8 years ago
|
||
sorry had to backout since this closed the tree for too long for waiting for the follow up.
backed out for bustage like https://treeherder.mozilla.org/logviewer.html#?job_id=35619031&repo=mozilla-inbound
Flags: needinfo?(nical.bugzilla)
Comment 14•8 years ago
|
||
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/528eaed8c185
Backed out changeset 26862b624c71 for bustage on a CLOSED TREE
Comment 15•8 years ago
|
||
Pushed by nsilva@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/96e65eb8c472
Don't unthrottle the refresh driver after 200ms when waiting for compositor transactions. r=mchang
![]() |
||
Comment 16•8 years ago
|
||
Backed out for failing browser-chrome test browser_Troubleshoot.js (looks like updating the expectations of the test should fix this):
https://hg.mozilla.org/integration/mozilla-inbound/rev/f19715529c24
Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=96e65eb8c47237e91ceee9f4565d6856baa33862
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=35623443&repo=mozilla-inbound
15:08:09 INFO - 449 INFO TEST-START | toolkit/modules/tests/browser/browser_Troubleshoot.js
15:08:09 INFO - WebGL(0x7f5e14f78800)::ForceLoseContext
15:08:10 INFO - JavaScript warning: resource://gre/modules/Troubleshoot.jsm, line 451: Error: WebGL: Failed to create WebGL context: WebGL creation failed:
15:08:10 INFO - * Error during native OpenGL init.
15:08:10 INFO - * WebGL 2 requires support for the following features:
15:08:10 INFO - clear_buffers
15:08:10 INFO - get_integer64_indexed
15:08:10 INFO - gpu_shader4
15:08:10 INFO - * Exhausted GL driver options.
15:08:10 INFO - TEST-INFO | started process screentopng
15:08:11 INFO - TEST-INFO | screentopng: exit 0
15:08:11 INFO - 450 INFO checking window state
15:08:11 INFO - 451 INFO Console message: [JavaScript Warning: "Error: WebGL: Failed to create WebGL context: WebGL creation failed:
15:08:11 INFO - * Error during native OpenGL init.
15:08:11 INFO - * WebGL 2 requires support for the following features:
15:08:11 INFO - clear_buffers
15:08:11 INFO - get_integer64_indexed
15:08:11 INFO - gpu_shader4
15:08:11 INFO - * Exhausted GL driver options." {file: "resource://gre/modules/Troubleshoot.jsm" line: 451}]
15:08:11 INFO - 452 INFO TEST-UNEXPECTED-FAIL | toolkit/modules/tests/browser/browser_Troubleshoot.js | Schema mismatch, Error: Validation error: Object has property not in schema: object={"numTotalWindows":2,"numAcceleratedWindows":0,"windowLayerManagerType":"Basic","windowLayerManagerRemote":true,"supportsHardwareH264":"No","currentAudioBackend":"pulse","numAcceleratedWindowsMessage":["blockedGfxCard"],"adapterDescription":"VMware, Inc. -- Gallium 0.4 on llvmpipe (LLVM 3.3, 128 bits)","adapterVendorID":"VMware, Inc.","adapterDeviceID":"Gallium 0.4 on llvmpipe (LLVM 3.3, 128 bits)","adapterRAM":"","adapterDrivers":"","driverVersion":"2.1 Mesa 9.2.1","driverDate":"","webglRenderer":"VMware, Inc. -- Gallium 0.4 on llvmpipe (LLVM 3.3, 128 bits)","webgl2Renderer":"WebGL creation failed: \n* Error during native OpenGL init.\n* WebGL 2 requires support for the following features: \n clear_buffers\n get_integer64_indexed\n gpu_shader4\n* Exhausted GL driver options.","info":{"AzureCanvasBackend":"skia","AzureCanvasAccelerated":0,"AzureFallbackCanvasBackend":"none","AzureContentBackend":"skia","CairoUseXRender":0},"failures":["[GFX1-]: Refresh driver waiting for the compositor for1.32642 seconds.","[GFX1-]: Refresh driver waiting for the compositor for1.15291 seconds.","[GFX1-]: Refresh driver waiting for the compositor for7.93974 seconds.","[GFX1-]: Refresh driver waiting for the compositor for1.93608 seconds.","[GFX1-]: Refresh driver waiting for the compositor for2.68407 seconds.","[GFX1-]: Refresh driver waiting for the compositor for1.02396 seconds.","[GFX1-]: Refresh driver waiting for the compositor for2.11269 seconds."],"indices":[0,1,2,3,4,5,6],"featureLog":{"features":[{"name":"HW_COMPOSITING","description":"Compositing","status":"available","log":[{"type":"default","status":"available"}]},{"name":"OPENGL_COMPOSITING","description":"OpenGL Compositing","status":"blacklisted","log":[{"type":"default","status":"available"},{"type":"env","status":"blacklisted","message":"#BLOCKLIST_FEATURE_FAILURE_SOFTWARE_GL"}]}],"fallbacks":[]},"crashGuards":[]}, schema={"required":true,"type":"object","properties":{"numTotalWindows":{"required":true,"type":"number"},"numAcceleratedWindows":{"required":true,"type":"number"},"windowLayerManagerType":{"type":"string"},"windowLayerManagerRemote":{"type":"boolean"},"supportsHardwareH264":{"type":"string"},"currentAudioBackend":{"type":"string"},"numAcceleratedWindowsMessage":{"type":"array"},"adapterDescription":{"type":"string"},"adapterVendorID":{"type":"string"},"adapterDeviceID":{"type":"string"},"adapterSubsysID":{"type":"string"},"adapterRAM":{"type":"string"},"adapterDrivers":{"type":"string"},"driverVersion":{"type":"string"},"driverDate":{"type":"string"},"adapterDescription2":{"type":"string"},"adapterVendorID2":{"type":"string"},"adapterDeviceID2":{"type":"string"},"adapterSubsysID2":{"type":"string"},"adapterRAM2":{"type":"string"},"adapterDrivers2":{"type":"string"},"driverVersion2":{"type":"string"},"driverDate2":{"type":"string"},"isGPU2Active":{"type":"boolean"},"direct2DEnabled":{"type":"boolean"},"directWriteEnabled":{"type":"boolean"},"directWriteVersion":{"type":"string"},"clearTypeParameters":{"type":"string"},"webglRenderer":{"type":"string"},"webgl2Renderer":{"type":"string"},"info":{"type":"object"},"failures":{"type":"array","items":{"type":"string"}},"featureLog":{"type":"object"},"crashGuards":{"type":"array"},"direct2DEnabledMessage":{"type":"array"}}} -
15:08:11 INFO - Stack trace:
15:08:11 INFO - chrome://mochitests/content/browser/toolkit/modules/tests/browser/browser_Troubleshoot.js:snapshotSchema/<:40
15:08:11 INFO - 453 INFO TEST-PASS | toolkit/modules/tests/browser/browser_Troubleshoot.js | The pref should be set: javascript.troubleshoot -
15:08:11 INFO - 454 INFO TEST-PASS | toolkit/modules/tests/browser/browser_Troubleshoot.js | The pref should be set: troubleshoot.foo -
15:08:11 INFO - 455 INFO TEST-PASS | toolkit/modules/tests/browser/browser_Troubleshoot.js | The pref should be set: javascript.print_to_filename -
15:08:11 INFO - 456 INFO TEST-PASS | toolkit/modules/tests/browser/browser_Troubleshoot.js | The pref should be set: network.proxy.troubleshoot -
Assignee | ||
Comment 17•8 years ago
|
||
The test maintains a list of properties that can appear in about:suport and in what form they can appear. The list did not contain the optional "indices" property that is present only if the number if number of items in the gfx critical log is greater than a certain number. (see http://searchfox.org/mozilla-central/rev/950e13cca1fda6507dc53c243295044e8eda4493/gfx/2d/Logging.h#197 ).
The fix is simply to add the indices entry in the schema that about:support is tested against.
Flags: needinfo?(nical.bugzilla)
Attachment #8790217 -
Flags: review?(mchang)
Updated•8 years ago
|
Attachment #8790217 -
Flags: review?(mchang) → review+
Comment 18•8 years ago
|
||
Pushed by nsilva@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8d41ddbda49d
Don't unthrottle the refresh driver after 200ms when waiting for compositor transactions. r=mchang
https://hg.mozilla.org/integration/mozilla-inbound/rev/533af0114689
Update test browser_Troubleshoot.js to recognize the indices field from gfx logging. r=mchang
Comment 19•8 years ago
|
||
It appears this patch is causing many of these error messages in about:support. This patch is in the regression range.
failures [GFX1-]: Refresh driver waiting for the compositor for1.43333 seconds.
Failure Log
(#0) Error Refresh driver waiting for the compositor for1.43333 seconds.
(#13) Error Refresh driver waiting for the compositor for1.16809 seconds.
(#14) Error Refresh driver waiting for the compositor for1.30978 seconds.
(#15) CP+[GFX1-]: Refresh driver waiting for the compositor for1.61857 seconds.
(#16) Error Refresh driver waiting for the compositor for1.26734 seconds.
(#17) Error Refresh driver waiting for the compositor for1.21802 seconds.
(#18) Error Refresh driver waiting for the compositor for1.73488 seconds.
(#19) Error Refresh driver waiting for the compositor for1.93955 seconds.
(#20) Error Refresh driver waiting for the compositor for27.5069 seconds.
(#21) Error Refresh driver waiting for the compositor for2.0159 seconds.
(#22) Error Refresh driver waiting for the compositor for3.09471 seconds.
(#23) Error Refresh driver waiting for the compositor for1.35097 seconds.
(#24) Error Refresh driver waiting for the compositor for2.91203 seconds.
(#25) Error Refresh driver waiting for the compositor for1.93484 seconds.
(#26) Error Refresh driver waiting for the compositor for1.02408 seconds.
(#27) Error Refresh driver waiting for the compositor for1.00016 seconds.
Comment 20•8 years ago
|
||
Comment 21•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/8d41ddbda49d
https://hg.mozilla.org/mozilla-central/rev/533af0114689
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox51:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
Assignee | ||
Comment 22•8 years ago
|
||
(In reply to Gary [:streetwolf] from comment #19)
> It appears this patch is causing many of these error messages in
> about:support. This patch is in the regression range.
This patch is causing us to report these failures. The failures already existed before, though, and we should investigate why there are so many of them.
Comment 23•8 years ago
|
||
(In reply to Nicolas Silva [:nical] from comment #22)
> (In reply to Gary [:streetwolf] from comment #19)
> > It appears this patch is causing many of these error messages in
> > about:support. This patch is in the regression range.
>
> This patch is causing us to report these failures. The failures already
> existed before, though, and we should investigate why there are so many of
> them.
I opened up Bug 1302713.
:nical, we may want to get rid of the critical note, or perhaps do a "Once" version of it. Since we have a fixed size buffer for all the critical errors and notes, this one can happen often enough that it fills it in all by itself, and we lose all the other information. For example https://crash-stats.mozilla.com/report/index/d366d0e8-9b36-44c3-bafc-b48872160914.
Flags: needinfo?(nical.bugzilla)
Something like this.
Attachment #8791736 -
Flags: feedback?(nical.bugzilla)
Assignee | ||
Comment 26•8 years ago
|
||
I just landed two patches in bug 1302713 which should fix the log spam. Let's wait for a day and see.
If I got the logging logic right this time, the log should really only happen if the compositor has been unresponsive for for more than 1 second in release builds and 5 seconds in debug builds. If it keeps filling up the critical log it means something terribly bad is happening and needs to be fixed asap.
knowing that the compositor has been unresponsive several times is useful info (it will let us make the difference between "froze several times" and just "froze once at startup"), and if it has been unresponsive for 2^N seconds, it's bad enough that I think it's worth having N items in the log (we could put a limit to this N, like 3 or 4, but more than 1).
Flags: needinfo?(nical.bugzilla)
Well, the crash from comment 24 has > 100s delays, so, yes, let's fix it right away :) Maybe it is the false positives.
I'm not arguing we don't need all this information, just that I don't want it hiding the other useful stuff we get in the crash reports. That information is useful across a wide range of bugs, not just this one. We can introduce another field, outside of the critical errors crash reporter field if we want it all.
Flags: needinfo?(milan)
Assignee | ||
Updated•8 years ago
|
Attachment #8791736 -
Flags: feedback?(nical.bugzilla)
Updated•8 years ago
|
Flags: needinfo?(milan)
You need to log in
before you can comment on or make changes to this bug.
Description
•