Open Bug 1462777 Opened 2 years ago Updated 7 months ago

Intermittent webrender::device: Failed to compile shader:

Categories

(Core :: Graphics: WebRender, defect, P5)

defect

Tracking

()

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Blocks 2 open bugs)

Details

(Keywords: intermittent-failure)

Filed by: apavel [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=179226503&repo=autoland

https://queue.taskcluster.net/v1/task/a_SS4iFRTo65FpwkYl4oGw/runs/0/artifacts/public/logs/live_backing.log

11:18:58     INFO -  PID 23453 | ERROR 2018-05-18T18:18:58Z: webrender::device: Failed to compile shader: brush_image
11:18:58     INFO -  PID 23453 |
11:18:58     INFO -  PID 23453 | [GFX1-]: wr_renderer_render: Shader(Compilation("brush_image", ""))
11:18:58     INFO -  PID 23453 | [GFX1-]: Compositors might be mixed (5,2)
11:21:56     INFO -  PID 23453 | Cycle 1(1): loaded http://localhost:37677/tests/dromaeo/cssquery-dojo.html (next: http://localhost:37677/tests/dromaeo/cssquery-ext.html)
11:21:57     INFO -  PID 23453 | ExceptionHandler::GenerateDump cloned child 23688
11:21:57     INFO -  PID 23453 | ExceptionHandler::SendContinueSignalToChild sent continue signal to child
11:21:57     INFO -  PID 23453 | ExceptionHandler::WaitForContinueSignal waiting for continue signal...
11:21:57     INFO -  PID 23453 | [GFX1-]: Receive IPC close with reason=AbnormalShutdown
11:21:57     INFO -  PID 23453 | [Child 23582, Chrome_ChildThread] WARNING: pipe error (3): Connection reset by peer: file /builds/worker/workspace/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 353
11:21:57     INFO -  PID 23453 | [Child 23529, Chrome_ChildThread] WARNING: pipe error (23): Connection reset by peer: file /builds/worker/workspace/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 353
11:21:57     INFO -  Terminating psutil.Process(pid=23453, name='firefox', started='11:18:36')
11:21:57     INFO -  TEST-INFO | 23453: exit 11
11:21:57     INFO -  TEST-UNEXPECTED-ERROR | dromaeo_css | Could not find report in browser output: [('tsformat', ('__start_report', '__end_report')), ('tpformat', ('__start_tp_report', '__end_tp_report'))]
11:21:57    ERROR -  Traceback (most recent call last):
11:21:57     INFO -    File "/home/cltbld/workspace/build/tests/talos/talos/run_tests.py", line 296, in run_tests
11:21:57     INFO -      talos_results.add(mytest.runTest(browser_config, test))
11:21:57     INFO -    File "/home/cltbld/workspace/build/tests/talos/talos/ttest.py", line 63, in runTest
11:21:57     INFO -      return self._runTest(browser_config, test_config, setup)
11:21:57     INFO -    File "/home/cltbld/workspace/build/tests/talos/talos/ttest.py", line 270, in _runTest
11:21:57     INFO -      else None)
11:21:57     INFO -    File "/home/cltbld/workspace/build/tests/talos/talos/results.py", line 95, in add
11:21:57     INFO -      global_counters=self.global_counters
11:21:57     INFO -    File "/home/cltbld/workspace/build/tests/talos/talos/results.py", line 326, in __init__
11:21:57     INFO -      self.parse()
11:21:57     INFO -    File "/home/cltbld/workspace/build/tests/talos/talos/results.py", line 353, in parse
11:21:57     INFO -      % self.report_tokens)
11:21:57     INFO -    File "/home/cltbld/workspace/build/tests/talos/talos/results.py", line 337, in error
11:21:57     INFO -      raise utils.TalosError(message)
11:21:57     INFO -  TalosError: Could not find report in browser output: [('tsformat', ('__start_report', '__end_report')), ('tpformat', ('__start_tp_report', '__end_tp_report'))]
11:21:57     INFO -  TEST-INFO took 237644ms
11:21:57     INFO -  SUITE-END | took 237s
11:21:58    ERROR - Return code: 2
11:21:58  WARNING - setting return code to 2
11:21:58    ERROR - # TBPL FAILURE #
11:21:58     INFO - Running post-action listener: _package_coverage_data
11:21:58     INFO - Running post-action listener: _resource_record_post_action
11:21:58     INFO - [mozharness: 2018-05-18 18:21:58.243239Z] Finished run-tests step (success)
11:21:58     INFO - Running post-run listener: _resource_record_post_run
11:21:58     INFO - Total resource usage - Wall time: 256s; CPU: 83.0%; Read bytes: 0; Write bytes: 475078656; Read time: 0; Write time: 64468
11:21:58     INFO - TinderboxPrint: CPU usage<br/>83.1%
11:21:58     INFO - TinderboxPrint: I/O read bytes / time<br/>0 / 0
11:21:58     INFO - TinderboxPrint: I/O write bytes / time<br/>475,078,656 / 64,468
11:21:58     INFO - TinderboxPrint: CPU idle<br/>319.3 (16.1%)
11:21:58     INFO - TinderboxPrint: CPU system<br/>1,201.6 (60.6%)
11:21:58     INFO - TinderboxPrint: CPU user<br/>446.9 (22.6%)
11:21:58     INFO - TinderboxPrint: Swap in / out<br/>0 / 0
11:21:58     INFO - install - Wall time: 19s; CPU: 84.0%; Read bytes: 0; Write bytes: 11984896; Read time: 0; Write time: 4088
11:21:58     INFO - setup-mitmproxy - Wall time: 0s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0
11:21:58     INFO - run-tests - Wall time: 238s; CPU: 83.0%; Read bytes: 0; Write bytes: 463093760; Read time: 0; Write time: 60380
11:21:58     INFO - Running post-run listener: _upload_blobber_files
11:21:58  WARNING - Blob upload gear skipped. Missing cmdline options.
11:21:58     INFO - Running post-run listener: copy_logs_to_upload_dir
11:21:58     INFO - Copying logs to upload dir...
11:21:58     INFO - mkdir: /home/cltbld/workspace/build/upload/logs
11:21:58     INFO - Copying logs to upload dir...
11:21:58  WARNING - returning nonzero exit status 2
Component: Talos → Graphics: WebRender
Product: Testing → Core
Summary: Intermittent dromaeo_css | Could not find report in browser output: [('tsformat', ('__start_report', '__end_report')), ('tpformat', ('__start_tp_report', '__end_tp_report'))] → Intermittent webrender::device: Failed to compile shader: brush_image
Version: Version 3 → unspecified
This is weird; the shader randomly fails to compile?
Glenn, do you have any ideas on what could be causing these random shader compilation failures on Linux? They seem to be happening not-infrequently.

From bug 1220362 comment 136:

> Looking at some logs I see a lot of tps failures with this:
> 2:00:27     INFO -  [GFX1-]: Failed to compile shader: cs_clip_image
> 12:00:27     INFO -  ERROR 2018-07-03T19:00:27Z: webrender::device::gl:
> Failed to compile shader: cs_clip_image
> 12:00:27     INFO -  [GFX1-]: wr_renderer_render:
> Shader(Compilation("cs_clip_image", ""))
> 12:00:27     INFO -  [GFX1-]: Failed to compile shader: cs_clip_image
> 12:00:27     INFO -  ERROR 2018-07-03T19:00:27Z: webrender::device::gl:
> Failed to compile shader: cs_clip_image
> 12:00:27     INFO -  [GFX1-]: wr_renderer_render:
> Shader(Compilation("cs_clip_image", ""))
> 12:00:27     INFO -  [GFX1-]: Failed to compile shader: cs_clip_image
> 12:00:27     INFO -  ERROR 2018-07-03T19:00:27Z: webrender::device::gl:
> Failed to compile shader: cs_clip_image
> 12:00:27     INFO -  [GFX1-]: wr_renderer_render:
> Shader(Compilation("cs_clip_image", ""))
> 12:00:28     INFO -  [GFX1-]: Compositors might be mixed (5,2)
> 
> most other tests have similar messages, but this is a consistent one which
> might be easier to pinpoint and fix.
Flags: needinfo?(gwatson)
No, that seems very odd. Does the 'Compositors might be mixed' have any meaning here?

One possibility might be if there is some kind of bug / race where the wrong GL context is being bound by Gecko (or not bound at all) before trying to init WR.

One option could be to add some debug logging that prints the currently bound GL context before the WR init call (getting the current context is platform-specific, I think Gecko probably already has a method to get the current context?).
Flags: needinfo?(gwatson)
The 'Compositors might be mixed' usually means that we ended up falling back to some other compositor. In this context it's probably the case that WebRender failed to start up because of the shader compilation errors, and so we fell back to basic layers or some other compositor.
Here's the relevant output from [1]

07:48:16     INFO -  MakeCurrent on 0x7f0b81433000, thread 139687417071360
07:48:16     INFO -  MakeCurrent on 0x7f0b7e1fb000, thread 139687425464064
07:48:16     INFO -  Created GL 0x7f0b7e1ee010 using gl_context 0x7f0b7e1fb000
07:48:16     INFO -  MakeCurrent on 0x7f0b7c13c000, thread 139687425464064
07:48:16     INFO -  Created GL 0x7f0b7c13f010 using gl_context 0x7f0b7c13c000
07:48:16     INFO -  Compiling shader brush_solid on GL 0x7f0b7c13f010
07:48:16     INFO -  Compiling shader brush_solid on GL 0x7f0b7c13f010
07:48:16     INFO -  Compiling shader debug_font on GL 0x7f0b7c13f010
07:48:16     INFO -  Compiling shader debug_font on GL 0x7f0b7c13f010
07:48:16     INFO -  Compiling shader debug_color on GL 0x7f0b7c13f010
07:48:16     INFO -  Compiling shader debug_color on GL 0x7f0b7c13f010
07:48:16     INFO -  MakeCurrent on 0x7f0b7e1fb000, thread 139687425464064
07:48:21     INFO -  Compiling shader cs_clip_rectangle on GL 0x7f0b7e1ee010
07:48:21     INFO -  [GFX1-]: Failed to compile shader: cs_clip_rectangle

It shows three GL contexts being used (0x7f0b81433000, 0x7f0b7e1fb000, and 0x7f0b7c13c000) in Gecko. The first one is on one thread and the other two are on a different thread (the Renderer thread). Those two get mapped to the rust GL objects 0x7f0b7e1ee010 and 0x7f0b7c13f010 respectively. A bunch of shaders get compiled successfully on the second of those two contexts, then we switch to the first one, and attempt to compile cs_clip_rectangle on it, and that fails. As far as I can tell the correct GL context is active on the correct thread.

[1] https://treeherder.mozilla.org/logviewer.html#?job_id=188964346&repo=try&lineNumber=912
Hit again https://treeherder.mozilla.org/logviewer.html#?job_id=198432948&repo=mozilla-beta This time for cs_clip_image
Summary: Intermittent webrender::device: Failed to compile shader: brush_image → Intermittent webrender::device: Failed to compile shader:
You need to log in before you can comment on or make changes to this bug.