Closed Bug 1562462 Opened 5 months ago Closed 4 months ago

Screen glitches - black squares (Win10/Intel HD Graphics 4600) (picture-caching)

Categories

(Core :: Graphics: WebRender, defect, P2)

x86_64
Windows 10
defect

Tracking

()

VERIFIED FIXED
mozilla70
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- unaffected
firefox68 --- unaffected
firefox69 + verified
firefox70 + verified

People

(Reporter: apavel, Assigned: gw)

References

(Blocks 1 open bug)

Details

(Keywords: correctness, regression)

Attachments

(10 files)

I use Firefox Nighty 69.0a1 (2019-06-29) (64-bit) and sometimes on my screen these glitchy black squares appear.

I made a recording the first time I noticed it: https://send.firefox.com/download/289c7cbb722314e5/#dz5C3kXEnmXvSP2SdZxUBA (recording expires after 100 downloads).

**i appologize for the audio.

Which platform are you on? Could you attach your about:support?

Black squares sounds like graphics issues.

Also, any chance you could attach the recording to the bug? I haven't figured out how to download it, maybe it expired already somehow?

Flags: needinfo?(apavel)
Attached file raw_data.txt
Attached video browser.mp4

Posted the info you required above. the raw data is from about:support
The glitches occur when there are 47 seconds left from the video and when there are 07 seconds left.

Flags: needinfo?(apavel)

So you have WR enabled, and this looks like some kind of graphics corruption, so I'm going to move it there for now since it seems more likely to be the culprit. Glenn, do you know if you've seen something like this or some recent change that could've introduced this?

Component: Web Painting → Graphics: WebRender
Flags: needinfo?(gwatson)
OS: Unspecified → Windows 10
Hardware: Unspecified → x86_64

I have seen some black screen glitches when running with ANGLE force disabled. I'm investigating those at the moment as part of some research into running WR on top of SwiftShader. However, they look very different to the glitches in the video above, so I think it's unlikely they are the same issue.

I've never seen any glitches similar to the ones in this video.

If I'm reading the support log correctly, this is on an Intel HD4600 GPU - is that right? It looks like the associated driver version is 10-16-2017 - I wonder if there is an updated driver available, which we could update just to get an idea if this is a driver related bug? It's a (relatively) old GPU, so it's possible there aren't any newer drivers available for this too.

Flags: needinfo?(gwatson)

(In reply to Andreea Pavel [:apavel] from comment #3)
At 1m9s, after you've clicked on Save, an X icon disappears in the top-right corner (which looks similar to bug 1558107) at the same time the large corruption on the left appears. The icon didn't disappear the other times you've clicked on Save.

Blocks: wr-intel
See Also: → 1558107
Summary: Screen glitches - black squares → Screen glitches - black squares (Win10/Intel HD Graphics 4600)

Andreea first observed this on June 12th or 13th. The processor is an i5-4590S and the latest version 15.40.42.5063 is from 2019-03-19: https://downloadcenter.intel.com/product/97500/Graphics-for-4th-Generation-Intel-Processors But they are managed by admins, so an update for testing might not be quick.

Talked about this issue with Matt at the start of the work week, he suggested to let Ryan know if it happens again.

Flags: needinfo?(rhunt)

WebRender isn't in my area of expertise so I'm not sure what's going on here. I don't think I'll have time to look at it further any time soon, either.

Flags: needinfo?(rhunt)

I haven't had any luck reproducing this locally. Jeff, would you or someone else in Toronto be able to test on one of the Toronto machines with this hardware configuration?

Flags: needinfo?(jmuizelaar)
Priority: -- → P2

I tried reproducing this on a HD4400 with an older driver. I didn't see the issue. Andreea can you see if you can reproduce the issue with gfx.webrender.picture-caching set to false?

Flags: needinfo?(apavel)

Hi Jeff. I've set gfx.webrender.picture-caching to false and will see what happens.

I'll post the result at the end of the shift (~11h)

Flags: needinfo?(apavel)

Andreea no longer sees the issue after switching the pref.

Flags: needinfo?(jmuizelaar)
Summary: Screen glitches - black squares (Win10/Intel HD Graphics 4600) → Screen glitches - black squares (Win10/Intel HD Graphics 4600) (picture-caching)

Debian Testing, KDE, X11, Macbook Pro A1502, Intel Iris 6100 (Broadwell GT3)
A few moments ago I saw something similar I haven't seen before in this form. The left part of a website's light-grey background suddenly became transparent and revealed my desktop background. Elements of a fixed navigation (that do not scroll with the page) and the fixed header were still flawlessly painted on top of it. Circumstances: Two open windows, heat and blowing fans. Unfortunately no screenshot.

According to comment 2 Andreea also had two windows open, maybe it was too stressful?

Blocks: wr-intel-mvp
No longer blocks: wr-intel
Assignee: nobody → gwatson

Bug 1559688 fixed a horrible graphics corruption (bug 1565297).

See Also: → 1559688
See Also: → 1565809

[Tracking Requested - why for this release]: Display artifacts

Blocks: wr-69
See Also: → 1565891

Timea, can you reproduce this on the 4600?

Flags: needinfo?(timea.zsoldos)

On a local Win10 + Intel HD530 machine, I can see black squares if I set gfx.webrender.force-angle to be false (which runs Gecko through the native GL driver instead of ANGLE/D3D), although I can't reproduce without that setting.

The black squares disappear if I disable picture caching in this configuration.

It might be a red herring (there are also other artifacts visible in this mode), but I will investigate this configuration and see if I can identify the cause, as it may be the same underlying issue.

I made some progress on this today.

I managed to reduce the test case I have down to a single rectangle, followed by a single border.

Drawing the border on the 2nd to last tile results in NaN in the VS outputs gl_Position and some of the interpolators (at least, according to RenderDoc).

From what I can tell, this seems related to the textureSize call in the vertex shader in brush_image.

Specifically, if I replace that code with:

texture_size = textureSize(sColor0, 0);
texture_size = vec2(512.0);

Then the following occurs:

  • If both lines are present, the bug occurs.
  • Commenting out just the first line, bug does not occur.
  • Commenting out just the second line, bug does occur.

Needs more investigation tomorrow to try and narrow this down further, and see if it is indeed related to the same symptoms under ANGLE.

I wrote a patch that removes all textureSize usage, replacing them with uniforms. Unfortunately the black squares on native GL are still appearing with those calls removed.

Trying to reproduce another capture with that patch, to see if renderdoc reports anything else strange.

Small progress on this today - it does seem to be somehow related to brush_image and/or sampling from an array texture. Tomorrow I'll continue investigating, and also try out some different hardware / driver variations to see if I can get a better repro case.

A little bit more progress. I can now reproduce the bug as originally described, when running under ANGLE.

It occurs fairly commonly, but randomly enough to make it difficult to capture. I would estimate I see the glitch for one frame every few minutes of browsing on my local configuration.

I managed to capture a trace file in apitrace when the glitch occurred. The glitch shows up in the thumbnails view for the frame, but doesn't appear when I replay each draw call individually.

I was unable to capture the glitch in PIX, GPA or RenderDoc.

Next step - continue investigating the apitrace capture, and try to work out what looks different in the command stream on the frame the glitch occurs.

Attached image 1_rdc.PNG

I managed to get a capture of the problem in RenderDoc, but I'm struggling to see what's going on. Perhaps someone else can make sense of these attached captures? Context below:

  1. In the attached image, you can see the RenderDoc thumbnails of two consecutive frames. In each of the thumbnails, the glitch is apparent (one picture cache tile is corrupted).

  2. If I open the first capture (frame 669) in RenderDoc, the output at the end of the frame looks correct, it's only the thumbnail that has the glitch. This frame is the one that draws that tile into the picture cache texture array.

  3. If I open the second capture (frame 670) in RenderDoc, the glitch is shown in the final output image.

  4. Looking at the content of the picture cache texture array slice in question, on frame 669 it looks correct. In frame 670, the content of the picture cache tile looks corrupted in the texture array. So it's not the drawing of the tile, it's the actual content of the tile that appears to be wrong.

  5. As far as I can tell, nothing alters that texture slice after it's written to. Pixel history in RenderDoc doesn't seem to reveal anything - it looks like it gets written to in frame 669, and then just read from in frame 670. The corruption in the tile is odd - since it looks like something writes to it with some kind of blend, since the AA on the rounded corners of the image appears to be different?

Questions:

  • Could the texture array be getting modified between frames in a way that RenderDoc can't see (or is there some kind of blit / resize of the array that I missed in the RenderDoc trace)?
  • Since we can see the glitch in a RenderDoc capture of the D3D command stream, that seems unlikely to be an ANGLE bug? Although I guess it still could be...
  • Is the way we use render targets / textures causing some kind of race condition / undefined behavior?
  • Does this make sense at all? Are we most likely looking at a driver bug?
Attached image gecko-frame669.rdc
Attached image gecko-frame670.rdc
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(dmalyshau)

I had a look at the GPU captures...
TL;DR: I found no evidence that either us or Angle are doing anything wrong. D3D11 command stream is reasonable. Looks like a driver bug so far.

Could the texture array be getting modified between frames in a way that RenderDoc can't see ?

AFAIK, RenderDoc captures all commands from one Present() to another. There should be no gaps.

(or is there some kind of blit / resize of the array that I missed in the RenderDoc trace)

Both frames have 20 slices, so no resize is taking place.

Since we can see the glitch in a RenderDoc capture of the D3D command stream, that seems unlikely to be an ANGLE bug? Although I guess it still could be...

Right. We'd see the problem in D3D commands if it was Angle.

Does this make sense at all? Are we most likely looking at a driver bug?

I confirm your observations to be correct.


Suggestions:

  • investigate the non-Angle issue further, confirm if it's related/unrelated.
  • play with parameters: tile size, picture cache texture format (e.g. https://phabricator.services.mozilla.com/D21965 has it as RGBA8), blit tiles instead of drawing, etc
  • Force Dx11 debug runtime (i.e. using DX control panel) and run with NSDebugView attached to see the DX11 runtime debug messages; try to associate any with the glitch, if caught
Flags: needinfo?(dmalyshau)

https://mozilla.logbot.info/gfx/20190726#c16497308-c16497309

pseudo-free texture memory

For me, problems of bug 1565809 so far only appeared after longer active usage with many tabs.
And in the background, Thunderbird is always running with WebRender enabled, but behaves well and never shows any bugs.

Attached patch 0001-WIP.patchSplinter Review

I tried an experiment today to remove the use of texture arrays for picture caching, wondering if the texture arrays were the cause of this (apparent) driver bug. With the attached patch, picture cache tiles are allocated and stored as blocks inside a normal 2D texture / render target.

Unfortunately, the bug persists even with this patch applied.

This seems to suggest it's a rendering issue with the content. Next steps I am going to try:

  • Investigate z-buffer values and any potential z-accuracy problem.
  • Investigate opaque / alpha pass differences and see if the problem still occurs with z/opaque optimizations disabled.

Well, this is strange. Having pulled the latest code today, I can now no longer reproduce the bug locally.

Is anyone else able to confirm in the next nightly if it's still occurring for them? Or does anyone know of any patches that have landed recently which might be related?

Notes from today:

  • Managed to get a reliable repro again. Seems to sometimes stop happening for short periods of time.

  • Tried to see of the bug occurs under various API scenarios:

  • Force enable WARP - bug does not occur.

  • Force native GL - bug does not occur.

  • Only seems to occur when running ANGLE + D3D11.

  • Seems to depend on a high(ish) frame rate to occur. Possible race condition etc. This might explain why it doesn't occur with WARP enabled.

  • Tried disabling all z-buffer / opaque optimizations. Force everything through the alpha pass. Bug still occurs, although seems less frequent.

  • Managed to create a very simple HTML test case that can reliably reproduce the bug each run (although only on random frames every 20 seconds or so). Manifests as the solid rectangles (the background and div in the test case) failing to draw and/or drawing with invalid geometry. So far unable to capture this test case in RenderDoc - possibly slows the frame rate down enough to prevent the bug occurring.

  • The above things are making me wonder if there is a WR / ANGLE / driver issue with a buffer that gets mapped and discarded / overwritten incorrectly, causing stale data to be read from a vertex texture and/or vertex/index buffer. I've tried a few hacks in WR and ANGLE to experiment with this, haven't found anything yet. It does seem like a plausible explanation for the above results though. More investigation into this tomorrow.

I believe that NI? was for me.
Observed some black squares while testing WR on Beta last week. No reliable repro steps tho, mostly when videos or pages are fastly loaded. Sometimes I can reproduce, sometimes it doesn't happen at all.

Could you provide that testcase for me too Glenn?

Meanwhile, I asked Andreea Pavel if she can still see this issue on the latest Nightly with Webrender enabled. Waiting for her results and will update here.

Flags: needinfo?(timea.zsoldos) → needinfo?(gwatson)
Flags: needinfo?(jmuizelaar)
Attached file bug.html

Added a test case that reproduces the bug on the specific hardware. On my setup, it reliably reproduces, but may only happen one frame every minute or so.

Flags: needinfo?(gwatson)

I made some progress on this today.

The bug appears to be related to the way we resize the vertex data textures (primitive headers, render task data etc). As far as I can tell, WR is behaving correctly here. I suspect there is a bug either in ANGLE or the underlying D3D driver that is occurring when a texture is deleted.

I verified that if I remove the calls to delete the vertex data textures when creating a new one, I can no longer reproduce the bug (of course, this leaks textures so isn't a proper solution).

I tried a solution where we use a texture pool, but I think I still saw the bug occur very occasionally, even with a pool size of 32. However, it definitely reduces the frequency of the bug very significantly, so seems to be on the right track.

Tomorrow, I will try to narrow this down further and find a reasonable workaround.

Checked the test case on latest Nightly and Beta and I can't reproduce the glitches.

Andreea mentioned she can't reproduce the glitches anymore after enabling WebRender on the latest Nightly. This was after 4h of work with WR enabled. She will ping me in case it happens again.

We both have Intel 4600.

Attached patch 0001-FIX.patchSplinter Review

I'm still able to reproduce the bug on this machine, both with the test case and on real pages.

I'm reasonably convinced it is a driver bug now. D3D debug runtime doesn't detect any issues, even with GPU validation enabled.

ANGLE has a setDataFasterThanImageUpload option in the D3D11 workarounds struct, which changes the texture upload to not use UpdateSubResource. This is defaulted to off for D3D11-class hardware, and on for D3D9-class hardware.

If I switch that workaround on, the bug seems to disappear! I can't say with 100% certainty it fixes it, due to the nature of the bug. However, without this change, I would typically see the bug at least a few times per minute. Whereas, with this fix I browsed for ~30mins without seeing any glitches.

The attached patch sets this workaround for any Intel + Haswell combinations, since it occurs even with the most recent driver update.

Jeff and Jeff, what are your thoughts on such a fix? Would a workaround like this be accepted upstream? Could we apply it to our local ANGLE, even if just in the interim while we do more research on the problem at a lower priority?

Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(jgilbert)

(In reply to Timea Babos [on PTO until 19th Aug - ni? Brindusa Tot] from comment #36)

Checked the test case on latest Nightly and Beta and I can't reproduce the glitches.

Andreea mentioned she can't reproduce the glitches anymore after enabling WebRender on the latest Nightly. This was after 4h of work with WR enabled. She will ping me in case it happens again.

We both have Intel 4600.

Since Timea is in PTO now, i'll write here. It's been ~1h since i got to work and this started occurring again. WebRender is enabled.

Flags: needinfo?(jmuizelaar)
Pushed by jgilbert@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/fe1d262c0542
Revert py-2to3 changes on py3 files from bug 1559975. NPOTB
https://hg.mozilla.org/integration/autoland/rev/d7f116f6262f
ANGLE Cherry-pick: Fix occasional corruption of vertex textures in HD4600 GPUs for WebRender. r=gw
Flags: needinfo?(jgilbert)
See Also: → 1559975
Status: NEW → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70

Jeff, can we get this uplifted to beta?

Flags: needinfo?(jgilbert)

I think in this case it applies directly.

Flags: needinfo?(jgilbert)

Can you fill out the request for beta uplift?

Flags: needinfo?(gwatson)

Comment on attachment 9082502 [details]
Bug 1562462 - ANGLE Cherry-pick: Fix occasional corruption of vertex textures in HD4600 GPUs for WebRender.

Beta/Release Uplift Approval Request

  • User impact if declined: Users with WebRender on Haswell chipsets will see black flickering.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): It's a very small patch that enables a tested workaround path inside the ANGLE library we use.
  • String changes made/needed:
Flags: needinfo?(gwatson)
Attachment #9082502 - Flags: approval-mozilla-beta?
Attachment #9082501 - Flags: approval-mozilla-beta?

Comment on attachment 9082502 [details]
Bug 1562462 - ANGLE Cherry-pick: Fix occasional corruption of vertex textures in HD4600 GPUs for WebRender.

Low risk patch for a 69 graphics regression, uplift approved for 69 beta 12, thanks.

Attachment #9082502 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9082501 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
QA Whiteboard: [qa-triaged]

Hi, I tried to reproduce this issue using the test case from Comment 34 on different versions of old Beta and Nightly builds but without any success, I also tested the latest Nightly and Beta 69.0b12 and the issue does not occur there either.

I tested this issue on a Windows 10 with Intel vga 4600 HD graphics.

Andreea can you please take a look at this, you mentioned that you reproduced the issue 8 days ago, can you please recheck (when you come back from PTO)

Hi everybody, i just got back from PTO. I had no issues this shift, could not reproduce this anymore.

I have tested today on Firefox Quantum 69.0b16 (64-bit) and the issue no longer reproduces.

Hi, Based on Comment 50 as well as 51 it seems this issue no longer occurs, I will update the flags for this issue. Thank you Andreea.

Status: RESOLVED → VERIFIED
QA Whiteboard: [qa-triaged]
Flags: qe-verify+

Recently (for some weeks at least) while browsing websites, I started to recognize drawing problems, which could be referenced as black rectangular areas [1] in Firefox (v68 for sure) which now seem to be fixed with Firefox v69 (only tested for 30 minutes by now). Just thought I'd better inform about, that v68 and Intel HD 530 possibly have been affected as well.

By disabling Hardware acceleration (Options -> Performance -> [ ] Use recommended performance settings -> [ ] Use hardware acceleration when available) or 'about:config layers.acceleration.disabled=true', after restart of browser, no black rectangles can be observed any longer.

System specs: i5-6600 (Gfx Intel HD 530), W10-1903-x64 (w/ latest security patches as of 2019-08), Firefox 64-bit, dual screen (1x 1920x1200, 1x 1600x1200) - tried newest Intel Graphics driver 26.20.100.7000 without success first (upgrading from some older Intel Graphics driver).

[1]

  • a permanent small black rectangle on top left of title bar
  • occasionally drawing problems (usually Browser needs to run for couple of minutes to hours until I recognized this problem), black areas on displayed websites (can get as big as whole window, except title bar if I remember correctly), minimize and restore the browser window immediately restored the displayed website (no black areas any more)

It's quite unfortunate that we switched all the texture uploads to this path. As far as I can see, Angle doesn't do any ring buffering and GPU tracking for the staging texture area, so it just tries to map it every time we update the contents (on that slow path), which means there is a forced stall for GPU. The bug is resolved, but we can't reasonably ship anything that uses this slow path (see bug 1576637).

The "proper" solution here would be to have an entirely new texture uploading path, either by manually ring-buffering the staging textures, or invoking the GPU scatter (like we can do for GPU cache today). But before we go there (and it's arguably a significantly complex affair), it would be good to constraint the problematic domain:

  • is it relevant that the VS stage uses the textures?
  • is the texture format is relevant? i.e. does it only happen for RGBA32F, or also for other formats?

Glenn, do you think we could run a series of experiments to narrow down the issue?

Flags: needinfo?(gwatson)

I'm not sure I understand the question - do you mean experiments with telemetry? Or running tests on various hardware in Toronto? Or something else?

Flags: needinfo?(gwatson)

Glenn, I mean experiments on your machine, given that you were able to consistently reproduce the issue and investigate it.

OK, sure - I don't use that as my main development machine now, so we can freely use it to run whatever experiments and tests we want to.

See Also: → 1578910
You need to log in before you can comment on or make changes to this bug.