Closed Bug 1709600 Opened 3 years ago Closed 3 months ago

Timeouts in mozilla::layers::LockD3DTexture<T>

Categories

(Core :: Graphics: WebRender, defect)

All
Windows
defect

Tracking

()

RESOLVED FIXED
123 Branch
Tracking Status
firefox122 --- wontfix
firefox123 --- fixed

People

(Reporter: bobowen, Unassigned)

References

(Blocks 1 open bug)

Details

Crash Data

Creating a meta to track the various attempts to fix this issue.
This is a gfxDevCrash so only happens on Nightly.

We tend to see other timeouts particularly at RenderDXGITextureHost::LockInternal, in the Graphics Critical Error.
These could be directly related or possibly the same root cause.

They don't appear to be down to device resets.

Severity: -- → S3
Depends on: 1709603
Summary: Timeouts in mozilla::layers::LockD3DTexture<T> → [meta] Timeouts in mozilla::layers::LockD3DTexture<T>

Just adding the crash signature to this tracking bug, so we can see the residual crashes.

Crash Signature: [@ mozilla::layers::LockD3DTexture<T>]

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly

For more information, please visit auto_nag documentation.

Keywords: topcrash

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash

This is spiking a lot on nightly, most of the affected users seem to be running Nvidia cards with the 31.0.101.x series drivers.

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

90% of the crashes happen on Intel integrated graphics, either Tiger Lake, Alder Lake, Raptor Lake, or Rocket Lake. So all pretty recent Intel chips.
With about 70% of the crashes being on two cpu models: family 6 model 154 stepping 3, family 6 model 140 stepping 1. Which are Alder and Tiger Lake. The driver versions seem pretty recent.

The common pattern is that we crash on this intentional crash

https://searchfox.org/mozilla-central/rev/57f6fbd39c0b5957e11b27b4db58b821d8e1607d/gfx/layers/d3d11/TextureD3D11.cpp#244

trying to acquire the texture mutex lock but we timeout.

And this is almost always preceded by two gfx critical notes

RenderDXGITextureHost AcquireSync timeout, hr=0x80070057 (E_INVALIDARG)
RenderDXGITextureHost AcquireSync timeout, hr=0x887a0001 (DXGI_ERROR_INVALID_CALL)

which happens here

https://searchfox.org/mozilla-central/rev/57f6fbd39c0b5957e11b27b4db58b821d8e1607d/gfx/webrender_bindings/RenderD3D11TextureHost.cpp#358

I'm not sure how it could be invalid arg. AcquireSync has a key and a ms parameter. The ms has to be valid or else this code would never work (it's always 10000). The key is always 0. I'm also not sure how DXGI_ERROR_INVALID_CALL could come about, the msdn page for AcquireSync doesn't say, it only says this function can return S_OK, E_FAIL, WAIT_ABANDONED, or WAIT_TIMEOUT.

50% of the crashes have dual gpus. Maybe this problem is more likely to happen when textures are shared between gpus?

I run into this crash somewhat regularly (Windows 10, Tiger lake integrated graphics). Usually it seems to happen while watching youtube videos (often at 3-4x via a short extension I made - it just sets .playbackRate - but I don't know if that's a necessary factor or not). Is there anything I can do to help narrow down the cause of this?

Let's address this as a normal bug, and we can always spin out new bugs for sub-bugs.

Keywords: meta
Summary: [meta] Timeouts in mozilla::layers::LockD3DTexture<T> → Timeouts in mozilla::layers::LockD3DTexture<T>

Sotaro, could you have a look here when you have time and see if there's a possible solution to this D3D lock?

Flags: needinfo?(sotaro.ikeda.g)

The problem might be reduced if ID3D11Fence is used instead of keyed mutex like Bug 1863474.

chromium uses ID3D11Fence if it is possible to use like the following.

How would ID3D11Fence help?

Each D3D11TextureData::Lock() acquires exclusive access by using IDXGIKeyedMutex::AcquireSync(0, 10000) in LockD3DTexture(). And each D3D11TextureData::Unlock() releases exclusive access by using IDXGIKeyedMutex::ReleaseSync(0) in UnlockD3DTexture().

From crash reports, crashes seemed to be triggered by IDXGIKeyedMutex::AcquireSync() failure in D3D11TextureData::Lock().

When using ID3D11Fence, ID3D11Fence is generated from the same ID3D11Device as D3D11TextureData::mTexture in DeviceManagerDx::GetCanvasDevice(). Each D3D11TextureData::Lock() does not need to wait Fence since ID3D11Fence creating device and signaling device is same. Each D3D11TextureData::Unlock() need to increment fence value and signal fence by using ID3D11DeviceContext4::Signal().

By using ID3D11Fence, D3D11TextureData::Lock() does not need to wait in CanvasTranslator use case.

And If we want to implement GPUQueue:copyExternalImageToTexture(), HTMLCanvasElement and OffscreenCanvas could also be used as source.

chromium implementation does like the following. The BeginAccessDawn() send wait fences to dawn(WebGPU)

GPUQueue::CopyFromCanvasSourceImage()
->WebGPUMailboxTexture::FromStaticBitmapImage()
->WebGPUMailboxTexture::FromCanvasResource()
->WebGPUMailboxTexture::WebGPUMailboxTexture()
->/// ipc
->WebGPUImplementation::AssociateMailbox()
->WebGPUDecoderImpl::HandleAssociateMailboxImmediate()
->WebGPUDecoderImpl::AssociateMailboxDawn()
->DawnImageRepresentation::BeginScopedAccess()
->DawnImageRepresentation::BeginAccess()
->DawnD3DImageRepresentation::BeginAccess()
->D3DImageBacking::BeginAccessDawn()

This is the top crash on Windows Nightlies from December 14, exceeding (slightly) OOM | small.

Looking at the graph, it looks like this has been steadily creeping up from ~10 crashes a day to ~58 crashes a day.

Tracking this against 122 for investigation, :bhood tagging as triage owner if there were any explanations for comment 23

Flags: needinfo?(bhood)

Glenn, anything we can do to mitigate this?

Component: Graphics: Layers → Graphics: WebRender
Flags: needinfo?(bhood) → needinfo?(gwatson)
See Also: → 1870393

Seems like the vast majority of these crashes happen on Windows 11. Over past 6 months, 80.78% of these crashes occur on Windows 11. I wonder if the trend line follows the creeping up in our Windows 11 user-base, though it's hard to see from the existing gfx telemetry we have. As Timothy points out, could be related to an increase users of newer hardware or drivers that might track with that stat as well, so could be red herring.

I'm not sure what could be causing these, what Lee has mentioned sounds like a plausible explanation though. Perhaps a deeper dive in to the telemetry might provide some correlations with hardware / driver version?

Flags: needinfo?(gwatson)

(In reply to Lee Salzman [:lsalzman] from comment #26)

Seems like the vast majority of these crashes happen on Windows 11.

I'm guessing that's mostly a result of this primarily occurring on the last few generations of Intel integrated graphics, rather than being caused by the OS version. (I say this as someone who experiences it on Windows 10.)

See Also: → 1664063
See Also: → 1870932

The bug is marked as tracked for firefox122 (beta). However, the bug still isn't assigned and has low severity.

:bhood, could you please find an assignee and increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(bhood)

It is too soon to make any assumptions because of the nightly issues yesterday, but it is possible our recent improvements to locking/unlocking have reduced the frequency.

I would like to see in particular if bug 1870950 helps, and it hasn't gotten into a nightly yet. If we manage to eliminate them, then bug 1664063, bug 1870932, and bug 1870950 would need to be uplift. We would need to refactor bug 1870950 as it will conflict.

See Also: → 1870950

(FWIW, I think it should ride the trains and not uplift to 122.)

Going off Andrew's recommendation, we'll keep this at an S3 and monitor results. Cc: Donal.

Flags: needinfo?(bhood) → needinfo?(dmeehan)

I'll remove the tracking but will leave 122 as affected to keep monitoring.

So far there's been nothing reported since 122 beta, there are still a few from 123 nightly.
Is it possible there's something nightly only that's having an impact here?

Flags: needinfo?(dmeehan)

(In reply to Donal Meehan [:dmeehan] from comment #34)

I'll remove the tracking but will leave 122 as affected to keep monitoring.

So far there's been nothing reported since 122 beta, there are still a few from 123 nightly.
Is it possible there's something nightly only that's having an impact here?

We only crash on nightly. We don't have any telemetry for beta/release on how frequently we hit this.

See Also: → 1871078

(In reply to Lee Salzman [:lsalzman] from comment #26)

Seems like the vast majority of these crashes happen on Windows 11. Over past 6 months, 80.78% of these crashes occur on Windows 11. I wonder if the trend line follows the creeping up in our Windows 11 user-base, though it's hard to see from the existing gfx telemetry we have. As Timothy points out, could be related to an increase users of newer hardware or drivers that might track with that stat as well, so could be red herring.

FWIW I have some of these crashes here on my (relatively) new Dell XPS 15 9520, latest is here. Let me know if I can try or look at something locally.

Bug 1871467 will put this issue to rest. We redesigned things to not need the lock by using RemoteTextureMap, the same pipeline as DrawTargetWebgl.

Depends on: 1871467

We can see the crash rate is now zero because it is impossible (for the canvas path).

Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED

There are too many patches with too much risk for uplifting to 122.

Flags: needinfo?(sotaro.ikeda.g)
Target Milestone: --- → 123 Branch

Bug 1871744 seems to need to revive keyed mutex usage. It seems to affect to the crash.

(In reply to Sotaro Ikeda [:sotaro PTO Dec/28 -Jan/4] from comment #40)

Bug 1871744 seems to need to revive keyed mutex usage. It seems to affect to the crash.

I understand it might fix it, but if it does, it makes no sense and needs further investigation as presumably the locking timeouts are related to this. There is zero contention on the texture, as it is always access only by the CanvasWorker task queue (so only ever one thread at a time, guaranteed).

To be clear, I'm not opposed to landing the fix in the meantime while we try to understand why keyed mutex is necessary.

You need to log in before you can comment on or make changes to this bug.