Timeouts in mozilla::layers::LockD3DTexture<T>
Categories
(Core :: Graphics: WebRender, defect)
Tracking
()
People
(Reporter: bobowen, Unassigned)
References
(Blocks 1 open bug)
Details
Crash Data
Creating a meta to track the various attempts to fix this issue.
This is a gfxDevCrash so only happens on Nightly.
We tend to see other timeouts particularly at RenderDXGITextureHost::LockInternal, in the Graphics Critical Error.
These could be directly related or possibly the same root cause.
They don't appear to be down to device resets.
Updated•3 years ago
|
Updated•3 years ago
|
Reporter | ||
Comment 1•3 years ago
|
||
Just adding the crash signature to this tracking bug, so we can see the residual crashes.
Comment 2•1 year ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 10 desktop browser crashes on nightly
For more information, please visit auto_nag documentation.
Comment 3•10 months ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
Comment 4•9 months ago
|
||
Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:
- Top 10 desktop browser crashes on nightly
For more information, please visit BugBot documentation.
Comment 5•8 months ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
Comment 6•8 months ago
|
||
Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:
- Top 10 desktop browser crashes on nightly
For more information, please visit BugBot documentation.
Comment 7•7 months ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
Comment 8•7 months ago
|
||
Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criterion:
- Top 10 desktop browser crashes on nightly
For more information, please visit BugBot documentation.
Comment 9•6 months ago
|
||
This is spiking a lot on nightly, most of the affected users seem to be running Nvidia cards with the 31.0.101.x series drivers.
Comment 10•6 months ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
Comment 11•6 months ago
|
||
90% of the crashes happen on Intel integrated graphics, either Tiger Lake, Alder Lake, Raptor Lake, or Rocket Lake. So all pretty recent Intel chips.
With about 70% of the crashes being on two cpu models: family 6 model 154 stepping 3, family 6 model 140 stepping 1. Which are Alder and Tiger Lake. The driver versions seem pretty recent.
The common pattern is that we crash on this intentional crash
trying to acquire the texture mutex lock but we timeout.
And this is almost always preceded by two gfx critical notes
RenderDXGITextureHost AcquireSync timeout, hr=0x80070057 (E_INVALIDARG)
RenderDXGITextureHost AcquireSync timeout, hr=0x887a0001 (DXGI_ERROR_INVALID_CALL)
which happens here
I'm not sure how it could be invalid arg. AcquireSync has a key and a ms parameter. The ms has to be valid or else this code would never work (it's always 10000). The key is always 0. I'm also not sure how DXGI_ERROR_INVALID_CALL could come about, the msdn page for AcquireSync doesn't say, it only says this function can return S_OK, E_FAIL, WAIT_ABANDONED, or WAIT_TIMEOUT.
50% of the crashes have dual gpus. Maybe this problem is more likely to happen when textures are shared between gpus?
Comment hidden (off-topic) |
Comment hidden (off-topic) |
Comment 14•5 months ago
|
||
I run into this crash somewhat regularly (Windows 10, Tiger lake integrated graphics). Usually it seems to happen while watching youtube videos (often at 3-4x via a short extension I made - it just sets .playbackRate - but I don't know if that's a necessary factor or not). Is there anything I can do to help narrow down the cause of this?
I just hit this crash on Youtube when running Nightly
https://crash-stats.mozilla.org/report/index/75b1915a-bf68-4e1e-910e-ca7720231115
Updated•4 months ago
|
Comment 16•4 months ago
|
||
Let's address this as a normal bug, and we can always spin out new bugs for sub-bugs.
Comment 17•4 months ago
|
||
Sotaro, could you have a look here when you have time and see if there's a possible solution to this D3D lock?
Comment 18•4 months ago
•
|
||
The problem might be reduced if ID3D11Fence is used instead of keyed mutex like Bug 1863474.
chromium uses ID3D11Fence if it is possible to use like the following.
Comment 19•4 months ago
|
||
How would ID3D11Fence help?
Comment 20•4 months ago
•
|
||
Each D3D11TextureData::Lock() acquires exclusive access by using IDXGIKeyedMutex::AcquireSync(0, 10000) in LockD3DTexture(). And each D3D11TextureData::Unlock() releases exclusive access by using IDXGIKeyedMutex::ReleaseSync(0) in UnlockD3DTexture().
From crash reports, crashes seemed to be triggered by IDXGIKeyedMutex::AcquireSync() failure in D3D11TextureData::Lock().
When using ID3D11Fence, ID3D11Fence is generated from the same ID3D11Device as D3D11TextureData::mTexture in DeviceManagerDx::GetCanvasDevice(). Each D3D11TextureData::Lock() does not need to wait Fence since ID3D11Fence creating device and signaling device is same. Each D3D11TextureData::Unlock() need to increment fence value and signal fence by using ID3D11DeviceContext4::Signal().
By using ID3D11Fence, D3D11TextureData::Lock() does not need to wait in CanvasTranslator use case.
Comment 21•4 months ago
|
||
And If we want to implement GPUQueue:copyExternalImageToTexture(), HTMLCanvasElement and OffscreenCanvas could also be used as source.
chromium implementation does like the following. The BeginAccessDawn() send wait fences to dawn(WebGPU)
GPUQueue::CopyFromCanvasSourceImage()
->WebGPUMailboxTexture::FromStaticBitmapImage()
->WebGPUMailboxTexture::FromCanvasResource()
->WebGPUMailboxTexture::WebGPUMailboxTexture()
->/// ipc
->WebGPUImplementation::AssociateMailbox()
->WebGPUDecoderImpl::HandleAssociateMailboxImmediate()
->WebGPUDecoderImpl::AssociateMailboxDawn()
->DawnImageRepresentation::BeginScopedAccess()
->DawnImageRepresentation::BeginAccess()
->DawnD3DImageRepresentation::BeginAccess()
->D3DImageBacking::BeginAccessDawn()
Comment 22•4 months ago
|
||
This is the top crash on Windows Nightlies from December 14, exceeding (slightly) OOM | small
.
Comment 23•4 months ago
|
||
Looking at the graph, it looks like this has been steadily creeping up from ~10 crashes a day to ~58 crashes a day.
Comment 24•4 months ago
|
||
Tracking this against 122 for investigation, :bhood tagging as triage owner if there were any explanations for comment 23
Comment 25•4 months ago
|
||
Glenn, anything we can do to mitigate this?
Comment 26•3 months ago
|
||
Seems like the vast majority of these crashes happen on Windows 11. Over past 6 months, 80.78% of these crashes occur on Windows 11. I wonder if the trend line follows the creeping up in our Windows 11 user-base, though it's hard to see from the existing gfx telemetry we have. As Timothy points out, could be related to an increase users of newer hardware or drivers that might track with that stat as well, so could be red herring.
Comment 27•3 months ago
|
||
I'm not sure what could be causing these, what Lee has mentioned sounds like a plausible explanation though. Perhaps a deeper dive in to the telemetry might provide some correlations with hardware / driver version?
Comment 28•3 months ago
|
||
(In reply to Lee Salzman [:lsalzman] from comment #26)
Seems like the vast majority of these crashes happen on Windows 11.
I'm guessing that's mostly a result of this primarily occurring on the last few generations of Intel integrated graphics, rather than being caused by the OS version. (I say this as someone who experiences it on Windows 10.)
Comment 29•3 months ago
|
||
The bug is marked as tracked for firefox122 (beta). However, the bug still isn't assigned and has low severity.
:bhood, could you please find an assignee and increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.
For more information, please visit BugBot documentation.
Comment 30•3 months ago
|
||
It is too soon to make any assumptions because of the nightly issues yesterday, but it is possible our recent improvements to locking/unlocking have reduced the frequency.
Comment 31•3 months ago
|
||
I would like to see in particular if bug 1870950 helps, and it hasn't gotten into a nightly yet. If we manage to eliminate them, then bug 1664063, bug 1870932, and bug 1870950 would need to be uplift. We would need to refactor bug 1870950 as it will conflict.
Comment 32•3 months ago
|
||
(FWIW, I think it should ride the trains and not uplift to 122.)
Comment 33•3 months ago
|
||
Going off Andrew's recommendation, we'll keep this at an S3 and monitor results. Cc: Donal.
Comment 34•3 months ago
|
||
I'll remove the tracking but will leave 122 as affected to keep monitoring.
So far there's been nothing reported since 122 beta, there are still a few from 123 nightly.
Is it possible there's something nightly only that's having an impact here?
Comment 35•3 months ago
|
||
(In reply to Donal Meehan [:dmeehan] from comment #34)
I'll remove the tracking but will leave 122 as affected to keep monitoring.
So far there's been nothing reported since 122 beta, there are still a few from 123 nightly.
Is it possible there's something nightly only that's having an impact here?
We only crash on nightly. We don't have any telemetry for beta/release on how frequently we hit this.
Comment 36•3 months ago
|
||
(In reply to Lee Salzman [:lsalzman] from comment #26)
Seems like the vast majority of these crashes happen on Windows 11. Over past 6 months, 80.78% of these crashes occur on Windows 11. I wonder if the trend line follows the creeping up in our Windows 11 user-base, though it's hard to see from the existing gfx telemetry we have. As Timothy points out, could be related to an increase users of newer hardware or drivers that might track with that stat as well, so could be red herring.
FWIW I have some of these crashes here on my (relatively) new Dell XPS 15 9520, latest is here. Let me know if I can try or look at something locally.
Comment 37•3 months ago
|
||
Bug 1871467 will put this issue to rest. We redesigned things to not need the lock by using RemoteTextureMap, the same pipeline as DrawTargetWebgl.
Comment 38•3 months ago
|
||
We can see the crash rate is now zero because it is impossible (for the canvas path).
Comment 39•3 months ago
|
||
There are too many patches with too much risk for uplifting to 122.
Updated•3 months ago
|
Updated•3 months ago
|
Comment 40•3 months ago
|
||
Bug 1871744 seems to need to revive keyed mutex usage. It seems to affect to the crash.
Comment 41•3 months ago
|
||
(In reply to Sotaro Ikeda [:sotaro PTO Dec/28 -Jan/4] from comment #40)
Bug 1871744 seems to need to revive keyed mutex usage. It seems to affect to the crash.
I understand it might fix it, but if it does, it makes no sense and needs further investigation as presumably the locking timeouts are related to this. There is zero contention on the texture, as it is always access only by the CanvasWorker task queue (so only ever one thread at a time, guaranteed).
Comment 42•3 months ago
|
||
To be clear, I'm not opposed to landing the fix in the meantime while we try to understand why keyed mutex is necessary.
Description
•