Closed Bug 1863872 Opened 6 months ago Closed 2 months ago

Categories

(Core :: Graphics: WebGPU, defect, P1)

defect

Tracking

()

VERIFIED FIXED
125 Branch
Tracking Status
firefox-esr115 --- unaffected
firefox119 --- unaffected
firefox120 --- unaffected
firefox121 --- disabled
firefox122 --- disabled
firefox123 --- disabled
firefox124 --- disabled
firefox125 --- disabled
firefox126 --- disabled

People

(Reporter: mayankleoboy1, Assigned: ErichDonGubler)

References

(Blocks 2 open bugs, Regression, )

Details

(Keywords: regression)

Crash Data

Attachments

(2 files, 7 obsolete files)

Crash Signature: @ d3d12::com::ComPtr<T>::as_unknown ] [@ mozilla::webgpu::CommandEncoder::CommandEncoder ]
Summary: [WebGPU] with n-readback dx12 enabled, crash on https://wgpu-game-of-life.fornwall.net/#rule=1&size=2048&seed=0&density=29&gps=8 → [WebGPU] with no-readback dx12 enabled, crash on https://wgpu-game-of-life.fornwall.net/#rule=1&size=2048&seed=0&density=29&gps=8
Keywords: regression
Regressed by: 1856787
Flags: needinfo?(sotaro.ikeda.g)
Summary: [WebGPU] with no-readback dx12 enabled, crash on https://wgpu-game-of-life.fornwall.net/#rule=1&size=2048&seed=0&density=29&gps=8 → [WebGPU] with no-readback dx12 enabled, crash on https://wgpu-game-of-life.fornwall.net/#rule=1&size=2048&seed=0&density=29&gps=30
Crash Signature: @ d3d12::com::ComPtr<T>::as_unknown ] [@ mozilla::webgpu::CommandEncoder::CommandEncoder ] → [@ d3d12::com::ComPtr<T>::as_unknown ] [@ mozilla::webgpu::CommandEncoder::CommandEncoder ]

Set release status flags based on info from the regressing bug 1856787

Assignee: nobody → sotaro.ikeda.g
Flags: needinfo?(sotaro.ikeda.g)

STR3 :
Follow the steps of comment 0 and let the tab crash.
Open the demo again

AR: The whole browser crashes.
https://crash-stats.mozilla.org/report/index/fc092041-d9fe-41e4-a660-cae200231109

Crash Signature: [@ d3d12::com::ComPtr<T>::as_unknown ] [@ mozilla::webgpu::CommandEncoder::CommandEncoder ] → [@ d3d12::com::ComPtr<T>::as_unknown ] [@ mozilla::webgpu::CommandEncoder::CommandEncoder ] [@ wgpu_core::storage::Storage<T>::get_mut<T> ] [@ core::result::unwrap_failed | wgpu_core::command::CommandEncoder<T>::open<T> ]
Blocks: 1859780
Severity: -- → S2
Priority: -- → P2
Blocks: webgpu-apps

D3D11Device reset happened during the STR. And it caused the crash.

By reverting Bug 1860801, the crash was addressed.

Attachment #9363194 - Attachment is obsolete: true

With the patch, valid TextureView keeps Texture alive and the crash happened when opened d3d12::Resources were around 2020.

From it, there may be a limit to the number of d3d12::Resources oof D3D11 textures.

Severity: S2 → S4

I filed a bug upstream in https://github.com/gfx-rs/wgpu/issues/4700.
Edit: Actually there was already an issue filed at: https://github.com/gfx-rs/wgpu/issues/3350

I think that (short of implementing the whole descriptor cache solution that Dawn has to the sampler descriptor limit, which is a fair amount of of work) this should be addressed in wgpu by eagerly destroying texture views when a texture is destroyed.

Attachment #9363420 - Attachment is obsolete: true

(In reply to Nicolas Silva [:nical] from comment #9)

I filed a bug upstream in https://github.com/gfx-rs/wgpu/issues/4700.
Edit: Actually there was already an issue filed at: https://github.com/gfx-rs/wgpu/issues/3350

I think that (short of implementing the whole descriptor cache solution that Dawn has to the sampler descriptor limit, which is a fair amount of of work) this should be addressed in wgpu by eagerly destroying texture views when a texture is destroyed.

:nical, is there a plan that you work for the issue?

Flags: needinfo?(nical.bugzilla)

We landed improvements to the memory reclamation of buffers and textures (including the ability to destroy a texture while a texture view is alive). So it would be good to check back and see if the problem is still happening. The linked demo does not start in Firefox for me, it looks like something is failing at initialization, unrelated to the original issue.
Regardless, we still haven't made progress with the 2k samplers limit, and don't have a good concrete plan, I'll try to get this prioritized soon.

Flags: needinfo?(nical.bugzilla)

I still see the same crash with pref dom.webgpu.swap-chain.external-texture-dx12 = true.

https://crash-stats.mozilla.org/report/index/55a89806-440c-4baa-8e47-028bb0240117

(In reply to Nicolas Silva [:nical] from comment #11)

We landed improvements to the memory reclamation of buffers and textures (including the ability to destroy a texture while a texture view is alive). So it would be good to check back and see if the problem is still happening.

:nical, which pull request did the improvements?

Flags: needinfo?(nical.bugzilla)

With latest m-c, crash happened at texture.resource.clone() in create_texture_view(), since texture.resource was nullptr.

Attachment #9363418 - Attachment is obsolete: true

With Attachment 9373170 [details] [diff], the crash happened when a number of TextureView became 2420.

With the patch, the crash did not happen. Then TextureView seemed to consume d3d12 resources.

This is also reproducible on demos on https://usegpu.live/demo/geometry/data . There is a dropdown at the bottom right of the page. You can select other demos for crash or hang

(In reply to Sotaro Ikeda [:sotaro] from comment #17)

Created attachment 9373171 [details]
temporal patch - Call TextureView::Cleanup() from Texture::Destroy()

With the patch, the crash did not happen. Then TextureView seemed to consume d3d12 resources.

The problem with this patch is that we cannot call TextureView's drop method until the very last texture view reference is destroyed. Otherwise JS code could try to use a dead texture view in some command and that would be the equivalent of a user-after-free.

The commit that improved texture memory reclamation is https://github.com/gfx-rs/wgpu/commit/4b82121501a61c2c2e11cb472d70ba54af3aa12d which makes it so if the user of the API calls texture.destroy(), wgpu, internally manages to deallocate the texture memory safely even if references to the texture still exist. That would not help, though, if the user is not calling texture.destroy() (memory reclamation would still be at the whims of the garbage collector).

Besides that there are issues with the number of live samplers which is limited to about 2048 (that should actually be affected by the number of bind groups rather than texture views, though, so I'm less sure about how it relates to this bug).

With latest m-c, crash happened at texture.resource.clone() in create_texture_view(), since texture.resource was nullptr.

The linked crash reports show the content process crashing (as a result of the GPU process crashing). The specific issue of the content process crashing should be fixed once bug 1873047 lands.
If you have a crash stack that shows some details of what's going on on the GPU process it would be handy.

I'm going to spend some time to better understand this in the coming days.

Flags: needinfo?(nical.bugzilla)

If you have a crash stack that shows some details of what's going on on the GPU process it would be handy.

With the STR, by attaching debugger to GPU process, I got the following stack.

[インライン フレーム] xul.dll!d3d12::com::impl$2::clone(d3d12::com::ComPtr<winapi::um::d3d12::ID3D12Resource> * self) 行 69 Rust
xul.dll!wgpu_hal::dx12::device::impl$1::create_texture_view(wgpu_hal::dx12::Device * self, wgpu_hal::dx12::Texture * texture, wgpu_hal::TextureViewDescriptor * desc) 行 469 Rust
xul.dll!wgpu_core::device::resource::Device<wgpu_hal::dx12::Api>::create_texture_view<wgpu_hal::dx12::Api>(alloc::sync::Arc<wgpu_core::resource::Texture<wgpu_hal::dx12::Api>> * self, wgpu_core::resource::TextureViewDescriptor * texture) 行 1174 Rust
xul.dll!wgpu_core::global::Global<wgpu_bindings::identity::IdentityRecyclerFactory>::texture_create_view<wgpu_bindings::identity::IdentityRecyclerFactory,wgpu_hal::dx12::Api>(wgpu_core::id::Id<wgpu_core::resource::Texture<wgpu_hal::empty::Api>> self, wgpu_core::resource::TextureViewDescriptor * texture_id, wgpu_core::id::Id<wgpu_core::resource::TextureView<wgpu_hal::empty::Api>> desc) 行 811 Rust
xul.dll!wgpu_bindings::server::Global::texture_action<wgpu_hal::dx12::Api>(wgpu_core::id::Id<wgpu_core::resource::Texture<wgpu_hal::empty::Api>> self, enum2$<wgpu_bindings::TextureAction> self_id, wgpu_bindings::error::ErrorBuffer action) 行 771 Rust
xul.dll!wgpu_bindings::server::wgpu_server_texture_action(wgpu_bindings::server::Global * global, wgpu_core::id::Id<wgpu_core::resource::Texture<wgpu_hal::empty::Api>> self_id, wgpu_bindings::ByteBuf * byte_buf, wgpu_bindings::error::ErrorBuffer error_buf) 行 933 Rust
xul.dll!mozilla::webgpu::WebGPUParent::RecvTextureAction(unsigned int64 aTextureId, unsigned int64 aDeviceId, const mozilla::ipc::ByteBuf & aByteBuf) 行 1297 C++
xul.dll!mozilla::webgpu::PWebGPUParent::OnMessageReceived(const IPC::Message & msg
) 行 420 C++
xul.dll!mozilla::gfx::PCanvasManagerParent::OnMessageReceived(const IPC::Message & msg
) 行 279 C++
xul.dll!mozilla::ipc::MessageChannel::DispatchAsyncMessage(mozilla::ipc::ActorLifecycleProxy * aProxy, const IPC::Message & aMsg) 行 1813 C++
xul.dll!mozilla::ipc::MessageChannel::DispatchMessage(mozilla::ipc::ActorLifecycleProxy * aProxy, mozilla::UniquePtr<IPC::Message,mozilla::DefaultDelete<IPC::Message>> aMsg) 行 1736 C++
xul.dll!mozilla::ipc::MessageChannel::RunMessage(mozilla::ipc::ActorLifecycleProxy * aProxy, mozilla::ipc::MessageChannel::MessageTask & aTask) 行 1526 C++
xul.dll!mozilla::ipc::MessageChannel::MessageTask::Run() 行 1632 C++
xul.dll!nsThread::ProcessNextEvent(bool aMayWait, bool * aResult) 行 1194 C++
xul.dll!NS_ProcessNextEvent(nsIThread * aThread, bool aMayWait) 行 480 C++
xul.dll!mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate * aDelegate) 行 300 C++
[インライン フレーム] xul.dll!MessageLoop::RunInternal() 行 370 C++
xul.dll!MessageLoop::RunHandler() 行 364 C++
xul.dll!MessageLoop::Run() 行 346 C++
xul.dll!nsThread::ThreadFunc(void * aArg) 行 372 C++
nss3.dll!_PR_NativeRunThread(void * arg) 行 421 C
nss3.dll!pr_root(void * arg) 行 140 C
[外部コード]
[インライン フレーム] mozglue.dll!mozilla::interceptor::FuncHook<mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyShared>,void (*)(int, void *, void *)>::operator()(int & aArgs, void * & aArgs, void * & aArgs) 行 150 C++
mozglue.dll!patched_BaseThreadInitThunk(int aIsInitialThread, void * aStartAddress, void * aThreadParam) 行 561 C++
[外部コード]

texture.resource.clone() was failed since txture.resource was nullptr.

Flags: needinfo?(nical.bugzilla)

I reproduced the issue. I see this in the log:

D3D12 ERROR: ID3D12CommandQueue::ExecuteCommandLists: Command lists must be successfully closed before execution. [ EXECUTION ERROR #838: EXECUTECOMMANDLISTS_FAILEDCOMMANDLIST]
Exception thrown at 0x00007FFA9C07CF19 in firefox.exe: Microsoft C++ exception: _com_error at memory location 0x0000007CA7FF5FF8.
D3D12: Removing Device.
D3D12 ERROR: ID3D12Device::RemoveDevice: Device removal has been triggered for the following reason (DXGI_ERROR_INVALID_CALL: There is strong evidence that the application has performed an illegal or undefined operation, and such a condition could not be returned to the application cleanly through a return code). [ EXECUTION ERROR #232: DEVICE_REMOVAL_PROCESS_AT_FAULT]

Note: I initially suspected we were not properly calling destroy on the swap chain textures, but I verified that we do and that wgpu-core is internally destroying all of the swap chain textures. So my next guess is that it has something to do with the error causing the device to be removed above. If not, fixing the device removal will at least help with getting a cleaner repro. I'll keep digging

A command list isn't successfully closed with close returning an out-of-memory error. After spending a bit more time in the d3d12 backend I better understand what's going on. Internally both the texture and texture view point to a reference counted ID3D12Resource which holds the gpu memory allocation. It has been right under my nose for the whole time, the reason kept missing this is that the problem only occurs when the suballocation features is disabled which is the case in gecko but not by default, so the language server was always pointing me to another implementation that does not have this issue.

The solution is to either get suballocation or track texture views in textures and deallocate them eagerly (like you did Sotaro, but from inside wgpu-core where it can be done safely).

I'll give it a go.

Assignee: sotaro.ikeda.g → nical.bugzilla
Flags: needinfo?(nical.bugzilla)

The wgpu-core side change for destroying texture views associated with destroyed textures is up for review in https://github.com/gfx-rs/wgpu/pull/5131

While discussing this with Jeff and Jim we noted that another thing we should for performance, but that would have an impact here is to reuse the canvas textures more aggressively, and reuse the associated texture views. Jim filed Bug 1876114.

Depends on: 1876389

gfx-rs/wgpu#5131 has been reviewed and merged, and is awaiting re-vendoring of WGPU (see bug 1876389).

Blocks: 1843891

With latest m-c, the STR with pref dom.webgpu.swap-chain.external-texture-dx12 = true did not cause the crash. But global.queue_submit() in wgpu_server_queue_submit() did not return.

:nical, do you have any idea bout comment 25?

Flags: needinfo?(nical.bugzilla)

The demo does not animate with the latest Nightly (containing bug 1876389) and dx12-no-readback thingy enabled... A lot of other webgpu demos are not working either (https://react-webgpu-samples.vercel.app/ , https://webgpu.github.io/webgpu-samples/samples/resizeCanvas, etc.)

I heard that recent wgpu update caused several deadlock problems.

Callstack when deadlock of comment 25 happened.

xul.dll!parking_lot_core::thread_parker::imp::waitaddress::WaitAddress::wait_on_address(core::sync::atomic::AtomicUsize * self, unsigned int key) 行 100 Rust
xul.dll!parking_lot_core::thread_parker::imp::waitaddress::WaitAddress::park(core::sync::atomic::AtomicUsize * self) 行 53 Rust
xul.dll!parking_lot_core::thread_parker::imp::impl$1::park(parking_lot_core::thread_parker::imp::ThreadParker * self) 行 119 Rust
xul.dll!parking_lot_core::parking_lot::park::closure$0(parking_lot_core::parking_lot::park::closure_env$0<parking_lot::raw_rwlock::impl$10::wait_for_readers::closure_env$0,parking_lot::raw_rwlock::impl$10::wait_for_readers::closure_env$1,parking_lot::raw_rwlock::impl$10::wait_for_readers::closure_env$2> thread_data, parking_lot_core::parking_lot::ThreadData *) 行 635 Rust
xul.dll!parking_lot_core::parking_lot::with_thread_data(parking_lot_core::parking_lot::park::closure_env$0<parking_lot::raw_rwlock::impl$10::wait_for_readers::closure_env$0,parking_lot::raw_rwlock::impl$10::wait_for_readers::closure_env$1,parking_lot::raw_rwlock::impl$10::wait_for_readers::closure_env$2> f) 行 207 Rust
xul.dll!parking_lot_core::parking_lot::park(unsigned __int64 key, parking_lot::raw_rwlock::impl$10::wait_for_readers::closure_env$0 validate, parking_lot::raw_rwlock::impl$10::wait_for_readers::closure_env$1) 行 600 Rust
xul.dll!parking_lot::raw_rwlock::RawRwLock::wait_for_readers(enum2$<core::option::Option<std::time::Instant>> self, unsigned __int64 prev_value) 行 1013 Rust
xul.dll!parking_lot::raw_rwlock::RawRwLock::lock_exclusive_slow(enum2$<core::option::Option<std::time::Instant>> self) 行 645 Rust
xul.dll!parking_lot::raw_rwlock::impl$0::lock_exclusive(parking_lot::raw_rwlock::RawRwLock * self) 行 73 Rust
xul.dll!lock_api::rwlock::RwLock<parking_lot::raw_rwlock::RawRwLock,tuple$<>>::write() 行 480 Rust
xul.dll!wgpu_core::snatch::SnatchLock::write() 行 90 Rust
xul.dll!wgpu_core::resource::impl$19::drop(wgpu_core::resource::DestroyedTexture<wgpu_hal::dx12::Api> * self) 行 1045 Rust
xul.dll!core::ptr::drop_in_place<wgpu_core::resource::DestroyedTexture<wgpu_hal::dx12::Api>>(wgpu_core::resource::DestroyedTexture<wgpu_hal::dx12::Api> *) 行 497 Rust
xul.dll!alloc::sync::Arc<wgpu_core::resource::DestroyedTexture<wgpu_hal::dx12::Api>>::drop_slow<wgpu_core::resource::DestroyedTexture<wgpu_hal::dx12::Api>>() 行 1266 Rust
xul.dll!alloc::sync::impl$27::drop(alloc::sync::Arc<wgpu_core::resource::DestroyedTexture<wgpu_hal::dx12::Api>> * self) 行 1897 Rust
xul.dll!core::ptr::drop_in_place(alloc::sync::Arc<wgpu_core::resource::DestroyedTexture<wgpu_hal::dx12::Api>> *) 行 497 Rust
xul.dll!core::ptr::drop_in_place(tuple$<wgpu_core::id::Id<enum2$<wgpu_core::id::markers::Texture>>,alloc::sync::Arc<wgpu_core::resource::DestroyedTexture<wgpu_hal::dx12::Api>>> *) 行 497 Rust
xul.dll!core::ptr::mut_ptr::impl$0::drop_in_place(tuple$<wgpu_core::id::Id<enum2$<wgpu_core::id::markers::Texture>>,alloc::sync::Arc<wgpu_core::resource::DestroyedTexture<wgpu_hal::dx12::Api>>> * self) 行 1431 Rust
xul.dll!hashbrown::raw::Bucket<tuple$<wgpu_core::id::Id<enum2$<wgpu_core::id::markers::Texture>>,alloc::sync::Arc<wgpu_core::resource::DestroyedTexture<wgpu_hal::dx12::Api>>>>::drop() 行 581 Rust
xul.dll!hashbrown::raw::RawTable<tuple$<wgpu_core::id::Id<enum2$<wgpu_core::id::markers::Texture>>,alloc::sync::Arc<wgpu_core::resource::DestroyedTexture<wgpu_hal::dx12::Api>>>,alloc::alloc::Global>::drop_elements<tuple$<wgpu_core::id::Id<enum2$<wgpu_core::id::markers::Texture>>,alloc::sync::Arc<wgpu_core::resource::DestroyedTexture<wgpu_hal::dx12::Api>>>,alloc::alloc::Global>() 行 1038 Rust
xul.dll!hashbrown::raw::impl$17::drop(hashbrown::raw::RawTable<tuple$<wgpu_core::id::Id<enum2$<wgpu_core::id::markers::Texture>>,alloc::sync::Arc<wgpu_core::resource::DestroyedTexture<wgpu_hal::dx12::Api>>>,alloc::alloc::Global> * self) 行 2699 Rust
xul.dll!core::ptr::drop_in_place(hashbrown::raw::RawTable<tuple$<wgpu_core::id::Id<enum2$<wgpu_core::id::markers::Texture>>,alloc::sync::Arc<wgpu_core::resource::DestroyedTexture<wgpu_hal::dx12::Api>>>,alloc::alloc::Global> *) 行 497 Rust
xul.dll!core::ptr::drop_in_place(hashbrown::map::HashMap<wgpu_core::id::Id<enum2$<wgpu_core::id::markers::Texture>>,alloc::sync::Arc<wgpu_core::resource::DestroyedTexture<wgpu_hal::dx12::Api>>,core::hash::BuildHasherDefault<rustc_hash::FxHasher>,alloc::alloc::Global> *) 行 497 Rust
xul.dll!core::ptr::drop_in_place(std::collections::hash::map::HashMap<wgpu_core::id::Id<enum2$<wgpu_core::id::markers::Texture>>,alloc::sync::Arc<wgpu_core::resource::DestroyedTexture<wgpu_hal::dx12::Api>>,core::hash::BuildHasherDefault<rustc_hash::FxHasher>> *) 行 497 Rust
xul.dll!core::ptr::drop_in_place<wgpu_core::device::life::ResourceMaps<wgpu_hal::dx12::Api>>(wgpu_core::device::life::ResourceMaps<wgpu_hal::dx12::Api> *) 行 497 Rust
xul.dll!wgpu_core::device::life::LifetimeTracker<wgpu_hal::dx12::Api>::triage_submissions<wgpu_hal::dx12::Api>(unsigned __int64 self, wgpu_core::device::CommandAllocator<wgpu_hal::dx12::Api> * last_done) 行 368 Rust
xul.dll!wgpu_core::device::resource::Device<wgpu_hal::dx12::Api>::maintain<wgpu_hal::dx12::Api>(wgpu_hal::dx12::Fence * self, enum2$<wgpu_types::Maintain<wgpu_core::device::queue::WrappedSubmissionIndex>> fence) 行 357 Rust
xul.dll!wgpu_core::global::Global::queue_submit<wgpu_hal::dx12::Api>(wgpu_core::id::Id<enum2$<wgpu_core::id::markers::Queue>> self, ref$<slice2$<wgpu_core::id::Id<enum2$<wgpu_core::id::markers::CommandBuffer>>>> queue_id) 行 1546 Rust
xul.dll!wgpu_bindings::server::wgpu_server_queue_submit(wgpu_bindings::server::Global * global, wgpu_core::id::Id<enum2$<wgpu_core::id::markers::Queue>> self_id, wgpu_core::id::Id<enum2$<wgpu_core::id::markers::CommandBuffer>> * command_buffer_ids, unsigned __int64 command_buffer_id_length, wgpu_bindings::error::ErrorBuffer error_buf) 行 1099 Rust
xul.dll!mozilla::webgpu::WebGPUParent::RecvQueueSubmit(unsigned __int64 aQueueId, unsigned __int64 aDeviceId, const nsTArray<unsigned long long> & aCommandBuffers, const nsTArray<unsigned long long> & aTextureIds) 行 760

It seems like SnatchLock::write is hanging, probably because an older frame is holding a read lock.

Flags: needinfo?(nical.bugzilla)

I still get teh crash from the latest Nightly, it just takes longer. I opened the testcase and let it run for 60-120 seconds. The browser crashed. https://crash-stats.mozilla.org/report/index/265b8bd0-4ba3-4c56-bedb-43a840240210
(ni? :nical so they see my comment)

Flags: needinfo?(nical.bugzilla)

By attaching debugger, the log out had the following.

D3D11: Removing Device.
0x00007FFE1EE94D8C (KernelBase.dll) で例外がスローされました (firefox.exe 内): WinRT originate error - 0x887A0005 : 'The GPU device instance has been suspended. Use GetDeviceRemovedReason to determine the appropriate action.'

Attached patch patch - Add log (obsolete) — Splinter Review
Attachment #9373170 - Attachment is obsolete: true

With the patch of Attachment 9379901 [details] [diff], device reset happened when dx12 TextureView count became 1753.

Erich will pick this up going forward.

Flags: needinfo?(nical.bugzilla) → needinfo?(egubler)
Depends on: 1879989

I still get a crash on the testcase (open the testcase and maximize the "initial density" slider) : https://crash-stats.mozilla.org/report/index/73a2c52f-5029-4ed9-b1b3-a88be0240229

I also get a crash. With the log patch of Attachment 9388300 [details] [diff], device reset happened when dx12 TextureView count became 1582.

Attached patch patch - Add log (obsolete) — Splinter Review
Attachment #9379901 - Attachment is obsolete: true
Assignee: nical.bugzilla → egubler
Status: NEW → ASSIGNED
Flags: needinfo?(egubler)
Priority: P2 → P1

Passing investigation on to :sotaro.

Assignee: egubler → sotaro.ikeda.g
Depends on: 1881518
Attached patch patch - Add logSplinter Review

With the patch on latest m-c, device reset happened when count of dx12 TextureView became 1978. From it, the large number of the dx12 TextureView seemed to trigger the device reset.

The TextureView count became large since, the dx12 TextureViews were alive until texture_view_drop() was called.

texture_view_drop() was triggered at TextureView::Cleanup(). It was not called often, since it was triggered by cycle collection.

Then the count of dx12 TextureView became large.

Attachment #9388300 - Attachment is obsolete: true
Attachment #9363420 - Attachment is obsolete: false
Attachment #9363420 - Attachment description: Bug 1863872 - Call TextureView::Cleanup() in Texture::ForceDestroy() → Bug 1863872 - Drop TextureView in Texture::Destroy()

With D193511, the problem did not happen for me. But it is not correct fix for wgpu.

:nical wants a correct fix in wgpu. The following is a comment from :nical.

We can't drop a texture view while the JS object still exists, so we can't take this patch in its current state. That said wgpu internally does the same thing: textures have a list of texture views and when destroy is called on a texture, the internal resource of its views are internally removed.
ErichDonGubler: it would be good to double check that this system is working as expected and more generally instrument the number and size of all hal resources over time. Maybe we are incorrectly tracking the views of a texture or maybe the number of texture views is just a correlation.

Re-assign to :ErichDonGubler.

Assignee: sotaro.ikeda.g → egubler
Attachment #9390097 - Attachment is patch: true
Attachment #9390097 - Attachment mime type: application/octet-stream → text/plain

For posterity, the link in the description for this report has expired/disappeared. Here is a functioning link to the WebGPU demo: https://usegpu.live/demo/geometry/data

After pairing with :jimb and :nical, I believe we have a fix for the device getting reset by having too many textures: wgpu#5378. I've included more narrative about the fix there.

:sotaro, can you confirm that this resolves the crash for you? You should be able to re-vendor WGPU in your local checkout of with mach vendor --ignore-modified --force gfx/wgpu_bindings/moz.yaml --revision 27991d1b272b3d367d446daece6fde58d3cdfb5d.

Assuming that the fix works, and we get it landed promptly, this fix should arrive with the next iteration of webgpu-update-wgpu (which I'm responsible for this week).

Flags: needinfo?(sotaro.ikeda.g)

:ErichDonGubler, thank you! I confirmed that the problem is addressed for me!!!

Flags: needinfo?(sotaro.ikeda.g)
Attachment #9363420 - Attachment is obsolete: true
Attachment #9373171 - Attachment is obsolete: true

wgpu#5378 has merged upstream. Now awaiting the next iteration of webgpu-update-wgpu.

Depends on: webgpu-update-wgpu
No longer depends on: webgpu-update-wgpu
Depends on: 1884946

testing autoland builds from bug 1884946 seems to fix this bug

This is fixed on the latest Nightly.

Status: ASSIGNED → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Target Milestone: --- → 125 Branch
Flags: qe-verify+
Attached image Fx125.0b5.png

I've replicated this issue using Nightly 121.0a1 (2023-11-08) on Windows 10 x64 following the STR from Comment 0, while pref dom.webgpu.swap-chain.external-texture-dx12=true.
However I'm unable to verify this in Firefox 125.0b5 as the provided test case does not work as expected(warning message "WebGPU is not working in our browser"). Please refer to the attached screenshot for details.
I can confirm that the issue no longer occurs in the latest Nightly 126.0a1 version.

Flags: needinfo?(egubler)

Ina: I presume you're using a beta build, rather than a Nightly version. That's expected behavior in beta and stable; we don't expose the navigator.gpu variable, because we don't want to expose WebGPU anywhere other than Nightly yet. We're using webgpu-v1 to track WebGPU's readiness for this.

Flags: needinfo?(egubler)

Thank you for the clarification.
Marking this as "Verified Fixed" as the issue is no longer present in the latest Nightly 126.0a1 version.

Status: RESOLVED → VERIFIED
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: