Open Bug 1704526 Opened 4 years ago Updated 7 months ago

Crash when running with WGPU_TRACE environment

Categories

(Core :: Graphics: WebGPU, defect, P2)

x86_64
Linux
defect

Tracking

()

People

(Reporter: kvark, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

Attached file trace-log.txt

Having WGPU_TRACE working is essential to debugging issues. It appears to be crashing now, rather mysteriously, on a specific page (attached). Logs are also attached.

Trying to catch it in either the child, or the GPU process, doesn't yield anything. The crash seems to originate from the IPC thread, and the first relevant log messages are:

[Parent 664563, IPC I/O Parent] WARNING: Message needs unreceived descriptors channel:7f1e574d8d00 message-type:65531 header()->num_fds:1 num_fds:0 fds_i:0: file /mnt/code/firefox/_webgpu/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:507

###!!! [Child][MessageChannel] Error: (msgtype=0xB40001,name=PWebGPU::Msg_DeviceAction) Channel error: cannot send/recv

[2021-04-10T17:30:56Z INFO wgpu_core::device] Created buffer Valid((2173, 1, Vulkan)) with BufferDescriptor { label: None, size: 576, usage: VERTEX, mapped_at_creation: true }
[Child 664650, IPC I/O Child] WARNING: FileDescriptorSet destroyed with unconsumed descriptors: file /mnt/code/firefox/_webgpu/ipc/chromium/src/chrome/common/file_descriptor_set_posix.cc:19

Attached file testPerfWebGPU.html

attaching the test case

I used this function to decode the message type 65531 = SHMEM_CREATED_MESSAGE.

This isn't a crash, it seems. Adding an assertion there gives me a proper call stack:

Assertion failure: false, at /mnt/code/firefox/_webgpu/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:498
#01: IPC::Channel::ChannelImpl::OnFileCanReadWithoutBlocking(int) (/mnt/code/firefox/_webgpu/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:828)
#02: base::MessagePumpLibevent::OnLibeventNotification(int, short, void*) (/mnt/code/firefox/_webgpu/ipc/chromium/src/base/message_pump_libevent.cc:251)
#03: event_process_active_single_queue (/mnt/code/firefox/_webgpu/ipc/chromium/src/third_party/libevent/event.c:1639)
#04: event_base_loop (/mnt/code/firefox/_webgpu/ipc/chromium/src/third_party/libevent/event.c:1961)
#05: base::MessagePumpLibevent::Run(base::MessagePump::Delegate*) (/mnt/code/firefox/_webgpu/ipc/chromium/src/base/message_pump_libevent.cc:0)
#06: MessageLoop::RunInternal() (/mnt/code/firefox/_webgpu/ipc/chromium/src/base/message_loop.cc:0)
#07: MessageLoop::Run() (/mnt/code/firefox/_webgpu/ipc/chromium/src/base/message_loop.cc:311)
#08: base::Thread::ThreadMain() (/mnt/code/firefox/_webgpu/ipc/chromium/src/base/thread.cc:194)
#09: ThreadFunc(void*) (/mnt/code/firefox/_webgpu/ipc/chromium/src/base/platform_thread_posix.cc:41)

Here is a few more details of the story:

  • the application creates 3000 * N resources at once, many end up with an associated Shmem on our side
  • we aren't freeing them nearly as fast as we are asked to create them
  • when WGPU_TRACE is enabled, we are also writing down many files in the GPU/parent process. ~5000 of them. All of them are written with std::fs in Rust, and all of these handles are closed.

Perhaps, we are running into some kind of file descriptor exhaustion?

(In reply to Dzmitry Malyshau [:kvark] from comment #4)

Here is a few more details of the story:

  • the application creates 3000 * N resources at once, many end up with an associated Shmem on our side
  • we aren't freeing them nearly as fast as we are asked to create them
  • when WGPU_TRACE is enabled, we are also writing down many files in the GPU/parent process. ~5000 of them. All of them are written with std::fs in Rust, and all of these handles are closed.

Perhaps, we are running into some kind of file descriptor exhaustion?

That is likely to be the problem. We have a limit of 4096 file descriptors per process, and the last time this was investigated, we found that some Linux distros have a hard limit of 4k so we can't raise it further in that case.

This probably would have been more immediately obvious if not for a lack of checks for fd exhaustion in some relevant places.

Shmem will automatically close its file descriptor after it's been mapped in each of the two processes involved, but if a large number are created at once, we could have a large peak number of fds. (Note that Linux also has a limit of 64k virtual memory areas due to annoying historical issues, and bug 1700687 reveals that we're already reaching ⅓ of that limit.)

It's going to be necessary to merge these resources into fewer shared memory segments somehow.

I no longer think this should be blocking MVP. It doesn't affect users.

Blocks: webgpu-v1
No longer blocks: webgpu-mvp
Priority: -- → P3
Blocks: webgpu-phase-2
No longer blocks: webgpu-v1
No longer blocks: webgpu-triage
Priority: P3 → P2
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: