Open Bug 1730374 Opened 4 months ago Updated 4 days ago

Intermittent [Tier 2] Last test finished | application crashed [@ PR_NativeRunThread(void*)] | After Last test finished | application terminated with exit code 1

Categories

(Core :: IPC, defect, P3)

defect

Tracking

()

Fission Milestone Future
Tracking Status
firefox-esr91 --- unaffected
firefox93 --- unaffected
firefox94 --- affected
firefox95 --- affected
firefox96 --- affected

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: crash, intermittent-failure, leave-open, Whiteboard: [fission:m96?][stockwell disabled][stockwell unknown])

Crash Data

Attachments

(2 files)

Filed by: nbeleuzu [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=351255856&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Bmk8FOpkS-OnNkO1cEw4sw/runs/0/artifacts/public/logs/live_backing.log


[task 2021-09-11T23:21:31.450Z] 23:21:31     INFO - PROCESS-CRASH | Last test finished | application crashed [@ PR_NativeRunThread(void*)]
[task 2021-09-11T23:21:31.450Z] 23:21:31     INFO - Mozilla crash reason: MOZ_CRASH(Shutdown hanging after all known phases and workers finished.)
[task 2021-09-11T23:21:31.466Z] 23:21:31     INFO - Crash dump filename: C:\Users\task_163140047228689\AppData\Local\Temp\tmpb0mrwjnq.mozrunner\minidumps\d3e6ec8a-8e13-4d83-a902-04c303b32dde.dmp
[task 2021-09-11T23:21:31.466Z] 23:21:31     INFO - Operating system: Windows NT
[task 2021-09-11T23:21:31.466Z] 23:21:31     INFO -                   10.0.19041 
[task 2021-09-11T23:21:31.466Z] 23:21:31     INFO - CPU: amd64
[task 2021-09-11T23:21:31.466Z] 23:21:31     INFO -      family 6 model 85 stepping 7
[task 2021-09-11T23:21:31.466Z] 23:21:31     INFO -      8 CPUs
[task 2021-09-11T23:21:31.466Z] 23:21:31     INFO - 
[task 2021-09-11T23:21:31.466Z] 23:21:31     INFO - GPU: UNKNOWN
[task 2021-09-11T23:21:31.466Z] 23:21:31     INFO - 
[task 2021-09-11T23:21:31.466Z] 23:21:31     INFO - Crash reason:  EXCEPTION_BREAKPOINT
[task 2021-09-11T23:21:31.466Z] 23:21:31     INFO - Crash address: 0xc6d80448
[task 2021-09-11T23:21:31.466Z] 23:21:31     INFO - Process uptime: 847 seconds
[task 2021-09-11T23:21:31.466Z] 23:21:31     INFO - 
[task 2021-09-11T23:21:31.466Z] 23:21:31     INFO - Thread 19 tid 5288 (crashed) - Shutdown Hang Terminator 0  xul.dll!mozilla::`anonymous namespace'::RunWatchdog(void*) [nsTerminator.cpp:96fd1bb41036f14d9b18f9d1e462f8382453c6f2 : 246 + 0x0]
[task 2021-09-11T23:21:31.466Z] 23:21:31     INFO -     rax = 0x00007ffdca5b6001   rdx = 0x0000000000000000
[task 2021-09-11T23:21:31.466Z] 23:21:31     INFO -     rcx = 0x00007ffdf6f19898   rbx = 0x00007ffdf28f90f7
[task 2021-09-11T23:21:31.467Z] 23:21:31     INFO -     rsi = 0x00007ffdfd0bada0   rdi = 0x0000000000000276
[task 2021-09-11T23:21:31.467Z] 23:21:31     INFO -     rbp = 0x00007ffdf2951d08   rsp = 0x000000cdc87ffae0
[task 2021-09-11T23:21:31.467Z] 23:21:31     INFO -      r8 = 0x000000cdc87ffcc0    r9 = 0x00007ffdfee70000
[task 2021-09-11T23:21:31.467Z] 23:21:31     INFO -     r10 = 0x00007ffdfeec2651   r11 = 0x000000cdc87ff4a0
[task 2021-09-11T23:21:31.467Z] 23:21:31     INFO -     r12 = 0x000002b626924938   r13 = 0x000002b626924948
[task 2021-09-11T23:21:31.467Z] 23:21:31     INFO -     r14 = 0x0000000000000000   r15 = 0x00007ffdf28f9118
[task 2021-09-11T23:21:31.467Z] 23:21:31     INFO -     rip = 0x00007ffdc6d80448
[task 2021-09-11T23:21:31.467Z] 23:21:31     INFO -     Found by: given as instruction pointer in context
[task 2021-09-11T23:21:31.467Z] 23:21:31     INFO -  1  nss3.dll!PR_NativeRunThread(void*) [pruthr.c:96fd1bb41036f14d9b18f9d1e462f8382453c6f2 : 399 + 0xe]
[task 2021-09-11T23:21:31.467Z] 23:21:31     INFO -     rbx = 0x00007ffdf28f90f7   rbp = 0x00007ffdf2951d08
[task 2021-09-11T23:21:31.467Z] 23:21:31     INFO -     rsp = 0x000000cdc87ffb30   r12 = 0x000002b626924938
[task 2021-09-11T23:21:31.467Z] 23:21:31     INFO -     r13 = 0x000002b626924948   r14 = 0x0000000000000000
[task 2021-09-11T23:21:31.467Z] 23:21:31     INFO -     r15 = 0x00007ffdf28f9118   rip = 0x00007ffdf27a9462
[task 2021-09-11T23:21:31.467Z] 23:21:31     INFO -     Found by: call frame info
[task 2021-09-11T23:21:31.467Z] 23:21:31     INFO -  2  nss3.dll!pr_root(void*) [w95thred.c:96fd1bb41036f14d9b18f9d1e462f8382453c6f2 : 139 + 0xd]
[task 2021-09-11T23:21:31.467Z] 23:21:31     INFO -     rbx = 0x00007ffdf28f90f7   rbp = 0x00007ffdf2951d08
[task 2021-09-11T23:21:31.467Z] 23:21:31     INFO -     rsp = 0x000000cdc87ffbb0   r12 = 0x000002b626924938
[task 2021-09-11T23:21:31.467Z] 23:21:31     INFO -     r13 = 0x000002b626924948   r14 = 0x0000000000000000
[task 2021-09-11T23:21:31.467Z] 23:21:31     INFO -     r15 = 0x00007ffdf28f9118   rip = 0x00007ffdf2799e41
[task 2021-09-11T23:21:31.467Z] 23:21:31     INFO -     Found by: call frame info
[task 2021-09-11T23:21:31.468Z] 23:21:31     INFO -  3  ucrtbase.dll!RtlpReAllocateHeap + 0x1ce
[task 2021-09-11T23:21:31.468Z] 23:21:31     INFO -     rbx = 0x00007ffdf28f90f7   rbp = 0x00007ffdf2951d08
[task 2021-09-11T23:21:31.468Z] 23:21:31     INFO -     rsp = 0x000000cdc87ffbe0   r12 = 0x000002b626924938
[task 2021-09-11T23:21:31.468Z] 23:21:31     INFO -     r13 = 0x000002b626924948   r14 = 0x0000000000000000
[task 2021-09-11T23:21:31.468Z] 23:21:31     INFO -     r15 = 0x00007ffdf28f9118   rip = 0x00007ffdfca41bb2
[task 2021-09-11T23:21:31.468Z] 23:21:31     INFO -     Found by: call frame info
[task 2021-09-11T23:21:31.468Z] 23:21:31     INFO - 
[task 2021-09-11T23:21:31.468Z] 23:21:31     INFO - Thread 0 tid 6544 - GeckoMain
[task 2021-09-11T23:21:31.468Z] 23:21:31     INFO -  0  ntdll.dll!NtWaitForSingleObject + 0x14
[task 2021-09-11T23:21:31.468Z] 23:21:31     INFO -     rax = 0x0000000000000004   rdx = 0x0000000000000000
[task 2021-09-11T23:21:31.468Z] 23:21:31     INFO -     rcx = 0x0000000000000428   rbx = 0x0000000000000000
[task 2021-09-11T23:21:31.468Z] 23:21:31     INFO -     rsi = 0x0000000000000000   rdi = 0x0000000000000428
[task 2021-09-11T23:21:31.468Z] 23:21:31     INFO -     rbp = 0x000000cdc57ff660   rsp = 0x000000cdc57ff268
[task 2021-09-11T23:21:31.468Z] 23:21:31     INFO -      r8 = 0x0000000000000010    r9 = 0x0000000000000008
[task 2021-09-11T23:21:31.468Z] 23:21:31     INFO -     r10 = 0x0000000000000000   r11 = 0x00007ffdfeff0248
[task 2021-09-11T23:21:31.469Z] 23:21:31     INFO -     r12 = 0xaaaaaaaaaaaaaaaa   r13 = 0x000002b626909490
[task 2021-09-11T23:21:31.469Z] 23:21:31     INFO -     r14 = 0x0000000000000428   r15 = 0x0000000000000000
[task 2021-09-11T23:21:31.469Z] 23:21:31     INFO -     rip = 0x00007ffdfef0cea4
[task 2021-09-11T23:21:31.469Z] 23:21:31     INFO -     Found by: given as instruction pointer in context
[task 2021-09-11T23:21:31.469Z] 23:21:31     INFO -  1  KERNELBASE.dll!RtlReleaseSRWLockShared + 0xae
[task 2021-09-11T23:21:31.469Z] 23:21:31     INFO -     rbx = 0x0000000000000000   rbp = 0x000000cdc57ff660
[task 2021-09-11T23:21:31.469Z] 23:21:31     INFO -     rsp = 0x000000cdc57ff270   r12 = 0xaaaaaaaaaaaaaaaa
[task 2021-09-11T23:21:31.469Z] 23:21:31     INFO -     r13 = 0x000002b626909490   r14 = 0x0000000000000428
[task 2021-09-11T23:21:31.469Z] 23:21:31     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffdfc5c19ce
[task 2021-09-11T23:21:31.469Z] 23:21:31     INFO -     Found by: call frame info
[task 2021-09-11T23:21:31.469Z] 23:21:31     INFO -  2  xul.dll!base::MessagePumpForIO::ScheduleWork() [message_pump_win.cc:96fd1bb41036f14d9b18f9d1e462f8382453c6f2 : 389 + 0xd]
[task 2021-09-11T23:21:31.469Z] 23:21:31     INFO -     rbx = 0x0000000000000000   rbp = 0x000000cdc57ff660
[task 2021-09-11T23:21:31.469Z] 23:21:31     INFO -     rsp = 0x000000cdc57ff2a0   r12 = 0xaaaaaaaaaaaaaaaa
[task 2021-09-11T23:21:31.470Z] 23:21:31     INFO -     r13 = 0x000002b626909490   r14 = 0x0000000000000428
[task 2021-09-11T23:21:31.470Z] 23:21:31     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffdc1497271
[task 2021-09-11T23:21:31.470Z] 23:21:31     INFO -     Found by: call frame info
[task 2021-09-11T23:21:31.470Z] 23:21:31     INFO -  3  xul.dll!MessageLoop::PostTask_Helper(already_AddRefed<nsIRunnable>, int) [message_loop.cc:96fd1bb41036f14d9b18f9d1e462f8382453c6f2 : 422 + 0xd]
[task 2021-09-11T23:21:31.470Z] 23:21:31     INFO -     rbx = 0x0000000000000000   rbp = 0x000000cdc57ff660
[task 2021-09-11T23:21:31.470Z] 23:21:31     INFO -     rsp = 0x000000cdc57ff2f0   r12 = 0xaaaaaaaaaaaaaaaa
[task 2021-09-11T23:21:31.470Z] 23:21:31     INFO -     r13 = 0x000002b626909490   r14 = 0x0000000000000428
[task 2021-09-11T23:21:31.470Z] 23:21:31     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffdc14aaebd
[task 2021-09-11T23:21:31.470Z] 23:21:31     INFO -     Found by: call frame info
[task 2021-09-11T23:21:31.470Z] 23:21:31     INFO -  4  xul.dll!base::Thread::Stop() [thread.cc:96fd1bb41036f14d9b18f9d1e462f8382453c6f2 : 124 + 0x9]
[task 2021-09-11T23:21:31.470Z] 23:21:31     INFO -     rbx = 0x0000000000000000   rbp = 0x000000cdc57ff660
[task 2021-09-11T23:21:31.470Z] 23:21:31     INFO -     rsp = 0x000000cdc57ff390   r12 = 0xaaaaaaaaaaaaaaaa
[task 2021-09-11T23:21:31.470Z] 23:21:31     INFO -     r13 = 0x000002b626909490   r14 = 0x0000000000000428
[task 2021-09-11T23:21:31.471Z] 23:21:31     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffdc14b5350
[task 2021-09-11T23:21:31.471Z] 23:21:31     INFO -     Found by: call frame info
[task 2021-09-11T23:21:31.471Z] 23:21:31     INFO -  5  xul.dll!mozilla::ipc::BrowserProcessSubThread::~BrowserProcessSubThread() [BrowserProcessSubThread.cpp:96fd1bb41036f14d9b18f9d1e462f8382453c6f2 : 40 + 0xf]
[task 2021-09-11T23:21:31.471Z] 23:21:31     INFO -     rbx = 0x0000000000000000   rbp = 0x000000cdc57ff660
[task 2021-09-11T23:21:31.471Z] 23:21:31     INFO -     rsp = 0x000000cdc57ff400   r12 = 0xaaaaaaaaaaaaaaaa
[task 2021-09-11T23:21:31.471Z] 23:21:31     INFO -     r13 = 0x000002b626909490   r14 = 0x0000000000000428
[task 2021-09-11T23:21:31.471Z] 23:21:31     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffdc14fe3eb
[task 2021-09-11T23:21:31.471Z] 23:21:31     INFO -     Found by: call frame info
[task 2021-09-11T23:21:31.471Z] 23:21:31     INFO -  6  xul.dll!mozilla::ShutdownXPCOM(nsIServiceManager*) [XPCOMInit.cpp:96fd1bb41036f14d9b18f9d1e462f8382453c6f2 : 796 + 0x1d]
[task 2021-09-11T23:21:31.471Z] 23:21:31     INFO -     rbx = 0x0000000000000000   rbp = 0x000000cdc57ff660
[task 2021-09-11T23:21:31.471Z] 23:21:31     INFO -     rsp = 0x000000cdc57ff440   r12 = 0xaaaaaaaaaaaaaaaa
[task 2021-09-11T23:21:31.471Z] 23:21:31     INFO -     r13 = 0x000002b626909490   r14 = 0x0000000000000428
[task 2021-09-11T23:21:31.471Z] 23:21:31     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffdc0c6c473
[task 2021-09-11T23:21:31.471Z] 23:21:31     INFO -     Found by: call frame info
[task 2021-09-11T23:21:31.471Z] 23:21:31     INFO -  7  xul.dll!ScopedXPCOMStartup::~ScopedXPCOMStartup() [nsAppRunner.cpp:96fd1bb41036f14d9b18f9d1e462f8382453c6f2 : 1683 + 0x8]
[task 2021-09-11T23:21:31.471Z] 23:21:31     INFO -     rbx = 0x0000000000000000   rbp = 0x000000cdc57ff660
[task 2021-09-11T23:21:31.471Z] 23:21:31     INFO -     rsp = 0x000000cdc57ff530   r12 = 0xaaaaaaaaaaaaaaaa
[task 2021-09-11T23:21:31.472Z] 23:21:31     INFO -     r13 = 0x000002b626909490   r14 = 0x0000000000000428
[task 2021-09-11T23:21:31.472Z] 23:21:31     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffdc6dc23b4
[task 2021-09-11T23:21:31.472Z] 23:21:31     INFO -     Found by: call frame info
[task 2021-09-11T23:21:31.472Z] 23:21:31     INFO -  8  xul.dll!XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) [nsAppRunner.cpp:96fd1bb41036f14d9b18f9d1e462f8382453c6f2 : 5493 + 0x19]
[task 2021-09-11T23:21:31.472Z] 23:21:31     INFO -     rbx = 0x0000000000000000   rbp = 0x000000cdc57ff660
[task 2021-09-11T23:21:31.472Z] 23:21:31     INFO -     rsp = 0x000000cdc57ff610   r12 = 0xaaaaaaaaaaaaaaaa
[task 2021-09-11T23:21:31.472Z] 23:21:31     INFO -     r13 = 0x000002b626909490   r14 = 0x0000000000000428
[task 2021-09-11T23:21:31.472Z] 23:21:31     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffdc6dd1bf2
[task 2021-09-11T23:21:31.472Z] 23:21:31     INFO -     Found by: call frame info
[task 2021-09-11T23:21:31.472Z] 23:21:31     INFO -  9  xul.dll!XRE_main(int, char**, mozilla::BootstrapConfig const&) [nsAppRunner.cpp:96fd1bb41036f14d9b18f9d1e462f8382453c6f2 : 5535 + 0x10]
[task 2021-09-11T23:21:31.472Z] 23:21:31     INFO -     rbx = 0x0000000000000000   rbp = 0x000000cdc57ff660
[task 2021-09-11T23:21:31.472Z] 23:21:31     INFO -     rsp = 0x000000cdc57ff720   r12 = 0xaaaaaaaaaaaaaaaa
[task 2021-09-11T23:21:31.472Z] 23:21:31     INFO -     r13 = 0x000002b626909490   r14 = 0x0000000000000428
[task 2021-09-11T23:21:31.472Z] 23:21:31     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffdc6dd24f2
[task 2021-09-11T23:21:31.472Z] 23:21:31     INFO -     Found by: call frame info
[task 2021-09-11T23:21:31.472Z] 23:21:31     INFO - 10  firefox.exe!NS_internal_main(int, char**, char**) [nsBrowserApp.cpp:96fd1bb41036f14d9b18f9d1e462f8382453c6f2 : 386 + 0x10f]
[task 2021-09-11T23:21:31.472Z] 23:21:31     INFO -     rbx = 0x0000000000000000   rbp = 0x000000cdc57ff660
[task 2021-09-11T23:21:31.472Z] 23:21:31     INFO -     rsp = 0x000000cdc57ff880   r12 = 0xaaaaaaaaaaaaaaaa
[task 2021-09-11T23:21:31.473Z] 23:21:31     INFO -     r13 = 0x000002b626909490   r14 = 0x0000000000000428
[task 2021-09-11T23:21:31.473Z] 23:21:31     INFO -     r15 = 0x0000000000000000   rip = 0x00007ff64ea3183e
[task 2021-09-11T23:21:31.473Z] 23:21:31     INFO -     Found by: call frame info
[task 2021-09-11T23:21:31.473Z] 23:21:31     INFO - 11  firefox.exe!wmain(int, wchar_t**) [nsWindowsWMain.cpp:96fd1bb41036f14d9b18f9d1e462f8382453c6f2 : 131 + 0x15]
[task 2021-09-11T23:21:31.473Z] 23:21:31     INFO -     rbx = 0x0000000000000000   rbp = 0x000000cdc57ff660
[task 2021-09-11T23:21:31.473Z] 23:21:31     INFO -     rsp = 0x000000cdc57ffa60   r12 = 0xaaaaaaaaaaaaaaaa
[task 2021-09-11T23:21:31.473Z] 23:21:31     INFO -     r13 = 0x000002b626909490   r14 = 0x0000000000000428
[task 2021-09-11T23:21:31.473Z] 23:21:31     INFO -     r15 = 0x0000000000000000   rip = 0x00007ff64ea3121b
[task 2021-09-11T23:21:31.473Z] 23:21:31     INFO -     Found by: call frame info
[task 2021-09-11T23:21:31.473Z] 23:21:31     INFO - 12  firefox.exe!__scrt_common_main_seh() [exe_common.inl : 288 + 0x22]
[task 2021-09-11T23:21:31.473Z] 23:21:31     INFO -     rbx = 0x0000000000000000   rbp = 0x000000cdc57ff660
[task 2021-09-11T23:21:31.473Z] 23:21:31     INFO -     rsp = 0x000000cdc57ffb20   r12 = 0xaaaaaaaaaaaaaaaa
[task 2021-09-11T23:21:31.473Z] 23:21:31     INFO -     r13 = 0x000002b626909490   r14 = 0x0000000000000428
[task 2021-09-11T23:21:31.473Z] 23:21:31     INFO -     r15 = 0x0000000000000000   rip = 0x00007ff64eacfce8
[task 2021-09-11T23:21:31.473Z] 23:21:31     INFO -     Found by: call frame info
[task 2021-09-11T23:21:31.473Z] 23:21:31     INFO - 13  kernel32.dll!LdrpSnapKernelBaseExtensions + 0x150
[task 2021-09-11T23:21:31.473Z] 23:21:31     INFO -     rbx = 0x0000000000000000   rbp = 0x000000cdc57ff660
[task 2021-09-11T23:21:31.473Z] 23:21:31     INFO -     rsp = 0x000000cdc57ffb60   r12 = 0xaaaaaaaaaaaaaaaa
[task 2021-09-11T23:21:31.473Z] 23:21:31     INFO -     r13 = 0x000002b626909490   r14 = 0x0000000000000428
[task 2021-09-11T23:21:31.473Z] 23:21:31     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffdfd0b7034
[task 2021-09-11T23:21:31.473Z] 23:21:31     INFO -     Found by: call frame info

Marking as P3 since it's intermittent but not assigning since it is low frequency and debug only.

Priority: -- → P3

In the past week, we had 63 occurrences for this crash, all on windows10-64-2004-qr debug.
Most recent failure: https://treeherder.mozilla.org/logviewer?job_id=352725786&repo=autoland&lineNumber=28567

[task 2021-09-26T11:44:12.393Z] 11:44:12 INFO - PROCESS-CRASH | Last test finished | application crashed [@ PR_NativeRunThread(void*)]
[task 2021-09-26T11:44:12.393Z] 11:44:12 INFO - Mozilla crash reason: MOZ_CRASH(Shutdown hanging after all known phases and workers finished.)
[task 2021-09-26T11:44:12.393Z] 11:44:12 INFO - Crash dump filename: C:\Users\task_163265260581469\AppData\Local\Temp\tmphubucl6h.mozrunner\minidumps\f8875539-a866-402b-8e7e-c82bc80cfa56.dmp
[task 2021-09-26T11:44:12.394Z] 11:44:12 INFO - Operating system: Windows NT
[task 2021-09-26T11:44:12.394Z] 11:44:12 INFO - 10.0.19041
[task 2021-09-26T11:44:12.394Z] 11:44:12 INFO - CPU: amd64
[task 2021-09-26T11:44:12.394Z] 11:44:12 INFO - family 6 model 85 stepping 7
[task 2021-09-26T11:44:12.394Z] 11:44:12 INFO - 8 CPUs
[task 2021-09-26T11:44:12.394Z] 11:44:12 INFO -
[task 2021-09-26T11:44:12.394Z] 11:44:12 INFO - GPU: UNKNOWN
[task 2021-09-26T11:44:12.394Z] 11:44:12 INFO -
[task 2021-09-26T11:44:12.394Z] 11:44:12 INFO - Crash reason: EXCEPTION_BREAKPOINT
[task 2021-09-26T11:44:12.394Z] 11:44:12 INFO - Crash address: 0x4d7c2b78
[task 2021-09-26T11:44:12.394Z] 11:44:12 INFO - Process uptime: 451 seconds
[task 2021-09-26T11:44:12.394Z] 11:44:12 INFO -
[task 2021-09-26T11:44:12.394Z] 11:44:12 INFO - Thread 20 tid 4956 (crashed) - Shutdown Hang Terminator 0 xul.dll!mozilla::`anonymous namespace'::RunWatchdog(void*) [nsTerminator.cpp:9237deee110beeaa7605195528a273b2bbc77352 : 246 + 0x0]
[task 2021-09-26T11:44:12.394Z] 11:44:12 INFO - rax = 0x00007ff85100d341 rdx = 0x0000000000000000
[task 2021-09-26T11:44:12.394Z] 11:44:12 INFO - rcx = 0x00007ff8784cc948 rbx = 0x00007ff87748a0f7
[task 2021-09-26T11:44:12.394Z] 11:44:12 INFO - rsi = 0x00007ff8803eada0 rdi = 0x0000000000000276
[task 2021-09-26T11:44:12.394Z] 11:44:12 INFO - rbp = 0x00007ff8774e2d08 rsp = 0x00000001847ffd90
[task 2021-09-26T11:44:12.394Z] 11:44:12 INFO - r8 = 0x00000001847fff70 r9 = 0x00007ff881d70000
[task 2021-09-26T11:44:12.394Z] 11:44:12 INFO - r10 = 0x00007ff881dc2651 r11 = 0x00000001847ff750
[task 2021-09-26T11:44:12.394Z] 11:44:12 INFO - r12 = 0x00000196b1d04938 r13 = 0x00000196b1d04948
[task 2021-09-26T11:44:12.395Z] 11:44:12 INFO - r14 = 0x0000000000000000 r15 = 0x00007ff87748a118
[task 2021-09-26T11:44:12.395Z] 11:44:12 INFO - rip = 0x00007ff84d7c2b78
[task 2021-09-26T11:44:12.395Z] 11:44:12 INFO - Found by: given as instruction pointer in context
[task 2021-09-26T11:44:12.395Z] 11:44:12 INFO - 1 nss3.dll!PR_NativeRunThread(void*) [pruthr.c:9237deee110beeaa7605195528a273b2bbc77352 : 399 + 0xe]
[task 2021-09-26T11:44:12.395Z] 11:44:12 INFO - rbx = 0x00007ff87748a0f7 rbp = 0x00007ff8774e2d08
[task 2021-09-26T11:44:12.395Z] 11:44:12 INFO - rsp = 0x00000001847ffde0 r12 = 0x00000196b1d04938
[task 2021-09-26T11:44:12.395Z] 11:44:12 INFO - r13 = 0x00000196b1d04948 r14 = 0x0000000000000000
[task 2021-09-26T11:44:12.395Z] 11:44:12 INFO - r15 = 0x00007ff87748a118 rip = 0x00007ff877339462
[task 2021-09-26T11:44:12.395Z] 11:44:12 INFO - Found by: call frame info
[task 2021-09-26T11:44:12.395Z] 11:44:12 INFO - 2 nss3.dll!pr_root(void*) [w95thred.c:9237deee110beeaa7605195528a273b2bbc77352 : 139 + 0xd]
[task 2021-09-26T11:44:12.395Z] 11:44:12 INFO - rbx = 0x00007ff87748a0f7 rbp = 0x00007ff8774e2d08
[task 2021-09-26T11:44:12.395Z] 11:44:12 INFO - rsp = 0x00000001847ffe60 r12 = 0x00000196b1d04938
[task 2021-09-26T11:44:12.395Z] 11:44:12 INFO - r13 = 0x00000196b1d04948 r14 = 0x0000000000000000
[task 2021-09-26T11:44:12.395Z] 11:44:12 INFO - r15 = 0x00007ff87748a118 rip = 0x00007ff877329e41
[task 2021-09-26T11:44:12.395Z] 11:44:12 INFO - Found by: call frame info
[task 2021-09-26T11:44:12.395Z] 11:44:12 INFO - 3 ucrtbase.dll!RtlpReAllocateHeap + 0x1ce
[task 2021-09-26T11:44:12.395Z] 11:44:12 INFO - rbx = 0x00007ff87748a0f7 rbp = 0x00007ff8774e2d08
[task 2021-09-26T11:44:12.396Z] 11:44:12 INFO - rsp = 0x00000001847ffe90 r12 = 0x00000196b1d04938
[task 2021-09-26T11:44:12.396Z] 11:44:12 INFO - r13 = 0x00000196b1d04948 r14 = 0x0000000000000000
[task 2021-09-26T11:44:12.396Z] 11:44:12 INFO - r15 = 0x00007ff87748a118 rip = 0x00007ff87f6d1bb2
[task 2021-09-26T11:44:12.396Z] 11:44:12 INFO - Found by: call frame info
[task 2021-09-26T11:44:12.396Z] 11:44:12 INFO - 4 kernel32.dll!LdrpSnapKernelBaseExtensions + 0x150
[task 2021-09-26T11:44:12.396Z] 11:44:12 INFO - rbp = 0x00007ff8774e2d08 rsp = 0x00000001847ffec0
[task 2021-09-26T11:44:12.396Z] 11:44:12 INFO - rip = 0x00007ff8803e7034
[task 2021-09-26T11:44:12.396Z] 11:44:12 INFO - Found by: stack scanning
[task 2021-09-26T11:44:12.396Z] 11:44:12 INFO -

Whiteboard: [stockwell needswork]

Pleas re-evaluate, it's failing frequently.

Flags: needinfo?(gijskruitbosch+bugs)

Kris or Jens, can you take a look at what's happening here and/or advise on how we'd get more details? I guess this is from https://searchfox.org/mozilla-central/rev/3fa5cc437a4937c621ea068ba5dc246f75831633/toolkit/components/terminator/nsTerminator.cpp#246 , but I have no idea how to investigate. Naïvely, it looks like we're hanging while writing shutdown leak logs or something? This appears to happen exclusively on Windows debug, but on different tests, primarily browser chrome but also plain ones.

Flags: needinfo?(kwright)
Flags: needinfo?(jstutte)
Flags: needinfo?(gijskruitbosch+bugs)

Clearing priority/severity so this goes back to being flagged in triage.

Severity: S4 → --
Priority: P3 → --

My only suggestion would be to increase toolkit.asyncshutdown.crash_timeout and see if this moves the needle. From this log it seems that we pass ~70 seconds from starting the shutdown to the crash, which is close to the default timeout of 60s + 3s additional time (which cannot be reduced).

Flags: needinfo?(jstutte) → needinfo?(aryx.bugmail)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #10)

An increase to 120s didn't improve the situation.

At this failure frequency, is it possible to bisect down what broke things? That might be the quickest way of making progress.

Flags: needinfo?(aryx.bugmail)

This seems to be similar to bug 1684441, note that the curves of the intermittent frequency have a very similar shape. We filed bug 1728721 there that might help also here, in many cases shutdown hangs are caused by spinning a nested event loop that waits for a condition that cannot be satisfied any more, and even slight timing and/or order changes in the event processing elsewhere might move the frequency up or down.

And over in bug 1731564 I started work on having the SpinEventLoopUntil annotations also for C++ calls.

Depends on: 1728721
See Also: → 1684441, 1731564

(In reply to :Gijs (he/him) from comment #11)

At this failure frequency, is it possible to bisect down what broke things? That might be the quickest way of making progress.

Not sure this is a new issue, the signature gets reported for the first time on 2021-07-07 - just a change in reported failure lines?

Flags: needinfo?(aryx.bugmail)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #13)

(In reply to :Gijs (he/him) from comment #11)

At this failure frequency, is it possible to bisect down what broke things? That might be the quickest way of making progress.

Not sure this is a new issue, the signature gets reported for the first time on 2021-07-07 - just a change in reported failure lines?

Hm, looks like that may have gotten filed as bug 1719481? I guess I assumed there was a change in frequency, but yes, a change in reported lines might also affect this - or perhaps this broke on tests of one kind first and then on tests of another kind? :-(

It's unfortunate this is Windows-specific, as it means we can't use pernosco to gather information about what's happening at the time of the shutdown hang. bug 1719481 comment 10 suggests that the minidumps might help, and blames bug 1685204 which has since been fixed (though 1719481 itself is still open and recorded failure instances in the past week). I don't know much about Windows minidumps and/or how to extract useful information out of them. I tried for a bit using the guide at https://firefox-source-docs.mozilla.org/contributing/debugging/debugging_a_minidump.html but didn't get very far; adding the moco symbol server and MS symbol server doesn't seem to be enough to load the symbol dump, possibly because the Windows version and my (up-to-date) Win10 install have different versions for ntdll.dll and such. :gsvelto, do you know how to do this and/or do we have better / more up-to-date documentation?

I did then look at the raw log from comment 0 some more, and there is no event loop in any of the stacks, beyond the main event loop, so I don't think this is related to the nested event loop spinning as comment #12 suggested. Jens, do you have any other hypotheses, or ways we could investigate further?

Flags: needinfo?(jstutte)
Flags: needinfo?(gsvelto)
See Also: → 1719481

(In reply to :Gijs (he/him) from comment #14)

It's unfortunate this is Windows-specific, as it means we can't use pernosco to gather information about what's happening at the time of the shutdown hang. bug 1719481 comment 10 suggests that the minidumps might help, and blames bug 1685204 which has since been fixed (though 1719481 itself is still open and recorded failure instances in the past week). I don't know much about Windows minidumps and/or how to extract useful information out of them. I tried for a bit using the guide at https://firefox-source-docs.mozilla.org/contributing/debugging/debugging_a_minidump.html but didn't get very far; adding the moco symbol server and MS symbol server doesn't seem to be enough to load the symbol dump, possibly because the Windows version and my (up-to-date) Win10 install have different versions for ntdll.dll and such. :gsvelto, do you know how to do this and/or do we have better / more up-to-date documentation?

AFAIK you will not be able to find the symbols for xul.dll on the symbol server if this is not an official build. I am not aware of another way to get them for this specific dump.

I did then look at the raw log from comment 0 some more, and there is no event loop in any of the stacks, beyond the main event loop, so I don't think this is related to the nested event loop spinning as comment #12 suggested. Jens, do you have any other hypotheses, or ways we could investigate further?

Sorry for the noise then, I should have examined the stacks. Still it remains curious that on bug 1684441 we see a very similar trend in the frequency.

No longer depends on: 1728721
Flags: needinfo?(jstutte)
See Also: 1684441, 1731564

For Try pushes, use the push's menu in Treeherder (top right) > Add new jobs (Search) > Activate Use full job list > Search for symbol > Pick the upload-symbols task for the desired platform and wait for it to complete.

Once the symbols are uploaded you can simply run~/.mozbuild/minidump_stackwalkminidump_stackwalk <path_to_the_minidump.dmp>. It will fetch symbols automatically and will get you a stack trace much faster than Visual Studio. I recommend using Visual Studio only if you need to inspect local variables and arguments, or poke other data on the stack.

Flags: needinfo?(gsvelto)

(In reply to Jens Stutte [:jstutte] from comment #15)

Sorry for the noise then, I should have examined the stacks. Still it remains curious that on bug 1684441 we see a very similar trend in the frequency.

I think the peaks are from Sundays, when there are a much lower number of pushes altogether, and so frequency calculations spike if the number of mozilla-central pushes compared to autoland pushes fluctuates significantly.

(In reply to Gabriele Svelto [:gsvelto] from comment #17)

Once the symbols are uploaded you can simply run~/.mozbuild/minidump_stackwalkminidump_stackwalk <path_to_the_minidump.dmp>. It will fetch symbols automatically and will get you a stack trace much faster than Visual Studio. I recommend using Visual Studio only if you need to inspect local variables and arguments, or poke other data on the stack.

Per discussion with Gabriele, I don't think running minidump_stackwalk will yield information that's not already in the stacks that are in the logs.

At this point, I'm not sure how to make progress here; I think it's going to require one of the XPCOM / crash-stack-tea-leaf-reader folks to take a look, I suspect. I poked at some of the logs - the Breakpad ExceptionHandler is busy writing something to disk in each crash, but I don't know if that's just normal because we've caught it in breakpad or what. There are usually a handful of TaskController threads that are waiting on a condvar (on Windows, via NtWaitForAlertByThreadId) and the crash from comment 0 has us stopped inside the destructor of ipc::BrowserProcessSubThread, in RtlReleaseSRWLockShared, but I don't see that in the other crash logs, and I don't know if that's relevant, coincidence, or a red herring...

Adding in :mccr8 in the hope that either he or Kris can (find someone to) help move this forward

Flags: needinfo?(continuation)

I extracted a better stack trace including symbols for the Windows system libraries from the crash in comment 0 and here's what I can see:

  • The main thread is stuck here waiting for the thread to be joined.
  • A child process of some kind seems to have crashed, because the breakpad exception handler thread is busy writing out a minidump
  • All the other threads appear to be waiting
  • Not knowing all the IPC bits my guess is that the main thread is waiting for a child process to shut down, but the child crashed, and taking minidump is taking so long that we time out
    It's hard to tell which child crashed but I think I can divine it by opening the minidump with Visual Studio
Attached file Better stack trace

FYI this is a better stack trace of the crash in comment 0

 1  KERNELBASE.dll!WaitForSingleObjectEx
 2  xul.dll!`anonymous namespace'::ChildReaper::WillDestroyCurrentMessageLoop()
 3  xul.dll!MessageLoop::~MessageLoop()

This is something we do on builds where we need child processes to fully shut down (instead of doing an early OS-level exit to save time) because of instrumentation for things like memory leaks and coverage: so, debug builds (refcount logging), ASan/LSan, ccov, etc. It's an indefinite wait for all child processes to exit, and it's done in a somewhat hacky way after XPCOM thread shutdown. There'd been talk of changing it to eventually time out and give up, but we'd want to avoid interrupting a crash report in progress.

The same IPC “feature” appears in bug 1719481 but I don't know if the root cause is the same.

Whiteboard: [stockwell disable-recommended] → [stockwell needswork]

There are 65 total failures in the last 7 days on - windows10-64-2004-qr debug
Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=354845815&repo=autoland&lineNumber=15617

[task 2021-10-14T23:20:48.861Z] 23:20:48     INFO - TEST-START | browser/base/content/test/protectionsUI/browser_protectionsUI_trackers_subview.js
[task 2021-10-14T23:20:48.865Z] 23:20:48     INFO - GECKO(8124) | [Child 1104: Main Thread]: I/DocShellAndDOMWindowLeak --DOCSHELL 234c7913000 == 0 [pid = 1104] [id = 0] [url = http://trackertest.org/]
[task 2021-10-14T23:20:48.877Z] 23:20:48     INFO - GECKO(8124) | [Child 1104, Main Thread] WARNING: IPC message discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:527
[task 2021-10-14T23:20:48.882Z] 23:20:48     INFO - GECKO(8124) | [Child 1640, Main Thread] WARNING: IPC message discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:527

[task 2021-10-14T23:22:11.352Z] 23:22:11     INFO - GECKO(8124) | [Child 5136, Main Thread] WARNING: '!ipcActor->SendRecordDiscardedData(discardedData)', file /builds/worker/checkouts/gecko/toolkit/components/telemetry/core/ipc/TelemetryIPCAccumulator.cpp:298
[task 2021-10-14T23:22:13.617Z] 23:22:13     INFO - GECKO(8124) | XPCOM_MEM_BLOAT_LOG: C:\Users\task_163425230863140\AppData\Local\Temp\tmpr5hf2mxk.mozrunner\runtests_leaks.log
[task 2021-10-14T23:22:13.618Z] 23:22:13     INFO - GECKO(8124) | Writing to log: C:\Users\task_163425230863140\AppData\Local\Temp\tmpr5hf2mxk.mozrunner\runtests_leaks.log
[task 2021-10-14T23:22:13.620Z] 23:22:13     INFO - GECKO(8124) | RunWatchdog: Mainthread nested event loops during hang:
[task 2021-10-14T23:22:13.621Z] 23:22:13     INFO - GECKO(8124) |  --- (no nested event loop active)
[task 2021-10-14T23:22:14.635Z] 23:22:14     INFO - GECKO(8124) | Hit MOZ_CRASH(Shutdown hanging after all known phases and workers finished.) at /builds/worker/checkouts/gecko/toolkit/components/terminator/nsTerminator.cpp:256
[task 2021-10-14T23:22:14.733Z] 23:22:14     INFO -  Initializing stack-fixing for the first stack frame, this may take a while...
[task 2021-10-14T23:22:24.308Z] 23:22:24     INFO - GECKO(8124) | #01: mozilla::`anonymous namespace'::RunWatchdog(void*) [toolkit/components/terminator/nsTerminator.cpp:256]
[task 2021-10-14T23:22:24.503Z] 23:22:24     INFO - GECKO(8124) | #02: PR_NativeRunThread(void*) [nsprpub/pr/src/threads/combined/pruthr.c:408]
[task 2021-10-14T23:22:24.514Z] 23:22:24     INFO - GECKO(8124) | #03: pr_root(void*) [nsprpub/pr/src/md/windows/w95thred.c:140]
[task 2021-10-14T23:22:24.518Z] 23:22:24     INFO -  fix-stacks: error: failed to read debug info file `ucrtbase.pdb` for `C:\Windows\System32\ucrtbase.dll`
[task 2021-10-14T23:22:24.518Z] 23:22:24     INFO -  fix-stacks: note: this is expected and harmless for all PDB files on opt automation runs
[task 2021-10-14T23:22:24.518Z] 23:22:24     INFO -  fix-stacks: The system cannot find the file specified. (os error 2)
[task 2021-10-14T23:22:24.519Z] 23:22:24     INFO - GECKO(8124) | #04: configthreadlocale [C:\Windows\System32\ucrtbase.dll + 0x21bb2]
[task 2021-10-14T23:22:24.522Z] 23:22:24     INFO -  fix-stacks: error: failed to read debug info file `wkernel32.pdb` for `C:\Windows\System32\KERNEL32.DLL`
[task 2021-10-14T23:22:24.522Z] 23:22:24     INFO -  fix-stacks: note: this is expected and harmless for all PDB files on opt automation runs
[task 2021-10-14T23:22:24.522Z] 23:22:24     INFO -  fix-stacks: The system cannot find the file specified. (os error 2)
[task 2021-10-14T23:22:24.523Z] 23:22:24     INFO - GECKO(8124) | #05: BaseThreadInitThunk [C:\Windows\System32\KERNEL32.DLL + 0x17034]
[task 2021-10-14T23:22:24.527Z] 23:22:24     INFO -  fix-stacks: error: failed to read debug info file `wntdll.pdb` for `C:\Windows\SYSTEM32\ntdll.dll`
[task 2021-10-14T23:22:24.527Z] 23:22:24     INFO -  fix-stacks: note: this is expected and harmless for all PDB files on opt automation runs
[task 2021-10-14T23:22:24.527Z] 23:22:24     INFO -  fix-stacks: The system cannot find the file specified. (os error 2)
[task 2021-10-14T23:22:24.528Z] 23:22:24     INFO - GECKO(8124) | #06: RtlUserThreadStart [C:\Windows\SYSTEM32\ntdll.dll + 0x52651]
[task 2021-10-14T23:22:24.528Z] 23:22:24     INFO - GECKO(8124) | [GPU 4592, IPC I/O Child] WARNING: [4A180DBF879BE3FB.57C8326487332A4C]: Ignoring message 'EVENT_MESSAGE' to peer 37124C67C69AC9FD.7D14168D4B2BD336 due to a missing broker: file /builds/worker/checkouts/gecko/ipc/glue/NodeController.cpp:297
[task 2021-10-14T23:22:24.529Z] 23:22:24     INFO - GECKO(8124) | [GPU 4592, IPC I/O Child] WARNING: [4A180DBF879BE3FB.57C8326487332A4C]: Ignoring message 'EVENT_MESSAGE' to peer 37124C67C69AC9FD.7D14168D4B2BD336 due to a missing broker: file /builds/worker/checkouts/gecko/ipc/glue/NodeController.cpp:297
[task 2021-10-14T23:22:24.529Z] 23:22:24     INFO - GECKO(8124) | Destroying context 0000027796667000 surface 00000277948652E0 on display 0000027796608200
[task 2021-10-14T23:22:24.530Z] 23:22:24     INFO - GECKO(8124) | [GPU 4592, Main Thread] WARNING: NS_ENSURE_TRUE(Preferences::InitStaticMembers()) failed: file /builds/worker/checkouts/gecko/modules/libpref/Preferences.cpp:4439
[task 2021-10-14T23:22:24.530Z] 23:22:24     INFO - TEST-INFO | Main app process: exit 1
task 2021-10-14T23:22:24.548Z] 23:22:24     INFO - TEST-INFO | browser/base/content/test/protectionsUI/browser_protectionsUI_trackers_subview.js | This test created 1 hidden docshell(s)
[task 2021-10-14T23:22:24.548Z] 23:22:24     INFO - Buffered messages finished
[task 2021-10-14T23:22:24.549Z] 23:22:24    ERROR - TEST-UNEXPECTED-FAIL | Last test finished | application terminated with exit code 1
[task 2021-10-14T23:22:24.549Z] 23:22:24     INFO - runtests.py | Application ran for: 0:04:09.742412
[task 2021-10-14T23:22:24.549Z] 23:22:24     INFO - zombiecheck | Reading PID log: C:\Users\task_163425230863140\AppData\Local\Temp\tmp45prhdxvpidlog
task 2021-10-14T23:22:31.792Z] 23:22:31     INFO - mozcrash Saved app info as Z:\task_163425230863140\build\blobber_upload_dir\25e3ca0f-5074-4ff7-956a-c73d45a10776.extra
[task 2021-10-14T23:22:31.836Z] 23:22:31     INFO - PROCESS-CRASH | Last test finished | application crashed [@ PR_NativeRunThread(void*)]
[task 2021-10-14T23:22:31.836Z] 23:22:31     INFO - Mozilla crash reason: MOZ_CRASH(Shutdown hanging after all known phases and workers finished.)
[task 2021-10-14T23:22:31.847Z] 23:22:31     INFO - Crash dump filename: C:\Users\task_163425230863140\AppData\Local\Temp\tmpr5hf2mxk.mozrunner\minidumps\25e3ca0f-5074-4ff7-956a-c73d45a10776.dmp
[task 2021-10-14T23:22:31.847Z] 23:22:31     INFO - Operating system: Windows NT
[task 2021-10-14T23:22:31.847Z] 23:22:31     INFO -                   10.0.19041 
[task 2021-10-14T23:22:31.847Z] 23:22:31     INFO - CPU: amd64
[task 2021-10-14T23:22:31.847Z] 23:22:31     INFO -      family 6 model 85 stepping 7
[task 2021-10-14T23:22:31.847Z] 23:22:31     INFO -      8 CPUs
[task 2021-10-14T23:22:31.847Z] 23:22:31     INFO - 
[task 2021-10-14T23:22:31.847Z] 23:22:31     INFO - GPU: UNKNOWN
[task 2021-10-14T23:22:31.847Z] 23:22:31     INFO - 
[task 2021-10-14T23:22:31.847Z] 23:22:31     INFO - Crash reason:  EXCEPTION_BREAKPOINT
[task 2021-10-14T23:22:31.847Z] 23:22:31     INFO - Crash address: 0x80ef856a
[task 2021-10-14T23:22:31.847Z] 23:22:31     INFO - Process uptime: 240 seconds
[task 2021-10-14T23:22:31.847Z] 23:22:31     INFO - 
[task 2021-10-14T23:22:31.847Z] 23:22:31     INFO - Thread 19 tid 7448 (crashed) - Shutdown Hang Terminator 0  xul.dll!mozilla::`anonymous namespace'::RunWatchdog(void*) [nsTerminator.cpp:03de2f75a442af1253075b106158bca2a198385b : 256 + 0x0]
[task 2021-10-14T23:22:31.847Z] 23:22:31     INFO -     rax = 0x00007ffe8475fe73   rdx = 0x0000000000000000
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -     rcx = 0x00007ffe953bc948   rbx = 0x00007ffe9567a0f7
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -     rsi = 0x000000c1f04efd10   rdi = 0x0000000000000276
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -     rbp = 0x00007ffe956d2d08   rsp = 0x000000c1f04efcf0
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -      r8 = 0x000000c1f04efee0    r9 = 0x00007ffeb9250000
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -     r10 = 0x00007ffeb92a2651   r11 = 0x000000c1f04ef6b0
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -     r12 = 0x000002b755d33838   r13 = 0x000002b755d33848
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -     r14 = 0x0000000000000000   r15 = 0x00007ffe9567a118
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -     rip = 0x00007ffe80ef856a
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -     Found by: given as instruction pointer in context
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -  1  nss3.dll!PR_NativeRunThread(void*) [pruthr.c:03de2f75a442af1253075b106158bca2a198385b : 399 + 0xe]
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -     rbx = 0x00007ffe9567a0f7   rbp = 0x00007ffe956d2d08
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -     rsp = 0x000000c1f04efd50   r12 = 0x000002b755d33838
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -     r13 = 0x000002b755d33848   r14 = 0x0000000000000000
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -     r15 = 0x00007ffe9567a118   rip = 0x00007ffe95529462
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -     Found by: call frame info
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -  2  nss3.dll!pr_root(void*) [w95thred.c:03de2f75a442af1253075b106158bca2a198385b : 139 + 0xd]
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -     rbx = 0x00007ffe9567a0f7   rbp = 0x00007ffe956d2d08
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -     rsp = 0x000000c1f04efdd0   r12 = 0x000002b755d33838
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -     r13 = 0x000002b755d33848   r14 = 0x0000000000000000
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -     r15 = 0x00007ffe9567a118   rip = 0x00007ffe95519e41
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -     Found by: call frame info
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -  3  ucrtbase.dll!RtlpReAllocateHeap + 0x1ce
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -     rbx = 0x00007ffe9567a0f7   rbp = 0x00007ffe956d2d08
[task 2021-10-14T23:22:31.848Z] 23:22:31     INFO -     rsp = 0x000000c1f04efe00   r12 = 0x000002b755d33838
[task 2021-10-14T23:22:31.849Z] 23:22:31     INFO -     r13 = 0x000002b755d33848   r14 = 0x0000000000000000
[task 2021-10-14T23:22:31.849Z] 23:22:31     INFO -     r15 = 0x00007ffe9567a118   rip = 0x00007ffeb6ef1bb2
[task 2021-10-14T23:22:31.849Z] 23:22:31     INFO -     Found by: call frame info
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

(In reply to Gabriele Svelto [:gsvelto] from comment #21)

I extracted a better stack trace including symbols for the Windows system libraries from the crash in comment 0 and here's what I can see:

  • The main thread is stuck here waiting for the thread to be joined.
  • A child process of some kind seems to have crashed, because the breakpad exception handler thread is busy writing out a minidump
  • All the other threads appear to be waiting
  • Not knowing all the IPC bits my guess is that the main thread is waiting for a child process to shut down, but the child crashed, and taking minidump is taking so long that we time out
    It's hard to tell which child crashed but I think I can divine it by opening the minidump with Visual Studio

Looking at another stack it seems we are stuck on PostQueuedCompletionStatus on the main thread. It seems that PostQueuedCompletionStatus can delay its execution in some cases, not sure if this is what happens here, though. FWIW, we create the completion IO port with a concurrency of 1 thread only. From what I read, too much concurrency wrt to the initial setting seems to be a possible issue here.

Severity: -- → N/A
Priority: -- → P3

I think bug 1684441 is the ASan/TSan equivalent of this shutdown hang issue.

See Also: → 1684441

I was confused about what sIOThread was and what was going on here, but Nika explained that sIOThread is the main processes IPC thread, and we're probably waitpiding for the child processes to go away here. Which I guess is what gsvelto was talking about in comment 21.

In logs where we don't hang, the shutdown sequence has these two lines from the GPU process (well, I'm assuming the Destroying Context is from the GPU process):

Destroying context 0000018618D65000 surface 000001861706A580 on display 00000186170FD800
[GPU 3572, Main Thread] WARNING: NS_ENSURE_TRUE(Preferences::InitStaticMembers()) failed: ...

...then a bunch of --DOMWINDOW stuff and the like in the parent, then the parent exits.

Looking at about a half dozen of the failing logs, we get the Hit MOZ_CRASH(Shutdown hanging after all known phases and workers finished.) and THEN we get some error messages from the GPU process:

[GPU 7188, IPC I/O Child] WARNING: ...: Ignoring message 'EVENT_MESSAGE' to peer ... due to a missing broker
[GPU 7188, IPC I/O Child] WARNING: ...: Ignoring message 'EVENT_MESSAGE' to peer ... due to a missing broker
Destroying context 0000027CBDC67000 surface 0000027CBBD6A190 on display 0000027CBBDFD400
[GPU 7188, Main Thread] WARNING: NS_ENSURE_TRUE(Preferences::InitStaticMembers()) failed: ...

The first two messages are presumably the GPU process being sad because the parent went away, but it seems bad that we're apparently not shutting down until then. The previous logging from the GPU process was about a minute earlier.

Some but not all of the failing logs also feature log spam from telemetry in a child failing to send IPC right up until the point of the hang:
[Child 1008, Main Thread] WARNING: '!ipcActor->SendRecordDiscardedData(discardedData)'
...so maybe a content process is also taking a while to go away, sometimes. But the GPU stuff above is still there.

Maybe we could add some logging to figure out exactly which process we're waiting on, including a PID. It might also be good to get some logging of the PID for the process that we're getting a minidump for. Maybe there's some weird shutdown race there.

Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

FWIW I opened one of the dmp files with Visual Studio.

The main thread is waiting for the IPC sIOThread to join.

The IPC thread seems to be waiting for a content process to exit.

It is not clear to me who was supposed to tell the child process to exit, but it apparently did not exit yet (assuming that Windows recognizes an obsolete process handle).

There is some

[Child 6944, Main Thread] WARNING: IPC message discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:531

but I am not sure if this is said by the same child we are waiting for and then some

###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost

message but I do not know if some of those were supposed to send the quit message to the child process. From the frequent "message discarded" entries in the log it seems something bad happened to our IPC layer in general?

In any case I am wondering if WillDestroyCurrentMessageLoop() should have some timeout and forced kill via KillProcess() to avoid such situations?

(In reply to Jens Stutte [:jstutte] from comment #37)

message but I do not know if some of those were supposed to send the quit message to the child process. From the frequent "message discarded" entries in the log it seems something bad happened to our IPC layer in general?

This just means that the child tried to send a message to the parent after the IPDL stuff was shut down. We get some of that regularly, so it isn't a big deal. However, it looks like in many of these error logs we're spamming out a bunch of this for telemetry, and IIRC I didn't see that in non-failing logs. But it might just be a symptom rather than a cause of a process failing to shut down.

In any case I am wondering if WillDestroyCurrentMessageLoop() should have some timeout and forced kill via KillProcess() to avoid such situations?

I don't remember where it is off the top of my head, but we do already have code set up to do something along those lines, but it is intentionally disabled for debug and ASan builds because we need those processes to do stuff when shutting down. If we don't let them run their course, we'll either get other error messages or may start silently failing to get LSan/ASan errors from the content process. I'm hoping we can figure out why the child process is not going away. I'm also not sure what exactly is supposed to trigger it.

I'm doing a try run that includes logging of the PID of the process we're waiting for, and the PID of the process we're doing the minidump stuff with. It might confirm my theory that this is the GPU process.

I did a try run with patches that print out the pid of process in WriteMinidumpWithExceptionForProcess(), and the pid of process_ in WillDestroyCurrentMessageLoop().

In the three cases that I saw in my try run ("fortunately" this seems to reproduce fairly often on bc5), the process in WriteMinidumpWithExceptionForProcess() is for the main process, and that isn't happening until after we've hit the hang. So I think the presence of the minidump in the minidump stack is a red herring: it looks like it is just recording the actual crash from the hang watchdog. Hopefully that makes sense.

As for the other logging, in 2 cases there is something like this:

QQQ WillDestroyCurrentMessageLoop pid 8964
[Child 8964, Main Thread] WARNING: IPC message discarded [...]

The parent process is waiting for a content process to exit, but the content process is still far enough away from shutdown that it is still trying to send IPC messages. Well, these are actually telemetry messages, and as Nika found yesterday, there's actually an attempt to send them as late as the shutdown of the XPCOM component manager, via ShutdownTelemetry().

So I guess the GPU process is not the problematic process, it is a content process.

(In the third case, it is also a content process that we're waiting on, but the last logging message from that content process happened much earlier. But there was a crash in some process around the same time so maybe that caused things to go weird?)

In one of these logs, the process that isn't going away has a weird history. It gets created, goes along enough to create a leak log, then there's some ++DOCSHELL stuff, then less than a second later it is hitting "WARNING: MsgDropped in ContentChild:" and then the only logging it produces for the rest of its lifetime is failed attempts to send telemetry IPC messages to the parent. It is sitting around doing that for around 5 minutes! The other log where this is happening follows the same general pattern. Nika was wondering if there could be some kind of startup/shutdown race that causes the process to not shutdown.

(In reply to Andrew McCreight [:mccr8] from comment #40)

In one of these logs, the process that isn't going away has a weird history. It gets created, goes along enough to create a leak log, then there's some ++DOCSHELL stuff, then less than a second later it is hitting "WARNING: MsgDropped in ContentChild:" and then the only logging it produces for the rest of its lifetime is failed attempts to send telemetry IPC messages to the parent. It is sitting around doing that for around 5 minutes! The other log where this is happening follows the same general pattern. Nika was wondering if there could be some kind of startup/shutdown race that causes the process to not shutdown.

FWIW, in that log I read:

[task 2021-10-27T21:16:47.572Z] 21:16:47     INFO - GECKO(5636) | [Child 8964, Main Thread] WARNING: DispatchEvent called on non-current inner window, dropping. Please check the window in the caller instead.: file /builds/worker/checkouts/gecko/dom/base/nsGlobalWindowInner.cpp:4195
[task 2021-10-27T21:16:47.642Z] 21:16:47     INFO - GECKO(5636) | [Child 8964, Main Thread] WARNING: DispatchEvent called on non-current inner window, dropping. Please check the window in the caller instead.: file /builds/worker/checkouts/gecko/dom/base/nsGlobalWindowInner.cpp:4195

Just a wild guess (as I ignore many details): Could it be that we ignore some quit events if we are not the current inner window?

gerard-majax mention bug 1718210 as a shutdown hang involving the GPU process.

See Also: → 1718210

This is probably IPC, not threads, so I'll clear Kris's needinfo.

Flags: needinfo?(kwright)

I've been working on reproducing this locally on my Windows machine.

I can reproduce it very frequently (maybe even 100% of the time) with:
mach test --headless --enable-fission --verify browser/components/extensions/test/browser/browser_ext_tabs_move_window.js

It consistently fails during the first part of verification, where it runs the test 10 times. Next I'm going to figure out how to get the right logging going. Enabling chromium logging unfortunately seems to slow things down enough that the hang goes away.

Unfortunately, the hang I'm getting locally appears to be something else.

Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

mccr8 says this test failure looks like it's Fission and debug only. It might or might not affect real users.

Of the 393 test failures from the last four weeks, all but one are on windows10-64-2004-qr. The one is one macosx1015-64-qr.

Fission Milestone: --- → Future
Whiteboard: [stockwell needswork:owner] → [stockwell needswork:owner] [fission:m96?]

This is the stack for the ContentChild::ActorDestroy call:

#02: mozilla::dom::ContentChild::ActorDestroy(mozilla::ipc::IProtocol::ActorDestroyReason) [dom/ipc/ContentChild.cpp:2226]
#03: mozilla::ipc::IProtocol::DestroySubtree(mozilla::ipc::IProtocol::ActorDestroyReason) [ipc/glue/ProtocolUtils.cpp:623]
#04: mozilla::dom::PContentChild::OnChannelClose() [/ipc/ipdl/PContentChild.cpp::16069]
#05: mozilla::ipc::MessageChannel::OnNotifyMaybeChannelError() [ipc/glue/MessageChannel.cpp:2474]
#06: mozilla::detail::RunnableMethodImpl<mozilla::ipc::MessageChannel ,void (mozilla::ipc::MessageChannel::)(),0,mozilla::RunnableKind::Cancelable>::Run() [xpcom/threads/nsThreadUtils.h:1203]
#07: mozilla::RunnableTask::Run() [xpcom/threads/TaskController.cpp:468]
#08: mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex &> const&) [xpcom/threads/TaskController.cpp:770]
#09: mozilla::TaskController::ExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex &> const&) [xpcom/threads/TaskController.cpp:606]
#10: mozilla::TaskController::ProcessPendingMTTask(bool) [xpcom/threads/TaskController.cpp:390]
#11: mozilla::detail::RunnableFunction<`lambda at /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:126:7'>::Run() [xpcom/threads/nsThreadUtils.h:532]
#12: nsThread::ProcessNextEvent(bool, bool*) [xpcom/threads/nsThread.cpp:1178]
#13: NS_ProcessNextEvent(nsIThread*, bool) [xpcom/threads/nsThreadUtils.cpp:467]
#14: mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) [ipc/glue/MessagePump.cpp:107]
#15: MessageLoop::RunHandler() [ipc/chromium/src/base/message_loop.cc:325]
#16: MessageLoop::Run() [ipc/chromium/src/base/message_loop.cc:307]
#17: nsBaseAppShell::Run() [widget/nsBaseAppShell.cpp:139]
#18: nsAppShell::Run() [widget/windows/nsAppShell.cpp:603]
#19: XRE_RunAppShell() [toolkit/xre/nsEmbedFunctions.cpp:917]
#20: mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) [ipc/glue/MessagePump.cpp:235]
#21: MessageLoop::RunHandler() [ipc/chromium/src/base/message_loop.cc:325]
#22: MessageLoop::Run() [ipc/chromium/src/base/message_loop.cc:307]
#23: XRE_InitChildProcess(int, char**, XREChildData const*) [toolkit/xre/nsEmbedFunctions.cpp:753]

It looks the same in non-failing runs, so whatever oddness is here is not sufficient to explain the hangs.

The failure rate in this bug could likely be greatly reduced by disabling browser/components/extensions/test/browser/browser_ext_tabs_move_window.js on Windows 10 Fission debug. It seems to be happening elsewhere, but a process created during that specific test seems to hang about 2/3 of the time.

Assignee: nobody → apavel
Status: NEW → ASSIGNED
Assignee: apavel → nobody
Status: ASSIGNED → NEW
Keywords: leave-open
Pushed by amccreight@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/bea1d2e79758
disable browser_ext_tabs_move_window.js on win10 fission debug r=mccr8

I'm kind of out of ideas here. Maybe inspiration will strike later. I'll move this to IPC because it seems like some kind of weird IPC shutdown issue, at least in the current form. Hopefully the test disablement will reduce the failure rate enough.

Component: Async Tooling → IPC
Flags: needinfo?(continuation)
Product: Toolkit → Core
Depends on: 1740529

(In reply to Andrew McCreight [:mccr8] from comment #53)

I'm kind of out of ideas here. Maybe inspiration will strike later. I'll move this to IPC because it seems like some kind of weird IPC shutdown issue, at least in the current form. Hopefully the test disablement will reduce the failure rate enough.

Hi Andrew, now that you added the pid in bug 1740529, I see a pattern here when I filter for it:

[task 2021-11-11T16:54:53.110Z] 16:54:53     INFO - GECKO(4396) | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to C:\Users\task_163664756302097\AppData\Local\Temp\tmp1xjaw5t3.mozrunner\runtests_leaks_tab_pid1384.log
[task 2021-11-11T16:54:53.110Z] 16:54:53     INFO - GECKO(4396) | [1384, Main Thread] WARNING: XPCOM_MEM_BLOAT_LOG is set, disabling native allocations.: file /builds/worker/checkouts/gecko/tools/profiler/core/platform.cpp:250
[task 2021-11-11T16:54:54.851Z] 16:54:54     INFO - GECKO(4396) | [Child 1384, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005 (NS_ERROR_FAILURE): file /builds/worker/checkouts/gecko/docshell/base/nsDocShell.cpp:8535
[task 2021-11-11T16:54:55.039Z] 16:54:55     INFO - GECKO(4396) | [Child 1384, Main Thread] WARNING: IPC message 'PIdleScheduler::Msg_StartedGC' discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:525
[task 2021-11-11T16:54:55.061Z] 16:54:55     INFO - GECKO(4396) | [Child 1384, Main Thread] WARNING: IPC message 'PIdleScheduler::Msg_DoneGC' discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:525
[task 2021-11-11T16:54:55.093Z] 16:54:55     INFO - GECKO(4396) | [Child 1384, Main Thread] WARNING: IPC message 'PIdleScheduler::Msg_StartedGC' discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:525
[task 2021-11-11T16:54:55.114Z] 16:54:55     INFO - GECKO(4396) | [Child 1384, Main Thread] WARNING: IPC message 'PIdleScheduler::Msg_DoneGC' discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:525
[task 2021-11-11T16:54:55.117Z] 16:54:55     INFO - GECKO(4396) | [Child 1384, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp:3353
[task 2021-11-11T16:54:55.120Z] 16:54:55     INFO - GECKO(4396) | [Child 1384, Main Thread] WARNING: IPC message 'PIdleScheduler::Msg_StartedGC' discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:525
[task 2021-11-11T16:54:55.123Z] 16:54:55     INFO - GECKO(4396) | [Child 1384, Main Thread] WARNING: IPC message 'PIdleScheduler::Msg_DoneGC' discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:525
[task 2021-11-11T16:54:55.124Z] 16:54:55     INFO - GECKO(4396) | [Child 1384, Main Thread] WARNING: IPC message 'PIdleScheduler::Msg_StartedGC' discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:525
[task 2021-11-11T16:54:55.125Z] 16:54:55     INFO - GECKO(4396) | [Child 1384, Main Thread] WARNING: IPC message 'PIdleScheduler::Msg_DoneGC' discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:525
[task 2021-11-11T16:54:55.134Z] 16:54:55     INFO - GECKO(4396) | [Child 1384, Main Thread] WARNING: NS_ENSURE_TRUE(Preferences::InitStaticMembers()) failed: file /builds/worker/checkouts/gecko/modules/libpref/Preferences.cpp:4439
[task 2021-11-11T17:00:11.018Z] 17:00:11     INFO - ==> process 3264 launched child process 1384 ("Z:\task_163664756302097\build\application\firefox\firefox.exe" -contentproc --channel="3264.649.62035424\28145897" -childID 325 -isForBrowser -prefsHandle 4632 -prefMapHandle 5476 -prefsLen 11781 -prefMapSize 255484 -jsInitHandle 1332 -jsInitLen 336328 -parentBuildID 20211111152544 -appDir "Z:\task_163664756302097\build\application\firefox\browser" - 3264 "\\.\pipe\gecko-crash-server-pipe.3264" 5792 22fe6348970 tab)
[task 2021-11-11T17:00:11.070Z] 17:00:11     INFO - ==> process 3264 launched child process 1668 ("Z:\task_163664756302097\build\application\firefox\firefox.exe" -contentproc --channel="3264.773.2013840525\1546756485" -childID 387 -isForBrowser -prefsHandle 4916 -prefMapHandle 5904 -prefsLen 11860 -prefMapSize 255484 -jsInitHandle 1332 -jsInitLen 336328 -parentBuildID 20211111152544 -appDir "Z:\task_163664756302097\build\application\firefox\browser" - 3264 "\\.\pipe\gecko-crash-server-pipe.3264" 6068 22fe7606970 tab)
[task 2021-11-11T17:00:11.364Z] 17:00:11     INFO - zombiecheck | Checking for orphan process with PID: 1384
[task 2021-11-11T17:00:19.174Z] 17:00:19     INFO - leakcheck | Processing leak log file C:\Users\task_163664756302097\AppData\Local\Temp\tmp1xjaw5t3.mozrunner\runtests_leaks_tab_pid1384.log
[task 2021-11-11T17:00:19.174Z] 17:00:19     INFO - == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, tab process 1384
[task 2021-11-11T17:07:54.894Z] 17:07:54     INFO - ==> process 6244 launched child process 5976 ("Z:\task_163664756302097\build\application\firefox\firefox.exe" -contentproc --channel="6244.365.1175471384\1084633623" -childID 183 -isForBrowser -prefsHandle 5868 -prefMapHandle 5960 -prefsLen 11579 -prefMapSize 255448 -jsInitHandle 1352 -jsInitLen 336328 -parentBuildID 20211111152544 -appDir "Z:\task_163664756302097\build\application\firefox\browser" - 6244 "\\.\pipe\gecko-crash-server-pipe.6244" 6036 1f0f9540970 tab)
[task 2021-11-11T17:07:54.899Z] 17:07:54     INFO - ==> process 6244 launched child process 3636 ("Z:\task_163664756302097\build\application\firefox\firefox.exe" -contentproc --channel="6244.377.1886900087\1384882622" -childID 189 -isForBrowser -prefsHandle 3640 -prefMapHandle 3352 -prefsLen 11579 -prefMapSize 255448 -jsInitHandle 1352 -jsInitLen 336328 -parentBuildID 20211111152544 -appDir "Z:\task_163664756302097\build\application\firefox\browser" - 6244 "\\.\pipe\gecko-crash-server-pipe.6244" 3376 1f0ef1e9870 tab)
[task 2021-11-11T17:09:49.592Z] 17:09:49     INFO - ==> process 3076 launched child process 8476 ("Z:\task_163664756302097\build\application\firefox\firefox.exe" -contentproc --channel="3076.131.1821038806\520881384" -childID 66 -isForBrowser -prefsHandle 6124 -prefMapHandle 5440 -prefsLen 11637 -prefMapSize 255484 -jsInitHandle 1364 -jsInitLen 336328 -parentBuildID 20211111152544 -appDir "Z:\task_163664756302097\build\application\firefox\browser" - 3076 "\\.\pipe\gecko-crash-server-pipe.3076" 5056 1c7ae21b570 tab)
[task 2021-11-11T17:09:50.158Z] 17:09:50     INFO -    0 |TOTAL                                 |       57        0|   21384        0|
[task 2021-11-11T17:13:26.009Z] 17:13:26     INFO - ==> process 3656 launched child process 628 ("Z:\task_163664756302097\build\application\firefox\firefox.exe" -contentproc --channel="3656.947.1384405750\374374135" -childID 474 -isForBrowser -prefsHandle 3136 -prefMapHandle 3280 -prefsLen 11250 -prefMapSize 255448 -jsInitHandle 1308 -jsInitLen 336328 -parentBuildID 20211111152544 -appDir "Z:\task_163664756302097\build\application\firefox\browser" - 3656 "\\.\pipe\gecko-crash-server-pipe.3656" 4056 1be8696e970 tab)
[task 2021-11-11T17:19:58.515Z] 17:19:58     INFO - GECKO(2852) | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to C:\Users\task_163664756302097\AppData\Local\Temp\tmpc1fun4vn.mozrunner\runtests_leaks_tab_pid1384.log
[task 2021-11-11T17:19:58.516Z] 17:19:58     INFO - GECKO(2852) | [1384, Main Thread] WARNING: XPCOM_MEM_BLOAT_LOG is set, disabling native allocations.: file /builds/worker/checkouts/gecko/tools/profiler/core/platform.cpp:250
[task 2021-11-11T17:20:00.714Z] 17:20:00     INFO - GECKO(2852) | [Child 1384, Main Thread] WARNING: MsgDropped in ContentChild: file /builds/worker/checkouts/gecko/dom/ipc/ContentChild.cpp:2282
[task 2021-11-11T17:20:02.727Z] 17:20:02     INFO - GECKO(2852) | [Child 1384, Main Thread] WARNING: IPC message 'PContent::Msg_AccumulateChildKeyedHistograms' discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:525
[task 2021-11-11T17:20:02.728Z] 17:20:02     INFO - GECKO(2852) | [Child 1384, Main Thread] WARNING: '!ipcActor->SendAccumulateChildKeyedHistograms(keyedHistogramsToSend)', file /builds/worker/checkouts/gecko/toolkit/components/telemetry/core/ipc/TelemetryIPCAccumulator.cpp:283
[task 2021-11-11T17:20:02.729Z] 17:20:02     INFO - GECKO(2852) | [Child 1384, Main Thread] WARNING: IPC message 'PContent::Msg_RecordDiscardedData' discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:525
[task 2021-11-11T17:20:02.730Z] 17:20:02     INFO - GECKO(2852) | [Child 1384, Main Thread] WARNING: '!ipcActor->SendRecordDiscardedData(discardedData)', file /builds/worker/checkouts/gecko/toolkit/components/telemetry/core/ipc/TelemetryIPCAccumulator.cpp:298
[task 2021-11-11T17:20:08.606Z] 17:20:08     INFO - GECKO(2852) | [Child 1384, Main Thread] WARNING: IPC message 'PContent::Msg_AccumulateChildHistograms' discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:525
[task 2021-11-11T17:20:08.607Z] 17:20:08     INFO - GECKO(2852) | [Child 1384, Main Thread] WARNING: '!ipcActor->SendAccumulateChildHistograms(histogramsToSend)', file /builds/worker/checkouts/gecko/toolkit/components/telemetry/core/ipc/TelemetryIPCAccumulator.cpp:279
[task 2021-11-11T17:20:08.608Z] 17:20:08     INFO - GECKO(2852) | [Child 1384, Main Thread] WARNING: IPC message 'PContent::Msg_RecordDiscardedData' discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:525
[task 2021-11-11T17:20:08.608Z] 17:20:08     INFO - GECKO(2852) | [Child 1384, Main Thread] WARNING: '!ipcActor->SendRecordDiscardedData(discardedData)', file /builds/worker/checkouts/gecko/toolkit/components/telemetry/core/ipc/TelemetryIPCAccumulator.cpp:298
[task 2021-11-11T17:20:54.182Z] 17:20:54     INFO - GECKO(2852) | [Child 1384, Main Thread] WARNING: IPC message 'PContent::Msg_UpdateChildScalars' discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:525
[task 2021-11-11T17:20:54.183Z] 17:20:54     INFO - GECKO(2852) | [Child 1384, Main Thread] WARNING: '!ipcActor->SendUpdateChildScalars(scalarsToSend)', file /builds/worker/checkouts/gecko/toolkit/components/telemetry/core/ipc/TelemetryIPCAccumulator.cpp:287
[task 2021-11-11T17:20:54.183Z] 17:20:54     INFO - GECKO(2852) | [Child 1384, Main Thread] WARNING: IPC message 'PContent::Msg_RecordDiscardedData' discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:525
[task 2021-11-11T17:20:54.184Z] 17:20:54     INFO - GECKO(2852) | [Child 1384, Main Thread] WARNING: '!ipcActor->SendRecordDiscardedData(discardedData)', file /builds/worker/checkouts/gecko/toolkit/components/telemetry/core/ipc/TelemetryIPCAccumulator.cpp:298
[task 2021-11-11T17:21:33.293Z] 17:21:33     INFO - GECKO(2852) | Waiting in WillDestroyCurrentMessageLoop for pid 1384
[task 2021-11-11T17:22:31.072Z] 17:22:31     INFO - ==> process 7628 launched child process 1384 ("Z:\task_163664756302097\build\application\firefox\firefox.exe" -contentproc --channel="7628.193.1023192696\825889016" -childID 97 -isForBrowser -prefsHandle 5620 -prefMapHandle 5616 -prefsLen 12080 -prefMapSize 255448 -jsInitHandle 1352 -jsInitLen 336328 -parentBuildID 20211111152544 -appDir "Z:\task_163664756302097\build\application\firefox\browser" - 7628 "\\.\pipe\gecko-crash-server-pipe.7628" 5732 2223b76b470 tab)
[task 2021-11-11T17:22:31.317Z] 17:22:31     INFO - zombiecheck | Checking for orphan process with PID: 1384
[task 2021-11-11T17:22:38.878Z] 17:22:38     INFO - leakcheck | Processing leak log file C:\Users\task_163664756302097\AppData\Local\Temp\tmpc1fun4vn.mozrunner\runtests_leaks_tab_pid1384.log

Could it be that we end up with a race between one parent asking a child process to close and another parent creating a child process who get's assigned the very same process ID before we reach WillDestroyCurrentMessageLoop in the first parent ? IIUC, Windows does reuse process IDs. Still it should be guaranteed to not do this until the last handle to a process has not been freed, so it looks like this should not be possible. But given we are out of ideas...

Flags: needinfo?(continuation)
Whiteboard: [stockwell needswork:owner] [fission:m96?] → [stockwell needswork:owner][fission:m96?][stockwell disabled]

Hmm, are we sure that WillDestroyCurrentMessageLoop ever does what we expect it to do? The message we print is only conditional with an #ifdef and the existence of the process_ handle, but I see it appearing in the logs I looked at only when we actually block.

We call WillDestroyCurrentMessageLoop only if we had a DestructionObserver registered. So my doubt would be that the normal flow unregisters that observer before we get into MessageLoop::~MessageLoop and that if we get here with a registered observer we always block?

(In reply to Jens Stutte [:jstutte] from comment #57)

Hmm, are we sure that WillDestroyCurrentMessageLoop ever does what we expect it to do? The message we print is only conditional with an #ifdef and the existence of the process_ handle, but I see it appearing in the logs I looked at only when we actually block.

We call WillDestroyCurrentMessageLoop only if we had a DestructionObserver registered. So my doubt would be that the normal flow unregisters that observer before we get into MessageLoop::~MessageLoop and that if we get here with a registered observer we always block?

Yeah, my guess is that usually processes should go away by the time we get to WillDestroyCurrentMessageLoop, so we're probably only going to hit this logging code in weird cases.

As for the PID reuse, I'm not sure. When looking at the bc5 test failure (that is now disabled), I saw that the PID of the process occurred a few times in the log in some cases but not others. Though the failures might not all be happening for the same reason.

Flags: needinfo?(continuation)
See Also: → 1537679
See Also: → 1724337
See Also: → 1743362

(In reply to Andrew McCreight [:mccr8] from comment #58)

(In reply to Jens Stutte [:jstutte] from comment #57)

Hmm, are we sure that WillDestroyCurrentMessageLoop ever does what we expect it to do? The message we print is only conditional with an #ifdef and the existence of the process_ handle, but I see it appearing in the logs I looked at only when we actually block.

We call WillDestroyCurrentMessageLoop only if we had a DestructionObserver registered. So my doubt would be that the normal flow unregisters that observer before we get into MessageLoop::~MessageLoop and that if we get here with a registered observer we always block?

Yeah, my guess is that usually processes should go away by the time we get to WillDestroyCurrentMessageLoop, so we're probably only going to hit this logging code in weird cases.

As for the PID reuse, I'm not sure. When looking at the bc5 test failure (that is now disabled), I saw that the PID of the process occurred a few times in the log in some cases but not others. Though the failures might not all be happening for the same reason.

FWIW, I filed bug 1743362 and I'd want to move all these investigations to bug 1724337 and children.

Looks like this went to 0 on Nov 22, which was a good 10 days after the disabling patch landed, so something else either fixed this or influenced timing. Should we try reverting the disabling patch, and/or do we know what "fixed" this?

Flags: needinfo?(jstutte)

Superseded by bug 1742677 (and back to RunWatchdog as signature which had been tracked with 1358818 before).

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #63)

Superseded by bug 1742677 (and back to RunWatchdog as signature which had been tracked with 1358818 before).

The generic RunWatchdog case has many different possible causes. One is the WillDestroyCurrentMessageLoop cause we want to investigate in bug 1724337 and children.

There are many other possible causes, in the current list of bug 1742677 I see mainly the SpinEventLoopUntil("CompositorThreadHolder::Shutdown", ...) case which is actually bug 1712136 and is happening also in the wild.

We cannot keep these cases separate as in crash-stats if they happen as intermittents on our CI ?

Flags: needinfo?(jstutte)
You need to log in before you can comment on or make changes to this bug.