Closed
Bug 1396112
Opened 7 years ago
Closed 6 years ago
Intermittent marionette.py | application crashed [@ AslHashFree + 0x35028]
Categories
(Core :: IPC, defect, P5)
Core
IPC
Tracking
()
RESOLVED
INCOMPLETE
People
(Reporter: intermittent-bug-filer, Unassigned)
Details
(Keywords: crash, intermittent-failure)
Crash Data
Filed by: wkocher [at] mozilla.com https://treeherder.mozilla.org/logviewer.html#?job_id=127808781&repo=autoland https://queue.taskcluster.net/v1/task/YzTctJK3QR2PiijMDmaNVA/runs/0/artifacts/public/logs/live_backing.log
Comment hidden (Intermittent Failures Robot) |
Comment 2•7 years ago
|
||
There are a lot of threads which all seem to try to call `SdbResolveDatabase` but only one with nearly no calling stack crashed: 17:53:52 INFO - Thread 0 (crashed) 17:53:52 INFO - 0 ntdll.dll!AslHashFree + 0x35028 17:53:52 INFO - rax = 0x0000000973bfcd01 rdx = 0x0000000000000003 17:53:52 INFO - rcx = 0x0000000973bfd1d8 rbx = 0x0000000000000000 17:53:52 INFO - rsi = 0x0000000000000000 rdi = 0x0000000000000000 17:53:52 INFO - rbp = 0x0000000973bff030 rsp = 0x0000000973bfefd8 17:53:52 INFO - r8 = 0x0000000000000000 r9 = 0x00000160ae016160 17:53:52 INFO - r10 = 0x0000000000000002 r11 = 0x0000000973bfcd20 17:53:52 INFO - r12 = 0x0000000000000000 r13 = 0x00000160aab92800 17:53:52 INFO - r14 = 0x00000160aab3d410 r15 = 0x00000160aab3d448 17:53:52 INFO - rip = 0x00007ffc03e28be4 17:53:52 INFO - Found by: given as instruction pointer in context 17:53:52 INFO - 1 ntdll.dll!SdbResolveDatabase + 0x23d 17:53:52 INFO - rbp = 0x0000000973bff030 rsp = 0x0000000973bfefe0 17:53:52 INFO - rip = 0x00007ffc03df01b9 17:53:52 INFO - Found by: stack scanning Starting with thread 20 we have some references to the image library: 17:53:52 INFO - Thread 20 17:53:52 INFO - 0 ntdll.dll!AslHashFree + 0x35028 17:53:52 INFO - rax = 0x00007ffbcbf19e80 rdx = 0x0000000000000001 17:53:52 INFO - rcx = 0x00000160aff3ec00 rbx = 0x0000000000000000 17:53:52 INFO - rsi = 0x0000000000000000 rdi = 0x00000009771ff4f0 17:53:52 INFO - rbp = 0x00000009765ff0d0 rsp = 0x00000009765ff078 17:53:52 INFO - r8 = 0x0000000000000002 r9 = 0x00000160aab3d448 17:53:52 INFO - r10 = 0x0000000000000000 r11 = 0x00000160aaa00198 17:53:52 INFO - r12 = 0x0000000000000000 r13 = 0x0000000000000000 17:53:52 INFO - r14 = 0x00000160ae0ad610 r15 = 0x00000160ae0ad648 17:53:52 INFO - rip = 0x00007ffc03e28be4 17:53:52 INFO - Found by: given as instruction pointer in context 17:53:52 INFO - 1 ntdll.dll!SdbResolveDatabase + 0x23d 17:53:52 INFO - rbp = 0x00000009765ff0d0 rsp = 0x00000009765ff080 17:53:52 INFO - rip = 0x00007ffc03df01b9 17:53:52 INFO - Found by: stack scanning 17:53:52 INFO - 2 0x160ad9ecaa0 17:53:52 INFO - rbp = 0x00000009765ff0d0 rsp = 0x00000009765ff140 17:53:52 INFO - rip = 0x00000160ad9ecaa0 17:53:52 INFO - Found by: call frame info 17:53:52 INFO - 3 xul.dll!mozilla::image::DecodePoolImpl::PopWork() [DecodePool.cpp:a0bef0340a9d : 127 + 0x13] 17:53:52 INFO - rsp = 0x00000009765ff150 rip = 0x00007ffbc9b330c0 17:53:52 INFO - Found by: stack scanning Markus, do you have an idea if this could be all image decoding related? The huge number of threads is a bit worrying.
Flags: needinfo?(mstange)
Comment 3•7 years ago
|
||
I don't think this is image decoding related. I think all threads are idle, in both the content and the parent process, but somehow the content process times out during shutdown and we kill it. This generates two crash reports in the log; first a dump from the parent process (*-browser.dmp) and then a dump from the killed content process. I looked at this run: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=127808781&lineNumber=968 Unfortunately it looks like there's some problem with stackwalking and symbolication. The functions AslHashFree and SdbResolveDatabase are supposed to be from ntdll.dll, whose breakpadID (according to the .extra artifact) is 41D57AE3B178CDF490C0B78265127E9B1, so it's symbol dump is here: https://symbols.mozilla.org/ntdll.pdb/41D57AE3B178CDF490C0B78265127E9B1/ntdll.sym However, this dump contains neither the string "AslHashFree" nor the string "SdbResolveDatabase", so something must have gone wrong. In the .extra artifact, there's one thread, and its leaf frame has "ip":"0x7ffc03e28be4", which is inside the mapping for ntdll.dll which is mapped at base address 0x7ffc03d80000, and 0x7ffc03e28be4 - 0x7ffc03d80000 = 0xA8BE4, and the closest symbol in the symbol for that offset is at a8bd0, which is "NtWaitForAlertByThreadId". In some of the threads, you can see a few relevant functions a bit up the callstack, which eventually call into AslHashFree. One of those functions is mozilla::BackgroundHangManager::RunMonitorThread() at line 288. That line is: nsresult rv = autoLock.Wait(waitTime); So all the threads in AslHashFree are probably just blocking on a lock or waiting on a condition variable. The next step in debugging this would probably be to load the crash dump from the content process into Visual Studio, in order to find out what the main thread was doing when the content process was killed. It's possible that Visual Studio will be able to get a better stack than what's shown in the log. The content process crash dump is at https://queue.taskcluster.net/v1/task/YzTctJK3QR2PiijMDmaNVA/runs/0/artifacts/public/test_info/459f1270-b8b9-46e0-a446-9b7acc383d5c.dmp .
Flags: needinfo?(mstange)
Comment 4•7 years ago
|
||
(In reply to Markus Stange [:mstange] from comment #3) > The next step in debugging this would probably be to load the crash dump > from the content process into Visual Studio, in order to find out what the > main thread was doing when the content process was killed. It's possible > that Visual Studio will be able to get a better stack than what's shown in > the log. > The content process crash dump is at > https://queue.taskcluster.net/v1/task/YzTctJK3QR2PiijMDmaNVA/runs/0/ > artifacts/public/test_info/459f1270-b8b9-46e0-a446-9b7acc383d5c.dmp . Thank you Markus. Maybe we should wait some more days and see how often this crash happens, or if all this is a red herring based on the issue below? > symbolication. The functions AslHashFree and SdbResolveDatabase are supposed > to be from ntdll.dll, whose breakpadID (according to the .extra artifact) is > 41D57AE3B178CDF490C0B78265127E9B1, so it's symbol dump is here: > https://symbols.mozilla.org/ntdll.pdb/41D57AE3B178CDF490C0B78265127E9B1/ > ntdll.sym > However, this dump contains neither the string "AslHashFree" nor the string > "SdbResolveDatabase", so something must have gone wrong. Ted, any idea what could have gone wrong here?
Flags: needinfo?(ted)
Comment hidden (Intermittent Failures Robot) |
Comment 6•7 years ago
|
||
I downloaded the content process crash and the symbols for this build: https://public-artifacts.taskcluster.net/YzTctJK3QR2PiijMDmaNVA/0/public/test_info/459f1270-b8b9-46e0-a446-9b7acc383d5c.dmp https://queue.taskcluster.net/v1/task/eTXqEQrDTnq5Mrvn3FIpfQ/artifacts/public/build/target.crashreporter-symbols.zip and ran them through the same stackwalking tool we use in CI locally: http://hg.mozilla.org/users/tmielczarek_mozilla.com/stackwalk-http/ and it gave me more sensible-looking results: https://gist.github.com/9852175fda47c7028f21c63bb20f89f4 I'm not sure what happened in the run in automation. Somehow the stackwalker wound up using symbols from an entirely different library. From googling, `SdbResolveDatabase` is from apphelp.dll. Unfortunately we pipe the stackwalker's stderr to /dev/null in mozcrash, so I don't have enough information to say anything more helpful. :-(
Flags: needinfo?(ted)
Comment 7•7 years ago
|
||
Thank you Ted! Given that the crash didn't happen for all the last month, I'm going to close this bug as WFM now.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
Comment 8•6 years ago
|
||
New occurrence: https://treeherder.mozilla.org/logviewer.html#?job_id=156729445&repo=autoland
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Updated•6 years ago
|
Crash Signature: [@ AslHashFree + 0x35028] → [@ AslHashFree + 0x35028]
[@ AslHashFree + 0x355ce]
Comment 9•6 years ago
|
||
Maybe that is the same or not, but it has a bit more information in the stacktrace: 23:44:47 INFO - Thread 1 (crashed) 23:44:47 INFO - 0 ntdll.dll!AslHashFree + 0x355ce 23:44:47 INFO - rax = 0x0000000000000031 rdx = 0x00007ff883ed486f 23:44:47 INFO - rcx = 0x00007ff802000002 rbx = 0x000002a105653700 23:44:47 INFO - rsi = 0x0000000000000000 rdi = 0x00000000fffffffa 23:44:47 INFO - rbp = 0x000000e4143ff7c0 rsp = 0x000000e4143ff5c0 23:44:47 INFO - r8 = 0x00000000000001b0 r9 = 0x0000000000000000 23:44:47 INFO - r10 = 0x000002a1000e5380 r11 = 0x0000000001c50031 23:44:47 INFO - r12 = 0x0000000000000000 r13 = 0x0000000000000000 23:44:47 INFO - r14 = 0x0000000000000000 r15 = 0x0000000000000000 23:44:47 INFO - rip = 0x00007ff883f5918a 23:44:47 INFO - Found by: given as instruction pointer in context 23:44:47 INFO - 1 ntdll.dll!AslHashFree + 0x355ce 23:44:47 INFO - rbp = 0x000000e4143ff7c0 rsp = 0x000000e4143ff5f8 23:44:47 INFO - rip = 0x00007ff883f5918a 23:44:47 INFO - Found by: stack scanning 23:44:47 INFO - 2 xul.dll!IPC::Channel::ChannelImpl::OnIOCompleted(base::MessagePumpForIO::IOContext *,unsigned long,unsigned long) [ipc_channel_win.cc:b5495d7f42a6 : 544 + 0x15] 23:44:47 INFO - rbp = 0x000000e4143ff7c0 rsp = 0x000000e4143ff610 23:44:47 INFO - rip = 0x00007ff85351d76b 23:44:47 INFO - Found by: stack scanning So it looks to be IPC related.
Status: REOPENED → NEW
Component: Marionette → IPC
Product: Testing → Core
Version: Version 3 → unspecified
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 12•6 years ago
|
||
https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup
Status: NEW → RESOLVED
Closed: 7 years ago → 6 years ago
Resolution: --- → INCOMPLETE
You need to log in
before you can comment on or make changes to this bug.
Description
•