Closed Bug 1396112 Opened 7 years ago Closed 6 years ago

Intermittent marionette.py | application crashed [@ AslHashFree + 0x35028]

Categories

(Core :: IPC, defect, P5)

defect

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: intermittent-bug-filer, Unassigned)

Details

(Keywords: crash, intermittent-failure)

Crash Data

There are a lot of threads which all seem to try to call `SdbResolveDatabase` but only one with nearly no calling stack crashed:

17:53:52     INFO -  Thread 0 (crashed)
17:53:52     INFO -   0  ntdll.dll!AslHashFree + 0x35028
17:53:52     INFO -      rax = 0x0000000973bfcd01   rdx = 0x0000000000000003
17:53:52     INFO -      rcx = 0x0000000973bfd1d8   rbx = 0x0000000000000000
17:53:52     INFO -      rsi = 0x0000000000000000   rdi = 0x0000000000000000
17:53:52     INFO -      rbp = 0x0000000973bff030   rsp = 0x0000000973bfefd8
17:53:52     INFO -       r8 = 0x0000000000000000    r9 = 0x00000160ae016160
17:53:52     INFO -      r10 = 0x0000000000000002   r11 = 0x0000000973bfcd20
17:53:52     INFO -      r12 = 0x0000000000000000   r13 = 0x00000160aab92800
17:53:52     INFO -      r14 = 0x00000160aab3d410   r15 = 0x00000160aab3d448
17:53:52     INFO -      rip = 0x00007ffc03e28be4
17:53:52     INFO -      Found by: given as instruction pointer in context
17:53:52     INFO -   1  ntdll.dll!SdbResolveDatabase + 0x23d
17:53:52     INFO -      rbp = 0x0000000973bff030   rsp = 0x0000000973bfefe0
17:53:52     INFO -      rip = 0x00007ffc03df01b9
17:53:52     INFO -      Found by: stack scanning

Starting with thread 20 we have some references to the image library:

17:53:52     INFO -  Thread 20
17:53:52     INFO -   0  ntdll.dll!AslHashFree + 0x35028
17:53:52     INFO -      rax = 0x00007ffbcbf19e80   rdx = 0x0000000000000001
17:53:52     INFO -      rcx = 0x00000160aff3ec00   rbx = 0x0000000000000000
17:53:52     INFO -      rsi = 0x0000000000000000   rdi = 0x00000009771ff4f0
17:53:52     INFO -      rbp = 0x00000009765ff0d0   rsp = 0x00000009765ff078
17:53:52     INFO -       r8 = 0x0000000000000002    r9 = 0x00000160aab3d448
17:53:52     INFO -      r10 = 0x0000000000000000   r11 = 0x00000160aaa00198
17:53:52     INFO -      r12 = 0x0000000000000000   r13 = 0x0000000000000000
17:53:52     INFO -      r14 = 0x00000160ae0ad610   r15 = 0x00000160ae0ad648
17:53:52     INFO -      rip = 0x00007ffc03e28be4
17:53:52     INFO -      Found by: given as instruction pointer in context
17:53:52     INFO -   1  ntdll.dll!SdbResolveDatabase + 0x23d
17:53:52     INFO -      rbp = 0x00000009765ff0d0   rsp = 0x00000009765ff080
17:53:52     INFO -      rip = 0x00007ffc03df01b9
17:53:52     INFO -      Found by: stack scanning
17:53:52     INFO -   2  0x160ad9ecaa0
17:53:52     INFO -      rbp = 0x00000009765ff0d0   rsp = 0x00000009765ff140
17:53:52     INFO -      rip = 0x00000160ad9ecaa0
17:53:52     INFO -      Found by: call frame info
17:53:52     INFO -   3  xul.dll!mozilla::image::DecodePoolImpl::PopWork() [DecodePool.cpp:a0bef0340a9d : 127 + 0x13]
17:53:52     INFO -      rsp = 0x00000009765ff150   rip = 0x00007ffbc9b330c0
17:53:52     INFO -      Found by: stack scanning

Markus, do you have an idea if this could be all image decoding related? The huge number of threads is a bit worrying.
Flags: needinfo?(mstange)
I don't think this is image decoding related. I think all threads are idle, in both the content and the parent process, but somehow the content process times out during shutdown and we kill it. This generates two crash reports in the log; first a dump from the parent process (*-browser.dmp) and then a dump from the killed content process.

I looked at this run: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=127808781&lineNumber=968

Unfortunately it looks like there's some problem with stackwalking and symbolication. The functions AslHashFree and SdbResolveDatabase are supposed to be from ntdll.dll, whose breakpadID (according to the .extra artifact) is 41D57AE3B178CDF490C0B78265127E9B1, so it's symbol dump is here:
https://symbols.mozilla.org/ntdll.pdb/41D57AE3B178CDF490C0B78265127E9B1/ntdll.sym
However, this dump contains neither the string "AslHashFree" nor the string "SdbResolveDatabase", so something must have gone wrong.
In the .extra artifact, there's one thread, and its leaf frame has "ip":"0x7ffc03e28be4", which is inside the mapping for ntdll.dll which is mapped at base address 0x7ffc03d80000, and 0x7ffc03e28be4 - 0x7ffc03d80000 = 0xA8BE4, and the closest symbol in the symbol for that offset is at a8bd0, which is "NtWaitForAlertByThreadId".

In some of the threads, you can see a few relevant functions a bit up the callstack, which eventually call into AslHashFree. One of those functions is mozilla::BackgroundHangManager::RunMonitorThread() at line 288. That line is:
    nsresult rv = autoLock.Wait(waitTime);

So all the threads in AslHashFree are probably just blocking on a lock or waiting on a condition variable.


The next step in debugging this would probably be to load the crash dump from the content process into Visual Studio, in order to find out what the main thread was doing when the content process was killed. It's possible that Visual Studio will be able to get a better stack than what's shown in the log.
The content process crash dump is at https://queue.taskcluster.net/v1/task/YzTctJK3QR2PiijMDmaNVA/runs/0/artifacts/public/test_info/459f1270-b8b9-46e0-a446-9b7acc383d5c.dmp .
Flags: needinfo?(mstange)
(In reply to Markus Stange [:mstange] from comment #3)
> The next step in debugging this would probably be to load the crash dump
> from the content process into Visual Studio, in order to find out what the
> main thread was doing when the content process was killed. It's possible
> that Visual Studio will be able to get a better stack than what's shown in
> the log.
> The content process crash dump is at
> https://queue.taskcluster.net/v1/task/YzTctJK3QR2PiijMDmaNVA/runs/0/
> artifacts/public/test_info/459f1270-b8b9-46e0-a446-9b7acc383d5c.dmp .

Thank you Markus. Maybe we should wait some more days and see how often this crash happens, or if all this is a red herring based on the issue below?

> symbolication. The functions AslHashFree and SdbResolveDatabase are supposed
> to be from ntdll.dll, whose breakpadID (according to the .extra artifact) is
> 41D57AE3B178CDF490C0B78265127E9B1, so it's symbol dump is here:
> https://symbols.mozilla.org/ntdll.pdb/41D57AE3B178CDF490C0B78265127E9B1/
> ntdll.sym
> However, this dump contains neither the string "AslHashFree" nor the string
> "SdbResolveDatabase", so something must have gone wrong.

Ted, any idea what could have gone wrong here?
Flags: needinfo?(ted)
I downloaded the content process crash and the symbols for this build:
https://public-artifacts.taskcluster.net/YzTctJK3QR2PiijMDmaNVA/0/public/test_info/459f1270-b8b9-46e0-a446-9b7acc383d5c.dmp
https://queue.taskcluster.net/v1/task/eTXqEQrDTnq5Mrvn3FIpfQ/artifacts/public/build/target.crashreporter-symbols.zip

and ran them through the same stackwalking tool we use in CI locally:
http://hg.mozilla.org/users/tmielczarek_mozilla.com/stackwalk-http/

and it gave me more sensible-looking results:
https://gist.github.com/9852175fda47c7028f21c63bb20f89f4

I'm not sure what happened in the run in automation. Somehow the stackwalker wound up using symbols from an entirely different library. From googling, `SdbResolveDatabase` is from apphelp.dll. Unfortunately we pipe the stackwalker's stderr to /dev/null in mozcrash, so I don't have enough information to say anything more helpful. :-(
Flags: needinfo?(ted)
Thank you Ted! Given that the crash didn't happen for all the last month, I'm going to close this bug as WFM now.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
Crash Signature: [@ AslHashFree + 0x35028] → [@ AslHashFree + 0x35028] [@ AslHashFree + 0x355ce]
Maybe that is the same or not, but it has a bit more information in the stacktrace:

23:44:47     INFO -  Thread 1 (crashed)
23:44:47     INFO -   0  ntdll.dll!AslHashFree + 0x355ce
23:44:47     INFO -      rax = 0x0000000000000031   rdx = 0x00007ff883ed486f
23:44:47     INFO -      rcx = 0x00007ff802000002   rbx = 0x000002a105653700
23:44:47     INFO -      rsi = 0x0000000000000000   rdi = 0x00000000fffffffa
23:44:47     INFO -      rbp = 0x000000e4143ff7c0   rsp = 0x000000e4143ff5c0
23:44:47     INFO -       r8 = 0x00000000000001b0    r9 = 0x0000000000000000
23:44:47     INFO -      r10 = 0x000002a1000e5380   r11 = 0x0000000001c50031
23:44:47     INFO -      r12 = 0x0000000000000000   r13 = 0x0000000000000000
23:44:47     INFO -      r14 = 0x0000000000000000   r15 = 0x0000000000000000
23:44:47     INFO -      rip = 0x00007ff883f5918a
23:44:47     INFO -      Found by: given as instruction pointer in context
23:44:47     INFO -   1  ntdll.dll!AslHashFree + 0x355ce
23:44:47     INFO -      rbp = 0x000000e4143ff7c0   rsp = 0x000000e4143ff5f8
23:44:47     INFO -      rip = 0x00007ff883f5918a
23:44:47     INFO -      Found by: stack scanning
23:44:47     INFO -   2  xul.dll!IPC::Channel::ChannelImpl::OnIOCompleted(base::MessagePumpForIO::IOContext *,unsigned long,unsigned long) [ipc_channel_win.cc:b5495d7f42a6 : 544 + 0x15]
23:44:47     INFO -      rbp = 0x000000e4143ff7c0   rsp = 0x000000e4143ff610
23:44:47     INFO -      rip = 0x00007ff85351d76b
23:44:47     INFO -      Found by: stack scanning

So it looks to be IPC related.
Status: REOPENED → NEW
Component: Marionette → IPC
Product: Testing → Core
Version: Version 3 → unspecified
https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup
Status: NEW → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.