If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Intermittent marionette.py | application crashed [@ AslHashFree + 0x35028]

NEW
Unassigned
(NeedInfo from)

Status

Testing
Marionette
P5
critical
22 days ago
12 days ago

People

(Reporter: Treeherder Bug Filer, Unassigned, NeedInfo)

Tracking

({crash, intermittent-failure})

Version 3
crash, intermittent-failure
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(crash signature)

(Reporter)

Description

22 days ago
treeherder
Filed by: wkocher [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=127808781&repo=autoland

https://queue.taskcluster.net/v1/task/YzTctJK3QR2PiijMDmaNVA/runs/0/artifacts/public/logs/live_backing.log

Comment 1

19 days ago
2 failures in 939 pushes (0.002 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 1
* autoland: 1

Platform breakdown:
* windows10-64: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1396112&startday=2017-08-28&endday=2017-09-03&tree=all
There are a lot of threads which all seem to try to call `SdbResolveDatabase` but only one with nearly no calling stack crashed:

17:53:52     INFO -  Thread 0 (crashed)
17:53:52     INFO -   0  ntdll.dll!AslHashFree + 0x35028
17:53:52     INFO -      rax = 0x0000000973bfcd01   rdx = 0x0000000000000003
17:53:52     INFO -      rcx = 0x0000000973bfd1d8   rbx = 0x0000000000000000
17:53:52     INFO -      rsi = 0x0000000000000000   rdi = 0x0000000000000000
17:53:52     INFO -      rbp = 0x0000000973bff030   rsp = 0x0000000973bfefd8
17:53:52     INFO -       r8 = 0x0000000000000000    r9 = 0x00000160ae016160
17:53:52     INFO -      r10 = 0x0000000000000002   r11 = 0x0000000973bfcd20
17:53:52     INFO -      r12 = 0x0000000000000000   r13 = 0x00000160aab92800
17:53:52     INFO -      r14 = 0x00000160aab3d410   r15 = 0x00000160aab3d448
17:53:52     INFO -      rip = 0x00007ffc03e28be4
17:53:52     INFO -      Found by: given as instruction pointer in context
17:53:52     INFO -   1  ntdll.dll!SdbResolveDatabase + 0x23d
17:53:52     INFO -      rbp = 0x0000000973bff030   rsp = 0x0000000973bfefe0
17:53:52     INFO -      rip = 0x00007ffc03df01b9
17:53:52     INFO -      Found by: stack scanning

Starting with thread 20 we have some references to the image library:

17:53:52     INFO -  Thread 20
17:53:52     INFO -   0  ntdll.dll!AslHashFree + 0x35028
17:53:52     INFO -      rax = 0x00007ffbcbf19e80   rdx = 0x0000000000000001
17:53:52     INFO -      rcx = 0x00000160aff3ec00   rbx = 0x0000000000000000
17:53:52     INFO -      rsi = 0x0000000000000000   rdi = 0x00000009771ff4f0
17:53:52     INFO -      rbp = 0x00000009765ff0d0   rsp = 0x00000009765ff078
17:53:52     INFO -       r8 = 0x0000000000000002    r9 = 0x00000160aab3d448
17:53:52     INFO -      r10 = 0x0000000000000000   r11 = 0x00000160aaa00198
17:53:52     INFO -      r12 = 0x0000000000000000   r13 = 0x0000000000000000
17:53:52     INFO -      r14 = 0x00000160ae0ad610   r15 = 0x00000160ae0ad648
17:53:52     INFO -      rip = 0x00007ffc03e28be4
17:53:52     INFO -      Found by: given as instruction pointer in context
17:53:52     INFO -   1  ntdll.dll!SdbResolveDatabase + 0x23d
17:53:52     INFO -      rbp = 0x00000009765ff0d0   rsp = 0x00000009765ff080
17:53:52     INFO -      rip = 0x00007ffc03df01b9
17:53:52     INFO -      Found by: stack scanning
17:53:52     INFO -   2  0x160ad9ecaa0
17:53:52     INFO -      rbp = 0x00000009765ff0d0   rsp = 0x00000009765ff140
17:53:52     INFO -      rip = 0x00000160ad9ecaa0
17:53:52     INFO -      Found by: call frame info
17:53:52     INFO -   3  xul.dll!mozilla::image::DecodePoolImpl::PopWork() [DecodePool.cpp:a0bef0340a9d : 127 + 0x13]
17:53:52     INFO -      rsp = 0x00000009765ff150   rip = 0x00007ffbc9b330c0
17:53:52     INFO -      Found by: stack scanning

Markus, do you have an idea if this could be all image decoding related? The huge number of threads is a bit worrying.
Flags: needinfo?(mstange)
I don't think this is image decoding related. I think all threads are idle, in both the content and the parent process, but somehow the content process times out during shutdown and we kill it. This generates two crash reports in the log; first a dump from the parent process (*-browser.dmp) and then a dump from the killed content process.

I looked at this run: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=127808781&lineNumber=968

Unfortunately it looks like there's some problem with stackwalking and symbolication. The functions AslHashFree and SdbResolveDatabase are supposed to be from ntdll.dll, whose breakpadID (according to the .extra artifact) is 41D57AE3B178CDF490C0B78265127E9B1, so it's symbol dump is here:
https://symbols.mozilla.org/ntdll.pdb/41D57AE3B178CDF490C0B78265127E9B1/ntdll.sym
However, this dump contains neither the string "AslHashFree" nor the string "SdbResolveDatabase", so something must have gone wrong.
In the .extra artifact, there's one thread, and its leaf frame has "ip":"0x7ffc03e28be4", which is inside the mapping for ntdll.dll which is mapped at base address 0x7ffc03d80000, and 0x7ffc03e28be4 - 0x7ffc03d80000 = 0xA8BE4, and the closest symbol in the symbol for that offset is at a8bd0, which is "NtWaitForAlertByThreadId".

In some of the threads, you can see a few relevant functions a bit up the callstack, which eventually call into AslHashFree. One of those functions is mozilla::BackgroundHangManager::RunMonitorThread() at line 288. That line is:
    nsresult rv = autoLock.Wait(waitTime);

So all the threads in AslHashFree are probably just blocking on a lock or waiting on a condition variable.


The next step in debugging this would probably be to load the crash dump from the content process into Visual Studio, in order to find out what the main thread was doing when the content process was killed. It's possible that Visual Studio will be able to get a better stack than what's shown in the log.
The content process crash dump is at https://queue.taskcluster.net/v1/task/YzTctJK3QR2PiijMDmaNVA/runs/0/artifacts/public/test_info/459f1270-b8b9-46e0-a446-9b7acc383d5c.dmp .
Flags: needinfo?(mstange)
(In reply to Markus Stange [:mstange] from comment #3)
> The next step in debugging this would probably be to load the crash dump
> from the content process into Visual Studio, in order to find out what the
> main thread was doing when the content process was killed. It's possible
> that Visual Studio will be able to get a better stack than what's shown in
> the log.
> The content process crash dump is at
> https://queue.taskcluster.net/v1/task/YzTctJK3QR2PiijMDmaNVA/runs/0/
> artifacts/public/test_info/459f1270-b8b9-46e0-a446-9b7acc383d5c.dmp .

Thank you Markus. Maybe we should wait some more days and see how often this crash happens, or if all this is a red herring based on the issue below?

> symbolication. The functions AslHashFree and SdbResolveDatabase are supposed
> to be from ntdll.dll, whose breakpadID (according to the .extra artifact) is
> 41D57AE3B178CDF490C0B78265127E9B1, so it's symbol dump is here:
> https://symbols.mozilla.org/ntdll.pdb/41D57AE3B178CDF490C0B78265127E9B1/
> ntdll.sym
> However, this dump contains neither the string "AslHashFree" nor the string
> "SdbResolveDatabase", so something must have gone wrong.

Ted, any idea what could have gone wrong here?
Flags: needinfo?(ted)

Comment 5

12 days ago
4 failures in 924 pushes (0.004 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 2
* try: 1
* mozilla-inbound: 1

Platform breakdown:
* windows10-64: 4

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1396112&startday=2017-09-04&endday=2017-09-10&tree=all
You need to log in before you can comment on or make changes to this bug.