Intermittent tsvgx | Found crashes after test run, terminating test

RESOLVED FIXED in Firefox 55

Status

Testing
Talos
RESOLVED FIXED
6 months ago
3 days ago

People

(Reporter: Treeherder Bug Filer, Assigned: dvander)

Tracking

({intermittent-failure})

Version 3
mozilla55
intermittent-failure
Points:
---

Firefox Tracking Flags

(firefox-esr52 unaffected, firefox53 unaffected, firefox54 unaffected, firefox55 fixed)

Details

(Whiteboard: [stockwell fixed])

Attachments

(2 attachments)

(Reporter)

Description

6 months ago
treeherder
Filed by: philringnalda [at] gmail.com

https://treeherder.mozilla.org/logviewer.html#?job_id=82590644&repo=autoland

https://archive.mozilla.org/pub/firefox/tinderbox-builds/autoland-win64/1489012583/autoland_win8_64_test-svgr-e10s-bm111-tests1-windows-build464.txt.gz
See Also: → bug 1345730

Comment 1

5 months ago
10 failures in 790 pushes (0.013 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-central: 8
* mozilla-inbound: 1
* autoland: 1

Platform breakdown:
* windows8-64: 9
* windows7-32-vm: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345735&startday=2017-03-06&endday=2017-03-12&tree=all

Comment 2

5 months ago
35 failures in 777 pushes (0.045 failures/push) were associated with this bug in the last 7 days.   

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* autoland: 18
* mozilla-inbound: 14
* mozilla-central: 3

Platform breakdown:
* windows8-64: 35

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345735&startday=2017-03-13&endday=2017-03-19&tree=all
Whiteboard: [stockwell needswork]
Talos, Windows 8, e10s crash (not timeout), bad report, as in several other bugs and under discussion in bug 1310638.
See Also: → bug 1310638
More than 90% of these have failures have no crash report, but a few do! Here's one:

https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=85527944&lineNumber=2215

22:57:18     INFO -  PROCESS-CRASH | tsvgx | application crashed [@ CrashStatsLogForwarder::CrashAction(mozilla::gfx::LogReason)]
22:57:18     INFO -  Crash dump filename: c:\users\cltbld~1.t-w\appdata\local\temp\tmp3ohf2e\profile\minidumps\21b16005-da8d-466d-8185-18535b975da9.dmp
22:57:18     INFO -  Operating system: Windows NT
22:57:18     INFO -                    6.2.9200
22:57:18     INFO -  CPU: amd64
22:57:18     INFO -       family 6 model 30 stepping 5
22:57:18     INFO -       8 CPUs
22:57:18     INFO -  GPU: UNKNOWN
22:57:18     INFO -  Crash reason:  EXCEPTION_BREAKPOINT
22:57:18     INFO -  Crash address: 0x7ffb9101b63
22:57:18     INFO -  Assertion: Unknown assertion type 0x00000000
22:57:18     INFO -  Process uptime: 9 seconds
22:57:18     INFO -  Thread 9 (crashed)
22:57:18     INFO -   0  xul.dll!CrashStatsLogForwarder::CrashAction(mozilla::gfx::LogReason) [gfxPlatform.cpp:e0be781966d4 : 408 + 0x11]
22:57:18     INFO -      rax = 0x000007ffcb730a30   rdx = 0x0000000000000000
22:57:18     INFO -      rcx = 0x000007ffbb252fc0   rbx = 0x0000000957e3f3e0
22:57:18     INFO -      rsi = 0x0000000951929b20   rdi = 0x0000000000000004
22:57:18     INFO -      rbp = 0x0000000957e3f3b0   rsp = 0x0000000957e3f2e0
22:57:18     INFO -       r8 = 0x000007ffbba90960    r9 = 0x0000000000000190
22:57:18     INFO -      r10 = 0x00000009515985f1   r11 = 0x0000000957e3f210
22:57:18     INFO -      r12 = 0x0000000000000002   r13 = 0x0000000000000000
22:57:18     INFO -      r14 = 0x0000000088760873   r15 = 0x00000009519067a0
22:57:18     INFO -      rip = 0x000007ffb9101b63
22:57:18     INFO -      Found by: given as instruction pointer in context
22:57:18     INFO -   1  xul.dll!mozilla::gfx::CriticalLogger::CrashAction(mozilla::gfx::LogReason) [Factory.cpp:e0be781966d4 : 971 + 0x11]
22:57:18     INFO -      rbx = 0x0000000957e3f3e0   rbp = 0x0000000957e3f3b0
22:57:18     INFO -      rsp = 0x0000000957e3f310   r12 = 0x0000000000000002
22:57:18     INFO -      r13 = 0x0000000000000000   r14 = 0x0000000088760873
22:57:18     INFO -      r15 = 0x00000009519067a0   rip = 0x000007ffb8f822af
22:57:18     INFO -      Found by: call frame info
22:57:18     INFO -   2  xul.dll!mozilla::gfx::Log<1,mozilla::gfx::CriticalLogger>::Flush() [Logging.h:e0be781966d4 : 280 + 0x3b]
22:57:18     INFO -      rbx = 0x0000000957e3f3e0   rbp = 0x0000000957e3f3b0
22:57:18     INFO -      rsp = 0x0000000957e3f340   r12 = 0x0000000000000002
22:57:18     INFO -      r13 = 0x0000000000000000   r14 = 0x0000000088760873
22:57:18     INFO -      r15 = 0x00000009519067a0   rip = 0x000007ffb8c4361e
22:57:18     INFO -      Found by: call frame info
22:57:18     INFO -   3  xul.dll!mozilla::layers::SyncObjectD3D11::Init() [TextureD3D11.cpp:e0be781966d4 : 1235 + 0x44]
22:57:18     INFO -      rbx = 0x0000000957e3f3e0   rbp = 0x0000000957e3f3b0
22:57:18     INFO -      rsp = 0x0000000957e3f3c0   r12 = 0x0000000000000002
22:57:18     INFO -      r13 = 0x0000000000000000   r14 = 0x0000000088760873
22:57:18     INFO -      r15 = 0x00000009519067a0   rip = 0x000007ffb90b2e74
22:57:18     INFO -      Found by: call frame info
22:57:18     INFO -   4  xul.dll!mozilla::layers::SyncObjectD3D11::FinalizeFrame() [TextureD3D11.cpp:e0be781966d4 : 1274 + 0x5]
22:57:18     INFO -      rbx = 0x0000000957e3f3e0   rbp = 0x0000000957e3f3b0
22:57:18     INFO -      rsp = 0x0000000957e3f500   r12 = 0x0000000000000002
22:57:18     INFO -      r13 = 0x0000000000000000   r14 = 0x0000000088760873
22:57:18     INFO -      r15 = 0x00000009519067a0   rip = 0x000007ffb90b17eb
22:57:18     INFO -      Found by: call frame info
22:57:18     INFO -   5  xul.dll!mozilla::D3D11DXVA2Manager::CopyToImage(IMFSample *,mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const &,mozilla::layers::Image * *) [DXVA2Manager.cpp:e0be781966d4 : 948 + 0xa]
22:57:18     INFO -      rbx = 0x0000000957e3f3e0   rbp = 0x0000000957e3f3b0
22:57:18     INFO -      rsp = 0x0000000957e3f6b0   r12 = 0x0000000000000002
22:57:18     INFO -      r13 = 0x0000000000000000   r14 = 0x0000000088760873
22:57:18     INFO -      r15 = 0x00000009519067a0   rip = 0x000007ffb9a1e650
22:57:18     INFO -      Found by: call frame info
22:57:18     INFO -   6  xul.dll!mozilla::WMFVideoMFTManager::CreateD3DVideoFrame(IMFSample *,__int64,mozilla::VideoData * *) [WMFVideoMFTManager.cpp:e0be781966d4 : 931 + 0x1d]
22:57:18     INFO -      rbx = 0x0000000957e3f3e0   rbp = 0x0000000957e3f3b0
22:57:18     INFO -      rsp = 0x0000000957e3f7a0   r12 = 0x0000000000000002
22:57:18     INFO -      r13 = 0x0000000000000000   r14 = 0x0000000088760873
22:57:18     INFO -      r15 = 0x00000009519067a0   rip = 0x000007ffb9a1ef4d
22:57:18     INFO -      Found by: call frame info
22:57:18     INFO -   7  xul.dll!mozilla::WMFVideoMFTManager::Output(__int64,RefPtr<mozilla::MediaData> &) [WMFVideoMFTManager.cpp:e0be781966d4 : 1048 + 0x21]
22:57:18     INFO -      rbx = 0x0000000957e3f3e0   rbp = 0x0000000957e3f3b0
22:57:18     INFO -      rsp = 0x0000000957e3f850   r12 = 0x0000000000000002
22:57:18     INFO -      r13 = 0x0000000000000000   r14 = 0x0000000088760873
22:57:18     INFO -      r15 = 0x00000009519067a0   rip = 0x000007ffb9a22d6d
22:57:18     INFO -      Found by: call frame info
22:57:18     INFO -   8  xul.dll!mozilla::WMFMediaDataDecoder::ProcessOutput(nsTArray<RefPtr<mozilla::MediaData> > &) [WMFMediaDataDecoder.cpp:e0be781966d4 : 155 + 0x13]
22:57:18     INFO -      rbx = 0x0000000957e3f3e0   rbp = 0x0000000957e3f3b0
22:57:18     INFO -      rsp = 0x0000000957e3f950   r12 = 0x0000000000000002
22:57:18     INFO -      r13 = 0x0000000000000000   r14 = 0x0000000088760873
22:57:18     INFO -      r15 = 0x00000009519067a0   rip = 0x000007ffb9a2318f
22:57:18     INFO -      Found by: call frame info
22:57:18     INFO -   9  xul.dll!mozilla::WMFMediaDataDecoder::ProcessDrain() [WMFMediaDataDecoder.cpp:e0be781966d4 : 200 + 0x5]
22:57:18     INFO -      rbx = 0x0000000957e3f3e0   rbp = 0x0000000957e3f3b0
22:57:18     INFO -      rsp = 0x0000000957e3f980   r12 = 0x0000000000000002
22:57:18     INFO -      r13 = 0x0000000000000000   r14 = 0x0000000088760873
22:57:18     INFO -      r15 = 0x00000009519067a0   rip = 0x000007ffb9a22f79
22:57:18     INFO -      Found by: call frame info
Another, very similar: https://treeherder.mozilla.org/logviewer.html#?job_id=85750177&repo=try&lineNumber=2215
David - Can you have a look at the crash reports in comment 4 and 5? We see these crashes infrequently in Windows e10s Talos tests, apparently at shutdown time. (We also see *frequent* Windows e10s Talos crashes at shutdown time without any crash reports.  I'm hoping the SyncObjectD3D11 crash is the same issue...and that you can fix it, or tell us how to avoid these crashes!)
Flags: needinfo?(dvander)
See Also: → bug 1319557, bug 1338639

Comment 7

5 months ago
28 failures in 898 pushes (0.031 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 16
* mozilla-inbound: 9
* mozilla-central: 3

Platform breakdown:
* windows8-64: 27
* windows7-32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345735&startday=2017-03-20&endday=2017-03-26&tree=all
See Also: → bug 1351818
See Also: → bug 1345724

Comment 8

5 months ago
43 failures in 845 pushes (0.051 failures/push) were associated with this bug in the last 7 days. 

This is the #38 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* autoland: 19
* mozilla-inbound: 16
* mozilla-central: 8

Platform breakdown:
* windows8-64: 41
* windows7-32: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345735&startday=2017-03-27&endday=2017-04-02&tree=all
:milan, can you help us find someone to look at this- we suspect this is the cause for a 25%+ failure rate in all windows 8 talos crashes. Geoff has outlined this nicely in Comment 6.
Flags: needinfo?(milan)
If this is only happening on shutdown, that makes a lot of sense - media could be racing with shutting down D3D11 devices.

Can we tell what PID is crashing? If it's the GPU process, then this locking might serve no purpose and can just be removed. If it's content, we'll have to investigate more.
Flags: needinfo?(dvander) → needinfo?(gbrown)
It does appear that all of these crashes are on shutdown.

We cannot easily determine the PID, but the process type is available. There is a .extra file associated with minidumps, available as artifacts of the test job. In both crashes from comment 4 and comment 5, the .extra files show ProcessType=gpu.

(If we really need the PID, :ted says it is available and can be retrieved with code like https://github.com/mozilla/socorro/blob/master/minidump-stackwalk/stackwalker.cc#L1196. Thanks much :ted!)
Flags: needinfo?(gbrown)
:dvander, knowing this is a GPU process, can you look into removing the shutdown locking?
Flags: needinfo?(dvander)
I talked with Matt Woodrow and it sounds like we can remove the locking on AMD GPUs only. If our machines use nVidia/Intel we'll have to do something else... do you know what they run, Joel?
Flags: needinfo?(dvander) → needinfo?(jmaher)
our machines are documented here:
https://wiki.mozilla.org/Buildbot/Talos/Misc#Hardware_Profile_of_machines_used_in_automation

and for the linux machines:
iX21X4 2U Neutron "Gemini" Series Four Node Hot-Pluggable Server (4 nodes per 2U)
920W High-Efficiency redundant (1+1) Power Supply
1 Intel X3450 CPU per node
8GB Total: 2 x 4GB DDR3 1333Mhz ECC/REG RAM per node
1 WD5003ABYX hard drive per node
1 NVIDIA GPU GeForce GT 610 per node


we will have new machines later in the summer or fall, I believe they will be Intel graphics.

Glad to hear there is a path forward here.
Flags: needinfo?(jmaher)
Okay, can't dropping the locking... I'll investigate further then.
Assignee: nobody → dvander
Status: NEW → ASSIGNED
Flags: needinfo?(milan)

Comment 16

5 months ago
49 failures in 867 pushes (0.057 failures/push) were associated with this bug in the last 7 days. 

This is the #36 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* autoland: 21
* mozilla-inbound: 16
* mozilla-central: 11
* try: 1

Platform breakdown:
* windows8-64: 48
* windows7-32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345735&startday=2017-04-03&endday=2017-04-09&tree=all
:dvander, checking in here- do you have any updates here?
Flags: needinfo?(dvander)
(In reply to Joel Maher ( :jmaher) from comment #17)
> :dvander, checking in here- do you have any updates here?

Sorry, I haven't had time to look at it yet. I can't seem to find any newer stack traces. The logs on brasstacks all say:

"16:53:12     INFO -  PROCESS-CRASH | tsvgx | application crashed [unknown top frame]"

There's no stack, and the minidump isn't parseable by Visual Studio. Did something break or am I looking in the wrong places?
Flags: needinfo?(dvander) → needinfo?(jmaher)
Recall comments 4 and 6: More than 90% of these Windows Talos shutdown crashes have no crash reports (usually there's a minidump, but it isn't parseable). You have to look through at least 20 logs before you find a stack.

Here's another one, a little different, from yesterday: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=90753534&lineNumber=2213
Flags: needinfo?(jmaher)
Errr... why do so few runs not have crash reports? Is that intentional?
s/not//
Created attachment 8858182 [details] [diff] [review]
part 1, don't create bad texture clients

I put a bunch of sleep() calls in right before video locks the SyncObject, to root out race conditions during shutdown. The first thing that hit was this, InitIPDLActor ends up getting a null forwarder.
Attachment #8858182 - Flags: review?(matt.woodrow)
Created attachment 8858183 [details] [diff] [review]
part 2, don't dev-crash

This code actually asserted earlier than the talos crash, because CompositorBridgeChild shouldn't be used in the GPU process and definitely not off the main thread. This patch just skips the gfxDevCrash if we're not in the main thread.
Attachment #8858183 - Flags: review?(matt.woodrow)
See Also: → bug 1356445

Comment 24

4 months ago
35 failures in 894 pushes (0.039 failures/push) were associated with this bug in the last 7 days. 

This is the #49 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* mozilla-inbound: 14
* autoland: 14
* mozilla-central: 7

Platform breakdown:
* windows8-64: 35

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345735&startday=2017-04-10&endday=2017-04-16&tree=all
Attachment #8858182 - Flags: review?(matt.woodrow) → review+
Attachment #8858183 - Flags: review?(matt.woodrow) → review+
let me know if I can land these patches, right now our #1 failure on the tree is a new talos test (glvideo in bug 1356445) which is failing with this same pattern- in total we have 400+ instances of failures in the last  week related to what looks like this issue.

Comment 26

4 months ago
Pushed by danderson@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d6348ea45c2d
Don't create TextureClients if the video bridge has shut down. (bug 1345735 part 1, r=mattwoodrow)
https://hg.mozilla.org/integration/mozilla-inbound/rev/7c5628d40478
Don't gfxDevCrash when video fails to acquire a SyncObject. (bug 1345735 part 2, r=mattwoodrow)
this looks to be fixed:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20talos%20e10s%20x64%20g4&group_state=expanded&fromchange=61ff36c046fab1fc70772472371e509b363c3302&tochange=f1c0c24105685fe93c82601e48ec566ffe291528

thank you for the fix, review, and landing!

Comment 28

4 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/d6348ea45c2d
https://hg.mozilla.org/mozilla-central/rev/7c5628d40478
Status: ASSIGNED → RESOLVED
Last Resolved: 4 months ago
status-firefox55: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Duplicate of this bug: 1356445
Duplicate of this bug: 1345730
Duplicate of this bug: 1345724
Duplicate of this bug: 1345723
Duplicate of this bug: 1357512
Duplicate of this bug: 1342735
Whiteboard: [stockwell needswork] → [stockwell fixed]
Duplicate of this bug: 1310638

Comment 36

4 months ago
29 failures in 817 pushes (0.035 failures/push) were associated with this bug in the last 7 days. 

This is the #33 most frequent failure this week.  

Repository breakdown:
* autoland: 19
* mozilla-inbound: 7
* mozilla-central: 3

Platform breakdown:
* windows8-64: 29

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345735&startday=2017-04-17&endday=2017-04-23&tree=all
status-firefox53: --- → unaffected
status-firefox54: --- → unaffected
status-firefox-esr52: --- → unaffected

Comment 37

4 months ago
7 failures in 883 pushes (0.008 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 4
* autoland: 3

Platform breakdown:
* windows7-32: 3
* osx-10-10: 3
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345735&startday=2017-04-24&endday=2017-04-30&tree=all

Comment 38

2 months ago
1 failures in 892 pushes (0.001 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 1

Platform breakdown:
* windows7-32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345735&startday=2017-06-19&endday=2017-06-25&tree=all
See Also: → bug 1375151

Comment 39

24 days ago
1 failures in 1008 pushes (0.001 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-central: 1

Platform breakdown:
* windows7-32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345735&startday=2017-07-24&endday=2017-07-30&tree=all
2 failures in 949 pushes (0.002 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-central: 2

Platform breakdown:
* windows7-32: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345735&startday=2017-08-14&endday=2017-08-20&tree=all
You need to log in before you can comment on or make changes to this bug.