Closed Bug 1345735 Opened 3 years ago Closed 3 years ago

Intermittent tsvgx | Found crashes after test run, terminating test

Categories

(Testing :: Talos, defect)

Version 3
defect
Not set

Tracking

(firefox-esr52 unaffected, firefox53 unaffected, firefox54 unaffected, firefox55 fixed)

RESOLVED FIXED
mozilla55
Tracking Status
firefox-esr52 --- unaffected
firefox53 --- unaffected
firefox54 --- unaffected
firefox55 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: dvander)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed])

Attachments

(2 files)

Whiteboard: [stockwell needswork]
Talos, Windows 8, e10s crash (not timeout), bad report, as in several other bugs and under discussion in bug 1310638.
See Also: → 1310638
More than 90% of these have failures have no crash report, but a few do! Here's one:

https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=85527944&lineNumber=2215

22:57:18     INFO -  PROCESS-CRASH | tsvgx | application crashed [@ CrashStatsLogForwarder::CrashAction(mozilla::gfx::LogReason)]
22:57:18     INFO -  Crash dump filename: c:\users\cltbld~1.t-w\appdata\local\temp\tmp3ohf2e\profile\minidumps\21b16005-da8d-466d-8185-18535b975da9.dmp
22:57:18     INFO -  Operating system: Windows NT
22:57:18     INFO -                    6.2.9200
22:57:18     INFO -  CPU: amd64
22:57:18     INFO -       family 6 model 30 stepping 5
22:57:18     INFO -       8 CPUs
22:57:18     INFO -  GPU: UNKNOWN
22:57:18     INFO -  Crash reason:  EXCEPTION_BREAKPOINT
22:57:18     INFO -  Crash address: 0x7ffb9101b63
22:57:18     INFO -  Assertion: Unknown assertion type 0x00000000
22:57:18     INFO -  Process uptime: 9 seconds
22:57:18     INFO -  Thread 9 (crashed)
22:57:18     INFO -   0  xul.dll!CrashStatsLogForwarder::CrashAction(mozilla::gfx::LogReason) [gfxPlatform.cpp:e0be781966d4 : 408 + 0x11]
22:57:18     INFO -      rax = 0x000007ffcb730a30   rdx = 0x0000000000000000
22:57:18     INFO -      rcx = 0x000007ffbb252fc0   rbx = 0x0000000957e3f3e0
22:57:18     INFO -      rsi = 0x0000000951929b20   rdi = 0x0000000000000004
22:57:18     INFO -      rbp = 0x0000000957e3f3b0   rsp = 0x0000000957e3f2e0
22:57:18     INFO -       r8 = 0x000007ffbba90960    r9 = 0x0000000000000190
22:57:18     INFO -      r10 = 0x00000009515985f1   r11 = 0x0000000957e3f210
22:57:18     INFO -      r12 = 0x0000000000000002   r13 = 0x0000000000000000
22:57:18     INFO -      r14 = 0x0000000088760873   r15 = 0x00000009519067a0
22:57:18     INFO -      rip = 0x000007ffb9101b63
22:57:18     INFO -      Found by: given as instruction pointer in context
22:57:18     INFO -   1  xul.dll!mozilla::gfx::CriticalLogger::CrashAction(mozilla::gfx::LogReason) [Factory.cpp:e0be781966d4 : 971 + 0x11]
22:57:18     INFO -      rbx = 0x0000000957e3f3e0   rbp = 0x0000000957e3f3b0
22:57:18     INFO -      rsp = 0x0000000957e3f310   r12 = 0x0000000000000002
22:57:18     INFO -      r13 = 0x0000000000000000   r14 = 0x0000000088760873
22:57:18     INFO -      r15 = 0x00000009519067a0   rip = 0x000007ffb8f822af
22:57:18     INFO -      Found by: call frame info
22:57:18     INFO -   2  xul.dll!mozilla::gfx::Log<1,mozilla::gfx::CriticalLogger>::Flush() [Logging.h:e0be781966d4 : 280 + 0x3b]
22:57:18     INFO -      rbx = 0x0000000957e3f3e0   rbp = 0x0000000957e3f3b0
22:57:18     INFO -      rsp = 0x0000000957e3f340   r12 = 0x0000000000000002
22:57:18     INFO -      r13 = 0x0000000000000000   r14 = 0x0000000088760873
22:57:18     INFO -      r15 = 0x00000009519067a0   rip = 0x000007ffb8c4361e
22:57:18     INFO -      Found by: call frame info
22:57:18     INFO -   3  xul.dll!mozilla::layers::SyncObjectD3D11::Init() [TextureD3D11.cpp:e0be781966d4 : 1235 + 0x44]
22:57:18     INFO -      rbx = 0x0000000957e3f3e0   rbp = 0x0000000957e3f3b0
22:57:18     INFO -      rsp = 0x0000000957e3f3c0   r12 = 0x0000000000000002
22:57:18     INFO -      r13 = 0x0000000000000000   r14 = 0x0000000088760873
22:57:18     INFO -      r15 = 0x00000009519067a0   rip = 0x000007ffb90b2e74
22:57:18     INFO -      Found by: call frame info
22:57:18     INFO -   4  xul.dll!mozilla::layers::SyncObjectD3D11::FinalizeFrame() [TextureD3D11.cpp:e0be781966d4 : 1274 + 0x5]
22:57:18     INFO -      rbx = 0x0000000957e3f3e0   rbp = 0x0000000957e3f3b0
22:57:18     INFO -      rsp = 0x0000000957e3f500   r12 = 0x0000000000000002
22:57:18     INFO -      r13 = 0x0000000000000000   r14 = 0x0000000088760873
22:57:18     INFO -      r15 = 0x00000009519067a0   rip = 0x000007ffb90b17eb
22:57:18     INFO -      Found by: call frame info
22:57:18     INFO -   5  xul.dll!mozilla::D3D11DXVA2Manager::CopyToImage(IMFSample *,mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const &,mozilla::layers::Image * *) [DXVA2Manager.cpp:e0be781966d4 : 948 + 0xa]
22:57:18     INFO -      rbx = 0x0000000957e3f3e0   rbp = 0x0000000957e3f3b0
22:57:18     INFO -      rsp = 0x0000000957e3f6b0   r12 = 0x0000000000000002
22:57:18     INFO -      r13 = 0x0000000000000000   r14 = 0x0000000088760873
22:57:18     INFO -      r15 = 0x00000009519067a0   rip = 0x000007ffb9a1e650
22:57:18     INFO -      Found by: call frame info
22:57:18     INFO -   6  xul.dll!mozilla::WMFVideoMFTManager::CreateD3DVideoFrame(IMFSample *,__int64,mozilla::VideoData * *) [WMFVideoMFTManager.cpp:e0be781966d4 : 931 + 0x1d]
22:57:18     INFO -      rbx = 0x0000000957e3f3e0   rbp = 0x0000000957e3f3b0
22:57:18     INFO -      rsp = 0x0000000957e3f7a0   r12 = 0x0000000000000002
22:57:18     INFO -      r13 = 0x0000000000000000   r14 = 0x0000000088760873
22:57:18     INFO -      r15 = 0x00000009519067a0   rip = 0x000007ffb9a1ef4d
22:57:18     INFO -      Found by: call frame info
22:57:18     INFO -   7  xul.dll!mozilla::WMFVideoMFTManager::Output(__int64,RefPtr<mozilla::MediaData> &) [WMFVideoMFTManager.cpp:e0be781966d4 : 1048 + 0x21]
22:57:18     INFO -      rbx = 0x0000000957e3f3e0   rbp = 0x0000000957e3f3b0
22:57:18     INFO -      rsp = 0x0000000957e3f850   r12 = 0x0000000000000002
22:57:18     INFO -      r13 = 0x0000000000000000   r14 = 0x0000000088760873
22:57:18     INFO -      r15 = 0x00000009519067a0   rip = 0x000007ffb9a22d6d
22:57:18     INFO -      Found by: call frame info
22:57:18     INFO -   8  xul.dll!mozilla::WMFMediaDataDecoder::ProcessOutput(nsTArray<RefPtr<mozilla::MediaData> > &) [WMFMediaDataDecoder.cpp:e0be781966d4 : 155 + 0x13]
22:57:18     INFO -      rbx = 0x0000000957e3f3e0   rbp = 0x0000000957e3f3b0
22:57:18     INFO -      rsp = 0x0000000957e3f950   r12 = 0x0000000000000002
22:57:18     INFO -      r13 = 0x0000000000000000   r14 = 0x0000000088760873
22:57:18     INFO -      r15 = 0x00000009519067a0   rip = 0x000007ffb9a2318f
22:57:18     INFO -      Found by: call frame info
22:57:18     INFO -   9  xul.dll!mozilla::WMFMediaDataDecoder::ProcessDrain() [WMFMediaDataDecoder.cpp:e0be781966d4 : 200 + 0x5]
22:57:18     INFO -      rbx = 0x0000000957e3f3e0   rbp = 0x0000000957e3f3b0
22:57:18     INFO -      rsp = 0x0000000957e3f980   r12 = 0x0000000000000002
22:57:18     INFO -      r13 = 0x0000000000000000   r14 = 0x0000000088760873
22:57:18     INFO -      r15 = 0x00000009519067a0   rip = 0x000007ffb9a22f79
22:57:18     INFO -      Found by: call frame info
David - Can you have a look at the crash reports in comment 4 and 5? We see these crashes infrequently in Windows e10s Talos tests, apparently at shutdown time. (We also see *frequent* Windows e10s Talos crashes at shutdown time without any crash reports.  I'm hoping the SyncObjectD3D11 crash is the same issue...and that you can fix it, or tell us how to avoid these crashes!)
Flags: needinfo?(dvander)
See Also: → 1319557, 1338639
:milan, can you help us find someone to look at this- we suspect this is the cause for a 25%+ failure rate in all windows 8 talos crashes. Geoff has outlined this nicely in Comment 6.
Flags: needinfo?(milan)
If this is only happening on shutdown, that makes a lot of sense - media could be racing with shutting down D3D11 devices.

Can we tell what PID is crashing? If it's the GPU process, then this locking might serve no purpose and can just be removed. If it's content, we'll have to investigate more.
Flags: needinfo?(dvander) → needinfo?(gbrown)
It does appear that all of these crashes are on shutdown.

We cannot easily determine the PID, but the process type is available. There is a .extra file associated with minidumps, available as artifacts of the test job. In both crashes from comment 4 and comment 5, the .extra files show ProcessType=gpu.

(If we really need the PID, :ted says it is available and can be retrieved with code like https://github.com/mozilla/socorro/blob/master/minidump-stackwalk/stackwalker.cc#L1196. Thanks much :ted!)
Flags: needinfo?(gbrown)
:dvander, knowing this is a GPU process, can you look into removing the shutdown locking?
Flags: needinfo?(dvander)
I talked with Matt Woodrow and it sounds like we can remove the locking on AMD GPUs only. If our machines use nVidia/Intel we'll have to do something else... do you know what they run, Joel?
Flags: needinfo?(dvander) → needinfo?(jmaher)
our machines are documented here:
https://wiki.mozilla.org/Buildbot/Talos/Misc#Hardware_Profile_of_machines_used_in_automation

and for the linux machines:
iX21X4 2U Neutron "Gemini" Series Four Node Hot-Pluggable Server (4 nodes per 2U)
920W High-Efficiency redundant (1+1) Power Supply
1 Intel X3450 CPU per node
8GB Total: 2 x 4GB DDR3 1333Mhz ECC/REG RAM per node
1 WD5003ABYX hard drive per node
1 NVIDIA GPU GeForce GT 610 per node


we will have new machines later in the summer or fall, I believe they will be Intel graphics.

Glad to hear there is a path forward here.
Flags: needinfo?(jmaher)
Okay, can't dropping the locking... I'll investigate further then.
Assignee: nobody → dvander
Status: NEW → ASSIGNED
Flags: needinfo?(milan)
:dvander, checking in here- do you have any updates here?
Flags: needinfo?(dvander)
(In reply to Joel Maher ( :jmaher) from comment #17)
> :dvander, checking in here- do you have any updates here?

Sorry, I haven't had time to look at it yet. I can't seem to find any newer stack traces. The logs on brasstacks all say:

"16:53:12     INFO -  PROCESS-CRASH | tsvgx | application crashed [unknown top frame]"

There's no stack, and the minidump isn't parseable by Visual Studio. Did something break or am I looking in the wrong places?
Flags: needinfo?(dvander) → needinfo?(jmaher)
Recall comments 4 and 6: More than 90% of these Windows Talos shutdown crashes have no crash reports (usually there's a minidump, but it isn't parseable). You have to look through at least 20 logs before you find a stack.

Here's another one, a little different, from yesterday: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=90753534&lineNumber=2213
Flags: needinfo?(jmaher)
Errr... why do so few runs not have crash reports? Is that intentional?
I put a bunch of sleep() calls in right before video locks the SyncObject, to root out race conditions during shutdown. The first thing that hit was this, InitIPDLActor ends up getting a null forwarder.
Attachment #8858182 - Flags: review?(matt.woodrow)
This code actually asserted earlier than the talos crash, because CompositorBridgeChild shouldn't be used in the GPU process and definitely not off the main thread. This patch just skips the gfxDevCrash if we're not in the main thread.
Attachment #8858183 - Flags: review?(matt.woodrow)
See Also: → 1356445
Attachment #8858182 - Flags: review?(matt.woodrow) → review+
Attachment #8858183 - Flags: review?(matt.woodrow) → review+
let me know if I can land these patches, right now our #1 failure on the tree is a new talos test (glvideo in bug 1356445) which is failing with this same pattern- in total we have 400+ instances of failures in the last  week related to what looks like this issue.
Pushed by danderson@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d6348ea45c2d
Don't create TextureClients if the video bridge has shut down. (bug 1345735 part 1, r=mattwoodrow)
https://hg.mozilla.org/integration/mozilla-inbound/rev/7c5628d40478
Don't gfxDevCrash when video fails to acquire a SyncObject. (bug 1345735 part 2, r=mattwoodrow)
https://hg.mozilla.org/mozilla-central/rev/d6348ea45c2d
https://hg.mozilla.org/mozilla-central/rev/7c5628d40478
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Duplicate of this bug: 1356445
Duplicate of this bug: 1345730
Duplicate of this bug: 1345724
Duplicate of this bug: 1345723
Duplicate of this bug: 1357512
Duplicate of this bug: 1342735
Whiteboard: [stockwell needswork] → [stockwell fixed]
Duplicate of this bug: 1310638
You need to log in before you can comment on or make changes to this bug.