Minidump still not always being written by talos forced shutdowns after timeout (talos crash reporting and windows kill_and_get_minidump busted)
Categories
(Testing :: Talos, defect, P3)
Tracking
(firefox76 fixed)
Tracking | Status | |
---|---|---|
firefox76 | --- | fixed |
People
(Reporter: Gijs, Assigned: gbrown)
References
(Blocks 1 open bug)
Details
(Whiteboard: dev-prod-2020)
Attachments
(1 file)
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=294290831&repo=autoland&lineNumber=3887
The log says it's writing one, but there's no attachment. I don't really know how to debug this further - this definitely post-dates bug 1623917 and bug 1622257, being run this past Monday.
Geoff, do you know how to investigate this further?
Reporter | ||
Updated•5 years ago
|
Looks like this is the underlying reason for the intermittent on bug 1557982?
Assignee | ||
Comment 2•5 years ago
|
||
Here's an intentional crash in a variety of tasks:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=2e13a8b00b434718d8fd0ac706632c0664137224
I guess it just confirms this bug: crashreporting is working for most test suites, but not for Talos, on any platform.
The Talos code looks okay to me. Looking closer...
Assignee | ||
Comment 3•5 years ago
|
||
In some cases, a browser crash may subsequently cause an exception in the harness before
check_for_crashes is called, effectively bypassing crash reporting. This patch catches
the exception to ensure that check_for_crashes is called regardless of such exceptions.
Updated•5 years ago
|
Assignee | ||
Comment 4•5 years ago
|
||
My patch (comment 3) improves Talos crash reporting in the common case, but does not appear to address the problem reported in comment 0.
Bug 1626097 will provide slightly improved diagnostics if / when comment 0 is reproduced. However, I suspect that will show that check_for_crashes is checking for minidumps and not finding any, despite kill_and_get_minidumps being called appropriately. Maybe that's because the process was already partly shut down when killed?
Comment 5•5 years ago
|
||
I'm also working on a talos silent error and I am trying to get familiar with this framework.
Geoff, one place I see prone to silent errors in mozcrash.kill_and_get_minidump()
is here.
Assignee | ||
Comment 6•5 years ago
|
||
Thanks - that's a good point. Are you going to add a warning there, or shall I?
I'd also like to see an info message like https://searchfox.org/mozilla-central/rev/fa2df28a49883612bd7af4dacd80cdfedcccd2f6/testing/mozbase/mozcrash/mozcrash/mozcrash.py#504 for the OpenProcess branch.
Comment 7•5 years ago
|
||
Let's first see a push to try. I'm not 100% sure until I see the results. Please do that and maybe we'll collaborate further if necessary. Thanks.
Assignee | ||
Comment 8•5 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #4)
My patch (comment 3) improves Talos crash reporting in the common case, but does not appear to address the problem reported in comment 0.
Bug 1626097 will provide slightly improved diagnostics if / when comment 0 is reproduced. However, I suspect that will show that check_for_crashes is checking for minidumps and not finding any, despite kill_and_get_minidumps being called appropriately. Maybe that's because the process was already partly shut down when killed?
I tracked down the problem reported in comment 0: On Windows (only), kill_and_get_minidump was consistently failing to create a minidump. That was because the minidump file_name was originally non-unicode:
and, when running in a python 2 environment, not converted to unicode:
before being used in a call to CreateFileW, which fails silently when called with an 8-bit string.
Solution: if not isinstance(file_name, string_types): -> if not isinstance(file_name, text_type):
Assignee | ||
Updated•5 years ago
|
Reporter | ||
Comment 9•5 years ago
|
||
Thanks so much for chasing this, :gbrown ! Great stuff. I'm wondering, could we add some kind of test for the generic Windows mozcrash side of things where we crash deliberately, and check that we get a minidump and stack info, to ensure we can't accidentally break this again in future?
Assignee | ||
Comment 10•5 years ago
|
||
I think there would be value in such a test and I have contemplated such in the past, but I don't know how to implement an effective and robust test without a significant time investment.
Comment 11•5 years ago
|
||
Comment 12•5 years ago
|
||
bugherder |
Description
•