[meta] Make crashes sent to Windows Error Reporting actionable
Categories
(Toolkit :: Crash Reporting, task)
Tracking
()
People
(Reporter: gsvelto, Unassigned)
References
Details
(Keywords: meta)
We're missing a significant amount of crashes that are being intercepted by the Windows Error Reporting subsystem. The crashes are visible in Microsof'ts partner dashboard (if you have an account to access it):
https://partner.microsoft.com/en-us/
This bug will serve as a central point to gather all related activities. Our goal is to reduce those crashes to a minimum and also make them actionable.
Reporter | ||
Comment 1•5 years ago
|
||
One annoying discovery I just made is that it doesn't seem possible to access the raw minidumps for the crashes intercepted by WER so at best we can look at the symbolicated stack traces. The traces appear to be at least two days old so we'll have to wait until Thursday to see the impact of having uploaded our symbols in bug 1670893.
Reporter | ||
Comment 2•4 years ago
|
||
After weeks of uploading symbols there's finally some interesting stuff coming out of those Microsoft health dashboards. David, Toshihito could you have a look? Most of the crashes with proper stack traces I found are really odd and I'm not sure what to make of them. When you look at the dashboards I suggest filtering by crashes, there's plenty of hangs and the like but I don't think those are very interesting right now.
One of the odd crashes I found appear a recursive call to mozilla::plugins::FunctionBroker_mozilla::plugins::ID_CreateMutexW
that eventually leads to a stack overflow.
Another odd one is happening in mozglue.dll and has a stack trace that looks like this:
0 ntdll RtlQueryPerformanceCounter 0x7F
1 mozglue mozilla::glue::ModuleLoadFrame::StaticInit 0x68
2 mozglue DllBlocklist_Initialize 0x4A
3 <unknown> firefox 0x130B
If you don't have an account to access the dashboards I think Sylvestre can provide you one.
As discussed in channel, these crashes don't show up on my dashboard.
Comment 4•4 years ago
|
||
Finally I got access to the dashboard. It's hard to decide where we should start with.
I can see the one with ID_CreateMutexW
and the one with ModuleLoadFrame::StaticInit
. I cannot find how to filter the failure table by name, so I needed to download tableChart.tsv, find a signature in it, and then find out a table row using a Hits count as a hint.
ModuleLoadFrame::StaticInit
is very strange. The offset 0x68 in mozglue v83.0 means calling sLoaderAPI->GetHandleLauncherErrorFn()
from here jumped to RtlQueryPerformanceCounter
somehow. Maybe GetNtLoaderAPI
returned a wrong object.
Comment 5•4 years ago
|
||
For plugins::ID_CreateMutexW
, this hook was introduced by bug 1366256. I discussed with :handyman. Given that the volume of this problem was drastically decreased since Nov-10 (maybe because of Patch Tuesday?) and we run this code only for Flash whose EOL is coming very soon, it's fixable but its priority is low.
Reporter | ||
Comment 6•4 years ago
•
|
||
Thanks for looking!
(In reply to Toshihito Kikuchi [:toshi] from comment #4)
Finally I got access to the dashboard. It's hard to decide where we should start with.
I can see the one with
ID_CreateMutexW
and the one withModuleLoadFrame::StaticInit
. I cannot find how to filter the failure table by name, so I needed to download tableChart.tsv, find a signature in it, and then find out a table row using a Hits count as a hint.
Yes, filtering is very problematic. The best I could find is to use the funnel icon to filter only for crashes. However since there's no filter-by-version tool the first few pages are full of crashes from old versions.
ModuleLoadFrame::StaticInit
is very strange. The offset 0x68 in mozglue v83.0 means callingsLoaderAPI->GetHandleLauncherErrorFn()
from here jumped toRtlQueryPerformanceCounter
somehow. MaybeGetNtLoaderAPI
returned a wrong object.
Well glad to know that doesn't have an impact.
That being said we probably need to make a decision about what to do with these dashboards. Given the above I'd say their value is low for a number of reasons:
- Things like symbolication trail the upload of new versions by days, possibly more than a week so even when we detect a bug there quite some time might have passed
- Since it's impossible to filter by version triaging for new crashes is hard or impossible
- Stack traces are of poor quality and often not available
- We still have to upload symbols for new versions manually
That being said this is obviously catching crashes we're not aware of. Since it appears to be impossible to access the minidumps this isn't really helping though. IMO the next step should be to investigate if we can use WerRegisterAppLocalDump
to coerce WER into writing minidumps locally and then "hijack" them with our regular crash reporting infrastructure. I'll open a bug for that.
Description
•