Open Bug 1670890 Opened 5 years ago Updated 4 years ago

[meta] Make crashes sent to Windows Error Reporting actionable

Categories

(Toolkit :: Crash Reporting, task)

Unspecified
Windows
task

Tracking

()

People

(Reporter: gsvelto, Unassigned)

References

Details

(Keywords: meta)

We're missing a significant amount of crashes that are being intercepted by the Windows Error Reporting subsystem. The crashes are visible in Microsof'ts partner dashboard (if you have an account to access it):

https://partner.microsoft.com/en-us/

This bug will serve as a central point to gather all related activities. Our goal is to reduce those crashes to a minimum and also make them actionable.

Depends on: 1670893

One annoying discovery I just made is that it doesn't seem possible to access the raw minidumps for the crashes intercepted by WER so at best we can look at the symbolicated stack traces. The traces appear to be at least two days old so we'll have to wait until Thursday to see the impact of having uploaded our symbols in bug 1670893.

After weeks of uploading symbols there's finally some interesting stuff coming out of those Microsoft health dashboards. David, Toshihito could you have a look? Most of the crashes with proper stack traces I found are really odd and I'm not sure what to make of them. When you look at the dashboards I suggest filtering by crashes, there's plenty of hangs and the like but I don't think those are very interesting right now.

One of the odd crashes I found appear a recursive call to mozilla::plugins::FunctionBroker_mozilla::plugins::ID_CreateMutexW that eventually leads to a stack overflow.

Another odd one is happening in mozglue.dll and has a stack trace that looks like this:

0 	ntdll		RtlQueryPerformanceCounter 			0x7F
1 	mozglue 	mozilla::glue::ModuleLoadFrame::StaticInit 	0x68
2 	mozglue 	DllBlocklist_Initialize 			0x4A
3 	<unknown>	firefox 					0x130B

If you don't have an account to access the dashboards I think Sylvestre can provide you one.

Flags: needinfo?(tkikuchi)
Flags: needinfo?(dmajor)

As discussed in channel, these crashes don't show up on my dashboard.

Flags: needinfo?(dmajor)

Finally I got access to the dashboard. It's hard to decide where we should start with.

I can see the one with ID_CreateMutexW and the one with ModuleLoadFrame::StaticInit. I cannot find how to filter the failure table by name, so I needed to download tableChart.tsv, find a signature in it, and then find out a table row using a Hits count as a hint.

ModuleLoadFrame::StaticInit is very strange. The offset 0x68 in mozglue v83.0 means calling sLoaderAPI->GetHandleLauncherErrorFn() from here jumped to RtlQueryPerformanceCounter somehow. Maybe GetNtLoaderAPI returned a wrong object.

Flags: needinfo?(tkikuchi)

For plugins::ID_CreateMutexW, this hook was introduced by bug 1366256. I discussed with :handyman. Given that the volume of this problem was drastically decreased since Nov-10 (maybe because of Patch Tuesday?) and we run this code only for Flash whose EOL is coming very soon, it's fixable but its priority is low.

See Also: → 1681243

Thanks for looking!

(In reply to Toshihito Kikuchi [:toshi] from comment #4)

Finally I got access to the dashboard. It's hard to decide where we should start with.

I can see the one with ID_CreateMutexW and the one with ModuleLoadFrame::StaticInit. I cannot find how to filter the failure table by name, so I needed to download tableChart.tsv, find a signature in it, and then find out a table row using a Hits count as a hint.

Yes, filtering is very problematic. The best I could find is to use the funnel icon to filter only for crashes. However since there's no filter-by-version tool the first few pages are full of crashes from old versions.

ModuleLoadFrame::StaticInit is very strange. The offset 0x68 in mozglue v83.0 means calling sLoaderAPI->GetHandleLauncherErrorFn() from here jumped to RtlQueryPerformanceCounter somehow. Maybe GetNtLoaderAPI returned a wrong object.

Well glad to know that doesn't have an impact.

That being said we probably need to make a decision about what to do with these dashboards. Given the above I'd say their value is low for a number of reasons:

  • Things like symbolication trail the upload of new versions by days, possibly more than a week so even when we detect a bug there quite some time might have passed
  • Since it's impossible to filter by version triaging for new crashes is hard or impossible
  • Stack traces are of poor quality and often not available
  • We still have to upload symbols for new versions manually

That being said this is obviously catching crashes we're not aware of. Since it appears to be impossible to access the minidumps this isn't really helping though. IMO the next step should be to investigate if we can use WerRegisterAppLocalDump to coerce WER into writing minidumps locally and then "hijack" them with our regular crash reporting infrastructure. I'll open a bug for that.

Depends on: 1681245
Depends on: 1682490
Depends on: 1682507
Depends on: 1685461
Depends on: 1691356
Depends on: 1691905
Depends on: 1700669
Depends on: 1703761
No longer depends on: 1703761
You need to log in before you can comment on or make changes to this bug.