Closed Bug 1489297 Opened 6 years ago Closed 6 years ago

annotate Windows crash reports with whether system has error-correcting (ECC) RAM (MemoryErrorCorrection)

Categories

(Toolkit :: Crash Reporting, enhancement)

enhancement
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla64
Tracking Status
firefox64 --- fixed

People

(Reporter: dbaron, Assigned: dbaron)

References

Details

Attachments

(4 files)

I'd like to add an additional metadata field to crash reports that records the value of MemoryErrorCorrection as documented at https://docs.microsoft.com/en-us/windows/desktop/cimwin32prov/win32-physicalmemoryarray .  This would record one of these 8 possible values:  Reserved (0), Other (1), Unknown (2), None (3), Parity (4), Single-bit ECC (5), Multi-bit ECC (6), CRC (7).

I suspect the vast majority of our users will report "None".  However, it's *possible* that enough of our users have error-correcting RAM of some sort that this will provide insight into which crashes are the result of memory corruption.  In particular, what I'd like to do is look for patterns of crashes that occur on systems without error-correcting RAM but do not occur on systems with ECC RAM; this may be a sign that these crashes are primarily the result of hardware memory corruption, which implies that we should invest less effort into trying to fix them.

This is somewhat opportunistic -- my expectation is that we probably won't have enough users with ECC RAM to do a useful analysis, but I'd like to be proven wrong, since I think such an analysis could turn out to be pretty useful.

(In the past we've done similar things with the number of CPU cores, which is metadata that we've been reporting since long before we thought to do this sort of analysis.  We could detect crashes as being related to multithreading bugs when those crashes were significantly more common on multicore machines than single core machines.  This is no longer useful today on desktop platforms because we don't have enough users on single core machines anymore.)
The above patch series successfully does this reporting for me.  You can see the "MemoryErrorCorrection" field (a "3", not surprisingly) in bp-e920ef1f-abc4-4ab6-8abd-571540180906#tab-metadata (a test crash I submitted from my debug build) in the "Metadata" tab, if you're logged in to Socorro and have Socorro permissions to view all the fields.
One thought is that maybe the annotation name should be more Windows-specific since the constants are (I think) Windows-specific.
(In reply to David Baron :dbaron: 🏴󠁵󠁳󠁣󠁡󠁿 ⌚UTC-7 from comment #5)
> One thought is that maybe the annotation name should be more
> Windows-specific since the constants are (I think) Windows-specific.

I don't think this is incredibly important, we have a lot of platform-specific annotations and they're not all marked as such.
I took a brief look at your patches and they look sensible, although you probably want someone with more in-depth Windows knowledge like aklotz or ccorcoran to review them.
FYI I just gave this a spin on my workstation (which has SECDED ECC memory) and got a "5" so it seems to be working fine.
bp-ad61ec13-b7e1-4289-8a63-c23340180907 is a test crash after switching to submitting strings, per ted's suggestion.
Comment on attachment 9007333 [details]
request for data collection

1) Is there or will there be **documentation** that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes, it will be added to https://searchfox.org/mozilla-central/source/toolkit/crashreporter/CrashAnnotations.yaml.

2) Is there a control mechanism that allows the user to turn the data collection on and off?

Yes, crash reports are opt-in. We prompt users before submitting each one.

3) If the request is for permanent data collection, is there someone who will monitor the data over time?**

Yes, anybody investigating a crash.

4) Using the **[category system of data types](https://wiki.mozilla.org/Firefox/Data_Collection)** on the Mozilla wiki, what collection type of data do the requested measurements fall under?  **

Category 1.

5) Is the data collection request for default-on or default-off?

Default off (crash reports are opt-in).

6) Does the instrumentation include the addition of **any *new* identifiers** (whether anonymous or otherwise; e.g., username, random IDs, etc.  See the appendix for more details)?

No.

7) Is the data collection covered by the existing Firefox privacy notice?

Yes.

8) Does there need to be a check-in in the future to determine whether to renew the data?

No, permanent.
Attachment #9007333 - Flags: review?(francois) → review+
Comment on attachment 9007054 [details]
Bug 1489297 patch 1 - Rename function that gets system manufacturer data from WMI in preparation for using it for memory error correction as well.

Aaron Klotz [:aklotz] has approved the revision.
Attachment #9007054 - Flags: review+
Comment on attachment 9007055 [details]
Bug 1489297 patch 2 - Refactor WMI query code so that it can be used for something else.

Aaron Klotz [:aklotz] has approved the revision.
Attachment #9007055 - Flags: review+
Comment on attachment 9007056 [details]
Bug 1489297 patch 3 - Report MemoryErrorCorrection from WMI.

Aaron Klotz [:aklotz] has approved the revision.
Attachment #9007056 - Flags: review+
Pushed by dbaron@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d0e3bb6adff0
patch 1 - Rename function that gets system manufacturer data from WMI in preparation for using it for memory error correction as well. r=aklotz
Pushed by dbaron@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/176376da58ab
patch 2 - Refactor WMI query code so that it can be used for something else. r=aklotz
Pushed by dbaron@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0c72051a2cc0
patch 3 - Report MemoryErrorCorrection from WMI. r=aklotz
https://hg.mozilla.org/mozilla-central/rev/d0e3bb6adff0
https://hg.mozilla.org/mozilla-central/rev/176376da58ab
https://hg.mozilla.org/mozilla-central/rev/0c72051a2cc0
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
See Also: → 1490259
Here's a crash I just submitted from my Windows Nightly on my Lenovo P710 workstation (via about:crashcontent):
https://crash-stats.mozilla.com/report/index/82a59191-e697-40d5-9d1a-d468b0180912

It shows:
 MemoryErrorCorrection Multi-bit ECC

which seems to be correct. From a stackoverflow answer:

C:\Users\ted>wmic memphysical get memoryerrorcorrection
MemoryErrorCorrection
6
6
Status: RESOLVED → VERIFIED
I'd also note that the code is using the WMI API that wmic is the command-line interface for, so it's certainly expected that they're consistent.  Another question, though, is how reliable the information from WMI is.  Hopefully it's good...

(In reply to Wayne Mery (:wsmwk) from comment #22)

Is it expected that queries in comment 21 have null results set (zero hits) for current crashes

No, this should be working. Maybe the code to detect it doesn't work anymore? Can you open a bug to investigate this and assign it to me?

Flags: needinfo?(gsvelto)
Blocks: 1581520
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: