Open Bug 995652 Opened 9 years ago Updated 2 months ago

Run memtest from the crash reporter

Categories

(Core :: JavaScript: GC, defect, P3)

defect

Tracking

()

flash10

People

(Reporter: terrence, Unassigned)

References

(Blocks 2 open bugs)

Details

We suspect that some non-trivial number of crashes are coming from bad sticks of RAM floating around in the wild. A year or so ago, dolske showed off a working prototype of an in-browser memory checker he had hacked up. We know it works because I gave him two bad sticks of RAM to test against: the prototype detected the bad blocks flawlessly. With a little polish, we could deploy this as part of crash reporter and find out for sure how many users are affected.

As a side-effect, it would dramatically improve the user experience for the no-doubt frustrated users who are experiencing daily crashes because of bad RAM.
See Also: → 1270554
See Also: → 1034706
I ran into this tidbit that I had forgotten about while looking up something else, from Microsoft's _Debugging in the (Very) Large: Ten Years of Implementation and Experience_ paper:
"The kernel maintains a list of free pages, zeroed when the 
CPU is idle. On every sixteenth page allocation, the Vista 
kernel  examines the page content and  asynchronously 
creates an error report if any zero has changed to a one. 
Analysis of these corrupt "zero" pages has located errors in 
hardware—like  a DMA engine that only supported 31-bit 
addresses—and errors in firmware—like a brand of  laptop 
with a bad resume-from-sleep firmware."

http://research.microsoft.com/apps/pubs/default.aspx?id=81176, section 7.1

I don't know how dolske's prototype works, but something like this as a background idle task might be feasible, and Microsoft's experience is a good data point showing it produces useful results.
> something like this as a background idle task might be feasible

This bug is about doing the memtest in the crash reporter. Doing it as a background idle task sounds a lot more like bug 1270554. We probably don't need both this bug and that bug. How realistic does running a memtest in the crash reporter sound? I suspect it might be far too costly and complex for the crash reporter (though I'm happy to be told otherwise). If so, we can probably close this bug and just focus on bug 1270554.
Flags: needinfo?(ted)
If you're talking about the crash reporter client (the native app that runs after a chrome process crash), then we can run whatever we want there because it's a new process. However, in this age of e10s doing things there is less useful because that only runs after chrome process crashes, not after content process crashes.
Flags: needinfo?(ted)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #1)

> I don't know how dolske's prototype works, but something like this as a
> background idle task might be feasible, and Microsoft's experience is a good
> data point showing it produces useful results.

I basically did a JS port of the guts of memtest86+, running in a worker thread. I'd suggest not bothering with a native implementation running in the crash reporter, since it would largely be duplicative of either running something in the browser, or having a button in about:support to launch something externally.

I'll add a few more remarks to bug 1270554.
I've been looking into doing this.  I wrote a patch that implements a couple of the tests (1 and 3) described here:

http://www.memtest86.com/technical.htm#description

This takes 1-2 mS per MiB on my MBP.
Ted, I'd like to work on this.  Could you provide guidance on where/how to hook this into the crash reporter?

Also I don't understand comment 3 - do we not report content process crashes or does this happen another way now?

My current plan is to run memtest on the process' current address space when it crashes and include the result in the crash report.  This isn't perfect for several reasons, but hopefully will provide useful data nonetheless.

One concern is that we may see a different set of physical pages when testing to those that were mapped when the process crashed.  There's not much we can do about this without OS assistance.

I think it will also be slow to run all the tests that memtest performs, particularly if we take care to force all writes to go to main memory so that issues are not hidden by caches.  The plan is to go with a small set of tests and hope that this catches a significant proportion of bad memory.  The memtest site linked above suggests that testing a large enough area of memory without disabling caching is good enough to catch problems.

Finally, since testing requires overwriting memory, this will have to take care not to test stack or heap that it is currently using or is required by future execution of the crash reporter.

Julian also suggested installing a custom exception handler to recover from any problems while testing.
Flags: needinfo?(ted)
(In reply to Jon Coppeard (:jonco) from comment #6)
> Ted, I'd like to work on this.  Could you provide guidance on where/how to
> hook this into the crash reporter?
> 
> Also I don't understand comment 3 - do we not report content process crashes
> or does this happen another way now?

For content processes, the exception handler in the content process just sends a message over a pipe to the parent process, which does the actual minidump generation, and then we submit the minidump using some chrome JS code.

For chrome process crashes the exception handler writes a minidump of the current process, then runs the crash reporter client to submit it.

> My current plan is to run memtest on the process' current address space when
> it crashes and include the result in the crash report.  This isn't perfect
> for several reasons, but hopefully will provide useful data nonetheless.

I don't think this is a great idea. Doing anything in a process after it has crashed is incredibly tricky, and if we screw it up and crash again we won't get a crash report for the original crash. We bend over backwards in the Breakpad code to be extremely safe during the exception handler.
 
> One concern is that we may see a different set of physical pages when
> testing to those that were mapped when the process crashed.  There's not
> much we can do about this without OS assistance.

Oh, per your previous paragraph I thought you intended to run this check in the browser process after it had crashed. If you want to run the check in the crashreporter client that code is here:
https://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/client/
 
> I think it will also be slow to run all the tests that memtest performs,
> particularly if we take care to force all writes to go to main memory so
> that issues are not hidden by caches.  The plan is to go with a small set of
> tests and hope that this catches a significant proportion of bad memory. 
> The memtest site linked above suggests that testing a large enough area of
> memory without disabling caching is good enough to catch problems.

I'd be worried about the performance of this, since the user experience of crashing is already bad enough. If we spend a bunch of time doing memory tests and it makes it take longer to get the user back to browsing after a crash that would not be great. I'd be more in favor of us setting a flag when we crash (maybe only for certain kinds of crashes, even) and then running some memory tests while the browser is running but idle, and then reporting any faulty memory results along with telemetry or future crash reports.
Flags: needinfo?(ted)
Priority: -- → P3
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.