Open Bug 1270554 Opened 4 years ago Updated 2 years ago
Run memtest continuously on the live browser
Here's how Microsoft does it (from section 7.1 or http://www.sigops.org/sosp/sosp09/papers/glerum-sosp09.pdf): "The kernel maintains a list of free pages, zeroed when the CPU is idle. On every sixteenth page allocation, the Vista kernel examines the page content and asynchronously creates an error report if any zero has changed to a one. Analysis of these corrupt zero pages has located errors in hardware like a DMA engine that only supported 31 bit addresses and errors in firmware like a brand of laptop with a bad resume from sleep firmware." Notes: - This only detects memory defects that cause zero bits to be read as ones. I don't know if the inverse (one bits read as zeroes) are similarly common. - They can do this in the kernel, where they can work with physical addresses. We're stuck with virtual addresses, unfortunately. - We could do this at the level of mmap/VirtualAlloc(). Every anonymous mapping allocated could be checked, or we could check a subset of them. This would give us maximum coverage, because it would get the entire C heap, the entire JS heap, generate JIT code, and the handful of other things we use mmap/VirtualAlloc for. I'm going to do some experiments along these lines.
Assignee: nobody → n.nethercote
http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf is paper where Google talks about RAM fault rates.
I've been meaning to blog about this for years, but just never got around to it. :( The code I wrote it on Github, and a live demo is on my site: https://github.com/dolske/memtest.js https://dolske.net/hacks/memtest.js/live/ It's basically a JS port of the guts of memtest86+ (running in a worker thread) with a similar UI (just for fun). The tests allocate large blocks of memory, and iterate over them setting and checking values in various ways. I'm sure the JIT loves tight simple loops like this, and I was seeing speeds close to native when I tested 3 years ago. The basic theory was that while it's not entirely possible to replicate memtest86 in a user-mode process (because you can't touch memory used elsewhere in the system, can't avoid virtual memory abstractions, etc), you could probably do a good enough job by repeatedly testing a subset of memory and making a statistical inference. EG, test 10% of the memory once a day on idle, and after some weeks you should have some confidence that the memory is good. Or if you ever find a single bad bit, you know for sure there's a problem. :) The results were encouraging, with some known-bad RAM and testing big chunks of RAM it was able to find a problem reliably -- I saved a screenshot of the first time that happened. :) http://dolske.net/hacks/memtest.js/First%20jsmemtest%20fail%202013-05-28%2017_35_36.png It also demonstrated that some of the non-simple bit flipping was necessary to detect errors, and was able to be done from JS. (IE, simply setting a block to a static pattern and checking the result was often insufficient to detect the error.) The last thing I worked on before setting it aside was seeing if testing smaller chunks of RAM worked reliably. You want to minimize risk of OOM and runtime, after all. But I had mixed results. In a perfect world, testing N% of the RAM would have a N% chance of finding the bad bit. In reality, there was a dropoff, and below a certain threshold it never found the error. (I assume because of vagaries in how the OS and Firefox deal with memory allocations and VM mappings). I can dig up the data I had, but my recollection was this wasn't severe enough to be a deal-breaker.
(In reply to Nicholas Nethercote [:njn] from comment #1) > - They can do this in the kernel, where they can work with physical > addresses. We're stuck with virtual addresses, unfortunately. One trick I did was to log the error offset within the ArrayBuffer I was using, mod 4096. My (admittedly ancient) understanding of how x86 VM works is that it's all 4K pages underneath, and presumably big allocations are page-aligned... This seemed to result in a stable identifier for the error bit across multiple runs/reboots. But I dunno how _useful_ that is, since once you've found a bad bit it doesn't really matter where it is. :) Another idea was that I think it's possible, at least on Linux and Solaris (did I mention "ancient" :), to figure out the physical address for a user address. (I didn't fully verify that, but it seemed like the info was there in various tools and /proc output) So, in theory, you could randomly allocate some memory, look up where it actually lives, and check to see if we've already tested that area. And so while you couldn't directly test all physical RAM, you'd have a actual understanding of how much you've randomly tested without relying on statistics.
Assignee: n.nethercote → nobody
You need to log in before you can comment on or make changes to this bug.