Open Bug 1270554 Opened 8 years ago Updated 1 year ago

Run memtest continuously on the live browser

Categories

(Core :: JavaScript: GC, defect)

defect

Tracking

()

Tracking Status
firefox49 --- affected

People

(Reporter: terrence, Unassigned)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

From Jan's analysis of JIT crashes in bug 1034706, it seems that >50% of our crash volume there is related to memory corruption. I'm opening this bug to discuss one of the mitigation strategies brought up there.

(In reply to Luke Wagner [:luke] from comment #47)
> Random idea:
> 
> What if we had a system that allocated a few scattered MiB (i.e., not all in
> one contiguous run or always at the same address, though being careful not
> to unduly increase fragmentation) with a predictable bit-pattern and
> periodically (say, on the daily telemetry or update ping) the system scanned
> all those MiBs to ensure they still had the same bit-pattern.  If a bitflip
> was detected, we set a flag in the browser that gets included in crash
> reports and also persists between browser restarts (at least for a period of
> time).
> 
> This could help us confirm a correlation between these catch-all JIT/GC
> crashes and the corruption flag and also have separate bins so that spikes
> in non-corruption-correlated crashes get more attention.  If we wanted to
> get fancy, we could even pop up a notification to the user suggesting they
> have bad RAM if they had the corruption flag and they were experiencing
> crashes :)

(In reply to Nicholas Nethercote [:njn] from comment #51)
> Thank you for the detailed analysis, Jan.
> 
> The bitflips are scary. Jan, are you assuming that it's faulty hardware
> that's the cause? It sounds like others are assuming that but I can't tell
> if that's what you think.
> 
> I think running a memtest on certain circumstances is a great idea. How hard
> is it write a memtest? What circumstances would you run it under? Do we have
> a bug open for this idea?

(In reply to Luke Wagner [:luke] from comment #52)
> I remember dolske was experimenting with running a memtest a few years ago;
> I don't know what happened there.  I don't have any bugs on file -- just an
> idea while reading Jan's very interesting analysis -- sorry, don't mean to
> derail to more targeted discussion here.

I talked to dolske about his memtest implementation and memtest in general fairly extensively back when my desk was next to his. The general memtest problem is amazingly hard: for example, if the issue is an address line that is always zero in a certain region, it doesn't work to have the same guard bits in every address as you'll read the "correct" bits from the wrong slot. Beyond that, errors in the decoded chip can lead to surprisingly subtle errors in the output: e.g. a bit gets forced high exactly 2MiB below the written address iff two specific address lines are high and the middle 6 bits of the input data are low, etc.

On the other hand, this may not matter that much in practice. The data we do have is that for the 3 known-bad sticks I gave dolske, his fairly simple javascript memtest implementation was able to detect the errors. On the other hand, these are sticks that were so bad that I noticed and ran memtest86 on them. I don't think we'll know more until we actually implement something.

To date our thoughts have landed in bug 995652: e.g. post-mortem analysis of probable flips. Luke's idea to do bit-flip detection actively on the running process is something that has come up a number of times in GC meetings; in the past, however, we've always considered it a comparatively poor use of the worlds rapidly dwindling oil reserves, but if more than half of our JIT crashes also fall under this umbrella...

I think the right way to implement this is with the OMT chunk rebalancing code. We can have it reserve an extra chunk (say 10% of the heap) as "ballast" with known non-random content. We'll need some way to route the browser's idle callback to us to spawn an off-thread worker to scan some of the ballast, but that's also not hard. If we actually hit memory pressure, we can always reclaim these chunks for actual usage, so it shouldn't ever cause us to OOM.

I'll try to spend some time on this in the next few weeks.
Depends on: 1272449
Depends on: 988356
Here's how Microsoft does it (from section 7.1 or http://www.sigops.org/sosp/sosp09/papers/glerum-sosp09.pdf):

"The kernel maintains a list of free pages, zeroed when the CPU is idle. On
every sixteenth page allocation, the Vista kernel examines the page content and
asynchronously creates an error report if any zero has changed to a one.
Analysis of these corrupt zero pages has located errors in hardware like a
DMA engine that only supported 31 bit addresses and errors in firmware like a
brand of laptop with a bad resume from sleep firmware."

Notes:

- This only detects memory defects that cause zero bits to be read as ones. I don't know if the inverse (one bits read as zeroes) are similarly common.

- They can do this in the kernel, where they can work with physical addresses. We're stuck with virtual addresses, unfortunately.

- We could do this at the level of mmap/VirtualAlloc(). Every anonymous mapping allocated could be checked, or we could check a subset of them. This would give us maximum coverage, because it would get the entire C heap, the entire JS heap, generate JIT code, and the handful of other things we use mmap/VirtualAlloc for.

I'm going to do some experiments along these lines.
Assignee: nobody → n.nethercote
http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf is paper where Google talks about RAM fault rates.
I've been meaning to blog about this for years, but just never got around to it. :(

The code I wrote it on Github, and a live demo is on my site:

  https://github.com/dolske/memtest.js
  https://dolske.net/hacks/memtest.js/live/

It's basically a JS port of the guts of memtest86+ (running in a worker thread) with a similar UI (just for fun). The tests allocate large blocks of memory, and iterate over them setting and checking values in various ways. I'm sure the JIT loves tight simple loops like this, and I was seeing speeds close to native when I tested 3 years ago.

The basic theory was that while it's not entirely possible to replicate memtest86 in a user-mode process (because you can't touch memory used elsewhere in the system, can't avoid virtual memory abstractions, etc), you could probably do a good enough job by repeatedly testing a subset of memory and making a statistical inference. EG, test 10% of the memory once a day on idle, and after some weeks you should have some confidence that the memory is good. Or if you ever find a single bad bit, you know for sure there's a problem. :)

The results were encouraging, with some known-bad RAM and testing big chunks of RAM it was able to find a problem reliably -- I saved a screenshot of the first time that happened. :) http://dolske.net/hacks/memtest.js/First%20jsmemtest%20fail%202013-05-28%2017_35_36.png It also demonstrated that some of the non-simple bit flipping was necessary to detect errors, and was able to be done from JS. (IE, simply setting a block to a static pattern and checking the result was often insufficient to detect the error.)

The last thing I worked on before setting it aside was seeing if testing smaller chunks of RAM worked reliably. You want to minimize risk of OOM and runtime, after all. But I had mixed results. In a perfect world, testing N% of the RAM would have a N% chance of finding the bad bit. In reality, there was a dropoff, and below a certain threshold it never found the error. (I assume because of vagaries in how the OS and Firefox deal with memory allocations and VM mappings). I can dig up the data I had, but my recollection was this wasn't severe enough to be a deal-breaker.
(In reply to Nicholas Nethercote [:njn] from comment #1)

> - They can do this in the kernel, where they can work with physical
> addresses. We're stuck with virtual addresses, unfortunately.

One trick I did was to log the error offset within the ArrayBuffer I was using, mod 4096. My (admittedly ancient) understanding of how x86 VM works is that it's all 4K pages underneath, and presumably big allocations are page-aligned... This seemed to result in a stable identifier for the error bit across multiple runs/reboots. But I dunno how _useful_ that is, since once you've found a bad bit it doesn't really matter where it is. :)

Another idea was that I think it's possible, at least on Linux and Solaris (did I mention "ancient" :), to figure out the physical address for a user address. (I didn't fully verify that, but it seemed like the info was there in various tools and /proc output) So, in theory, you could randomly allocate some memory, look up where it actually lives, and check to see if we've already tested that area. And so while you couldn't directly test all physical RAM, you'd have a actual understanding of how much you've randomly tested without relying on statistics.
Assignee: n.nethercote → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.