Open Bug 1317253 Opened 6 years ago Updated 2 months ago

Best-effort detection of faulty memory at page-request time

Categories

(Core :: Memory Allocator, defect, P5)

defect

Tracking

()

Tracking Status
firefox52 --- affected

People

(Reporter: jseward, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

There is evidence to suggest that at least some browser crashes are
due to faulty RAM.  The proposal here is to test memory that is
"brought in" to the process via mmap-zero calls, or the Windows
equivalents, immediately after the map call succeeds, but before the
new area is made available for use.  This, and the symmetrical
immediately-before-unmap point, are the only places where we can
test memory by trying to modify it.

There are a number of difficulties with this, which means we can only
do a best-effort job.

* some areas -- stacks, globals, static code -- won't get tested.  The
  above proposal only really covers heap and heap-like areas, for
  example code buffers for the JITs.

* we have no way to reliably visit any specific physical address
  range, since we have no control over the virtual to physical
  mapping.  To try and maximise the chances that testing N consecutive
  virtual pages actually tests N different physical pages, we could
  visit virtual addresses as follows: first word of the first page,
  first word of the second page, etc .. first word of the last page,
  then repeat for second word of each page, etc.  The idea is to make
  it appear to the kernel's page-swap heuristics that all the pages
  are currently in use, so none get swapped out.

* Rather than just produce a binary fail/no-fail result for the
  testing as a whole, we could make the results available as a vector
  of 4096 booleans, in which the Nth entry is set if a failure has
  been observed at offset N in a page (assuming a page is 4096 bytes).
  This makes it somewhat possible to differentiate different failing
  pages.  For example, if one page shows an error at offset 123, and
  another at offset 456, then we know that we have at least two
  different bad physical pages.  This could help assess the severity
  of errors.

* I don't yet know how to remove the effects of caching from this.
  Large (eg, 8MB) last-level caches are now commonplace, and I don't
  want to be mistakenly testing the L3 cache rather than RAM.

* Given that RAM works reliably almost all of the time, this is
  difficult to test.  At least for development, I propose to create a
  simple Valgrind tool that observes all writes to memory.  The tool
  can be told to mutate writes in specified address ranges during
  specified time periods with a specified probability.  This would
  make it possible to test the detection mechanism during development,
  and more generally to get at least some idea of the effect of faulty
  memory on Gecko as a whole.
(In reply to Julian Seward [:jseward] from comment #0)
> There is evidence to suggest that at least some browser crashes are
> due to faulty RAM.  The proposal here is to test memory that is
> "brought in" to the process via mmap-zero calls, or the Windows
> equivalents, immediately after the map call succeeds, but before the
> new area is made available for use.  This, and the symmetrical
> immediately-before-unmap point, are the only places where we can
> test memory by trying to modify it.
> 
> There are a number of difficulties with this, which means we can only
> do a best-effort job.
> 
> * some areas -- stacks, globals, static code -- won't get tested.  The
>   above proposal only really covers heap and heap-like areas, for
>   example code buffers for the JITs.
> 
> * we have no way to reliably visit any specific physical address
>   range, since we have no control over the virtual to physical
>   mapping.

Worse, you're not even guaranteed that a same virtual address will always point to the same physical memory.

>  To try and maximise the chances that testing N consecutive
>   virtual pages actually tests N different physical pages, we could
>   visit virtual addresses as follows: first word of the first page,
>   first word of the second page, etc .. first word of the last page,
>   then repeat for second word of each page, etc.  The idea is to make
>   it appear to the kernel's page-swap heuristics that all the pages
>   are currently in use, so none get swapped out.

... although in practice, if it's not swapped out, it's very likely it won't change.

> * Rather than just produce a binary fail/no-fail result for the
>   testing as a whole, we could make the results available as a vector
>   of 4096 booleans, in which the Nth entry is set if a failure has
>   been observed at offset N in a page (assuming a page is 4096 bytes).
>   This makes it somewhat possible to differentiate different failing
>   pages.  For example, if one page shows an error at offset 123, and
>   another at offset 456, then we know that we have at least two
>   different bad physical pages.  This could help assess the severity
>   of errors.
> 
> * I don't yet know how to remove the effects of caching from this.
>   Large (eg, 8MB) last-level caches are now commonplace, and I don't
>   want to be mistakenly testing the L3 cache rather than RAM.

It's possible to disable the caches, but that probably requires kernel privileges.

We usually mmap by chunks of 1MB. I suspect checking 1MB of memory 1 word at a time alterning between pages is going to make mmapping noticeably slower. Also, from my experience with running memtest86 back in the day when I had bad RAM, bad RAM can work fine on some patterns and fail on some others. You'd need multiple passes with various different patterns to be able to discover any useful information, making things even slower.

[Come to think of it, it's possible to force cache flushes on memory ranges from userspace with cacheflush(2) on Linux. I don't know if there are equivalent APIs on OSX and Windows.]

Now, the question is how often we are actually mmapping memory. Maybe we're not mapping frequently enough that it would matter overall, but the side effect of making malloc randomly take much more time than usual because it just happened to have required a mmap doesn't sound really awesome. It's arguably already happening, but we're talking about making it even slower here. That sounds like a good way to introduce new jank. So ideally, if we're really going to do this, this would have to happen on a separate thread that would speculatively mmap memory before we actually need it... but that won't help for allocations that don't match the exact chunk size (which there are multiple ways to get).

Also, the premise that there's possibly bad RAM to find seems flawed. There are plenty of ways to get bit flips in memory, and (detectable) bad RAM is merely one of them. Overall, I'm not all convinced this is worth the effort.
Priority: -- → P5
Hmm, the performance aspect might be a showstopper.  Making 8
write/read passes with different random patterns over a 1MB block
takes about 155 milliseconds on a 2.5 GHz Haswell.  On a low end
target that's going to be more like one second.

/usr/bin/perf reports an IPC of 0.05 (!) for the run, which is bad
but expected.  Removing the clflush instruction from the store path
reduces the run time to 22 milliseconds, which confirms that most of
the time goes in cache flushes.

This is with a linear scan through memory.  Using the page-hopping
scheme described in comment 0 would surely be even slower, as it
would presumably have worse TLB locality.
Attached patch memtest.cSplinter Review
Test program referred to in comment 2.
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.