Open Bug 1317253 Opened 8 years ago Updated 9 months ago

Best-effort detection of faulty memory at page-request time

Tracking

()

Status:

NEW

Tracking Flags:

Tracking

Status

firefox52

---

affected

People

(Reporter: jseward, Unassigned)

References

Details

Attachments

(1 file)

memtest.c 8 years ago Julian Seward [:jseward] 2.10 KB, patch		Details \| Diff \| Splinter Review

Julian Seward [:jseward]

Reporter

Description

•

8 years ago

There is evidence to suggest that at least some browser crashes are due to faulty RAM. The proposal here is to test memory that is "brought in" to the process via mmap-zero calls, or the Windows equivalents, immediately after the map call succeeds, but before the new area is made available for use. This, and the symmetrical immediately-before-unmap point, are the only places where we can test memory by trying to modify it. There are a number of difficulties with this, which means we can only do a best-effort job. * some areas -- stacks, globals, static code -- won't get tested. The above proposal only really covers heap and heap-like areas, for example code buffers for the JITs. * we have no way to reliably visit any specific physical address range, since we have no control over the virtual to physical mapping. To try and maximise the chances that testing N consecutive virtual pages actually tests N different physical pages, we could visit virtual addresses as follows: first word of the first page, first word of the second page, etc .. first word of the last page, then repeat for second word of each page, etc. The idea is to make it appear to the kernel's page-swap heuristics that all the pages are currently in use, so none get swapped out. * Rather than just produce a binary fail/no-fail result for the testing as a whole, we could make the results available as a vector of 4096 booleans, in which the Nth entry is set if a failure has been observed at offset N in a page (assuming a page is 4096 bytes). This makes it somewhat possible to differentiate different failing pages. For example, if one page shows an error at offset 123, and another at offset 456, then we know that we have at least two different bad physical pages. This could help assess the severity of errors. * I don't yet know how to remove the effects of caching from this. Large (eg, 8MB) last-level caches are now commonplace, and I don't want to be mistakenly testing the L3 cache rather than RAM. * Given that RAM works reliably almost all of the time, this is difficult to test. At least for development, I propose to create a simple Valgrind tool that observes all writes to memory. The tool can be told to mutate writes in specified address ranges during specified time periods with a specified probability. This would make it possible to test the detection mechanism during development, and more generally to get at least some idea of the effect of faulty memory on Gecko as a whole.

Mike Hommey [:glandium]

Comment 1

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #0) > There is evidence to suggest that at least some browser crashes are > due to faulty RAM. The proposal here is to test memory that is > "brought in" to the process via mmap-zero calls, or the Windows > equivalents, immediately after the map call succeeds, but before the > new area is made available for use. This, and the symmetrical > immediately-before-unmap point, are the only places where we can > test memory by trying to modify it. > > There are a number of difficulties with this, which means we can only > do a best-effort job. > > * some areas -- stacks, globals, static code -- won't get tested. The > above proposal only really covers heap and heap-like areas, for > example code buffers for the JITs. > > * we have no way to reliably visit any specific physical address > range, since we have no control over the virtual to physical > mapping. Worse, you're not even guaranteed that a same virtual address will always point to the same physical memory. > To try and maximise the chances that testing N consecutive > virtual pages actually tests N different physical pages, we could > visit virtual addresses as follows: first word of the first page, > first word of the second page, etc .. first word of the last page, > then repeat for second word of each page, etc. The idea is to make > it appear to the kernel's page-swap heuristics that all the pages > are currently in use, so none get swapped out. ... although in practice, if it's not swapped out, it's very likely it won't change. > * Rather than just produce a binary fail/no-fail result for the > testing as a whole, we could make the results available as a vector > of 4096 booleans, in which the Nth entry is set if a failure has > been observed at offset N in a page (assuming a page is 4096 bytes). > This makes it somewhat possible to differentiate different failing > pages. For example, if one page shows an error at offset 123, and > another at offset 456, then we know that we have at least two > different bad physical pages. This could help assess the severity > of errors. > > * I don't yet know how to remove the effects of caching from this. > Large (eg, 8MB) last-level caches are now commonplace, and I don't > want to be mistakenly testing the L3 cache rather than RAM. It's possible to disable the caches, but that probably requires kernel privileges. We usually mmap by chunks of 1MB. I suspect checking 1MB of memory 1 word at a time alterning between pages is going to make mmapping noticeably slower. Also, from my experience with running memtest86 back in the day when I had bad RAM, bad RAM can work fine on some patterns and fail on some others. You'd need multiple passes with various different patterns to be able to discover any useful information, making things even slower. [Come to think of it, it's possible to force cache flushes on memory ranges from userspace with cacheflush(2) on Linux. I don't know if there are equivalent APIs on OSX and Windows.] Now, the question is how often we are actually mmapping memory. Maybe we're not mapping frequently enough that it would matter overall, but the side effect of making malloc randomly take much more time than usual because it just happened to have required a mmap doesn't sound really awesome. It's arguably already happening, but we're talking about making it even slower here. That sounds like a good way to introduce new jank. So ideally, if we're really going to do this, this would have to happen on a separate thread that would speculatively mmap memory before we actually need it... but that won't help for allocations that don't match the exact chunk size (which there are multiple ways to get). Also, the premise that there's possibly bad RAM to find seems flawed. There are plenty of ways to get bit flips in memory, and (detectable) bad RAM is merely one of them. Overall, I'm not all convinced this is worth the effort.

Priority: -- → P5

Julian Seward [:jseward]

Reporter

Comment 2

•

8 years ago

Hmm, the performance aspect might be a showstopper. Making 8 write/read passes with different random patterns over a 1MB block takes about 155 milliseconds on a 2.5 GHz Haswell. On a low end target that's going to be more like one second. /usr/bin/perf reports an IPC of 0.05 (!) for the run, which is bad but expected. Removing the clflush instruction from the store path reduces the run time to 22 milliseconds, which confirms that most of the time goes in cache flushes. This is with a linear scan through memory. Using the page-hopping scheme described in comment 0 would surely be even slower, as it would presumably have worse TLB locality.

Julian Seward [:jseward]

Reporter

Comment 3

•

8 years ago

Attached patch memtest.c — Details — Splinter Review

Test program referred to in comment 2.

Nicholas Nethercote [inactive]

Updated

•

8 years ago

Blocks: 1289666

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Best-effort detection of faulty memory at page-request time

Categories

(Core :: Memory Allocator, defect, P5)

Tracking

()

People

(Reporter: jseward, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Updated

Updated

Attachment

General

Description

File Name

Content Type