Create a badram checking library
Categories
(Toolkit :: Crash Reporting, task)
Tracking
()
People
(Reporter: pbone, Assigned: btsoi, Mentored)
References
(Blocks 1 open bug)
Details
Attachments
(1 file, 1 obsolete file)
To support checking for badram in either the crash reporter or as an idle-time task we should create a library that helps us scan memory pages for different kinds of memory errors. The library should support scanning with various bit patterns and various amounts of memory.
Bad ram could be because a bit got stuck on either a 1 or a 0 (writing all zeros then verifying it's all zeros, then all ones and verifying it's all ones).
Or memory errors could be more complex and different bit patterns may cause neighbouring bits to flip: https://fosstodon.org/@gabrielesvelto/112407741329145666. See memtest86+ for ideas.
I don't know if it'd be useful for the library to map new memory or work on existing memory.
It will be necessary for callers to specify how much memory to scan.
I think it would be useful for callers to specify how much "work" to do.
This can then be used by either the crash reporter or in-process idle memory scanning.
Comment 1•1 year ago
|
||
For the "how much work to do" I believe we should use a fixed amount of time that we consider acceptable to a user. This can be relatively small (seconds) because the most important issues are immediately apparent (stuck bits) while other ones might require very long scanning times (specific patterns may require minutes or even hours). Additionally we'd have to take into account the machine last-level cache size so that we know what's the minimum amount of memory we have to scan. Scanning based on time instead of size will shield us from issues like swapping.
Re-using the memory of an already existing process which has already crashed is tempting, as we know there's likely a problem already and we have the process under our control.
Reporter | ||
Comment 2•1 year ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #1)
For the "how much work to do" I believe we should use a fixed amount of time that we consider acceptable to a user. This can be relatively small (seconds) because the most important issues are immediately apparent (stuck bits) while other ones might require very long scanning times (specific patterns may require minutes or even hours).
Agreed, the reason why I think the caller should choose how much work to do is because in the future if we make this run during idle time. Setting a work/time budget there could be important.
Oh and probably once it finds any memory error it stops checking and returns. We don't care about where, what kind or how many errors there are. Just a pass/fail.
Assignee | ||
Comment 3•9 months ago
|
||
Assignee | ||
Updated•9 months ago
|
Assignee | ||
Updated•9 months ago
|
Assignee | ||
Comment 4•9 months ago
|
||
The memtest crate is currently being developed for this bug. An initial prototype PR is landed a few months ago and there has been releases on crates.io
Updated•8 months ago
|
Description
•