Open Bug 1897102 Opened 1 year ago Updated 8 months ago

Create a badram checking library

Categories

(Toolkit :: Crash Reporting, task)

Desktop
All
task

Tracking

()

People

(Reporter: pbone, Assigned: btsoi, Mentored)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 1 obsolete file)

To support checking for badram in either the crash reporter or as an idle-time task we should create a library that helps us scan memory pages for different kinds of memory errors. The library should support scanning with various bit patterns and various amounts of memory.

Bad ram could be because a bit got stuck on either a 1 or a 0 (writing all zeros then verifying it's all zeros, then all ones and verifying it's all ones).
Or memory errors could be more complex and different bit patterns may cause neighbouring bits to flip: https://fosstodon.org/@gabrielesvelto/112407741329145666. See memtest86+ for ideas.

I don't know if it'd be useful for the library to map new memory or work on existing memory.
It will be necessary for callers to specify how much memory to scan.
I think it would be useful for callers to specify how much "work" to do.

This can then be used by either the crash reporter or in-process idle memory scanning.

Blocks: 1565033
No longer blocks: badram
Blocks: 1897104

For the "how much work to do" I believe we should use a fixed amount of time that we consider acceptable to a user. This can be relatively small (seconds) because the most important issues are immediately apparent (stuck bits) while other ones might require very long scanning times (specific patterns may require minutes or even hours). Additionally we'd have to take into account the machine last-level cache size so that we know what's the minimum amount of memory we have to scan. Scanning based on time instead of size will shield us from issues like swapping.

Re-using the memory of an already existing process which has already crashed is tempting, as we know there's likely a problem already and we have the process under our control.

(In reply to Gabriele Svelto [:gsvelto] from comment #1)

For the "how much work to do" I believe we should use a fixed amount of time that we consider acceptable to a user. This can be relatively small (seconds) because the most important issues are immediately apparent (stuck bits) while other ones might require very long scanning times (specific patterns may require minutes or even hours).

Agreed, the reason why I think the caller should choose how much work to do is because in the future if we make this run during idle time. Setting a work/time budget there could be important.

Oh and probably once it finds any memory error it stops checking and returns. We don't care about where, what kind or how many errors there are. Just a pass/fail.

Attached patch memtest crate (obsolete) — Splinter Review
Attachment #9463857 - Attachment is patch: true
Attachment #9463857 - Attachment is obsolete: true
Attached file GitHub Pull Request

The memtest crate is currently being developed for this bug. An initial prototype PR is landed a few months ago and there has been releases on crates.io

Assignee: nobody → btsoi
No longer blocks: 1565033
See Also: → 1565033
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: