Open Bug 1897102 Opened 1 year ago Updated 8 months ago

Create a badram checking library

Tracking

()

Status:

NEW

People

(Reporter: pbone, Assigned: btsoi, Mentored)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 1 obsolete file)

memtest crate 9 months ago Brian Tsoi [:btsoi] 34 bytes, patch		Details \| Diff \| Splinter Review
GitHub Pull Request 9 months ago Brian Tsoi [:btsoi] 41 bytes, text/x-github-pull-request		Details \| Review

Paul Bone [:pbone]

Reporter

Description

•

1 year ago

To support checking for badram in either the crash reporter or as an idle-time task we should create a library that helps us scan memory pages for different kinds of memory errors. The library should support scanning with various bit patterns and various amounts of memory.

Bad ram could be because a bit got stuck on either a 1 or a 0 (writing all zeros then verifying it's all zeros, then all ones and verifying it's all ones).
Or memory errors could be more complex and different bit patterns may cause neighbouring bits to flip: https://fosstodon.org/@gabrielesvelto/112407741329145666. See memtest86+ for ideas.

I don't know if it'd be useful for the library to map new memory or work on existing memory.
It will be necessary for callers to specify how much memory to scan.
I think it would be useful for callers to specify how much "work" to do.

This can then be used by either the crash reporter or in-process idle memory scanning.

Paul Bone [:pbone]

Reporter

Updated

•

1 year ago

Blocks: 1565033

Paul Bone [:pbone]

Reporter

Updated

•

1 year ago

No longer blocks: badram

Paul Bone [:pbone]

Reporter

Updated

•

1 year ago

Blocks: 1897104

Gabriele Svelto [:gsvelto]

Comment 1

•

1 year ago

For the "how much work to do" I believe we should use a fixed amount of time that we consider acceptable to a user. This can be relatively small (seconds) because the most important issues are immediately apparent (stuck bits) while other ones might require very long scanning times (specific patterns may require minutes or even hours). Additionally we'd have to take into account the machine last-level cache size so that we know what's the minimum amount of memory we have to scan. Scanning based on time instead of size will shield us from issues like swapping.

Re-using the memory of an already existing process which has already crashed is tempting, as we know there's likely a problem already and we have the process under our control.

Paul Bone [:pbone]

Reporter

Comment 2

•

1 year ago

(In reply to Gabriele Svelto [:gsvelto] from comment #1)

For the "how much work to do" I believe we should use a fixed amount of time that we consider acceptable to a user. This can be relatively small (seconds) because the most important issues are immediately apparent (stuck bits) while other ones might require very long scanning times (specific patterns may require minutes or even hours).

Agreed, the reason why I think the caller should choose how much work to do is because in the future if we make this run during idle time. Setting a work/time budget there could be important.

Oh and probably once it finds any memory error it stops checking and returns. We don't care about where, what kind or how many errors there are. Just a pass/fail.

Brian Tsoi [:btsoi]

Assignee

Comment 3

•

9 months ago

Attached patch memtest crate (obsolete) — Details — Splinter Review

Brian Tsoi [:btsoi]

Assignee

Updated

•

9 months ago

Attachment #9463857 - Attachment is patch: true

Brian Tsoi [:btsoi]

Assignee

Updated

•

9 months ago

Attachment #9463857 - Attachment is obsolete: true

Brian Tsoi [:btsoi]

Assignee

Comment 4

•

9 months ago

Attached file GitHub Pull Request — Details

The memtest crate is currently being developed for this bug. An initial prototype PR is landed a few months ago and there has been releases on crates.io

Assignee: nobody → btsoi

Mathew Hodson

Updated

•

8 months ago

No longer blocks: 1565033

Bugzilla

Create a badram checking library

Categories

(Toolkit :: Crash Reporting, task)

Tracking

()

People

(Reporter: pbone, Assigned: btsoi, Mentored)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file, 1 obsolete file)

Description

Updated

Updated

Updated

Comment 1

Comment 2

Comment 3

Updated

Updated

Comment 4

Updated

Attachment

General

Description

File Name

Content Type