Bug 1881546 Comment 1 Edit History

Original comment by

on 2024-02-22 13:24:21 PST

I've done a bit of exploration here.
* Out of a sampling of 1000 Win11 crashes, 5 had `is_likely_guard_page` set. 1000 isn't a large sample size, and there's event bias based on whether our code in recent releases actually has any buffer overflow or other guard page access.
* Of those 5, 4 also reported potential bit flips.
* I chose one to look at more closely (`d8c28511-3ef2-4509-b37c-96b270240222`). The [related signature](https://crash-stats.mozilla.org/signature/?product=Firefox&signature=LdrpSnapModule) has many crashes reporting `is_likely_guard_page` (62/85). One even has an `EXCEPTION_GUARD_PAGE` reason. Those that don't have it are either 32-bit (which doesn't support further analysis) or are crashing on a different instruction (otherwise they are all crashing on the same instruction). Many also show potential bit flips for that address. All of the `is_likely_guard_page` crashes are crashing on the first byte of a page.

The likelihood of a true guard page access (as a result of buffer overflow) also being reported with potential bit flips is hard to determine since it comes down to the memory mapping of the process, however guard pages are often in heap memory, which one can argue would have more memory mapped in the region (so the bit flip detection may trigger on it and find matching bit-flipped mapped memory more frequently than elsewhere).

On the other hand, the likelihood of a true bit flip causing a guard page access (to a guard page following a group of mapped pages) is (assuming a uniform probability distribution of the bit being flipped) not terribly likely, especially given that we only consider mapped memory with a fairly narrow size range as potential guard pages. That is to say, if the cause of a bug is hardware failure, you'd expect to see plenty of crashes with bit flips which _don't_ report `is_likely_guard_page`. You'd also not expect to see the crashing address as the first byte in a page (as is the case of the signature I linked previously).

Given this information, I think we could use a heuristic along the lines of "50% of crashes have `is_likely_guard_page` set for some memory access". After that, a developer can get higher (or lower) confidence by further inspection. E.g. all instructions being the same and all addresses being the first byte of a page are red flags. I would have suggested a higher threshold but there seems to be a decent bit of noise in the crash reports themselves, at least in the example I inspected. This may be due to bad hardware (sadly that probably introduces noise across all of our signatures, proportional to CPU time in the relevant code) or perhaps another bug in the same function (which is less likely but still possible). If we want to account for the potential of 2 bugs in the same signature (assuming equal incidence, which is a big assumption) plus the noise of bad hardware, we might consider reducing that threshold to be a bit lower. But the point of this indicator is to give at-a-glance hints, so I don't think it's necessary to do that.

Revision 1 by

Alex Franchuk [:afranchuk]

on 2024-02-22 13:26:35 PST

I've done a bit of exploration here.
* Out of a sampling of 1000 Win11 crashes, 5 had `is_likely_guard_page` set. 1000 isn't a large sample size, and there's event bias based on whether our code in recent releases actually has any buffer overflow or other guard page access.
* Of those 5, 4 also reported potential bit flips.
* I chose one to look at more closely (`d8c28511-3ef2-4509-b37c-96b270240222`). The [related signature](https://crash-stats.mozilla.org/signature/?product=Firefox&signature=LdrpSnapModule) has many crashes reporting `is_likely_guard_page` (62/85). One even has an `EXCEPTION_GUARD_PAGE` reason. Those that don't have it are either 32-bit (which doesn't support further analysis) or are crashing on a different instruction (otherwise they are all crashing on the same instruction). Many also show potential bit flips for that address. All of the `is_likely_guard_page` crashes are crashing on the first byte of a page.

The likelihood of a true guard page access (as a result of buffer overflow) also being reported with potential bit flips is hard to determine since it comes down to the memory mapping of the process, however guard pages are often in heap memory, which one can argue would have more memory mapped in the region (so the bit flip detection may trigger on it and find matching bit-flipped mapped memory more frequently than elsewhere).

On the other hand, the likelihood of a true bit flip causing a guard page access (to a guard page following a group of mapped pages) is (assuming a uniform probability distribution of the bit being flipped) not terribly likely, especially given that we only consider mapped memory with a fairly narrow size range as potential guard pages. That is to say, if the cause of a bug is hardware failure, you'd expect to see plenty of crashes with bit flips which _don't_ report `is_likely_guard_page`. You'd also not expect to see the crashing address as the first byte (or for that matter, a similar offset across all crashes) in a page, as is the case of the signature I linked previously.

Given this information, I think we could use a heuristic along the lines of "50% of crashes have `is_likely_guard_page` set for some memory access". After that, a developer can get higher (or lower) confidence by further inspection. E.g. all instructions being the same and all addresses being the first byte of a page are red flags. I would have suggested a higher threshold but there seems to be a decent bit of noise in the crash reports themselves, at least in the example I inspected. This may be due to bad hardware (sadly that probably introduces noise across all of our signatures, proportional to CPU time in the relevant code) or perhaps another bug in the same function (which is less likely but still possible). If we want to account for the potential of 2 bugs in the same signature (assuming equal incidence, which is a big assumption) plus the noise of bad hardware, we might consider reducing that threshold to be a bit lower. But the point of this indicator is to give at-a-glance hints, so I don't think it's necessary to do that.

Revision 2 by

Alex Franchuk [:afranchuk]

on 2024-02-22 13:27:00 PST

I've done a bit of exploration here.
* Out of a sampling of 1000 Win11 crashes, 5 had `is_likely_guard_page` set. 1000 isn't a large sample size, and there's event bias based on whether our code in recent releases actually has any buffer overflow or other guard page access.
* Of those 5, 4 also reported potential bit flips.
* I chose one to look at more closely (`d8c28511-3ef2-4509-b37c-96b270240222`). The [related signature](https://crash-stats.mozilla.org/signature/?product=Firefox&signature=LdrpSnapModule) has many crashes reporting `is_likely_guard_page` (62/85). One even has an `EXCEPTION_GUARD_PAGE` reason. Those that don't have it are either 32-bit (which doesn't support further analysis) or are crashing on a different instruction (otherwise they are all crashing on the same instruction). Many also show potential bit flips for that address. All of the `is_likely_guard_page` crashes are crashing on the first byte of a page.

The likelihood of a true guard page access (as a result of buffer overflow) also being reported with potential bit flips is hard to determine since it comes down to the memory mapping of the process, however guard pages are often in heap memory, which one can argue would have more memory mapped in the region (so the bit flip detection may trigger on it and find matching bit-flipped mapped memory more frequently than elsewhere).

On the other hand, the likelihood of a true bit flip causing a guard page access (to a guard page following a group of mapped pages) is (assuming a uniform probability distribution of the bit being flipped) not terribly likely, especially given that we only consider mapped memory with a fairly narrow size range as potential guard pages. That is to say, if the cause of a bug is hardware failure, you'd expect to see plenty of crashes with bit flips which _don't_ report `is_likely_guard_page`. You'd also not expect to see the crashing address as the first byte (or for that matter, a similar offset across all crashes) in a page, as is the case of the signature I linked previously.

Given this information, I think we could use a heuristic along the lines of **50% of crashes have `is_likely_guard_page` set for some memory access**. After that, a developer can get higher (or lower) confidence by further inspection. E.g. all instructions being the same and all addresses being the first byte of a page are red flags. I would have suggested a higher threshold but there seems to be a decent bit of noise in the crash reports themselves, at least in the example I inspected. This may be due to bad hardware (sadly that probably introduces noise across all of our signatures, proportional to CPU time in the relevant code) or perhaps another bug in the same function (which is less likely but still possible). If we want to account for the potential of 2 bugs in the same signature (assuming equal incidence, which is a big assumption) plus the noise of bad hardware, we might consider reducing that threshold to be a bit lower. But the point of this indicator is to give at-a-glance hints, so I don't think it's necessary to do that.

Revision 3 by

Alex Franchuk [:afranchuk]

on 2024-02-22 13:27:26 PST

I've done a bit of exploration here.
* Out of a sampling of 1000 Win11 crashes, 5 had `is_likely_guard_page` set. 1000 isn't a large sample size, and there's event bias based on whether our code in recent releases actually has any buffer overflow or other guard page access.
* Of those 5, 4 also reported potential bit flips.
* I chose one to look at more closely (`d8c28511-3ef2-4509-b37c-96b270240222`). The [related signature](https://crash-stats.mozilla.org/signature/?product=Firefox&signature=LdrpSnapModule) has many crashes reporting `is_likely_guard_page` (62/85). One even has an `EXCEPTION_GUARD_PAGE` reason. Those that don't have it are either 32-bit (which doesn't support further analysis) or are crashing on a different instruction (otherwise they are all crashing on the same instruction). Many also show potential bit flips for that address. All of the `is_likely_guard_page` crashes are crashing on the first byte of a page.

The likelihood of a true guard page access (as a result of buffer overflow) also being reported with potential bit flips is hard to determine since it comes down to the memory mapping of the process, however guard pages are often in heap memory, which one can argue would have more memory mapped in the region (so the bit flip detection may trigger on it and find matching bit-flipped mapped memory more frequently than elsewhere).

On the other hand, the likelihood of a true bit flip causing a guard page access (to a guard page following a group of mapped pages) is (assuming a uniform probability distribution of the bit being flipped) not terribly likely, especially given that we only consider mapped memory with a fairly narrow size range as potential guard pages. That is to say, if the cause of a bug is hardware failure, you'd expect to see plenty of crashes with bit flips which _don't_ report `is_likely_guard_page`. You'd also not expect to see the crashing address as the first byte (or for that matter, a similar offset across all crashes) in a page, as is the case of the signature I linked previously.

Given this information, I think we could use a heuristic along the lines of **50% of crashes have `is_likely_guard_page` set for some memory access**. After that, a developer can get higher (or lower) confidence by further inspection. E.g. all instructions being the same and all addresses being the same offset into a page are red flags. I would have suggested a higher threshold but there seems to be a decent bit of noise in the crash reports themselves, at least in the example I inspected. This may be due to bad hardware (sadly that probably introduces noise across all of our signatures, proportional to CPU time in the relevant code) or perhaps another bug in the same function (which is less likely but still possible). If we want to account for the potential of 2 bugs in the same signature (assuming equal incidence, which is a big assumption) plus the noise of bad hardware, we might consider reducing that threshold to be a bit lower. But the point of this indicator is to give at-a-glance hints, so I don't think it's necessary to do that.