Harden XDR against bugs and corrupt

RESOLVED FIXED in Firefox 58



2 years ago
2 years ago


(Reporter: tcampbell, Assigned: tcampbell)


(Blocks 2 bugs)

Dependency tree / graph

Firefox Tracking Flags

(firefox58 fixed, firefox59 fixed)



(2 attachments, 1 obsolete attachment)

In order to reduce crash rates related to XDR, we should avoid encoding flags and enums using common buggy buffer values like 0x00 and 0xFF. Enums should not have 0 as a legal value, and decoding switch statements should detect and fail the XDR instead of MOZ_CRASHing. There is a pattern in some places of using a MOZ_ASSERT (for debug) followed by a return fail.

This won't solve underlying problems, but should hopefully cut down crash rates experience by end users.
Assignee: nobody → tcampbell
Priority: -- → P1
I'm on the fence about this. I think that ideally, asserting is the right approach, since XDR data is meant to be reliable, and non-fatal errors are easier to overlook. Ideally, I'd like to just have checksums so we can detect cache corruption.

But our crash rates are way too high now and we don't know why, so trying to handle errors non-fatally is probably our best option. But we should at least try to collect metrics about where/how often corruption happens.
I'd propose for now we build some basic error checks into XDR, and then can hard crash at the top level later if we want. There are also a number of cases right now where we don't crash on garbage and keep going.

One thing I'm adding right now is a codeMarker function to insert constants and verify. The current XDR structure has also no ability to resynchronize, so I'd like to do things like codeMarker at start and end of JSScript::objects list (which is itself variable length). Right now a single bad bit (let alone the logic errors we have) will create really dangerous decoded structures and we don't detect them.
Blocks: 1384544
This patch handles default switch values in XDR decoding and makes them assert in debug, and return XDR failures in release. It also XORs a magic value with enum values when transcoding so buffers with bad data are less likely to go unnoticed.
Attachment #8932726 - Flags: review?(nicolas.b.pierron)
This patch adds marker values at key places in XDR transcoding where bad data has lead to crash signatures.
Attachment #8932727 - Flags: review?(nicolas.b.pierron)
These patches try to address crash rates by detecting possible XDR data corruption to avoid building JS structures that lead to complex runtime signatures. We make these detected faults fail the XDR decode, but allow the start-up cache to continue loading the non-cached version.

I intend to uplift them to Beta58 and see if they mitigate any of the signatures that are suspected to be related to XDR.

As a follow-up, we should add telemetry about decode failures. I am still investigating the source of these problems, but hopefully we can reduce the noise these related crashes are adding to crash stats.
Comment on attachment 8932726 [details] [diff] [review]

Review of attachment 8932726 [details] [diff] [review]:

::: js/src/vm/Xdr.h
@@ +279,5 @@
>      template <typename T>
>      bool codeEnum32(T* val, typename mozilla::EnableIf<mozilla::IsEnum<T>::value, T>::Type * = NULL)
>      {
> +        // Mix with magic value to better detect corruption
> +        const uint32_t MAGIC = 0x21AB218C;

nit: improve this comment with something similar to:

Mix the enumerated value with a random magic number, such that a corruption with a low-ranged value is less likely to cause a miss-interpretation of the XDR content, and instead cause a failure.

Note: When updating this value, make sure that you also update the build-id (XDR version) or properly clear all the caches.
Attachment #8932726 - Flags: review?(nicolas.b.pierron) → review+
Comment on attachment 8932727 [details] [diff] [review]

Review of attachment 8932727 [details] [diff] [review]:

::: js/src/jsfun.cpp
@@ +686,5 @@
>          objp.set(fun);
>      }
> +    // Verify marker at end of function to detect buffer trunction.
> +    if (!xdr->codeMarker(0x9E35CA1F))

An end marker is likely to be followed by a start marker, is this one really necessary?
Attachment #8932727 - Flags: review?(nicolas.b.pierron) → review+
Fix warnings in codeEnum32 around tmp being uninitialized. Carrying r+ forward.
Attachment #8932726 - Attachment is obsolete: true
Attachment #8933027 - Flags: review+
Pushed by tcampbell@mozilla.com:
Harden XDR data decoding. r=nbp
Use marker values in XDR data to detect corruption. r=nbp
Comment on attachment 8933027 [details] [diff] [review]

Review of attachment 8933027 [details] [diff] [review]:

::: js/src/jsscript.cpp
@@ +195,5 @@
>              vp.setMagic(JS_ELEMENTS_HOLE);
>          break;
> +      default:
> +        // Fail in debug, but only soft-fail in release
> +        MOZ_ASSERT(false, "Bad XDR value kind");

Can we make these diagnostic asserts? Our XDR crash rates in nightlies are pretty low, and it would be nice to at least have some indication that this is happening in the wild.
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Comment on attachment 8933027 [details] [diff] [review]

Approval Request Comment
[Feature/Bug causing the regression]:
[User impact if declined]:
This patch attempts to mitigate crashes seen in Bug 1384544 and others. Getting this patch to the larger beta population will help determine if XDR is primary cause of Bug 1384544 or if there are other issues we need to investigate.
[Is this code covered by automated tests?]:
The feature this patch affects is covered by automated tests that should detect regressions.
[Has the fix been verified in Nightly?]:
Patch is in current nightly, but limited population makes it difficult to draw conclusions.
[Needs manual test from QE? If yes, steps to reproduce]:
[List of other uplifts needed for the feature/fix]:
Both patches (0001 / 0002) on this bug should be uplifted. These patches rebase cleanly on Beta.
[Is the change risky?]:
[Why is the change risky/not risky?]:
This patch adds additional checks that to stop XDR decode earlier, resulting in startup caches reporting no data and falling back to normal script load. This fallback mechanism is pre-existing in the code. This patch adds more criteria to detect when to fail loading.
[String changes made/needed]:
Attachment #8933027 - Flags: approval-mozilla-beta?
NI myself for follow-up work:
- Use DIAGNOSTIC asserts, per Comment 11
- Add telemetry for start-up cache hit rate
Flags: needinfo?(tcampbell)
Blocks: 1407651
Comment on attachment 8933027 [details] [diff] [review]

This patch helps investigate bug 1384544. Take this to mitigate bug 1384544. Beta58+.
Attachment #8933027 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Blocks: 1367896
Blocks: 1373934
Blocks: 1415748
Blocks: 1409179
Blocks: 1423616
Blocks: 1421786
Flags: needinfo?(tcampbell)
You need to log in before you can comment on or make changes to this bug.