Closed Bug 1376819 Opened 7 years ago Closed 1 month ago

Land JIT Hardening Constant Blinding

Categories

(Core :: JavaScript Engine: JIT, enhancement, P3)

enhancement

Tracking

()

RESOLVED WONTFIX

People

(Reporter: tjr, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: parity-edge, sec-want)

Attachments

(3 files)

Based on the discussions in Bug 1333005 it would be good to update the patch in Bug 677272 to add blinding. 

One proposed idea was to land this behind a non-default flag (such as --enable-hardening or --enable-even-more-hardening) if the performance cost is too great to land it by default.

Our goals for blinding are to prevent an attacker (who we assume has an arbitrary read/write primitive, but has not achieved code exec yet) from being able to write out instructions into a RX page that they can use to get arbitrary code execution.  This would encompass two different approaches:
- Writing full code out in the JIT and directing execution to it
- Writing out ROP gadgets, and building a chain off of that

Because we would also like to prevent ROP gadgets from being written out, we can't do selective blinding. (If we only blinded... 10% or even 50% of constants, the attacker would just spew 10 or 20 constants for each ROP gadget, find the ones that came out without blinding, and ROP off of those.) So we really do need to blind everything. Or at least we need to blind everything that's 4 bytes or above. If we need to blind 2 byte constants also, that's TBD.
Attachment #8881808 - Flags: review?(nicolas.b.pierron)
:bbouvier do you think you could run https://treeherder.mozilla.org/#/jobs?repo=try&revision=d5074b0fa3aaf37227bf42c1f21d3ef754c9eb2c through arewefastyet?  (Is this the best way to ask?)
Flags: needinfo?(bbouvier)
(In reply to Tom Ritter [:tjr] from comment #11)
> :bbouvier do you think you could run
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=d5074b0fa3aaf37227bf42c1f21d3ef754c9eb2c through
> arewefastyet?  (Is this the best way to ask?)

Sure, actually you can even do it yourself by requesting runs:

- go to https://arewefastyet.com/schedule.php
- log in with your mozilla account
- select the benchmark you'd like to run, with revision = d5074b0fa3aaf37227bf42c1f21d3ef754c9eb2c from your mozilla-try build
- check 'email results'

It's a bit tedious because you have to do it once for every (benchmark, platform) couple, but it should work.
Flags: needinfo?(bbouvier)
Okay, we have performance numbers. I ran AWFY on the patch (which omits a RNG) and the base: https://docs.google.com/spreadsheets/d/1JdEALc0_7o34iIMeyJobbEPP6_fWiUPWufz5y908-ok/edit?usp=sharing

The numbers are very close in most cases. There are a few tests that don't seem to make much sense though: many of the large variances in the speedometer-misc test, the 'dom-traverse' test, 'dromaeo-object-array'.... I'm not sure if I need to run AWFY multiple times and average them...
I've updated the spreadsheet. It now has three runs of each test and also highlights when the runsof non-blinding and blinding are very different (implying that you can't trust the percent change that much).

My conclusion is that constant blinding - at least the toy example - doesn't seem that bad. On many tests it barely impacts anything. On some (speedometer-misc) the results are so variable we'd need to run them a dozen times to even them out. It does hurt in some places with high reliability, but it also helps in others with high reliability - although I'm not sure why (CPU pipelining?)

I'm going to continue working on this to add a real RNG into the patch; but I'd like to get feedback about whether we could land this by default.
I think it would be useful to land this with an #ifdef so that, as the code develops over time (I'd be surprised if this didn't require fixing many places), it's easy to toggle on/off to measure the aggregate effect.  Also, if the overall project (including preventing ROP in C++) stalls out, it would make it easier to turn back off.
(In reply to Luke Wagner [:luke] from comment #16)
> I think it would be useful to land this with an #ifdef so that, as the code
> develops over time (I'd be surprised if this didn't require fixing many
> places), it's easy to toggle on/off to measure the aggregate effect.  Also,
> if the overall project (including preventing ROP in C++) stalls out, it
> would make it easier to turn back off.

Is there a reason it couldn't be a pref?
Yeah, that'd work too and have the advantage of building both paths.  Just a bit more trouble :)
Comment on attachment 8881808 [details]
Bug 1376819 Perform constant blinding in the JIT on move32 and move64 immediates

https://reviewboard.mozilla.org/r/152882/#review160814

::: js/src/jit/x64/MacroAssembler-x64-inl.h:34
(Diff revision 12)
> +    Imm32 blinded_bot((imm.value ^ blind_bot));
> +    movq(blinded_top, dest.reg);
> +    xorq(Imm32((blind_top >> 32)), dest.reg);
> +    lshift64(Imm32(32), dest);
> +    addq(blinded_bot, dest.reg);
> +    xorq(Imm32(blind_bot), dest.reg);

This generates an off-by one error in the lowest bit of the top word. I can't quite figure out where the bug is....

https://treeherder.mozilla.org/logviewer.html#?job_id=112652014&repo=try&lineNumber=29152
Attachment #8881808 - Flags: review-
(In reply to Tom Ritter [:tjr] from comment #20)
> Comment on attachment 8881808 [details]
> Bug 1376819 Add some toy constant blinding to test performance
> 
> https://reviewboard.mozilla.org/r/152882/#review160814
> 
> ::: js/src/jit/x64/MacroAssembler-x64-inl.h:34
> (Diff revision 12)
> > +    Imm32 blinded_bot((imm.value ^ blind_bot));
> > +    movq(blinded_top, dest.reg);
> > +    xorq(Imm32((blind_top >> 32)), dest.reg);
> > +    lshift64(Imm32(32), dest);
> > +    addq(blinded_bot, dest.reg);
> > +    xorq(Imm32(blind_bot), dest.reg);
> 
> This generates an off-by one error in the lowest bit of the top word. I
> can't quite figure out where the bug is....
> 
> https://treeherder.mozilla.org/logviewer.
> html#?job_id=112652014&repo=try&lineNumber=29152

Can you reproduce it locally? It seems that the issue is about encoding 0x123456789abcdef0 int64 constant.
I do not see where the error is, but I can suggest to use movl & xorl and use Imm32 instead of ImmWord instead of the first movq & xorq.
Comment on attachment 8881808 [details]
Bug 1376819 Perform constant blinding in the JIT on move32 and move64 immediates

https://reviewboard.mozilla.org/r/152882/#review182302

That looks good. (f+)
Add #if guards or a JitOption flag to enable constant blinding, only when requested by the user.

::: js/src/jit/x64/MacroAssembler-x64-inl.h:34
(Diff revision 13)
> +    Imm32 blinded_bot((imm.value ^ blind_bot));
> +    movl(blinded_top, dest.reg);
> +    xorl(Imm32((blind_top >> 32)), dest.reg);
> +    lshift64(Imm32(32), dest);
> +    addq(blinded_bot, dest.reg);
> +    xorq(Imm32(blind_bot), dest.reg);

not: blind_bot is already an Imm32.
Attachment #8881808 - Flags: review?(nicolas.b.pierron) → review-
The addq will clobber carry flag while moveq previously did not. Perhaps WASM relied on this and that is why you have an off-by-one.
Comment on attachment 8881808 [details]
Bug 1376819 Perform constant blinding in the JIT on move32 and move64 immediates

https://reviewboard.mozilla.org/r/152882/#review160814

Does the the addq clobber flags that otherwise would not? Perhaps wasm engine expected move64 not to. Would explain an off-by-one error.
Changing the addq to an xorq changed the off by one error: Old error was got 305419895, expected 305419896; new error is got -305419897, expected 305419896

Left shift also affects the carry flag. Also, xorq clears the carry flag. movl (the instruction previously present) affects no flags.

So it seems that there is no way (?) to encode the logic I want in something that does not affect any flags or scratch registers.
BSWAP, NOT, and XCHG do not change the flags.  They could be applied randomly (alone or in some combination), and they can be applied repeatedly, so the number of combinations might be interesting.  I know this is not very strong protection but it does reduce the predictability of the instruction stream.
Another possibility is perhaps to combine RCL and RCR.  These only change the carry flag.  If one loads a constant, then RCR the low half and RCL the whole thing one can end up with a different constant but the carry flag unchanged.  For example, carry is x and the initial value in R is abcd efgh ijkl mnop, then:

             x abcd efgh ijkl mnop
 RCR R_lo, 5
             k abcd efgh lmno pxij
 RCL R, 14
             x ijka bcde fghl mnop

On x86 one would use eax/ax etc; on x64, rax/eax etc as the high/low parts.  Tricky for edi/esi/ebp on x86, adds some churn there if we can't load unblinded constants into these, maybe fall back to some BSWAP trick then.

Combine rotates with random rotate widths and it'll be hard to predict what the code looks like, though for sure the initial constant will have all the bits of the eventual result, just scrambled.  (No promises about performance.)
Priority: -- → P3
Okay. I'm kind of tired, but I'm pretty sure I figured out the problem, and it is isn't the carry flag.

My patch does the following (for x64):

>    //Wipe out the top bit to avoid sign extension problems
>    int64_t blind = 0xdeadbeefdeadbeef & 0x7FFFFFFF7FFFFFFF;
>    int64_t blind_top = (blind & 0x7FFFFFFF00000000);
>    int32_t blind_bot = (blind & 0x000000007FFFFFFF);
>
>    Imm32 blinded_top((imm.value ^ blind_top) >> 32);
>    Imm32 blinded_bot((imm.value ^ blind_bot));
>
>    //Put the blinded top in the bottom 32 bits of the register
>    movl(blinded_top, dest.reg);
>    //XOR the top part of the blind value (shifted 32 bits to the bottom) into the register
>    xorl(Imm32((blind_top >> 32)), dest.reg);
>    //left shift the bottom of the register up into the top
>    lshift64(Imm32(32), dest);
>    //Put the blinded bottom in the bottom 32 bits of the regsiter. Use an xor instead of an add to leave CF untouched
>    xorq(blinded_bot, dest.reg);
>    //XOR the bottom part of the blind value into the regsiter
>    xorq(Imm32(blind_bot), dest.reg);

The problem is with:

>    xorq(blinded_bot, dest.reg);

blinded_bot is a Imm32 and the lower 32 bits of the imm xored with the lower 32 bits of the blind (in my test code this is 0x5eadbeef (we zero the top bit)).

Inside of xorq, we cast this value to a int64: http://searchfox.org/mozilla-central/rev/d0448c970093f94bd986a21d3a9e8c366b316eb4/js/src/jit/x64/BaseAssembler-x64.h#253

If the imm value contains a 1 in the top bit, we sign extend into the int64. Then when we do the xor and store the result in $rax, we're affecting the top 32 bits of $rax when we want to leave them alone.

What I think we need to do is create a version of xorq_ir that accepts a int64_t, so I can pass in a value that won't be sign extended.  I had attempted to do this back in this revision: https://reviewboard.mozilla.org/r/152882/diff/3/ (not because I anticipated this problem, it just seemed the shorted distance at the time) - but I don't understand the m_formatter stuff. 

Is there a reason we can't make a xorq_ir(int64_t, RegisterID)?
(In reply to Tom Ritter [:tjr] from comment #29)
> Is there a reason we can't make a xorq_ir(int64_t, RegisterID)?

There is nothing against that on x64 backend.

Have a look at andq_ir:
http://searchfox.org/mozilla-central/rev/d0448c970093f94bd986a21d3a9e8c366b316eb4/js/src/jit/x64/BaseAssembler-x64.h#158-171

And use the same opcode as the andl_ir instruction.
Okay! I('m pretty sure I) got it working with 32 bit xors and eliding one rotate. I fired off a try run to confirm on all platforms.

The next step I'll do is perform AWFY measurements on a few variants to try to understand what the performance cost is, and how much of that cost comes from the blinding vs invoking the RNG all the time.
Comment on attachment 8881808 [details]
Bug 1376819 Perform constant blinding in the JIT on move32 and move64 immediates

https://reviewboard.mozilla.org/r/152882/#review196892

::: js/src/jit/x64/MacroAssembler-x64-inl.h:23
(Diff revisions 14 - 15)
>  // ===============================================================
>  
>  void
>  MacroAssembler::move64(Imm64 imm, Register64 dest)
>  {
> -    //Wipe out the top bit to avoid sign extension problems
> +    // Wipe out the top bit to avoid any potential sign extension problems

Drive-by nits:
  - In general one should use ULL suffix for 64-bit constants to avoid sadness
  - Shift before masking to make more compilers happier.
  - Might be cleared to do something like..
blind = js::GenerateRandomSeed()
blind_1 = 0x7FFFFFFF & (blind >> 32);
blind_2 = 0x7FFFFFFF & (blind);
Comment on attachment 8881808 [details]
Bug 1376819 Perform constant blinding in the JIT on move32 and move64 immediates

https://reviewboard.mozilla.org/r/152882/#review196892

> Drive-by nits:
>   - In general one should use ULL suffix for 64-bit constants to avoid sadness
>   - Shift before masking to make more compilers happier.
>   - Might be cleared to do something like..
> blind = js::GenerateRandomSeed()
> blind_1 = 0x7FFFFFFF & (blind >> 32);
> blind_2 = 0x7FFFFFFF & (blind);

Thanks, fixed. (Although I did need to & in 'blind ='; otherwise my later xor wouldn't have the top bits cleared.)
Some notes:

(1) Adding GenerateRandomSeed calls to move32/move64 is going to be slow - please #ifdef this out.

(2) Is move32/move64 really the best place for this? It seems we should at least audit all callers and probably split these methods in blindedMove32/unblindedMove32.

(3) Blinding all constants seems unnecessary. IIRC Chakra only blinds constants > 0xffff with exceptions for 0xffff_ffff etc.

I'm okay with experiments but let's #ifdef this until we have answers to each of these questions.
(In reply to Jan de Mooij [:jandem] from comment #37)
> (1) Adding GenerateRandomSeed calls to move32/move64 is going to be slow

(I added GenerateRandomSeed as a strong RNG that relies on platform APIs. We should call it at most once per MacroAssembler or something (even that might be too slow) and then for instance use it to seed a XorShift128PlusRNG.)
Comment on attachment 8881808 [details]
Bug 1376819 Perform constant blinding in the JIT on move32 and move64 immediates

https://reviewboard.mozilla.org/r/152882/#review197584

feedback+

I agree with Jan. This patch looks nice, but there is currently too much uncertainty about the performance impact of this patch.
Can you push this patch to Try in order to compare it on AWFY?
Attachment #8881808 - Flags: review?(nicolas.b.pierron)
Awesome, thanks all. I'm not trying to land this patch as-is to be clear, I'm moving onto the next phase which is performance testing and figuring out if this needs to be behind an option and which one.

(In reply to Jan de Mooij [:jandem] from comment #37)
> Some notes:
> 
> (1) Adding GenerateRandomSeed calls to move32/move64 is going to be slow -
> please #ifdef this out.

Yup! I'm measuring how bad it is for performance comparisons.

> (2) Is move32/move64 really the best place for this? It seems we should at
> least audit all callers and probably split these methods in
> blindedMove32/unblindedMove32.

The thinking was, from my recollection, that we put it here and measure. If performance is abysmal, the best improvement we may be able to get would be only blind constants the user chose, and not all constants. 

> (3) Blinding all constants seems unnecessary. IIRC Chakra only blinds
> constants > 0xffff with exceptions for 0xffff_ffff etc.

I'm also testing a version that omits blinding for -1, 0, and 1. I will test omitting blinding for 8 and 16 bit constants; but I also want to review the assembly opcode charts to make sure useful gadgets can't be made from those rather than blindly following Edge or Chrome.


(In reply to Jan de Mooij [:jandem] from comment #38)
> (In reply to Jan de Mooij [:jandem] from comment #37)
> > (1) Adding GenerateRandomSeed calls to move32/move64 is going to be slow
> 
> (I added GenerateRandomSeed as a strong RNG that relies on platform APIs. We
> should call it at most once per MacroAssembler or something (even that might
> be too slow) and then for instance use it to seed a XorShift128PlusRNG.)

I will try that!


(In reply to Nicolas B. Pierron [:nbp] from comment #39)
> Comment on attachment 8881808 [details]
> Bug 1376819 Perform constant blinding in the JIT on move32 and move64
> immediates
> 
> https://reviewboard.mozilla.org/r/152882/#review197584
> 
> feedback+
> 
> I agree with Jan. This patch looks nice, but there is currently too much
> uncertainty about the performance impact of this patch.
> Can you push this patch to Try in order to compare it on AWFY?

I'm currently setting up a whole bunch of AWFY runs to compare the base revision and various different types of changes. It takes a bit to gather the data because I have to queue up several hundred runs manually on both win32 and win64 and keep the revisions straight. Right now I am running for both 32 and 64:
 - Base
 - Blinding with a constant 0xdeadbeef
 - Blinding with the heavy-weight RNG
 - Blinding with the heavy-weight RNG but omitting blinding for -1, 0, and 1

I will add:
 - Blinding with the lightweight RNG
 - Blinding with the lightweight RNG but omitting blinding for -1, 0, 1, and 16 bit and below
Okay, I have some performance data.  

Here is the 'published' google spreadsheet (which doesn't bog down the browser as much as opening the raw file): https://docs.google.com/spreadsheets/d/e/2PACX-1vT5EhIAj9E0Jkm-6RKe-dBSMrHijGCUBltHFh6IGhYS-pM2I-5i6bcgIfcBRml3uFw26t3ViXHd-27Y/pubhtml

Here is the raw file:
https://docs.google.com/spreadsheets/d/1JdEALc0_7o34iIMeyJobbEPP6_fWiUPWufz5y908-ok/edit?usp=sharing

I ran all AWFY tests four times and averaged the results. I ran it with five variants:
 - Base (No Blinding)
 A) Blind with a constant 0xdeadbeef
 B) Blind with the heavyweight RNG
 C) Blind everything except -1, 0, and 1 with the heavyweight RNG
 D) Blind everything not in the range imm.value >= -32768 && imm.value <= 32767 with the heavyweight RNG

I display the data as a percent change off of the base revision.

The performance cost on all of these is pretty low. So low, in fact, that some of them may be unreliable for exact comparison purposes between each other.  However, Octane and Jetstream show nearly the exact curve we expect: A is the cheapest, B the most expensive, and C and D in between. And all of them agree that the differences between them are marginal.

If you think the data is suspect and you want me to gather more, or in a different manner, I'll do that if you let me know what you'd like to see.  But otherwise, I'll continue assuming this data is good:



The cost is very low on all the tests for C and D, never exceeding 2%, and usually below 1%. 



The reason I omitted XorShift128PlusRNG is because I don't think it will provide the security we desire. We are trying to defend against an attacker with an arbitrary read primitive (and probably arbitrary write). XorShift128PlusRNG's internal state is a mere two uint64s. An attacker would be able to locate and read that data and use it to predict the output sequence for JITed code, defeating the blinding.

In comparison, GenerateRandomSeed uses the system's RNG. An attacker would not be able to read kernel memory to be able to predict this output, and even if they did, it would be a TOCTOU race to read the data and then invoke a JIT before the expected output changed (from mixing or other uses of the RNG).  

XorShift128PlusRNG won't be faster than variant A, and since the cost between C (or D) and A is not very high, I'd strongly prefer to stay with GenerateRandomSeed.



I have not analyzed opcode tables to come to a conclusion on blinding only int16 and above. It's probably okay, cause Chrome and Edge do it, but I want to conclude the same myself.
(In reply to Tom Ritter [:tjr] from comment #41)
> If you think the data is suspect and you want me to gather more, or in a
> different manner, I'll do that if you let me know what you'd like to see. 
> But otherwise, I'll continue assuming this data is good:

We should also measure compilation time explicitly, for instance for Wasm Baseline compilation. It would also be good to measure on Linux and OS X too, as we have different RNGs there.

> An attacker would
> be able to locate and read that data and use it to predict the output
> sequence for JITed code, defeating the blinding.

If an attacker is that far, couldn't they just clobber the AssemblerBuffer Vector directly and write their payload at offset 0? Easier than messing with our RNG state :)

> I have not analyzed opcode tables to come to a conclusion on blinding only
> int16 and above. It's probably okay, cause Chrome and Edge do it, but I want
> to conclude the same myself.

* Note that last I checked, V8 used constant blinding only in their Baseline JIT. Maybe that changed with Turbofan, do you know? If not we shouldn't take them too seriously and assume they don't do constant blinding.

* I still think we should have explicit macro assembler methods that do constant blinding instead of "hiding" it in move32 - many callers of move32 don't want blinding (because they're just moving a known constant). One approach here would be to have TrustedImm32 and UntrustedImm32 types.

* There are a lot of other ways to add constants to the assembler buffer, for instance float32 and doubles... We should probably do an audit. Then there are methods like push/pushValue etc that might bypass move32...
(In reply to Jan de Mooij [:jandem] from comment #42)
> If an attacker is that far, couldn't they just clobber the AssemblerBuffer
> Vector directly and write their payload at offset 0?

Oh I see you were talking about an arbitrary read (not write) primitive. How common is arbitrary-read-without-write?
(In reply to Jan de Mooij [:jandem] from comment #42)
> * Note that last I checked, V8 used constant blinding only in their Baseline
> JIT. Maybe that changed with Turbofan, do you know? If not we shouldn't take
> them too seriously and assume they don't do constant blinding.

OK it indeed looks like they removed constant blinding from their MacroAssembler when they removed the baseline compiler (full codegen):

Warning: large diff - https://github.com/v8/v8/commit/1458e8b01a98e9b029aa4349bb0c2be45504b638#diff-10a20736e81312c0e20edb56ddca931cL947

So assuming they don't do it elsewhere, it means V8 and JSC (AFAIK JSC has something primitive, not sure) don't do constant blinding.

I'm not saying we have to follow them and shouldn't do better (I did the W^X work despite V8 not having it) but I also don't want to regress perf too much - for instance I still think using a CSRNG for *every* large-ish constant we emit is overkill.
FWIW, on Linux the |js::GenerateRandomSeed| function will open /dev/urandom on every call if the |getrandom| syscall isn't available, that overhead could be reduced by reusing the file descriptor. The macOS implementation looks fine (it uses |arc4random|, which on recent macOS isn't actually RC4); I'm not familiar with the Windows API in use, |RtlGenRandom|, to know whether it needs to do any expensive setup (I'm used to use |CryptGenRandom| which has explicit init vs. read APIs).
(In reply to Jan de Mooij [:jandem] from comment #42)
> (In reply to Tom Ritter [:tjr] from comment #41)
> > If you think the data is suspect and you want me to gather more, or in a
> > different manner, I'll do that if you let me know what you'd like to see. 
> > But otherwise, I'll continue assuming this data is good:
> 
> We should also measure compilation time explicitly, for instance for Wasm
> Baseline compilation. It would also be good to measure on Linux and OS X
> too, as we have different RNGs there.

I've got Talos comparisons. Alex was correct, reopening /dev/urandom _killed_ performance. But I hacked up a simple 'reuse the fd' patch and it brought things back in line. This might be a generic performance improvement to make depending on how often this function gets called.

The spreadsheet linked above has links to before/after the fd fix. The below links are with the fix:

Blinding w/ constant: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=8e739dccf9aee7092ff11cf5fc57a1e31da9ec15&newProject=try&newRevision=634f546af2b55a283618a90a133d8e8a31277ef8&framework=1

Blinding w/ RNG: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=8e739dccf9aee7092ff11cf5fc57a1e31da9ec15&newProject=try&newRevision=f67f84f75a9b38951cf503da8e948b7cc2c7b2da&framework=1

Blinding except -1, 0, 1: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=8e739dccf9aee7092ff11cf5fc57a1e31da9ec15&newProject=try&newRevision=19f83c72462298d8bd942d0f61207ffec600105f&framework=1

Blinding except int16: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=8e739dccf9aee7092ff11cf5fc57a1e31da9ec15&newProject=try&newRevision=5ad42362138fb12cbb2935333a55a24b95469dac&framework=1

Performance impact is small. I did 10 runs, and the confidence is still low because the change is small. Next week I hope to measure the WASM compilation times.
Attachment #8881808 - Flags: review?(nicolas.b.pierron)
I've tried to measure the compilation times, but I don't think I got much information. First I made a new Ion test (as I was told we have no Tracelogging for WASM Compilation, but Ion Compilation seems well traced via the 'IonCompiler' TLLOG option.) I wanted to pick something a bit heavier weight than the existing very small tests, so I picked AES Encryption. Lots of constants and math.

Then I ran ./jit_test to get the command line for the shell, then I ran:

> TLLOG=IonCompiler TLOPTIONS=EnableActiveThread,EnableOffThread,EnableGraph /home/tom/Documents/moz/constant-blinding/js/src/build_OPT.OBJ/dist/bin/js -f /home/tom/Documents/moz/constant-blinding/js/src/jit-test/lib/prologue.js --js-cache /home/tom/Documents/moz/constant-blinding/js/src/jit-test/.js-cache -e 'const platform='"'"'linux2'"'"'' -e 'const libdir='"'"'/home/tom/Documents/moz/constant-blinding/js/src/jit-test/lib/'"'"'' -e 'const scriptdir='"'"'/home/tom/Documents/moz/constant-blinding/js/src/jit-test/tests/ion/'"'"'' -f /home/tom/Documents/moz/constant-blinding/js/src/jit-test/tests/ion/aes.js

I ran it five times with and without blinding and put the results here: https://ritter.vg/misc/transient/constant-blinding-perf.zip

When I actually look at the results (using https://github.com/h4writer/tracelogger) I don't see very much, and nothing that appears to let me drill down into actual execution times (vs 'percent of total') or of a function I recognize as being one containing my modifications.
Okay, I got WASM compilation results. They're... very low! Raw numbers in the spreadsheet above, but here's the gist:

On a x64 Linux Lenovo tower (xeons):
wget http://godot.eska.me/pub/wasm-benchmark/godot.wasm
./js -e "before = Date.now(); new WebAssembly.Module(os.file.readFile('./godot.wasm', 'binary')); after = Date.now(); console.log(after - before);"

No Blinding: Avg 229.5
Blinding with Constant: Avg 232.9  (or + 1.47%)
Blinding w/ RNG: Avg 246.8  (or + 7.52%)
Blinding w/ RNG and omitting int16's: Avg 231.9 (or + 1%)
(In reply to Tom Ritter [:tjr] from comment #51)
> Okay, I got WASM compilation results. They're... very low!

Well, wasm compilation in particular is made to be very fast, so that's expected :-)

I'd move the file loading out of the timed part:

./js -e "binary=os.file.readFile('./godot.wasm', 'binary'); before = Date.now(); new WebAssembly.Module(binary); after = Date.now(); console.log(after - before);"


To get a better measure, I'd recommend to do your measurements with the following runtime flags, with the spidermonkey shell:

--no-threads, to see the overall sequential impact your patch has (it might be amortized if you have a high number of threads, which is probably more realistic, but my assumption is that we're interested in the worst case behavior here)

Then try two more variants (in addition to the above flag):

--no-wasm-baseline to prevent the ultra-fast wasm baseline compiler to kick in (which will very likely happen with godot)

then:

--no-wasm-ion to see the impact only on the wasm baseline compiler

(also, general recommendation for benchmarking nightly: make sure to use the env variable JSGC_DISABLE_POISONING=1, as well as disabling async stacks with --no-async-stacks)
Okay, got new numbers!

> for i in `seq 0 9`; do JSGC_DISABLE_POISONING=1 ./js --no-async-stacks --no-threads -e "binary=os.file.readFile('./godot.wasm', 'binary'); before = Date.now(); new WebAssembly.Module(binary); after = Date.now(); console.log(after - before);" ;  done

Blinding with Constant: +1.30%
Blinding with RNG: +2.25%
Blinding with RNG excluding int16: +1.21%

> for i in `seq 0 9`; do JSGC_DISABLE_POISONING=1 ./js --no-async-stacks --no-threads --no-wasm-baseline -e "binary=os.file.readFile('./godot.wasm', 'binary'); before = Date.now(); new WebAssembly.Module(binary); after = Date.now(); console.log(after - before);" ;  done

Blinding with Constant: +0.85%
Blinding with RNG: +2.81%	
Blinding with RNG excluding int16: +1.31%

> for i in `seq 0 9`; do JSGC_DISABLE_POISONING=1 ./js --no-async-stacks --no-threads --no-wasm-ion -e "binary=os.file.readFile('./godot.wasm', 'binary'); before = Date.now(); new WebAssembly.Module(binary); after = Date.now(); console.log(after - before);" ;  done

Blinding with Constant: +0.22%
Blinding with RNG: +6.76%
Blinding with RNG excluding int16: -1.18%
Nice.

The with-rng results for --no-wasm-ion are sensible since the baseline compiler is very performance sensitive; the excluding-int16 result less so and probably some type of noise.  But looking at your raw data these numbers are so close that it's hard to say what's noise and what's not.

Does the RNG have per-thread state or shared state (accessed with atomics) BTW?  That won't matter here but would be an issue for parallel baseline compilation.

There's another variable here if you're on a NUMA system, which is that you want to prevent your processes from moving from one node to another, there are games you can play with numactl here.  (You said Lenovo Xeon tower which would describe what I have, and I've had good luck avoiding NUMA effects this way.)

And at the risk of being redundant, make sure cpupower is set to 'performance'.  I thought I saw a remark about that somewhere but it does not seem to be on this bug, so just mentioning it to be comprehensive.

Baseline compilation is almost 20x faster than ion compilation on this program according to above numbers.  This seems like an outlier; mostly we're used to seeing 5x-10x.  May be interesting to try at least one more program just to be sure.

Also trying a consumer laptop would be useful before concluding; tiny caches.

(By and large I am willing to believe that blinding-excluding-int16 is completely affordable.)

BTW before landing any blinding code we should (I should) remeasure ratios of wasm bytecode to generated machine code, since we use those ratios to pre-size some buffers during compilation.
(In reply to Lars T Hansen [:lth] from comment #54)
> Does the RNG have per-thread state or shared state (accessed with atomics)
> BTW?  That won't matter here but would be an issue for parallel baseline
> compilation.

No. Quasi-exception: on Linux <3.17 it open()'s /dev/urandom and that needs to be mutex-ed or similar; after that I *think* read()'s can happen in parallel... That's the third patch on this bug and in general I need advice on how to do it correctly to be thread-safe.

> There's another variable here if you're on a NUMA system, which is that you
> want to prevent your processes from moving from one node to another, there
> are games you can play with numactl here.  (You said Lenovo Xeon tower which
> would describe what I have, and I've had good luck avoiding NUMA effects
> this way.)
> 
> And at the risk of being redundant, make sure cpupower is set to
> 'performance'.  I thought I saw a remark about that somewhere but it does
> not seem to be on this bug, so just mentioning it to be comprehensive.

I don't have that package, but `cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor` says 'powersave'.  If I switch it to performance, won't that make the numbers better (or have no effect?)  Do I need to adjust it and rerun the tests in Comment 53?

> Baseline compilation is almost 20x faster than ion compilation on this
> program according to above numbers.  This seems like an outlier; mostly
> we're used to seeing 5x-10x.  May be interesting to try at least one more
> program just to be sure.

I can do that, do you have a recommendation?
 
> Also trying a consumer laptop would be useful before concluding; tiny caches.

I can use my standard-issue developer mac and run the wasm compilation tests in Comment 53, is that sufficient? I don't have another laptop.
(In reply to Tom Ritter [:tjr] from comment #55)
> (In reply to Lars T Hansen [:lth] from comment #54)
> > Does the RNG have per-thread state or shared state (accessed with atomics)
> > BTW?  That won't matter here but would be an issue for parallel baseline
> > compilation.
> 
> No. Quasi-exception: on Linux <3.17 it open()'s /dev/urandom and that needs
> to be mutex-ed or similar; after that I *think* read()'s can happen in
> parallel... That's the third patch on this bug and in general I need advice
> on how to do it correctly to be thread-safe.

Hm, yes.  Presumably the implementation of read on the random device does some mutual exclusion too, and anyway the getrandom syscall might be expected to.  Looks like getrandom() just reads from urandom anyway.

Having to synchronize on access to the RNG is not great, since we try to compile on all cores and we should expect a lot of contention for the RNG lock, whether it's in userland or in the kernel.  Admittedly by only blinding for large constants the cost of that is probably quite modest, at least for the instructions you currently blind.  Still, makes me a little nervous.  A per-compilation-thread RNG initially seeded from urandom would feel better.

> > There's another variable here if you're on a NUMA system, which is that you
> > want to prevent your processes from moving from one node to another, there
> > are games you can play with numactl here.  (You said Lenovo Xeon tower which
> > would describe what I have, and I've had good luck avoiding NUMA effects
> > this way.)
> > 
> > And at the risk of being redundant, make sure cpupower is set to
> > 'performance'.  I thought I saw a remark about that somewhere but it does
> > not seem to be on this bug, so just mentioning it to be comprehensive.
> 
> I don't have that package, but `cat
> /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor` says 'powersave'.  If
> I switch it to performance, won't that make the numbers better (or have no
> effect?)

They might make the numbers more stable.

> Do I need to adjust it and rerun the tests in Comment 53?

If you have the time and it's not too much bother, you can at least run one configuration to see if the numbers are different.

(Dealing with NUMA might be more important.)

> > Baseline compilation is almost 20x faster than ion compilation on this
> > program according to above numbers.  This seems like an outlier; mostly
> > we're used to seeing 5x-10x.  May be interesting to try at least one more
> > program just to be sure.
> 
> I can do that, do you have a recommendation?

Possibly the wasm from the zen garden demo?  Or the wasm from the tanks demo?  (Do you have the links to these?)

> > Also trying a consumer laptop would be useful before concluding; tiny caches.
> 
> I can use my standard-issue developer mac and run the wasm compilation tests
> in Comment 53, is that sufficient? I don't have another laptop.

We supposedly have something known as a Reference Laptop somewhere...  The mac is a good start.  Or perhaps we could run these tests on AWFY to get a more complete picture with less effort on your part.
(In reply to Lars T Hansen [:lth] from comment #56)
> (In reply to Tom Ritter [:tjr] from comment #55)
> > (In reply to Lars T Hansen [:lth] from comment #54)
> > > Does the RNG have per-thread state or shared state (accessed with atomics)
> > > BTW?  That won't matter here but would be an issue for parallel baseline
> > > compilation.
> > 
> > No. Quasi-exception: on Linux <3.17 it open()'s /dev/urandom and that needs
> > to be mutex-ed or similar; after that I *think* read()'s can happen in
> > parallel... That's the third patch on this bug and in general I need advice
> > on how to do it correctly to be thread-safe.
> 
> Hm, yes.  Presumably the implementation of read on the random device does
> some mutual exclusion too, and anyway the getrandom syscall might be
> expected to.  Looks like getrandom() just reads from urandom anyway.
> 
> Having to synchronize on access to the RNG is not great, since we try to
> compile on all cores and we should expect a lot of contention for the RNG
> lock, whether it's in userland or in the kernel.  Admittedly by only
> blinding for large constants the cost of that is probably quite modest, at
> least for the instructions you currently blind.  Still, makes me a little
> nervous.  A per-compilation-thread RNG initially seeded from urandom would
> feel better.

Implementing a RNG is bad cryptographic practice. Doubly so if it's in user-space and you don't have to do it. It's just a lot of code to write and potentially get wrong (see Debian, FreeBSD, TrueCrypt). Even if you decide to do it cheap and fast and use a stream cipher as a poor-person's CSPRNG you're forfeiting security and exposing the state of the RNG in memory as I talked about in Comment 41. (Not exposing the state in memory is not a bulletproof defense by any means but if it can be avoided it should be.)

I found this old post: https://news.ycombinator.com/item?id=7610273 - but using their shell commands I actually can't replicate the /dev/urandom contention they observed. Maybe the kernel improved since then?


> > > There's another variable here if you're on a NUMA system, which is that you
> > > want to prevent your processes from moving from one node to another, there
> > > are games you can play with numactl here.  (You said Lenovo Xeon tower which
> > > would describe what I have, and I've had good luck avoiding NUMA effects
> > > this way.)
> > > 
> > > And at the risk of being redundant, make sure cpupower is set to
> > > 'performance'.  I thought I saw a remark about that somewhere but it does
> > > not seem to be on this bug, so just mentioning it to be comprehensive.
> > 
> > I don't have that package, but `cat
> > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor` says 'powersave'.  If
> > I switch it to performance, won't that make the numbers better (or have no
> > effect?)
> 
> They might make the numbers more stable.
> 
> > Do I need to adjust it and rerun the tests in Comment 53?
> 
> If you have the time and it's not too much bother, you can at least run one
> configuration to see if the numbers are different.
> 
> (Dealing with NUMA might be more important.)
> 
> > > Baseline compilation is almost 20x faster than ion compilation on this
> > > program according to above numbers.  This seems like an outlier; mostly
> > > we're used to seeing 5x-10x.  May be interesting to try at least one more
> > > program just to be sure.
> > 
> > I can do that, do you have a recommendation?
> 
> Possibly the wasm from the zen garden demo?  Or the wasm from the tanks
> demo?  (Do you have the links to these?)

I don't.  I found the zen garden demo at least, but I couldn't identify the .wasm file from it.
Zen garden:
https://s3.amazonaws.com/mozilla-games/ZenGarden/2017-03-16-ZenGarden.zip
The main wasm file is UE4Game-HTML5-Shipping.wasm (about 35MB)

Tanks:
Go to https://github.com/WebAssembly/website/tree/master/demo/Tanks/Build
You want tanks.wasm.code.unityweb which you should just gunzip to tanks.wasm (about 12MB)

I know implementing your own RNG is bad cryptographic practice, but if blinding slows down parallel compilation significantly due to MP contention then you may be faced with choosing between bad practice and no blinding at all.

Which reminds me: skimming this thread and the patches it seems like you blind MOV operations but not constant operands in other ALU ops (ADD, SUB, AND, OR, XOR) that would compile from wasm and asm.js down into single, predictable machine instructions.  Is that right?  What are your thoughts on whether work is needed also for those operations?  Is the frequency of those operations in real code high enough that it might matter for compile time, run time, and/or code size?
(The kernel has improved since then: https://marc.info/?l=linux-crypto-vger&m=146583311726544&w=2 I believe)
Okay, I measured godot, ZenGarden, and tanks on a Linux Desktop with cpu in performance mode, and on my macbook, for all four variations. Even the worst results are under 10%, and all but 2 for exclude-int16s are below 5%: https://imgur.com/a/roIGT

Raw data: https://docs.google.com/spreadsheets/d/1JdEALc0_7o34iIMeyJobbEPP6_fWiUPWufz5y908-ok/edit#gid=1823945629

I had some folks review the kernel patch Alex posted, and agree that that patch is safe (one would hope the kernel would not include unsafe code but I wanted to double check). Despite Ted's admonishment that "Programs who need this are doing the wrong thing", the cryptographic community's opinion is solidly that implementing your own RNG is "the wrong thing". But since there is no longer a performance concern with locking on kernels late-2016 or newer (not sure exactly where that patch landed) we shouldn't need one.

Regarding other constant operations; I am not certain, that is something worth investigating. I'll probably open a separate bug about it.
Wow, 9% is quite a lot of overhead, especially considering this is just one source of immediates; presumably a more complete implementation which covers the 100s of other places 4-byte immediates can appear would have more overhead.
(In reply to Luke Wagner [:luke] from comment #61)
> Wow, 9% is quite a lot of overhead, especially considering this is just one
> source of immediates

The "10%" figure was the worst scenario: it uses the RNG for all constants (including int16s), and it was two tests in that range out of 12. 

On the version we would actually consider, which does not blind int16s, performance is much better. On a Linux Desktop, 8/12 tests ran faster, the slowest increased runtime by 1.5%.  On my macbook, 4/12 tests ran faster, the slowest increased runtime by 5.3%, the next slowest by 4.8%, the third slowest by 3.4%.
Yes, but won't the number of constants increase as other 4-byte immediates are blinded?
I've started investigating other compilation mechanisms that result in unblinded constants.

> function disassemble(f) {
> 	for(var i=0; i<1000; i++)
> 		f(i);
> 	var s = disnative(f);
> 	var lines = s.split("\n")		
> 	for(var l in lines) {
> 		console.log(lines[l]);
> 	}
> }

is my base, and I'm testing without my current patches.

> ./js -f test-asm.js -e "disassemble(function(x) { return x + 2; })"

Will give me the constant in this form:

> 48BB020000000080F8FF                movabs rbx, -0x77ffffffffffe

(Specifically 020000000080F8FF, but ignoring the top 80F8FF).  No matter what arithmetic operator I use I can't seem to coerce it into anything except a mov (which is good.)

I'm using this for asm.js:

> ./js -f test-asm.js -e 'disassemble((function () { "use asm"; function f(x) { x = x|0; x = (x + 0x1337)|0; return x|0
; } return f; })())'

That gives me a very clean

> 0537130000                          add eax, 0x1337

So for asm.js, there's definitely going to be a lot of blinding that needs to happen. I wonder if we'll want to do it where we taint user constants and only blind those...
Removing parity-chrome as they aren't doing this (see comment 44).
Keywords: parity-chrome
There was a presentation at OffensiveCon recently about exploiting Firefox using unblinded constants as a quick and easy way to get your shellcode working without needed to go through any hoops. (It also attacked our asm.js allocation procedures which have since been fixed.)

https://github.com/rh0dev/slides/blob/master/OffensiveCon2018_From_Assembly_to_JavaScript_and_back.pdf
Severity: normal → S3

Constant blinding has been shown to be insufficient as the generated code might be used as a way to generate the desired code instead of using constants.

Constant blinding focuses on making the content unpredictable where as the newly taken approach focuses on making the location unpredictable.

Status: NEW → RESOLVED
Closed: 1 month ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: