Closed Bug 868948 Opened 11 years ago Closed 7 years ago

A patch for NSS: a constant-time implementation of the GHASH function of AES-GCM, for processors that do not have the AES-NI/PCLMULQDQ

Categories

(NSS :: Libraries, defect, P1)

3.14.3
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: shay.gueron, Assigned: franziskus)

References

(Depends on 1 open bug)

Details

Attachments

(2 files, 7 obsolete files)

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0
Build ID: 20130409194949

Steps to reproduce:

Hello all,

This patch is a contribution to NSS. 
It provides a generic code (C code) implementation of the GHASH function of AES-GCM. 
Its purpose is to provide an option for a "side channel protected" AES-GCM on CPU’s that do not have 
the AES-NI / PCLMULQDQ (and therefore can enjoy the faster and side channel protected that is in NSS now). 

This patch offers an alternative to the NSS function “gcm_HashMult”. 

The implementation is “constant time” in the following sense: it has no conditional branches and uses 
128 tables, 32 bytes each (aligned), therefore fitting into a single cache line of (practically) any CPU.

In addition, performance provided here is better than the two non-constant time generic implementations that are 
currently used in NSS (on tested CPU’s).

We measured the following performance (in cycles/byte):

                           MPI                 ALGORITHM_1       THIS PATCH
Sandy Bridge:             54.71                  249.64            29.92
Ivy Bridge                54.26                  249.40            27.05


Developers and authors:
***************************************************************************
Shay Gueron (1, 2), and Vlad Krasnov (1)
(1) Intel Corporation, Israel Development Center, Haifa, Israel
(2) University of Haifa, Israel
***************************************************************************
Copyright(c) 2013, Intel Corp.
Attachment #745806 - Attachment is patch: true
Attachment #745806 - Flags: review?(bsmith)
Status: UNCONFIRMED → NEW
Ever confirmed: true
Attachment #745806 - Flags: superreview?(rrelyea)
Assignee: nobody → shay.gueron
Priority: -- → P1
Target Milestone: --- → 3.15.1
Version: 3.15.4 → 3.14.3
Target Milestone: 3.15.1 → 3.15.2
Sadly, in practice, this patch is slower than the MPI code!

I setup a test with OpenSSL sending a 1G file to tstclnt. OpenSSL is on the same machine, but a different core, and has AESNI support. NSS AESNI is disabled. I'm timing simply with time(1).

AES128-SHA: 13.444
AES128-GCM-SHA256: 26.254
AES128-GCM-SHA256 + this patch: 28.920
AES128-GCM-SHA256 + my patch: 15.091
RC4-SHA: 7.498

The reason for the slowdown is that PKCS#11 doesn't have multi-shot support so the AES keyschedule and GCM precomputation is performed for every record. AES-GCM's poor key agility really hurts in this case!

My patch is actually much dumber (and not constant time, although only spans four cache lines). It appears to be faster because of the minimal precomputation involved: it's just a table of 16 multiples.

wtc has suggested that we might want to transparently cache precomputed tables. With more work it's possible that PKCS#11 could be extended. But, without that, first, this patch is problematic.
Comment on attachment 772892 [details] [diff] [review]
patch (dumber, but faster in practice for the moment).

Hi, 

The intention of the patch as not to achieve the best performance, on a generic processor. Rather, its purpose was to offer a generic C code that computes GHASH in constant time, and is therefore side-channel protected. 
The current C code in NSS is not constant time. 
Therefore, comparison of the patch to another C code which is not constant time is less relevant in this context.

Superior performance is achieved on processor with the AES-NI and PCLMULQDQ instruction (and NSS already has this constant time implementation). 

Perhaps it would be advisable to consider a “constant time” code as an option (via a compilation flag?) for users who are concerned about side channel analysis, and a faster code for those who don’t. 

Independently, it is indeed regrettable that the NSS implementation repeats the “init” function multiple time unnecessarily, and consequently suffers performance degradation. 

Shay
Making my patch constant time results in a ~5x slowdown, making it slower than Shay's. So that makes it sound like Shay's patch is the correct answer.

However, it appear to use the assumption that all accesses to the same cache line are equivalent as far as timing goes. However, that doesn't appear to be correct; see http://cryptojedi.org/peter/data/chesrump-20130822.pdf.

So no option is very good. Shay's patch may be an improvement on the current code however.
If we were to take Adam's approach or Shay's approach, we'd still need to factor out and cache some precomputated results. Should we do that in a separate bug?

The DJB/Shwabe presentation disproves the idea that all accesses to the same cache line as constant-time in general. However, how general is that result? Are there significant numbers of Intel chips and/or ARM chips where the accesses actually *are* constant-time?
(In reply to Brian Smith (:briansmith, was :bsmith; Please NEEDINFO? me if you want a response) from comment #4)
> If we were to take Adam's approach or Shay's approach, we'd still need to
> factor out and cache some precomputated results. Should we do that in a
> separate bug?

It's certainly true that the code would benefit from being able to reuse some context for multiple AES-GCM calls. However, the PKCS#11 API doesn't seem to permit that.

> The DJB/Shwabe presentation disproves the idea that all accesses to the same
> cache line as constant-time in general. However, how general is that result?
> Are there significant numbers of Intel chips and/or ARM chips where the
> accesses actually *are* constant-time?

I'm not sure. However, if I make my version of the code constant-time by making the cache-line assumption, then it has very little slowdown. So, if we're willing to stretch to that assumption then we can clearly improve on the current code. (Although making sure that the correct part of the struct is 64-byte aligned is a little tricky. It's something that I've have to tackle next week.)
>>> The DJB/Shwabe presentation disproves the idea that all accesses to the same cache line as constant-time in general. 
>>> However, how general is that result?

I repeat what I wrote about this elsewhere: 

First, a comment: it is difficult to actually understand the precise claim by the authors, from these 6 slides. The code snippet does not include sufficient information to determine what the experiment really is. With this disclaimer, my understanding is this: the authors of the presentation claim that accessing different locations on a cache line shows different latency – depending on the location (within the cache line). The graph shows a *significantly* larger latency for doubleword #0. As a result, they insinuate that the scattering-gathering method for accessing tables does not protect against side channel analysis. 

This, however, seems to me to be incorrect. 

Perhaps what the experiment shows is due to “cache line splits”, where the data that is being accessed does not reside completely in a single cache line. For example, when aligned on 56 bytes, and reading 64-bit quadwords, the first one (say Q0) is in one cache line and the rest are in another. So, when flushing (only) the cache line that holds Q0, subsequent reading of Q0 would be “slow” but reading the rest would be fast. 

It is not possible to see how the array was aligned in the discussed presentation, because the code snippet does not include this part. 

Anyway, this phenomenon is not a vulnerability - it is only about a proper implementation of the gather/scatter method: the elements of the table simply need to be naturally aligned in a way that every single element belongs to the same cache line. The programmer needs make sure that the elements of the table are naturally aligned (e.g., aligned (64)). As long as there are no cache line splits, there should be no timing difference between accesses to a cache line.
(In reply to Shay Gueron from comment #6)
> the authors of the presentation claim
> that accessing different locations on a cache line shows different latency –
> depending on the location (within the cache line). The graph shows a
> *significantly* larger latency for doubleword #0. As a result, they
> insinuate that the scattering-gathering method for accessing tables does not
> protect against side channel analysis.

I believe that is their claim.

I've attempted to reproduce the code shown in the slides as accurately as possible (although I greatly increased the number of iterations). See cacheline.s, which I'll attach after this bug.

Based on slide 5, the authors report a significant effect for the first uint32; somewhere between a doubling and tripling of the access time. I was able to reproduce a smaller, but consistent effect. However, when I added a dummy, pre-warming run of the loop before the real measurements the effect disappeared.

I cannot reproduce the results from the slides so perhaps all that the authors saw was a measurement artifact?

I'm testing on an E5-2690 @ 2.9GHz with HyperThreading and TurboBoost disabled. I'll attach the source to this bug.
Compile this with the ASM source using a command like:
  gcc cacheline.s cacheline_main.c -Wall -g -std=c99
Attached file ASM source for cacheline timing (obsolete) —
The previously attached cacheline.s was after some tweaking. This is the version that I meant to include.
Attachment #8337786 - Attachment is obsolete: true
I emailed Peter and djb about this and Dan pointed that that I probably shouldn't be using the LOOP instruction.

By switching from loop to sub+jg, the effect is clear and remains when compiled with -DREVERSE to run the tests in the opposite order.

On the same system as previously described I find that accesses to the first uint32 in the cacheline are ~3.4x slower with the, newly updated, code.
Attachment #8337794 - Attachment is obsolete: true
I could reproduce these results more or less (from a single running code - not from 2 separate threads running together). 
These numbers are about "read-after-write" latency (and memory disambiguation). I guess the threat would be a “remote” scenario where one code constantly writes to such addresses that might slow down another code. Upon completion of some task, the added latency is used in for estimating the number of times it accesses some piece of a line. The question is whether and where there is a real threat from this (one needs a lot of data to collect statistics, and the scenario is not obvious). 

An easy (probably the easiest) change to make in the current patch would this this: start the "gather" function with a "preemptive write"; this eliminates the observed effects (at the cost of some slowdown). But now, the question would be whether this is something consistent on any platform.

Bottom line - endless discussions... the path of least resistance would be to modify the "fetch from the table" function in a way that goes over all of the elements and that's it. I will soon send an update.
Here is a simple implementation - no tables at all (and no branches). 
It is faster than current NSS. The performance for the "gfmul" on a Sandy Bridge is 980 cycles (compared has 2549).
Attachment #8338551 - Flags: review+
(In reply to Shay Gueron from comment #13)
> Created attachment 8338551 [details] [diff] [review]
> Patch: GCM for processors with no PCLMULQDQ instruction
> 
> Here is a simple implementation - no tables at all (and no branches). 
> It is faster than current NSS. The performance for the "gfmul" on a Sandy
> Bridge is 980 cycles (compared has 2549).

I'm afraid that the patch in #13 doesn't have any effect because GCM_USE_ALGORITHM_1 isn't defined.

When it is defined there are a couple of minor compile errors that need to be fixed. The result, unfortunately, is half the speed of the current code.
(In reply to Shay Gueron from comment #12)
> disambiguation). I guess the threat would be a “remote” scenario where one
> code constantly writes to such addresses that might slow down another code.
> Upon completion of some task, the added latency is used in for estimating
> the number of times it accesses some piece of a line. The question is
> whether and where there is a real threat from this (one needs a lot of data
> to collect statistics, and the scenario is not obvious).

Though I feel a little bit silly for saying it, I think that in a web browser it is unlikely that we're going to run into this issue in a way that is going to actually be problematic. Servers have other considerations. I would prefer a solution that optimizes mostly for speed but which tries to do a reasonable job at limiting the side channel leakage over one that is less efficient but which is perfect.

On ARM-based phones, we may be able to make use of Shay's original technique. If nobody tests on ARM before I get back from Thanksgiving break, I will do so.
(In reply to Brian Smith (:briansmith, was :bsmith; Please NEEDINFO? me if you want a response) from comment #15)
> Though I feel a little bit silly for saying it, I think that in a web
> browser it is unlikely that we're going to run into this issue in a way that
> is going to actually be problematic. Servers have other considerations. I
> would prefer a solution that optimizes mostly for speed but which tries to
> do a reasonable job at limiting the side channel leakage over one that is
> less efficient but which is perfect.

It's worth noting that the Firefox OS security model for apps allows any installed app to effectively start itself up whenever it wants without explicitly informing the user, allowing it to run code in the background that could potentially infer stuff.  (The app will be visible in the card switcher, however.)  Specifically mozAlarms lets the app automatically start itself up, the simple push API lets a remote server start the app up, and there are others too.

Having said that, timing channel attacks against symmetric encryption using short-lived keys does not seem super duper concerning.  (My interest in this bug and in general is that DT is making an effort to prototype PGP support for e-mail for Firefox OS and I'm particularly interested in avoiding timing channel attacks that might reveal private keys for that.  We had a short thread at https://mail.mozilla.org/private/openpgpjs/2013-November/000001.html for those interested.  Right now you need to join the list to read, but I complained and it should be made public tomorrow.  The discussion list may be good for those interested in the effort.  List info at https://mail.mozilla.org/listinfo/openpgpjs)
GCM is fundamentally unsuitable for software implementation. If we're happy to use the cacheline trick and hope that it's good enough then I've attached an updated version of my patch.

Using my silly "transfer 1GB using tstclnt" test:

Before: 28.028s 27.992s 27.919s
After:  16.946s 16.906s 16.968s
Attachment #772892 - Attachment is obsolete: true
Attached patch gcm_safe_02.patch (obsolete) — Splinter Review
Fix: now defining Alg 1 as the default.

The performance for the gfmul on Sandy bridge is 980 cycles (instead of 2549 in NSS Alg 1). 

This implementation has no tables at all (and no branches). This is its merit.

Comparing the speed with Alg 2 is irrelevant, because Alg 2 is not constant time. 
Similarly, comparing with the original Alg1 is also irrelevant: if we only seek performance here, then improving Alg 1 to performance that is slower than Alg 2 is not an answer. 

The “cache line size tables” as in the original patch is faster (than all of the above; perhaps Adam has something faster). The question about that patch is whether this method is "sufficiently constant time”. This is why the current patch was offered. 

Bottom line: robustness (practical or theoretical) against potential side channel analysis is the question. There are different levels, and we need to choose how much performance we are willing want to pay for achieving them in the current context.
> Comparing the speed with Alg 2 is irrelevant, because Alg 2 is not constant time.

> we need to choose how much performance we are willing want to pay for achieving them in the current context.

These cannot both be true :) If it's a trade off between security and speed then the speed of non-constant-time implementations does matter.

I ran the patch from #18 via my silly test and got 66 seconds. (See #17 for other times.)

If we want really constant-time then this patch is the only option at 66 seconds.

Falsely making the cacheline assumption makes my #17 the best options at 17 seconds. If NSS were altered to retain GCM state across invocations then Shay's #1 would be best under this assumption, but it's currently at 29 seconds.
Let's keep in mind that the currently-shipping code in NSS is neither fast nor constant time. AFAICT, all of the alternatives provided so far reduce the timing attack surface, even if it isn't eliminated completely in all of them. That is an improvement. Almost all of the alternatives provided so far also improve performance.

Remember that, effectively, we're asking websites to switch from RC4 to AES-GCM. Given the current state of affairs, I don't think it is wise for us to pay a nearly 9x CPU performance penalty over RC4 for AES-GCM, especially when AES-GCM is also less efficient in terms of overhead per record. A penalty of more than 2X is already saddening, but at least it is in line with AES128-HMAC-SHA1.

If Shay's original patch is no less safe than AGL's, but it would be faster if we can cache the precomputation, then let's figure out how to make such caching happen. Looking at the ssl3con.c code, I see we call PK11_Encrypt/PK11_Decrypt for each record. That means we create and close a PKCS#11 session for each record, so we can't cache the precomputation within the session. The only choice is to cache the precomputation within the key object. I don't see why this would be problematic. One change that would make things clearer would be to factor out the table precomputation into its own function that takes as arguments the inputs that must remain constant for the precomputation to be valid, so that it is crystal clear exactly what we need to check when deciding whether to reuse existing precomputed table values or to recompute the precomputed table. If it makes things easier (e.g. with respect to thread-safe simultaneous access of the sake key object in two different threads), we could only do the precomputation for the first usage of the key, and then ignore the precomputed values (re-computing the table) for any usage where the original precomputation would be invalid, and/or we could implement some sort of mutex where only one session at a time can make use of the precomputed table. These techniques would work OK for libssl because, IIUC, libssl always does things in such a way where we'd always be able to use the original precomputed values, and it never tries to use the same key simultaneously from multiple threads.

That said, if all of that seems too complicated to do now, then let's go with Adam's latest suggestion. Perfect is the enemy of the good and whatnot.
cache requires a change to PKCS #11*. We're working on a draft at the OASIS level currently. The problem is GCM needs pre block information passed in on each encrypt/decrypt call. This requires C_EncryptInit/DecryptInit(). 

(*Actually, we may be able to do something where we recognize that the same key and the same session were used in softoken and resuse our freebl context. This would involve, on GCM, not freeing the freebl GCM state on Finalize, but delaying the free until the session is destroyed or a new Initialize is called. At Initialize time, we check to see if we have an old freebl GCM state still alocated and free it if the key passed to C_Initialize isn't the same as the previous key. PK11_Encrypt/Decrypt could be changed to use the session handle associated with the key, then it wouldn't be a new handle. If someone wants to look a this, I can point the to the appropriate code).

I don't think trying to jam the precomupted table into the key object is a good idea. Primarily, it messes up the freebl/softoken boundary quite badly. Since keys in NSS have their own sessions, it's not necessary.


RE thread safety: NSS already protects around independent PKCS #11 sessions (a requirement of the PKCS #11 spec). Two threads can't call PKCS #11 on the same thread at the same time.
(In reply to Robert Relyea from comment #21)
> PK11_Encrypt/Decrypt could be changed to use the session handle associated
> with the key, then it wouldn't be a new handle. If someone wants to look a
> this, I can point the to the appropriate code).

OK, that sounds simpler and better all around anyway. Why do PK11_Encrypt/PK11_Decrypt create new sessions in the first place? What negative side effects would there be to changing them to use the key's session? Would be be better off creating new variants of PK11_Encrypt/Pk11_Decrypt that have this new behavior, to minimize risk?


> I don't think trying to jam the precomupted table into the key object is a
> good idea. Primarily, it messes up the freebl/softoken boundary quite badly.
> Since keys in NSS have their own sessions, it's not necessary.

To be clear, I am suggesting that we cache whatever FreeBL context object we have; I'm not suggeting we try to copy the context information out of FreeBL and into some softoken structure. In other words, I think that your idea and my idea are basically the same idea, except my idea was to store the context information as part of the key object, and your idea is to store it as part of the session object that the key belongs to. I don't know which of those two ways is better or why.

> RE thread safety: NSS already protects around independent PKCS #11 sessions
> (a requirement of the PKCS #11 spec). Two threads can't call PKCS #11 on the
> same thread at the same time.

OK. TBH, this is still unclear to me, because of code like this in PK11_Encrypt:

    session = pk11_GetNewSession(slot, &owner);
    haslock = (!owner || !slot->isThreadSafe);
    if (haslock) PK11_EnterSlotMonitor(slot);

I inferred from this that things regarding threading are more complicated than you seem to suggest. But, I am happy to be wrong about that.
Flags: needinfo?(rrelyea)
> I think it's copying the structure of other PK11 single shot functions (like PK11_Sign).  The negative affects is that PK11_Encrypt/PK11_Decrypt would single thread through their respective keys (that is 2 PK11_Encrypt functions would block if called using identical keys). This would be a problem for functions like PK11_Sign (where lots of threads will likely be using the same private key), but PK11_Encrypt/Decrypt would not likely be using the same keys on multiple threads since the symetric keys unique, temporary keys. I don't think this semantic change would require new variants.

> To be clear, I am suggesting that we cache whatever FreeBL context object we have; 

The contexts we have contain more than just the key.

> I inferred from this that things regarding threading are more complicated than you seem to suggest.

No, that's providing exactly what I said it did. If pk11_GetNewSession() returns a brand new session (owner == TRUE), then there is no reason to lock (no other thread even knows about this session) and we proceed without locking. If pk11_GetNewSession returns the session owned by the slot (Some tokens can only create one session, or some limited number of sessions. When NSS starts it allocates one session at tokenInit time for each slot, which can always be used), then we lock the slot that owns the session (protecting the session from multiple use).  In addition if a slot claims not to be threadsafe, then we also lock it. In the softoken case, we currently don't lock because we own the session (softoken can generate as many sessions as you have memory for), and softoken is threadsafe.

If we switch to the keysession, we would lock the key over this operation because that's the lock that protects the session owned by the key.

bob
Flags: needinfo?(rrelyea)
Comment on attachment 745806 [details] [diff] [review]
intel_sidechannel_safe_ghash.patch

Review of attachment 745806 [details] [diff] [review]:
-----------------------------------------------------------------

Dropping this review request. rrelyea knows this code much better than me.
Attachment #745806 - Flags: superreview?(rrelyea)
Attachment #745806 - Flags: review?(rrelyea)
Attachment #745806 - Flags: review?(brian)
This seems to have stalled. Any chance we can get something landable?
Flags: needinfo?(ttaubert)
Richard has looked at constant-time arithmetic in bug 1252035 lately.
Flags: needinfo?(ttaubert) → needinfo?(rlb)
Tim: This bug seems worth fixing, but there's a big mess of patches above.  Can you sort through them and summarize status?
Flags: needinfo?(rlb) → needinfo?(ttaubert)
Ping? AES shows up in profiles when watch 4K video. It would be good to reduce that cost.
Jeff: You may wish to file a separate bug. It's unlikely that the AES showing up in your profile is related at all to NSS, and if it is, that it's not the site doing something incredibly stupid with WebCrypto. More likely, it's an AES implementation in something else (for example, the EME CDM, if you're shipping one). A separate bug would help ensure your feedback gets routed appropriately if it's not NSS (and this bug would not likely help that case anyways)
(In reply to Ryan Sleevi from comment #29)
> Jeff: You may wish to file a separate bug. It's unlikely that the AES
> showing up in your profile is related at all to NSS

Here's the relevant stack:

Running Time	Self		Symbol Name
3317.0ms  100.0%	13.0	 	mozilla::net::nsHttpConnection::OnSocketReadable()
3281.0ms   98.9%	3.0	 	 mozilla::net::nsHttpTransaction::WriteSegments(mozilla::net::nsAHttpSegmentWriter*, unsigned int, unsigned int*)
3278.0ms   98.8%	8.0	 	  nsPipeOutputStream::WriteSegments(nsresult (*)(nsIOutputStream*, void*, char*, unsigned int, unsigned int, unsigned int*), void*, unsigned int, unsigned int*)
3189.0ms   96.1%	2.0	 	   mozilla::net::nsHttpTransaction::WritePipeSegment(nsIOutputStream*, void*, char*, unsigned int, unsigned int, unsigned int*)
3179.0ms   95.8%	4.0	 	    non-virtual thunk to mozilla::net::nsHttpConnection::OnWriteSegment(char*, unsigned int, unsigned int*)
3173.0ms   95.6%	4.0	 	     nsSocketInputStream::Read(char*, unsigned int, unsigned int*)
3148.0ms   94.9%	6.0	 	      PSMRecv(PRFileDesc*, void*, int, int, unsigned int)
3084.0ms   92.9%	2.0	 	       ssl_Recv
3076.0ms   92.7%	5.0	 	        ssl_SecureRecv
3020.0ms   91.0%	1.0	 	         ssl3_GatherAppDataRecord
3019.0ms   91.0%	9.0	 	          ssl3_GatherCompleteHandshake
2716.0ms   81.8%	8.0	 	           ssl3_HandleRecord
2704.0ms   81.5%	3.0	 	            ssl3_AESGCM
2701.0ms   81.4%	6.0	 	             PK11_Decrypt
2568.0ms   77.4%	2.0	 	              NSC_Decrypt
2533.0ms   76.3%	0.0	 	               AES_Decrypt
2533.0ms   76.3%	4.0	 	                GCM_DecryptUpdate
1656.0ms   49.9%	0.0	 	                 gcmHash_Update
1651.0ms   49.7%	144.0	 	                  gcm_HashMult
Redirecting to Franziskus, he's going to look at constant-time implementations in the near future.
Flags: needinfo?(ttaubert) → needinfo?(franziskuskiefer)
Assignee: shay.gueron → franziskuskiefer
Flags: needinfo?(franziskuskiefer)
Attachment #745806 - Flags: review?(rrelyea)
Attachment #745806 - Attachment is obsolete: true
Attachment #8338551 - Attachment is obsolete: true
Attachment #8338836 - Attachment is obsolete: true
Attachment #8339102 - Attachment is obsolete: true
Depends on: 1356191
Target Milestone: 3.15.2 → 3.32
See Also: → 1283585
Depends on: 1361687
https://hg.mozilla.org/projects/nss/rev/cd068f7ce6ae11120f8e4427aa2e8ac35a69e764
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: 3.32 → 3.31
follow-up to make this build in Firefox.
https://hg.mozilla.org/projects/nss/rev/a130081b5f20176b447cdfd359405cb16d1b9e65
Target Milestone: 3.31 → 3.32
Depends on: 1383608
Depends on: 1387779
Depends on: 1389432
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: