Closed Bug 1780432 (CVE-2023-5388) Opened 2 years ago Closed 7 months ago

Timing attack against RSA decryption (in TLS)

Categories

(NSS :: Libraries, defect, P2)

3.80

Tracking

(firefox-esr115124+ fixed, firefox122 wontfix, firefox123 wontfix, firefox124+ fixed)

RESOLVED FIXED
Tracking Status
firefox-esr115 124+ fixed
firefox122 --- wontfix
firefox123 --- wontfix
firefox124 + fixed

People

(Reporter: hkario, Assigned: anna.weine)

References

Details

(Keywords: csectype-side-channel, sec-moderate, Whiteboard: [post-critsmash-triage][adv-main124+][adv-esr115.9+])

Attachments

(8 files)

I've been working on testing against timing attacks using tlsfuzzer, the newest version of that code is in the upstream pregenerate branch[1].

What's special about it, is that it is much more sensitive to timing side-channels than both my previous attempts at it, from bug 1651411, and what the published research suggests. In particular, I'm able to detect the OpenSSL side-channel caused by the https://github.com/openssl/openssl/issues/6640 bug in just 10 thousand connections with 2048 bit RSA keys, over local gigabit Ethernet: the measured side channel is about 55ns while the test provides a result with a 95% confidence interval of ±36ns.

I haven't executed extensive tests against NSS, so I don't have good results with NSS just yet, but I've talked with Bob about the NSS code and how it handles multi-precision integers. The problem is similar as the one in OpenSSL.

The MPI objects internally represent large integers as a list of word-sized (64bit or 32bit) integers. The problem is, that many operations on objects perform "clamping" (if the most significant word is zero, they drop it and store the number in fewer words), in particular, the modulo multiplication performing the unblinding after the modular exponentiation in RSA private key decryption.

The problem is that when such number then needs to be converted to a byte string, so that it can be fed into a hash function, or so that the padding can be checked (be it PKCS#1v1.5 or OAEP), that operation can't take the same amount of time, since it operates on different number of words used to represent the number. In other words, that conversion then leaks if the high order bytes are zero or not: precisely the signal necessary for Bleichenbacher oracle.

I'll add detailed test results later.

Please keep this issue embargoed as other implementations are vulnerable, so we'd prefer to release information about it in a coordinated fashion.

1 - https://github.com/tlsfuzzer/tlsfuzzer/tree/pregenerate

Hi Hubert,
Thanks for the report.

MPI not being constant time is a known problem on our side and we know our RSA code is for sure vulnerable to side channels... : (
We will be looking at replacing the RSA code with a formally verified implementation in the next few weeks and we will report back when things are moving.

Here is a related bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1761036

See Also: → CVE-2023-4421

Right, but my point is that we can fix the timing side-channel in RSA by just fixing one multiply and mod operation, which I'd expect is much easier to do that replacing the whole implementation for RSA.

It is also much easier attack to perform than the micro-architectural side-channels or remote sound recording side-channels (where a side-channel free mod-exp or exponent blinding is necessary for protection)...

But sure, if you'll be able to replace it all in the next few weeks, it may not be worth it.

I'll still provide the test results, if nothing else, they'll be useful as the baseline to compare to with the new implementation. Also, if you'd like some timing tests done on the code before it is committed, just send me a line, I'll definitely find time to run them (this whole exercise started with us in Red Hat wanting to have an automated regression test case for timing attacks that we can run as part of CI, so it's not something that will be a one off anyway).

Assignee: nobody → nkulatova

Plot of bootstrapped confidence intervals for median pairwise differences between the first (0th) probe and all the others. Results from 12.2M observations per sample (probe). 95% confidence interval is ±22.5ns.

Probe descriptions:

ID,Name
0,random plaintext - 1
1,random plaintext - 1 zero MSB
2,random plaintext - 2
3,random plaintext - 2 zero MSB
4,random plaintext - 4 zero MSB
5,random plaintext - 8 zero MSB
6,random plaintext - 16 zero MSB
7,random plaintext - 40 zero MSB

Random plaintext in this case refers to the data before RSA public key operation.
The "random plaintext - 1" and "random plaintext - 2" are just a copy of the same kind of probe, to verify that the execution environment isn't biased (for a "canary" test, expected to fail; have high p-value).
The "1 zero MSB" means that one of the Most Significant Bytes is equal to 0x00, "2 zero MSB" for two, etc.

Attached file report.csv

Comparison statistics between all the different samples. Samples are 12.2M observations large, executed against current NSS tip (as of last week), with 2048 bit RSA key on an 64 bit platform.

Most important is the "Sign test" column.

The results show a clear timing signal when the first word (8 bytes) or two of the first words (16 bytes) are zero. When compared to plaintext with all bytes non-zero, the sign test reports p-values of 4.86e-7 and 1.3e-8 respectively.

At the same time, the results from comparing the fully random plaintexts with plaintexts with 1, 2, or 4 MSBs set to zero don't show statistically significant difference between samples (p-values larger than 0.05).

This is consistent with the theory that either the calculation of the last MPI (unblinding) or conversion of it to a byte string is not constant time with regards to the number of zero bytes, but with the granularity of a single "limb" used internally in MPI.

What is a bit surprising, is that when 40 bytes are zero (i.e. five limbs are zero), there is almost no statistically significant difference (p-values of 0.11, 0.22, 0.02) between either the baseline probes or probes with just few bytes set to zero. I've got no good explanation for that. I'll keep running the tests, hopefully it turns out to be just a case of a weaker timing signal, not a lack of it.

As expected, when I collected more data, the 40 zero MSB probe gave a statistically significant result.
While in the graph it looks like the "2 zero MSB" may be significant too, the actual p-values are not significant after applying Bonferroni correction.

This is for 33.5M observations per probe, 95% CI of ±11.5ns.

Attached file report.csv

Detailed results from the 33.5M run.

The p-value for sign test of "random plaintext - 1" (probe 0) and "random plaintext - 40 zero MSB" (probe 7) is 3.11e-7.
Which would suggest that there are two algorithms, one that runs faster while other that runs slower when most significant limbs (words) are equal to zero, but they have different scaling factor.

The severity field is not set for this bug.
:beurdouche, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(bbeurdouche)

I will try to estimate the severity of the bug :)

Setting ni? as a reminder :)

Flags: needinfo?(nkulatova)
Severity: -- → S2
Flags: needinfo?(nkulatova)
Priority: -- → P2
Attached file ctmpi.tar.gz

I've implemented a simple mul() and mod() operations in (largely) portable pure C, and verified that they are indeed constant time on x86_64, aarch64, ppc64le, and s390x.

Included is both the code and the test harnesses for timing tests.

@nkulatova if you think they would be useful to integrate ahead of replacing the whole RSA code, feel free to use them.

(In reply to Hubert Kario from comment #10)

Created attachment 9297296 [details]
ctmpi.tar.gz

I've implemented a simple mul() and mod() operations in (largely) portable pure C, and verified that they are indeed constant time on x86_64, aarch64, ppc64le, and s390x.

Included is both the code and the test harnesses for timing tests.

@nkulatova if you think they would be useful to integrate ahead of replacing the whole RSA code, feel free to use them.

Did you have a chance to test the speed?

The simplest way to measure the speed would be to use the code I wrote here: https://phabricator.services.mozilla.com/D153347

Those are basically the slowest possible ways to implement those operations (long multiplication on words and long division on bits), so they are slow.

On x86_64 a multiplication of two 2048bit numbers takes about 4.2µs on a 3.9GHz CPU (i7-8650U), so not bad.
But the mod() operation of a 4096 bit operand (result of 2048 bit number multiplication) and 2048 bit modulus is very slow, taking 580µs on a 5.2GHz CPU (i9-12900KS).

Side note: using rdtsc alone is not sufficient on Intel CPUs for microbenchmarking, as that instruction is not a speculative execution or memory access barrier, see https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf or get_time_before() and get_time_after() in harness.c in ctmpi.tar.gz.

Attached file ctmpi-v2.tar.gz

I've got sort-of good news. I went ahead and implemented the multi-word version of Montgomery reduction. It's about as fast as the multiplication, taking around 2.9µs on a 3.9GHz CPU (i7-8650U), so quite usable.

The downside is that it assumes that it has access to the multiplicative inverse of the modulus mod word size (i.e. 2^64 or 2^32). And it obviously requires the mod_exp() output or the unblinder to be in the Montgomery form (they both can be in Montgomery form, then Montgomery reduction needs to be applied twice to the output of the mul(), but that's supported too, the function I've implemented can handle any input up to the double the number of words of the modulus).

Could you compare the performance with the current one?

Also, just fyi, the HACL* team implemented the MM/MR for RSA, if you want to take a look for inspiration, here it is: https://github.com/hacl-star/hacl-star/blob/main/dist/gcc-compatible/ (Bignum or RSA).

Finally, do you think if we implement an euclidian division to compute the multiplicative inverse, will it is quicker?

Could you compare the performance with the current one?

I don't have precise measurements, but my experience is that NSS performs the RSA key exchange in about 1 milisecond, so even if we assume that the current unblinding and final Montgomery reduction takes no time, this new code would slow down NSS by less than 1%.

Also, just fyi, the HACL* team implemented the MM/MR for RSA, if you want to take a look for inspiration,

well, I've done it mostly as a proof that it's possible to have side-channel secure pure C code (or that it's possible to have side-channel free multiplication of integers in the first place), I'm not interested in implementing advanced numerical methods... For one, while a neat exercise, I still think those algorithms should be implemented in assembly (e.g.: I wouldn't be surprised if add() couldn't be made like 3 times faster in assembly than what gcc generates from my code).

Finally, do you think if we implement an euclidian division to compute the multiplicative inverse, will it is quicker?

To calculate what? The inverse of modulus mod word, or to calculate the Montgomery reduction?

For inverse, yes, AFAIK, that's the fastest algorithm, and fine to use since we don't have to worry about side-channels: the operands are public.

For the Montgomery reduction: I don't think so, the multi-word reduction operates on much smaller numbers and less data at every step. If you have the full-sized inverse of the modulus you still need to multiply it, and as you can see, my mul() is slower than mod_montgomery() (4.2µs vs 2.9µs). Now, I'm using long multiplication instead of Karatsuba or better, but then for regular Montgomery reduction you have to do two multiplications and with much bigger numbers, not one, so I don't think you'd be able to beat multi-word Montgomery reduction, as it multiplies a modulus-sized multi-precision integer by a single word at a time: there isn't a better algorithm for that than the one I already implemented, multiplying every word in turn, neither Karatsuba nor FFTs can speed it up.

But like I said, I'm not particularly interested in numerical methods, so don't consider it to be an expert opinion, I may very well be wrong. The behaviour may also be more complex, where one algorithm is faster for smaller key sizes and another is faster for larger key sizes.

To calculate what? The inverse of modulus mod word, or to calculate the Montgomery reduction?
The inverse it was.

(In reply to Hubert Kario from comment #10)

Created attachment 9297296 [details]
ctmpi.tar.gz

I've implemented a simple mul() and mod() operations in (largely) portable pure C, and verified that they are indeed constant time on x86_64, aarch64, ppc64le, and s390x.

Included is both the code and the test harnesses for timing tests.

@nkulatova if you think they would be useful to integrate ahead of replacing the whole RSA code, feel free to use them.

Would you prefer to submit the patch by yourself, or would you prefer if I do it?

I'd prefer if You'd do it.

Ok! Do you mind then to review it once it's ready?

Sure, I can both review and test it for timing side-channels.

Lovely, thanks.

@nulatova: Hi,
How is the progress on getting in the constant time HACL implementation of RSA into NSS?
I'm asking because the other libraries with similar issues are planning a release around the end of January, so that may generate increased interest in this kind of vulnerabilities.

Flags: needinfo?(nkulatova)

Hi,

We are very close to adding the RSA-PSS HACL* support to NSS.
Do you consider that the rest of the RSA primitives require the immediate attention?

Flags: needinfo?(nkulatova)

Do you consider that the rest of the RSA primitives require the immediate attention?

Yes,

  1. The unblinding leaking information affects all RSA private key operations (though exploitation of it may be very non-tryvial
  2. The PKCS#1 v1.5 decryption is exploitable over the network
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

If I understand correctly, RSA#1 v1.5 is not used in TLS1.3. However, we plan to get rid of the non-constant time code as soon as we can.

We still have to support TLS 1.2 servers. I don't know if there's a deprecation timeline and kill-date for TLS 1.2, but judging how TLS 1.0/1.1 went that could be a while still.

Yes, this bug affects only TLS 1.2 and earlier. TLS 1.3 is immune.

Ah, but it also affects any applications that would use NSS API directly for RSA decryption (be it PKCS#1 v1.5 or OEAP).

Yes, this we take into account

Side note: the same issue in OpenSSL, CVE-2022-4304, was made public today.

Flags: needinfo?(bbeurdouche) → needinfo?(nkulatova)

The paper was accepted. I'll be making the issue public in last week of September (between 25-29th).

As I've said, the issue was made public, the details of it are on the https://people.redhat.com/~hkario/marvin/

Can we get CVE assigned specifically for the numerical library side-channel leakage? It affects both RSA key exchange and the RSA-OAEP in WebCrypto, so technically it's a bug in Firefox too.

Anna, could you assign a CVE to it? It would make fix coordination much easier.

What is the process for getting a CVE for this, and is there anything I can help to move this forward?

Dan and Tom can assign CVEs for this.

Flags: needinfo?(tom)
Flags: needinfo?(dveditz)
Alias: CVE-2023-5388
Flags: needinfo?(tom)
Flags: needinfo?(dveditz)
  1. Add Constant time mult mod functions.
    a. constant time mul
    b. use constant time montgomery reduce.

  2. Use montgomery values for blinding.

Anna, I've pushed our downstream patch for this to phabricator. We've tested this patch against the marvin toolkit and our test suite on x86, ppc, and s390 and we've removed the timing vulnerability. I'll let you decide how to proceed with the patch.
Thanks,
bob

Ping, Anna how do you want to proceed with the patch for this issue (see comment 37 and comment 38). Thanks.

Oh great! We will take a look at the patch! (Could you the next time put nss-reviewers as the reviewer option? As I have just found your question, sorry for the delay!)

Flags: needinfo?(nkulatova)
Status: REOPENED → RESOLVED
Closed: 2 years ago7 months ago
Resolution: --- → FIXED
Target Milestone: --- → 3.98
Group: crypto-core-security → core-security-release

I'd like to add the information to the Marvin page that there are Firefox versions with the fixes, but I see that 124 is still at alpha/nightly stage, similarly NSS 3.98 isn't released. Should I put that information there ahead of time, or should I wait till final versions are released?

Blocks: 1880562

NSS 3.98 and NSS 3.90.2 (ESR) have just been released with this patch. Those versions of NSS will be in Firefox 124 and Firefox 115.9, respectively, which will both be released on March 19.

Flags: qe-verify-
Whiteboard: [post-critsmash-triage]

marking esr115 as fixed for 115.9esr by 1880562 which was just uplifted

Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main124+][adv-esr124+]
Whiteboard: [post-critsmash-triage][adv-main124+][adv-esr124+] → [post-critsmash-triage][adv-main124+][adv-esr115.9+]
Attached file advisory.txt
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: