Closed Bug 1550579 Opened 6 years ago Closed 5 years ago

On Fenix (Arm32) ec_Curve25519_mul() takes a surprising amount of time

Categories

(NSS :: Libraries, defect, P1)

Product:

Component:

Platform:

ARM

Android

Type:

defect

Priority:

P1

Severity:

normal

Tracking

(Performance Impact:high, firefox-esr60 wontfix, firefox-esr68 wontfix, firefox66 wontfix, firefox67 wontfix, firefox67.0.1 wontfix, firefox68 wontfix, firefox69 fixed)

Status:

RESOLVED FIXED

Milestone:

3.45

Performance Impact

high

Tracking Flags:

Tracking

Status

firefox-esr60

---

wontfix

firefox-esr68

---

wontfix

firefox66

---

wontfix

firefox67

---

wontfix

firefox67.0.1

---

wontfix

firefox68

---

wontfix

firefox69

---

fixed

People

(Reporter: jesup, Assigned: kjacobs)

References

(Blocks 1 open bug)

Details

(Keywords: perf, perf:pageload, Whiteboard: [geckoview:fenix:p2] [bcs:p1])

Attachments

(1 file)

Bug 1550579 - Replace arm32 curve25519 ref implementation with fiat-crypto 5 years ago Kevin Jacobs [:kjacobs] 47 bytes, text/x-phabricator-request		Details \| Review

Randell Jesup [:jesup] (needinfo me)

Reporter

Description

•

6 years ago

When profiling Fenix pageloads on a Motorola G5 (Cortex-A53 octacore) generating keys on the master process Socket Thread seems to take an inordinately long time, most of it in ec_Curve25519_mul().

Note that this is 32-bit ARM code; even though the processor can run AArch64.

See https://perfht.ml/2Vwm1MJ and https://perfht.ml/2VnphJY (Gecko Profiler capture of nytimes.com, 4ms sampling) and https://perfht.ml/2VsbsKc (captured with simpleperf -- this implies that mul() and square() are calling into the kernel, which is hard to verify with a Gecko Profiler profile, due to how profiles are captured). In any case, though, it's spending a long time generating keys when loading pages (cold).

Randell Jesup [:jesup] (needinfo me)

Reporter

Updated

•

6 years ago

Whiteboard: [qf][gv] → [qf][geckoview]

Martin Thomson [:mt:]

Comment 1

•

6 years ago

Before I looked, my thought was that we were hitting one of these two problems:

Lots of connections. We can't fix this, but it's worth looking into how many we have (and why). We might be making extra connections, and we could maybe dial back our aggression a little for mobile. It does appear that the site is using HTTP/2, so we're not into the HTTP/1.1 spread - wins might not be easy in this area. We still seem to have 120 connections though...
Too many key generation calls. I thought at first this might have been https://searchfox.org/mozilla-central/rev/b9da45f63cb567244933c77b2c7e827a057d3f9b/security/manager/ssl/nsNSSIOLayer.cpp#2433 which causes us to generate two key shares for each connection, but I only see 4ms/128ms heading that way in the profile I checked.

But Randell is probably right to attribute this to something in the multiply function. We are probably hitting the slow 32-bit code. 25519 without a 64-bit multiply is pretty bad. There might be a build configuration we can hit that will result in deploying faster 64-bit code on this platform, or maybe we should find a way to ship both sets of code and switch hit at runtime.

Randell Jesup [:jesup] (needinfo me)

Reporter

Updated

•

6 years ago

Summary: On Fenix ec_Curve25519_mul() takes a surprising amount of time → On Fenix (Arm32) ec_Curve25519_mul() takes a surprising amount of time

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 2

•

6 years ago

Regarding (1), in https://perfht.ml/2VrxRrn we're seeing lots of time there, and only about max 18 connections in use (and less being started...) with 9 different domains, per the Network tab.

32-bit code for this is here: https://searchfox.org/mozilla-central/source/security/nss/lib/freebl/ecl/curve25519_32.c#109
32x32=64 multiply should be easy; 3 multiplies and 2 adds (one of which can be combined with a mul): MUL, MLA, UMULL, ADD. 64x64=128 which probably is needed here is tougher. The code above though doesn't appear to even try to use 32x32=64 though; it sticks to pure int32's.

Per http://conradoplg.cryptoland.net/files/2010/12/gcm14.pdf it's possible to build 64x64=128 multiplies with VMULL in Arm7 (8 VMULL's) and arm8 (1 VMULL) in NEON. See also https://cryptojedi.org/peter/data/ches-20120911.pdf (paper: https://cryptojedi.org/papers/neoncrypto-20120320.pdf)

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 3

•

6 years ago

Would there be any advantage on slow machines (like arm32, perhaps low-end desktops?) in pre-generating a small pool of keys off idle callbacks? Are there any security gotcha's here? (stab in the dark; probably a fair bit of work - but less than trying to recode and verify everything in NEON).

It might even help higher-end/desktop machines (a bit).

Flags: needinfo?(mt)

Martin Thomson [:mt:]

Comment 4

•

6 years ago

It seems like there are some low-hanging fruit when it comes to optimization of this code. I would prefer to chase them down first. Caching keygen isn't going to save us from the use of keys, which is just as expensive. And working out when to run those routines could be tricky.

We don't build on systems that don't have uint64_t any more, so we could look at the 32x32=64 optimizations. I agree that chasing down uint128_t is probably going to be more expensive. It all comes down to how much effort you are willing to expend.

Flags: needinfo?(mt)

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 5

•

6 years ago

From looking at a number of these, it seems like each handshake generation takes on the order of 35ms, and loading a page may require doing this many times.

Over the ~13seconds for the entire load, the Socket Thread used almost 900ms in Curve25519_pt_mul

jcj: what are our options here? It looks like this has a serious impact on pageload on many devices

Flags: needinfo?(jjones)

Whiteboard: [qf][geckoview] → [qf:p1:pageload][geckoview]

Chris Peterson [:cpeterson]

Updated

•

6 years ago

status-firefox66: --- → wontfix

status-firefox67: --- → wontfix

status-firefox67.0.1: --- → wontfix

status-firefox68: --- → affected

OS: Unspecified → Android

Hardware: Unspecified → ARM

Whiteboard: [qf:p1:pageload][geckoview] → [qf:p1:pageload][geckoview:fenix:p2]

J.C. Jones [:jcj] (he/they)

Comment 6

•

6 years ago

This seems wholly reasonable to me, too. I've double-checked and there's no HACL* work for a 32bit implementation, sadly, but the optimizations here don't look too hard.

I'm passing the ni? to Thyla for prioritization.

Flags: needinfo?(jjones) → needinfo?(tvandermerwe)

Thyla van der Merwe

Updated

•

6 years ago

Assignee: nobody → kjacobs.bugzilla

Flags: needinfo?(tvandermerwe)

Thyla van der Merwe

Updated

•

6 years ago

Status: NEW → ASSIGNED

BugBot [:suhaib / :marco/ :calixte]

Comment 7

•

6 years ago

The priority flag is not set for this bug.
:jcj, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jjones)

Thyla van der Merwe

Updated

•

5 years ago

Priority: -- → P2

Martin Thomson [:mt:]

Comment 8

•

5 years ago

I don't know how much help this is likely to be, but this implementation might be worth looking into: https://github.com/briansmith/ring/blob/master/crypto/curve25519/asm/x25519-asm-arm.S

J.C. Jones [:jcj] (he/they)

Updated

•

5 years ago

Flags: needinfo?(jjones)

J.C. Jones [:jcj] (he/they)

Updated

•

5 years ago

See Also: → 1555407

Thyla van der Merwe

Updated

•

5 years ago

Priority: P2 → P1

Kevin Jacobs [:kjacobs]

Assignee

Comment 9

•

5 years ago

The multiplication change is not as simple as I was hoping... Unrolling the multiplication gives a fairly minor (~20%) speedup on my host machine. I've also tested two other implementations: Ref10 and Curve25519-donna. Both are drop-in replacements with significant improvements - the former being an 10-15x faster in my tests.

This was discussed in some detail back when the current implementation went in: https://bugzilla.mozilla.org/show_bug.cgi?id=957105#c40. Given these factors, and the fact that we have no HACL* implementation available, are there any concerns with replacing our 32-bit implementation with Ref10?

Flags: needinfo?(mt)

Martin Thomson [:mt:]

Comment 10

•

5 years ago

If we can get Ref10 on reasonable licensing terms (please confirm with mhoye), then I don't see a real problem with taking it for 32-bit machines.

I think that we need to be a little less picky about things when there is a 10x speedup on offer, but we do need to ensure that the licensing is kosher.

Flags: needinfo?(mt)

Kevin Jacobs [:kjacobs]

Assignee

Comment 11

•

5 years ago

Attached file Bug 1550579 - Replace arm32 curve25519 ref implementation with fiat-crypto — Details

This patch changes the 32-bit Curve25519 implementation in NSS from ref to ref10. There is no HACL* variant available for 32-bit platforms. A helpful benchmark comparison is here: https://bench.cr.yp.to/impl-scalarmult/curve25519.html

**Notes about this patch: **

lib/freebl/freebl_base.gypi is updated to replace the 64-bit HACL* implementation with this version. This is ONLY for testing on a wider number of platforms and sanitizers. It will be removed before this is marked ready to land.
mhoye has given initial clearance on the licensing (being derived from public domain code), but must approve here before it can land.
This needs to be tested more fully to ensure we don't have subtle incompatibilities. Conversations are ongoing in //#nss-arm32-perf// on how best to accomplish this (since our CI doesn't target arm32 currently).
Performance on a MacOS 64b optimized build improves our x25519 gtest performance (~90 shared secret computations) from ~90ms to ~8ms. Performance improvements will be larger for workloads that use the implementation more heavily or have less set up/tear-down included in the measurement.
LibSodium also implements ref10. Their adaptation checks the public key against a blacklist at time of use ([[ https://github.com/jedisct1/libsodium/blob/master/src/libsodium/crypto_scalarmult/curve25519/ref10/x25519_ref10.c | here ]]). This blacklist is a subset of what we already check against in [[ https://searchfox.org/mozilla-central/source/security/nss/lib/freebl/ecl/ecp_25519.c#27 | ec_Curve25519_pt_validate ]].

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 12

•

5 years ago

See this startup-and-load-CNN profile as well: https://perfht.ml/2QNWAAv (same issue, just a cleaner profile (the simpleperf import function has a bugfix - it was mis-drawing samples int he graph in some cases)

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 13

•

5 years ago

a preliminary look with the patch applied appears to show significantly less overhead for 25519: https://perfht.ml/2QOdi2t

Andrew Creskey [:acreskey]

Comment 14

•

5 years ago

I put up a perfherder comparison:
before (left) and after applying this patch (right)
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=a15a5d367a7b7a197d90f13ff2670610a09a5003&newProject=try&newRevision=3957a47f22a84100cdd05cbbd492fc3dc2c286f0&framework=10

Phabricator Automation

Updated

•

5 years ago

Attachment #9069810 - Attachment description: Bug 1550579 - Replace arm32 curve25519 ref implementation with ref10 → Bug 1550579 - Replace arm32 curve25519 ref implementation with fiat-crypto

J.C. Jones [:jcj] (he/they)

Updated

•

5 years ago

Blocks: 1561967

Chris Peterson [:cpeterson]

Comment 15

•

5 years ago

esr68=affected because we might want to uplift this optimization to Fennec ESR 68.1.

status-firefox68: affected → wontfix

status-firefox69: --- → affected

status-firefox-esr60: --- → wontfix

status-firefox-esr68: --- → affected

Whiteboard: [qf:p1:pageload][geckoview:fenix:p2] → [qf:p1:pageload][geckoview:fenix:p2] [bcs:p1]

J.C. Jones [:jcj] (he/they)

Comment 16

•

5 years ago

https://hg.mozilla.org/projects/nss/rev/7b749ff400db3ecab87f394c091ea0831254e674

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

Target Milestone: --- → 3.45

J.C. Jones [:jcj] (he/they)

Updated

•

5 years ago

Blocks: 1555407

Kevin Jacobs [:kjacobs]

Assignee

Updated

•

5 years ago

Blocks: 1563279

Kevin Jacobs [:kjacobs]

Assignee

Updated

•

5 years ago

Blocks: 1563280

Chris Peterson [:cpeterson]

Comment 17

•

5 years ago

status-firefox69=fixed by NSS 3.45 in bug 1550889.
status-firefox-esr68=wontfix because we don't plan to uplift these NSS changes to Fennec ESR 68.

Blocks: 1550889

status-firefox69: affected → fixed

status-firefox-esr68: affected → wontfix

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

3 years ago

Performance Impact: --- → P1

Keywords: perf:pageload

Whiteboard: [qf:p1:pageload][geckoview:fenix:p2] [bcs:p1] → [geckoview:fenix:p2] [bcs:p1]

You need to log in before you can comment on or make changes to this bug.