Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an optimized curve25519_base_donna #23

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nmathewson
Copy link

In many protocols, it doesn't hurt security for a timing attacker to
know whether we're computing SK_9 ==> PUBKEY or whether we're
computing SK_PUBKEY ==> SHARED_SECRET. For example, when you're doing
Diffie-Hellman, everybody will do one scalarmult of the basepoint (9),
and one scalarmult of the other party's public key.

For this kind of application, we can improve handshake speed by having
an optimized curve25519_base_donna function (name taken from nacl's
"crypto_scalarmult_base") that computes SK*BP.

Here I've only done the most obvious version of the optimization by
making an alternate fmonty_bp that does a scalar multiplication by 9
in place of fmul(.,.,qmqp). On my laptop, the 64-bit
curve25519_base_donna is 11.5% faster than the regular 64-bit
curve25519_donna, and the 32-bit curve25519_base_donna is faster than
its counterpart by 4.7%.

In many protocols, it doesn't hurt security for a timing attacker to
know whether we're computing SK*9 ==> PUBKEY or whether we're
computing SK*PUBKEY ==> SHARED_SECRET.  For example, when you're doing
Diffie-Hellman, everybody will do one scalarmult of the basepoint (9),
and one scalarmult of the other party's public key.

For this kind of application, we can improve handshake speed by having
an optimized curve25519_base_donna function (name taken from nacl's
"crypto_scalarmult_base") that computes SK*BP.

Here I've only done the most obvious version of the optimization by
making an alternate fmonty_bp that does a scalar multiplication by 9
in place of fmul(.,.,qmqp).  On my laptop, the 64-bit
curve25519_base_donna is 11.5% faster than the regular 64-bit
curve25519_donna, and the 32-bit curve25519_base_donna is faster than
its counterpart by 4.7%.
@agl
Copy link
Owner

agl commented Dec 30, 2012

I'm on vacation at the moment and I'm afraid it'll be a file of weeks till
I'm properly back to normal and reviewing.

However, fixed base optimisations are typically much faster than this. If
the time taken to do a key gen is a problem then I can probably do better,
although with more code.

Are you thinking of using curve 25519 for ECDHE where key gen is common?
On Dec 23, 2012 9:38 AM, "Nick Mathewson" [email protected] wrote:

In many protocols, it doesn't hurt security for a timing attacker to
know whether we're computing SK_9 ==> PUBKEY or whether we're
computing SK_PUBKEY ==> SHARED_SECRET. For example, when you're doing
Diffie-Hellman, everybody will do one scalarmult of the basepoint (9),
and one scalarmult of the other party's public key.

For this kind of application, we can improve handshake speed by having
an optimized curve25519_base_donna function (name taken from nacl's
"crypto_scalarmult_base") that computes SK*BP.

Here I've only done the most obvious version of the optimization by
making an alternate fmonty_bp that does a scalar multiplication by 9
in place of fmul(.,.,qmqp). On my laptop, the 64-bit
curve25519_base_donna is 11.5% faster than the regular 64-bit
curve25519_donna, and the 32-bit curve25519_base_donna is faster than

its counterpart by 4.7%.

You can merge this Pull Request by running:

git pull https://github.com/nmathewson/curve25519-donna bp_opt

Or view, comment on, or merge it at:

#23
Commit Summary

  • Add an optimized curve25519_base_donna

File Changes

  • M .gitignore (2)
  • M Makefile (14)
  • M curve25519-donna-c64.c (114)
  • M curve25519-donna.c (130)
  • M speed-curve25519.c (9)
  • A test-curve25519-bp.c (32)
  • M test-curve25519.c (2)

Patch Links

@nmathewson
Copy link
Author

I've got a Tor branch where I'm using it for Ian Goldberg et al's "ntor" protocol, as described in their paper and specified in Tor proposal 216.

The server side of the protocol is performance-critical for me; it involves 3 curve25519 operations, one of which is an operation on the base point.

If you think it's something I could figure out, and you point me at something readable about how to do fixed base optimizations right, I might be able to take a crack at it.

The other conceivable optimization to do for that protocol would be to implement something like Shamir's simultaneous multiple exponentiation -- but that one isn't at all an obvious fit for how curve25519 wants to be implemented.

Alternatively, there's a protocol for a faster protocol out there, but it's less reviewed than ntor, and it would require point addition as well as simultaneous multiple curve25519.

@agl
Copy link
Owner

agl commented Dec 30, 2012

On Dec 30, 2012 2:54 AM, "Nick Mathewson" [email protected] wrote:

If you think it's something I could figure out, and you point me at something readable about how to do fixed base optimizations right, I might be able to take a crack at it.

Firstly, the biggest improvement is probably to grab djb's code from
SUPERCOP and use that when building for amd64 / x86 / arm. (Since I
wrote donna he's added amd64 code and it's faster than donna, the ARM
code is certainly much better.)

But, as for adding fixed-base optimisations to donna, I suggest that
we find Dan in a week and pin him down. (I think we're all at End of
RSA / Real World Crypto.)

I can't find plain addition formula for Montgomery curves and I don't
think there are any, so there are two obvious paths: work in Short
Weierstrass form with Jacobian coordinates or work in twisted Edwards
form with extended coordinates. I think the latter will be cleaner and
I think both can be done without extending the field. Indeed, the
conversion to SW seems to be just to addition of A/3 to the
x-coordinate, and ed25519 uses a twisted Edward's version of
curve25519 so the isomorphism must exist.

From there, one can precompute a table of values tapping two or four
bit positions and so do a fixed based scalar mult with 64 adds and 64
doubles (with four taps). With a second table that's out of phase, it
can be 64 adds and 32 doubles as the cost of more memory bandwidth.

Reading the table in constant time is a bit of a pain. Typically I've
always read the whole table and used bit masks to select the desired
value. However, if you're willing to have some platform specific code
in there then knowledge of the cache line length can speed that up.

Cheers

AGL

@nmathewson
Copy link
Author

I'll see about snarfing the SUPERCOP implementations; I was kind of hoping that the next version of nacl would come out sometime soon, including those implementations, so I wouldn't have to worry about it and could just tell people "link against nacl." The schedule for that is a fine thing for us to ask Dan about when we're co-located.

WRT better implementations for implementing the ntor/ace handshakes: this is definitely a place to step back and talk about what the correct long-term solution really is. But even with the current naive implementation, ntor+curve25519-donna is much faster and much more secure than what Tor's been doing up till now, so I'm fine merging it and then optimizing it or moving to ace later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants