ACME (RFC 8555) § 8.2 Challenge Retries #242

dcow · 2020-04-30T11:50:30Z

Taking over #181. Original issue #162.

These are visual studio code's workspace configuration files.

The mock acme authority needs to in order to conform to the updated acme authority interface.

Makes the line more readable.

acme/authority.go

acme/challenge.go

api/utils.go

acme/challenge.go

dcow · 2020-05-01T02:28:58Z

https://tools.ietf.org/html/rfc8555#section-7.1.6 illustrates the challenge lifecycle. Retries are mentioned:

Note that within the "processing" state, the server may attempt to validate the challenge multiple times (see Section 8.2). Likewise, client requests for retries do not cause a state change. If validation is successful, the challenge moves to the "valid" state; if there is an error, the challenge moves to the "invalid" state.

I read this as "if the server successfully acquires a challenge result (e.g. http resource, dns record, tls cert) but verification fails (the data is not correct), then move to the "invalid" state". This type of "error" is different from a transient network error which should be retried.

So I'm going to sift through all the error logic and bucket things into one of those two definitions. And we'll only retry for transient errors.

Also, once the state is "invalid", I don't think a challenge can be revived. This means the proposed implementation, where a challenge lives in the invalid state between retry attempts, is not correct. The challenge should remain processing until it's either valid or invalid.

edit: s/pending/processing ^

cpu · 2020-05-01T12:51:40Z

Also, once the state is "invalid", I don't think a challenge can be revived. This means the proposed implementation, where a challenge lives in the invalid state between retry attempts, is not correct. The challenge should remain pending until it's either valid or invalid.

@dcow I think you would want it to remain in a "processing" state instead of a "pending" state for failed attempts that support further retries. I think that's the meat of the errata that Rob from Sectico filed against RFC 8555 related to retries, eid5732: https://www.rfc-editor.org/errata/eid5732

(Edit: this part of the spec is a little bit rough. Despite my best efforts it landed in 8555 without anyone ever having implemented it at the time... 😓)

dcow · 2020-05-01T19:01:30Z

(Edit: this part of the spec is a little bit rough. Despite my best efforts it landed in 8555 without anyone ever having implemented it at the time... 😓)

@cpu thanks for the color. The errata is super useful—figuring out when to consider a challenge invalid is exactly what I've been mulling over. Do you have any insight on:

While the server is still trying, the status of the challenge remains "processing"; it is only marked "invalid" once the server has given up.

from section 8.2? Is the intention that orders are only allowed to be in the processing state for a short amount of time after the initial client request and the server needs to put the kibosh on future client retries by marking the challenge invalid? I'm inclined to let a challenge remain in a processing state indefinitely until an authorization expires or otherwise becomes invalid. I guess one could interpret that behavior as a server that never "gives up" even if it won't automatically retry again unless the client pokes it.

cpu · 2020-05-02T14:10:19Z

from section 8.2? Is the intention that orders are only allowed to be in the processing state for a short amount of time after the initial client request and the server needs to put the kibosh on future client retries by marking the challenge invalid?

@dcow I think an order should only enter into the processing state when the client POSTs the finalization URL to begin issuance. At that point all of the challenges need to be valid for the finalization to start. In practice I suspect most clients expect orders to not remain in the processing state for very long simply because Let's Encrypt's issuance is more or less synchronous with the finalization request (but that's a quirk of one implementation and not a protocol guarantee).

I'm inclined to let a challenge remain in a processing state indefinitely until an authorization expires or otherwise becomes invalid. I guess one could interpret that behavior as a server that never "gives up" even if it won't automatically retry again unless the client pokes it.

I think that makes sense for the challenges 👍 I can't think of a reason that would be a problem.

Hope that helps!

Section 8.2 of RFC 8555 explains how retries apply to the validation process. However, much is left up to the implementer. Add retries every 12 seconds for 2 minutes after a client requests a validation. The challenge status remains "processing" indefinitely until a distinct conclusion is reached. This allows a client to continually re-request a validation by sending a post-get to the challenge resource until the process fails or succeeds. Challenges in the processing state include information about why a validation did not complete in the error field. The server also includes a Retry-After header to help clients and servers coordinate. Retries are inherently stateful because they're part of the public API. When running step-ca in a highly available setup with replicas, care must be taken to maintain a persistent identifier for each instance "slot". In kubernetes, this implies a *stateful set*.

dcow · 2020-05-06T15:10:43Z

Remaining TODO:

Reschedule retries when the ca starts up (from a crash or otherwise).
Test & Tests

acme/api/handler_test.go

dcow · 2020-05-07T01:11:45Z

Re rescheduling when we come back up. Discussed with @dopey and @maraino this morning. Given our currentl limitation to k/v db, we decided to split this work out into a separate ticket. We may want to wait to address it until we have better support for crafting the type of query we want: select all the challenges where owner = my_ordinal && status == processing && num_retries < max_retries

The comment in acme/authority directs users to this file so put a TODO in for posterity.

It might make sense to check in the vscode workspace file if we can make everything relative to the project directory.

Stop execution when the error happens. This was previously a typo.

Prior to validation, we must wrap the base challenge in the correct concrete challenge type so that we dispatch the correct validation method.

The linter likes comments on public functions to start with their name, for some reason...

Include the "Link" and "Location" headers on invalid challenge resources. An invalid challenge is still a perfectly acceptable response.

The error message was updated. Make the test should reflect the new changes.

Also, return early from ValidateChallenge if the challenge is already valid. Interestingly, we aren't actually testing most of the ValidateChallenge func, just the early error and return conditions. We should add some more coverage here.

On the challenge resource, set "Link" and "Location" headers for all successful requests to the challenge resource.

maraino · 2020-05-13T18:22:13Z

acme/challenge.go


+	up := &http01Challenge{hc.baseChallenge.clone()}
+
+	url := fmt.Sprintf("http://%s/.well-known/acme-challenge/%s", hc.Value, hc.Token)


HTTP?

Edit: yes HTTP is ok here.

Are Value and Token url safe? In not you should use url.URL

This isn't new I just moved the line. I'm happy to use a URL if you think it'd be best. I'm pretty sure the token is specified to be url safe and so would be the hostname. But I don't know if we ever sanitize these things.

maraino · 2020-05-13T18:43:55Z

acme/challenge.go

+		e := errors.Errorf("no TXT record found at '%s'", record)
+		up.Error = DNSErr(e).ToACME()
+		return up, nil
+	}


It looks to me that this check should go after we do the lookup.

I believe the lookup is here: https://github.com/smallstep/certificates/pull/242/files/5e5a76c3b53e0edc1c654f8b334e651e9b9e3b21#diff-96e563cc3503e2e7e76fad23ad6b0a09R586 (line 586). The following code is just where we try to match one of the returned records with the token.

I added this check so we could distinguish between no dns records and no matched records with the idea being that no records is transient whereas an incorrect record is a failure. We could, however, just consider all mismatches transient, because it's dns.

Based on https://letsencrypt.org/docs/challenge-types/#dns-01-challenge:

You can have multiple TXT records in place for the same name. For instance, this might happen if you are validating a challenge for a wildcard and a non-wildcard certificate at the same time. However, you should make sure to clean up old TXT records, because if the response size gets too big Let’s Encrypt will start rejecting it.

I think my suggestion to leave the DNS challenge open:

We could, however, just consider all mismatches transient, because it's dns.

might be the best course.

The authority now receives the ordinal in its constructor rather than a global variable set at package initialization time. The ordinal is passed via the command line option `--ordinal`.

Since it will ultimately 500 anyway, just return an error.

dopey · 2020-05-15T19:41:10Z

acme/authority.go

+
+// The challenge validation process is specific to the type of challenge (dns-01, http-01, tls-alpn-01).
+// But, we still pass generic "options" to the polymorphic validate call.
+func (a *Authority) validate(ch challenge, jwk *jose.JSONWebKey) (challenge, error) {


I'd like to have the method name here be validateChallenge or _validateChallenge. Otherwise it seems like it's validation of the authority itself. I appreciate that you can tell from the args, but I was still confused.

dopey · 2020-05-15T19:45:52Z

acme/authority.go

 	if err != nil {
-		return nil, Wrap(err, "error attempting challenge validation")
+		return


Let's at least log the error here, otherwise we're failing silently, right?

dopey · 2020-05-15T19:49:26Z

acme/authority.go

+		return
+	}
+	switch ch.getStatus() {
+	case StatusPending:


Let's add the challenge id (and any other info you think might help debugging, you may even consider printing the entire challenge object in the case of an error) to every trace/log output from this method. Debugging is going to be a pain anyway, given that our db is nosql, but it will be harder still without a challenge id.

dopey · 2020-05-15T19:51:51Z

acme/authority.go

+	retry := ch.getRetry()
+	switch {
+	case retry.Owner != a.ordinal:
+		return


We should log here, "Retry of ACME Challenge %s has changed ownership to ordinal %d" or something. Let's make it easy to understand what's happened.

dopey · 2020-05-15T19:54:15Z

acme/authority.go

+// write update the challenge record in the db if the challenge has remaining retry attempts.
+//
+// see: ValidateChallenge
+func (a *Authority) RetryChallenge(chID string) {


More logging.

Anywhere we return due to error we should log the error and, at the very least, the challenge id.

Anywhere we return for logical reasons, consider logging as well. (e.g. the ordinal has changed).

dopey · 2020-05-15T19:57:23Z

acme/authority.go

+	switch {
+	case err != nil:
+		return
+	case t.Before(now):


Can we leave a comment about this check. Just reading the code I'm not sure why this matters. We only want to continue with a retry attempt if the retryTime is in the future?

Looks like I need to flip this. We want to exit if next-attempt is in the future, not the past.

dopey · 2020-05-15T20:00:04Z

acme/authority.go

+	}
+	ch = up
+
+	p, err := a.LoadProvisionerByID(retry.ProvisionerID)


We shouldn't ignore these errors. We should log and try to exit ceremoniously.

Is it possible for someone to remove a provisioner? If so we may just want a warning but not try to exit the entire process.

dopey · 2020-05-15T20:00:14Z

acme/authority.go

+	ch = up
+
+	p, err := a.LoadProvisionerByID(retry.ProvisionerID)
+	acc, err := a.GetAccount(p, ch.getAccountID())


same as above

dopey · 2020-05-15T20:35:33Z

acme/challenge.go


-	if err := upd.save(db, dc); err != nil {
-		return nil, err
+// KeyAuthorization creates the ACME key authorization value from a token


This method is only used in this file. Probably doesn't need to be exported.

I didn't add this, just moved it down to the bottom. I can make it private if it's not intended to be called by others.

Add tests for the starting challenge statuses. Removed unneeded db write test.

This logic was already in the correct form so it was much easier to update.

Add `golanglint-ci` to the modules so it's available when running `make lint`.

codecov-commenter · 2020-05-21T04:16:24Z

Codecov Report

Merging #242 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #242   +/-   ##
=======================================
  Coverage   73.84%   73.84%           
=======================================
  Files          76       76           
  Lines        8921     8921           
=======================================
  Hits         6588     6588           
  Misses       2004     2004           
  Partials      329      329

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 112fc59...112fc59. Read the comment docs.

dcow and others added 8 commits April 30, 2020 04:44

git: Ignore *.code-workspace

6fdbd85

These are visual studio code's workspace configuration files.

Implement acme RFC 8555, challenge retries

40d7c42

Add automated challenge retries, RFC 8555

66b2c4b

Polish retry conditions

f9779d0

Implement standard backoff strategy

8d43567

handler_test: Remove unused field "Backoffs"

8fb558d

handler_test: Add BackoffChallenge

f56c449

The mock acme authority needs to in order to conform to the updated acme authority interface.

acme/authority: Add space around *

5e6a020

Makes the line more readable.

dcow mentioned this pull request Apr 30, 2020

[RFC] Implement ACME RFC 8555, challenge retries #181

Closed

dcow changed the title ~~8555 § 8.3 Challenge Retries~~ ACME (RFC 8555) § 8.2 Challenge Retries Apr 30, 2020

dcow mentioned this pull request Apr 30, 2020

Implement RFC8555 (ACME spec) § 8.2 #168

Open

dcow commented Apr 30, 2020

View reviewed changes

dcow self-assigned this Apr 30, 2020

dcow requested a review from dopey April 30, 2020 12:29

dcow commented May 1, 2020

View reviewed changes

acme/challenge.go Outdated Show resolved Hide resolved

dcow marked this pull request as ready for review May 6, 2020 15:11

dcow commented May 6, 2020

View reviewed changes

acme/api/handler_test.go Outdated Show resolved Hide resolved

dcow mentioned this pull request May 7, 2020

ACME: Resume challenge retries on process restart #250

Open

dcow added 5 commits May 7, 2020 09:27

acme: go fmt

bdadea8

provisioner/acme: Add TODO for retry restarts

9518ba4

The comment in acme/authority directs users to this file so put a TODO in for posterity.

vscode: Ignore vscode binaries

8326632

It might make sense to check in the vscode workspace file if we can make everything relative to the project directory.

acme/api: Add missing return

2d0a00c

Stop execution when the error happens. This was previously a typo.

acme/authority: Polymorph the challenge type

a857c45

Prior to validation, we must wrap the base challenge in the correct concrete challenge type so that we dispatch the correct validation method.

dcow added 4 commits May 13, 2020 05:04

acme: Fix comment style to appease linter

8ae32f5

The linter likes comments on public functions to start with their name, for some reason...

acme/api: Write headers for invalid challenges

609e131

Include the "Link" and "Location" headers on invalid challenge resources. An invalid challenge is still a perfectly acceptable response.

acme/authority: Fix error message in test

b061d0a

The error message was updated. Make the test should reflect the new changes.

acme/authority: Fix tests

976c8f8

Also, return early from ValidateChallenge if the challenge is already valid. Interestingly, we aren't actually testing most of the ValidateChallenge func, just the early error and return conditions. We should add some more coverage here.

dopey requested a review from maraino May 13, 2020 17:02

dcow added 2 commits May 13, 2020 10:56

acme/api: Add func name to beginning of comment

5354906

acme/api: Set Link and Location headers for all 200

5e5a76c

On the challenge resource, set "Link" and "Location" headers for all successful requests to the challenge resource.

maraino reviewed May 13, 2020

View reviewed changes

acme/authority: Add descriptive intro to ValidateChallenge

b8b3ca2

maraino reviewed May 13, 2020

View reviewed changes

dcow added 3 commits May 13, 2020 19:22

acme: Move ordinal to application

c378e00

The authority now receives the ordinal in its constructor rather than a global variable set at package initialization time. The ordinal is passed via the command line option `--ordinal`.

project: go mod tidy

f022818

acme: Don't panic on logic errors

deacbdc

Since it will ultimately 500 anyway, just return an error.

dopey reviewed May 15, 2020

View reviewed changes

dcow added 7 commits May 18, 2020 04:06

Merge branch 'master' into dcow/challenge-retry

d5f95de

Merge branch 'master' into dcow/challenge-retry

9103880

acme: Update http-01 challenge tests

0f63e43

Add tests for the starting challenge statuses. Removed unneeded db write test.

acme/retry: Update DNS challenge tests

d54f963

acme/retry: Cleanup tls-alpn-01 tests

0578055

This logic was already in the correct form so it was much easier to update.

acme: make fmt

6c39439

make: Fix lint errors

112fc59

Add `golanglint-ci` to the modules so it's available when running `make lint`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ACME (RFC 8555) § 8.2 Challenge Retries #242

ACME (RFC 8555) § 8.2 Challenge Retries #242

dcow commented Apr 30, 2020 •

edited

Loading

dcow commented May 1, 2020 •

edited

Loading

cpu commented May 1, 2020 •

edited

Loading

dcow commented May 1, 2020 •

edited

Loading

cpu commented May 2, 2020 •

edited

Loading

dcow commented May 6, 2020 •

edited

Loading

dcow commented May 7, 2020

maraino May 13, 2020 •

edited

Loading

maraino May 13, 2020

dcow May 14, 2020

maraino May 13, 2020

dcow May 14, 2020

dcow May 14, 2020

dcow May 21, 2020

dopey May 15, 2020

dopey May 15, 2020

dopey May 15, 2020

dopey May 15, 2020

dopey May 15, 2020

dopey May 15, 2020

dcow May 18, 2020

dopey May 15, 2020

dcow May 18, 2020

dopey May 15, 2020

dopey May 15, 2020

dcow May 18, 2020

codecov-commenter commented May 21, 2020 •

edited

Loading


		up := &http01Challenge{hc.baseChallenge.clone()}

		url := fmt.Sprintf("http://%s/.well-known/acme-challenge/%s", hc.Value, hc.Token)

ACME (RFC 8555) § 8.2 Challenge Retries #242

Are you sure you want to change the base?

ACME (RFC 8555) § 8.2 Challenge Retries #242

Conversation

dcow commented Apr 30, 2020 • edited Loading

dcow commented May 1, 2020 • edited Loading

cpu commented May 1, 2020 • edited Loading

dcow commented May 1, 2020 • edited Loading

cpu commented May 2, 2020 • edited Loading

dcow commented May 6, 2020 • edited Loading

dcow commented May 7, 2020

maraino May 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented May 21, 2020 • edited Loading

Codecov Report

dcow commented Apr 30, 2020 •

edited

Loading

dcow commented May 1, 2020 •

edited

Loading

cpu commented May 1, 2020 •

edited

Loading

dcow commented May 1, 2020 •

edited

Loading

cpu commented May 2, 2020 •

edited

Loading

dcow commented May 6, 2020 •

edited

Loading

maraino May 13, 2020 •

edited

Loading

codecov-commenter commented May 21, 2020 •

edited

Loading