-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ACME (RFC 8555) § 8.2 Challenge Retries #242
base: master
Are you sure you want to change the base?
Conversation
These are visual studio code's workspace configuration files.
The mock acme authority needs to in order to conform to the updated acme authority interface.
Makes the line more readable.
https://tools.ietf.org/html/rfc8555#section-7.1.6 illustrates the challenge lifecycle. Retries are mentioned:
I read this as "if the server successfully acquires a challenge result (e.g. http resource, dns record, tls cert) but verification fails (the data is not correct), then move to the "invalid" state". This type of "error" is different from a transient network error which should be retried. So I'm going to sift through all the error logic and bucket things into one of those two definitions. And we'll only retry for transient errors. Also, once the state is "invalid", I don't think a challenge can be revived. This means the proposed implementation, where a challenge lives in the invalid state between retry attempts, is not correct. The challenge should remain processing until it's either valid or invalid.
|
@dcow I think you would want it to remain in a "processing" state instead of a "pending" state for failed attempts that support further retries. I think that's the meat of the errata that Rob from Sectico filed against RFC 8555 related to retries, (Edit: this part of the spec is a little bit rough. Despite my best efforts it landed in 8555 without anyone ever having implemented it at the time... 😓) |
@cpu thanks for the color. The errata is super useful—figuring out when to consider a challenge invalid is exactly what I've been mulling over. Do you have any insight on:
from section 8.2? Is the intention that orders are only allowed to be in the processing state for a short amount of time after the initial client request and the server needs to put the kibosh on future client retries by marking the challenge invalid? I'm inclined to let a challenge remain in a processing state indefinitely until an authorization expires or otherwise becomes invalid. I guess one could interpret that behavior as a server that never "gives up" even if it won't automatically retry again unless the client pokes it. |
@dcow I think an order should only enter into the processing state when the client POSTs the finalization URL to begin issuance. At that point all of the challenges need to be valid for the finalization to start. In practice I suspect most clients expect orders to not remain in the processing state for very long simply because Let's Encrypt's issuance is more or less synchronous with the finalization request (but that's a quirk of one implementation and not a protocol guarantee).
I think that makes sense for the challenges 👍 I can't think of a reason that would be a problem. Hope that helps! |
Section 8.2 of RFC 8555 explains how retries apply to the validation process. However, much is left up to the implementer. Add retries every 12 seconds for 2 minutes after a client requests a validation. The challenge status remains "processing" indefinitely until a distinct conclusion is reached. This allows a client to continually re-request a validation by sending a post-get to the challenge resource until the process fails or succeeds. Challenges in the processing state include information about why a validation did not complete in the error field. The server also includes a Retry-After header to help clients and servers coordinate. Retries are inherently stateful because they're part of the public API. When running step-ca in a highly available setup with replicas, care must be taken to maintain a persistent identifier for each instance "slot". In kubernetes, this implies a *stateful set*.
Remaining TODO:
|
Re rescheduling when we come back up. Discussed with @dopey and @maraino this morning. Given our currentl limitation to k/v db, we decided to split this work out into a separate ticket. We may want to wait to address it until we have better support for crafting the type of query we want: select all the challenges where owner = my_ordinal && status == processing && num_retries < max_retries |
The comment in acme/authority directs users to this file so put a TODO in for posterity.
It might make sense to check in the vscode workspace file if we can make everything relative to the project directory.
Stop execution when the error happens. This was previously a typo.
Prior to validation, we must wrap the base challenge in the correct concrete challenge type so that we dispatch the correct validation method.
The linter likes comments on public functions to start with their name, for some reason...
Include the "Link" and "Location" headers on invalid challenge resources. An invalid challenge is still a perfectly acceptable response.
The error message was updated. Make the test should reflect the new changes.
Also, return early from ValidateChallenge if the challenge is already valid. Interestingly, we aren't actually testing most of the ValidateChallenge func, just the early error and return conditions. We should add some more coverage here.
On the challenge resource, set "Link" and "Location" headers for all successful requests to the challenge resource.
|
||
up := &http01Challenge{hc.baseChallenge.clone()} | ||
|
||
url := fmt.Sprintf("http://%s/.well-known/acme-challenge/%s", hc.Value, hc.Token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HTTP?
Edit: yes HTTP is ok here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are Value and Token url safe? In not you should use url.URL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't new I just moved the line. I'm happy to use a URL if you think it'd be best. I'm pretty sure the token is specified to be url safe and so would be the hostname. But I don't know if we ever sanitize these things.
e := errors.Errorf("no TXT record found at '%s'", record) | ||
up.Error = DNSErr(e).ToACME() | ||
return up, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me that this check should go after we do the lookup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the lookup is here: https://github.com/smallstep/certificates/pull/242/files/5e5a76c3b53e0edc1c654f8b334e651e9b9e3b21#diff-96e563cc3503e2e7e76fad23ad6b0a09R586 (line 586). The following code is just where we try to match one of the returned records with the token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this check so we could distinguish between no dns records and no matched records with the idea being that no records is transient whereas an incorrect record is a failure. We could, however, just consider all mismatches transient, because it's dns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on https://letsencrypt.org/docs/challenge-types/#dns-01-challenge:
You can have multiple TXT records in place for the same name. For instance, this might happen if you are validating a challenge for a wildcard and a non-wildcard certificate at the same time. However, you should make sure to clean up old TXT records, because if the response size gets too big Let’s Encrypt will start rejecting it.
I think my suggestion to leave the DNS challenge open:
We could, however, just consider all mismatches transient, because it's dns.
might be the best course.
The authority now receives the ordinal in its constructor rather than a global variable set at package initialization time. The ordinal is passed via the command line option `--ordinal`.
Since it will ultimately 500 anyway, just return an error.
|
||
// The challenge validation process is specific to the type of challenge (dns-01, http-01, tls-alpn-01). | ||
// But, we still pass generic "options" to the polymorphic validate call. | ||
func (a *Authority) validate(ch challenge, jwk *jose.JSONWebKey) (challenge, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to have the method name here be validateChallenge
or _validateChallenge
. Otherwise it seems like it's validation of the authority itself. I appreciate that you can tell from the args, but I was still confused.
if err != nil { | ||
return nil, Wrap(err, "error attempting challenge validation") | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's at least log the error here, otherwise we're failing silently, right?
return | ||
} | ||
switch ch.getStatus() { | ||
case StatusPending: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add the challenge id (and any other info you think might help debugging, you may even consider printing the entire challenge object in the case of an error) to every trace/log output from this method. Debugging is going to be a pain anyway, given that our db is nosql, but it will be harder still without a challenge id.
retry := ch.getRetry() | ||
switch { | ||
case retry.Owner != a.ordinal: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should log here, "Retry of ACME Challenge %s has changed ownership to ordinal %d" or something. Let's make it easy to understand what's happened.
// write update the challenge record in the db if the challenge has remaining retry attempts. | ||
// | ||
// see: ValidateChallenge | ||
func (a *Authority) RetryChallenge(chID string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More logging.
-
Anywhere we return due to error we should log the error and, at the very least, the challenge id.
-
Anywhere we return for logical reasons, consider logging as well. (e.g. the ordinal has changed).
switch { | ||
case err != nil: | ||
return | ||
case t.Before(now): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leave a comment about this check. Just reading the code I'm not sure why this matters. We only want to continue with a retry attempt if the retryTime is in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I need to flip this. We want to exit if next-attempt is in the future, not the past.
} | ||
ch = up | ||
|
||
p, err := a.LoadProvisionerByID(retry.ProvisionerID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't ignore these errors. We should log and try to exit ceremoniously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for someone to remove a provisioner? If so we may just want a warning but not try to exit the entire process.
acme/authority.go
Outdated
ch = up | ||
|
||
p, err := a.LoadProvisionerByID(retry.ProvisionerID) | ||
acc, err := a.GetAccount(p, ch.getAccountID()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
|
||
if err := upd.save(db, dc); err != nil { | ||
return nil, err | ||
// KeyAuthorization creates the ACME key authorization value from a token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is only used in this file. Probably doesn't need to be exported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't add this, just moved it down to the bottom. I can make it private if it's not intended to be called by others.
Add tests for the starting challenge statuses. Removed unneeded db write test.
This logic was already in the correct form so it was much easier to update.
Add `golanglint-ci` to the modules so it's available when running `make lint`.
Codecov Report
@@ Coverage Diff @@
## master #242 +/- ##
=======================================
Coverage 73.84% 73.84%
=======================================
Files 76 76
Lines 8921 8921
=======================================
Hits 6588 6588
Misses 2004 2004
Partials 329 329 Continue to review full report at Codecov.
|
Taking over #181. Original issue #162.