Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added freshness aware table loading using entityId:entityVersion for ETag #1037

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

mansehajsingh
Copy link

@mansehajsingh mansehajsingh commented Feb 20, 2025

Adding freshness aware table loading to Polaris to match the Iceberg 1.8.0 REST Spec change. createTable, registerTable, and loadTable will now issue an ETag, and loadTable will consume an If-None-Match header and compare it against the current ETag in the TableLikeEntity to determine if the the client's version is current.

The ETag value was decided to be a tuple of entityId:entityVersion.

  • Added authorization unit tests to ensure the new PolarisCatalogHandlerWrapper APIs adhere to same privileges
  • Added integration tests to ensure correct behavior of registerTable, loadTable, and createTable via new API changes

@mansehajsingh mansehajsingh changed the title Pulled in iceberg 1.8.0 spec changes for freshness aware table loadin… Added freshness aware table loading using metadata location for ETag Feb 20, 2025
@snazy
Copy link
Member

snazy commented Feb 21, 2025

NOTE: DO NOT REVIEW. Currently only to demonstrate a POC.

Please use "Draft PRs" for this - not "ready for review"

Copy link
Contributor

@eric-maynard eric-maynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm supportive of the idea, but we need some tests. Otherwise left a few comments

@mansehajsingh mansehajsingh marked this pull request as draft February 21, 2025 19:09
@mansehajsingh mansehajsingh marked this pull request as ready for review February 26, 2025 18:11
Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this PR implements ETag and If-None-Match in a non-HTTP-spec conformant way.

I'd really prefer to respect RFC 9110 here.

/**
* Entities that can expose an ETag that can uniquely identify their current state.
*/
public interface ETaggableEntity {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interface mixes REST/HTTP concerns w/ persistence concerns. Please remove the ETag functionality from persistence.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed! Removed all references to it in the persistence layer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this separation is wise. Besides the fact that it complicates the code, we'll eventually want to expose etag semantics to the persistence layer in some form. i.e. if the entity an etag refers to is indeed still in the version specified in the etag, we don't need to pull the entity out of the metastore at all.

@mansehajsingh
Copy link
Author

I've gone ahead and added HTTP compliant representations and parsing of the ETag and If-None-Match headers. Now,

  • The returned ETag will always be of the form W/"entityId:entityVersion" since we never do byte by byte comparisons
  • The provided If-None-Match header can specify the wildcard * to match any ETag or can specify the format W/"entityId:entityVersion", and even multiple ETags in the header as is defined in the HTTP standard. For example, if the returned ETag was W/"850:2" all of *, W/"850:2", and W/"850:2", "some-other-strong-etag" would return 304. However, W/"850:1" or "850:2" would not match it.


protected static Pattern ETAG_PATTERN = Pattern.compile("(W/)?\"([^\"]*)\"");

private final boolean weak;
Copy link
Contributor

@eric-maynard eric-maynard Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to support weak & strong etags?

Copy link
Author

@mansehajsingh mansehajsingh Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I supported their parsing, but in the current implementation all table operations only return weak etags. So, we never generate a strong etag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the strong etag code is dead code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followed up in #1037 (comment)

@Override
public boolean equals(Object o) {
if (o instanceof ETag other) {
return weak == other.weak && value.equals(other.value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right. If anything, all of our etags are technically weak because the credentials would be different but we think the responses are semantically identical

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused at what's being pointed out here- the line commented on compares two etags to see if they're both strong or weak validators, and then that they contain the same internal value. I've marked all of our ETags as weak since we never do a byte for byte comparison. My thinking was, we don't know the content that we encode in a weak vs strong etag if we choose to support both in future, so comparisons should only be between weak etags and other weak etags and strong etags and other strong etags.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our etags are never strong etags.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. As such, if we receive a strong etag we should ensure that we don't match it. This requires us to define a way to tell strong etags from weak ones.

While we don't generate strong etags, I don't believe the distinction being available is dead code. If I receive a strong etag in the header, the distinction that isWeak() provides me is the ability to check if the received etag was strong or weak, and then only return 304 if the received header was weak.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you somehow receive a strong etag in the header, simple string matching will show that it doesn't match any etag our application can generate

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I've removed the wrapper and now we do just regular string comparisons.

private final List<ETag> etags;

/**
* Parses a non wildcard If-None-Match header value into ETags
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the reason to support this? You can use an etag of * to check if a table exists?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added this to be compliant with the HTTP spec as @snazy requested, since * is a valid If-None-Match value with the desired behavior of matching any resourxe. I believe the wildcard is supposed to be useful to see if something already exists on a put, kind of like a create if not exists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is lacking wildcard support really noncompliant? Can you link the relevant part of the RFC?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@eric-maynard eric-maynard Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. But in our case the PUT (createTable) already has semantics around failing the request if the table exists. What additional semantics would having an * on that request provide?

I wouldn't have a problem with just failing on an etag like this as we really don't expect one from the client.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The endpoint now rejects the wildcard value. 👍

/**
* Logical representation of an HTTP compliant If-None-Match header.
*/
public class IfNoneMatch {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: couldn't this be a record? The current constructor could be turned into a static factory method.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are certain semantics surrounding the If-None-Match header that I feel would not be as clear with a record.

The If-None-Match header can either take the value of the wildcard * or a (possibly empty) list of etags, not both. The canonical constructor for a class like this would be defined like IfNoneMatch(boolean isWildcard, List<ETag> etags). However, an If-None-Match header is one or the other, not both. I shouldn't be able to construct a header that's both a wildcard and contains etags.

It's easier to capture this when we don't expose the canonical constructor. I've improved the semantics a bit. Now, I can either construct the header as a wildcard by calling IfNoneMatch.wildcard() to construct a wildcard header, or I can init the header by calling new IfNoneMatch(List.of(etag1, etag2, etag3)), but I can't give it both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public record IfNoneMatch(Boolean isWildcard, List<ETag> etags) {

  public IfNoneMatch(boolean isWildcard) {
    this(isWildcard, null);
  }

  public IfNoneMatch(List<ETag> etags) {
    this(null, etags);
  }

  public IfNoneMatch {
    if (!(isWildcard == null ^ etags == null)) {
      throw new IllegalArgumentException(...);
    }
  }
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used suggestion! Changed to record with similar semantics.

@mansehajsingh
Copy link
Author

Ok- I've gone ahead and removed the ETag wrapper. We now no longer need explicit distinctions between strong and weak etags, we can just use regular string matches. I fixed some parsing logic for the If-None-Match header, and made the wrapper class into a record. It also now fails on * since we don't really have use for that functionality of If-None-Match,

*/
public record IfNoneMatch(boolean isWildcard, @Nonnull List<String> eTags) {

protected static Pattern ETAG_PATTERN = Pattern.compile("(W/)?\"([^\"]*)\"");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we don't need the first capture group

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. I have taken it out and wrapped the entire thing in a capture group so it can be used to extract the etag from the header/validate an individual etag. See #1037 (comment)

public static IfNoneMatch fromHeader(String rawValue) {
// parse null header as an empty header
if (rawValue == null) {
return new IfNoneMatch(List.of());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make IfNoneMatch(List.of()) a constant?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made constant IfNoneMatch.EMPTY

}

rawValue = rawValue.trim();
if (rawValue.equals("*")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use a constant here, too

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made constant for the value, WILDCARD_HEADER_VALUE and made the header object a constant, IfNoneMatch.WILDCARD

} else {

List<String> parts = Stream.of(rawValue.split("\\s+")) // Tokenizes string , eg. we will now have [`W/"etag1",`, `W/"etag2"`]
.map(s -> s.endsWith(",") ? s.substring(0, s.length() - 1) : s) // Remove trailing comma from each part, so we now have [`W/"etag1"`, `W/"etag2"`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just split on ,?\\s+

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have taken this logic out entirely, see #1037 (comment) for an explanation as to why the parsing was incorrect.

Comment on lines 80 to 83
boolean allValid = parts.stream().allMatch(s -> {
Matcher matcher = ETAG_PATTERN.matcher(s);
return matcher.matches();
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks redundant with the static block

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have this part out, now we just use the pattern to extract the etags out so we can build the object. The static block still has the validation so that we can validate objects built from the constructor instead of from fromHeader(..)

@mansehajsingh
Copy link
Author

mansehajsingh commented Mar 4, 2025

I realized a bug I introduced with the parsing of the header, splitting by whitespace or commas is incorrect because these are not necessarily always delimiters as they can be contained within the value of an ETag. I have updated it to now use two RegExs, one to validate the format of the entire header to be HTTP compliant (eg. comma and space between each ETag, each ETag matches, no trailing comma) and then one to extract the captured ETags from the header. I have removed unnecessary capture groups from both of these.

Previously, an invalid header like W/"etag" W/"etag" would have passed through, now the delimiter will be properly validated and parsed.

I have added a test to ensure this is not passed going forward.

I have also replaced the empty and wildcard ETags as constants.

Comment on lines +351 to +352
if (ifNoneMatch.isWildcard())
throw new BadRequestException("If-None-Match may not take the value of '*'");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use braces here

* @return the Polaris table entity for the table
*/
private TableLikeEntity getTableEntity(TableIdentifier tableIdentifier) {
PolarisResolvedPathWrapper target = resolutionManifest.getPassthroughResolvedPath(tableIdentifier);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we potentially adding another trip to the metastore here after an operation like createTable?

* @param eTag the eTag value
* @param <T> The type of the encapsulated response object
*/
public record ETaggedResponse<T> (@Nonnull T response, @Nonnull String eTag) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the trailing newline got dropped

Copy link
Contributor

@eric-maynard eric-maynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very close to me, commented on a few lingering concerns. It seems like we are just trusting the cache to avoid extra hits against the metastore here

Comment on lines +598 to +600
return new ETaggedResponse<>(
doCatalogOperation(() -> CatalogHandlers.createTable(baseCatalog, namespace, request)),
generateETagValueForTable(getTableEntity(identifier))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to ensure that the entity we're using to generate the etag here is the exact same one returned in the response.

Otherwise, imagine the response body describes the table at time T but you call getTableEntity at time T+1 and generate the etag accordingly. A user who keeps polling for the table using the etag will never see the table's state at time T+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants