-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS Glue with Iceberg Source and Delta Target #354
Comments
Yes you will need a catalog yaml to specify which catalog implementation you are using for the Iceberg source. If none is specified, it assumes you're not using an external catalog. Once we get this working it would be good to add some feedback or submit a PR for updates to the README since this part is not so clear right now :) |
Hi thank you! You will have to forgive me because I'm about to sound dumb, but I guess, what would I put here 'catalogImpl: io.my.CatalogImpl' If I am using aws managed glue with iceberg? And what keys would I need to provide? |
@the-other-tim-brown - I am also stuck on this particular issue, do you have an example of what this config would look like? Thanks. |
@MrDerecho to catalog with glue while running sync, you can look at this example: https://medium.com/@sagarlakshmipathy/using-onetable-to-translate-a-hudi-table-to-iceberg-format-and-sync-with-glue-catalog-8c3071f08877 |
@sagarlakshmipathy thanks for this info! If I'm running onetable in a docker container, do I need to use jdk 8 or 11? I noticed that the docs say 11, but the pom file that's builds the utilities jar uses 8. Maybe you can help demystify which jdk should be used for build vs. run? |
@ForeverAngry we are currently using java 11 for development and building the jars but we set a lower target so that projects that are still on java 8 can use the jar. Your runtime environment just needs to be java 8 or above for a java 8 jar to work properly. |
Maybe I'm doing something wrong here, but when I try to convert iceberg to delta using a glue catalog, I keep getting the error that the class constructor for catalog.Catalog is not found. I've tried adding the iceberg aws bundle to the class path, as well as many other jara, but still the same error. I'm executing the one table jar sync job from a docker image. Maybe the fact that it's a docker image is causing issues? I'm not sure. Does anyone have examples of specific versions of jars that should be used when converting from a iceberg source managed by glue to delta, that they know work? @the-other-tim-brown @sagarlakshmipathy any help would be very welcomed. I'd love to get this tool working! |
@ForeverAngry what is the error message you are seeing and how are you creating this docker image? |
@the-other-tim-brown Well, it was for sure an error with the image. But the error I'm getting now is related the partition. I have a column called
Any thoughts on how to resolve this error? |
@ForeverAngry when you say you have an "identity transformation" on the column, what do you mean by that? It sounds like you are using the The exception looks like there could be an issue in how the source schema is converted. When converting Iceberg's "hidden partitioning" to Delta Lake, we use a generated column to represent the partition values. |
@the-other-tim-brown yeah I misspoke in my comment, you are correct, I'm using the hour transform. |
Ok can you attach more of the stacktrace so I can see where exactly that error is thrown? |
@the-other-tim-brown yeah here is the error:
|
@the-other-tim-brown - additional context is the column name is "timestamp" which is used as a partition transform at the "hour" level. Note, when I query the partitions they appear to be stored as unix hours, but the logical partitions in S3 are formed as "yyyy-mm-dd-HH" |
Is this the full stacktrace? If so I'll need to investigate why we're not getting more error content in the logs. |
@the-other-tim-brown that's not the full stack trace, just the error. The machine the program is on is separate and not able to post here. So I have to type alot of this in by hand. Perhaps you could just let me know - does OneTable have the ability to take a timestamp field, stored as a timestamp, that uses the hour transform, from an iceberg source and convert that to delta, or is that not supported? I guess this is what I really want to know. |
This is definitely a bug somewhere in the partition handling logic. We want to support this case so it just requires developing an end to end test case in the repo to reproduce the issue in a way that is easier to debug. |
@the-other-tim-brown I was able to get the full stack trace printed out of the machine. I made most of the fields random names like "f1, f2 ... "except for the field the data is partitioned on using the hour transform from iceberg (the one that is failing). I also redacted the s3 path names. Just to restate, im going from an existing iceberg table, in aws using s3, and glue - to delta (not databricks delta). The existing iceberg table was created by athena. Take a look at let me know what you think or if anything stands out! Any help would be awesome. Other partition functions work fine. The project is really awesome!
|
Can you confirm the field schema type in the iceberg metadata and in the parquet file metadata? I tried a local example with hour partitioning on a timestamp field and it worked. I'm wondering if the field is actually a long in the schema. |
@the-other-tim-brown so, i rebuilt the table again, to make sure that there wasent something odd with the partitions. Its odd, the only change i made when rebuilding the table was using upper case
|
@ForeverAngry can you grab the schema of the Iceberg table? |
@the-other-tim-brown Here it is, let me know if this wasnt what you were looking for!
|
Thanks, this is what I was looking for. It lines up with my expectations but I'm not able to reproduce the same error that you saw. Are any of your partitions null or before 1970-01-01? Just trying to think through the common time-based edge cases here. |
@the-other-tim-brown So the table had some null values in the timestamp. I got rid of those. Now im back to the original error i started with. The timestamps include the millisecond, but do not have timezone. Do you think the millisecond precision is an issue? Also, maybe - can you tell me what Jar versions you are using when you run your test? Maybe mine are causing an issue. Here is the stack:
|
@ForeverAngry can you confirm that you are on the latest main branch when testing? If so, can you rebuild off of this branch and test again? #382 I think there is some issue in how we're handling timestamps with/without timezone set. |
@the-other-tim-brown i re-pulled the main branch again, and rebuilt the Jar. Note - i noticed that the run sync class changed from
|
@ForeverAngry the change in paths is expected and required as we transition this project to the Apache Software Foundation. We will need to do a full revamp of the docs/references once this process is complete. Did you try building off of the branch I linked as well? |
@the-other-tim-brown Well, I feel a bit foolish - for some reason I thought you only wanted me to re pull and test the main branch. Sorry, I'll pull that branch you linked and let you know! Again, I appreciate the dialog and support - your awesome! |
@the-other-tim-brown Okay, i rebuilt the jar using the branch you provided, here is the stack trace:
|
@ForeverAngry I missed a spot but just updated the branch to handle this error you are seeing. |
@the-other-tim-brown awesome!! I'll try it here in the next hour and let you know! |
@the-other-tim-brown So i pulled that branch and built the jar. Good news is - the sync completes successfully! But when i try to run the sync again, i get this error (see below). However, if i delete the
|
@ForeverAngry we need to handle the protocol versions better. Right now it's hardcoded to 1,4 when creating the table but I guess when we add the timestamp ntz it is getting updated to 3,7. We should remove the hardcoding and compute it based off the needs of the table. |
@the-other-tim-brown Makes sense, I can see if I can get it figured out - and then submit a PR. Any suggestions (or existing examples in the code base for other types) on where to start to actually compute the values, would be appreciated! |
https://github.com/apache/incubator-xtable/blob/main/core/src/main/java/org/apache/xtable/delta/DeltaClient.java#L69-L71 Is where we are currently hardcoding the values. I would create a method that outputs the min reader and writer versions based on the features used (generated column, timestamp ntz, etc). Maybe there is already a method that is doing this in Delta that is public that can just be used? There is something that is upgrading the table so I am thinking we can try to call into that method. You can add some testing here: https://github.com/apache/incubator-xtable/blob/main/core/src/test/java/org/apache/xtable/delta/TestDeltaSync.java#L102 Also feel free to take over my branch or just pull in those small changes to your own branch for the timestamp ntz so you can test that case |
@the-other-tim-brown I added the changes to support this feature as you outline. Full disclosure, this is my first time contributing to a open source project. Also, im not overly familiar with the large scale Java projects - so you will have to bare with me. Here is my Fork - One final note - i pulled the main branch, and made the changes you put on the timestamp_ntz branch, since that branch was pretty far out of date with the current project at this point. Let me know what you think - any guidance is welcomed! |
@ForeverAngry can you create a pull request from your fork? It should allow you to set the main branch in this repo as a target. I will be able to better review and add comments around testing that way. |
@the-other-tim-brown will do! Give me a few min! |
@the-other-tim-brown done, take a look |
@ForeverAngry I am getting the class constructor error while running the jar locally. What was the issue fixed to resolve this? |
For this specific issue, I just pull the main branch and rebuilt the jars. That seemed to clear up the class constructor issue. Let me know if it doesn't! |
I seem to keep getting a
version-hint.text
error when trying to use the .jar file to sync an existing iceberg table to a delta, using AWS Glue as my catalog. Im using themy_config.yaml
file as directed in the doc's, but i feel like im missing something.For instance:
Do i need to create a catalog.yaml and clients.yaml file as well? I noticed these but i thought they were only for implementing custom sources other than Hudi, Iceberg or Delta.
If i do need to create these files, what would i set as the
catalogImpl:
to for a source iceberg table that is managed by aws glue?Thanks in advance!
The text was updated successfully, but these errors were encountered: