Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Glue with Iceberg Source and Delta Target #354

Open
ForeverAngry opened this issue Feb 29, 2024 · 41 comments
Open

AWS Glue with Iceberg Source and Delta Target #354

ForeverAngry opened this issue Feb 29, 2024 · 41 comments

Comments

@ForeverAngry
Copy link

ForeverAngry commented Feb 29, 2024

I seem to keep getting a version-hint.text error when trying to use the .jar file to sync an existing iceberg table to a delta, using AWS Glue as my catalog. Im using the my_config.yaml file as directed in the doc's, but i feel like im missing something.

For instance:

sourceFormat: ICEBERG
targetFormats:
  - DELTA
datasets:
  -
   # (i have this set to the parent path that holds the (metadata, data and _delta_log directories), ive tried other configurations as well.  The _delta_log directory gets created, but nothing gets created in it, as i keep getting the error indicated above)
    tableBasePath: s3://path
    tableName: my_table

Do i need to create a catalog.yaml and clients.yaml file as well? I noticed these but i thought they were only for implementing custom sources other than Hudi, Iceberg or Delta.

If i do need to create these files, what would i set as the catalogImpl: to for a source iceberg table that is managed by aws glue?

catalogImpl: io.my.CatalogImpl
catalogName: name
catalogOptions: # all other options are passed through in a map
  key1: value1
  key2: value2

Thanks in advance!

@the-other-tim-brown
Copy link
Contributor

Yes you will need a catalog yaml to specify which catalog implementation you are using for the Iceberg source. If none is specified, it assumes you're not using an external catalog. Once we get this working it would be good to add some feedback or submit a PR for updates to the README since this part is not so clear right now :)

@ForeverAngry
Copy link
Author

Yes you will need a catalog yaml to specify which catalog implementation you are using for the Iceberg source. If none is specified, it assumes you're not using an external catalog. Once we get this working it would be good to add some feedback or submit a PR for updates to the README since this part is not so clear right now :)

Hi thank you! You will have to forgive me because I'm about to sound dumb, but I guess, what would I put here

'catalogImpl: io.my.CatalogImpl'

If I am using aws managed glue with iceberg? And what keys would I need to provide?

@MrDerecho
Copy link

@the-other-tim-brown - I am also stuck on this particular issue, do you have an example of what this config would look like? Thanks.

@sagarlakshmipathy
Copy link
Contributor

sagarlakshmipathy commented Mar 1, 2024

@ForeverAngry
Copy link
Author

@MrDerecho to catalog with glue while running sync, you can look at this example: https://medium.com/@sagarlakshmipathy/using-onetable-to-translate-a-hudi-table-to-iceberg-format-and-sync-with-glue-catalog-8c3071f08877

@sagarlakshmipathy thanks for this info! If I'm running onetable in a docker container, do I need to use jdk 8 or 11? I noticed that the docs say 11, but the pom file that's builds the utilities jar uses 8. Maybe you can help demystify which jdk should be used for build vs. run?

@the-other-tim-brown
Copy link
Contributor

@ForeverAngry we are currently using java 11 for development and building the jars but we set a lower target so that projects that are still on java 8 can use the jar. Your runtime environment just needs to be java 8 or above for a java 8 jar to work properly.

@ForeverAngry
Copy link
Author

Maybe I'm doing something wrong here, but when I try to convert iceberg to delta using a glue catalog, I keep getting the error that the class constructor for catalog.Catalog is not found. I've tried adding the iceberg aws bundle to the class path, as well as many other jara, but still the same error. I'm executing the one table jar sync job from a docker image. Maybe the fact that it's a docker image is causing issues? I'm not sure. Does anyone have examples of specific versions of jars that should be used when converting from a iceberg source managed by glue to delta, that they know work? @the-other-tim-brown @sagarlakshmipathy any help would be very welcomed. I'd love to get this tool working!

@the-other-tim-brown
Copy link
Contributor

@ForeverAngry what is the error message you are seeing and how are you creating this docker image?

@ForeverAngry
Copy link
Author

ForeverAngry commented Mar 4, 2024

@ForeverAngry what is the error message you are seeing and how are you creating this docker image?

@the-other-tim-brown Well, it was for sure an error with the image. But the error I'm getting now is related the partition. I have a column called timestamp' tbat ibuse a partition transform based on hour` on and now I get the following error:

org.apache.spark.sql.AnalysisException: cannot resolve 'date_format(timestamp, 'yyyy-MM-dd-HH')' due to data type mismatch: argument 1 requires timestamp type, however, 'timestamp' is of bigint type.; line 1 pos 0;

Any thoughts on how to resolve this error?

@the-other-tim-brown
Copy link
Contributor

@ForeverAngry when you say you have an "identity transformation" on the column, what do you mean by that? It sounds like you are using the hour transform from the Iceberg partitioning docs.

The exception looks like there could be an issue in how the source schema is converted. When converting Iceberg's "hidden partitioning" to Delta Lake, we use a generated column to represent the partition values.

@ForeverAngry
Copy link
Author

@the-other-tim-brown yeah I misspoke in my comment, you are correct, I'm using the hour transform.

@the-other-tim-brown
Copy link
Contributor

Ok can you attach more of the stacktrace so I can see where exactly that error is thrown?

@ForeverAngry
Copy link
Author

@the-other-tim-brown yeah here is the error:

2024-03-05 20:30:06 ERROR io.onetable.spi.sync.TableFormatSync:78 - Failed to sync snapshot org.apache.spark.sql.AnalysisException: cannot resolve 'date_format(timestamp, 'yyyy-MM-dd-HH')' due to data type mismatch: argument 1 requires timestamp type, however, 'timestamp' is of bigint type.; line 1 pos 0;

@MrDerecho
Copy link

@the-other-tim-brown - additional context is the column name is "timestamp" which is used as a partition transform at the "hour" level. Note, when I query the partitions they appear to be stored as unix hours, but the logical partitions in S3 are formed as "yyyy-mm-dd-HH"

@the-other-tim-brown
Copy link
Contributor

@the-other-tim-brown yeah here is the error:

2024-03-05 20:30:06 ERROR io.onetable.spi.sync.TableFormatSync:78 - Failed to sync snapshot org.apache.spark.sql.AnalysisException: cannot resolve 'date_format(timestamp, 'yyyy-MM-dd-HH')' due to data type mismatch: argument 1 requires timestamp type, however, 'timestamp' is of bigint type.; line 1 pos 0;

Is this the full stacktrace? If so I'll need to investigate why we're not getting more error content in the logs.

@ForeverAngry
Copy link
Author

@the-other-tim-brown that's not the full stack trace, just the error. The machine the program is on is separate and not able to post here. So I have to type alot of this in by hand.

Perhaps you could just let me know - does OneTable have the ability to take a timestamp field, stored as a timestamp, that uses the hour transform, from an iceberg source and convert that to delta, or is that not supported? I guess this is what I really want to know.

@the-other-tim-brown
Copy link
Contributor

This is definitely a bug somewhere in the partition handling logic. We want to support this case so it just requires developing an end to end test case in the repo to reproduce the issue in a way that is easier to debug.

@ForeverAngry
Copy link
Author

@the-other-tim-brown I was able to get the full stack trace printed out of the machine. I made most of the fields random names like "f1, f2 ... "except for the field the data is partitioned on using the hour transform from iceberg (the one that is failing). I also redacted the s3 path names.

Just to restate, im going from an existing iceberg table, in aws using s3, and glue - to delta (not databricks delta). The existing iceberg table was created by athena.

Take a look at let me know what you think or if anything stands out! Any help would be awesome. Other partition functions work fine. The project is really awesome!

2024-03-08 21:24:59 INFO  org.apache.spark.sql.delta.DeltaLog:57 - No delta log found for the Delta table at s3://<redacted>
2024-03-08 21:24:59 INFO  org.apache.spark.sql.delta.InitialSnapshot:57 - [tableId=0e175eb3-ad57-4432-8d71-cd3d58b839a2] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(081118d2-4985-476f-9za6-07b80359891c,null,null,Format(parquet,Map()),null,List(),Map(),Some(1709933099519)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),List(),None,-1), checksumOpt=None)
2024-03-08 21:25:03 ERROR io.onetable.spi.sync.TableFormatSync:78 - Failed to sync snapshot
org.apache.spark.sql.AnalysisException: cannot resolve 'date_format(timestamp, 'yyyy-MM-dd-HH')' due to data type mismatch: argument 1 requires timestamp type, however, 'timestamp' is of bigint type.; line 1 pos 
0;
'Project [date_format(timestamp#91L, yyyy-MM-dd-HH, Some(Etc/UTC)) AS onetable_partition_col_HOUR_timestamp#135]
+- LocalRelation <empty>, [timestamp#91L, f0#92, f1#93, f2#94, f14#95, f11#96, f4#97, f13#98, f5#99, f6#100, f7#101L, f8#102L, f9#103, f10#104, f11#105, f12#106L, f13#107, f14#108, f15#109L, f16#110, f17#111L, f18#112]

        at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:190) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:175) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:535) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1122) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1121) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:181) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:209) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.immutable.List.foreach(List.scala:392) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.TraversableLike.map(TraversableLike.scala:238) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.TraversableLike.map$(TraversableLike.scala:231) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.immutable.List.map(List.scala:298) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:209) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:181) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:161) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:175) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:94) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:91) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:182) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:205) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:202) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:88) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:196) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:196) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:88) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:86) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:78) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset.withPlan(Dataset.scala:3734) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset.select(Dataset.scala:1454) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.GeneratedColumn$.validateGeneratedColumns(GeneratedColumn.scala:196) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.$anonfun$verifyNewMetadata$2(OptimisticTransaction.scala:431) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile(DeltaLogging.scala:141) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile$(DeltaLogging.scala:139) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordFrameProfile(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.$anonfun$recordDeltaOperationInternal$1(DeltaLogging.scala:134) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at com.databricks.spark.util.DatabricksLogging.recordOperation(DatabricksLogging.scala:77) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at com.databricks.spark.util.DatabricksLogging.recordOperation$(DatabricksLogging.scala:67) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordOperation(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperationInternal(DeltaLogging.scala:133) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation(DeltaLogging.scala:123) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation$(DeltaLogging.scala:111) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordDeltaOperation(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.verifyNewMetadata(OptimisticTransaction.scala:431) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.verifyNewMetadata$(OptimisticTransaction.scala:402) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.verifyNewMetadata(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal(OptimisticTransaction.scala:382) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal$(OptimisticTransaction.scala:296) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadataInternal(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata(OptimisticTransaction.scala:289) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata$(OptimisticTransaction.scala:284) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadata(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.commitTransaction(DeltaClient.java:259) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.access$300(DeltaClient.java:215) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient.completeSync(DeltaClient.java:198) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.getSyncResult(TableFormatSync.java:165) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.syncSnapshot(TableFormatSync.java:70) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.syncSnapshot(OneTableClient.java:179) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.sync(OneTableClient.java:116) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.utilities.RunSync.main(RunSync.java:168) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
2024-03-08 21:25:03 ERROR io.onetable.client.OneTableClient:133 - Sync failed for the following formats DELTA

@the-other-tim-brown
Copy link
Contributor

Can you confirm the field schema type in the iceberg metadata and in the parquet file metadata? I tried a local example with hour partitioning on a timestamp field and it worked. I'm wondering if the field is actually a long in the schema.

@ForeverAngry
Copy link
Author

@the-other-tim-brown so, i rebuilt the table again, to make sure that there wasent something odd with the partitions. Its odd, the only change i made when rebuilding the table was using upper case PARTITION BY HOUR(timestamp), but ... now im getting a different error:

Output: 2024-03-09 00:06:16 INFO  io.onetable.utilities.RunSync:147 - Running sync for basePath s3://<redacted> for following table formats [DELTA]
2024-03-09 00:06:17 WARN  org.apache.hadoop.util.NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-03-09 00:06:19 WARN  org.apache.hadoop.metrics2.impl.MetricsConfig:136 - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
2024-03-09 00:06:19 INFO  org.apache.spark.sql.delta.storage.DelegatingLogStore:57 - LogStore `LogStoreAdapter(io.delta.storage.S3SingleDriverLogStore)` is used for scheme `s3`
2024-03-09 00:06:20 INFO  org.apache.spark.sql.delta.DeltaLog:57 - Creating initial snapshot without metadata, because the directory is empty
2024-03-09 00:06:20 INFO  org.apache.spark.sql.delta.InitialSnapshot:57 - [tableId=cafeb292-12a0-4a27-9382-d436a0977e46] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(ad0d3e83-149a-4904-afd8-8d655b49f1f8,null,null,Format(parquet,Map()),null,List(),Map(),Some(1709942780618)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),List(),None,-1), checksumOpt=None)
2024-03-09 00:06:20 INFO  io.onetable.client.OneTableClient:234 - No previous OneTable sync for target. Falling back to snapshot sync.
2024-03-09 00:06:21 INFO  org.apache.iceberg.CatalogUtil:302 - Loading custom FileIO implementation: org.apache.iceberg.aws.s3.S3FileIO
2024-03-09 00:06:22 INFO  org.apache.iceberg.BaseMetastoreTableOperations:199 - Refreshing table metadata from new version: s3://<redacted>/metadata/00000-585a7096-6160-4bc7-a3d9-a60634380092.metadata.json
2024-03-09 00:06:22 INFO  org.apache.iceberg.BaseMetastoreCatalog:67 - Table loaded by catalog: onetable.one_table_demo.bluvector_sensor_zeek_connection_log_dev_test_v3
2024-03-09 00:06:22 INFO  org.apache.iceberg.SnapshotScan:124 - Scanning table onetable.one_table_demo.bluvector_sensor_zeek_connection_log_dev_test_v3 snapshot 802574091683561956 created at 2024-03-08T22:31:16.976+00:00 with filter true
2024-03-09 00:06:23 INFO  org.apache.iceberg.metrics.LoggingMetricsReporter:38 - Received metrics report: ScanReport{tableName=onetable.one_table_demo.bluvector_sensor_zeek_connection_log_dev_test_v3, snapshotId=802574091683561956, filter=true, schemaId=0, projectedFieldIds=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], projectedFieldNames=[timestamp, uid, sensor_name, 
source_ip, source_port, destination_ip, destination_port, proto, service, duration, source_bytes, destination_bytes, conn_state, local_source, local_destination, missed_bytes, history, source_pkts, source_ip_bytes, destination_pkts, destination_ip_bytes, tunnel_parents], scanMetrics=ScanMetricsResult{totalPlanningDuration=TimerResult{timeUnit=NANOSECONDS, totalDuration=PT0.606986723S, count=1}, resultDataFiles=CounterResult{unit=COUNT, value=116}, resultDeleteFiles=CounterResult{unit=COUNT, value=0}, totalDataManifests=CounterResult{unit=COUNT, value=1}, totalDeleteManifests=CounterResult{unit=COUNT, value=0}, scannedDataManifests=CounterResult{unit=COUNT, value=1}, skippedDataManifests=CounterResult{unit=COUNT, value=0}, totalFileSizeInBytes=CounterResult{unit=BYTES, value=3635593429}, totalDeleteFileSizeInBytes=CounterResult{unit=BYTES, value=0}, skippedDataFiles=CounterResult{unit=COUNT, value=0}, skippedDeleteFiles=CounterResult{unit=COUNT, value=0}, scannedDeleteManifests=CounterResult{unit=COUNT, value=0}, skippedDeleteManifests=CounterResult{unit=COUNT, value=0}, indexedDeleteFiles=CounterResult{unit=COUNT, value=0}, equalityDeleteFiles=CounterResult{unit=COUNT, value=0}, positionalDeleteFiles=CounterResult{unit=COUNT, value=0}}, metadata={iceberg-version=Apache Iceberg 1.4.2 (commit f6bb9173b13424d77e7ad8439b5ef9627e530cb2)}}
2024-03-09 00:06:23 ERROR io.onetable.utilities.RunSync:170 - Error running sync for s3://<redacted>
java.lang.NullPointerException: null
        at io.onetable.iceberg.IcebergPartitionValueConverter.toOneTable(IcebergPartitionValueConverter.java:117) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.iceberg.IcebergSourceClient.fromIceberg(IcebergSourceClient.java:165) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.iceberg.IcebergSourceClient.getCurrentSnapshot(IcebergSourceClient.java:147) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.extractor.ExtractFromSource.extractSnapshot(ExtractFromSource.java:35) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.syncSnapshot(OneTableClient.java:177) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.sync(OneTableClient.java:116) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.utilities.RunSync.main(RunSync.java:168) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]


@the-other-tim-brown
Copy link
Contributor

@ForeverAngry can you grab the schema of the Iceberg table?

@ForeverAngry
Copy link
Author

@the-other-tim-brown Here it is, let me know if this wasnt what you were looking for!

{
  "format-version" : 2,
  "table-uuid" : "03e38c37-0ac5-4cae-a66f-49038931f265",
  "location" : "s3://<redacted>",
  "last-sequence-number" : 114,
  "last-updated-ms" : 1709948427780,
  "last-column-id" : 22,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "timestamp",
      "required" : false,
      "type" : "timestamp"
    }, {
      "id" : 2,
      "name" : "uid",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 3,
      "name" : "sensor_name",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 4,
      "name" : "source_ip",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 5,
      "name" : "source_port",
      "required" : false,
      "type" : "int"
    }, {
      "id" : 6,
      "name" : "destination_ip",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 7,
      "name" : "destination_port",
      "required" : false,
      "type" : "int"
    }, {
      "id" : 8,
      "name" : "proto",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 9,
      "name" : "service",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 10,
      "name" : "duration",
      "required" : false,
      "type" : "double"
    }, {
      "id" : 11,
      "name" : "source_bytes",
      "required" : false,
      "type" : "long"
    }, {
      "id" : 12,
      "name" : "destination_bytes",
      "required" : false,
      "type" : "long"
    }, {
      "id" : 13,
      "name" : "conn_state",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 14,
      "name" : "local_source",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 15,
      "name" : "local_destination",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 16,
      "name" : "missed_bytes",
      "required" : false,
      "type" : "long"
    }, {
      "id" : 17,
      "name" : "history",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 18,
      "name" : "source_pkts",
      "required" : false,
      "type" : "int"
    }, {
      "id" : 19,
      "name" : "source_ip_bytes",
      "required" : false,
      "type" : "long"
    }, {
      "id" : 20,
      "name" : "destination_pkts",
      "required" : false,
      "type" : "int"
    }, {
      "id" : 21,
      "name" : "destination_ip_bytes",
      "required" : false,
      "type" : "long"
    }, {
      "id" : 22,
      "name" : "tunnel_parents",
      "required" : false,
      "type" : "string"
    } ]
  } ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ {
      "name" : "timestamp_hour",
      "transform" : "hour",
      "source-id" : 1,
      "field-id" : 1000
    } ]
  } ],
  "last-partition-id" : 1000,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "history.expire.max-snapshot-age-ms" : "259200000",
    "athena.optimize.rewrite.min-data-file-size-bytes" : "50000000",
    "write.delete.parquet.compression-codec" : "GZIP",
    "write.format.default" : "PARQUET",
    "write.parquet.compression-codec" : "GZIP",
    "write.binpack.min-file-size-bytes" : "50000000",
    "write.object-storage.enabled" : "true",
    "write.object-storage.path" : "s3://<redacted>/data"
  },

@the-other-tim-brown
Copy link
Contributor

Thanks, this is what I was looking for. It lines up with my expectations but I'm not able to reproduce the same error that you saw. Are any of your partitions null or before 1970-01-01? Just trying to think through the common time-based edge cases here.

@ForeverAngry
Copy link
Author

ForeverAngry commented Mar 10, 2024

@the-other-tim-brown So the table had some null values in the timestamp. I got rid of those. Now im back to the original error i started with.

The timestamps include the millisecond, but do not have timezone. Do you think the millisecond precision is an issue?

Also, maybe - can you tell me what Jar versions you are using when you run your test? Maybe mine are causing an issue.

Here is the stack:

Output: 2024-03-10 02:42:51 INFO  io.onetable.utilities.RunSync:147 - Running sync for basePath s3://<redacted> for following table formats [DELTA]
2024-03-10 02:42:51 WARN  org.apache.hadoop.util.NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-03-10 02:42:53 WARN  org.apache.hadoop.metrics2.impl.MetricsConfig:136 - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
2024-03-10 02:42:54 INFO  org.apache.spark.sql.delta.storage.DelegatingLogStore:57 - LogStore `LogStoreAdapter(io.delta.storage.S3SingleDriverLogStore)` is used for scheme `s3`
2024-03-10 02:42:55 INFO  org.apache.spark.sql.delta.DeltaLog:57 - Creating initial snapshot without metadata, because the directory is empty
2024-03-10 02:42:55 INFO  org.apache.spark.sql.delta.InitialSnapshot:57 - [tableId=5ae55a82-f906-438d-8150-a4a8bbec1bc0] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(7a370e68-bd4f-4908-8893-6bd01b5969b3,null,null,Format(parquet,Map()),null,List(),Map(),Some(1710038575593)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),List(),None,-1), checksumOpt=None)
2024-03-10 02:42:55 INFO  io.onetable.client.OneTableClient:234 - No previous OneTable sync for target. Falling back to snapshot sync.
2024-03-10 02:42:56 INFO  org.apache.iceberg.CatalogUtil:302 - Loading custom FileIO implementation: org.apache.iceberg.aws.s3.S3FileIO
2024-03-10 02:42:57 INFO  org.apache.iceberg.BaseMetastoreTableOperations:199 - Refreshing table metadata from new version: s3://<redacted>/metadata/00116-d628cb20-f502-445e-a87a-b0639b830259.metadata.json   
2024-03-10 02:42:58 INFO  org.apache.iceberg.BaseMetastoreCatalog:67 - Table loaded by catalog: onetable.titanflow_laker_trusted.bluvector_sensor_zeek_connection_log_dev
2024-03-10 02:42:58 INFO  org.apache.iceberg.SnapshotScan:124 - Scanning table onetable.titanflow_laker_trusted.bluvector_sensor_zeek_connection_log_dev snapshot 8863237097523144183 created at 2024-03-10T02:42:10.959+00:00 with filter true
2024-03-10 02:42:59 INFO  org.apache.iceberg.metrics.LoggingMetricsReporter:38 - Received metrics report: ScanReport{tableName=onetable.titanflow_laker_trusted.bluvector_sensor_zeek_connection_log_dev, snapshotId=8863237097523144183, filter=true, schemaId=0, projectedFieldIds=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], projectedFieldNames=[timestamp, uid, sensor_name, source_ip, source_port, destination_ip, destination_port, proto, service, duration, source_bytes, destination_bytes, conn_state, local_source, local_destination, missed_bytes, history, source_pkts, source_ip_bytes, destination_pkts, destination_ip_bytes, tunnel_parents], scanMetrics=ScanMetricsResult{totalPlanningDuration=TimerResult{timeUnit=NANOSECONDS, totalDuration=PT0.98888014S, count=1}, resultDataFiles=CounterResult{unit=COUNT, value=456}, resultDeleteFiles=CounterResult{unit=COUNT, value=0}, totalDataManifests=CounterResult{unit=COUNT, value=71}, totalDeleteManifests=CounterResult{unit=COUNT, value=0}, scannedDataManifests=CounterResult{unit=COUNT, value=58}, skippedDataManifests=CounterResult{unit=COUNT, value=13}, totalFileSizeInBytes=CounterResult{unit=BYTES, value=13005026723}, totalDeleteFileSizeInBytes=CounterResult{unit=BYTES, value=0}, skippedDataFiles=CounterResult{unit=COUNT, value=0}, skippedDeleteFiles=CounterResult{unit=COUNT, value=0}, scannedDeleteManifests=CounterResult{unit=COUNT, value=0}, skippedDeleteManifests=CounterResult{unit=COUNT, value=0}, indexedDeleteFiles=CounterResult{unit=COUNT, value=0}, equalityDeleteFiles=CounterResult{unit=COUNT, value=0}, positionalDeleteFiles=CounterResult{unit=COUNT, value=0}}, metadata={iceberg-version=Apache Iceberg 1.4.2 (commit f6bb9173b13424d77e7ad8439b5ef9627e530cb2)}}
2024-03-10 02:42:59 INFO  org.apache.spark.sql.delta.DeltaLog:57 - No delta log found for the Delta table at s3://<redacted>/data/_delta_log
2024-03-10 02:42:59 INFO  org.apache.spark.sql.delta.InitialSnapshot:57 - [tableId=7a370e68-bd4f-4908-8893-6bd01b5969b3] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(884081c4-c241-417c-858a-ef3714fa54dc,null,null,Format(parquet,Map()),null,List(),Map(),Some(1710038579829)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),List(),None,-1), checksumOpt=None)
2024-03-10 02:43:05 ERROR io.onetable.spi.sync.TableFormatSync:78 - Failed to sync snapshot
org.apache.spark.sql.AnalysisException: cannot resolve 'date_format(timestamp, 'yyyy-MM-dd-HH')' due to data type mismatch: argument 1 requires timestamp type, however, 'timestamp' is of bigint type.; line 1 pos 0;
'Project [date_format(timestamp#91L, yyyy-MM-dd-HH, Some(Etc/UTC)) AS onetable_partition_col_HOUR_timestamp#135]
+- LocalRelation <empty>, [timestamp#91L, uid#92, sensor_name#93, source_ip#94, source_port#95, destination_ip#96, destination_port#97, proto#98, service#99, duration#100, source_bytes#101L, destination_bytes#102L, conn_state#103, local_source#104, local_destination#105, missed_bytes#106L, history#107, source_pkts#108, source_ip_bytes#109L, destination_pkts#110, destination_ip_bytes#111L, tunnel_parents#112]

        at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:190) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:175) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:535) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1122) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1121) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:181) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:209) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.immutable.List.foreach(List.scala:392) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.TraversableLike.map(TraversableLike.scala:238) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.TraversableLike.map$(TraversableLike.scala:231) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.immutable.List.map(List.scala:298) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:209) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:181) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:161) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:175) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:94) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:91) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:182) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:205) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:202) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:88) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:196) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:196) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:88) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:86) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:78) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset.withPlan(Dataset.scala:3734) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset.select(Dataset.scala:1454) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.GeneratedColumn$.validateGeneratedColumns(GeneratedColumn.scala:196) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.$anonfun$verifyNewMetadata$2(OptimisticTransaction.scala:431) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile(DeltaLogging.scala:141) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile$(DeltaLogging.scala:139) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordFrameProfile(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.$anonfun$recordDeltaOperationInternal$1(DeltaLogging.scala:134) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at com.databricks.spark.util.DatabricksLogging.recordOperation(DatabricksLogging.scala:77) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at com.databricks.spark.util.DatabricksLogging.recordOperation$(DatabricksLogging.scala:67) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordOperation(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperationInternal(DeltaLogging.scala:133) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation(DeltaLogging.scala:123) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation$(DeltaLogging.scala:111) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordDeltaOperation(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.verifyNewMetadata(OptimisticTransaction.scala:431) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.verifyNewMetadata$(OptimisticTransaction.scala:402) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.verifyNewMetadata(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal(OptimisticTransaction.scala:382) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal$(OptimisticTransaction.scala:296) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadataInternal(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata(OptimisticTransaction.scala:289) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata$(OptimisticTransaction.scala:284) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadata(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.commitTransaction(DeltaClient.java:259) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.access$300(DeltaClient.java:215) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient.completeSync(DeltaClient.java:198) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.getSyncResult(TableFormatSync.java:165) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.syncSnapshot(TableFormatSync.java:70) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.syncSnapshot(OneTableClient.java:179) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.sync(OneTableClient.java:116) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.utilities.RunSync.main(RunSync.java:168) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
2024-03-10 02:43:05 ERROR io.onetable.client.OneTableClient:133 - Sync failed for the following formats DELTA

@the-other-tim-brown
Copy link
Contributor

@ForeverAngry can you confirm that you are on the latest main branch when testing? If so, can you rebuild off of this branch and test again? #382 I think there is some issue in how we're handling timestamps with/without timezone set.

@ForeverAngry
Copy link
Author

@the-other-tim-brown i re-pulled the main branch again, and rebuilt the Jar. Note - i noticed that the run sync class changed from io.onetable.utiliities.RunSync to org.apache.xtable.utilities.RunSync, just an fyi - i wasn't sure if this was expected or not. Anyway - looks like im still getting the same error. Let me know what you think.

Output: 2024-03-10 13:06:33 INFO  org.apache.xtable.utilities.RunSync:147 - Running sync for basePath s3://<redacted> for following table formats [DELTA]
2024-03-10 13:06:35 WARN  org.apache.hadoop.util.NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-03-10 13:06:39 WARN  org.apache.hadoop.metrics2.impl.MetricsConfig:136 - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
2024-03-10 13:06:40 WARN  org.apache.hadoop.fs.s3a.SDKV2Upgrade:39 - Directly referencing AWS SDK V1 credential provider com.amazonaws.auth.DefaultAWSCredentialsProviderChain. AWS SDK V1 credential providers will be removed once S3A is upgraded to SDK V2
2024-03-10 13:06:40 INFO  org.apache.spark.sql.delta.storage.DelegatingLogStore:60 - LogStore `LogStoreAdapter(io.delta.storage.S3SingleDriverLogStore)` is used for scheme `s3`
2024-03-10 13:06:42 INFO  org.apache.spark.sql.delta.DeltaLog:60 - Creating initial snapshot without metadata, because the directory is empty
2024-03-10 13:06:44 INFO  org.apache.spark.sql.delta.InitialSnapshot:60 - [tableId=37adbca2-cc8a-4d2f-a2b2-5a891368c9a5] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(608fba2e-531b-455c-8a92-e14f1086a557,null,null,Format(parquet,Map()),null,List(),Map(),Some(1710076004533)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),None,-1), checksumOpt=None)
2024-03-10 13:06:44 INFO  org.apache.xtable.client.OneTableClient:234 - No previous OneTable sync for target. Falling back to snapshot sync.
2024-03-10 13:06:45 INFO  org.apache.iceberg.CatalogUtil:302 - Loading custom FileIO implementation: org.apache.iceberg.aws.s3.S3FileIO
2024-03-10 13:06:46 INFO  org.apache.iceberg.BaseMetastoreTableOperations:199 - Refreshing table metadata from new version: s3://<redacted>/metadata/00116-d628cb20-f502-445e-a87a-b0639b830259.metadata.json   
2024-03-10 13:06:47 INFO  org.apache.iceberg.BaseMetastoreCatalog:67 - Table loaded by catalog: onetable.<redacted>
2024-03-10 13:06:47 INFO  org.apache.iceberg.SnapshotScan:124 - Scanning table onetable.<redacted> snapshot 8863237097523144183 created at 2024-03-10T02:42:10.959+00:00 with filter true
2024-03-10 13:06:48 INFO  org.apache.iceberg.metrics.LoggingMetricsReporter:38 - Received metrics report: ScanReport{tableName=onetable.<redacted>, snapshotId=8863237097523144183, filter=true, schemaId=0, projectedFieldIds=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], projectedFieldNames=[timestamp, uid, sensor_name, source_ip, source_port, destination_ip, destination_port, proto, service, duration, source_bytes, destination_bytes, conn_state, local_source, local_destination, missed_bytes, history, source_pkts, source_ip_bytes, destination_pkts, destination_ip_bytes, tunnel_parents], scanMetrics=ScanMetricsResult{totalPlanningDuration=TimerResult{timeUnit=NANOSECONDS, totalDuration=PT1.291327918S, count=1}, resultDataFiles=CounterResult{unit=COUNT, value=456}, resultDeleteFiles=CounterResult{unit=COUNT, value=0}, totalDataManifests=CounterResult{unit=COUNT, value=71}, totalDeleteManifests=CounterResult{unit=COUNT, value=0}, scannedDataManifests=CounterResult{unit=COUNT, value=58}, skippedDataManifests=CounterResult{unit=COUNT, value=13}, totalFileSizeInBytes=CounterResult{unit=BYTES, value=13005026723}, totalDeleteFileSizeInBytes=CounterResult{unit=BYTES, value=0}, skippedDataFiles=CounterResult{unit=COUNT, value=0}, skippedDeleteFiles=CounterResult{unit=COUNT, value=0}, scannedDeleteManifests=CounterResult{unit=COUNT, value=0}, skippedDeleteManifests=CounterResult{unit=COUNT, value=0}, indexedDeleteFiles=CounterResult{unit=COUNT, value=0}, equalityDeleteFiles=CounterResult{unit=COUNT, value=0}, positionalDeleteFiles=CounterResult{unit=COUNT, value=0}}, metadata={iceberg-version=Apache Iceberg 1.4.2 (commit f6bb9173b13424d77e7ad8439b5ef9627e530cb2)}}   
2024-03-10 13:06:49 INFO  org.apache.spark.sql.delta.DeltaLog:60 - No delta log found for the Delta table at s3://<redacted>/data/_delta_log
2024-03-10 13:06:49 INFO  org.apache.spark.sql.delta.InitialSnapshot:60 - [tableId=608fba2e-531b-455c-8a92-e14f1086a557] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(6b778d50-11e8-4490-9892-2d971f078703,null,null,Format(parquet,Map()),null,List(),Map(),Some(1710076009117)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),None,-1), checksumOpt=None)
2024-03-10 13:06:59 ERROR org.apache.xtable.spi.sync.TableFormatSync:78 - Failed to sync snapshot
org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "date_format(timestamp, yyyy-MM-dd-HH)" due to data type mismatch: Parameter 1 requires the "TIMESTAMP" type, however "timestamp" has the type "BIGINT".; line 1 pos 0
        at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.Iterator.foreach(Iterator.scala:943) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.Iterator.foreach$(Iterator.scala:943) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.IterableLike.foreach(IterableLike.scala:74) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.IterableLike.foreach$(IterableLike.scala:73) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.AbstractIterable.foreach(Iterable.scala:56) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$4(CheckAnalysis.scala:256) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$4$adapted(CheckAnalysis.scala:256) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.immutable.Stream.foreach(Stream.scala:533) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$1(CheckAnalysis.scala:256) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$1$adapted(CheckAnalysis.scala:163) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:163) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:160) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:188) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:156) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:146) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.GeneratedColumn$.validateColumnReferences(GeneratedColumn.scala:185) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.GeneratedColumn$.$anonfun$validateGeneratedColumns$2(GeneratedColumn.scala:214) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.immutable.List.map(List.scala:293) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.GeneratedColumn$.validateGeneratedColumns(GeneratedColumn.scala:210) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.$anonfun$assertMetadata$2(OptimisticTransaction.scala:611) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile(DeltaLogging.scala:140) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile$(DeltaLogging.scala:138) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordFrameProfile(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.$anonfun$recordDeltaOperationInternal$1(DeltaLogging.scala:133) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at com.databricks.spark.util.DatabricksLogging.recordOperation(DatabricksLogging.scala:128) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at com.databricks.spark.util.DatabricksLogging.recordOperation$(DatabricksLogging.scala:117) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordOperation(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperationInternal(DeltaLogging.scala:132) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation(DeltaLogging.scala:122) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation$(DeltaLogging.scala:112) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordDeltaOperation(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.assertMetadata(OptimisticTransaction.scala:611) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.assertMetadata$(OptimisticTransaction.scala:583) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.assertMetadata(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal(OptimisticTransaction.scala:565) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal$(OptimisticTransaction.scala:395) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadataInternal(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata(OptimisticTransaction.scala:379) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata$(OptimisticTransaction.scala:372) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadata(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.delta.DeltaClient$TransactionState.commitTransaction(DeltaClient.java:261) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.delta.DeltaClient$TransactionState.access$300(DeltaClient.java:217) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.delta.DeltaClient.completeSync(DeltaClient.java:199) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.spi.sync.TableFormatSync.getSyncResult(TableFormatSync.java:165) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.spi.sync.TableFormatSync.syncSnapshot(TableFormatSync.java:70) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.client.OneTableClient.syncSnapshot(OneTableClient.java:179) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.client.OneTableClient.sync(OneTableClient.java:116) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.utilities.RunSync.main(RunSync.java:168) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
2024-03-10 13:06:59 ERROR org.apache.xtable.client.OneTableClient:133 - Sync failed for the following formats DELTA

@the-other-tim-brown
Copy link
Contributor

@ForeverAngry the change in paths is expected and required as we transition this project to the Apache Software Foundation. We will need to do a full revamp of the docs/references once this process is complete.

Did you try building off of the branch I linked as well?

@ForeverAngry
Copy link
Author

@the-other-tim-brown Well, I feel a bit foolish - for some reason I thought you only wanted me to re pull and test the main branch. Sorry, I'll pull that branch you linked and let you know!

Again, I appreciate the dialog and support - your awesome!

@ForeverAngry
Copy link
Author

@the-other-tim-brown Okay, i rebuilt the jar using the branch you provided, here is the stack trace:

Output: 2024-03-10 15:59:14 INFO  io.onetable.utilities.RunSync:147 - Running sync for basePath s3://<redacted> for following table formats [DELTA]
2024-03-10 15:59:15 WARN  org.apache.hadoop.util.NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-03-10 15:59:21 WARN  org.apache.hadoop.metrics2.impl.MetricsConfig:136 - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
2024-03-10 15:59:22 INFO  org.apache.spark.sql.delta.storage.DelegatingLogStore:60 - LogStore `LogStoreAdapter(io.delta.storage.S3SingleDriverLogStore)` is used for scheme `s3`
2024-03-10 15:59:23 INFO  org.apache.spark.sql.delta.DeltaLog:60 - Creating initial snapshot without metadata, because the directory is empty
2024-03-10 15:59:25 INFO  org.apache.spark.sql.delta.InitialSnapshot:60 - [tableId=f5cd0f71-bf22-4b57-8447-b5520c7b7d99] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(23f44368-3cd6-451f-af9a-d87c7cd8ee9a,null,null,Format(parquet,Map()),null,List(),Map(),Some(1710086365401)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),None,-1), checksumOpt=None)
2024-03-10 15:59:25 INFO  io.onetable.client.OneTableClient:234 - No previous OneTable sync for target. Falling back to snapshot sync.
2024-03-10 15:59:26 INFO  org.apache.iceberg.CatalogUtil:302 - Loading custom FileIO implementation: org.apache.iceberg.aws.s3.S3FileIO
2024-03-10 15:59:27 INFO  org.apache.iceberg.BaseMetastoreTableOperations:199 - Refreshing table metadata from new version: s3://<redacted>/metadata/00116-d628cb20-f502-445e-a87a-b0639b830259.metadata.json   
2024-03-10 15:59:27 INFO  org.apache.iceberg.BaseMetastoreCatalog:67 - Table loaded by catalog: onetable.<redacted>
2024-03-10 15:59:27 INFO  org.apache.iceberg.SnapshotScan:124 - Scanning table onetable.<redacted> snapshot 8863237097523144183 created at 2024-03-10T02:42:10.959+00:00 with filter true
2024-03-10 15:59:29 INFO  org.apache.iceberg.metrics.LoggingMetricsReporter:38 - Received metrics report: ScanReport{tableName=onetable.<redacted>, snapshotId=8863237097523144183, filter=true, schemaId=0, projectedFieldIds=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], projectedFieldNames=[timestamp, uid, sensor_name, source_ip, source_port, destination_ip, destination_port, proto, service, duration, source_bytes, destination_bytes, conn_state, local_source, local_destination, missed_bytes, history, source_pkts, source_ip_bytes, destination_pkts, destination_ip_bytes, tunnel_parents], scanMetrics=ScanMetricsResult{totalPlanningDuration=TimerResult{timeUnit=NANOSECONDS, totalDuration=PT1.189127061S, count=1}, resultDataFiles=CounterResult{unit=COUNT, value=456}, resultDeleteFiles=CounterResult{unit=COUNT, value=0}, totalDataManifests=CounterResult{unit=COUNT, value=71}, totalDeleteManifests=CounterResult{unit=COUNT, value=0}, scannedDataManifests=CounterResult{unit=COUNT, value=58}, skippedDataManifests=CounterResult{unit=COUNT, value=13}, totalFileSizeInBytes=CounterResult{unit=BYTES, value=13005026723}, totalDeleteFileSizeInBytes=CounterResult{unit=BYTES, value=0}, skippedDataFiles=CounterResult{unit=COUNT, value=0}, skippedDeleteFiles=CounterResult{unit=COUNT, value=0}, scannedDeleteManifests=CounterResult{unit=COUNT, value=0}, skippedDeleteManifests=CounterResult{unit=COUNT, value=0}, indexedDeleteFiles=CounterResult{unit=COUNT, value=0}, equalityDeleteFiles=CounterResult{unit=COUNT, value=0}, positionalDeleteFiles=CounterResult{unit=COUNT, value=0}}, metadata={iceberg-version=Apache Iceberg 1.4.2 (commit f6bb9173b13424d77e7ad8439b5ef9627e530cb2)}}
2024-03-10 15:59:29 INFO  org.apache.spark.sql.delta.DeltaLog:60 - No delta log found for the Delta table at s3://<redacted>/data/_delta_log
2024-03-10 15:59:29 INFO  org.apache.spark.sql.delta.InitialSnapshot:60 - [tableId=23f44368-3cd6-451f-af9a-d87c7cd8ee9a] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(f152cc9d-0cd4-4692-bc95-68866870b6ef,null,null,Format(parquet,Map()),null,List(),Map(),Some(1710086369624)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),None,-1), checksumOpt=None)
2024-03-10 15:59:29 ERROR io.onetable.spi.sync.TableFormatSync:78 - Failed to sync snapshot
io.onetable.exception.NotSupportedException: Unsupported type: timestamp_ntz
        at io.onetable.delta.DeltaSchemaExtractor.toOneSchema(DeltaSchemaExtractor.java:274) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaSchemaExtractor.lambda$toOneSchema$9(DeltaSchemaExtractor.java:203) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_342]
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_342]
        at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_342]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[?:1.8.0_342]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[?:1.8.0_342]
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[?:1.8.0_342]
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_342]
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) ~[?:1.8.0_342]
        at io.onetable.delta.DeltaSchemaExtractor.toOneSchema(DeltaSchemaExtractor.java:217) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaSchemaExtractor.toOneSchema(DeltaSchemaExtractor.java:145) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.addColumn(DeltaClient.java:242) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.access$200(DeltaClient.java:217) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient.syncPartitionSpec(DeltaClient.java:170) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.getSyncResult(TableFormatSync.java:158) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.syncSnapshot(TableFormatSync.java:70) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.syncSnapshot(OneTableClient.java:179) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.sync(OneTableClient.java:116) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.utilities.RunSync.main(RunSync.java:168) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
2024-03-10 15:59:29 ERROR io.onetable.client.OneTableClient:133 - Sync failed for the following formats DELTA

@the-other-tim-brown
Copy link
Contributor

@ForeverAngry I missed a spot but just updated the branch to handle this error you are seeing.

@ForeverAngry
Copy link
Author

@the-other-tim-brown awesome!! I'll try it here in the next hour and let you know!

@ForeverAngry
Copy link
Author

@the-other-tim-brown So i pulled that branch and built the jar. Good news is - the sync completes successfully! But when i try to run the sync again, i get this error (see below). However, if i delete the _delta_log folder, and try to sync again, the sync works. Is this expected behavior - or something else going on?

Output: 2024-03-11 20:05:55 INFO  io.onetable.utilities.RunSync:147 - Running sync for basePath s3:/<redacted> for following table formats [DELTA]
2024-03-11 20:05:56 WARN  org.apache.hadoop.util.NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-03-11 20:06:01 WARN  org.apache.hadoop.metrics2.impl.MetricsConfig:136 - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
2024-03-11 20:06:02 INFO  org.apache.spark.sql.delta.storage.DelegatingLogStore:60 - LogStore `LogStoreAdapter(io.delta.storage.S3SingleDriverLogStore)` is used for scheme `s3`
2024-03-11 20:06:03 INFO  org.apache.spark.sql.delta.DeltaLog:60 - Loading version 0.
2024-03-11 20:06:06 INFO  org.apache.spark.sql.delta.DeltaLogFileIndex:60 - Created DeltaLogFileIndex(JSON, numFilesInSegment: 1, totalFileSize: 436295)
2024-03-11 20:06:12 INFO  org.apache.spark.sql.delta.Snapshot:60 - [tableId=520ce65c-9f62-4aa8-89d1-d2b318ce6c7d] Created snapshot Snapshot(path=s3:/<redacted>/data/_delta_log, version=0, metadata=Metadata(d40894e6-845f-47aa-942e-2f51df8b3d02,<redacted>,,Format(parquet,Map()),{"type":"struct","fields":[{"name":"timestamp","type":"timestamp_ntz","nullable":true,"metadata":{}},{"name":"uid","type":"string","nullable":true,"metadata":{}},{"name":"sensor_name","type":"string","nullable":true,"metadata":{}},{"name":"source_ip","type":"string","nullable":true,"metadata":{}},{"name":"source_port","type":"integer","nullable":true,"metadata":{}},{"name":"destination_ip","type":"string","nullable":true,"metadata":{}},{"name":"destination_port","type":"integer","nullable":true,"metadata":{}},{"name":"proto","type":"string","nullable":true,"metadata":{}},{"name":"service","type":"string","nullable":true,"metadata":{}},{"name":"duration","type":"double","nullable":true,"metadata":{}},{"name":"source_bytes","type":"long","nullable":true,"metadata":{}},{"name":"destination_bytes","type":"long","nullable":true,"metadata":{}},{"name":"conn_state","type":"string","nullable":true,"metadata":{}},{"name":"local_source","type":"string","nullable":true,"metadata":{}},{"name":"local_destination","type":"string","nullable":true,"metadata":{}},{"name":"missed_bytes","type":"long","nullable":true,"metadata":{}},{"name":"history","type":"string","nullable":true,"metadata":{}},{"name":"source_pkts","type":"integer","nullable":true,"metadata":{}},{"name":"source_ip_bytes","type":"long","nullable":true,"metadata":{}},{"name":"destination_pkts","type":"integer","nullable":true,"metadata":{}},{"name":"destination_ip_bytes","type":"long","nullable":true,"metadata":{}},{"name":"tunnel_parents","type":"string","nullable":true,"metadata":{}},{"name":"onetable_partition_col_HOUR_timestamp","type":"string","nullable":true,"metadata":{"delta.generationExpression":"DATE_FORMAT(timestamp, 'yyyy-MM-dd-HH')"}}]},List(onetable_partition_col_HOUR_timestamp),Map(INFLIGHT_COMMITS_TO_CONSIDER_FOR_NEXT_SYNC -> , delta.logRetentionDuration -> interval 168 hours, ONETABLE_LAST_INSTANT_SYNCED -> 2024-03-11T19:02:28.102Z),Some(1710183748102)), logSegment=LogSegment(s3:/<redacted>/data/_delta_log,0,WrappedArray(S3AFileStatus{path=s3:/<redacted>/data/_delta_log/00000000000000000000.json; isDirectory=false; length=436295; replication=1; blocksize=33554432; modification_time=1710184335000; access_time=0; owner=root; group=root; permission=rw-rw-rw-; isSymlink=false; hasAcl=false; isEncrypted=true; isErasureCoded=false} isEmptyDirectory=FALSE eTag=f3fe01c5ffc81f267b3b9932d9f17e27 
versionId=null),None,1710184335000), checksumOpt=None)
2024-03-11 20:06:14 INFO  org.apache.iceberg.CatalogUtil:302 - Loading custom FileIO implementation: org.apache.iceberg.aws.s3.S3FileIO
2024-03-11 20:06:16 INFO  org.apache.iceberg.BaseMetastoreTableOperations:199 - Refreshing table metadata from new version: s3:/<redacted>/metadata/00142-6be157e8-3143-4086-a00a-08a2d56f4f4c.metadata.json
2024-03-11 20:06:17 INFO  org.apache.iceberg.BaseMetastoreCatalog:67 - Table loaded by catalog: onetable.<redacted>.<redacted>
2024-03-11 20:06:17 INFO  io.onetable.client.OneTableClient:240 - Incremental sync is not safe from instant Optional[2024-03-11T19:02:28.102Z]. Falling back to snapshot sync.
2024-03-11 20:06:17 INFO  org.apache.iceberg.SnapshotScan:124 - Scanning table onetable.<redacted>.<redacted> snapshot 5739292547283486572 created at 2024-03-11T20:03:10.657+00:00 with filter true        
2024-03-11 20:06:18 INFO  org.apache.iceberg.metrics.LoggingMetricsReporter:38 - Received metrics report: ScanReport{tableName=onetable.<redacted>.<redacted>, snapshotId=5739292547283486572, filter=true, 
schemaId=0, projectedFieldIds=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], projectedFieldNames=[timestamp, uid, sensor_name, source_ip, source_port, destination_ip, destination_port, proto, service, duration, source_bytes, destination_bytes, conn_state, local_source, local_destination, missed_bytes, history, source_pkts, source_ip_bytes, destination_pkts, destination_ip_bytes, tunnel_parents], scanMetrics=ScanMetricsResult{totalPlanningDuration=TimerResult{timeUnit=NANOSECONDS, totalDuration=PT1.380608267S, count=1}, resultDataFiles=CounterResult{unit=COUNT, value=1724}, resultDeleteFiles=CounterResult{unit=COUNT, value=0}, totalDataManifests=CounterResult{unit=COUNT, value=84}, totalDeleteManifests=CounterResult{unit=COUNT, value=0}, scannedDataManifests=CounterResult{unit=COUNT, value=84}, skippedDataManifests=CounterResult{unit=COUNT, value=0}, totalFileSizeInBytes=CounterResult{unit=BYTES, value=13496158287}, totalDeleteFileSizeInBytes=CounterResult{unit=BYTES, value=0}, skippedDataFiles=CounterResult{unit=COUNT, value=0}, skippedDeleteFiles=CounterResult{unit=COUNT, value=0}, scannedDeleteManifests=CounterResult{unit=COUNT, value=0}, skippedDeleteManifests=CounterResult{unit=COUNT, value=0}, indexedDeleteFiles=CounterResult{unit=COUNT, value=0}, equalityDeleteFiles=CounterResult{unit=COUNT, value=0}, positionalDeleteFiles=CounterResult{unit=COUNT, value=0}}, metadata={iceberg-version=Apache Iceberg 1.4.2 (commit f6bb9173b13424d77e7ad8439b5ef9627e530cb2)}}
2024-03-11 20:06:19 WARN  org.apache.spark.sql.catalyst.util.package:72 - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
2024-03-11 20:06:28 ERROR io.onetable.spi.sync.TableFormatSync:78 - Failed to sync snapshot
org.apache.spark.sql.delta.ProtocolDowngradeException: Protocol version cannot be downgraded from (3,7,[timestampNtz],[generatedColumns,timestampNtz]) to (1,4)
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal(OptimisticTransaction.scala:479) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal$(OptimisticTransaction.scala:395) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadataInternal(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata(OptimisticTransaction.scala:379) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata$(OptimisticTransaction.scala:372) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadata(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.commitTransaction(DeltaClient.java:261) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.access$300(DeltaClient.java:217) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient.completeSync(DeltaClient.java:199) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.getSyncResult(TableFormatSync.java:165) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.syncSnapshot(TableFormatSync.java:70) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.syncSnapshot(OneTableClient.java:179) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.sync(OneTableClient.java:116) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.utilities.RunSync.main(RunSync.java:168) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
2024-03-11 20:06:28 ERROR io.onetable.client.OneTableClient:133 - Sync failed for the following formats DELTA

@the-other-tim-brown
Copy link
Contributor

@ForeverAngry we need to handle the protocol versions better. Right now it's hardcoded to 1,4 when creating the table but I guess when we add the timestamp ntz it is getting updated to 3,7. We should remove the hardcoding and compute it based off the needs of the table.

@ForeverAngry
Copy link
Author

@the-other-tim-brown Makes sense, I can see if I can get it figured out - and then submit a PR. Any suggestions (or existing examples in the code base for other types) on where to start to actually compute the values, would be appreciated!

@the-other-tim-brown
Copy link
Contributor

the-other-tim-brown commented Mar 12, 2024

https://github.com/apache/incubator-xtable/blob/main/core/src/main/java/org/apache/xtable/delta/DeltaClient.java#L69-L71 Is where we are currently hardcoding the values.

I would create a method that outputs the min reader and writer versions based on the features used (generated column, timestamp ntz, etc). Maybe there is already a method that is doing this in Delta that is public that can just be used? There is something that is upgrading the table so I am thinking we can try to call into that method.

You can add some testing here: https://github.com/apache/incubator-xtable/blob/main/core/src/test/java/org/apache/xtable/delta/TestDeltaSync.java#L102

Also feel free to take over my branch or just pull in those small changes to your own branch for the timestamp ntz so you can test that case

@ForeverAngry
Copy link
Author

ForeverAngry commented Apr 29, 2024

@the-other-tim-brown I added the changes to support this feature as you outline. Full disclosure, this is my first time contributing to a open source project. Also, im not overly familiar with the large scale Java projects - so you will have to bare with me. Here is my Fork - https://github.com/ForeverAngry/incubator-xtable.git can you help me understand how i would add a unit test? It was overly clear to me, since the method i adjusted was part of a larger call stack.

One final note - i pulled the main branch, and made the changes you put on the timestamp_ntz branch, since that branch was pretty far out of date with the current project at this point.

Let me know what you think - any guidance is welcomed!

@the-other-tim-brown
Copy link
Contributor

@ForeverAngry can you create a pull request from your fork? It should allow you to set the main branch in this repo as a target. I will be able to better review and add comments around testing that way.

@ForeverAngry
Copy link
Author

@the-other-tim-brown will do! Give me a few min!

@ForeverAngry
Copy link
Author

@the-other-tim-brown done, take a look gh pr checkout 428

@psheets
Copy link

psheets commented Jun 7, 2024

Maybe I'm doing something wrong here, but when I try to convert iceberg to delta using a glue catalog, I keep getting the error that the class constructor for catalog.Catalog is not found. I've tried adding the iceberg aws bundle to the class path, as well as many other jara, but still the same error. I'm executing the one table jar sync job from a docker image. Maybe the fact that it's a docker image is causing issues? I'm not sure. Does anyone have examples of specific versions of jars that should be used when converting from a iceberg source managed by glue to delta, that they know work? @the-other-tim-brown @sagarlakshmipathy any help would be very welcomed. I'd love to get this tool working!

@ForeverAngry I am getting the class constructor error while running the jar locally. What was the issue fixed to resolve this?

@ForeverAngry
Copy link
Author

ForeverAngry commented Jun 7, 2024

Maybe I'm doing something wrong here, but when I try to convert iceberg to delta using a glue catalog, I keep getting the error that the class constructor for catalog.Catalog is not found. I've tried adding the iceberg aws bundle to the class path, as well as many other jara, but still the same error. I'm executing the one table jar sync job from a docker image. Maybe the fact that it's a docker image is causing issues? I'm not sure. Does anyone have examples of specific versions of jars that should be used when converting from a iceberg source managed by glue to delta, that they know work? @the-other-tim-brown @sagarlakshmipathy any help would be very welcomed. I'd love to get this tool working!

@ForeverAngry I am getting the class constructor error while running the jar locally. What was the issue fixed to resolve this?

For this specific issue, I just pull the main branch and rebuilt the jars. That seemed to clear up the class constructor issue. Let me know if it doesn't!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants