AWS Glue with Iceberg Source and Delta Target #354

ForeverAngry · 2024-02-29T23:58:58Z

I seem to keep getting a version-hint.text error when trying to use the .jar file to sync an existing iceberg table to a delta, using AWS Glue as my catalog. Im using the my_config.yaml file as directed in the doc's, but i feel like im missing something.

For instance:

sourceFormat: ICEBERG
targetFormats:
  - DELTA
datasets:
  -
   # (i have this set to the parent path that holds the (metadata, data and _delta_log directories), ive tried other configurations as well.  The _delta_log directory gets created, but nothing gets created in it, as i keep getting the error indicated above)
    tableBasePath: s3://path
    tableName: my_table

Do i need to create a catalog.yaml and clients.yaml file as well? I noticed these but i thought they were only for implementing custom sources other than Hudi, Iceberg or Delta.

If i do need to create these files, what would i set as the catalogImpl: to for a source iceberg table that is managed by aws glue?

catalogImpl: io.my.CatalogImpl
catalogName: name
catalogOptions: # all other options are passed through in a map
  key1: value1
  key2: value2

Thanks in advance!

The text was updated successfully, but these errors were encountered:

the-other-tim-brown · 2024-03-01T18:15:26Z

Yes you will need a catalog yaml to specify which catalog implementation you are using for the Iceberg source. If none is specified, it assumes you're not using an external catalog. Once we get this working it would be good to add some feedback or submit a PR for updates to the README since this part is not so clear right now :)

ForeverAngry · 2024-03-01T19:33:28Z

Yes you will need a catalog yaml to specify which catalog implementation you are using for the Iceberg source. If none is specified, it assumes you're not using an external catalog. Once we get this working it would be good to add some feedback or submit a PR for updates to the README since this part is not so clear right now :)

Hi thank you! You will have to forgive me because I'm about to sound dumb, but I guess, what would I put here

'catalogImpl: io.my.CatalogImpl'

If I am using aws managed glue with iceberg? And what keys would I need to provide?

MrDerecho · 2024-03-01T19:35:28Z

@the-other-tim-brown - I am also stuck on this particular issue, do you have an example of what this config would look like? Thanks.

sagarlakshmipathy · 2024-03-01T22:41:28Z

@MrDerecho to catalog with glue while running sync, you can look at this example: https://medium.com/@sagarlakshmipathy/using-onetable-to-translate-a-hudi-table-to-iceberg-format-and-sync-with-glue-catalog-8c3071f08877

ForeverAngry · 2024-03-02T12:56:28Z

@MrDerecho to catalog with glue while running sync, you can look at this example: https://medium.com/@sagarlakshmipathy/using-onetable-to-translate-a-hudi-table-to-iceberg-format-and-sync-with-glue-catalog-8c3071f08877

@sagarlakshmipathy thanks for this info! If I'm running onetable in a docker container, do I need to use jdk 8 or 11? I noticed that the docs say 11, but the pom file that's builds the utilities jar uses 8. Maybe you can help demystify which jdk should be used for build vs. run?

the-other-tim-brown · 2024-03-02T20:29:29Z

@ForeverAngry we are currently using java 11 for development and building the jars but we set a lower target so that projects that are still on java 8 can use the jar. Your runtime environment just needs to be java 8 or above for a java 8 jar to work properly.

ForeverAngry · 2024-03-04T01:40:01Z

Maybe I'm doing something wrong here, but when I try to convert iceberg to delta using a glue catalog, I keep getting the error that the class constructor for catalog.Catalog is not found. I've tried adding the iceberg aws bundle to the class path, as well as many other jara, but still the same error. I'm executing the one table jar sync job from a docker image. Maybe the fact that it's a docker image is causing issues? I'm not sure. Does anyone have examples of specific versions of jars that should be used when converting from a iceberg source managed by glue to delta, that they know work? @the-other-tim-brown @sagarlakshmipathy any help would be very welcomed. I'd love to get this tool working!

the-other-tim-brown · 2024-03-04T04:27:08Z

@ForeverAngry what is the error message you are seeing and how are you creating this docker image?

ForeverAngry · 2024-03-04T17:17:57Z

@ForeverAngry what is the error message you are seeing and how are you creating this docker image?

@the-other-tim-brown Well, it was for sure an error with the image. But the error I'm getting now is related the partition. I have a column called timestamp' tbat ibuse a partition transform based on hour` on and now I get the following error:

org.apache.spark.sql.AnalysisException: cannot resolve 'date_format(timestamp, 'yyyy-MM-dd-HH')' due to data type mismatch: argument 1 requires timestamp type, however, 'timestamp' is of bigint type.; line 1 pos 0;

Any thoughts on how to resolve this error?

the-other-tim-brown · 2024-03-04T22:55:38Z

@ForeverAngry when you say you have an "identity transformation" on the column, what do you mean by that? It sounds like you are using the hour transform from the Iceberg partitioning docs.

The exception looks like there could be an issue in how the source schema is converted. When converting Iceberg's "hidden partitioning" to Delta Lake, we use a generated column to represent the partition values.

ForeverAngry · 2024-03-05T00:42:05Z

@the-other-tim-brown yeah I misspoke in my comment, you are correct, I'm using the hour transform.

the-other-tim-brown · 2024-03-05T16:41:11Z

Ok can you attach more of the stacktrace so I can see where exactly that error is thrown?

ForeverAngry · 2024-03-05T20:34:05Z

@the-other-tim-brown yeah here is the error:

2024-03-05 20:30:06 ERROR io.onetable.spi.sync.TableFormatSync:78 - Failed to sync snapshot org.apache.spark.sql.AnalysisException: cannot resolve 'date_format(timestamp, 'yyyy-MM-dd-HH')' due to data type mismatch: argument 1 requires timestamp type, however, 'timestamp' is of bigint type.; line 1 pos 0;

MrDerecho · 2024-03-05T20:38:35Z

@the-other-tim-brown - additional context is the column name is "timestamp" which is used as a partition transform at the "hour" level. Note, when I query the partitions they appear to be stored as unix hours, but the logical partitions in S3 are formed as "yyyy-mm-dd-HH"

the-other-tim-brown · 2024-03-06T02:42:59Z

@the-other-tim-brown yeah here is the error:

2024-03-05 20:30:06 ERROR io.onetable.spi.sync.TableFormatSync:78 - Failed to sync snapshot org.apache.spark.sql.AnalysisException: cannot resolve 'date_format(timestamp, 'yyyy-MM-dd-HH')' due to data type mismatch: argument 1 requires timestamp type, however, 'timestamp' is of bigint type.; line 1 pos 0;

Is this the full stacktrace? If so I'll need to investigate why we're not getting more error content in the logs.

ForeverAngry · 2024-03-06T15:44:51Z

@the-other-tim-brown that's not the full stack trace, just the error. The machine the program is on is separate and not able to post here. So I have to type alot of this in by hand.

Perhaps you could just let me know - does OneTable have the ability to take a timestamp field, stored as a timestamp, that uses the hour transform, from an iceberg source and convert that to delta, or is that not supported? I guess this is what I really want to know.

the-other-tim-brown · 2024-03-06T17:57:42Z

This is definitely a bug somewhere in the partition handling logic. We want to support this case so it just requires developing an end to end test case in the repo to reproduce the issue in a way that is easier to debug.

ForeverAngry · 2024-03-08T21:45:06Z

@the-other-tim-brown I was able to get the full stack trace printed out of the machine. I made most of the fields random names like "f1, f2 ... "except for the field the data is partitioned on using the hour transform from iceberg (the one that is failing). I also redacted the s3 path names.

Just to restate, im going from an existing iceberg table, in aws using s3, and glue - to delta (not databricks delta). The existing iceberg table was created by athena.

Take a look at let me know what you think or if anything stands out! Any help would be awesome. Other partition functions work fine. The project is really awesome!

2024-03-08 21:24:59 INFO  org.apache.spark.sql.delta.DeltaLog:57 - No delta log found for the Delta table at s3://<redacted>
2024-03-08 21:24:59 INFO  org.apache.spark.sql.delta.InitialSnapshot:57 - [tableId=0e175eb3-ad57-4432-8d71-cd3d58b839a2] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(081118d2-4985-476f-9za6-07b80359891c,null,null,Format(parquet,Map()),null,List(),Map(),Some(1709933099519)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),List(),None,-1), checksumOpt=None)
2024-03-08 21:25:03 ERROR io.onetable.spi.sync.TableFormatSync:78 - Failed to sync snapshot
org.apache.spark.sql.AnalysisException: cannot resolve 'date_format(timestamp, 'yyyy-MM-dd-HH')' due to data type mismatch: argument 1 requires timestamp type, however, 'timestamp' is of bigint type.; line 1 pos 
0;
'Project [date_format(timestamp#91L, yyyy-MM-dd-HH, Some(Etc/UTC)) AS onetable_partition_col_HOUR_timestamp#135]
+- LocalRelation <empty>, [timestamp#91L, f0#92, f1#93, f2#94, f14#95, f11#96, f4#97, f13#98, f5#99, f6#100, f7#101L, f8#102L, f9#103, f10#104, f11#105, f12#106L, f13#107, f14#108, f15#109L, f16#110, f17#111L, f18#112]

        at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:190) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:175) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:535) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1122) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1121) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:181) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:209) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.immutable.List.foreach(List.scala:392) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.TraversableLike.map(TraversableLike.scala:238) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.TraversableLike.map$(TraversableLike.scala:231) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.immutable.List.map(List.scala:298) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:209) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:181) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:161) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:175) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:94) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:91) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:182) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:205) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:202) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:88) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:196) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:196) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:88) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:86) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:78) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset.withPlan(Dataset.scala:3734) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset.select(Dataset.scala:1454) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.GeneratedColumn$.validateGeneratedColumns(GeneratedColumn.scala:196) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.$anonfun$verifyNewMetadata$2(OptimisticTransaction.scala:431) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile(DeltaLogging.scala:141) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile$(DeltaLogging.scala:139) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordFrameProfile(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.$anonfun$recordDeltaOperationInternal$1(DeltaLogging.scala:134) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at com.databricks.spark.util.DatabricksLogging.recordOperation(DatabricksLogging.scala:77) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at com.databricks.spark.util.DatabricksLogging.recordOperation$(DatabricksLogging.scala:67) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordOperation(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperationInternal(DeltaLogging.scala:133) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation(DeltaLogging.scala:123) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation$(DeltaLogging.scala:111) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordDeltaOperation(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.verifyNewMetadata(OptimisticTransaction.scala:431) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.verifyNewMetadata$(OptimisticTransaction.scala:402) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.verifyNewMetadata(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal(OptimisticTransaction.scala:382) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal$(OptimisticTransaction.scala:296) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadataInternal(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata(OptimisticTransaction.scala:289) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata$(OptimisticTransaction.scala:284) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadata(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.commitTransaction(DeltaClient.java:259) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.access$300(DeltaClient.java:215) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient.completeSync(DeltaClient.java:198) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.getSyncResult(TableFormatSync.java:165) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.syncSnapshot(TableFormatSync.java:70) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.syncSnapshot(OneTableClient.java:179) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.sync(OneTableClient.java:116) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.utilities.RunSync.main(RunSync.java:168) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
2024-03-08 21:25:03 ERROR io.onetable.client.OneTableClient:133 - Sync failed for the following formats DELTA

the-other-tim-brown · 2024-03-09T04:54:06Z

Can you confirm the field schema type in the iceberg metadata and in the parquet file metadata? I tried a local example with hour partitioning on a timestamp field and it worked. I'm wondering if the field is actually a long in the schema.

ForeverAngry · 2024-03-09T18:09:53Z

@the-other-tim-brown so, i rebuilt the table again, to make sure that there wasent something odd with the partitions. Its odd, the only change i made when rebuilding the table was using upper case PARTITION BY HOUR(timestamp), but ... now im getting a different error:

Output: 2024-03-09 00:06:16 INFO  io.onetable.utilities.RunSync:147 - Running sync for basePath s3://<redacted> for following table formats [DELTA]
2024-03-09 00:06:17 WARN  org.apache.hadoop.util.NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-03-09 00:06:19 WARN  org.apache.hadoop.metrics2.impl.MetricsConfig:136 - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
2024-03-09 00:06:19 INFO  org.apache.spark.sql.delta.storage.DelegatingLogStore:57 - LogStore `LogStoreAdapter(io.delta.storage.S3SingleDriverLogStore)` is used for scheme `s3`
2024-03-09 00:06:20 INFO  org.apache.spark.sql.delta.DeltaLog:57 - Creating initial snapshot without metadata, because the directory is empty
2024-03-09 00:06:20 INFO  org.apache.spark.sql.delta.InitialSnapshot:57 - [tableId=cafeb292-12a0-4a27-9382-d436a0977e46] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(ad0d3e83-149a-4904-afd8-8d655b49f1f8,null,null,Format(parquet,Map()),null,List(),Map(),Some(1709942780618)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),List(),None,-1), checksumOpt=None)
2024-03-09 00:06:20 INFO  io.onetable.client.OneTableClient:234 - No previous OneTable sync for target. Falling back to snapshot sync.
2024-03-09 00:06:21 INFO  org.apache.iceberg.CatalogUtil:302 - Loading custom FileIO implementation: org.apache.iceberg.aws.s3.S3FileIO
2024-03-09 00:06:22 INFO  org.apache.iceberg.BaseMetastoreTableOperations:199 - Refreshing table metadata from new version: s3://<redacted>/metadata/00000-585a7096-6160-4bc7-a3d9-a60634380092.metadata.json
2024-03-09 00:06:22 INFO  org.apache.iceberg.BaseMetastoreCatalog:67 - Table loaded by catalog: onetable.one_table_demo.bluvector_sensor_zeek_connection_log_dev_test_v3
2024-03-09 00:06:22 INFO  org.apache.iceberg.SnapshotScan:124 - Scanning table onetable.one_table_demo.bluvector_sensor_zeek_connection_log_dev_test_v3 snapshot 802574091683561956 created at 2024-03-08T22:31:16.976+00:00 with filter true
2024-03-09 00:06:23 INFO  org.apache.iceberg.metrics.LoggingMetricsReporter:38 - Received metrics report: ScanReport{tableName=onetable.one_table_demo.bluvector_sensor_zeek_connection_log_dev_test_v3, snapshotId=802574091683561956, filter=true, schemaId=0, projectedFieldIds=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], projectedFieldNames=[timestamp, uid, sensor_name, 
source_ip, source_port, destination_ip, destination_port, proto, service, duration, source_bytes, destination_bytes, conn_state, local_source, local_destination, missed_bytes, history, source_pkts, source_ip_bytes, destination_pkts, destination_ip_bytes, tunnel_parents], scanMetrics=ScanMetricsResult{totalPlanningDuration=TimerResult{timeUnit=NANOSECONDS, totalDuration=PT0.606986723S, count=1}, resultDataFiles=CounterResult{unit=COUNT, value=116}, resultDeleteFiles=CounterResult{unit=COUNT, value=0}, totalDataManifests=CounterResult{unit=COUNT, value=1}, totalDeleteManifests=CounterResult{unit=COUNT, value=0}, scannedDataManifests=CounterResult{unit=COUNT, value=1}, skippedDataManifests=CounterResult{unit=COUNT, value=0}, totalFileSizeInBytes=CounterResult{unit=BYTES, value=3635593429}, totalDeleteFileSizeInBytes=CounterResult{unit=BYTES, value=0}, skippedDataFiles=CounterResult{unit=COUNT, value=0}, skippedDeleteFiles=CounterResult{unit=COUNT, value=0}, scannedDeleteManifests=CounterResult{unit=COUNT, value=0}, skippedDeleteManifests=CounterResult{unit=COUNT, value=0}, indexedDeleteFiles=CounterResult{unit=COUNT, value=0}, equalityDeleteFiles=CounterResult{unit=COUNT, value=0}, positionalDeleteFiles=CounterResult{unit=COUNT, value=0}}, metadata={iceberg-version=Apache Iceberg 1.4.2 (commit f6bb9173b13424d77e7ad8439b5ef9627e530cb2)}}
2024-03-09 00:06:23 ERROR io.onetable.utilities.RunSync:170 - Error running sync for s3://<redacted>
java.lang.NullPointerException: null
        at io.onetable.iceberg.IcebergPartitionValueConverter.toOneTable(IcebergPartitionValueConverter.java:117) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.iceberg.IcebergSourceClient.fromIceberg(IcebergSourceClient.java:165) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.iceberg.IcebergSourceClient.getCurrentSnapshot(IcebergSourceClient.java:147) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.extractor.ExtractFromSource.extractSnapshot(ExtractFromSource.java:35) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.syncSnapshot(OneTableClient.java:177) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.sync(OneTableClient.java:116) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.utilities.RunSync.main(RunSync.java:168) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]

the-other-tim-brown · 2024-03-09T18:14:29Z

@ForeverAngry can you grab the schema of the Iceberg table?

ForeverAngry · 2024-03-09T19:24:49Z

@the-other-tim-brown Here it is, let me know if this wasnt what you were looking for!

{
  "format-version" : 2,
  "table-uuid" : "03e38c37-0ac5-4cae-a66f-49038931f265",
  "location" : "s3://<redacted>",
  "last-sequence-number" : 114,
  "last-updated-ms" : 1709948427780,
  "last-column-id" : 22,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "timestamp",
      "required" : false,
      "type" : "timestamp"
    }, {
      "id" : 2,
      "name" : "uid",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 3,
      "name" : "sensor_name",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 4,
      "name" : "source_ip",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 5,
      "name" : "source_port",
      "required" : false,
      "type" : "int"
    }, {
      "id" : 6,
      "name" : "destination_ip",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 7,
      "name" : "destination_port",
      "required" : false,
      "type" : "int"
    }, {
      "id" : 8,
      "name" : "proto",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 9,
      "name" : "service",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 10,
      "name" : "duration",
      "required" : false,
      "type" : "double"
    }, {
      "id" : 11,
      "name" : "source_bytes",
      "required" : false,
      "type" : "long"
    }, {
      "id" : 12,
      "name" : "destination_bytes",
      "required" : false,
      "type" : "long"
    }, {
      "id" : 13,
      "name" : "conn_state",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 14,
      "name" : "local_source",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 15,
      "name" : "local_destination",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 16,
      "name" : "missed_bytes",
      "required" : false,
      "type" : "long"
    }, {
      "id" : 17,
      "name" : "history",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 18,
      "name" : "source_pkts",
      "required" : false,
      "type" : "int"
    }, {
      "id" : 19,
      "name" : "source_ip_bytes",
      "required" : false,
      "type" : "long"
    }, {
      "id" : 20,
      "name" : "destination_pkts",
      "required" : false,
      "type" : "int"
    }, {
      "id" : 21,
      "name" : "destination_ip_bytes",
      "required" : false,
      "type" : "long"
    }, {
      "id" : 22,
      "name" : "tunnel_parents",
      "required" : false,
      "type" : "string"
    } ]
  } ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ {
      "name" : "timestamp_hour",
      "transform" : "hour",
      "source-id" : 1,
      "field-id" : 1000
    } ]
  } ],
  "last-partition-id" : 1000,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "history.expire.max-snapshot-age-ms" : "259200000",
    "athena.optimize.rewrite.min-data-file-size-bytes" : "50000000",
    "write.delete.parquet.compression-codec" : "GZIP",
    "write.format.default" : "PARQUET",
    "write.parquet.compression-codec" : "GZIP",
    "write.binpack.min-file-size-bytes" : "50000000",
    "write.object-storage.enabled" : "true",
    "write.object-storage.path" : "s3://<redacted>/data"
  },

the-other-tim-brown · 2024-03-10T00:30:48Z

Thanks, this is what I was looking for. It lines up with my expectations but I'm not able to reproduce the same error that you saw. Are any of your partitions null or before 1970-01-01? Just trying to think through the common time-based edge cases here.

ForeverAngry · 2024-03-10T02:46:42Z

@the-other-tim-brown So the table had some null values in the timestamp. I got rid of those. Now im back to the original error i started with.

The timestamps include the millisecond, but do not have timezone. Do you think the millisecond precision is an issue?

Also, maybe - can you tell me what Jar versions you are using when you run your test? Maybe mine are causing an issue.

Here is the stack:

Output: 2024-03-10 02:42:51 INFO  io.onetable.utilities.RunSync:147 - Running sync for basePath s3://<redacted> for following table formats [DELTA]
2024-03-10 02:42:51 WARN  org.apache.hadoop.util.NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-03-10 02:42:53 WARN  org.apache.hadoop.metrics2.impl.MetricsConfig:136 - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
2024-03-10 02:42:54 INFO  org.apache.spark.sql.delta.storage.DelegatingLogStore:57 - LogStore `LogStoreAdapter(io.delta.storage.S3SingleDriverLogStore)` is used for scheme `s3`
2024-03-10 02:42:55 INFO  org.apache.spark.sql.delta.DeltaLog:57 - Creating initial snapshot without metadata, because the directory is empty
2024-03-10 02:42:55 INFO  org.apache.spark.sql.delta.InitialSnapshot:57 - [tableId=5ae55a82-f906-438d-8150-a4a8bbec1bc0] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(7a370e68-bd4f-4908-8893-6bd01b5969b3,null,null,Format(parquet,Map()),null,List(),Map(),Some(1710038575593)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),List(),None,-1), checksumOpt=None)
2024-03-10 02:42:55 INFO  io.onetable.client.OneTableClient:234 - No previous OneTable sync for target. Falling back to snapshot sync.
2024-03-10 02:42:56 INFO  org.apache.iceberg.CatalogUtil:302 - Loading custom FileIO implementation: org.apache.iceberg.aws.s3.S3FileIO
2024-03-10 02:42:57 INFO  org.apache.iceberg.BaseMetastoreTableOperations:199 - Refreshing table metadata from new version: s3://<redacted>/metadata/00116-d628cb20-f502-445e-a87a-b0639b830259.metadata.json   
2024-03-10 02:42:58 INFO  org.apache.iceberg.BaseMetastoreCatalog:67 - Table loaded by catalog: onetable.titanflow_laker_trusted.bluvector_sensor_zeek_connection_log_dev
2024-03-10 02:42:58 INFO  org.apache.iceberg.SnapshotScan:124 - Scanning table onetable.titanflow_laker_trusted.bluvector_sensor_zeek_connection_log_dev snapshot 8863237097523144183 created at 2024-03-10T02:42:10.959+00:00 with filter true
2024-03-10 02:42:59 INFO  org.apache.iceberg.metrics.LoggingMetricsReporter:38 - Received metrics report: ScanReport{tableName=onetable.titanflow_laker_trusted.bluvector_sensor_zeek_connection_log_dev, snapshotId=8863237097523144183, filter=true, schemaId=0, projectedFieldIds=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], projectedFieldNames=[timestamp, uid, sensor_name, source_ip, source_port, destination_ip, destination_port, proto, service, duration, source_bytes, destination_bytes, conn_state, local_source, local_destination, missed_bytes, history, source_pkts, source_ip_bytes, destination_pkts, destination_ip_bytes, tunnel_parents], scanMetrics=ScanMetricsResult{totalPlanningDuration=TimerResult{timeUnit=NANOSECONDS, totalDuration=PT0.98888014S, count=1}, resultDataFiles=CounterResult{unit=COUNT, value=456}, resultDeleteFiles=CounterResult{unit=COUNT, value=0}, totalDataManifests=CounterResult{unit=COUNT, value=71}, totalDeleteManifests=CounterResult{unit=COUNT, value=0}, scannedDataManifests=CounterResult{unit=COUNT, value=58}, skippedDataManifests=CounterResult{unit=COUNT, value=13}, totalFileSizeInBytes=CounterResult{unit=BYTES, value=13005026723}, totalDeleteFileSizeInBytes=CounterResult{unit=BYTES, value=0}, skippedDataFiles=CounterResult{unit=COUNT, value=0}, skippedDeleteFiles=CounterResult{unit=COUNT, value=0}, scannedDeleteManifests=CounterResult{unit=COUNT, value=0}, skippedDeleteManifests=CounterResult{unit=COUNT, value=0}, indexedDeleteFiles=CounterResult{unit=COUNT, value=0}, equalityDeleteFiles=CounterResult{unit=COUNT, value=0}, positionalDeleteFiles=CounterResult{unit=COUNT, value=0}}, metadata={iceberg-version=Apache Iceberg 1.4.2 (commit f6bb9173b13424d77e7ad8439b5ef9627e530cb2)}}
2024-03-10 02:42:59 INFO  org.apache.spark.sql.delta.DeltaLog:57 - No delta log found for the Delta table at s3://<redacted>/data/_delta_log
2024-03-10 02:42:59 INFO  org.apache.spark.sql.delta.InitialSnapshot:57 - [tableId=7a370e68-bd4f-4908-8893-6bd01b5969b3] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(884081c4-c241-417c-858a-ef3714fa54dc,null,null,Format(parquet,Map()),null,List(),Map(),Some(1710038579829)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),List(),None,-1), checksumOpt=None)
2024-03-10 02:43:05 ERROR io.onetable.spi.sync.TableFormatSync:78 - Failed to sync snapshot
org.apache.spark.sql.AnalysisException: cannot resolve 'date_format(timestamp, 'yyyy-MM-dd-HH')' due to data type mismatch: argument 1 requires timestamp type, however, 'timestamp' is of bigint type.; line 1 pos 0;
'Project [date_format(timestamp#91L, yyyy-MM-dd-HH, Some(Etc/UTC)) AS onetable_partition_col_HOUR_timestamp#135]
+- LocalRelation <empty>, [timestamp#91L, uid#92, sensor_name#93, source_ip#94, source_port#95, destination_ip#96, destination_port#97, proto#98, service#99, duration#100, source_bytes#101L, destination_bytes#102L, conn_state#103, local_source#104, local_destination#105, missed_bytes#106L, history#107, source_pkts#108, source_ip_bytes#109L, destination_pkts#110, destination_ip_bytes#111L, tunnel_parents#112]

        at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:190) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:175) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:535) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1122) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1121) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:181) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:209) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.immutable.List.foreach(List.scala:392) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.TraversableLike.map(TraversableLike.scala:238) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.TraversableLike.map$(TraversableLike.scala:231) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.immutable.List.map(List.scala:298) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:209) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:181) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:161) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:175) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:94) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:91) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:182) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:205) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:202) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:88) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:196) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:196) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:88) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:86) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:78) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset.withPlan(Dataset.scala:3734) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.Dataset.select(Dataset.scala:1454) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.GeneratedColumn$.validateGeneratedColumns(GeneratedColumn.scala:196) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.$anonfun$verifyNewMetadata$2(OptimisticTransaction.scala:431) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile(DeltaLogging.scala:141) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile$(DeltaLogging.scala:139) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordFrameProfile(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.$anonfun$recordDeltaOperationInternal$1(DeltaLogging.scala:134) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at com.databricks.spark.util.DatabricksLogging.recordOperation(DatabricksLogging.scala:77) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at com.databricks.spark.util.DatabricksLogging.recordOperation$(DatabricksLogging.scala:67) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordOperation(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperationInternal(DeltaLogging.scala:133) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation(DeltaLogging.scala:123) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation$(DeltaLogging.scala:111) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordDeltaOperation(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.verifyNewMetadata(OptimisticTransaction.scala:431) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.verifyNewMetadata$(OptimisticTransaction.scala:402) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.verifyNewMetadata(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal(OptimisticTransaction.scala:382) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal$(OptimisticTransaction.scala:296) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadataInternal(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata(OptimisticTransaction.scala:289) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata$(OptimisticTransaction.scala:284) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadata(OptimisticTransaction.scala:100) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.commitTransaction(DeltaClient.java:259) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.access$300(DeltaClient.java:215) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient.completeSync(DeltaClient.java:198) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.getSyncResult(TableFormatSync.java:165) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.syncSnapshot(TableFormatSync.java:70) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.syncSnapshot(OneTableClient.java:179) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.sync(OneTableClient.java:116) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.utilities.RunSync.main(RunSync.java:168) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
2024-03-10 02:43:05 ERROR io.onetable.client.OneTableClient:133 - Sync failed for the following formats DELTA

the-other-tim-brown · 2024-03-10T03:41:02Z

@ForeverAngry can you confirm that you are on the latest main branch when testing? If so, can you rebuild off of this branch and test again? #382 I think there is some issue in how we're handling timestamps with/without timezone set.

ForeverAngry · 2024-03-10T13:15:47Z

@the-other-tim-brown i re-pulled the main branch again, and rebuilt the Jar. Note - i noticed that the run sync class changed from io.onetable.utiliities.RunSync to org.apache.xtable.utilities.RunSync, just an fyi - i wasn't sure if this was expected or not. Anyway - looks like im still getting the same error. Let me know what you think.

Output: 2024-03-10 13:06:33 INFO  org.apache.xtable.utilities.RunSync:147 - Running sync for basePath s3://<redacted> for following table formats [DELTA]
2024-03-10 13:06:35 WARN  org.apache.hadoop.util.NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-03-10 13:06:39 WARN  org.apache.hadoop.metrics2.impl.MetricsConfig:136 - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
2024-03-10 13:06:40 WARN  org.apache.hadoop.fs.s3a.SDKV2Upgrade:39 - Directly referencing AWS SDK V1 credential provider com.amazonaws.auth.DefaultAWSCredentialsProviderChain. AWS SDK V1 credential providers will be removed once S3A is upgraded to SDK V2
2024-03-10 13:06:40 INFO  org.apache.spark.sql.delta.storage.DelegatingLogStore:60 - LogStore `LogStoreAdapter(io.delta.storage.S3SingleDriverLogStore)` is used for scheme `s3`
2024-03-10 13:06:42 INFO  org.apache.spark.sql.delta.DeltaLog:60 - Creating initial snapshot without metadata, because the directory is empty
2024-03-10 13:06:44 INFO  org.apache.spark.sql.delta.InitialSnapshot:60 - [tableId=37adbca2-cc8a-4d2f-a2b2-5a891368c9a5] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(608fba2e-531b-455c-8a92-e14f1086a557,null,null,Format(parquet,Map()),null,List(),Map(),Some(1710076004533)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),None,-1), checksumOpt=None)
2024-03-10 13:06:44 INFO  org.apache.xtable.client.OneTableClient:234 - No previous OneTable sync for target. Falling back to snapshot sync.
2024-03-10 13:06:45 INFO  org.apache.iceberg.CatalogUtil:302 - Loading custom FileIO implementation: org.apache.iceberg.aws.s3.S3FileIO
2024-03-10 13:06:46 INFO  org.apache.iceberg.BaseMetastoreTableOperations:199 - Refreshing table metadata from new version: s3://<redacted>/metadata/00116-d628cb20-f502-445e-a87a-b0639b830259.metadata.json   
2024-03-10 13:06:47 INFO  org.apache.iceberg.BaseMetastoreCatalog:67 - Table loaded by catalog: onetable.<redacted>
2024-03-10 13:06:47 INFO  org.apache.iceberg.SnapshotScan:124 - Scanning table onetable.<redacted> snapshot 8863237097523144183 created at 2024-03-10T02:42:10.959+00:00 with filter true
2024-03-10 13:06:48 INFO  org.apache.iceberg.metrics.LoggingMetricsReporter:38 - Received metrics report: ScanReport{tableName=onetable.<redacted>, snapshotId=8863237097523144183, filter=true, schemaId=0, projectedFieldIds=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], projectedFieldNames=[timestamp, uid, sensor_name, source_ip, source_port, destination_ip, destination_port, proto, service, duration, source_bytes, destination_bytes, conn_state, local_source, local_destination, missed_bytes, history, source_pkts, source_ip_bytes, destination_pkts, destination_ip_bytes, tunnel_parents], scanMetrics=ScanMetricsResult{totalPlanningDuration=TimerResult{timeUnit=NANOSECONDS, totalDuration=PT1.291327918S, count=1}, resultDataFiles=CounterResult{unit=COUNT, value=456}, resultDeleteFiles=CounterResult{unit=COUNT, value=0}, totalDataManifests=CounterResult{unit=COUNT, value=71}, totalDeleteManifests=CounterResult{unit=COUNT, value=0}, scannedDataManifests=CounterResult{unit=COUNT, value=58}, skippedDataManifests=CounterResult{unit=COUNT, value=13}, totalFileSizeInBytes=CounterResult{unit=BYTES, value=13005026723}, totalDeleteFileSizeInBytes=CounterResult{unit=BYTES, value=0}, skippedDataFiles=CounterResult{unit=COUNT, value=0}, skippedDeleteFiles=CounterResult{unit=COUNT, value=0}, scannedDeleteManifests=CounterResult{unit=COUNT, value=0}, skippedDeleteManifests=CounterResult{unit=COUNT, value=0}, indexedDeleteFiles=CounterResult{unit=COUNT, value=0}, equalityDeleteFiles=CounterResult{unit=COUNT, value=0}, positionalDeleteFiles=CounterResult{unit=COUNT, value=0}}, metadata={iceberg-version=Apache Iceberg 1.4.2 (commit f6bb9173b13424d77e7ad8439b5ef9627e530cb2)}}   
2024-03-10 13:06:49 INFO  org.apache.spark.sql.delta.DeltaLog:60 - No delta log found for the Delta table at s3://<redacted>/data/_delta_log
2024-03-10 13:06:49 INFO  org.apache.spark.sql.delta.InitialSnapshot:60 - [tableId=608fba2e-531b-455c-8a92-e14f1086a557] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(6b778d50-11e8-4490-9892-2d971f078703,null,null,Format(parquet,Map()),null,List(),Map(),Some(1710076009117)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),None,-1), checksumOpt=None)
2024-03-10 13:06:59 ERROR org.apache.xtable.spi.sync.TableFormatSync:78 - Failed to sync snapshot
org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "date_format(timestamp, yyyy-MM-dd-HH)" due to data type mismatch: Parameter 1 requires the "TIMESTAMP" type, however "timestamp" has the type "BIGINT".; line 1 pos 0
        at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.Iterator.foreach(Iterator.scala:943) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.Iterator.foreach$(Iterator.scala:943) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.IterableLike.foreach(IterableLike.scala:74) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.IterableLike.foreach$(IterableLike.scala:73) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.AbstractIterable.foreach(Iterable.scala:56) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$4(CheckAnalysis.scala:256) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$4$adapted(CheckAnalysis.scala:256) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.immutable.Stream.foreach(Stream.scala:533) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$1(CheckAnalysis.scala:256) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$1$adapted(CheckAnalysis.scala:163) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:163) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:160) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:188) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:156) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:146) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.GeneratedColumn$.validateColumnReferences(GeneratedColumn.scala:185) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.GeneratedColumn$.$anonfun$validateGeneratedColumns$2(GeneratedColumn.scala:214) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.collection.immutable.List.map(List.scala:293) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.GeneratedColumn$.validateGeneratedColumns(GeneratedColumn.scala:210) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.$anonfun$assertMetadata$2(OptimisticTransaction.scala:611) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile(DeltaLogging.scala:140) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile$(DeltaLogging.scala:138) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordFrameProfile(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.$anonfun$recordDeltaOperationInternal$1(DeltaLogging.scala:133) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at com.databricks.spark.util.DatabricksLogging.recordOperation(DatabricksLogging.scala:128) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at com.databricks.spark.util.DatabricksLogging.recordOperation$(DatabricksLogging.scala:117) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordOperation(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperationInternal(DeltaLogging.scala:132) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation(DeltaLogging.scala:122) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation$(DeltaLogging.scala:112) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.recordDeltaOperation(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.assertMetadata(OptimisticTransaction.scala:611) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.assertMetadata$(OptimisticTransaction.scala:583) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.assertMetadata(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal(OptimisticTransaction.scala:565) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal$(OptimisticTransaction.scala:395) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadataInternal(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata(OptimisticTransaction.scala:379) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata$(OptimisticTransaction.scala:372) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadata(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.delta.DeltaClient$TransactionState.commitTransaction(DeltaClient.java:261) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.delta.DeltaClient$TransactionState.access$300(DeltaClient.java:217) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.delta.DeltaClient.completeSync(DeltaClient.java:199) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.spi.sync.TableFormatSync.getSyncResult(TableFormatSync.java:165) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.spi.sync.TableFormatSync.syncSnapshot(TableFormatSync.java:70) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.client.OneTableClient.syncSnapshot(OneTableClient.java:179) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.client.OneTableClient.sync(OneTableClient.java:116) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.xtable.utilities.RunSync.main(RunSync.java:168) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
2024-03-10 13:06:59 ERROR org.apache.xtable.client.OneTableClient:133 - Sync failed for the following formats DELTA

the-other-tim-brown · 2024-03-10T14:17:58Z

@ForeverAngry the change in paths is expected and required as we transition this project to the Apache Software Foundation. We will need to do a full revamp of the docs/references once this process is complete.

Did you try building off of the branch I linked as well?

ForeverAngry · 2024-03-10T14:53:16Z

@the-other-tim-brown Well, I feel a bit foolish - for some reason I thought you only wanted me to re pull and test the main branch. Sorry, I'll pull that branch you linked and let you know!

Again, I appreciate the dialog and support - your awesome!

ForeverAngry · 2024-03-10T16:03:17Z

@the-other-tim-brown Okay, i rebuilt the jar using the branch you provided, here is the stack trace:

Output: 2024-03-10 15:59:14 INFO  io.onetable.utilities.RunSync:147 - Running sync for basePath s3://<redacted> for following table formats [DELTA]
2024-03-10 15:59:15 WARN  org.apache.hadoop.util.NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-03-10 15:59:21 WARN  org.apache.hadoop.metrics2.impl.MetricsConfig:136 - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
2024-03-10 15:59:22 INFO  org.apache.spark.sql.delta.storage.DelegatingLogStore:60 - LogStore `LogStoreAdapter(io.delta.storage.S3SingleDriverLogStore)` is used for scheme `s3`
2024-03-10 15:59:23 INFO  org.apache.spark.sql.delta.DeltaLog:60 - Creating initial snapshot without metadata, because the directory is empty
2024-03-10 15:59:25 INFO  org.apache.spark.sql.delta.InitialSnapshot:60 - [tableId=f5cd0f71-bf22-4b57-8447-b5520c7b7d99] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(23f44368-3cd6-451f-af9a-d87c7cd8ee9a,null,null,Format(parquet,Map()),null,List(),Map(),Some(1710086365401)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),None,-1), checksumOpt=None)
2024-03-10 15:59:25 INFO  io.onetable.client.OneTableClient:234 - No previous OneTable sync for target. Falling back to snapshot sync.
2024-03-10 15:59:26 INFO  org.apache.iceberg.CatalogUtil:302 - Loading custom FileIO implementation: org.apache.iceberg.aws.s3.S3FileIO
2024-03-10 15:59:27 INFO  org.apache.iceberg.BaseMetastoreTableOperations:199 - Refreshing table metadata from new version: s3://<redacted>/metadata/00116-d628cb20-f502-445e-a87a-b0639b830259.metadata.json   
2024-03-10 15:59:27 INFO  org.apache.iceberg.BaseMetastoreCatalog:67 - Table loaded by catalog: onetable.<redacted>
2024-03-10 15:59:27 INFO  org.apache.iceberg.SnapshotScan:124 - Scanning table onetable.<redacted> snapshot 8863237097523144183 created at 2024-03-10T02:42:10.959+00:00 with filter true
2024-03-10 15:59:29 INFO  org.apache.iceberg.metrics.LoggingMetricsReporter:38 - Received metrics report: ScanReport{tableName=onetable.<redacted>, snapshotId=8863237097523144183, filter=true, schemaId=0, projectedFieldIds=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], projectedFieldNames=[timestamp, uid, sensor_name, source_ip, source_port, destination_ip, destination_port, proto, service, duration, source_bytes, destination_bytes, conn_state, local_source, local_destination, missed_bytes, history, source_pkts, source_ip_bytes, destination_pkts, destination_ip_bytes, tunnel_parents], scanMetrics=ScanMetricsResult{totalPlanningDuration=TimerResult{timeUnit=NANOSECONDS, totalDuration=PT1.189127061S, count=1}, resultDataFiles=CounterResult{unit=COUNT, value=456}, resultDeleteFiles=CounterResult{unit=COUNT, value=0}, totalDataManifests=CounterResult{unit=COUNT, value=71}, totalDeleteManifests=CounterResult{unit=COUNT, value=0}, scannedDataManifests=CounterResult{unit=COUNT, value=58}, skippedDataManifests=CounterResult{unit=COUNT, value=13}, totalFileSizeInBytes=CounterResult{unit=BYTES, value=13005026723}, totalDeleteFileSizeInBytes=CounterResult{unit=BYTES, value=0}, skippedDataFiles=CounterResult{unit=COUNT, value=0}, skippedDeleteFiles=CounterResult{unit=COUNT, value=0}, scannedDeleteManifests=CounterResult{unit=COUNT, value=0}, skippedDeleteManifests=CounterResult{unit=COUNT, value=0}, indexedDeleteFiles=CounterResult{unit=COUNT, value=0}, equalityDeleteFiles=CounterResult{unit=COUNT, value=0}, positionalDeleteFiles=CounterResult{unit=COUNT, value=0}}, metadata={iceberg-version=Apache Iceberg 1.4.2 (commit f6bb9173b13424d77e7ad8439b5ef9627e530cb2)}}
2024-03-10 15:59:29 INFO  org.apache.spark.sql.delta.DeltaLog:60 - No delta log found for the Delta table at s3://<redacted>/data/_delta_log
2024-03-10 15:59:29 INFO  org.apache.spark.sql.delta.InitialSnapshot:60 - [tableId=23f44368-3cd6-451f-af9a-d87c7cd8ee9a] Created snapshot InitialSnapshot(path=s3://<redacted>/data/_delta_log, version=-1, metadata=Metadata(f152cc9d-0cd4-4692-bc95-68866870b6ef,null,null,Format(parquet,Map()),null,List(),Map(),Some(1710086369624)), logSegment=LogSegment(s3://<redacted>/data/_delta_log,-1,List(),None,-1), checksumOpt=None)
2024-03-10 15:59:29 ERROR io.onetable.spi.sync.TableFormatSync:78 - Failed to sync snapshot
io.onetable.exception.NotSupportedException: Unsupported type: timestamp_ntz
        at io.onetable.delta.DeltaSchemaExtractor.toOneSchema(DeltaSchemaExtractor.java:274) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaSchemaExtractor.lambda$toOneSchema$9(DeltaSchemaExtractor.java:203) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_342]
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_342]
        at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_342]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[?:1.8.0_342]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[?:1.8.0_342]
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[?:1.8.0_342]
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_342]
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) ~[?:1.8.0_342]
        at io.onetable.delta.DeltaSchemaExtractor.toOneSchema(DeltaSchemaExtractor.java:217) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaSchemaExtractor.toOneSchema(DeltaSchemaExtractor.java:145) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.addColumn(DeltaClient.java:242) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.access$200(DeltaClient.java:217) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient.syncPartitionSpec(DeltaClient.java:170) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.getSyncResult(TableFormatSync.java:158) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.syncSnapshot(TableFormatSync.java:70) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.syncSnapshot(OneTableClient.java:179) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.sync(OneTableClient.java:116) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.utilities.RunSync.main(RunSync.java:168) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
2024-03-10 15:59:29 ERROR io.onetable.client.OneTableClient:133 - Sync failed for the following formats DELTA

the-other-tim-brown · 2024-03-11T00:56:19Z

@ForeverAngry I missed a spot but just updated the branch to handle this error you are seeing.

ForeverAngry · 2024-03-11T01:24:14Z

@the-other-tim-brown awesome!! I'll try it here in the next hour and let you know!

ForeverAngry · 2024-03-12T00:19:46Z

@the-other-tim-brown So i pulled that branch and built the jar. Good news is - the sync completes successfully! But when i try to run the sync again, i get this error (see below). However, if i delete the _delta_log folder, and try to sync again, the sync works. Is this expected behavior - or something else going on?

Output: 2024-03-11 20:05:55 INFO  io.onetable.utilities.RunSync:147 - Running sync for basePath s3:/<redacted> for following table formats [DELTA]
2024-03-11 20:05:56 WARN  org.apache.hadoop.util.NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-03-11 20:06:01 WARN  org.apache.hadoop.metrics2.impl.MetricsConfig:136 - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
2024-03-11 20:06:02 INFO  org.apache.spark.sql.delta.storage.DelegatingLogStore:60 - LogStore `LogStoreAdapter(io.delta.storage.S3SingleDriverLogStore)` is used for scheme `s3`
2024-03-11 20:06:03 INFO  org.apache.spark.sql.delta.DeltaLog:60 - Loading version 0.
2024-03-11 20:06:06 INFO  org.apache.spark.sql.delta.DeltaLogFileIndex:60 - Created DeltaLogFileIndex(JSON, numFilesInSegment: 1, totalFileSize: 436295)
2024-03-11 20:06:12 INFO  org.apache.spark.sql.delta.Snapshot:60 - [tableId=520ce65c-9f62-4aa8-89d1-d2b318ce6c7d] Created snapshot Snapshot(path=s3:/<redacted>/data/_delta_log, version=0, metadata=Metadata(d40894e6-845f-47aa-942e-2f51df8b3d02,<redacted>,,Format(parquet,Map()),{"type":"struct","fields":[{"name":"timestamp","type":"timestamp_ntz","nullable":true,"metadata":{}},{"name":"uid","type":"string","nullable":true,"metadata":{}},{"name":"sensor_name","type":"string","nullable":true,"metadata":{}},{"name":"source_ip","type":"string","nullable":true,"metadata":{}},{"name":"source_port","type":"integer","nullable":true,"metadata":{}},{"name":"destination_ip","type":"string","nullable":true,"metadata":{}},{"name":"destination_port","type":"integer","nullable":true,"metadata":{}},{"name":"proto","type":"string","nullable":true,"metadata":{}},{"name":"service","type":"string","nullable":true,"metadata":{}},{"name":"duration","type":"double","nullable":true,"metadata":{}},{"name":"source_bytes","type":"long","nullable":true,"metadata":{}},{"name":"destination_bytes","type":"long","nullable":true,"metadata":{}},{"name":"conn_state","type":"string","nullable":true,"metadata":{}},{"name":"local_source","type":"string","nullable":true,"metadata":{}},{"name":"local_destination","type":"string","nullable":true,"metadata":{}},{"name":"missed_bytes","type":"long","nullable":true,"metadata":{}},{"name":"history","type":"string","nullable":true,"metadata":{}},{"name":"source_pkts","type":"integer","nullable":true,"metadata":{}},{"name":"source_ip_bytes","type":"long","nullable":true,"metadata":{}},{"name":"destination_pkts","type":"integer","nullable":true,"metadata":{}},{"name":"destination_ip_bytes","type":"long","nullable":true,"metadata":{}},{"name":"tunnel_parents","type":"string","nullable":true,"metadata":{}},{"name":"onetable_partition_col_HOUR_timestamp","type":"string","nullable":true,"metadata":{"delta.generationExpression":"DATE_FORMAT(timestamp, 'yyyy-MM-dd-HH')"}}]},List(onetable_partition_col_HOUR_timestamp),Map(INFLIGHT_COMMITS_TO_CONSIDER_FOR_NEXT_SYNC -> , delta.logRetentionDuration -> interval 168 hours, ONETABLE_LAST_INSTANT_SYNCED -> 2024-03-11T19:02:28.102Z),Some(1710183748102)), logSegment=LogSegment(s3:/<redacted>/data/_delta_log,0,WrappedArray(S3AFileStatus{path=s3:/<redacted>/data/_delta_log/00000000000000000000.json; isDirectory=false; length=436295; replication=1; blocksize=33554432; modification_time=1710184335000; access_time=0; owner=root; group=root; permission=rw-rw-rw-; isSymlink=false; hasAcl=false; isEncrypted=true; isErasureCoded=false} isEmptyDirectory=FALSE eTag=f3fe01c5ffc81f267b3b9932d9f17e27 
versionId=null),None,1710184335000), checksumOpt=None)
2024-03-11 20:06:14 INFO  org.apache.iceberg.CatalogUtil:302 - Loading custom FileIO implementation: org.apache.iceberg.aws.s3.S3FileIO
2024-03-11 20:06:16 INFO  org.apache.iceberg.BaseMetastoreTableOperations:199 - Refreshing table metadata from new version: s3:/<redacted>/metadata/00142-6be157e8-3143-4086-a00a-08a2d56f4f4c.metadata.json
2024-03-11 20:06:17 INFO  org.apache.iceberg.BaseMetastoreCatalog:67 - Table loaded by catalog: onetable.<redacted>.<redacted>
2024-03-11 20:06:17 INFO  io.onetable.client.OneTableClient:240 - Incremental sync is not safe from instant Optional[2024-03-11T19:02:28.102Z]. Falling back to snapshot sync.
2024-03-11 20:06:17 INFO  org.apache.iceberg.SnapshotScan:124 - Scanning table onetable.<redacted>.<redacted> snapshot 5739292547283486572 created at 2024-03-11T20:03:10.657+00:00 with filter true        
2024-03-11 20:06:18 INFO  org.apache.iceberg.metrics.LoggingMetricsReporter:38 - Received metrics report: ScanReport{tableName=onetable.<redacted>.<redacted>, snapshotId=5739292547283486572, filter=true, 
schemaId=0, projectedFieldIds=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], projectedFieldNames=[timestamp, uid, sensor_name, source_ip, source_port, destination_ip, destination_port, proto, service, duration, source_bytes, destination_bytes, conn_state, local_source, local_destination, missed_bytes, history, source_pkts, source_ip_bytes, destination_pkts, destination_ip_bytes, tunnel_parents], scanMetrics=ScanMetricsResult{totalPlanningDuration=TimerResult{timeUnit=NANOSECONDS, totalDuration=PT1.380608267S, count=1}, resultDataFiles=CounterResult{unit=COUNT, value=1724}, resultDeleteFiles=CounterResult{unit=COUNT, value=0}, totalDataManifests=CounterResult{unit=COUNT, value=84}, totalDeleteManifests=CounterResult{unit=COUNT, value=0}, scannedDataManifests=CounterResult{unit=COUNT, value=84}, skippedDataManifests=CounterResult{unit=COUNT, value=0}, totalFileSizeInBytes=CounterResult{unit=BYTES, value=13496158287}, totalDeleteFileSizeInBytes=CounterResult{unit=BYTES, value=0}, skippedDataFiles=CounterResult{unit=COUNT, value=0}, skippedDeleteFiles=CounterResult{unit=COUNT, value=0}, scannedDeleteManifests=CounterResult{unit=COUNT, value=0}, skippedDeleteManifests=CounterResult{unit=COUNT, value=0}, indexedDeleteFiles=CounterResult{unit=COUNT, value=0}, equalityDeleteFiles=CounterResult{unit=COUNT, value=0}, positionalDeleteFiles=CounterResult{unit=COUNT, value=0}}, metadata={iceberg-version=Apache Iceberg 1.4.2 (commit f6bb9173b13424d77e7ad8439b5ef9627e530cb2)}}
2024-03-11 20:06:19 WARN  org.apache.spark.sql.catalyst.util.package:72 - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
2024-03-11 20:06:28 ERROR io.onetable.spi.sync.TableFormatSync:78 - Failed to sync snapshot
org.apache.spark.sql.delta.ProtocolDowngradeException: Protocol version cannot be downgraded from (3,7,[timestampNtz],[generatedColumns,timestampNtz]) to (1,4)
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal(OptimisticTransaction.scala:479) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadataInternal$(OptimisticTransaction.scala:395) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadataInternal(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata(OptimisticTransaction.scala:379) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata$(OptimisticTransaction.scala:372) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at org.apache.spark.sql.delta.OptimisticTransaction.updateMetadata(OptimisticTransaction.scala:137) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.commitTransaction(DeltaClient.java:261) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient$TransactionState.access$300(DeltaClient.java:217) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.delta.DeltaClient.completeSync(DeltaClient.java:199) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.getSyncResult(TableFormatSync.java:165) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.syncSnapshot(TableFormatSync.java:70) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.syncSnapshot(OneTableClient.java:179) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.client.OneTableClient.sync(OneTableClient.java:116) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
        at io.onetable.utilities.RunSync.main(RunSync.java:168) [utilities-0.1.0-SNAPSHOT-bundled.jar:?]
2024-03-11 20:06:28 ERROR io.onetable.client.OneTableClient:133 - Sync failed for the following formats DELTA

the-other-tim-brown · 2024-03-12T02:00:00Z

@ForeverAngry we need to handle the protocol versions better. Right now it's hardcoded to 1,4 when creating the table but I guess when we add the timestamp ntz it is getting updated to 3,7. We should remove the hardcoding and compute it based off the needs of the table.

ForeverAngry · 2024-03-12T02:05:20Z

@the-other-tim-brown Makes sense, I can see if I can get it figured out - and then submit a PR. Any suggestions (or existing examples in the code base for other types) on where to start to actually compute the values, would be appreciated!

the-other-tim-brown · 2024-03-12T02:11:22Z

https://github.com/apache/incubator-xtable/blob/main/core/src/main/java/org/apache/xtable/delta/DeltaClient.java#L69-L71 Is where we are currently hardcoding the values.

I would create a method that outputs the min reader and writer versions based on the features used (generated column, timestamp ntz, etc). Maybe there is already a method that is doing this in Delta that is public that can just be used? There is something that is upgrading the table so I am thinking we can try to call into that method.

You can add some testing here: https://github.com/apache/incubator-xtable/blob/main/core/src/test/java/org/apache/xtable/delta/TestDeltaSync.java#L102

Also feel free to take over my branch or just pull in those small changes to your own branch for the timestamp ntz so you can test that case

ForeverAngry · 2024-04-29T17:36:17Z

@the-other-tim-brown I added the changes to support this feature as you outline. Full disclosure, this is my first time contributing to a open source project. Also, im not overly familiar with the large scale Java projects - so you will have to bare with me. Here is my Fork - https://github.com/ForeverAngry/incubator-xtable.git can you help me understand how i would add a unit test? It was overly clear to me, since the method i adjusted was part of a larger call stack.

One final note - i pulled the main branch, and made the changes you put on the timestamp_ntz branch, since that branch was pretty far out of date with the current project at this point.

Let me know what you think - any guidance is welcomed!

the-other-tim-brown · 2024-04-29T19:24:58Z

@ForeverAngry can you create a pull request from your fork? It should allow you to set the main branch in this repo as a target. I will be able to better review and add comments around testing that way.

ForeverAngry · 2024-04-29T19:26:54Z

@the-other-tim-brown will do! Give me a few min!

ForeverAngry · 2024-04-29T19:36:01Z

@the-other-tim-brown done, take a look gh pr checkout 428

psheets · 2024-06-07T02:03:23Z

Maybe I'm doing something wrong here, but when I try to convert iceberg to delta using a glue catalog, I keep getting the error that the class constructor for catalog.Catalog is not found. I've tried adding the iceberg aws bundle to the class path, as well as many other jara, but still the same error. I'm executing the one table jar sync job from a docker image. Maybe the fact that it's a docker image is causing issues? I'm not sure. Does anyone have examples of specific versions of jars that should be used when converting from a iceberg source managed by glue to delta, that they know work? @the-other-tim-brown @sagarlakshmipathy any help would be very welcomed. I'd love to get this tool working!

@ForeverAngry I am getting the class constructor error while running the jar locally. What was the issue fixed to resolve this?

ForeverAngry · 2024-06-07T02:27:49Z

Maybe I'm doing something wrong here, but when I try to convert iceberg to delta using a glue catalog, I keep getting the error that the class constructor for catalog.Catalog is not found. I've tried adding the iceberg aws bundle to the class path, as well as many other jara, but still the same error. I'm executing the one table jar sync job from a docker image. Maybe the fact that it's a docker image is causing issues? I'm not sure. Does anyone have examples of specific versions of jars that should be used when converting from a iceberg source managed by glue to delta, that they know work? @the-other-tim-brown @sagarlakshmipathy any help would be very welcomed. I'd love to get this tool working!

@ForeverAngry I am getting the class constructor error while running the jar locally. What was the issue fixed to resolve this?

For this specific issue, I just pull the main branch and rebuilt the jars. That seemed to clear up the class constructor issue. Let me know if it doesn't!

ForeverAngry mentioned this issue Apr 29, 2024

Added support for dynamically getting delta table features #428

Open

AWS Glue with Iceberg Source and Delta Target #354

AWS Glue with Iceberg Source and Delta Target #354

Comments

ForeverAngry commented Feb 29, 2024 • edited Loading

the-other-tim-brown commented Mar 1, 2024

ForeverAngry commented Mar 1, 2024

MrDerecho commented Mar 1, 2024

sagarlakshmipathy commented Mar 1, 2024 • edited Loading

ForeverAngry commented Mar 2, 2024

the-other-tim-brown commented Mar 2, 2024

ForeverAngry commented Mar 4, 2024

the-other-tim-brown commented Mar 4, 2024

ForeverAngry commented Mar 4, 2024 • edited Loading

the-other-tim-brown commented Mar 4, 2024

ForeverAngry commented Mar 5, 2024

the-other-tim-brown commented Mar 5, 2024

ForeverAngry commented Mar 5, 2024

MrDerecho commented Mar 5, 2024

the-other-tim-brown commented Mar 6, 2024

ForeverAngry commented Mar 6, 2024

the-other-tim-brown commented Mar 6, 2024

ForeverAngry commented Mar 8, 2024

the-other-tim-brown commented Mar 9, 2024

ForeverAngry commented Mar 9, 2024

the-other-tim-brown commented Mar 9, 2024

ForeverAngry commented Mar 9, 2024

the-other-tim-brown commented Mar 10, 2024

ForeverAngry commented Mar 10, 2024 • edited Loading

the-other-tim-brown commented Mar 10, 2024

ForeverAngry commented Mar 10, 2024

the-other-tim-brown commented Mar 10, 2024

ForeverAngry commented Mar 10, 2024

ForeverAngry commented Mar 10, 2024

the-other-tim-brown commented Mar 11, 2024

ForeverAngry commented Mar 11, 2024

ForeverAngry commented Mar 12, 2024

the-other-tim-brown commented Mar 12, 2024

ForeverAngry commented Mar 12, 2024

the-other-tim-brown commented Mar 12, 2024 • edited Loading

ForeverAngry commented Apr 29, 2024 • edited Loading

the-other-tim-brown commented Apr 29, 2024

ForeverAngry commented Apr 29, 2024

ForeverAngry commented Apr 29, 2024

psheets commented Jun 7, 2024

ForeverAngry commented Jun 7, 2024 • edited Loading

ForeverAngry commented Feb 29, 2024 •

edited

Loading

sagarlakshmipathy commented Mar 1, 2024 •

edited

Loading

ForeverAngry commented Mar 4, 2024 •

edited

Loading

ForeverAngry commented Mar 10, 2024 •

edited

Loading

the-other-tim-brown commented Mar 12, 2024 •

edited

Loading

ForeverAngry commented Apr 29, 2024 •

edited

Loading

ForeverAngry commented Jun 7, 2024 •

edited

Loading