Skip to content

Error creating table from pyarrow schema with pa.uuid() #1986

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 3 tasks
simw opened this issue May 10, 2025 · 3 comments
Open
1 of 3 tasks

Error creating table from pyarrow schema with pa.uuid() #1986

simw opened this issue May 10, 2025 · 3 comments

Comments

@simw
Copy link

simw commented May 10, 2025

Apache Iceberg version

0.9.0 (latest release)

Please describe the bug 🐞

Preamble: using a local sqlite db:

from pyiceberg.catalog import load_catalog

warehouse_path = "data/warehouse"
catalog = load_catalog(
    "default",
    **{
        'type': 'sql',
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)

A pyiceberg UUID column works fine:

from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, UUIDType

schema = Schema(
    NestedField(field_id=1, name="uuid", field_type=UUIDType(), required=False),
)

catalog.create_table("default.test2", schema=schema)

But a pyarrow UUID column gives an error:

import pyarrow as pa

schema = pa.schema([pa.field("foo", pa.uuid(), nullable=True)])

catalog.create_table("default.test4", schema=schema)

The exception is:

File ~/Code/Projects/others/icebergs/.venv/lib/python3.13/site-packages/pyiceberg/io/pyarrow.py:1032, in _(obj, visitor)
   1030     result = visit_pyarrow(field_type, visitor)
   1031 except TypeError as e:
-> 1032     raise UnsupportedPyArrowTypeException(obj, f"Column '{obj.name}' has an unsupported type: {field_type}") from e
   1033 visitor.after_field(obj)
   1035 return visitor.field(obj, result)

UnsupportedPyArrowTypeException: Column 'foo' has an unsupported type: extension<arrow.uuid>

Related to simw/pydantic-to-pyarrow#27

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@jim-ngoo
Copy link

jim-ngoo commented May 12, 2025

we have the UUIDType type already, I think what we missed is the visit_pyarrow decorator

I think we need to handle the case pa.uuid() here:

def primitive(self, primitive: pa.DataType) -> PrimitiveType:

@DinGo4DEV
Copy link

def schema_to_pyarrow(
schema: Union[Schema, IcebergType],
metadata: Dict[bytes, bytes] = EMPTY_DICT,
include_field_ids: bool = True,
) -> pa.schema:
return visit(schema, _ConvertToArrowSchema(metadata, include_field_ids))

The UUIDType transfer to pyarrow is fixed_size_binary[16] . You might use pa.binary(16) and store the bytes of the uuid in your pyarrow table.

import uuid
import pyarrow as pa
uuids = pa.array([uuid.uuid4().bytes for _ in range(100) ], pa.binary(16))

@Tishj
Copy link

Tishj commented May 14, 2025

I think I've also encountered this problem, trying to write tests for https://github.com/duckdb/duckdb-iceberg
I am using a build from main (0.10 dev)

The table.append(...) method won't accept this array:

col_uuid = pa.array([UUID('020d4fc7-acd6-45ac-b216-7873f4038e1f').bytes], pa.uuid())

Results in:

  File "/iceberg-python/pyiceberg/io/pyarrow.py", line 1061, in _
    raise UnsupportedPyArrowTypeException(obj, f"Column '{obj.name}' has an unsupported type: {field_type}") from e
pyiceberg.io.pyarrow.UnsupportedPyArrowTypeException: Column 'col_uuid' has an unsupported type: extension<arrow.uuid>

What does work is this:

col_uuid = pa.array([UUID('020d4fc7-acd6-45ac-b216-7873f4038e1f').bytes], pa.binary(16))

But the problem then is that the parquet file created by pyiceberg does not have the UUIDType() logical type for the field, which trips up our (duckdb's) parquet reader.

I think it's fine to accept pa.binary(16), but then pyiceberg should add the UUIDType() logical type to the parquet file's field, if the destination is a uuidtype, such as this:

update.add_column("col_uuid", UUIDType(), default_value="f79c3e09-677c-4bbd-a479-3f349cb785e7", required=False)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants