Skip to content

ENH: Implement pandas.read_iceberg #61383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 37 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
aa59971
ENH: Implement pandas.read_iceberg
datapythonista Apr 30, 2025
9ef8e4e
Typo in API index
datapythonista Apr 30, 2025
34792c9
Run iceberg tests
datapythonista Apr 30, 2025
c1b2426
Fix docstring example
datapythonista Apr 30, 2025
6081bfb
Fixes to the docstring
datapythonista Apr 30, 2025
91370db
Creating catalog dynamically
datapythonista Apr 30, 2025
ecf40ff
Fixing tests
datapythonista May 2, 2025
8c1e4dc
Adding debug info
datapythonista May 3, 2025
ee61079
Making pyiceberg tests run in a single cpu
datapythonista May 3, 2025
2debd5f
Debugging CI problems
datapythonista May 3, 2025
e593977
Bump pyiceberg version to 0.8.1
datapythonista May 3, 2025
24c0ceb
Revert "Bump pyiceberg version to 0.8.1"
datapythonista May 3, 2025
5fc738e
Commenting debugging information
datapythonista May 3, 2025
4953745
Bump minimum version to 0.7.1
datapythonista May 4, 2025
17d73e8
Removing debug code
datapythonista May 4, 2025
5f07a49
Allowing older version of gcsfs
datapythonista May 4, 2025
32add5f
Allowing an older version of s3fs
datapythonista May 4, 2025
3b0d7ee
Allowing an older version of fsspec
datapythonista May 5, 2025
7018c11
adding pyiceberg to requirements
datapythonista May 5, 2025
75c24d6
empty
datapythonista May 5, 2025
9c343a5
pre-commit
datapythonista May 5, 2025
9cd2d5c
Updating gcsfs minimum version
datapythonista May 5, 2025
301e988
Print difference in validation of minimum versions
datapythonista May 5, 2025
15f6397
Fix diff print
datapythonista May 5, 2025
f973e61
Fix bug when showing diff
datapythonista May 5, 2025
9110f2c
Merge remote-tracking branch 'upstream/main' into read_iceberg
datapythonista May 5, 2025
c13ce5b
debug validate min versions
datapythonista May 5, 2025
bc4d689
Updating new CI deps, reverting validate min versions script changes
datapythonista May 5, 2025
735c48c
Reverting test data to working version
datapythonista May 5, 2025
761b92e
Making read_iceberg experimental
datapythonista May 12, 2025
c25536c
Merge from main
datapythonista May 12, 2025
0a2e9ea
Using fixture for catalog
datapythonista May 12, 2025
2ef6343
Fix tests after using fixture
datapythonista May 12, 2025
7dd054b
Remove unneeded list when defining single cpu pytest mark
datapythonista May 12, 2025
74a1e65
whatsnew entry
datapythonista May 12, 2025
6917b30
Merge branch 'read_iceberg' of github.com:datapythonista/pandas into …
datapythonista May 12, 2025
dd0b5e4
Merge branch 'main' into read_iceberg
datapythonista May 14, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions ci/deps/actions-310-minimum_versions.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,10 @@ dependencies:
- beautifulsoup4=4.12.3
- bottleneck=1.3.6
- fastparquet=2024.2.0
- fsspec=2024.2.0
- fsspec=2023.12.2
- html5lib=1.1
- hypothesis=6.84.0
- gcsfs=2024.2.0
- gcsfs=2023.12.2
- jinja2=3.1.3
- lxml=4.9.2
- matplotlib=3.8.3
Expand All @@ -42,14 +42,15 @@ dependencies:
- openpyxl=3.1.2
- psycopg2=2.9.6
- pyarrow=10.0.1
- pyiceberg=0.7.1
- pymysql=1.1.0
- pyqt=5.15.9
- pyreadstat=1.2.6
- pytables=3.8.0
- python-calamine=0.1.7
- pytz=2023.4
- pyxlsb=1.0.10
- s3fs=2024.2.0
- s3fs=2023.12.2
- scipy=1.12.0
- sqlalchemy=2.0.0
- tabulate=0.9.0
Expand Down
7 changes: 4 additions & 3 deletions ci/deps/actions-310.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,10 @@ dependencies:
- beautifulsoup4>=4.12.3
- bottleneck>=1.3.6
- fastparquet>=2024.2.0
- fsspec>=2024.2.0
- fsspec>=2023.12.2
- html5lib>=1.1
- hypothesis>=6.84.0
- gcsfs>=2024.2.0
- gcsfs>=2023.12.2
- jinja2>=3.1.3
- lxml>=4.9.2
- matplotlib>=3.8.3
Expand All @@ -40,14 +40,15 @@ dependencies:
- openpyxl>=3.1.2
- psycopg2>=2.9.6
- pyarrow>=10.0.1
- pyiceberg>=0.7.1
- pymysql>=1.1.0
- pyqt>=5.15.9
- pyreadstat>=1.2.6
- pytables>=3.8.0
- python-calamine>=0.1.7
- pytz>=2023.4
- pyxlsb>=1.0.10
- s3fs>=2024.2.0
- s3fs>=2023.12.2
- scipy>=1.12.0
- sqlalchemy>=2.0.0
- tabulate>=0.9.0
Expand Down
7 changes: 4 additions & 3 deletions ci/deps/actions-311-downstream_compat.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,10 +27,10 @@ dependencies:
- beautifulsoup4>=4.12.3
- bottleneck>=1.3.6
- fastparquet>=2024.2.0
- fsspec>=2024.2.0
- fsspec>=2023.12.2
- html5lib>=1.1
- hypothesis>=6.84.0
- gcsfs>=2024.2.0
- gcsfs>=2023.12.2
- jinja2>=3.1.3
- lxml>=4.9.2
- matplotlib>=3.8.3
Expand All @@ -41,14 +41,15 @@ dependencies:
- openpyxl>=3.1.2
- psycopg2>=2.9.6
- pyarrow>=10.0.1
- pyiceberg>=0.7.1
- pymysql>=1.1.0
- pyqt>=5.15.9
- pyreadstat>=1.2.6
- pytables>=3.8.0
- python-calamine>=0.1.7
- pytz>=2023.4
- pyxlsb>=1.0.10
- s3fs>=2024.2.0
- s3fs>=2023.12.2
- scipy>=1.12.0
- sqlalchemy>=2.0.0
- tabulate>=0.9.0
Expand Down
7 changes: 4 additions & 3 deletions ci/deps/actions-311.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,10 @@ dependencies:
- beautifulsoup4>=4.12.3
- bottleneck>=1.3.6
- fastparquet>=2024.2.0
- fsspec>=2024.2.0
- fsspec>=2023.12.2
- html5lib>=1.1
- hypothesis>=6.84.0
- gcsfs>=2024.2.0
- gcsfs>=2023.12.2
- jinja2>=3.1.3
- lxml>=4.9.2
- matplotlib>=3.8.3
Expand All @@ -41,13 +41,14 @@ dependencies:
- openpyxl>=3.1.2
- psycopg2>=2.9.6
- pyarrow>=10.0.1
- pyiceberg>=0.7.1
- pymysql>=1.1.0
- pyreadstat>=1.2.6
- pytables>=3.8.0
- python-calamine>=0.1.7
- pytz>=2023.4
- pyxlsb>=1.0.10
- s3fs>=2024.2.0
- s3fs>=2023.12.2
- scipy>=1.12.0
- sqlalchemy>=2.0.0
- tabulate>=0.9.0
Expand Down
7 changes: 4 additions & 3 deletions ci/deps/actions-312.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,10 @@ dependencies:
- beautifulsoup4>=4.12.3
- bottleneck>=1.3.6
- fastparquet>=2024.2.0
- fsspec>=2024.2.0
- fsspec>=2023.12.2
- html5lib>=1.1
- hypothesis>=6.84.0
- gcsfs>=2024.2.0
- gcsfs>=2023.12.2
- jinja2>=3.1.3
- lxml>=4.9.2
- matplotlib>=3.8.3
Expand All @@ -41,13 +41,14 @@ dependencies:
- openpyxl>=3.1.2
- psycopg2>=2.9.6
- pyarrow>=10.0.1
- pyiceberg>=0.7.1
- pymysql>=1.1.0
- pyreadstat>=1.2.6
- pytables>=3.8.0
- python-calamine>=0.1.7
- pytz>=2023.4
- pyxlsb>=1.0.10
- s3fs>=2024.2.0
- s3fs>=2023.12.2
- scipy>=1.12.0
- sqlalchemy>=2.0.0
- tabulate>=0.9.0
Expand Down
6 changes: 3 additions & 3 deletions ci/deps/actions-313.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,10 +27,10 @@ dependencies:
- blosc>=1.21.3
- bottleneck>=1.3.6
- fastparquet>=2024.2.0
- fsspec>=2024.2.0
- fsspec>=2023.12.2
- html5lib>=1.1
- hypothesis>=6.84.0
- gcsfs>=2024.2.0
- gcsfs>=2023.12.2
- jinja2>=3.1.3
- lxml>=4.9.2
- matplotlib>=3.8.3
Expand All @@ -48,7 +48,7 @@ dependencies:
- python-calamine>=0.1.7
- pytz>=2023.4
- pyxlsb>=1.0.10
- s3fs>=2024.2.0
- s3fs>=2023.12.2
- scipy>=1.12.0
- sqlalchemy>=2.0.0
- tabulate>=0.9.0
Expand Down
9 changes: 5 additions & 4 deletions doc/source/getting_started/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -299,7 +299,7 @@ Dependency Minimum Versi
Other data sources
^^^^^^^^^^^^^^^^^^

Installable with ``pip install "pandas[hdf5, parquet, feather, spss, excel]"``
Installable with ``pip install "pandas[hdf5, parquet, iceberg, feather, spss, excel]"``

====================================================== ================== ================ ==========================================================
Dependency Minimum Version pip extra Notes
Expand All @@ -308,6 +308,7 @@ Dependency Minimum Version pip ex
`zlib <https://github.com/madler/zlib>`__ hdf5 Compression for HDF5
`fastparquet <https://github.com/dask/fastparquet>`__ 2024.2.0 - Parquet reading / writing (pyarrow is default)
`pyarrow <https://github.com/apache/arrow>`__ 10.0.1 parquet, feather Parquet, ORC, and feather reading / writing
`PyIceberg <https://py.iceberg.apache.org/>`__ 0.7.1 iceberg Apache Iceberg reading
`pyreadstat <https://github.com/Roche/pyreadstat>`__ 1.2.6 spss SPSS files (.sav) reading
`odfpy <https://github.com/eea/odfpy>`__ 1.4.1 excel Open document format (.odf, .ods, .odt) reading / writing
====================================================== ================== ================ ==========================================================
Expand All @@ -328,10 +329,10 @@ Installable with ``pip install "pandas[fss, aws, gcp]"``
============================================ ================== =============== ==========================================================
Dependency Minimum Version pip extra Notes
============================================ ================== =============== ==========================================================
`fsspec <https://github.com/fsspec>`__ 2024.2.0 fss, gcp, aws Handling files aside from simple local and HTTP (required
`fsspec <https://github.com/fsspec>`__ 2023.12.2 fss, gcp, aws Handling files aside from simple local and HTTP (required
dependency of s3fs, gcsfs).
`gcsfs <https://github.com/fsspec/gcsfs>`__ 2024.2.0 gcp Google Cloud Storage access
`s3fs <https://github.com/fsspec/s3fs>`__ 2024.2.0 aws Amazon S3 access
`gcsfs <https://github.com/fsspec/gcsfs>`__ 2023.12.2 gcp Google Cloud Storage access
`s3fs <https://github.com/fsspec/s3fs>`__ 2023.12.2 aws Amazon S3 access
============================================ ================== =============== ==========================================================

Clipboard
Expand Down
9 changes: 9 additions & 0 deletions doc/source/reference/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,15 @@ Parquet
read_parquet
DataFrame.to_parquet

Iceberg
~~~~~~~
.. autosummary::
:toctree: api/

read_iceberg

.. warning:: ``read_iceberg`` is experimental and may change without warning.

ORC
~~~
.. autosummary::
Expand Down
97 changes: 97 additions & 0 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like
binary,`HDF5 Format <https://support.hdfgroup.org/documentation/hdf5/latest/_intro_h_d_f5.html>`__, :ref:`read_hdf<io.hdf5>`, :ref:`to_hdf<io.hdf5>`
binary,`Feather Format <https://github.com/wesm/feather>`__, :ref:`read_feather<io.feather>`, :ref:`to_feather<io.feather>`
binary,`Parquet Format <https://parquet.apache.org/>`__, :ref:`read_parquet<io.parquet>`, :ref:`to_parquet<io.parquet>`
binary,`Apache Iceberg <https://iceberg.apache.org/>`__, :ref:`read_iceberg<io.iceberg>` , NA
binary,`ORC Format <https://orc.apache.org/>`__, :ref:`read_orc<io.orc>`, :ref:`to_orc<io.orc>`
binary,`Stata <https://en.wikipedia.org/wiki/Stata>`__, :ref:`read_stata<io.stata_reader>`, :ref:`to_stata<io.stata_writer>`
binary,`SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__, :ref:`read_sas<io.sas_reader>` , NA
Expand Down Expand Up @@ -5403,6 +5404,102 @@ The above example creates a partitioned dataset that may look like:
except OSError:
pass

.. _io.iceberg:

Iceberg
-------

.. versionadded:: 3.0.0

Apache Iceberg is a high performance open-source format for large analytic tables.
Iceberg enables the use of SQL tables for big data while making it possible for different
engines to safely work with the same tables at the same time.

Iceberg support predicate pushdown and column pruning, which are available to pandas
users via the ``row_filter`` and ``selected_fields`` parameters of the :func:`~pandas.read_iceberg`
function. This is convenient to extract from large tables a subset that fits in memory asa
pandas ``DataFrame``.

Internally, pandas uses PyIceberg_ to query Iceberg.

.. _PyIceberg: https://py.iceberg.apache.org/

A simple example loading all data from an Iceberg table ``my_table`` defined in the
``my_catalog`` catalog.

.. code-block:: python

df = pd.read_iceberg("my_table", catalog_name="my_catalog")

Catalogs must be defined in the ``.pyiceberg.yaml`` file, usually in the home directory.
It is possible to to change properties of the catalog definition with the
``catalog_properties`` parameter:

.. code-block:: python

df = pd.read_iceberg(
"my_table",
catalog_name="my_catalog",
catalog_properties={"s3.secret-access-key": "my_secret"},
)

It is also possible to fully specify the catalog in ``catalog_properties`` and not provide
a ``catalog_name``:

.. code-block:: python

df = pd.read_iceberg(
"my_table",
catalog_properties={
"uri": "http://127.0.0.1:8181",
"s3.endpoint": "http://127.0.0.1:9000",
},
)

To create the ``DataFrame`` with only a subset of the columns:

.. code-block:: python

df = pd.read_iceberg(
"my_table",
catalog_name="my_catalog",
selected_fields=["my_column_3", "my_column_7"]
)

This will execute the function faster, since other columns won't be read. And it will also
save memory, since the data from other columns won't be loaded into the underlying memory of
the ``DataFrame``.

To fetch only a subset of the rows, we can do it with the ``limit`` parameter:

.. code-block:: python

df = pd.read_iceberg(
"my_table",
catalog_name="my_catalog",
limit=100,
)

This will create a ``DataFrame`` with 100 rows, assuming there are at least this number in
the table.

To fetch a subset of the rows based on a condition, this can be done using the ``row_filter``
parameter:

.. code-block:: python

df = pd.read_iceberg(
"my_table",
catalog_name="my_catalog",
row_filter="distance > 10.0",
)

Reading a particular snapshot is also possible providing the snapshot ID as an argument to
``snapshot_id``.

More information about the Iceberg format can be found in the `Apache Iceberg official
page <https://iceberg.apache.org/>`__.

.. _io.orc:

ORC
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ Other enhancements
- :py:class:`frozenset` elements in pandas objects are now natively printed (:issue:`60690`)
- Add ``"delete_rows"`` option to ``if_exists`` argument in :meth:`DataFrame.to_sql` deleting all records of the table before inserting data (:issue:`37210`).
- Added half-year offset classes :class:`HalfYearBegin`, :class:`HalfYearEnd`, :class:`BHalfYearBegin` and :class:`BHalfYearEnd` (:issue:`60928`)
- Added support to read from Apache Iceberg tables with the new :func:`read_iceberg` function (:issue:`61383`)
- Errors occurring during SQL I/O will now throw a generic :class:`.DatabaseError` instead of the raw Exception type from the underlying driver manager library (:issue:`60748`)
- Implemented :meth:`Series.str.isascii` and :meth:`Series.str.isascii` (:issue:`59091`)
- Improved deprecation message for offset aliases (:issue:`60820`)
Expand Down
7 changes: 4 additions & 3 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,10 @@ dependencies:
- beautifulsoup4>=4.12.3
- bottleneck>=1.3.6
- fastparquet>=2024.2.0
- fsspec>=2024.2.0
- fsspec>=2023.12.2
- html5lib>=1.1
- hypothesis>=6.84.0
- gcsfs>=2024.2.0
- gcsfs>=2023.12.2
- ipython
- pickleshare # Needed for IPython Sphinx directive in the docs GH#60429
- jinja2>=3.1.3
Expand All @@ -44,13 +44,14 @@ dependencies:
- odfpy>=1.4.1
- psycopg2>=2.9.6
- pyarrow>=10.0.1
- pyiceberg>=0.7.1
- pymysql>=1.1.0
- pyreadstat>=1.2.6
- pytables>=3.8.0
- python-calamine>=0.1.7
- pytz>=2023.4
- pyxlsb>=1.0.10
- s3fs>=2024.2.0
- s3fs>=2023.12.2
- scipy>=1.12.0
- sqlalchemy>=2.0.0
- tabulate>=0.9.0
Expand Down
2 changes: 2 additions & 0 deletions pandas/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,7 @@
read_stata,
read_sas,
read_spss,
read_iceberg,
)

from pandas.io.json._normalize import json_normalize
Expand Down Expand Up @@ -319,6 +320,7 @@
"read_fwf",
"read_hdf",
"read_html",
"read_iceberg",
"read_json",
"read_orc",
"read_parquet",
Expand Down
Loading
Loading