Added multi-processing capabilities to LH5Iterator, including map, query, and hist functions #192

iguinn · 2025-10-04T20:32:19Z

Added functions to LH5Iterator:
map takes an input function with arguments of type Table and LH5Iterator and applies it to each block of data, returning a list of the results (similar to the python builtin map). In addition, the user can provide:

An aggregator function that will in some way combine the results of this function (e.g., can join tables column wise or add together results).
Begin and terminate functions to do things before loops begin/end
Either a concurrent.futures.Executor or a number of processes to run (using ProcessPoolExecutor). This will divide the iterator into equal(ish) sized chunks of files/groups and submit each chunk to the Executor. If this is used, results will be returned as an iterator over futures objects (see concurrent.futures.Executor.map)

query takes an input function with arguments of type Table and LH5Iterator and returns a Table, pandas dataframe, awkward array, or numpy array with some sort of processing (including down-selection) applied. This function is mapped over the full dataset pointed to by the iterator, and the results are merged into a single Table/dataframe/awkward array. This can be used with multiprocessing.

In addition, a pandas query can be provided as a string; this will return a pandas dataframe with the query used to select which entries to include. This feature could be expanded for awkward and other datatypes in the future (see

Secure Table.eval() #135 for why this is not done yet)

hist takes a list of Hist.axis specifications, a query function/str, and a list of keys, and builds a histogram out of the queried data (filling it iteratively so that we never have to hold the full table in memory at once). This can be used with multiprocessing.

In addition, some other functions were implemented to support these:

deepcopy is defined; this deep-copies everything except LH5Store, for which a new Store is constructed (since h5py objects are not all deepcopy-able)
getstate and setstate are defined to enable pickling of an LH5Iterator. Pickling is used by multiprocessing to communicate data from the parent process to child processes. This requires skipping the LH5Store when writing, and constructing a new one when reading.
_select_groups is a helper function that reduces the files and groups iterated over to some slice. This is used by...
_generate_workers splits the iterator into n equal(ish) iterators that are passed to each of the processes when using multi-processing
Additional helper functions used by map, query, and hist

Added tests for map, query and hist.

…ery, and hist functions

codecov · 2025-10-04T20:34:35Z

Codecov Report

❌ Patch coverage is 80.47337% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.86%. Comparing base (7428202) to head (fd94d58).

Files with missing lines	Patch %	Lines
src/lgdo/lh5/iterator.py	80.35%	33 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #192      +/-   ##
==========================================
- Coverage   80.91%   80.86%   -0.05%     
==========================================
  Files          47       47              
  Lines        3747     3910     +163     
==========================================
+ Hits         3032     3162     +130     
- Misses        715      748      +33

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gipert · 2025-10-05T08:02:11Z

Very cool functionality! i think we should merge it as you are not changing existing code. But maybe I would suggest writing a documentation page (or notebook) about the iterator? at this point there is quite some functionality and i think it's worth having a new section in the docs

iguinn · 2025-10-05T15:35:40Z

Hi Luigi, yes, I would like to merge this soon, but I agree there's a few things to do first. (this is the same changes as another pull request that was getting messy due to cross-dependencies on other PRs, so I resubmitted)

gipert · 2025-11-28T10:17:53Z

hi @iguinn, I'm trying this out for the ssc (big) data and it works nice! can we add a little more docs and merge?

gipert · 2025-12-02T10:21:59Z

src/lgdo/lh5/iterator.py

+        Parameters
+        ----------
+        fun:
+            function with signature fun(lh5_obj: Table, it: LH5Iterator) -> Any


Suggested change

function with signature fun(lh5_obj: Table, it: LH5Iterator) -> Any

function with signature ``fun(lh5_obj: Table, it: LH5Iterator) -> Any``

gipert · 2025-12-02T10:23:35Z

src/lgdo/lh5/iterator.py

+        """Map function over iterator blocks and return order-preserving list
+        of outputs. Can be multi-threaded provided there are no attempts to
+        modify existing objects. Multi-threading splits the iterator into
+        multiple independent streams with an approximately equal number of
+        files/groups, concurrently processed under a single program multiple
+        data model. Results will be returned asynchronously for each process.


Suggested change

"""Map function over iterator blocks and return order-preserving list

of outputs. Can be multi-threaded provided there are no attempts to

modify existing objects. Multi-threading splits the iterator into

multiple independent streams with an approximately equal number of

files/groups, concurrently processed under a single program multiple

data model. Results will be returned asynchronously for each process.

"""Map function over iterator blocks.

Returns order-preserving list of outputs. Can be multi-threaded

provided there are no attempts to modify existing objects.

Multi-threading splits the iterator into multiple independent

streams with an approximately equal number of files/groups,

concurrently processed under a single program multiple data

model. Results will be returned asynchronously for each process.

gipert · 2025-12-02T10:25:18Z

src/lgdo/lh5/iterator.py

+        multiple independent streams with an approximately equal number of
+        files/groups, concurrently processed under a single program multiple
+        data model. Results will be returned asynchronously for each process.
+


Can we add a simple example here?

Examples -------- >>> it = LH5Interator(...) >>> >>> def fun():

etc

gipert · 2025-12-02T10:26:32Z

src/lgdo/lh5/iterator.py

+            number of processes. If ``None``, use number equal to threads available
+            to ``executor`` (if provided), or else do not parallelize
+        executor:
+            `concurrent.futures.Executor <https://docs.python.org/3/library/concurrent.futures.html>`_


Suggested change

`concurrent.futures.Executor <https://docs.python.org/3/library/concurrent.futures.html>`_

:class:`concurrent.futures.Executor`

does this work?

gipert · 2025-12-02T10:28:06Z

src/lgdo/lh5/iterator.py

+        Query the data files in the iterator and return the selected data
+        as a single table in one of several formats.


Suggested change

Query the data files in the iterator and return the selected data

as a single table in one of several formats.

Query the data files in the iterator.

Returns the selected data as a single table in one of several formats.

docstrings should always be: one-line description and then full description

gipert · 2025-12-02T10:29:14Z

src/lgdo/lh5/iterator.py

+              - ``pandas.DataFrame``: pandas dataframe. Treat as mapping from column
+                name to values
+
+            - A string expression. This will call `pd.DataFrame.query <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html>`_ and return


Suggested change

- A string expression. This will call `pd.DataFrame.query <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html>`_ and return

- A string expression. This will call :meth:`pd.DataFrame.query` and return

this should also work, to be tested

gipert · 2025-12-02T10:29:33Z

src/lgdo/lh5/iterator.py

+            number of processes. If ``None``, use number equal to threads available
+            to ``executor`` (if provided), or else do not parallelize
+        executor:
+            `concurrent.futures.Executor <https://docs.python.org/3/library/concurrent.futures.html>`_


use intersphinx mapping here too

gipert · 2025-12-02T10:30:16Z

src/lgdo/lh5/iterator.py

+        Parameters
+        ----------
+        ax:
+            Axis object(s) used to construct the histogram. Can provide a ``Hist``


Suggested change

Axis object(s) used to construct the histogram. Can provide a ``Hist``

Axis object(s) used to construct the histogram. Can provide a :class:``.types.hist.Hist`

or whatever is the right path

gipert · 2025-12-02T10:30:51Z

src/lgdo/lh5/iterator.py

+              - ``Mapping[str, ArrayLike]``: mapping from axis name to values
+              - ``pandas.DataFrame``: pandas dataframe. Treat as mapping from column
+                name to values
+            - A string expression. This will call `pd.DataFrame.query <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html>`_ and return


intersphinx reference

gipert · 2025-12-02T10:31:07Z

src/lgdo/lh5/iterator.py

+            number of processes. If ``None``, use number equal to threads available
+            to ``executor`` (if provided), or else do not parallelize
+        executor:
+            `concurrent.futures.Executor <https://docs.python.org/3/library/concurrent.futures.html>`_


intersphinx

gipert · 2025-12-02T10:31:25Z

src/lgdo/lh5/iterator.py

+            object for managing parallelism. If ``None``, create a ``ProcessPoolExecutor`` with
+            number of processes equal to ``processes``.
+        hist_kwargs:
+            additional keyword arguments for constructing Hist. See `hist.Hist`.


Suggested change

additional keyword arguments for constructing Hist. See `hist.Hist`.

additional keyword arguments for constructing Hist. See :class:`.types.hist.Hist`.

gipert · 2025-12-02T10:31:56Z

src/lgdo/lh5/iterator.py

+        elif isinstance(data, Collection):
+            hist.fill(*data)
+        else:
+            msg = "data returned by where is not compatible with hist. Must be a 1d or 2d numpy array, a list of arrays, or a mapping from str to array"


Suggested change

msg = "data returned by where is not compatible with hist. Must be a 1d or 2d numpy array, a list of arrays, or a mapping from str to array"

msg = (

"data returned by where is not compatible with hist. Must "

"be a 1d or 2d numpy array, a list of arrays, or a mapping from str to array"

)

gipert

thanks Ian! I have some suggestions on how to improve the docstrings

gipert · 2025-12-02T14:14:48Z

could you also update the docs for the buffer_len arg of LH5Iterator? there's no mention to inputting it in units of bytes

Added multi-processing capabilities to LH5Iterator, including map, qu…

3f05df2

…ery, and hist functions

iguinn added 2 commits December 1, 2025 10:16

Merge branch 'main' into iter

e3180b1

Fix doc strings and replace processes/chunks with processes/executor

fd94d58

gipert reviewed Dec 2, 2025

View reviewed changes

	function with signature fun(lh5_obj: Table, it: LH5Iterator) -> Any
	function with signature ``fun(lh5_obj: Table, it: LH5Iterator) -> Any``

	`concurrent.futures.Executor <https://docs.python.org/3/library/concurrent.futures.html>`_
	:class:`concurrent.futures.Executor`

		Query the data files in the iterator and return the selected data
		as a single table in one of several formats.

	- A string expression. This will call `pd.DataFrame.query <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html>`_ and return
	- A string expression. This will call :meth:`pd.DataFrame.query` and return

	Axis object(s) used to construct the histogram. Can provide a ``Hist``
	Axis object(s) used to construct the histogram. Can provide a :class:``.types.hist.Hist`

	additional keyword arguments for constructing Hist. See `hist.Hist`.
	additional keyword arguments for constructing Hist. See :class:`.types.hist.Hist`.

-            msg = "data returned by where is not compatible with hist. Must be a 1d or 2d numpy array, a list of arrays, or a mapping from str to array"
+            msg = (
+                 "data returned by where is not compatible with hist. Must "
+                 "be a 1d or 2d numpy array, a list of arrays, or a mapping from str to array"
+             )

Added multi-processing capabilities to LH5Iterator, including map, query, and hist functions #192

Are you sure you want to change the base?

Added multi-processing capabilities to LH5Iterator, including map, query, and hist functions #192

Uh oh!

Conversation

iguinn commented Oct 4, 2025

Uh oh!

codecov bot commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gipert commented Oct 5, 2025

Uh oh!

iguinn commented Oct 5, 2025

Uh oh!

gipert commented Nov 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gipert left a comment

Choose a reason for hiding this comment

Uh oh!

gipert commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Oct 4, 2025 •

edited

Loading