-
Notifications
You must be signed in to change notification settings - Fork 11
Added multi-processing capabilities to LH5Iterator, including map, query, and hist functions #192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ery, and hist functions
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #192 +/- ##
==========================================
- Coverage 80.91% 80.86% -0.05%
==========================================
Files 47 47
Lines 3747 3910 +163
==========================================
+ Hits 3032 3162 +130
- Misses 715 748 +33 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Very cool functionality! i think we should merge it as you are not changing existing code. But maybe I would suggest writing a documentation page (or notebook) about the iterator? at this point there is quite some functionality and i think it's worth having a new section in the docs |
|
Hi Luigi, yes, I would like to merge this soon, but I agree there's a few things to do first. (this is the same changes as another pull request that was getting messy due to cross-dependencies on other PRs, so I resubmitted) |
|
hi @iguinn, I'm trying this out for the ssc (big) data and it works nice! can we add a little more docs and merge? |
| Parameters | ||
| ---------- | ||
| fun: | ||
| function with signature fun(lh5_obj: Table, it: LH5Iterator) -> Any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| function with signature fun(lh5_obj: Table, it: LH5Iterator) -> Any | |
| function with signature ``fun(lh5_obj: Table, it: LH5Iterator) -> Any`` |
| """Map function over iterator blocks and return order-preserving list | ||
| of outputs. Can be multi-threaded provided there are no attempts to | ||
| modify existing objects. Multi-threading splits the iterator into | ||
| multiple independent streams with an approximately equal number of | ||
| files/groups, concurrently processed under a single program multiple | ||
| data model. Results will be returned asynchronously for each process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| """Map function over iterator blocks and return order-preserving list | |
| of outputs. Can be multi-threaded provided there are no attempts to | |
| modify existing objects. Multi-threading splits the iterator into | |
| multiple independent streams with an approximately equal number of | |
| files/groups, concurrently processed under a single program multiple | |
| data model. Results will be returned asynchronously for each process. | |
| """Map function over iterator blocks. | |
| Returns order-preserving list of outputs. Can be multi-threaded | |
| provided there are no attempts to modify existing objects. | |
| Multi-threading splits the iterator into multiple independent | |
| streams with an approximately equal number of files/groups, | |
| concurrently processed under a single program multiple data | |
| model. Results will be returned asynchronously for each process. |
| multiple independent streams with an approximately equal number of | ||
| files/groups, concurrently processed under a single program multiple | ||
| data model. Results will be returned asynchronously for each process. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a simple example here?
Examples
--------
>>> it = LH5Interator(...)
>>>
>>> def fun():
etc
| number of processes. If ``None``, use number equal to threads available | ||
| to ``executor`` (if provided), or else do not parallelize | ||
| executor: | ||
| `concurrent.futures.Executor <https://docs.python.org/3/library/concurrent.futures.html>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `concurrent.futures.Executor <https://docs.python.org/3/library/concurrent.futures.html>`_ | |
| :class:`concurrent.futures.Executor` |
does this work?
| Query the data files in the iterator and return the selected data | ||
| as a single table in one of several formats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Query the data files in the iterator and return the selected data | |
| as a single table in one of several formats. | |
| Query the data files in the iterator. | |
| Returns the selected data as a single table in one of several formats. |
docstrings should always be: one-line description and then full description
| - ``pandas.DataFrame``: pandas dataframe. Treat as mapping from column | ||
| name to values | ||
| - A string expression. This will call `pd.DataFrame.query <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html>`_ and return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - A string expression. This will call `pd.DataFrame.query <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html>`_ and return | |
| - A string expression. This will call :meth:`pd.DataFrame.query` and return |
this should also work, to be tested
| number of processes. If ``None``, use number equal to threads available | ||
| to ``executor`` (if provided), or else do not parallelize | ||
| executor: | ||
| `concurrent.futures.Executor <https://docs.python.org/3/library/concurrent.futures.html>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use intersphinx mapping here too
| Parameters | ||
| ---------- | ||
| ax: | ||
| Axis object(s) used to construct the histogram. Can provide a ``Hist`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Axis object(s) used to construct the histogram. Can provide a ``Hist`` | |
| Axis object(s) used to construct the histogram. Can provide a :class:``.types.hist.Hist` |
or whatever is the right path
| - ``Mapping[str, ArrayLike]``: mapping from axis name to values | ||
| - ``pandas.DataFrame``: pandas dataframe. Treat as mapping from column | ||
| name to values | ||
| - A string expression. This will call `pd.DataFrame.query <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html>`_ and return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intersphinx reference
| number of processes. If ``None``, use number equal to threads available | ||
| to ``executor`` (if provided), or else do not parallelize | ||
| executor: | ||
| `concurrent.futures.Executor <https://docs.python.org/3/library/concurrent.futures.html>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intersphinx
| object for managing parallelism. If ``None``, create a ``ProcessPoolExecutor`` with | ||
| number of processes equal to ``processes``. | ||
| hist_kwargs: | ||
| additional keyword arguments for constructing Hist. See `hist.Hist`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| additional keyword arguments for constructing Hist. See `hist.Hist`. | |
| additional keyword arguments for constructing Hist. See :class:`.types.hist.Hist`. |
| elif isinstance(data, Collection): | ||
| hist.fill(*data) | ||
| else: | ||
| msg = "data returned by where is not compatible with hist. Must be a 1d or 2d numpy array, a list of arrays, or a mapping from str to array" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| msg = "data returned by where is not compatible with hist. Must be a 1d or 2d numpy array, a list of arrays, or a mapping from str to array" | |
| msg = ( | |
| "data returned by where is not compatible with hist. Must " | |
| "be a 1d or 2d numpy array, a list of arrays, or a mapping from str to array" | |
| ) |
gipert
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks Ian! I have some suggestions on how to improve the docstrings
|
could you also update the docs for the |
Added functions to LH5Iterator:
map takes an input function with arguments of type Table and LH5Iterator and applies it to each block of data, returning a list of the results (similar to the python builtin map). In addition, the user can provide:
query takes an input function with arguments of type Table and LH5Iterator and returns a Table, pandas dataframe, awkward array, or numpy array with some sort of processing (including down-selection) applied. This function is mapped over the full dataset pointed to by the iterator, and the results are merged into a single Table/dataframe/awkward array. This can be used with multiprocessing.
In addition, a pandas query can be provided as a string; this will return a pandas dataframe with the query used to select which entries to include. This feature could be expanded for awkward and other datatypes in the future (see
Secure Table.eval() #135 for why this is not done yet)
hist takes a list of Hist.axis specifications, a query function/str, and a list of keys, and builds a histogram out of the queried data (filling it iteratively so that we never have to hold the full table in memory at once). This can be used with multiprocessing.
In addition, some other functions were implemented to support these:
Added tests for map, query and hist.