-
Notifications
You must be signed in to change notification settings - Fork 89
Autopopulate 2.0 #1244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feat/autopopulate2
Are you sure you want to change the base?
Autopopulate 2.0 #1244
Conversation
fix: improve error handling when `make_fetch` referential integrity fails
@@ -249,6 +254,8 @@ def populate( | |||
to be passed down to each ``make()`` call. Computation arguments should be | |||
specified within the pipeline e.g. using a `dj.Lookup` table. | |||
:type make_kwargs: dict, optional | |||
:param schedule_jobs: if True, run schedule_jobs before doing populate (default: True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than baking this operation into populate, which makes the logic more convoluted, consider making schedule_jobs
a separate, explicit process.
datajoint/autopopulate.py
Outdated
) | ||
finally: | ||
if purge_invalid_jobs: | ||
self.purge_invalid_jobs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than purging jobs from within schedule_jobs
, consider implementing it as a separate explicit step, reducing interdependencies.
datajoint/autopopulate.py
Outdated
if purge_invalid_jobs: | ||
self.purge_invalid_jobs() | ||
|
||
def purge_invalid_jobs(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider naming it simply cleanup_jobs
since the jobs are not technically "invalid".
This PR introduces significant changes to the logic of DataJoint's jobs reservation/orchestration scheme - namely the
autopopulate
mechanism.The PR aims to address issue described in #1243 - following the proposed solution 1.
I have tested this new autopopulate 2.0 mechanism in some production pipeline settings, and it works great!
In short, the new logic is outlined below
Enhancing the Jobs Table in DataJoint-Python
To address current limitations, we'll enhance the jobs table by introducing new job statuses and modifying the
populate()
logic. This approach aims to improve efficiency and maintain data freshness.Modifying the Jobs Table
Expand the job statuses within the jobs table to include:
scheduled
: For jobs that are identified and queued for execution.success
: To record jobs that have completed without errors.Dedicated
schedule_jobs
StepIntroduce a new, dedicated step called
schedule_jobs
. This method will be responsible for populating the jobs table with new entries marked asscheduled
.(table.key_source - table).fetch("KEY")
to identify new jobs. While this operation can be computationally expensive, it mirrors the current approach for job discovery.schedule_jobs
will include a configurable rate-limiting logic. For instance, it can skip scheduling if the most recent scheduling event occurred within a defined time period (e.g., 10 seconds).New
populate()
LogicThe
populate()
function will be updated to:schedule_jobs
can be called at the beginning of thepopulate()
process to ensure the jobs table is up-to-date before work commences.key_source
,populate()
will fetch keys directly from the jobs table that have ascheduled
status.make()
will be called. Upon completion, the job's status in the jobs table will be updated to eithererror
orsuccess
.Addressing Stale or Out-of-Sync Jobs Data
The jobs table can become stale or out-of-sync if not updated frequently or if upstream data changes.
success
jobs can also become "invalid."purge_invalid_jobs
Method: To handle this, a newpurge_invalid_jobs
method will be added. This method will identify and remove these invalid entries from the jobs table, ensuring data integrity.Keeping the Jobs Table "Fresh"
Maintaining a "fresh" jobs table is crucial for efficient operations:
schedule_jobs
will ensure that new tasks are promptly added to the queue.purge_invalid_jobs
will keep the table clean and free of irrelevant or invalid entries.Trade-off: Both
schedule_jobs
andpurge_invalid_jobs
will involve hittingkey_source
, which can be resource-intensive. Users (or system administrators) will need to balance the desired level of "freshness" against the associated resource consumption to optimize performance.For more detailed description of the new logic, see here