Autopopulate 2.0 #1244

ttngu207 · 2025-06-12T14:52:03Z

This PR introduces significant changes to the logic of DataJoint's jobs reservation/orchestration scheme - namely the autopopulate mechanism.

The PR aims to address issue described in #1243 - following the proposed solution 1.

I have tested this new autopopulate 2.0 mechanism in some production pipeline settings, and it works great!

In short, the new logic is outlined below

Enhancing the Jobs Table in DataJoint-Python

To address current limitations, we'll enhance the jobs table by introducing new job statuses and modifying the populate() logic. This approach aims to improve efficiency and maintain data freshness.

Modifying the Jobs Table

Expand the job statuses within the jobs table to include:

scheduled: For jobs that are identified and queued for execution.
success: To record jobs that have completed without errors.

Dedicated `schedule_jobs` Step

Introduce a new, dedicated step called schedule_jobs. This method will be responsible for populating the jobs table with new entries marked as scheduled.

Identifying New Jobs: This step will execute (table.key_source - table).fetch("KEY") to identify new jobs. While this operation can be computationally expensive, it mirrors the current approach for job discovery.
Rate Limiting: To prevent excessive scheduling and resource consumption, schedule_jobs will include a configurable rate-limiting logic. For instance, it can skip scheduling if the most recent scheduling event occurred within a defined time period (e.g., 10 seconds).

New `populate()` Logic

The populate() function will be updated to:

Optional Scheduling: Optionally, schedule_jobs can be called at the beginning of the populate() process to ensure the jobs table is up-to-date before work commences.
Fetching Scheduled Jobs: Instead of repeatedly hitting key_source, populate() will fetch keys directly from the jobs table that have a scheduled status.
Execution and Status Update: For each retrieved key, make() will be called. Upon completion, the job's status in the jobs table will be updated to either error or success.

Addressing Stale or Out-of-Sync Jobs Data

The jobs table can become stale or out-of-sync if not updated frequently or if upstream data changes.

Invalid Entries: If entries in upstream tables are deleted, existing entries in the jobs table might become "invalid." Similarly, if entries are deleted from the target table, success jobs can also become "invalid."
purge_invalid_jobs Method: To handle this, a new purge_invalid_jobs method will be added. This method will identify and remove these invalid entries from the jobs table, ensuring data integrity.

Keeping the Jobs Table "Fresh"

Maintaining a "fresh" jobs table is crucial for efficient operations:

Frequent Scheduling: Regularly running schedule_jobs will ensure that new tasks are promptly added to the queue.
Frequent Purging: Regularly running purge_invalid_jobs will keep the table clean and free of irrelevant or invalid entries.

Trade-off: Both schedule_jobs and purge_invalid_jobs will involve hitting key_source, which can be resource-intensive. Users (or system administrators) will need to balance the desired level of "freshness" against the associated resource consumption to optimize performance.

For more detailed description of the new logic, see here

…ails

fix: improve error handling when `make_fetch` referential integrity fails

dimitri-yatsenko · 2025-06-13T11:06:44Z

datajoint/autopopulate.py

@@ -249,6 +254,8 @@ def populate(
            to be passed down to each ``make()`` call. Computation arguments should be
            specified within the pipeline e.g. using a `dj.Lookup` table.
        :type make_kwargs: dict, optional
+        :param schedule_jobs: if True, run schedule_jobs before doing populate (default: True),


Rather than baking this operation into populate, which makes the logic more convoluted, consider making schedule_jobs a separate, explicit process.

dimitri-yatsenko · 2025-06-13T11:08:58Z

datajoint/autopopulate.py

+            )
+        finally:
+            if purge_invalid_jobs:
+                self.purge_invalid_jobs()


Rather than purging jobs from within schedule_jobs, consider implementing it as a separate explicit step, reducing interdependencies.

dimitri-yatsenko · 2025-06-13T11:10:33Z

datajoint/autopopulate.py

+            if purge_invalid_jobs:
+                self.purge_invalid_jobs()
+
+    def purge_invalid_jobs(self):


Consider naming it simply cleanup_jobs since the jobs are not technically "invalid".

Thinh Nguyen and others added 29 commits February 3, 2023 10:21

added JobsConfig

4f01ea3

successful prototype for key_source

de4437c

added refresh_jobs and purge_invalid_jobs

8edf179

Merge branch 'populate_success_count' into autopopulate-2.0

fd849bc

rename function: schedule_jobs

8692eb4

implement schedule_jobs as part of populate()

f38ce9f

Merge branch 'master' into autopopulate-2.1

ef3adc2

remove JobConfigTable and register_key_source

8b9ac0f

bugfix, add tests

71b0696

bugfix - remove jobconfig

c13b3e1

Merge branch 'master' into autopulate-2.0

1f3fac1

Merge branch 'master' into autopulate-2.0

32fbc6d

chore: minor bugfix

0d9ec01

chore: code cleanup

4cc170d

fix: key attribute of type JSON

b7e4d9b

feat: prevent excessive scheduling with min_scheduling_interval

57c7247

chore: minor cleanup

3f5247b

feat: improve logic to prevent excessive scheduling

45b5658

chore: minor bugfix

9903e02

chore: tiny bugfix

872c5dc

fix: fix scheduling logic

69d8831

chore: minor logging tweak

cc0f398

fix: log run_duration in error jobs

2ac9fa2

fix: improve logic in purge_invalid_jobs

db93e0a

docs: new jobs_orchestration.md docs

c9a5750

Update jobs_orchestration.md

28df6c2

Update jobs_orchestration.md

b0308e2

feat: add run_metadata column to jobs table

e9f5377

Merge branch 'datajoint:master' into autopopulate-2.0

eb90d3d

ttngu207 requested a review from dimitri-yatsenko June 12, 2025 14:52

github-actions bot added enhancement Indicates new improvements documentation Issues related to documentation labels Jun 12, 2025

ttngu207 and others added 4 commits June 13, 2025 13:36

fix: improve error handling when make_fetch referential integrity f…

e7c8943

…ails

style: black format

e55bbcb

style: format

964743e

Merge pull request datajoint#1245 from ttngu207/bugfix-three-part-make

53e38f7

fix: improve error handling when `make_fetch` referential integrity fails

dimitri-yatsenko requested changes Jun 20, 2025

View reviewed changes

ttngu207 added 11 commits July 1, 2025 09:17

Merge remote-tracking branch 'upstream/master' into autopopulate-2.0

0cf1ea0

feat: add _job hidden column for Imported Computed tables

15f791c

chore: rename purge_valid_jobs -> purge_jobs

18727e9

feat: insert _job metadata upon make completion

efbb920

chore: minor code optimization in purge_jobs

918cc9d

feat: remove logging of success jobs in Jobs table

dcfeaf5

docs: minor updates

1f773fa

format: black

7184ce5

chore: remove the optional purge_jobs in schedule_jobs

9d3a9e4

chore: code cleanup

269c4af

fix: update job metadata in the make's transaction

3d7c4ea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Autopopulate 2.0 #1244

Autopopulate 2.0 #1244

Uh oh!

ttngu207 commented Jun 12, 2025

Uh oh!

dimitri-yatsenko Jun 13, 2025

Uh oh!

dimitri-yatsenko Jun 13, 2025

Uh oh!

dimitri-yatsenko Jun 13, 2025

Uh oh!

Uh oh!

Autopopulate 2.0 #1244

Are you sure you want to change the base?

Autopopulate 2.0 #1244

Uh oh!

Conversation

ttngu207 commented Jun 12, 2025

Enhancing the Jobs Table in DataJoint-Python

Modifying the Jobs Table

Dedicated schedule_jobs Step

New populate() Logic

Addressing Stale or Out-of-Sync Jobs Data

Keeping the Jobs Table "Fresh"

Uh oh!

dimitri-yatsenko Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

dimitri-yatsenko Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

dimitri-yatsenko Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Dedicated `schedule_jobs` Step

New `populate()` Logic