Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: command to initialize cron job to slowly backfill CAP term data #3425

Merged
merged 8 commits into from
Oct 16, 2024

Conversation

eddiesshop
Copy link
Contributor

@eddiesshop eddiesshop commented Sep 17, 2024

All Submissions:

Changes proposed in this Pull Request:

This cron (and by extension, the backfill script it executes) is crucial for our efforts to improve performance across our network of sites. This cron will slowly backfill CAP author term data on an hourly basis. This PR relies on this separate PR in Co-Authors Plus which creates the new backfill command. Once this data backfill is complete, we can then implement the CAP Author Archive filter to greatly improve author archive page performance.

Closes # .

How to test the changes in this Pull Request:

  1. Run this query and note the number returned: SELECT COUNT(*) FROM wp_posts WHERE post_type IN ( 'post' ) AND post_status IN ( 'publish' ) AND post_author <> 0 AND ID NOT IN ( SELECT tr.object_id FROM wp_term_relationships tr LEFT JOIN wp_term_taxonomy tt ON tr.term_taxonomy_id = tt.term_taxonomy_id WHERE tt.taxonomy = 'author' GROUP BY tr.object_id ) ORDER BY ID
  2. Run wp newspack schedule-co-authors-author-term-backfill
  3. Check that the cron job has been adding by running wp cron event list and searching for newspack_cap_author_term_backfill
  4. Force the cron job to run immediately with: wp cron event run newspack_cap_author_term_backfill
  5. Re-run the query from Step 1, and confirm that the number has gone down.

Other information:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your changes, as applicable?
  • Have you successfully ran tests with your changes locally?

@eddiesshop eddiesshop requested a review from a team as a code owner September 17, 2024 00:58
Copy link
Contributor

@dkoo dkoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eddiesshop this is a really cool concept for running a backfill slowly via cron/CLI! I can see this idea being turned into a more abstract tool that would let us schedule any CLI command—something like wp newspack-schedule-command <CLI command name> --args=<CLI command args in stringified format> --batch-size=50 --interval=hourly that would let us set the schedule and batch sizes via args.

No need to pursue that idea in this PR, though. For now I see some issues and potential improvements that we should consider for this specific backfill command:

  • The code as written isn't executing for me when I run the cron job, because the add_action( self::NEWSPACK_SCHEDULE_AUTHOR_TERM_BACKFILL ) is only executed in the callback that gets called on WP_CLI::add_command. We need to hook into the action on every process, like this.
  • Best practice for cron jobs scheduled by a main plugin is to register a deactivation hook to unschedule the job if the plugin gets deactivated, that way the cron job won't continue to execute (and fail with an error due to the callback not existing) if the plugin is inactive. Example
  • Instead of failing with an error if the job is already scheduled, we could simply unschedule future jobs and reschedule, like this
  • I tried for a bit to figure out how to capture the STDOUT output from the runcommand calls, but WP_CLI seems to flush its buffer after every log or line command. If we could somehow capture this output it would make this pattern really powerful as an abstract tool. For example, we could WP_CLI::runcommand( 'co-authors-plus list-posts-without-terms' ); and unschedule future jobs if there are no more posts to process.

I pushed some suggestions to a draft PR here just for convenience

Another major issue I saw with this implementation (not your fault but due to limitations in the CAP plugin): looking at the create-terms-for-posts command in the CAP plugin, I don't see support for the --batched or --records-per-batch args. In fact, the command doesn't seem to support batching at all—if I understand correctly it just always runs the command for all posts 100 at a time. And looking at the output of the command it does seem to query all posts on every run, but skips processing the posts that have terms. So based on that, is the scheduled CLI approach even getting us any performance benefits? It seems like for a site with 1 million posts, every cron job will query 1 million posts. This would make things a lot more complicated, but maybe you could try hooking into the WP query args and adding our own batching/offsetting—WDYT? That would also make this tool useful for scheduling any CLI command, whether or not the command supports batching out of the box.

includes/cli/class-co-authors-plus.php Outdated Show resolved Hide resolved
includes/cli/class-co-authors-plus.php Outdated Show resolved Hide resolved
includes/cli/class-co-authors-plus.php Show resolved Hide resolved
@eddiesshop eddiesshop requested a review from dkoo October 3, 2024 21:02
@eddiesshop
Copy link
Contributor Author

NOTE: This PR relies on this separate PR in Co-Authors Plus to function.

Copy link
Contributor

@dkoo dkoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eddiesshop this is working nicely now, thanks for the changes! I confirmed that with the changes from Automattic/Co-Authors-Plus#1060 the command doesn't do unnecessary queries or processing of all posts, so I think it's okay to let it keep running as a background process in that case. However, I think we should wait to merge this PR until that change gets deployed to the CAP plugin, otherwise the command will query all posts on every cron execution.

I also have a couple of minor suggestions for improving the logging in case of errors, but after that I think this is ready.

includes/cli/class-co-authors-plus.php Outdated Show resolved Hide resolved
includes/cli/class-co-authors-plus.php Outdated Show resolved Hide resolved
eddiesshop and others added 2 commits October 7, 2024 10:31
Adding missing `exit_error` option to be able to receive any errors that may have caused the script to fail.

Co-authored-by: Derrick Koo <[email protected]>
Fixing the payload for Newspack logger so that events can be logged properly.

Co-authored-by: Derrick Koo <[email protected]>
Copy link
Contributor

@dkoo dkoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! Approved pending release of the dependent PR in CAP.

@github-actions github-actions bot added the [Status] Approved The pull request has been reviewed and is ready to merge label Oct 7, 2024
@dkoo
Copy link
Contributor

dkoo commented Oct 9, 2024

@eddiesshop looks like Automattic/Co-Authors-Plus#1060 has been merged! Do we know when it might get deployed to the plugin repository?

@leogermani
Copy link
Contributor

Do we know when it might get deployed to the plugin repository?

I'm working on this

@eddiesshop eddiesshop merged commit 0b0d79a into trunk Oct 16, 2024
7 checks passed
@eddiesshop eddiesshop deleted the add/author-term-backfill-cron branch October 16, 2024 21:45
matticbot pushed a commit that referenced this pull request Oct 17, 2024
# [5.6.0-alpha.2](v5.6.0-alpha.1...v5.6.0-alpha.2) (2024-10-17)

### Features

* command to initialize cron job to slowly backfill CAP term data ([#3425](#3425)) ([0b0d79a](0b0d79a))
@matticbot
Copy link
Contributor

🎉 This PR is included in version 5.6.0-alpha.2 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

matticbot pushed a commit that referenced this pull request Oct 28, 2024
# [5.6.0](v5.5.2...v5.6.0) (2024-10-28)

### Bug Fixes

* cancelled subscriptions sync ([#3466](#3466)) ([6cad50a](6cad50a))
* **woocommerce:** update how order meta are updated ([#2711](#2711)) ([ae75548](ae75548))

### Features

* add user name to woocommerce data events ([#3473](#3473)) ([2b57d27](2b57d27))
* command to initialize cron job to slowly backfill CAP term data ([#3425](#3425)) ([0b0d79a](0b0d79a))
* **metering:** dispatch RAS activity on content restriction ([#3437](#3437)) ([4e1c262](4e1c262))
@matticbot
Copy link
Contributor

🎉 This PR is included in version 5.6.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
released on @alpha released [Status] Approved The pull request has been reviewed and is ready to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants