-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: command to initialize cron job to slowly backfill CAP term data #3425
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eddiesshop this is a really cool concept for running a backfill slowly via cron/CLI! I can see this idea being turned into a more abstract tool that would let us schedule any CLI command—something like wp newspack-schedule-command <CLI command name> --args=<CLI command args in stringified format> --batch-size=50 --interval=hourly
that would let us set the schedule and batch sizes via args.
No need to pursue that idea in this PR, though. For now I see some issues and potential improvements that we should consider for this specific backfill command:
- The code as written isn't executing for me when I run the cron job, because the
add_action( self::NEWSPACK_SCHEDULE_AUTHOR_TERM_BACKFILL )
is only executed in the callback that gets called onWP_CLI::add_command
. We need to hook into the action on every process, like this. - Best practice for cron jobs scheduled by a main plugin is to register a deactivation hook to unschedule the job if the plugin gets deactivated, that way the cron job won't continue to execute (and fail with an error due to the callback not existing) if the plugin is inactive. Example
- Instead of failing with an error if the job is already scheduled, we could simply unschedule future jobs and reschedule, like this
- I tried for a bit to figure out how to capture the
STDOUT
output from theruncommand
calls, but WP_CLI seems to flush its buffer after everylog
orline
command. If we could somehow capture this output it would make this pattern really powerful as an abstract tool. For example, we couldWP_CLI::runcommand( 'co-authors-plus list-posts-without-terms' );
and unschedule future jobs if there are no more posts to process.
I pushed some suggestions to a draft PR here just for convenience
Another major issue I saw with this implementation (not your fault but due to limitations in the CAP plugin): looking at the create-terms-for-posts
command in the CAP plugin, I don't see support for the --batched
or --records-per-batch
args. In fact, the command doesn't seem to support batching at all—if I understand correctly it just always runs the command for all posts 100 at a time. And looking at the output of the command it does seem to query all posts on every run, but skips processing the posts that have terms. So based on that, is the scheduled CLI approach even getting us any performance benefits? It seems like for a site with 1 million posts, every cron job will query 1 million posts. This would make things a lot more complicated, but maybe you could try hooking into the WP query args and adding our own batching/offsetting—WDYT? That would also make this tool useful for scheduling any CLI command, whether or not the command supports batching out of the box.
NOTE: This PR relies on this separate PR in Co-Authors Plus to function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eddiesshop this is working nicely now, thanks for the changes! I confirmed that with the changes from Automattic/Co-Authors-Plus#1060 the command doesn't do unnecessary queries or processing of all posts, so I think it's okay to let it keep running as a background process in that case. However, I think we should wait to merge this PR until that change gets deployed to the CAP plugin, otherwise the command will query all posts on every cron execution.
I also have a couple of minor suggestions for improving the logging in case of errors, but after that I think this is ready.
Adding missing `exit_error` option to be able to receive any errors that may have caused the script to fail. Co-authored-by: Derrick Koo <[email protected]>
Fixing the payload for Newspack logger so that events can be logged properly. Co-authored-by: Derrick Koo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! Approved pending release of the dependent PR in CAP.
@eddiesshop looks like Automattic/Co-Authors-Plus#1060 has been merged! Do we know when it might get deployed to the plugin repository? |
I'm working on this |
# [5.6.0-alpha.2](v5.6.0-alpha.1...v5.6.0-alpha.2) (2024-10-17) ### Features * command to initialize cron job to slowly backfill CAP term data ([#3425](#3425)) ([0b0d79a](0b0d79a))
🎉 This PR is included in version 5.6.0-alpha.2 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
# [5.6.0](v5.5.2...v5.6.0) (2024-10-28) ### Bug Fixes * cancelled subscriptions sync ([#3466](#3466)) ([6cad50a](6cad50a)) * **woocommerce:** update how order meta are updated ([#2711](#2711)) ([ae75548](ae75548)) ### Features * add user name to woocommerce data events ([#3473](#3473)) ([2b57d27](2b57d27)) * command to initialize cron job to slowly backfill CAP term data ([#3425](#3425)) ([0b0d79a](0b0d79a)) * **metering:** dispatch RAS activity on content restriction ([#3437](#3437)) ([4e1c262](4e1c262))
🎉 This PR is included in version 5.6.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
All Submissions:
Changes proposed in this Pull Request:
This cron (and by extension, the backfill script it executes) is crucial for our efforts to improve performance across our network of sites. This cron will slowly backfill CAP author term data on an hourly basis. This PR relies on this separate PR in Co-Authors Plus which creates the new backfill command. Once this data backfill is complete, we can then implement the CAP Author Archive filter to greatly improve author archive page performance.
Closes # .
How to test the changes in this Pull Request:
SELECT COUNT(*) FROM wp_posts WHERE post_type IN ( 'post' ) AND post_status IN ( 'publish' ) AND post_author <> 0 AND ID NOT IN ( SELECT tr.object_id FROM wp_term_relationships tr LEFT JOIN wp_term_taxonomy tt ON tr.term_taxonomy_id = tt.term_taxonomy_id WHERE tt.taxonomy = 'author' GROUP BY tr.object_id ) ORDER BY ID
wp newspack schedule-co-authors-author-term-backfill
wp cron event list
and searching fornewspack_cap_author_term_backfill
wp cron event run newspack_cap_author_term_backfill
Other information: