Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add: created a new CLI cmd to backfill missing author terms for posts. #1060

Merged
merged 6 commits into from
Oct 9, 2024

Conversation

eddiesshop
Copy link
Contributor

Description

This is a new, more efficient, command to help backfill author term data for any posts that are missing it. The new command was based off of the old one. The key differences between that command and this one are:

  1. If, for some reason, the command were to get killed while it was backfilling data, re-running wp co-authors-plus create-terms-for-posts would always start back from the first post. For sites of a certain stature, this could create a situation where this command would never finish executing. Instead, this new command only looks for posts that are missing the requisite author term in the relationships table via SQL.
  2. Since only posts that are missing the author term are pulled, once the missing data is created, any subsequent command execution means that you will be processing a smaller and smaller list of posts that need to be addressed, unlike the existing command.

Deploy Notes

Are there any new dependencies added that should be taken into account when deploying to WordPress.org?
No, this is an entirely new command.

Steps to Test

  1. Check out PR.
  2. Take a DB backup.
  3. Execute this command with wp co-authors-plus create-author-terms-for-posts
  4. Once complete, run the old command: wp co-authors-plus create-terms-for-posts. Notice that no new author terms are being created.

@leogermani
Copy link
Contributor

@GaryJones do you know why we have these failing tests here? I see that we no longer support php 7.1 (

* Requires PHP: 7.4
) and for some reason these integration tests are not running in other open PRs

@GaryJones
Copy link
Contributor

do you know why we have these failing tests here?

Since https://github.com/Automattic/Co-Authors-Plus/blob/develop/.github/workflows/integrate.yml use PHP 7.4, it looks like this PR branch should be targetted to develop (not main, since that's our production release branch).

@GaryJones GaryJones removed their request for review October 8, 2024 09:05
@leogermani leogermani changed the base branch from main to develop October 8, 2024 11:46
Copy link
Contributor

@leogermani leogermani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks and tests well.

Can we just add a comment to the command description highlighting the difference between this and the old command?

I also think we could add a comment in the old command, saying that we should give preference to this new and more robust version of it.

And a NIT, not blocking: since the batching occurs on run time, there is no real difference from a user perspective running the command with or without it, from a functional point of view. The only thing that can happen is an OOM if you run without batching... In other words, there is no reason to have an option to run this command without doing batches. Is there? We could leave just the option for the batch size and always run in batches... WDYT?:

The comments are meant to clarify the key differences between the two commands, and that the new one should be preferred over the old one.
@eddiesshop
Copy link
Contributor Author

Can we just add a comment to the command description highlighting the difference between this and the old command?

I also think we could add a comment in the old command, saying that we should give preference to this new and more robust version of it.

Great idea! Done here.

And a NIT, not blocking: since the batching occurs on run time, there is no real difference from a user perspective running the command with or without it, from a functional point of view. The only thing that can happen is an OOM if you run without batching... In other words, there is no reason to have an option to run this command without doing batches. Is there? We could leave just the option for the batch size and always run in batches... WDYT?:

That batching approach really comes from a L&I point of view. Often times, our migration commands can be interrupted due to memory constraints or we can cause MySQL binlog replication issues. Although not a perfect approach, tackling a certain amount of records at a time usually mitigates those problems. I don't mind removing it though, since the command should operate just fine if it were restarted due to an interruption during execution.

@leogermani
Copy link
Contributor

That batching approach really comes from a L&I point of view. Often times, our migration commands can be interrupted due to memory constraints or we can cause MySQL binlog replication issues. Although not a perfect approach, tackling a certain amount of records at a time usually mitigates those problems. I don't mind removing it though, since the command should operate just fine if it were restarted due to an interruption during execution.

I don't mean removing the batching approach. I mean making it the default and only approach. There is no benefit in not using it

@eddiesshop
Copy link
Contributor Author

I obviously need some more coffee today!

I think we shouldn't remove the batched flag, but we should make it true by default. That way, if anyone wants to let it rip, they have that option.

@leogermani leogermani merged commit d5f9404 into develop Oct 9, 2024
13 of 15 checks passed
@leogermani leogermani deleted the add/author_term_backfill_command branch October 9, 2024 20:23
leogermani added a commit that referenced this pull request Nov 5, 2024
* Increase composer.json required PHP version to 7.4

* Update README to match required PHP version 7.4

* Remove PHP 7.1 from integration tests

* PHP 7.4: Use array_key_first()

Slightly cleaner to use the native function for getting the first key's value from an array.

* PHP 7.4: Use instanceof

* PHP 7.4: Use null coalescing

* PHP 7.4: Add return types

* PHP 7.4: Collapse nested dirname() calls

* CI: Remove MySQL workaround for PHP <= 7.3

* Increase WordPress required version to 5.9

* Update integration tests to use WordPress 5.9

* Remove unnecessary phpunit versions for WordPress 5.9

* CI: Update tested versions

Doesn't make sense to test WP versions would unsupported PHP versions (e.g. WP 5.9 with PHP 8.3).

* Composer: Update dev-dependencies

* PHPCS: Consolidate config into config file

The PHPCS in the composer.json was duplicating but obscuring some aspects of what was in the `phpcs.xml.dist` file. This change consolidates the Composer commands and the config file.

* Support for Yoast %%name%% variable

* CI: Update deploy.yml

Increase actions/checkout dependency version.

* CI: Update integrate.yml action versios

* Contents edited to consolidate instructions within the Wiki and bring more attention to its existence (#1055)

* add: created a new CLI cmd to backfill missing author terms for posts. (#1060)

* add: created a new CLI cmd to backfill missing author terms for posts.

* add: adding some comments to the new and old backfill commands.

The comments are meant to clarify the key differences between the two commands, and that the new one should be preferred over the old one.

* add: batching is the default, pass `--unbatched` flag to run w/o it.

---------

Co-authored-by: Gary Jones <[email protected]>
Co-authored-by: Alec Geatches <[email protected]>

* Fix/missing wp user type (#988)

* fix: preventing loss of fact that a guest author might also be a WP_User

* fix: making the update operation dependent on $append flag.

This might be a problematic decision. But the way I justify this change is that if you are appending co-authors, there may already be a WP_User set as the author. So we don't really have to care whether one is passed or not. Because of this, we do not need to forcibly return a `false` flag since that is confusing to the caller, especially because we actually do save the guest authors which are given in the call! Instead, if the $append flag is false, we should expect that at least one user will be a WP_User. In that case, if none is passed in, then there is a mismatch of the intended authors. Because now, the `wp_posts.post_author` column will have an old `wp_users.ID` which remains set and most likely isn't the intent of the caller.

* fix: attempting DB update only when $new_author is not empty.

Also, returning the actual response from the DB, to make this call even more accurate in terms of what is actually happen at the DB layer.

* fix: need to ensure pure WP_User is processed correctly as post_author.

A pure WP_User (i.e. a WP_User that IS NOT linked to a Guest Author) needs to be handled specially.

* fix: a necessary refactor of the `get_coauthor_by` function.

This refactor is absolutely necessary in order for all the previous fixes to work as expected. Without this fix, what happens is that when you use `get_coauthor_by` by searching with a Guest Author, if that Guest Author has a valid link to a WP_User, it is summarily ignored. Functions like `add_coauthors` expect at least one coauthor to be a valid WP_User so that the `wp_posts.post_author` column can be appropriately updated. The only case where this function is returning an expected value is when you search by the WP_User first. When it arrives at `$guest_author = $this->guest_authors->get_guest_author_by( $key, $value, $force );`, `$guest_author === false`. It is then forced to move to the switch statement to find a user via their WP_User data.

With this refactor, `get_coauthor_by` will now check if the `linked_account` attribute is set. If so, it will attempt to find the corresponding user for the Guest Account. It still gives priority to returning a Guest Author. When a Guest Author is not found, it will search for a WP_User. If found, it will also search to see if a linked Guest Author account exists. If it does, it will return that Guest Author object instead, without losing the fact that this account also has a WP_User associated with it.

* fix: returning a plain WP_User if guest authors is not enabled.

I forgot to run tests on my previous commit. This satisfies the test Test_CoAuthors_Plus::test_get_coauthor_by_when_guest_authors_not_enabled which is expecting a WP_User when the plugin is not enabled.

* feat: adding additional tests for co-authors-plus.php functionality.

* fix: preventing loss of fact that a guest author might also be a WP_User

* fix: making the update operation dependent on $append flag.

This might be a problematic decision. But the way I justify this change is that if you are appending co-authors, there may already be a WP_User set as the author. So we don't really have to care whether one is passed or not. Because of this, we do not need to forcibly return a `false` flag since that is confusing to the caller, especially because we actually do save the guest authors which are given in the call! Instead, if the $append flag is false, we should expect that at least one user will be a WP_User. In that case, if none is passed in, then there is a mismatch of the intended authors. Because now, the `wp_posts.post_author` column will have an old `wp_users.ID` which remains set and most likely isn't the intent of the caller.

* fix: attempting DB update only when $new_author is not empty.

Also, returning the actual response from the DB, to make this call even more accurate in terms of what is actually happen at the DB layer.

* fix: need to ensure pure WP_User is processed correctly as post_author.

A pure WP_User (i.e. a WP_User that IS NOT linked to a Guest Author) needs to be handled specially.

* fix: a necessary refactor of the get_coauthor_by function.

This refactor is absolutely necessary in order for all the previous fixes to work as expected. Without this fix, what happens is that when you use `get_coauthor_by` by searching with a Guest Author, any link to a WP_User the Guest Author may have is summarily ignored. Functions like `add_coauthors` expect at least one coauthor to be a valid WP_User so that the `wp_posts.post_author` column can be appropriately updated. The only case where this function is currently returning an expected value is when you search by a WP_User account/field first. When it arrives at `$guest_author = $this->guest_authors->get_guest_author_by( $key, $value, $force );`, `$guest_author === false`. It is then forced to move to the switch statement to find a user via their WP_User data.

With this refactor, `get_coauthor_by` will now check if the `linked_account` attribute is set. If so, it will then attempt to find the corresponding WP_User for the Guest Author. Crucially, it still gives priority to returning a Guest Author. When a Guest Author is not found, it will then attempt to search for a WP_User. If found, it will also search to see if a linked Guest Author account exists. If it does, it will return that Guest Author object instead, without losing the fact that this account also has a WP_User associated with it.

* fix: renaming user_login's for new authors introduced for new tests.

These user_login's were causing other tests to fail because you cannot create another user with the same user_login.

* fix: removing use of assertObjectHasProperty

Older version of PHPUnit do not have this function available. Updating to workaround: `assertTrue( property_exists( $obj, 'prop' ) )`

* fix: typo in function call

* fix: using strict comparison instead of function call `is_null`

* fix: using more descriptive assertion for array validation.

* fix: using `create_and_get` post factory func, to avoid query call.

* fix: removing use of newly introduced is_wp_user property.

Relying instead on wp_user property which has already been used before.

* fix: PHPCS fixes and added commentary/descriptions to docblocks.

* fix: some small quick fixes for formatting and documentation

* fix: removing repetitive test.

* add: new assertion func that determines if an obj is not a WP_User class

* add: new assertion to help determine if a Post has the correct Authors

* add: new test solely for CoAuthorPlus::get_coauthor_by().

By fully testing CoAuthorPlus::get_coauthor_by(), we can remove some repetitive assertions that don't directly relate to what's being tested.

* fix: was passing string values when I should've been passing Author objs

* fix: using a data provider for very similar tests

---------

Co-authored-by: Gary Jones <[email protected]>

* bumping version to 3.6.2 (#1064)

* bumping version to 3.6.2

* Update CHANGELOG.md

Co-authored-by: Gary Jones <[email protected]>

* add changelog link

---------

Co-authored-by: Gary Jones <[email protected]>

* fix: prevent the backfill from running forever. (#1065)

* fix: prevent the backfill from running forever.

There's an edge case where an author that no longer exists can still be assigned to a post. This throws the backfill script into an infinite loop, because the respective author-term is never found/created, and so the underlying problem of missing author-term records is never resolved. The infinite loop is started when at the end of the while loop, the script asks for "remaining posts which need author terms" and so it returns the same rows over and over.

This fix addresses this in 2 ways:
1. If an author is not found, we look for the most prolific author on the site and assign the posts to them. If there is no prolific author, one is created. And if one can't be created, an exception is thrown so that the script can't proceed.
2. Checks have been added so that the script can't go beyond what should be the maximum number of rows needing to be addressed.

* fix: obtaining the first available admin user account instead.

* fix: updating output to reflect that the ID belongs to an Admin account.

* fix: this function should be private

* fix: switching tactic to skipping posts that have missing post_author.

This approach is more faithful with what the current condition on the site would be anyway. If the post author doesn't exist on the site, you wouldn't be able to see the particular post in question in an author archive anyway. Skipping the post instead of reassigning it to the first available admin user is a cleaner solution.

* fix: removed unused references from a past commit

* fix: appeasing PHPCS

* Bump versions to 3.6.3 (#1070)

---------

Co-authored-by: Alec Geatches <[email protected]>
Co-authored-by: Gary Jones <[email protected]>
Co-authored-by: claudiulodro <[email protected]>
Co-authored-by: Yoli Hodde <[email protected]>
Co-authored-by: Eddie Carrasco <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants