Skip to content

Conversation

stsewd
Copy link
Member

@stsewd stsewd commented Aug 21, 2025

Common changes across providers:

  • A new method to just update a remote repository from the data returned by the API was added, this is in order to make it easy to re-use this method when support for updating a single repo is added (will do this in another PR on top of this one).
  • We were iterating over repositories twice for GitHub and BB, once by listing all the repositories the user has access to, and then again by iterating over all the repositories from the organizations the user has access to, this was also incorrectly adding a relationship between the user and repositories the use didn't actually have access, as that endpoint will return all public repositories, even if the user doesn't belong to it.
  • sync_organizations now basically just creates the user-organization relation only.
  • Since several repositories can belong to the same organization, we don't really need to update the same organization over and over again, just once, so a cache was introduced for it. This is also a reason why the creation of the user-organization relationship was moved outside create_organization.

Specific changes:

  • For Bitbucket, we weren't updating a repository if the workspace it belongs was changed, we now update those. This is the same problem we had with Gitlab GitLab: handle when a repository is moved to another group #12233.
  • For Bitbucket, we were creating the repo-user relationship and then updating that relationship for the repositories where the user is an admin. This wasn't resetting the admin status to false for repos the user no longer had that permission, we now always default to admin=False, and then we update the repositories where the user is admin in the other call.
  • In Bitbucket, we don't have organizations, we have projects and workspaces, we are using workspaces as our organizations. But in BB, every repository is linked to a workspace (a user can be a workspace), so we always create an organization for each repository.
  • Since all other providers are fetching all repositories in the sync_repositories method, I changed Gitlab do also do that. In order to do that I reverted to use the previous /projects endpoint (GitLab: handle when a repository is moved to another group #12233).
  • GitLab sometimes returns avatars with a relative URL /~/uploads... instead of including the domain as well, so we are normalizing all URLs from Gitlab now.

@stsewd stsewd marked this pull request as ready for review August 21, 2025 20:39
@stsewd stsewd requested a review from a team as a code owner August 21, 2025 20:39
@stsewd stsewd requested a review from ericholscher August 21, 2025 20:39
for remote_repository_relation in admin_repo_relations:
remote_repository_relation.admin = True
remote_repository_relation.save()
).update(admin=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small optimization here, since we don't need to save each object individually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also nice to avoid the save() logic if we don't need it.

@stsewd stsewd requested a review from Copilot August 21, 2025 23:50
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the Git service repository synchronization process across GitHub, BitBucket, and GitLab to eliminate redundant API calls and fix organization-repository relationship issues. The main purpose is to consolidate repository syncing into a single method and separate it from organization syncing, while fixing incorrect permission handling and organization relationships.

Key changes:

  • Separation of repository and organization syncing logic to eliminate duplicate API calls
  • Introduction of organization caching to avoid repeated database queries for the same organization
  • Fix for repository-organization relationship updates when repositories move between organizations

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
readthedocs/oauth/services/github.py Adds organization caching, separates repository/organization sync, removes duplicate repository fetching from org sync
readthedocs/oauth/services/bitbucket.py Adds organization caching, fixes admin permission handling, ensures all repos have workspace organizations
readthedocs/oauth/services/gitlab.py Adds organization caching, switches to unified /projects endpoint, fixes relative URL handling for avatars
readthedocs/rtd_tests/tests/test_oauth_sync.py Updates test expectations to reflect that org sync no longer creates repositories
readthedocs/rtd_tests/tests/test_oauth.py Updates tests to reflect new repository creation patterns and organization handling

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a solid improvement. My questions are mostly around if we should be considering adding some of this logic to the base class.

Overall I don't love having to support so many VCS providers, and wonder if we should be considering trimming down to just GH, but I think that's a longer term discussion given the usage of the other platforms we still have.

In terms of making the code here faster -- it seems like this is a small speedup -- maybe 50-60% with the caching? Do we think this is enough to re-enable the task?

for remote_repository_relation in admin_repo_relations:
remote_repository_relation.admin = True
remote_repository_relation.save()
).update(admin=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also nice to avoid the save() logic if we don't need it.

@@ -36,6 +36,10 @@ class GitHubService(UserService):
url_pattern = re.compile(r"github\.com")
supports_build_status = True

def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._organizations_cache = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're setting this on all of the classes, should it just be in the base model?


repo.save()

def _make_absolute_url(self, url):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this for all the providers, just to be defensive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only observed this on GL, mainly from groups objects.

@stsewd stsewd merged commit cc69f23 into main Aug 25, 2025
7 checks passed
@stsewd stsewd deleted the sync-repos-improvements branch August 25, 2025 16:31
@stsewd
Copy link
Member Author

stsewd commented Aug 25, 2025

In terms of making the code here faster -- it seems like this is a small speedup -- maybe 50-60% with the caching? Do we think this is enough to re-enable the task?

yeah, I do feel this is going to be huge per improvement for users that belong to large orgs. But not sure about the task... last time I checked we sync repositories of 2K users per-day, I'll also want to check if this tasks is automatically retrying on timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants