Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add tool to check for out-of-sync data between DB and production-definitions blob store #84

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

elrayle
Copy link
Collaborator

@elrayle elrayle commented Jul 1, 2024

Co-authored-by: @ajhenry

Description

The new tool, analyze_data_synchronization, checks for out-of-sync data between the database and the production-definitions blob store. The tool can be for multiple months, one month at a time, or for a custom date range. The tool outputs a JSON file with summary stats and the invalid data. The tool is controlled through a .env file, which can be customized to specify the start and end dates, the maximum number of documents to process, and the output file name.

See README for examples.

Minor fix

The README includes a fix to rename production-snapshots to changes-notifications. The switch to changes-notifications has been in production use since January 2024.

@elrayle elrayle marked this pull request as draft July 2, 2024 13:09
@elrayle elrayle marked this pull request as ready for review July 2, 2024 13:51
@ajhenry
Copy link

ajhenry commented Jul 2, 2024

Nice this looks great! Thanks a ton for cleaning this script up

Copy link
Collaborator

@qtomlinson qtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together!

# * MONGO_CONNECTION_STRING: The connection string to the MongoDB database
# * START_MONTH: The first month to include in the query
# * END_MONTH: The last month to include in the query
# * OUTPUT_FILE: The file to write the output to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BASE_AZURE_BLOB_URL seems to be mandatory as well

print(f"Collection '{COLLECTION_NAME}' not found.")
else:
print(f"Using collection: '{COLLECTION_NAME}'.")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: logging blob container name as well?

elrayle added 4 commits July 12, 2024 07:31
This performs a check one month at a time hardcoded for all months in 2024.  Output file is hardcoded to "2024-invalid_data.json”.
* allows for a check of a single week
* continues to support processing a month at a time
* expands support for controlling function through .env file
* provides example .env file
@elrayle elrayle force-pushed the elr/sync-check branch 7 times, most recently from 482e8a2 to 4165126 Compare July 24, 2024 12:39
Batch processing:
* updates just the declared license in the DB documents using `collection.bulk_write()`
* updates denitions using service API `POST /definitions?force=true`

_NOTE: Updating the DB makes the fix of the declared license immediately available.  When the `POST /definitions` request completes, the full DB document will be updated to be in sync with the blob definition._

Additional changes:
* moves global variable definitions based on .env to the initialize() function
* adds DRYRUN flag to check what would run and how many records would be evaluated
* add estimated time to complete
* adds script and function level documentation
* includes timestamps to make it easier to estimate how long it will take to complete a run
* generate filename based on date range and offset to avoid overwriting output files

_NOTE: Azure only supports fetching one blob at a time. Not able to optimize that part of the code. _

_NOTE: Batch size of 500 was selected because that is the max number of coordinates supported in calls to service API `POST /definitions`._
@elrayle elrayle force-pushed the elr/sync-check branch 3 times, most recently from 8259871 to 392f8f6 Compare July 24, 2024 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants