-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add tool to check for out-of-sync data between DB and production-definitions blob store #84
base: main
Are you sure you want to change the base?
Conversation
Nice this looks great! Thanks a ton for cleaning this script up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together!
# * MONGO_CONNECTION_STRING: The connection string to the MongoDB database | ||
# * START_MONTH: The first month to include in the query | ||
# * END_MONTH: The last month to include in the query | ||
# * OUTPUT_FILE: The file to write the output to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BASE_AZURE_BLOB_URL seems to be mandatory as well
print(f"Collection '{COLLECTION_NAME}' not found.") | ||
else: | ||
print(f"Using collection: '{COLLECTION_NAME}'.") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: logging blob container name as well?
This performs a check one month at a time hardcoded for all months in 2024. Output file is hardcoded to "2024-invalid_data.json”.
* allows for a check of a single week * continues to support processing a month at a time * expands support for controlling function through .env file * provides example .env file
482e8a2
to
4165126
Compare
Batch processing: * updates just the declared license in the DB documents using `collection.bulk_write()` * updates denitions using service API `POST /definitions?force=true` _NOTE: Updating the DB makes the fix of the declared license immediately available. When the `POST /definitions` request completes, the full DB document will be updated to be in sync with the blob definition._ Additional changes: * moves global variable definitions based on .env to the initialize() function * adds DRYRUN flag to check what would run and how many records would be evaluated * add estimated time to complete * adds script and function level documentation * includes timestamps to make it easier to estimate how long it will take to complete a run * generate filename based on date range and offset to avoid overwriting output files _NOTE: Azure only supports fetching one blob at a time. Not able to optimize that part of the code. _ _NOTE: Batch size of 500 was selected because that is the max number of coordinates supported in calls to service API `POST /definitions`._
8259871
to
392f8f6
Compare
Co-authored-by: @ajhenry
Description
The new tool, analyze_data_synchronization, checks for out-of-sync data between the database and the production-definitions blob store. The tool can be for multiple months, one month at a time, or for a custom date range. The tool outputs a JSON file with summary stats and the invalid data. The tool is controlled through a .env file, which can be customized to specify the start and end dates, the maximum number of documents to process, and the output file name.
See README for examples.
Minor fix
The README includes a fix to rename
production-snapshots
tochanges-notifications
. The switch tochanges-notifications
has been in production use since January 2024.