You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This script has pretty big memory requirements, so we avoid running it as a Kubernetes job (doing so would mean holding open expensive space on the cluster, or else potentially disrupting the cluster when it’s running; both are bad).
It’s also not well integrated with the readability server (a separate Node.js process).
To deal with those, we currently run this job by hand, which is not very good for lots of reasons. Instead, we should solve these two problems together: run it on AWS Batch (designed exactly for running large, intermittent, automated jobs like this) ECS, which requires containerizing it (which simplifies the parallel Node.js and Python issue).
Breakdown of Steps
Let’s make each of these a separate PR/task/whatever, and not try to bite this whole thing off in one chunk.
Add code to post results directly to Google Drive instead of to disk. (I currently do this by hand.)
Add code to post a notification the job is done and a link to the sheets either directly to Slack or as an e-mail.
Set it up to run on ECS 1x/week, Wednesday mornings at 16:00 UTC with arguments like:
# Assuming this is runs at Wednesday@16:00, do an analysis:
# From the previous Sunday@16:00
# To the current Monday@04:00
python generate_task_sheets.py --after 180 --before 60
# That's equivalent to this, given the script is running on Wednesday 2021-12-15:
python generate_task_sheets.py --after '2021-12-05T16:00:00.000Z' --before '2021-12-13T04:00:00Z'
This is roughly the timing and arguments I aim to run it by hand now. This gives us 2 days to make sure we’ve imported all the data for this time period (imports run at 3:30am UTC daily) and analyzes a 7-day period ending Monday morning @ midnight UTC, with an extra 12 hour overlap into the previous week.
Update: 2021-12-17: use ECS instead of Batch, add breakdown of steps.
The text was updated successfully, but these errors were encountered:
After more lessons learned on this, we should probably be running as a scheduled task on ECS, not Batch. But the basic requirements (dockerizing) are the same.
Mr0grog
changed the title
Dockerize this so it can run on AWS Batch
Dockerize this so it can run as a scheduled AWS ECS
Dec 17, 2021
This script has pretty big memory requirements, so we avoid running it as a Kubernetes job (doing so would mean holding open expensive space on the cluster, or else potentially disrupting the cluster when it’s running; both are bad).
It’s also not well integrated with the readability server (a separate Node.js process).
To deal with those, we currently run this job by hand, which is not very good for lots of reasons. Instead, we should solve these two problems together: run it on AWS
Batch (designed exactly for running large, intermittent, automated jobs like this)ECS, which requires containerizing it (which simplifies the parallel Node.js and Python issue).Breakdown of Steps
Let’s make each of these a separate PR/task/whatever, and not try to bite this whole thing off in one chunk.
Create a Dockerfile that encapsulates the
generate_task_sheets.py
script in this repo and thereadability-server
from web-monitoring-changed-terms-analysis.Add a CI job to publish the image to Docker Hub (probably similar to what we do for web-monitoring-processing: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/ed122ce241c705601411b9e17d02dca2c7e7537e/.circleci/config.yml#L53-L85).
Add code to post results directly to Google Drive instead of to disk. (I currently do this by hand.)
Add code to post a notification the job is done and a link to the sheets either directly to Slack or as an e-mail.
Set it up to run on ECS 1x/week, Wednesday mornings at 16:00 UTC with arguments like:
This is roughly the timing and arguments I aim to run it by hand now. This gives us 2 days to make sure we’ve imported all the data for this time period (imports run at 3:30am UTC daily) and analyzes a 7-day period ending Monday morning @ midnight UTC, with an extra 12 hour overlap into the previous week.
Update: 2021-12-17: use ECS instead of Batch, add breakdown of steps.
The text was updated successfully, but these errors were encountered: