Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dockerize this so it can run as a scheduled AWS ECS #6

Open
5 tasks
Mr0grog opened this issue Nov 9, 2020 · 1 comment
Open
5 tasks

Dockerize this so it can run as a scheduled AWS ECS #6

Mr0grog opened this issue Nov 9, 2020 · 1 comment
Assignees

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Nov 9, 2020

This script has pretty big memory requirements, so we avoid running it as a Kubernetes job (doing so would mean holding open expensive space on the cluster, or else potentially disrupting the cluster when it’s running; both are bad).

It’s also not well integrated with the readability server (a separate Node.js process).

To deal with those, we currently run this job by hand, which is not very good for lots of reasons. Instead, we should solve these two problems together: run it on AWS Batch (designed exactly for running large, intermittent, automated jobs like this) ECS, which requires containerizing it (which simplifies the parallel Node.js and Python issue).

Breakdown of Steps

Let’s make each of these a separate PR/task/whatever, and not try to bite this whole thing off in one chunk.

  • Create a Dockerfile that encapsulates the generate_task_sheets.py script in this repo and the readability-server from web-monitoring-changed-terms-analysis.

  • Add a CI job to publish the image to Docker Hub (probably similar to what we do for web-monitoring-processing: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/ed122ce241c705601411b9e17d02dca2c7e7537e/.circleci/config.yml#L53-L85).

  • Add code to post results directly to Google Drive instead of to disk. (I currently do this by hand.)

  • Add code to post a notification the job is done and a link to the sheets either directly to Slack or as an e-mail.

  • Set it up to run on ECS 1x/week, Wednesday mornings at 16:00 UTC with arguments like:

    # Assuming this is runs at Wednesday@16:00, do an analysis:
    #   From the previous Sunday@16:00
    #   To the current Monday@04:00
    python generate_task_sheets.py --after 180 --before 60
    
    # That's equivalent to this, given the script is running on Wednesday 2021-12-15:
    python generate_task_sheets.py --after '2021-12-05T16:00:00.000Z' --before '2021-12-13T04:00:00Z'
    

    This is roughly the timing and arguments I aim to run it by hand now. This gives us 2 days to make sure we’ve imported all the data for this time period (imports run at 3:30am UTC daily) and analyzes a 7-day period ending Monday morning @ midnight UTC, with an extra 12 hour overlap into the previous week.

Update: 2021-12-17: use ECS instead of Batch, add breakdown of steps.

@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 17, 2021

After more lessons learned on this, we should probably be running as a scheduled task on ECS, not Batch. But the basic requirements (dockerizing) are the same.

@Mr0grog Mr0grog changed the title Dockerize this so it can run on AWS Batch Dockerize this so it can run as a scheduled AWS ECS Dec 17, 2021
@Mr0grog Mr0grog moved this to Inbox in Web Monitoring Feb 17, 2025
@Mr0grog Mr0grog moved this from Inbox to Prioritized in Web Monitoring Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Prioritized
Development

No branches or pull requests

1 participant