Dockerize this so it can run as a scheduled AWS ECS #6

Mr0grog · 2020-11-09T23:01:53Z

This script has pretty big memory requirements, so we avoid running it as a Kubernetes job (doing so would mean holding open expensive space on the cluster, or else potentially disrupting the cluster when it’s running; both are bad).

It’s also not well integrated with the readability server (a separate Node.js process).

To deal with those, we currently run this job by hand, which is not very good for lots of reasons. Instead, we should solve these two problems together: run it on AWS ~~Batch (designed exactly for running large, intermittent, automated jobs like this)~~ ECS, which requires containerizing it (which simplifies the parallel Node.js and Python issue).

Breakdown of Steps

Let’s make each of these a separate PR/task/whatever, and not try to bite this whole thing off in one chunk.

Create a Dockerfile that encapsulates the generate_task_sheets.py script in this repo and the readability-server from web-monitoring-changed-terms-analysis.
Add a CI job to publish the image to Docker Hub (probably similar to what we do for web-monitoring-processing: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/ed122ce241c705601411b9e17d02dca2c7e7537e/.circleci/config.yml#L53-L85).
Add code to post results directly to Google Drive instead of to disk. (I currently do this by hand.)
Add code to post a notification the job is done and a link to the sheets either directly to Slack or as an e-mail.
Set it up to run on ECS 1x/week, Wednesday mornings at 16:00 UTC with arguments like:
```
# Assuming this is runs at Wednesday@16:00, do an analysis:
#   From the previous Sunday@16:00
#   To the current Monday@04:00
python generate_task_sheets.py --after 180 --before 60

# That's equivalent to this, given the script is running on Wednesday 2021-12-15:
python generate_task_sheets.py --after '2021-12-05T16:00:00.000Z' --before '2021-12-13T04:00:00Z'
```
This is roughly the timing and arguments I aim to run it by hand now. This gives us 2 days to make sure we’ve imported all the data for this time period (imports run at 3:30am UTC daily) and analyzes a 7-day period ending Monday morning @ midnight UTC, with an extra 12 hour overlap into the previous week.

Update: 2021-12-17: use ECS instead of Batch, add breakdown of steps.

The text was updated successfully, but these errors were encountered:

Mr0grog · 2021-12-17T23:17:48Z

After more lessons learned on this, we should probably be running as a scheduled task on ECS, not Batch. But the basic requirements (dockerizing) are the same.

Mr0grog self-assigned this Nov 9, 2020

Mr0grog mentioned this issue Nov 11, 2020

Roadmap for the rest of 2020 edgi-govdata-archiving/web-monitoring#158

Closed

17 tasks

Mr0grog mentioned this issue Nov 11, 2020

Handle priority/analysis better when first version of the period was an intermittent error edgi-govdata-archiving/web-monitoring-versionista-scraper#232

Closed

Mr0grog changed the title ~~Dockerize this so it can run on AWS Batch~~ Dockerize this so it can run as a scheduled AWS ECS Dec 17, 2021

Mr0grog mentioned this issue Feb 14, 2025

Auto-upload results to Google Drive #19

Open

Mr0grog moved this to Inbox in Web Monitoring Feb 17, 2025

Mr0grog added this to Web Monitoring Feb 17, 2025

Mr0grog moved this from Inbox to Prioritized in Web Monitoring Feb 17, 2025

Mr0grog mentioned this issue Feb 19, 2025

2025 Q1 Roadmap edgi-govdata-archiving/web-monitoring#174

Open

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dockerize this so it can run as a scheduled AWS ECS #6

Dockerize this so it can run as a scheduled AWS ECS #6

Mr0grog commented Nov 9, 2020 •

edited

Loading

Mr0grog commented Dec 17, 2021

Dockerize this so it can run as a scheduled AWS ECS #6

Dockerize this so it can run as a scheduled AWS ECS #6

Comments

Mr0grog commented Nov 9, 2020 • edited Loading

Breakdown of Steps

Mr0grog commented Dec 17, 2021

Mr0grog commented Nov 9, 2020 •

edited

Loading