You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fetcher/tasks.py: handle 'always' in _feed_update_period_mins & catch KeyErrors,
log exceptions, log unknown period names
dokku-scripts/push.sh: fix VERSION extraction; make more verbose
require staging & prod to be pushed only to mediacloud
scripts/db_archive.py: compress stories on the fly, fix headers, add .csv
scripts/queue_feeds.py: refactor to allow more command line params and
fix command line feeds; move FetchEvent creation & feed update to queue_feeds.
multiply fetches_per_minute before rounding (used to truncate then multiply).
scripts/db_archive.py: use max(RSS_OUTPUT_DAYS, NORMALIZED_TITLE_DAYS)
for story_days default. Display default values in help message.
NEW: dokku-scripts/randomize-feeds.sh: randomize feed.next_fetch_attempt times
NEW: dokku-scripts/clone-db.sh: clone production database & randomize
doc/deployment.md: update
scripts/queue_feeds.py: if qlen==0 but db_queue!=0, clear queued feeds (fix leakage).
fetcher/tasks.py: clear queued on insane feeds (stop leakage).
v0.12.0 2022-11-07
Major raking by Phil Budne
runtime.txt updated to python-3.9.13 (security fixes)
autopep.sh runs autopep8 -a -i on all files (except fetcher/database/versions/*.py)
mypy.sh installs and runs mypy in a virtual env. RUNS CLEANLY!
All scripts take uniform command line arguments for logging, initialization, help and version (in "fetcher.logargparse"):
-h, --help show this help message and exit
--verbose, -v set default logging level to 'DEBUG'
--quiet, -q set default logging level to 'WARNING'
--list-loggers list all logger names and exit
--log-config LOG_CONFIG_FILE
configure logging with .json, .yml, or .ini file
--log-file LOG_FILE log file name (default: main.pid.310509.log)
--log-level {critical,fatal,error,warn,warning,info,debug,notset}, -l {critical,fatal,error,warn,warning,info,debug,notset}
set default logging level to LEVEL
--no-log-file don't log to a file
--logger-level LOGGER:LEVEL, -L LOGGER:LEVEL
set LOGGER (see --list-loggers) verbosity to LEVEL (see --level)
--set VAR=VALUE, -S VAR=VALUE
set config/environment variable
--version, -V show program's version number and exit
fetcher.queue abstraction
All queue access abstracted to fetcher.queue; using "rq" for work
queue (only redis needed, allows length monitoring), saving of
"result" (ie; celery backend) data is disabled, since we only queue
jobs "blind" and never check for function results returned (although
queue_feeds in --loop mode could poll for results).
All database datetimes stored without timezones.
"fetcher" module (fetcher/init.py) stripped to bare minimum
(version string and fetching a few environment variables)
All config variables in fetcher.config "conf" object
provides mechanisms for specifying optional, boolean, integer params.
Script startup logging
All script startup logging includes script name and Dokku deployed git hash, followed by ONLY logging the configuration that is referenced.
All scripts log to BASE/storage/logs/APP.DYNO.log
Files are turned over at midnight (to filename.log.YYYY-MM-DD), seven files are kept.
SQLAlchemy "Session" factory moved to "fetcher.database"
so db params only logged if db access used/needed
All Proctab entries invoke existing ./run-....sh scripts
Only one place to change how a script is invoked.
"fetcher" process (scripts/queue_feeds.py) runs persistently
(no longer invoked by crontab)
[enabled by --loop PERIOD in Proctab]
and: reports statistics (queue length, database counts, etc)
queues ready feeds every PERIOD minutes.
queues only the number of feeds necessary
to cover a day's fetch attempts divided into equal
sized batches (based on active enabled feeds advertised update rate, and config)
Allows any number of feed id's on command line.
Operates as before (queues MAX_FEEDS feeds) if invoked without feed ids or --loop.
Clears queue and exits given --clear
Queue "worker" process started by scripts/worker.py
takes common logging arguments, stats connection init
runs a single queue worker (need to use dokku ps:scale worker=8).
workers set process title when active, visible by ps, top:
Saved data from HTTP response Last-Modified: header
next_fetch_attempt
Next time to attempt to fetch the feed
queued
TRUE if the feed is currently in the work queue
system_enabled
Set to FALSE by fetcher after excess failures
update_minutes
Update period advertised by feed
http_304
HTTP 304 (Not Modified) response seen from server
system_status
Human readable result of last fetch attempt
Also: last_fetch_failures is now a float, incremented by 0.5
for "soft" errors that might resolve given some (more) time.
Archiver process
Run from crontab: archives fetch_event and stories rows based on configuration settings.
Reports statistics via dokku-graphite plugin, displayed by grafana.
v0.11.12 2022-08-22
Handle some more feed and url parsing errors. Update feed title after fetch. Switch database to merged feeds.
v0.11.11 2022-08-12
Integrate non-news-domain skiplist from mcmetadata library.
v0.11.10 2022-08-04
Increase default fetch frequency to twice a day.
v0.11.9 2022-08-02
Pull in more aggresive URL query param removal for URL normalization.
v0.11.8 2022-08-02
Disable extra verbose debugging. Also update some requirements.
v0.11.7 2022-08-02
Fix requirements bug by forcing a minimum version of mediacloud-metadata library.
v0.11.6 2022-07-31
Skip homepage-like URLs.
v0.11.5 2022-07-27
Safer normalized title/url queries.
v0.11.4 2022-07-27
Refactored database code to support testing. Also handling failure counting more robustly now.
v0.11.3 2022-07-27
Properly save and double-check against normalized URLs for uniqueness.
v0.11.2 2022-07-27
Better testing of RSS generation.
v0.11.1 2022-07-27
Better handling of missing dates in output RSS files.
v0.11.0 2022-07-27
Write out own feed so we can customize error handling and fields outputted more closely. Also fix a small URL validity
check bug fix.
v0.10.5 2022-07-25
Fix bug in function call
v0.10.4 2022-07-25
Requirements bump.
v0.10.3 2022-07-19
Don't allow NULL chars in story titles.
v0.10.2 2022-07-19
Make Celery Backend a configuration option. We default to RabbitMQ for Broker and Redis for Backend because
that is a super common setup that seems to scale well.
v0.10.1 2022-07-18
Small bug fixes.
v0.10.0 2022-07-15
Add feed history to help debugging, view new FetchEvents objects.
v0.9.4 2022-07-15
Fix some date parsing bugs by using built-in approach from feed parsing library. Also add some more unit tests.
v0.9.3 2022-07-14
Added back in a necessary index for fast querying.
v0.9.2 2022-07-14
More debug logging.
v0.9.1 2022-07-14
Pretending to be a browser in order to see if it fixes a 403 bug.
v0.9.0 2022-07-14
Add fetch_events table for history and debugging. Also move title uniqueness check to software (not DB) to allow for
empty title fields.
v0.8.1 2022-07-14
Rewrite main rss fetching task to make logic more obvious, and also try and streamline database handle usage.
v0.8.0 2022-07-14
Switch to FastApi for returning counts to help debug. See /redoc, or /docs for full API documentation and Open API
specification file.
v0.7.5 2022-07-11
New option to log RSS info to files on disk, controlled via SAVE_RSS_FILES env-var (1 or 0)
v0.7.4 2022-07-07
Small tweak to skip relative URLs. Also more debug logging.
v0.7.3 2022-07-06
Fix bug that was checking for duplicate titles across all sources within last 7 days, instead of just within one
media source.
v0.7.2 2022-07-06
Update requirements and fix bug related to overly aggressive marking failures.
v0.7.1 2022-06-02
Add in more feeds from production server.
v0.7.0 2022-05-26
Check a normalized story URL and title for uniqueness before saving, like we do on our production system. This is a
critical de-duplication step.
v0.6.1 2022-05-20
Generate files for yesterday (not 2 days ago) because that will make delivered results more timely.
v0.6.0 2022-05-16
Add in new feed. Prep to show some data on website.
v0.5.5 2022-04-28
More work on concurrency for prod server and related configurations.
v0.5.4 2022-04-27
Tweaks to RSS file generation to make it more robust.
v0.5.3 2022-04-27
Query bug fix.
v0.5.2 2022-04-27
Handle podcast feeds, which don't have links by ignoring them in reporting script (they have enclosures instead)
v0.5.1 2022-04-27
Deployment work for generating daily rss files.
v0.5.0 2022-04-27
Retry feeds that we tried by didn't respond (up to 3 times in a row before giving up).
v0.4.0 2022-04-27
Update dependencies to latest
v0.3.2 2022-03-25
RSS path loaded from env-var
v0.3.1 2022-03-11
Ignore a whole bunch of errors that are expected ones
v0.3.0 2022-03-11
Add title and canonical domain to daily feeds
v0.2.1 2022-02-19
Move max feeds to fetch at a time limit to an env var for easier config (MAX_FEEDS defaults to 1000)
v0.2.0 2022-02-19
Restructured queries to try and solve DB connection leak bug.
v0.1.2 2022-02-18
Production performance-related tweaks.
v0.1.1 2022-02-18
Make sure duplicate story urls don't get inserted (no matter where they are from). This is the quick solution to making
sure an RSS feed with stories we have already saved doesn't create duplicates.