Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect deleted Sources #46

Open
philbudne opened this issue Mar 6, 2025 · 0 comments
Open

Detect deleted Sources #46

philbudne opened this issue Mar 6, 2025 · 0 comments
Assignees

Comments

@philbudne
Copy link
Contributor

philbudne commented Mar 6, 2025

It's now possible for a Source to be deleted in the web-search directory.

There are on the order of 50K unique sources with feeds in the rss-fetcher db.

Having the rss-fetcher check one per second, would take13 hours. At one query per minute it would be 34 days. To go thru everything every two days, that would be 1041 queries per hour, or 17 queries per minute. I can (off hand) think of three ways to make this less painful (in order of pain for implementation on the rss-fetcher side), all may require work in the mcweb API/db:

  1. Implement an endpoint where the rss-fetcher can present a list of source ids, and get back a list of which ones are valid. This should be doable in a single query to the mcweb-db SELECT id WHERE id IN list_of_ids_to_validate
  2. Keep track of deleted sources in the mcweb-db (a separate table of ids would be fine) that the rss-fetcher can fetch (should never be terribly large) -- keeping the sources in place and marking them deleted would have worked here, but would require all normal queries to filter out deleted entries. My (limited) past experience is that this is pretty normal.
  3. Add a web-search API endpoint to download a CSV of all sources.
  4. Have the rss-fetcher page thru the sources. NOTE: the current source_id space is 1 thru 1.9million (1900 pages of 1000), tho the rss-fetcher can optimize this by only fetching pages that start with a previously unchecked id (ie; keeping a cursor of the last id checked).
@philbudne philbudne self-assigned this Mar 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant