Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data.gov archiving guidance? #36

Open
chronick opened this issue Jan 30, 2025 · 2 comments
Open

Data.gov archiving guidance? #36

chronick opened this issue Jan 30, 2025 · 2 comments

Comments

@chronick
Copy link

This seems concerning: https://www.reddit.com/r/climate/comments/1idin45/the_us_governments_open_data_is_currently_being/

From the thread:

I just checked, it has a steady and big increase in datasets until Jan 21, 2025, at 307,854 datasets http://web.archive.org/web/20250120135355/https://data.gov/
Now it has lost 2,290 datasets in 9 days!

Look at this huge decrease on Jan 21, between 03:04:19 and 15:15:42 http://web.archive.org/web/20250120135355/https://data.gov/ http://web.archive.org/web/20250121233247/https://data.gov/

Drops from 307,854 to 306,012 datasets!!! It's been decreasing everyday and today it's at 305,564 data.gov

Are data.gov datasets being covered by the EOT archive? I don't see any specific info about these.

@ldko
Copy link
Member

ldko commented Jan 30, 2025

I believe @jcushman has been working on archiving the datasets from data.gov, and some of it will have been captured in the web crawling being done by Internet Archive, but I don't know how fully they have gotten it at this point.

@jcushman
Copy link

jcushman commented Jan 30, 2025

We posted a short blog post on this just now: https://lil.law.harvard.edu/blog/2025/01/30/preserving-public-u-s-federal-data/

Basically we are routinely capturing the metadata of the data.gov index itself, as well as a copy of each URL it points to, and we're figuring out an affordable way to make that searchable and clonable for data science. There are likely things being missed between the two efforts still -- anything that needs a deep crawl but either isn't on the EOT list or isn't generically crawlable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants