Releases: edgi-govdata-archiving/wayback
v0.4.5
In v0.4.4, we broke archived mementos of rate limit errors — they started raising exceptions instead of returning the actual memento. We now correctly return mementos of rate limit errors while still raising exceptions for actual live rate limit errors from the Wayback Machine itself. (#158)
Full Changelog: v0.4.4...v0.4.5
Version 0.4.4
This release makes some small fixes to rate limits and retries in order to better match with the current behavior of Wayback Machine servers:
-
Updates the
WaybackClient.search
rate limit to 1 call per second (it was previously 1.5 per second). (#140) -
Delays retries for 60 seconds when receiving rate limit errors from the server. (#142)
-
Adds more logging around requests and rate limiting. This should make it easier to debug future rate limit issues. (#139)
-
Fixes calculation of the
time
attribute onwayback.exceptions.WaybackRetryError
. It turns out it was only accounting for the time spent waiting between retries and skipping the time waiting for the server to respond! (#142) -
Fixes some spots where we leaked HTTP connections during retries or during exception handling. (#142)
The next minor release (v0.5) will almost certainly include some bigger changes to how rate limits and retries are handled.
Version 0.4.3
This is mainly a compatibility release: it adds support for urllib3 v2.x and the next upcoming major release of Python, v3.12.0. It also adds support for multiple filters in searches. There are no breaking changes.
Features
You can now apply multiple filters to a search by using a list or tuple for the filter_field
parameter of WaybackClient.search
. Previously, you could only supply a string with a single filter. (#119)
For example, to search for all captures at nasa.gov
with a 404 status code and “feature” somewhere in the URL:
client.search('nasa.gov/',
match_type='prefix',
filter_field=['statuscode:404',
'urlkey:.*feature.*'])
Fixes & Maintenance
Version 0.4.3a1
This is a test release for properly supporting the upcoming release of Python 3.12.0. Please file an issue if you encounter issues using on Python 3.12.0rc3 or later. (#123)
Version 0.4.2
Wayback is not compatible with urllib3 v2, and this release updates the package's requirements to make sure Pip and other package managers install compatible versions of Wayback and urllib3. There are no other fixes or new features.
Version 0.4.1
Features
wayback.Memento
now has a links
property with information about other URLs that are related to the memento, such as the previous or next mementos in time. It’s a dict where the keys identify the relationship (e.g. 'prev memento'
) and the values are dicts with additional information about the link. (#57)
For example::
{
'original': {
'url': 'https://www.fws.gov/birds/',
'rel': 'original'
},
'first memento': {
'url': 'https://web.archive.org/web/20050323155300id_/http://www.fws.gov:80/birds',
'rel': 'first memento',
'datetime': 'Wed, 23 Mar 2005 15:53:00 GMT'
},
'prev memento': {
'url': 'https://web.archive.org/web/20210125125216id_/https://www.fws.gov/birds/',
'rel': 'prev memento',
'datetime': 'Mon, 25 Jan 2021 12:52:16 GMT'
},
'next memento': {
'url': 'https://web.archive.org/web/20210321180831id_/https://www.fws.gov/birds',
'rel': 'next memento',
'datetime': 'Sun, 21 Mar 2021 18:08:31 GMT'
},
'last memento': {
'url': 'https://web.archive.org/web/20221006031005id_/https://fws.gov/birds',
'rel': 'last memento',
'datetime': 'Thu, 06 Oct 2022 03:10:05 GMT'
}
}
One use for these is to iterate through additional mementos. For example, to get the previous memento::
client.get_memento(memento.links['prev memento']['url'])
Fixes & Maintenance
-
Fix an issue where the
Memento.url
attribute might be slightly off from the exact URL that was captured (it could have a different protocol, different upper/lower-casing, etc.). (#99) -
Fix an error when getting a memento for a redirect in
view
mode. If you calledwayback.WaybackClient.get_memento
with a URL that turned out to be a redirect at the given time and set themode
option towayback.Mode.view
, you’d get an exception saying “Memento at {url} could not be played.” Now this works just fine. (#109)
Version 0.4.0
Breaking Changes
This release includes a significant overhaul of parameters for WaybackClient.search
.
-
Removed parameters that did nothing, could break search, or that were for internal use only:
gzip
,showResumeKey
,resumeKey
,page
,pageSize
,previous_result
. -
Removed support for extra, arbitrary keyword parameters that could be added to each request to the search API.
-
All parameters now use snake_case. (Previously, parameters that were passed unchanged to the HTTP API used camelCase, while others used snake_case.) The old, non-snake-case names are deprecated, but still work. They’ll be completely removed in v0.5.0.
matchType
→match_type
fastLatest
→fast_latest
resolveRevisits
→resolve_revisits
-
The
limit
parameter now has a default value. There are very few cases where you should not set alimit
(not doing so will typically break pagination), and there is now a default value to help prevent mistakes. We’ve also added documentation to explain how and when to adjust this value, since it is pretty complex. (#65) -
Expanded the method documentation to explain things in more depth and link to more external references.
While we were at it, we also renamed the datetime
parameter of WaybackClient.get_memento
to timestamp
for consistency with the CdxRecord
and Memento
classes. The old name still works for now, but it will be fully removed in v0.5.0.
Features
-
Memento.headers
is now case-insensitive. The keys of theheaders
dict are returned with their original case when iterating, but lookups are performed case-insensitively. For example:list(memento.headers) == ['Content-Type', 'Date'] memento.headers['Content-Type'] == memento.headers['content-type']
(#98)
-
There are now built-in, adjustable rate limits for calls to both
search()
andget_memento()
. The default values should keep you from getting temporarily blocked by the Wayback Machine servers, but you can also adjust them when instantiatingWaybackSession
:# Limit get_memento() calls to 2 per second (or one every 0.5 seconds): client = WaybackClient(WaybackSession(memento_calls_per_second=2)) # These now take a minimum of 0.5 seconds, even if the Wayback Machine # responds instantly (there's no delay on the first call): client.get_memento('http://www.noaa.gov/', timestamp='20180816111911') client.get_memento('http://www.noaa.gov/', timestamp='20180829092926')
A huge thanks to @LionSzl for implementing this. (#12)
Fixes & Maintenance
-
All API requests to archive.org now use HTTPS instead of HTTP. Thanks to @sundhaug92 for calling this out. (#81)
-
Headers from the original archived response are again included in
Memento.headers
. As part of this, theheaders
attribute is now case-insensitive (see new features above), since the Internet Archive servers now return headers with different cases depending on how the request was made. (#98)
Version 0.3.3
This release extends the timestamp parsing fix from version 0.3.2 to handle a similar problem, but with the month portion of timestamps in addition to the day. It also implements a small performance improvement in timestamp parsing. Thanks to @edsu for discovering this issue and addressing this. (#88)
Full Changelog: v0.3.2...v0.3.3