Releases · edgi-govdata-archiving/wayback

01 Feb 19:01

Mr0grog

v0.4.5

0ef2797

v0.4.5 Latest

Latest

In v0.4.4, we broke archived mementos of rate limit errors — they started raising exceptions instead of returning the actual memento. We now correctly return mementos of rate limit errors while still raising exceptions for actual live rate limit errors from the Wayback Machine itself. (#158)

Full Changelog: v0.4.4...v0.4.5

Assets 4

28 Nov 00:47

Mr0grog

v0.4.4

b8f647f

Version 0.4.4

This release makes some small fixes to rate limits and retries in order to better match with the current behavior of Wayback Machine servers:

Updates the WaybackClient.search rate limit to 1 call per second (it was previously 1.5 per second). (#140)
Delays retries for 60 seconds when receiving rate limit errors from the server. (#142)
Adds more logging around requests and rate limiting. This should make it easier to debug future rate limit issues. (#139)
Fixes calculation of the time attribute on wayback.exceptions.WaybackRetryError. It turns out it was only accounting for the time spent waiting between retries and skipping the time waiting for the server to respond! (#142)
Fixes some spots where we leaked HTTP connections during retries or during exception handling. (#142)

The next minor release (v0.5) will almost certainly include some bigger changes to how rate limits and retries are handled.

Assets 4

26 Sep 16:56

Mr0grog

v0.4.3

edd08e8

Version 0.4.3

This is mainly a compatibility release: it adds support for urllib3 v2.x and the next upcoming major release of Python, v3.12.0. It also adds support for multiple filters in searches. There are no breaking changes.

Features

You can now apply multiple filters to a search by using a list or tuple for the filter_field parameter of WaybackClient.search. Previously, you could only supply a string with a single filter. (#119)

For example, to search for all captures at nasa.gov with a 404 status code and “feature” somewhere in the URL:

client.search('nasa.gov/',
              match_type='prefix',
              filter_field=['statuscode:404',
                            'urlkey:.*feature.*'])

Fixes & Maintenance

Add support for Python 3.12.0. (#123)
Add support for urllib3 v2.x (urllib3 v1.20+ also still works). (#116)

Assets 4

22 Sep 22:02

Mr0grog

v0.4.3a1

45ff79e

Version 0.4.3a1 Pre-release

Pre-release

This is a test release for properly supporting the upcoming release of Python 3.12.0. Please file an issue if you encounter issues using on Python 3.12.0rc3 or later. (#123)

Assets 4

30 May 06:16

Mr0grog

v0.4.2

43f553f

Version 0.4.2

Wayback is not compatible with urllib3 v2, and this release updates the package's requirements to make sure Pip and other package managers install compatible versions of Wayback and urllib3. There are no other fixes or new features.

Assets 4

08 Mar 05:12

Mr0grog

v0.4.1

7fd74ef

Version 0.4.1

Features

wayback.Memento now has a links property with information about other URLs that are related to the memento, such as the previous or next mementos in time. It’s a dict where the keys identify the relationship (e.g. 'prev memento') and the values are dicts with additional information about the link. (#57)

For example::

{
    'original': {
        'url': 'https://www.fws.gov/birds/',
        'rel': 'original'
    },
    'first memento': {
        'url': 'https://web.archive.org/web/20050323155300id_/http://www.fws.gov:80/birds',
        'rel': 'first memento',
        'datetime': 'Wed, 23 Mar 2005 15:53:00 GMT'
    },
    'prev memento': {
        'url': 'https://web.archive.org/web/20210125125216id_/https://www.fws.gov/birds/',
        'rel': 'prev memento',
        'datetime': 'Mon, 25 Jan 2021 12:52:16 GMT'
    },
    'next memento': {
        'url': 'https://web.archive.org/web/20210321180831id_/https://www.fws.gov/birds',
        'rel': 'next memento',
        'datetime': 'Sun, 21 Mar 2021 18:08:31 GMT'
    },
    'last memento': {
        'url': 'https://web.archive.org/web/20221006031005id_/https://fws.gov/birds',
        'rel': 'last memento',
        'datetime': 'Thu, 06 Oct 2022 03:10:05 GMT'
    }
}

One use for these is to iterate through additional mementos. For example, to get the previous memento::

client.get_memento(memento.links['prev memento']['url'])

Fixes & Maintenance

Fix an issue where the Memento.url attribute might be slightly off from the exact URL that was captured (it could have a different protocol, different upper/lower-casing, etc.). (#99)
Fix an error when getting a memento for a redirect in view mode. If you called wayback.WaybackClient.get_memento with a URL that turned out to be a redirect at the given time and set the mode option to wayback.Mode.view, you’d get an exception saying “Memento at {url} could not be played.” Now this works just fine. (#109)

Assets 4

10 Nov 18:35

Mr0grog

v0.4.0

e2af777

Version 0.4.0

Breaking Changes

This release includes a significant overhaul of parameters for WaybackClient.search.

Removed parameters that did nothing, could break search, or that were for internal use only: gzip, showResumeKey, resumeKey, page, pageSize, previous_result.
Removed support for extra, arbitrary keyword parameters that could be added to each request to the search API.
All parameters now use snake_case. (Previously, parameters that were passed unchanged to the HTTP API used camelCase, while others used snake_case.) The old, non-snake-case names are deprecated, but still work. They’ll be completely removed in v0.5.0.
- matchType → match_type
- fastLatest → fast_latest
- resolveRevisits → resolve_revisits
The limit parameter now has a default value. There are very few cases where you should not set a limit (not doing so will typically break pagination), and there is now a default value to help prevent mistakes. We’ve also added documentation to explain how and when to adjust this value, since it is pretty complex. (#65)
Expanded the method documentation to explain things in more depth and link to more external references.

While we were at it, we also renamed the datetime parameter of WaybackClient.get_memento to timestamp for consistency with the CdxRecord and Memento classes. The old name still works for now, but it will be fully removed in v0.5.0.

Features

Memento.headers is now case-insensitive. The keys of the headers dict are returned with their original case when iterating, but lookups are performed case-insensitively. For example:
```
list(memento.headers) == ['Content-Type', 'Date']
memento.headers['Content-Type'] == memento.headers['content-type']
```
(#98)

There are now built-in, adjustable rate limits for calls to both search() and get_memento(). The default values should keep you from getting temporarily blocked by the Wayback Machine servers, but you can also adjust them when instantiating WaybackSession:

# Limit get_memento() calls to 2 per second (or one every 0.5 seconds):
client = WaybackClient(WaybackSession(memento_calls_per_second=2))

# These now take a minimum of 0.5 seconds, even if the Wayback Machine
# responds instantly (there's no delay on the first call):
client.get_memento('http://www.noaa.gov/', timestamp='20180816111911')
client.get_memento('http://www.noaa.gov/', timestamp='20180829092926')

A huge thanks to @LionSzl for implementing this. (#12)

Fixes & Maintenance

All API requests to archive.org now use HTTPS instead of HTTP. Thanks to @sundhaug92 for calling this out. (#81)
Headers from the original archived response are again included in Memento.headers. As part of this, the headers attribute is now case-insensitive (see new features above), since the Internet Archive servers now return headers with different cases depending on how the request was made. (#98)

Contributors

sundhaug92 and lion-sz

Assets 3

30 Sep 19:11

Mr0grog

v0.3.3

3ff9a73

Version 0.3.3

This release extends the timestamp parsing fix from version 0.3.2 to handle a similar problem, but with the month portion of timestamps in addition to the day. It also implements a small performance improvement in timestamp parsing. Thanks to @edsu for discovering this issue and addressing this. (#88)

Full Changelog: v0.3.2...v0.3.3

Contributors

edsu

Assets 4

17 Nov 07:35

Mr0grog

v0.3.2

2047b07

Version 0.3.2

Some Wayback CDX records have invalid timestamps with "00" for the day-of-month portion. wayback.WaybackClient.search previously raised an exception when parsing CDX records with this issue, but now handles them safely. Thanks to @8W9aG for discovering this issue and addressing it. (#85)

Contributors

8W9aG

Assets 4

15 Oct 03:30

Mr0grog

v0.3.1

b406f3d

Version 0.3.1

Some Wayback CDX records have no length information, and previously caused WaybackClient.search to raise an exception. These records now have their length property set to None instead of a number. Thanks to @8W9aG for discovering this issue and addressing it! (#83)

Contributors

8W9aG

Assets 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features

Fixes & Maintenance

Features

Fixes & Maintenance

Breaking Changes

Features

Fixes & Maintenance

Contributors

Contributors

Contributors

Contributors

Releases: edgi-govdata-archiving/wayback

v0.4.5

Version 0.4.4

Version 0.4.3

Features

Fixes & Maintenance

Version 0.4.3a1

Version 0.4.2

Version 0.4.1

Features

Fixes & Maintenance

Version 0.4.0

Breaking Changes

Features

Fixes & Maintenance

Contributors

Version 0.3.3

Contributors

Version 0.3.2

Contributors

Version 0.3.1

Contributors