Version History

Version 1

V1.4.1

add trafilatura.utils to SENTRY_LOGGERS_TO_IGNORE

V1.4.0

add ignore_loggers and sentry_ignore_loggers

V1.3.1

Fix CI Bug from out-of-date dependency

V1.3.0

Add option to optionally override canonical_domain

V1.2.0

metadata now includes Canonical Url when available
introduce black preformatter

V1.1.0

add requests_arcana.py for use across Media Cloud stack

V1.0.2

version bump to test automatic releases

V1.0.1

version bump to test automatic releases

v1.0.0

small dependency updates
changed version management system
first release packaged with Flit

Version 0

v0.12.0: Add new const to centrally store MC User-Agent (mcmetadata.webpages.MEDIA_CLOUD_USER_AGENT)
v0.11.2: Fix title parsing and url normalization edge cases, update requirements
v0.11.1: Add new urls.unique_url_hash helper to centralize logic for generating a unique hash for a URL, also returned from extract in case you choose to use it
v0.11.0: (error release)
v0.10.0: Support defaults and overrides in extract, returning execution time stats, requirements updates, more handling of malformed URLs
v0.9.5: Updated requirements, update non-news site list, fix failing unit tests, tweak title parsing logic
v0.9.4: Updated requirements to use faust-cchardet for py >3.9 support
v0.9.3: Updated content extractor dependencies, added py.typing for typing support
v0.9.2: fixed a bug related to title regex matching
v0.9.1: better support for some non-US government domains
v0.9.0: adds feeds.normalize_url helper
v0.8.2: small fix to url parsing
v0.8.1: handle IP addresses in canonical_domain helper
v0.8.0: update dependencies, fix various edge-case bugs
v0.7.9: fix include_other_metadata processing, upgrade underlying libraries to latest, remove leading and trailing whitespace from extracted text
v0.7.8: add optional include_other_metadata arg to extract method, which includes top_image and authors and other less validated metadata in results
v0.7.7: fix typo
v0.7.6: fix distribution packaging error
v0.7.5: add performance monitoring, handle invalid URLs, add a list of high volume non-news domains that might be worth ignoring (based on high volume "noise" domains in our production database)
v0.7.4: don't treat shortened URLs as homepage ones, also more aggressively strip URL query params
v0.7.3: tweak title extraction for multipart titles, add is_homepage helper boolean
v0.7.2: fix extraction argument bug introduced in last release, fix some more test cases
v0.7.1: fix bug in url normalization, increase robustness in extractor chain
v0.7.0: fix YouTube url normalization, better Trafilatura defaults, limit to pub dates within 90 days of today, ensure language is 2 letters, content extraction performance improvements, fix some title parsing bugs, add more test cases, add script to compare results to older Media Cloud code (which this stuff is extracted from), resolve language guessing conflicts better, handle text encoding errors
v0.6.0: prefer language from metadata over guessing, try Trafilatura as first parser, encoding fixes
v0.5.5: turn off aggressive date finding mode, which was making lots of 1/1 date guesses
v0.5.4: bug in regex that parses og:title properties into titles
v0.5.3: bug fixes in title normalization
v0.5.2: more efficient parsing of dates from HTML, remove failing over-specified canonical domain case
v0.5.1: fix small bug related to use of BeautifulSoup
v0.5.0: add normalized URL and normalized title
v0.4.3: more work on title regex bug
v0.4.2: work on title regex bug
v0.4.1: work on deployment
v0.4.0: performance improvements, dependency updates
v0.3.1: update dependencies
v0.3.0: more fault tolerant, faster regex's, track extraction rates, update requirements
v0.2.0: first packaging release for use in other places
v0.1.1: first version for testing with collaborators

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGELOG.md

CHANGELOG.md

Version History

Version 1

V1.4.1

V1.4.0

V1.3.1

V1.3.0

V1.2.0

V1.1.0

V1.0.2

V1.0.1

v1.0.0

Version 0

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Version History

Version 1

V1.4.1

V1.4.0

V1.3.1

V1.3.0

V1.2.0

V1.1.0

V1.0.2

V1.0.1

v1.0.0

Version 0