- add trafilatura.utils to SENTRY_LOGGERS_TO_IGNORE
- add ignore_loggers and sentry_ignore_loggers
- Fix CI Bug from out-of-date dependency
- Add option to optionally override canonical_domain
- metadata now includes Canonical Url when available
- introduce black preformatter
- add requests_arcana.py for use across Media Cloud stack
- version bump to test automatic releases
- version bump to test automatic releases
- small dependency updates
- changed version management system
- first release packaged with Flit
- v0.12.0: Add new const to centrally store MC User-Agent (
mcmetadata.webpages.MEDIA_CLOUD_USER_AGENT
) - v0.11.2: Fix title parsing and url normalization edge cases, update requirements
- v0.11.1: Add new
urls.unique_url_hash
helper to centralize logic for generating a unique hash for a URL, also returned fromextract
in case you choose to use it - v0.11.0: (error release)
- v0.10.0: Support defaults and overrides in
extract
, returning execution time stats, requirements updates, more handling of malformed URLs - v0.9.5: Updated requirements, update non-news site list, fix failing unit tests, tweak title parsing logic
- v0.9.4: Updated requirements to use faust-cchardet for py >3.9 support
- v0.9.3: Updated content extractor dependencies, added py.typing for typing support
- v0.9.2: fixed a bug related to title regex matching
- v0.9.1: better support for some non-US government domains
- v0.9.0: adds
feeds.normalize_url
helper - v0.8.2: small fix to url parsing
- v0.8.1: handle IP addresses in canonical_domain helper
- v0.8.0: update dependencies, fix various edge-case bugs
- v0.7.9: fix
include_other_metadata
processing, upgrade underlying libraries to latest, remove leading and trailing whitespace from extracted text - v0.7.8: add optional
include_other_metadata
arg to extract method, which includes top_image and authors and other less validated metadata in results - v0.7.7: fix typo
- v0.7.6: fix distribution packaging error
- v0.7.5: add performance monitoring, handle invalid URLs, add a list of high volume non-news domains that might be worth ignoring (based on high volume "noise" domains in our production database)
- v0.7.4: don't treat shortened URLs as homepage ones, also more aggressively strip URL query params
- v0.7.3: tweak title extraction for multipart titles, add is_homepage helper boolean
- v0.7.2: fix extraction argument bug introduced in last release, fix some more test cases
- v0.7.1: fix bug in url normalization, increase robustness in extractor chain
- v0.7.0: fix YouTube url normalization, better Trafilatura defaults, limit to pub dates within 90 days of today, ensure language is 2 letters, content extraction performance improvements, fix some title parsing bugs, add more test cases, add script to compare results to older Media Cloud code (which this stuff is extracted from), resolve language guessing conflicts better, handle text encoding errors
- v0.6.0: prefer language from metadata over guessing, try Trafilatura as first parser, encoding fixes
- v0.5.5: turn off aggressive date finding mode, which was making lots of 1/1 date guesses
- v0.5.4: bug in regex that parses og:title properties into titles
- v0.5.3: bug fixes in title normalization
- v0.5.2: more efficient parsing of dates from HTML, remove failing over-specified canonical domain case
- v0.5.1: fix small bug related to use of BeautifulSoup
- v0.5.0: add normalized URL and normalized title
- v0.4.3: more work on title regex bug
- v0.4.2: work on title regex bug
- v0.4.1: work on deployment
- v0.4.0: performance improvements, dependency updates
- v0.3.1: update dependencies
- v0.3.0: more fault tolerant, faster regex's, track extraction rates, update requirements
- v0.2.0: first packaging release for use in other places
- v0.1.1: first version for testing with collaborators