Skip to content

Latest commit

 

History

History
87 lines (74 loc) · 4.2 KB

CHANGELOG.md

File metadata and controls

87 lines (74 loc) · 4.2 KB

Version History

Version 1

V1.4.1

  • add trafilatura.utils to SENTRY_LOGGERS_TO_IGNORE

V1.4.0

  • add ignore_loggers and sentry_ignore_loggers

V1.3.1

  • Fix CI Bug from out-of-date dependency

V1.3.0

  • Add option to optionally override canonical_domain

V1.2.0

  • metadata now includes Canonical Url when available
  • introduce black preformatter

V1.1.0

  • add requests_arcana.py for use across Media Cloud stack

V1.0.2

  • version bump to test automatic releases

V1.0.1

  • version bump to test automatic releases

v1.0.0

  • small dependency updates
  • changed version management system
  • first release packaged with Flit

Version 0

  • v0.12.0: Add new const to centrally store MC User-Agent (mcmetadata.webpages.MEDIA_CLOUD_USER_AGENT)
  • v0.11.2: Fix title parsing and url normalization edge cases, update requirements
  • v0.11.1: Add new urls.unique_url_hash helper to centralize logic for generating a unique hash for a URL, also returned from extract in case you choose to use it
  • v0.11.0: (error release)
  • v0.10.0: Support defaults and overrides in extract, returning execution time stats, requirements updates, more handling of malformed URLs
  • v0.9.5: Updated requirements, update non-news site list, fix failing unit tests, tweak title parsing logic
  • v0.9.4: Updated requirements to use faust-cchardet for py >3.9 support
  • v0.9.3: Updated content extractor dependencies, added py.typing for typing support
  • v0.9.2: fixed a bug related to title regex matching
  • v0.9.1: better support for some non-US government domains
  • v0.9.0: adds feeds.normalize_url helper
  • v0.8.2: small fix to url parsing
  • v0.8.1: handle IP addresses in canonical_domain helper
  • v0.8.0: update dependencies, fix various edge-case bugs
  • v0.7.9: fix include_other_metadata processing, upgrade underlying libraries to latest, remove leading and trailing whitespace from extracted text
  • v0.7.8: add optional include_other_metadata arg to extract method, which includes top_image and authors and other less validated metadata in results
  • v0.7.7: fix typo
  • v0.7.6: fix distribution packaging error
  • v0.7.5: add performance monitoring, handle invalid URLs, add a list of high volume non-news domains that might be worth ignoring (based on high volume "noise" domains in our production database)
  • v0.7.4: don't treat shortened URLs as homepage ones, also more aggressively strip URL query params
  • v0.7.3: tweak title extraction for multipart titles, add is_homepage helper boolean
  • v0.7.2: fix extraction argument bug introduced in last release, fix some more test cases
  • v0.7.1: fix bug in url normalization, increase robustness in extractor chain
  • v0.7.0: fix YouTube url normalization, better Trafilatura defaults, limit to pub dates within 90 days of today, ensure language is 2 letters, content extraction performance improvements, fix some title parsing bugs, add more test cases, add script to compare results to older Media Cloud code (which this stuff is extracted from), resolve language guessing conflicts better, handle text encoding errors
  • v0.6.0: prefer language from metadata over guessing, try Trafilatura as first parser, encoding fixes
  • v0.5.5: turn off aggressive date finding mode, which was making lots of 1/1 date guesses
  • v0.5.4: bug in regex that parses og:title properties into titles
  • v0.5.3: bug fixes in title normalization
  • v0.5.2: more efficient parsing of dates from HTML, remove failing over-specified canonical domain case
  • v0.5.1: fix small bug related to use of BeautifulSoup
  • v0.5.0: add normalized URL and normalized title
  • v0.4.3: more work on title regex bug
  • v0.4.2: work on title regex bug
  • v0.4.1: work on deployment
  • v0.4.0: performance improvements, dependency updates
  • v0.3.1: update dependencies
  • v0.3.0: more fault tolerant, faster regex's, track extraction rates, update requirements
  • v0.2.0: first packaging release for use in other places
  • v0.1.1: first version for testing with collaborators