Readability-Extractor

This is a tiny JS wrapper library around Mozilla's article-text extraction tool https://github.com/mozilla/readability.

It's designed to be used as an ArchiveBox extractor.

Install

npm install -g 'git+https://github.com/pirate/readability-extractor'

# which is equivalent to this:
curl https://raw.githubusercontent.com/pirate/readability-extractor/master/readability-extractor > /usr/local/bin/readability-extractor
chmod +x /usr/local/bin/readability-extractor

Usage

# readability-extractor <input HTML path> <original url?> <suggested encoding?> > <output JSON path>
readability-extractor some_article.html 'https://exmaple.com/original/url/some/article.html' 'UTF-8' > some_article.json

{
    "title": "Title autodetected from article html",
    "byline": "Autodetected author...",
    "excerpt": "Autodetected short description",
    "dir": "ltr",
    "length": 1337,
    "lang": null,
    "charset": "UTF-8",
    "content": "<div id=\"readability-page-1\" class=\"page\">abc some article body text...</div>",
    "textContent": "abc some article body text..."
}

ArchiveBox Integration

# You don't have to run these commands usually.
# Readability is on by default and ArchiveBox will find any 
# installed version in your $PATH automatically

# However, if you explicitly want to turn readability on
# and/or specify a manual path to the binary, you can do this:
archivebox config --set SAVE_READABILITY=True
archivebox config --set READABILITY_BINARY="$(which readability-extractor)"

# test archiving oneshot using only singlefile+readability
archivebox add --extract=singlefile,readability 'https://exmaple.com'

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github		.github
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
readability-extractor		readability-extractor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Readability-Extractor

Install

Usage

ArchiveBox Integration

About

Sponsor this project

Contributors 3

Languages

ArchiveBox/readability-extractor

Folders and files

Latest commit

History

Repository files navigation

Readability-Extractor

Install

Usage

ArchiveBox Integration

About

Topics

Resources

Stars

Watchers

Forks

Sponsor this project

Contributors 3

Languages