Skip to content

Conversation

@adhurjaty
Copy link
Collaborator

Adds the @extractus/article-extractor library to facilitate article parsing. Functionality remains the same, but we now generate an object that is easier to work with. Example output:

{
  "url": "https://stackoverflow.com/questions/8644428/how-to-highlight-text-using-javascript",
  "title": "How to highlight text using javascript",
  "description": "Can someone help me with a javascript function that can highlight text on a web page.\nAnd the requirement is to - highlight only once, not like highlight all occurrences of the text as we do in cas...",
  "links": [
    "https://stackoverflow.com/questions/8644428/how-to-highlight-text-using-javascript"
  ],
  "image": "https://cdn.sstatic.net/Sites/stackoverflow/Img/[email protected]?v=73d79a89bded",
  "content": "<div>\n<p>The solutions offered here are quite bad.</p>\n<ol>\n<li>...",
  "author": "",
  "favicon": "https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196",
  "source": "stackoverflow.com",
  "published": "",
  "ttr": 199,
  "type": "website"
}

Try it here

Webpack changes:

  • switch module to commonjs to allow importing the library
  • add TerserPlugin to remove non-UTF8 characters from the output build
  • split chunks to separate library output from content script output

We now get a warning on build that we are exceeding the recommended entrypoint asset size. Something maybe worth addressing down the line.

@adhurjaty adhurjaty marked this pull request as draft November 12, 2024 01:07
Base automatically changed from enable-toggle-transform-button to main November 13, 2024 04:23
Copy link
Collaborator

@Melvillian Melvillian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea improving our (read: my crappy) hand-parsing.

I have several questions in my inline comments, mostly around improving commenting because I felt a lot of the changes were unclear to me, and needed inline comments to help explain to future readers why we're doing things a certain way.

More importantly, this did not work for me when I ran npm run build, then "Load Unpacked" in both Brave and Chrome. Does it work locally for you? I tried it on one of the pages given in the extractor demo webpage, specifically: https://edition.cnn.com/2022/04/14/success/savings-mistakes/index.html

console.log(
`Searching for headline element using these selectors: ${selectors.join(', ')}`,
);
function getElementByXpath(xp: string) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a scary function with lots of new information to readers (me included!) Can you add some documentation to help the reader quickly understand why this exists and what it's doing? These were the websites I used for my own learning:

  1. https://developer.mozilla.org/en-US/docs/Web/XPath/Introduction_to_using_XPath_in_JavaScript
  2. https://developer.mozilla.org/en-US/docs/Web/API/Document/evaluate

AFAIK, the reason do use this function is to do the same thing as document.querySelector, so please explain in the comment why this more complicated approach is warranted over just using document.querySelector, which far more are familiar with.

console.log('Headline element not found on this page.');
console.log('parsedArticle:', parsedArticle);
const headlineElement =
getElementByXpath(`//h1[contains(., "${parsedArticle.title}")]`) ??
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't grok why this xPathExression is the way it is. Why the use of contains? Why the ., "..." characters inside of the call to contains? Why is it ok to just search for h1 where before searched for all the selectors listed here?

Please respond not with a reply to my issue, but rather with an inline comment that answers all these questions, since I reckon future readers of this code will have the same questions as me.

"compilerOptions": {
"target": "ES6",
"module": "ESNext",
"module": "CommonJS",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the need to switch to commonjs?

resolve: {
extensions: ['.ts', '.js'],
},
optimization: {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this custom use of TerserPlugin needed, since starting with webpack v5 (which we use) terser-webpack-plugin is used by default (according to the terser-webpack-plugin README). Please respond with an inline comment so future readers know why we're doing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants