Add article extractor library #11

adhurjaty · 2024-11-08T23:09:32Z

Adds the @extractus/article-extractor library to facilitate article parsing. Functionality remains the same, but we now generate an object that is easier to work with. Example output:

{
  "url": "https://stackoverflow.com/questions/8644428/how-to-highlight-text-using-javascript",
  "title": "How to highlight text using javascript",
  "description": "Can someone help me with a javascript function that can highlight text on a web page.\nAnd the requirement is to - highlight only once, not like highlight all occurrences of the text as we do in cas...",
  "links": [
    "https://stackoverflow.com/questions/8644428/how-to-highlight-text-using-javascript"
  ],
  "image": "https://cdn.sstatic.net/Sites/stackoverflow/Img/[email protected]?v=73d79a89bded",
  "content": "<div>\n<p>The solutions offered here are quite bad.</p>\n<ol>\n<li>...",
  "author": "",
  "favicon": "https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196",
  "source": "stackoverflow.com",
  "published": "",
  "ttr": 199,
  "type": "website"
}

Try it here

Webpack changes:

switch module to commonjs to allow importing the library
add TerserPlugin to remove non-UTF8 characters from the output build
split chunks to separate library output from content script output

We now get a warning on build that we are exceeding the recommended entrypoint asset size. Something maybe worth addressing down the line.

Melvillian

Good idea improving our (read: my crappy) hand-parsing.

I have several questions in my inline comments, mostly around improving commenting because I felt a lot of the changes were unclear to me, and needed inline comments to help explain to future readers why we're doing things a certain way.

More importantly, this did not work for me when I ran npm run build, then "Load Unpacked" in both Brave and Chrome. Does it work locally for you? I tried it on one of the pages given in the extractor demo webpage, specifically: https://edition.cnn.com/2022/04/14/success/savings-mistakes/index.html

Melvillian · 2024-11-15T12:07:59Z

ui/browser-extension/src/contentScript.ts

-  console.log(
-    `Searching for headline element using these selectors: ${selectors.join(', ')}`,
-  );
+function getElementByXpath(xp: string) {


this is a scary function with lots of new information to readers (me included!) Can you add some documentation to help the reader quickly understand why this exists and what it's doing? These were the websites I used for my own learning:

https://developer.mozilla.org/en-US/docs/Web/XPath/Introduction_to_using_XPath_in_JavaScript

https://developer.mozilla.org/en-US/docs/Web/API/Document/evaluate

AFAIK, the reason do use this function is to do the same thing as document.querySelector, so please explain in the comment why this more complicated approach is warranted over just using document.querySelector, which far more are familiar with.

Melvillian · 2024-11-15T12:10:33Z

ui/browser-extension/src/contentScript.ts

-  console.log('Headline element not found on this page.');
+  console.log('parsedArticle:', parsedArticle);
+  const headlineElement =
+    getElementByXpath(`//h1[contains(., "${parsedArticle.title}")]`) ??


I can't grok why this xPathExression is the way it is. Why the use of contains? Why the ., "..." characters inside of the call to contains? Why is it ok to just search for h1 where before searched for all the selectors listed here?

Please respond not with a reply to my issue, but rather with an inline comment that answers all these questions, since I reckon future readers of this code will have the same questions as me.

Melvillian · 2024-11-15T18:27:16Z

ui/browser-extension/tsconfig.json

  "compilerOptions": {
    "target": "ES6",
-    "module": "ESNext",
+    "module": "CommonJS",


Why the need to switch to commonjs?

Melvillian · 2024-11-15T18:28:14Z

ui/browser-extension/webpack.config.js

  resolve: {
    extensions: ['.ts', '.js'],
  },
+  optimization: {


why is this custom use of TerserPlugin needed, since starting with webpack v5 (which we use) terser-webpack-plugin is used by default (according to the terser-webpack-plugin README). Please respond with an inline comment so future readers know why we're doing this.

Add article extractor library

44a45c0

adhurjaty marked this pull request as draft November 12, 2024 01:07

Base automatically changed from enable-toggle-transform-button to main November 13, 2024 04:23

Melvillian requested changes Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add article extractor library #11

Add article extractor library #11

Uh oh!

adhurjaty commented Nov 8, 2024

Uh oh!

Melvillian left a comment

Uh oh!

Melvillian Nov 15, 2024

Uh oh!

Melvillian Nov 15, 2024

Uh oh!

Melvillian Nov 15, 2024

Uh oh!

Melvillian Nov 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add article extractor library #11

Are you sure you want to change the base?

Add article extractor library #11

Uh oh!

Conversation

adhurjaty commented Nov 8, 2024

Uh oh!

Melvillian left a comment

Choose a reason for hiding this comment

Uh oh!

Melvillian Nov 15, 2024

Choose a reason for hiding this comment

Uh oh!

Melvillian Nov 15, 2024

Choose a reason for hiding this comment

Uh oh!

Melvillian Nov 15, 2024

Choose a reason for hiding this comment

Uh oh!

Melvillian Nov 15, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants