-
Notifications
You must be signed in to change notification settings - Fork 6
Add article extractor library #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Melvillian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea improving our (read: my crappy) hand-parsing.
I have several questions in my inline comments, mostly around improving commenting because I felt a lot of the changes were unclear to me, and needed inline comments to help explain to future readers why we're doing things a certain way.
More importantly, this did not work for me when I ran npm run build, then "Load Unpacked" in both Brave and Chrome. Does it work locally for you? I tried it on one of the pages given in the extractor demo webpage, specifically: https://edition.cnn.com/2022/04/14/success/savings-mistakes/index.html
| console.log( | ||
| `Searching for headline element using these selectors: ${selectors.join(', ')}`, | ||
| ); | ||
| function getElementByXpath(xp: string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a scary function with lots of new information to readers (me included!) Can you add some documentation to help the reader quickly understand why this exists and what it's doing? These were the websites I used for my own learning:
- https://developer.mozilla.org/en-US/docs/Web/XPath/Introduction_to_using_XPath_in_JavaScript
- https://developer.mozilla.org/en-US/docs/Web/API/Document/evaluate
AFAIK, the reason do use this function is to do the same thing as document.querySelector, so please explain in the comment why this more complicated approach is warranted over just using document.querySelector, which far more are familiar with.
| console.log('Headline element not found on this page.'); | ||
| console.log('parsedArticle:', parsedArticle); | ||
| const headlineElement = | ||
| getElementByXpath(`//h1[contains(., "${parsedArticle.title}")]`) ?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't grok why this xPathExression is the way it is. Why the use of contains? Why the ., "..." characters inside of the call to contains? Why is it ok to just search for h1 where before searched for all the selectors listed here?
Please respond not with a reply to my issue, but rather with an inline comment that answers all these questions, since I reckon future readers of this code will have the same questions as me.
| "compilerOptions": { | ||
| "target": "ES6", | ||
| "module": "ESNext", | ||
| "module": "CommonJS", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the need to switch to commonjs?
| resolve: { | ||
| extensions: ['.ts', '.js'], | ||
| }, | ||
| optimization: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this custom use of TerserPlugin needed, since starting with webpack v5 (which we use) terser-webpack-plugin is used by default (according to the terser-webpack-plugin README). Please respond with an inline comment so future readers know why we're doing this.
Adds the
@extractus/article-extractorlibrary to facilitate article parsing. Functionality remains the same, but we now generate an object that is easier to work with. Example output:Try it here
Webpack changes:
TerserPluginto remove non-UTF8 characters from the output buildWe now get a warning on build that we are exceeding the recommended entrypoint asset size. Something maybe worth addressing down the line.