User-Agent #63

roniemartinez · 2022-03-07T23:28:49Z

Set a value for Dude User-Agent instead of using the default values on each parser backend (e.g.: pydude/{version} (+https://github.com/roniemartinez/dude))
Add option to override the User-Agent
For Playwright and Pyppeteer and Selenium, the User-Agent should include the original (Chromium, Firefox, Webkit) with the Dude User-Agent inserted into the string

The text was updated successfully, but these errors were encountered:

ghost · 2023-06-16T10:30:36Z

@roniemartinez I would like to start with this issue. But I could not completely understand the structure of the code base. Is there any document or architecture diagram to start with?

roniemartinez · 2023-06-16T10:49:29Z

@FluffyDietEngine All of the scrapers are using ScraperAbstract https://github.com/roniemartinez/dude/blob/master/dude/base.py#L433

For this feature, you also need to modify the following:

Playwright's launch():

dude/dude/playwright_scraper.py

Line 170 in 6d846aa

browser = p[browser_type].launch(headless=headless, proxy=proxy, **launch_kwargs)
Selenium's Webdriver options:

dude/dude/optional/selenium_scraper.py

Line 242 in 6d846aa

def _get_driver(self, browser_type: str, headless: bool) -> WebDriver:
httpx Client (lxml, parsel and bs4 are on different places):

dude/dude/optional/lxml_scraper.py

Line 69 in 6d846aa

with httpx.Client(

ghost · 2023-06-17T13:50:34Z

@FluffyDietEngine All of the scrapers are using ScraperAbstract https://github.com/roniemartinez/dude/blob/master/dude/base.py#L433

For this feature, you also need to modify the following:

Playwright's launch():

dude/dude/playwright_scraper.py

Line 170 in 6d846aa

browser = p[browser_type].launch(headless=headless, proxy=proxy, **launch_kwargs)

Selenium's Webdriver options:

dude/dude/optional/selenium_scraper.py

Line 242 in 6d846aa

def _get_driver(self, browser_type: str, headless: bool) -> WebDriver:

httpx Client (lxml, parsel and bs4 are on different places):

dude/dude/optional/lxml_scraper.py

Line 69 in 6d846aa

with httpx.Client(

Will start working on it. Please assign the task to me. Will reach out if any help needed.

ghost · 2023-06-22T18:11:52Z

@roniemartinez I am going through the code. Firstly, hats off. The thought process for the ScraperAbstract class is so inspiring. While working on this issue, I came across few questions and thoughts. I would like to have your opinion on them.

Can I have the user agent as a direct attribute to the ScraperAbstract class?
I am able to find that the useragent is being used to validate whether it is allowed for the given domain. But what if ignore_robots_txt has been enabled, Where should we implement the user-agents then?
I am thinking of using fake-useragent library for the implementation of user agents. Any thoughts?

roniemartinez · 2023-06-22T21:09:32Z

Hi @FluffyDietEngine thank you for working on this feature.

Can I have the user agent as a direct attribute to the ScraperAbstract class?

I am also leaning towards this since there should only be one default user agent but of course this can be overriden by the derived classes. However, if a user wants to replace it with custom user agent, it should be replaceable from terminal args or from run() function.

I am able to find that the useragent is being used to validate whether it is allowed for the given domain. But what if ignore_robots_txt has been enabled, Where should we implement the user-agents then?

I think the answer from 1 is still true in this scenario.

I am thinking of using fake-useragent library for the implementation of user agents. Any thoughts?

It's a nice suggestion but I think it might bloat the library if this is a builtin feature. I believe, this can be used directly in a code by feeding the generated user agent to run(). An example and additional documentation might be enough to guide users.

ghost · 2023-10-28T15:07:56Z

@roniemartinez Apologies for the delayed response. Been busy with some work. I want to resume here. And I am seeing pretty much of changes and kind of lost in the middle. Should I make the changes at run method from Scraper class at https://github.com/roniemartinez/dude/blob/master/dude/scraper.py ?

roniemartinez · 2023-10-28T21:40:21Z

@FluffyDietEngine No worries, also been busy. Yes, that should be the right place though you also have to duplicate this into the derived classes, too. Thanks for looking into it.

roniemartinez added enhancement New feature or request help wanted Extra attention is needed labels Mar 7, 2022

roniemartinez assigned ghost Jun 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User-Agent #63

User-Agent #63

roniemartinez commented Mar 7, 2022 •

edited

Loading

ghost commented Jun 16, 2023

roniemartinez commented Jun 16, 2023

ghost commented Jun 17, 2023

ghost commented Jun 22, 2023

roniemartinez commented Jun 22, 2023

ghost commented Oct 28, 2023

roniemartinez commented Oct 28, 2023

User-Agent #63

User-Agent #63

Comments

roniemartinez commented Mar 7, 2022 • edited Loading

ghost commented Jun 16, 2023

roniemartinez commented Jun 16, 2023

ghost commented Jun 17, 2023

ghost commented Jun 22, 2023

roniemartinez commented Jun 22, 2023

ghost commented Oct 28, 2023

roniemartinez commented Oct 28, 2023

roniemartinez commented Mar 7, 2022 •

edited

Loading