Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User-Agent #63

Open
roniemartinez opened this issue Mar 7, 2022 · 7 comments
Open

User-Agent #63

roniemartinez opened this issue Mar 7, 2022 · 7 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@roniemartinez
Copy link
Owner

roniemartinez commented Mar 7, 2022

  • Set a value for Dude User-Agent instead of using the default values on each parser backend (e.g.: pydude/{version} (+https://github.com/roniemartinez/dude))
  • Add option to override the User-Agent
  • For Playwright and Pyppeteer and Selenium, the User-Agent should include the original (Chromium, Firefox, Webkit) with the Dude User-Agent inserted into the string
@roniemartinez roniemartinez added enhancement New feature or request help wanted Extra attention is needed labels Mar 7, 2022
@ghost
Copy link

ghost commented Jun 16, 2023

@roniemartinez I would like to start with this issue. But I could not completely understand the structure of the code base. Is there any document or architecture diagram to start with?

@roniemartinez
Copy link
Owner Author

@FluffyDietEngine All of the scrapers are using ScraperAbstract https://github.com/roniemartinez/dude/blob/master/dude/base.py#L433

For this feature, you also need to modify the following:

  1. Playwright's launch():
    browser = p[browser_type].launch(headless=headless, proxy=proxy, **launch_kwargs)
  2. Selenium's Webdriver options:
    def _get_driver(self, browser_type: str, headless: bool) -> WebDriver:
  3. httpx Client (lxml, parsel and bs4 are on different places):
    with httpx.Client(

@ghost
Copy link

ghost commented Jun 17, 2023

@FluffyDietEngine All of the scrapers are using ScraperAbstract https://github.com/roniemartinez/dude/blob/master/dude/base.py#L433

For this feature, you also need to modify the following:

  1. Playwright's launch():
    browser = p[browser_type].launch(headless=headless, proxy=proxy, **launch_kwargs)
  2. Selenium's Webdriver options:
    def _get_driver(self, browser_type: str, headless: bool) -> WebDriver:
  3. httpx Client (lxml, parsel and bs4 are on different places):
    with httpx.Client(

Will start working on it. Please assign the task to me. Will reach out if any help needed.

@roniemartinez roniemartinez assigned ghost Jun 18, 2023
@ghost
Copy link

ghost commented Jun 22, 2023

@roniemartinez I am going through the code. Firstly, hats off. The thought process for the ScraperAbstract class is so inspiring. While working on this issue, I came across few questions and thoughts. I would like to have your opinion on them.

  1. Can I have the user agent as a direct attribute to the ScraperAbstract class?
  2. I am able to find that the useragent is being used to validate whether it is allowed for the given domain. But what if ignore_robots_txt has been enabled, Where should we implement the user-agents then?
  3. I am thinking of using fake-useragent library for the implementation of user agents. Any thoughts?

@roniemartinez
Copy link
Owner Author

Hi @FluffyDietEngine thank you for working on this feature.

  1. Can I have the user agent as a direct attribute to the ScraperAbstract class?

I am also leaning towards this since there should only be one default user agent but of course this can be overriden by the derived classes. However, if a user wants to replace it with custom user agent, it should be replaceable from terminal args or from run() function.

  1. I am able to find that the useragent is being used to validate whether it is allowed for the given domain. But what if ignore_robots_txt has been enabled, Where should we implement the user-agents then?

I think the answer from 1 is still true in this scenario.

  1. I am thinking of using fake-useragent library for the implementation of user agents. Any thoughts?

It's a nice suggestion but I think it might bloat the library if this is a builtin feature. I believe, this can be used directly in a code by feeding the generated user agent to run(). An example and additional documentation might be enough to guide users.

@ghost
Copy link

ghost commented Oct 28, 2023

@roniemartinez Apologies for the delayed response. Been busy with some work. I want to resume here. And I am seeing pretty much of changes and kind of lost in the middle. Should I make the changes at run method from Scraper class at https://github.com/roniemartinez/dude/blob/master/dude/scraper.py ?

@roniemartinez
Copy link
Owner Author

@FluffyDietEngine No worries, also been busy. Yes, that should be the right place though you also have to duplicate this into the derived classes, too. Thanks for looking into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant