Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Study: ML-based scrapers #70

Open
roniemartinez opened this issue Mar 12, 2022 · 6 comments
Open

Study: ML-based scrapers #70

roniemartinez opened this issue Mar 12, 2022 · 6 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@roniemartinez
Copy link
Owner

roniemartinez commented Mar 12, 2022

Possible format:

@select(sample="path/to/training/data")
def handler(result):
    return {"data": result}

Potential backends:

@roniemartinez roniemartinez added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Mar 12, 2022
@daniel7an
Copy link

Hey, It's a good idea to use mlscraper as a backend. But first of all, we need data (inputs and outputs).

@roniemartinez
Copy link
Owner Author

@daniel7an

Yes, I can see potential on this one.

@daniel7an
Copy link

daniel7an commented Mar 18, 2022

@roniemartinez

Autoscraper is another one that would be great to have in Dude. It learns the scraping rules and returns similar elements. It just needs a few examples and isn't complicated as mlscraper.

Input:
wanted_list = ["What are metaclasses in Python?"]

Output:
[ 'How do I merge two dictionaries in a single expression in Python (taking union of dictionaries)?', 'How to call an external command?', 'What are metaclasses in Python?', 'Does Python have a ternary conditional operator?', 'How do you remove duplicates from a list whilst preserving order?', 'Convert bytes to a string', 'How to get line count of a large file cheaply in Python?', "Does Python have a string 'contains' substring method?", 'Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?' ]

Any ideas to add this one to Dude? Should I open a new issue for this?

@roniemartinez
Copy link
Owner Author

@daniel7an

The thing is, I've been reading the source code of Autoscraper and it is not actually using Machine Learning or AI. It is just using difflib.SequenceMatcher. What the project claims that it runs on ML or AI are incorrect.

https://github.com/alirezamika/autoscraper/blob/973ba6abed840d16907a556bc0192e2bf4806c6d/autoscraper/utils.py#L42-L66

image

Please correct me if I am wrong. I cannot categorize it as such, but for sure it learns by saving rules.

@roniemartinez
Copy link
Owner Author

@daniel7an

Any ideas to add this one to Dude? Should I open a new issue for this?

Though it seems Autoscraper does not fall into this category, I believe it is a very powerful tool for web scraping and I'd love to include it. Please open a separate ticket.

@daniel7an
Copy link

@daniel7an

Any ideas to add this one to Dude? Should I open a new issue for this?

Though it seems Autoscraper does not fall into this category, I believe it is a very powerful tool for web scraping and I'd love to include it. Please open a separate ticket.

Done ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants