Advanced High Speed Universal Scrapper

The currently available program is constant_rate_scrapper.py

Muti-thread Firefox-geckodriver, queueing mechanism
Send requests at a constant rate, great for avoiding rate limit
Rate limit detection and pausing mechainsm
Use template form extractors folder (Yahoo Finance as example)
Log both succeed and failed articles, automatically resume the progress when restart, simple CSV storage

yahoo_links_selenium.py is used to get all the recorded Yahoo Finance news links on Internet Archive through its CDX server. It loops through prefix "00*" - "zz*", since on some link prefixes only return limited amount of results because there's too much urls. All the succeed fetches will be cached in the "parts" folder (also capable for automatic resuming after restart). Finally it drops the duplicates and output a CSV file that could feed to the scrapper.

experimental folder holds all the experimental programs, future developmet including distributed system and more advanced with computer vision universal templateless scrapper.

ticker_symbol_query is used to get the information for each ticker (company name, products, key people etc), which can be further matched with news. Note: consider using VPN to use an American IP if error.

match_keywords.py match the information from Wikidata to get the according news for each ticker. To use this dataset, you can download the premade dataset there from my HuggingFace

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
experiental		experiental
extractors		extractors
info		info
.gitignore		.gitignore
README.md		README.md
README.png		README.png
constant_rate_scrapper.py		constant_rate_scrapper.py
geckodriver		geckodriver
gp.sh		gp.sh
match_keywords.py		match_keywords.py
requirements.txt		requirements.txt
sp500list.csv		sp500list.csv
ticker_symbol_query.py		ticker_symbol_query.py
ticker_symbol_query_rate_limit_protected.py		ticker_symbol_query_rate_limit_protected.py
yahoo_links_selenium.py		yahoo_links_selenium.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Advanced High Speed Universal Scrapper

About

Uh oh!

Releases

Packages

Languages

edaschau/advanced_scrapper

Folders and files

Latest commit

History

Repository files navigation

Advanced High Speed Universal Scrapper

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages