Broken Link Finder

A Python script that scans a web page of a given URL and validates the links on it.

If the page has a link to the same hostname as the URL given by the user, its destination page also scanned. It works like a web crawler, so potentially the whole site will be scanned. Be aware that this may take hours.

All broken links found are saved on broken-links-[date]-[time]-[random-ID].txt.

Requisites

Chrome 75 (put the proper driver in drivers folder if you have another version)

Python 3 (it has been tested on Python 3.7)

Some additional Python modules (check the script)

Details

The script scans pages using Selenium to account for links that may be injected via JavaScript.

Then, each found link is validate with Requests in a concurrent (multi-thread) fashion.

If you don't care about rendering JavaScript, the find_broken_links_req.py script doesn't use Selenium and thus is slightly faster.

A single-thread script is provided for benchmarking.

TODO

Better exception handling
Validate links in other tags besides <href a=...> (like <img src=...>)
Timeout
Link depth limit
Proxy support
NTLM support
Selenium driver parallelization

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
find_broken_links.py		find_broken_links.py
find_broken_links_req.py		find_broken_links_req.py
find_broken_links_sync.py		find_broken_links_sync.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Broken Link Finder

Requisites

Details

TODO

About

Releases

Packages

Languages

License

ubalklen/Broken-Link-Finder

Folders and files

Latest commit

History

Repository files navigation

Broken Link Finder

Requisites

Details

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages