Skip to content

ubalklen/Broken-Link-Finder

Repository files navigation

Broken Link Finder

A Python script that scans a web page of a given URL and validates the links on it.

If the page has a link to the same hostname as the URL given by the user, its destination page also scanned. It works like a web crawler, so potentially the whole site will be scanned. Be aware that this may take hours.

All broken links found are saved on broken-links-[date]-[time]-[random-ID].txt.

Requisites

Chrome 75 (put the proper driver in drivers folder if you have another version)

Python 3 (it has been tested on Python 3.7)

Some additional Python modules (check the script)

Details

The script scans pages using Selenium to account for links that may be injected via JavaScript.

Then, each found link is validate with Requests in a concurrent (multi-thread) fashion.

If you don't care about rendering JavaScript, the find_broken_links_req.py script doesn't use Selenium and thus is slightly faster.

A single-thread script is provided for benchmarking.

TODO

  • Better exception handling
  • Validate links in other tags besides <href a=...> (like <img src=...>)
  • Timeout
  • Link depth limit
  • Proxy support
  • NTLM support
  • Selenium driver parallelization

About

Python script to find broken links on a site

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages