Skip to content

Commit

Permalink
Add README
Browse files Browse the repository at this point in the history
  • Loading branch information
Alp Toker committed Jul 14, 2017
1 parent 5f01948 commit cb3269b
Showing 1 changed file with 25 additions and 0 deletions.
25 changes: 25 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# block-crawler: discovery tool for legally restricted HTTP 451 resources

## Synopsis

The _block-crawler_ module scans web resources in order to discover content withheld due to legal reasons using the HTTP 451 status code specified in [RFC7725](https://tools.ietf.org/html/rfc7725).

## Purpose and scope

Unlike other kinds of internet censorship implemented by service providers and governments, resources marked with HTTP 451 are typically blocked _at source_ — that is to say, the publisher has voluntarily complied with demands to restrict the content, either regionally or globally.

_block-crawler_ intends to provide a reference implementation for RFC7725, in so far as it covers all specified features and provisions. The tool includes specialised support for the _blocked-by_ Link HTTP header field ([RFC5988](https://tools.ietf.org/html/rfc5988)) whose value is a URI reference optionally identifying the entity which is implementing the blockage.

## Modes of operation

This module provides a standalone commandline utility as well as developer interfaces and a REST HTTP API for integration into third-party measurement frameworks.

Because HTTP 451 is typically used to 'geoblock' content, it is expected that varied results will be observed from different geographic vantage points. The output of this tool is suitable for aggregation into a larger international dataset which can reveal the global extent of corporate compliance with legal censorship orders and other kinds of localised restrictions on the flow of information online.

### Data formats

Results are produced in a simple streaming JSON annotation format which identifies the affected URL, observed status code and status text and optional blocking entity. A single report entity identifies a one HTTP request at a specific point in time observed from a single IP address.

## Status and contributor guidelines

This tool is under development and not yet recommended for use in production or as a reporting tool for transparency work.

0 comments on commit cb3269b

Please sign in to comment.