Skip to content

rumca-js/Internet-Places-Database

Repository files navigation

Overview

This is a database of Internet places. Mostly domains. Sometimes other things. Think of it as Internet meta database. This repository contains link metadata: title, description, publish date, etc.

The entire Internt is in one file! Just unzip internet.zip!

Project Logo

You can easily browse the file using any SQLite program, like DBeaver!

Acceptable link types

Not acceptable link types

  • malware sites
  • porn, casino, gambling etc.
  • analytic domains that are used for user surveillance
  • IT infrastructure domains, CDN domains
  • link shorteners. somethingsomething.lnk.to is not something useful. Main domain lnk.to is acceptable though

Some zen rules:

  • Anything not obeying the law will be removed from lists
  • Internet operates in ... many countries, so there are many laws
  • If things are offensive, they do not have to be removed
  • I might suspect that a page is notorious, I may flag it with a tag, like "piracy", but it may not be true
  • If page content is obnoxious, it can, and possible should be demoted
  • I do not always follow these rules strictly

The goal is to check how "wide" the Internet is, not how "deep" individual places are!

I do not have resources to verify all links

  • Links are captured from the Internet automatically
  • If any link is suspicious, and should be removed, please create an Issue in this repository
  • Use 'votes' to see credibility of domains
  • Be careful, Internet is a dangerous space. You should know what you are doing when using this list

Example

Simple Search uses domains that have votes > 0.

You can check how this search works, and what is inside.

Sources of data

Obtained by the Django-link-archive web crawler.

Sources:

Meme

Benefit - Security

Google Search is known to be susceptible to malvertising. Predatory web pages can "disguise" them as other pages. The displayed link in Google Search does not have to be the linked you will be transported to.

  • This local search does not require Internet to operate. Once downloaded - you can just search these meta information
  • This local search might be faster than your ISP, depending on drive, machine, etc
  • It may be more secure. You can verify domain, it's status, how long it operates before accessing the Internet

Alternative solutions

Stats

Tables, and sizes (will definitely change over time)

Table: blockentrylist, Row count: 30
Table: compactedtags, Row count: 3305
Table: configurationentry, Row count: 1
Table: dataexport, Row count: 0
Table: domains, Row count: 815164
Table: entryrules, Row count: 12
Table: gateway, Row count: 82
Table: linkdatamodel, Row count: 814242
Table: modelfiles, Row count: 0
Table: readlater, Row count: 0
Table: sourcecategories, Row count: 8
Table: sourcedatamodel, Row count: 11704
Table: sourcesubcategories, Row count: 13
Table: user, Row count: 5
Table: userbookmarks, Row count: 13062
Table: usercomments, Row count: 1
Table: usercompactedtags, Row count: 3305
Table: userconfig, Row count: 3
Table: userentrytransitionhistory, Row count: 6091
Table: userentryvisithistory, Row count: 5001
Table: usersearchhistory, Row count: 304
Table: usertags, Row count: 22712
Table: uservotes, Row count: 27924

Files

Data are distributed in internet.zip file, split with 50MB parts.

To use it, you have to unpack it.

The result internet.db database file can be viewed using any sqlite browser / program.

Each link contains a set of attributes, like:

  • title
  • description
  • page rating
  • date of creation
  • date of last seen
  • etc.

You can run queries to find information about tags, etc.

SELECT *
FROM linkdatamodel
JOIN usertags
ON linkdatamodel.id = usertags.entry_id;

Page rating

Content ranking is established by the Django link archive project.

To have a good page rating, it is desireable to follow good standards:

  • Schema Validator
  • W3C Validator
  • Provide HTML meta information. More info in Open Graph Protocol
  • Provide valid title, which is concise, but not too short
  • Provide valid description, which is concise, but not too short
  • Provide valid publication date
  • Provide valid thumbnail, media image
  • Provide a valid HTML status code. No fancy redirects, JavaScript redirects
  • Provide RSS feed. Provide HTML meta information for it https://www.petefreitag.com/blog/rss-autodiscovery/
  • Provide search engine keywords tags

Your page, domain exist alongside thousands of other pages. Imagine your meta data have an impact on your recognition, and page ranking.

Remember: a good page is always ranked higher.

You may wonder, why am I writing about search engine "keywords" meta field, if Google does not need them. Well I don't like Google. If we want alternative solutions to exist, it should be possible to easily find your page from simpler search engines. Provide keywords field if you support open web.

Tags

Some tags are quite obvious:

  • company - if entry exists just to provide information about company
  • university, museum, etc - if entry provides details about a university, museum, etc.
  • disinformation / misinformation - self explanatory
  • news - if it is "news" content farm. Might be also "game news", "tech news", etc.
  • web spam - anything annoying, not worth, etc.
  • warhammer - anything that relates to...
  • radio station
  • movie - page describing a movie
  • video game - page describing a video game, etc.
  • movie - page describing a movie
  • fan page - pages created by fans, of topics, of people
  • online tool - some things are web programs, that are not accessible if you are offline
  • ad business - if page owner work in this sector
  • nfsw - not safe for work
  • convention - gathering of hobbyist etc.

Some other notable examples

  • open source - if entry is "open source" related
  • personal - if it seems to be a personal website
  • personal sites source - pages where you can find more personal sites
  • self-host - software that can be self-hosted
  • amiga / commodore - anything amiga / commodore related
  • demoscene / zx spectrum - related to this kind of music
  • emulator / emulation - anything related to emulators
  • wtf - for really interesting finds
  • funny - anything that makes me chuckle
  • interesting page design - self explanatory. Some pages are just fun
  • interesting domain name - if domain name is interesting
  • wargames / tabletop game - there are some old blogs about this hobby
  • internet archive - valuable resources that protect knowledge
  • reverse engineering
  • hacking / cubersecurity / ctf - quite explanatory
  • ranking page - page which shows items with scores, like metacritic, pepper
  • image assets / music assets /

Other

  • artificial intelligence bot - AI bot, like chatGPT, etc.
  • gatekeeper - platforms that are too big to fail. Monopolies, big tech, etc.
  • link service - link service, link shorteners, ad counters
  • monetization - if page includes some kind of monetization, subscription, loot boxes
  • gambling - if the page is about gambling
  • redirect issue - the page is not what it is supposed to be, redirects to some adult page etc.
  • the left wing - things for democrats, left wing of political spectrum
  • the right wing - things for republicans, left wing of political spectrum
  • conspiracy theories / 911

How to access the data?

Any SQLite database reader software, like DBeaver.

CLI script

Do you want to search the database? I have got you covered! Use dataanalyzer.py

First install poetry. Then perform 'poetry update'. Then you can use the script.

Unpack internet.zip, then...

Search for warhammer in link, title, description. Shows title

dataanalyzer.py --db internet.db --search "*warhammer*" --title

Search for warhammer in link name. Shows title, tags

dataanalyzer.py --db internet.db --search "link=*warhammer*" --title --tags

Access via web interface

unpack internet.zip
python3 -m http.server 8000          # start server
https://localhost:8000/search.html   # visit
Project Logo

You don't like it? Fork it!

I have my own opinions, with which you do not have to agree. Most of tags, votes are added manually. You can use this repository, as a starting point, to kick off your own project. Add your own tags. Create your own version of search engine. Good luck!

Notes

  • Not all domains have to be stored here. I think it would be best to have valuable domains. Certainly we do not want content farms. We do not need sites that do not contribute anything useful to the society, to the reader
  • The distinction is not that clear-cut, but more lenient rules apply toward personal sites
  • I am not that interested in marking substack, or medium as "personal" sites, as I do not feel that it should be tagged as such
Meme

Roadmap

  • Initial relase. Provide commonly used domains. YouTube, google, etc
  • Define sources of data. Use indie web sources
  • Define clean tag names, so that the database can easily be searched
  • Advertise in indie web sources. Potentially: HN, reddit self-host, reddit web scraping forums. Amiga board. Nice, hackery places
  • Provide binary releases. SQLlite database, so that it would be easily imported by other tools
  • Establish plan for binary releases
  • Create a browser extension. The extension should provide domain info for each link. Provides rating according to page_rating, and since when page operates. The longer domain is active - the better
  • Create mobile app for searching. Upload to Google play & F-Droid
  • Gather data using VPN, to receive english meta information
  • Secure funds for a organisation. Kickstarter?
  • Establish valid, simple domain for the project
  • Provide google-like search on the domain
  • Conquer the world