Overview

This is a database of Internet places. Mostly domains. Sometimes other things. Think of it as Internet meta database. This repository contains link metadata: title, description, publish date, etc.

The entire Internt is in one file! Just unzip internet.zip!

You can easily browse the file using any SQLite program, like DBeaver!

Acceptable link types

domains
repository links. For example https://github.com/rumca-js/Internet-Places-Database
user spaces. Might be youtube channel link: Linus Tech Tips YouTube Channel. Might be X/Twitter user account

Not acceptable link types

malware sites
porn, casino, gambling etc.
analytic domains that are used for user surveillance
IT infrastructure domains, CDN domains
link shorteners. somethingsomething.lnk.to is not something useful. Main domain lnk.to is acceptable though

Some zen rules:

Anything not obeying the law will be removed from lists
Internet operates in ... many countries, so there are many laws
If things are offensive, they do not have to be removed
I might suspect that a page is notorious, I may flag it with a tag, like "piracy", but it may not be true
If page content is obnoxious, it can, and possible should be demoted
I do not always follow these rules strictly

The goal is to check how "wide" the Internet is, not how "deep" individual places are!

I do not have resources to verify all links

Links are captured from the Internet automatically
If any link is suspicious, and should be removed, please create an Issue in this repository
Use 'votes' to see credibility of domains
Be careful, Internet is a dangerous space. You should know what you are doing when using this list

Example

Simple Search uses domains that have votes > 0.

You can check how this search works, and what is inside.

Sources of data

Obtained by the Django-link-archive web crawler.

Sources:

Benefit - Security

Google Search is known to be susceptible to malvertising. Predatory web pages can "disguise" them as other pages. The displayed link in Google Search does not have to be the linked you will be transported to.

This local search does not require Internet to operate. Once downloaded - you can just search these meta information
This local search might be faster than your ISP, depending on drive, machine, etc
It may be more secure. You can verify domain, it's status, how long it operates before accessing the Internet

Alternative solutions

Stats

Tables, and sizes (will definitely change over time)

Table: blockentrylist, Row count: 30
Table: compactedtags, Row count: 3305
Table: configurationentry, Row count: 1
Table: dataexport, Row count: 0
Table: domains, Row count: 815164
Table: entryrules, Row count: 12
Table: gateway, Row count: 82
Table: linkdatamodel, Row count: 814242
Table: modelfiles, Row count: 0
Table: readlater, Row count: 0
Table: sourcecategories, Row count: 8
Table: sourcedatamodel, Row count: 11704
Table: sourcesubcategories, Row count: 13
Table: user, Row count: 5
Table: userbookmarks, Row count: 13062
Table: usercomments, Row count: 1
Table: usercompactedtags, Row count: 3305
Table: userconfig, Row count: 3
Table: userentrytransitionhistory, Row count: 6091
Table: userentryvisithistory, Row count: 5001
Table: usersearchhistory, Row count: 304
Table: usertags, Row count: 22712
Table: uservotes, Row count: 27924

Files

Data are distributed in internet.zip file, split with 50MB parts.

To use it, you have to unpack it.

The result internet.db database file can be viewed using any sqlite browser / program.

Each link contains a set of attributes, like:

title
description
page rating
date of creation
date of last seen
etc.

You can run queries to find information about tags, etc.

SELECT *
FROM linkdatamodel
JOIN usertags
ON linkdatamodel.id = usertags.entry_id;

Page rating

Content ranking is established by the Django link archive project.

To have a good page rating, it is desireable to follow good standards:

Schema Validator
W3C Validator
Provide HTML meta information. More info in Open Graph Protocol
Provide valid title, which is concise, but not too short
Provide valid description, which is concise, but not too short
Provide valid publication date
Provide valid thumbnail, media image
Provide a valid HTML status code. No fancy redirects, JavaScript redirects
Provide RSS feed. Provide HTML meta information for it https://www.petefreitag.com/blog/rss-autodiscovery/
Provide search engine keywords tags

Your page, domain exist alongside thousands of other pages. Imagine your meta data have an impact on your recognition, and page ranking.

Remember: a good page is always ranked higher.

You may wonder, why am I writing about search engine "keywords" meta field, if Google does not need them. Well I don't like Google. If we want alternative solutions to exist, it should be possible to easily find your page from simpler search engines. Provide keywords field if you support open web.

How to access the data?

Any SQLite database reader software, like DBeaver.

CLI script

Do you want to search the database? I have got you covered! Use dataanalyzer.py

First install poetry. Then perform 'poetry update'. Then you can use the script.

Unpack internet.zip, then...

Search for warhammer in link, title, description. Shows title

dataanalyzer.py --db internet.db --search "*warhammer*" --title

Search for warhammer in link name. Shows title, tags

dataanalyzer.py --db internet.db --search "link=*warhammer*" --title --tags

Access via web interface

unpack internet.zip
python3 -m http.server 8000          # start server
https://localhost:8000/search.html   # visit

You don't like it? Fork it!

I have my own opinions, with which you do not have to agree. Most of tags, votes are added manually. You can use this repository, as a starting point, to kick off your own project. Add your own tags. Create your own version of search engine. Good luck!

Notes

Not all domains have to be stored here. I think it would be best to have valuable domains. Certainly we do not want content farms. We do not need sites that do not contribute anything useful to the society, to the reader
The distinction is not that clear-cut, but more lenient rules apply toward personal sites
I am not that interested in marking substack, or medium as "personal" sites, as I do not feel that it should be tagged as such

Name		Name	Last commit message	Last commit date
Latest commit History 515 Commits
configuration		configuration
images		images
scripts		scripts
styles		styles
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE_DATA		LICENSE_DATA
Makefile		Makefile
README.md		README.md
dataanalyzer.py		dataanalyzer.py
internet.z01		internet.z01
internet.z02		internet.z02
internet.zip		internet.zip
poetry.lock		poetry.lock
preview.html		preview.html
pyproject.toml		pyproject.toml
search.html		search.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Overview

Acceptable link types

Not acceptable link types

I do not have resources to verify all links

Example

Sources of data

Benefit - Security

Alternative solutions

Stats

Files

Page rating

Tags

How to access the data?

CLI script

Access via web interface

You don't like it? Fork it!

Notes

Roadmap

About

Licenses found

Releases

Packages

Languages

License

Licenses found

rumca-js/Internet-Places-Database

Folders and files

Latest commit

History

Repository files navigation

Overview

Acceptable link types

Not acceptable link types

I do not have resources to verify all links

Example

Sources of data

Benefit - Security

Alternative solutions

Stats

Files

Page rating

Tags

How to access the data?

CLI script

Access via web interface

You don't like it? Fork it!

Notes

Roadmap

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages