TvTropes

The base for all links is (https://tvtropes.org/pmwiki/)
TVTropes is organized into Namespaces.
The list of namespaces can be obtained from (articlecount.php)
Each namespace is listed as "0000: Name" with <br/> between each within the fourth <p> subelement within the #wikimiddle <div>
The list of articles in a Namespace can be obtained from (namespace_index.php?ns=)
Both (3) and (4) links have multiple pages.
The last page number can be found from the data-total-pages attribute on the only .pagination-box element on any of the pages
Pages can be navigated to with the ?page= query
Not having a page number is equivelent to ?page=1
All links to other TVTropes pages have the .twikilink class
The link to these pages is (pmwiki.php/Namespace/Identifier)
These pages will have a title found in an h1 with the .entry-title class
The title h1 will also have a strong tag with "Namespace /" prepended to the text if it is not in the Main namespace
A link should either be either Broken 404, Redirect 302, or exist 2xx
A Redirect will have a div with the .aka-title in it's title h1 (some dont?)
The above is a hoax. The .aka-title will repeat anything in the second part of the ?from= query with captials replaced by "\s[A-Z]" as appropriate
A Redirect will change the link to (pmwiki.php/Namespace/Identifier?from=OldNamespace.OldIdentifier)
There is only one definitive link that multiple redirects can send to
Pages that exist have multiple .twikilinks on them that serve as graph edges
A page can be identified by a combination of (Namespace, ID)
The ID is normally the Title with all non alphanumeric characters dropped
IDs and Namespaces should be compared case insensitively
Too much crawling may incur a rate limit

Sqlite3

A Sqlite3 DB can only have one writer
A reader will error if it attempts to access the DB while it is processing a transaction
For these reasons, Sqlite3 is easiest if only 1 thread accesses it

Concurrent Ruby

~~2. Concurrent::Array could be used as a queue via the Array#push and Array#shift methods~~

Ruby has a builtin thread-safe Queue class
The ruby-concurrency gem provides thread safe for all other purposes
A new thread can be created using Thread.new and providing a block

General plan

Database Schema

We wil use my original database schema with some extensions.

create table if not exists pages (
  namespace text collate nocase not null,
  id text collate nocase not null,
  response integer,
  title text,
  alias_of_namespace text collate nocase,
  alias_of_id text collate nocase,
  primary key (namespace, id) on conflict replace,
  foreign key (alias_of_namespace, alias_of_id) references pages (namespace, id) on delete set null
);

This table represents a wiki page (one of the pmwiki.php links). The alias_of foreign keys setup the many to one relationship that redirects can have with another page. The title column will be the contents of the title h1 for most pages, but for redirects it will be the contents of the aka-title minus the preceding "Aka: " text. Broken links will have no title.

create table if not exists links (
  namespace text collate nocase not null,
  id text collate nocase not null,
  link_namespace text collate nocase not null,
  link_id text collage nocase not null,
  primary key (namespace, id, link_namespace, link_id) on conflict ignore,
  foreign key (namespace, id) references pages (namespace, id) on delete cascade,
  foreign key (namespace, id) references pages (link_namespace, link_id) on delete cascade,
);

This table represents the many-to-many links between wiki pages. A link can only be uniquely identified by both the starting and ending pages. Due to the nature of the Tvtropes, this table represents a directed graph, since not all pages link back to pages that link to them.

Web Crawler

For completeness, the web crawler will be seeded with every Namespace from the Article Count page. The first task will be to crawl every Namespace Index for seed pages. There will be 4 work queues: Article Count, Namespace, Wiki, and SQL. The function of each queue is described below. The number of data each of the queues processes will get successively larger.

Article Count

This work queue will be seeded with all the articlecount.php pages. Workers will parse each page for the namespaces therein and each namespace will become a link on the namespace queue equal to "namespace_index.php?ns=".

Namespace

To prevent duplication, a worker will read the Namespace link, if the query contains the page= parameter it will process all .twikilinks on the page and add them to the Wiki queue. Otherwise, it will determine how many pages are in the Namespace Index and add page=2 through page=<count> onto the Namespace queue before processing the page as page=1.

Wiki

The wiki queue will perform the data gathering for the database from the Wiki links the Namespace workers find. A Concurrent::Set of visited links will The worker will perform these actions:

Read a link from the work queue
Attempt to put that link into the visited links set
On failure, return to 1
Open the link and gather the data needed for the SQL tables
Push new .twikilinks onto the Wiki Queue
Push this data onto the SQL queue

{
  namespace,
  id,
  response,
  title,
  alias_of_namespace,
  alias_of_id,
  links: [
    { namespace, id }
  ]
}

SQL

There will only be one worker operating on the SQL queue. It will persist a connection to the sqlite database, wait for data on the SQL queue, parse and insert that data to the database.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
app		app
scripts		scripts
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TvTropes

Sqlite3

Concurrent Ruby

General plan

Database Schema

Web Crawler

Article Count

Namespace

Wiki

SQL

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TvTropes

Sqlite3

Concurrent Ruby

General plan

Database Schema

Web Crawler

Article Count

Namespace

Wiki

SQL

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages