GitHub - jumpingGrendel/dmoz2db: A database importer for the open directory project (aka dmoz) data

jumpingGrendel / dmoz2db Public

forked from JoKnopp/dmoz2db

Notifications You must be signed in to change notification settings
Fork 0
Star 1

A database importer for the open directory project (aka dmoz) data

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
doc/table_schema		doc/table_schema
src		src
README		README
gpl.txt		gpl.txt

Repository files navigation

dmoz2db is a tool to parse the RDF-like dumps from http://rdf.dmoz.org/rdf/ and
put the contents into a database. dmoz2db is tested with MySQL but should work
with other databases as well. IT COMES WITH ABSOLUTELY NO WARRANTY OF ANY KIND.

Instructions

To use dmoz2db you need to install sqlalchemy 0.6.5 or higher
(http://www.sqlalchemy.org)

Your database must have utf8 support enabled. For MySQL a description how to do
that is available here:
http://cameronyule.com/2008/07/configuring-mysql-to-use-utf-8

The database where the dmoz data will be stored must be created manually:

mysql> create database DATABASENAME;
mysql> GRANT ALL ON DATABASENAME.* TO 'USERNAME'@'localhost';

After that you should edit db.sample.conf according to your setup and save it
as db.conf.

The database design can be found in the html pages in the doc folder.

Running

If the rdf files are present in your current directory you can just say
~/dmoz-dir/src $ python dmoz.py

but you may want to run
~/dmoz-dir/src $ python dmoz2db.py --help

first and look at the available options. Most of them should be self
explaining. If you are not interested in the complete dmoz dataset you can
specify a topic filter to ignore everything which is not under the given
category which speeds up the import process. Take care with trailing slashes:
'Top/Computers' includes the category while 'Top/Computers/' filters for
everything under that category. The default father id is 1 for every category
whose father was filtered out.

Debug output should be turned on only in combination with the log file option
because every sql statement is printed.

The import will take time, so go to lunch or find something else to do :) And
don't halloo till you're out of the wood: There is a first parse inserting the
basic topic structure into the db, then the father ids are generated and after
that all the additional information like related categories or other languages
are added in the second parse. Last but not least the content.rdf file is
parsed to add the externalpage information to the database.

On my laptop it took about 20 minutes to complete the first parse, 25 minutes
for generating father ids, 2:08 h for the second parse and 8 h for the
content.rdf file which results in ~11 h total. One dot in the output means
10,000 processed topics, a newline is generated after 200,000 Topics.

In the structure.rdf file entries dealing with the last editor are ignored. For
content.rdf the tags <mediadate>, <type>, <uksite>, <age> and <priority> are
ignored because they are present only in a fraction of the data.

NB (john): there was a spurious html entity hiding in content.rdf, not sure how typical it is.
After struggling with the SAX parser for ages, I finally solved the problem by removing the entity:
perl -pi -e 's~&#11;~~gs' content.rdf.u8
apologies to perl and python.