forked from JoKnopp/dmoz2db
-
Notifications
You must be signed in to change notification settings - Fork 0
jumpingGrendel/dmoz2db
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
dmoz2db is a tool to parse the RDF-like dumps from http://rdf.dmoz.org/rdf/ and put the contents into a database. dmoz2db is tested with MySQL but should work with other databases as well. IT COMES WITH ABSOLUTELY NO WARRANTY OF ANY KIND. Instructions To use dmoz2db you need to install sqlalchemy 0.6.5 or higher (http://www.sqlalchemy.org) Your database must have utf8 support enabled. For MySQL a description how to do that is available here: http://cameronyule.com/2008/07/configuring-mysql-to-use-utf-8 The database where the dmoz data will be stored must be created manually: mysql> create database DATABASENAME; mysql> GRANT ALL ON DATABASENAME.* TO 'USERNAME'@'localhost'; After that you should edit db.sample.conf according to your setup and save it as db.conf. The database design can be found in the html pages in the doc folder. Running If the rdf files are present in your current directory you can just say ~/dmoz-dir/src $ python dmoz.py but you may want to run ~/dmoz-dir/src $ python dmoz2db.py --help first and look at the available options. Most of them should be self explaining. If you are not interested in the complete dmoz dataset you can specify a topic filter to ignore everything which is not under the given category which speeds up the import process. Take care with trailing slashes: 'Top/Computers' includes the category while 'Top/Computers/' filters for everything under that category. The default father id is 1 for every category whose father was filtered out. Debug output should be turned on only in combination with the log file option because every sql statement is printed. The import will take time, so go to lunch or find something else to do :) And don't halloo till you're out of the wood: There is a first parse inserting the basic topic structure into the db, then the father ids are generated and after that all the additional information like related categories or other languages are added in the second parse. Last but not least the content.rdf file is parsed to add the externalpage information to the database. On my laptop it took about 20 minutes to complete the first parse, 25 minutes for generating father ids, 2:08 h for the second parse and 8 h for the content.rdf file which results in ~11 h total. One dot in the output means 10,000 processed topics, a newline is generated after 200,000 Topics. In the structure.rdf file entries dealing with the last editor are ignored. For content.rdf the tags <mediadate>, <type>, <uksite>, <age> and <priority> are ignored because they are present only in a fraction of the data. NB (john): there was a spurious html entity hiding in content.rdf, not sure how typical it is. After struggling with the SAX parser for ages, I finally solved the problem by removing the entity: perl -pi -e 's~~~gs' content.rdf.u8 apologies to perl and python.
About
A database importer for the open directory project (aka dmoz) data
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- Python 93.2%
- JavaScript 6.8%