Dmoz (the Open Directory Project) has a wealth of data in relation to websites, as well as a comprehensive list of categories, this has been established through years of maintaining the directory (before and after being bought by Google), and being one of the most “sought after” real-estate in terms of link building.
Recently I came across Dmoz data through a classification research project I was working on, essentially we had a Naive Bayes classifier which we were trying to use to classify companies (through a description snippet) into categories, and then extract which other competitors of this company exist within the same category… Simples!
In order to import the Dmoz data into SQL Server, I resorted to using the Dmoz Data Importer solution by bodzebod, which although very good and did the job well for the Dmoz Structure files, which contains all the category classifications, bodzebod has not yet implemented the import of the Dmoz content file, which actually contains the data. This post presents a solution to importing Dmoz content file into a SQL Server database through C#, building on the work of bodzebod.
First you will need to grab the Dmoz content and structure dumps, which they have kindly collated and made freely available (I think under some sort of license though, so be careful). After downloading these semi-large files, and since this solution is complementary to bodzebod’s work, you will need to download and run the Dmoz Data Importer from the link above before starting on this project, as I said before this will only import the Structure dmoz file, and has not yet been modified to import the more lucrative Content file.
This solution is written in VS2010 and SQL Server 2012, although I dont reckon there are any 2012 specific features.
The code uses an XmlTextReader, which is much more efficient than placing the whole file in memory, and batches inserts into the database (batch limit can be set in App.config file), and does not utilize intermediate files to process the Content DMOZ file (unlike the bodzebod solution).
In order to get the solution to work, you need to do the following steps:
- Go to http://dmozimporter.codeplex.com/, download the solution, run the database scripts, and import the DMOZ structure file as per bodzebod’s instructions.
- Once you have completed all the steps in bodzebod’s solution, run the DmozContentDBScripts.sql on the DMOZ database created by bodzebod’s solution. This will create two additional import tables (starting with Tmp) and two data tables: ExternalPage and ExternalPageToCategory
- Change the App.config file to point to your database and the DMOZ content file on your file system.
- Run the bad boy!
Current known issues with DMOZ Content Importer:
1) Issues with character encoding for Russian, Japanese and Arabic languages (bodzebod had a solution for this).
This information is also available in the README.txt file available in both the executable and the sourcecode.
I really should merge this with bodzebod’s solution, particularly since they went through the effort of uploading it to Codeplex, this is something I will certainly be looking into at some point.
Import Dmoz Content through C# to SQL Server rated 5 out of 5 by 1 readers