Dmoz (the Open Directory Project) has a wealth of data in relation to websites, as well as a comprehensive list of categories, this has been established through years of maintaining the directory (before and after being bought by Google), and being one of the most “sought after” real-estate in terms of link building.
Recently I came across Dmoz data through a classification research project I was working on, essentially we had a Naive Bayes classifier which we were trying to use to classify companies (through a description snippet) into categories, and then extract which other competitors of this company exist within the same category… Simples!
In order to import the Dmoz data into SQL Server, I resorted to using the Dmoz Data Importer solution by bodzebod, which although very good and did the job well for the Dmoz Structure files, which contains all the category classifications, bodzebod has not yet implemented the import of the Dmoz content file, which actually contains the data. This post presents a solution to importing Dmoz content file into a SQL Server database through C#, building on the work of bodzebod.
Read more →