Import Dmoz Content through C# to SQL Server
Dmoz (the Open Directory Project) has a wealth of data in relation to websites, as well as a comprehensive list of categories, this has been established through years of maintaining the directory (before and after being bought by Google), and being one of the most “sought after” real-estate in terms of link building.
Recently I came across Dmoz data through a classification research project I was working on, essentially we had a Naive Bayes classifier which we were trying to use to classify companies (through a description snippet) into categories, and then extract which other competitors of this company exist within the same category… Simples!
In order to import the Dmoz data into SQL Server, I resorted to using the Dmoz Data Importer solution by bodzebod, which although very good and did the job well for the Dmoz Structure files, which contains all the category classifications, bodzebod has not yet implemented the import of the Dmoz content file, which actually contains the data. This post presents a solution to importing Dmoz content file into a SQL Server database through C#, building on the work of bodzebod.
First you will need to grab the Dmoz content and structure dumps, which they have kindly collated and made freely available (I think under some sort of license though, so be careful). After downloading these semi-large files, and since this solution is complementary to bodzebod’s work, you will need to download and run the Dmoz Data Importer from the link above before starting on this project, as I said before this will only import the Structure dmoz file, and has not yet been modified to import the more lucrative Content file.
Click here to download the DMOZ Content Importer executable, and here is the DMOZ Content Importer Sourcecode
This solution is written in VS2010 and SQL Server 2012, although I dont reckon there are any 2012 specific features.
The code uses an XmlTextReader, which is much more efficient than placing the whole file in memory, and batches inserts into the database (batch limit can be set in App.config file), and does not utilize intermediate files to process the Content DMOZ file (unlike the bodzebod solution).
In order to get the solution to work, you need to do the following steps:
- Go to http://dmozimporter.codeplex.com/, download the solution, run the database scripts, and import the DMOZ structure file as per bodzebod’s instructions.
- Once you have completed all the steps in bodzebod’s solution, run the DmozContentDBScripts.sql on the DMOZ database created by bodzebod’s solution. This will create two additional import tables (starting with Tmp) and two data tables: ExternalPage and ExternalPageToCategory
- Change the App.config file to point to your database and the DMOZ content file on your file system.
- Run the bad boy!
Current known issues with DMOZ Content Importer:
1) Issues with character encoding for Russian, Japanese and Arabic languages (bodzebod had a solution for this).
This information is also available in the README.txt file available in both the executable and the sourcecode.
I really should merge this with bodzebod’s solution, particularly since they went through the effort of uploading it to Codeplex, this is something I will certainly be looking into at some point.
Import Dmoz Content through C# to SQL Server rated 5 out of 5 by 1 readers
Thanks for putting this together. By any chance, were you able to run the structure program without issue? I had a couple issues. First, it seems the path’s are hard-coded in the program, because updating the config file with my path didn’t do anything. I actually had to mount a H:\ drive in order to run it.
Second, at the end of the structure program, I was hit with this error:
Ouch! an error : System.Data.SqlClient.SqlException: Table ‘CategoryWithLocalize
dName’ does not exist.
This table doesn’t exist, only the view does. It seems the data imported fine, but I’m not sure yet as I haven’t had a chance to validate. Did you have any of these problems?
BTW – I’m running your content program now. So far, so good. I’ll post back if I run into any issues.
Thanks, glad someone is using the DMOZ Content Importer to be honest
I haven’t looked at bodzebod code for a while now, but I seem to remember running into the same issue regarding file path. I think the file path location is actually taken from the Settings.settings file (in the Properties folder of the DMOZ_Importer project), rather than from the App.Config file.
Regarding the error you received, am not entirely sure how that came about, there is no table in the database, only the view… you can either email bodzebod about it or (I would) ignore it unless something is broken :). The view looks like this (just incase you need to rebuild it).
dbo.Category.parentId, dbo.LocalizedName.name AS LocalizedName,
dbo.LocalizedName.idLanguage, dbo.Language.name AS LanguageName
FROM dbo.Category INNER JOIN
dbo.LocalizedName ON dbo.Category.idLabel = dbo.LocalizedName.id INNER JOIN
dbo.Language ON dbo.LocalizedName.idLanguage = dbo.Language.id
I hope the content program runs ok, as I mentioned I still haven’t fixed the issue with the character encoding, but if you do please do not hesitate to post the solution up here or on bodzebod’s Codeplex page.
It looks like the content importer worked most the way, but it looks like a procedure to move tmp data to final tables wasn’t performed. The two tmp tables have data (TmpExternalPage and TmpTopic), but the ExternalPage and ExternalPageToCategory tables are empty
At least the data is basically there, so I should be able to consume that for my test project.
I don’t recall seeing an error along the way, so cannot provide any info on that…
Thats pretty strange.
Technically the app should have called the database procedure: [dbo].[usp_Import_ImportContentData] in order to move data from the staging tables to live. You could try running that stored procedure manually instead?
Let me know if running the SP causes any issues!
Nice work, I think i’ve got everything I need for now to test dmoz data by getting some of it together. Had a few issues with bods bits which I will contact him about (similar to the last commenter).
Did have one issue running your sql script though:
Msg 195, Level 15, State 10, Procedure usp_Import_ImportContentData, Line 41
‘TRY_PARSE’ is not a recognized built-in function name.
Seemed really odd, especially as Google didnt through up people having similar issues – using sql express 2012 but couldnt see any reason why this wouldnt work.. any thoughts?
I got round it by removing the try_parse but then later the app fell over, presumably due to hitting something that didnt quite look like a date – so would be good to work this out to move forward.
Should also say – running that stored proc having removed the try_parse gave:
Msg 1105, Level 17, State 2, Procedure usp_Import_ImportContentData, Line 10
Could not allocate space for object ‘dbo.ExternalPage’.’PK__External__3214EC075BE2A6F2′ in database ‘dmoz’ because the ‘PRIMARY’ filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.
But this may well be due to the face that I removed the truncate_only line in bods solution as that function isn’t in sql 2012 anymore.
Regarding the PRIMARY filegroup, it looks like it ran out of space, that is an easy fix, all you need to do is right click on your database, then select “Properties“, go to the “Files” tab, you will be able to see the database file associated with the PRIMARY filegroup, there you will need to set an “Autogrowth/Maxsize” value to allow the filegroup to expand. That should be it really!
There is also the possibility that you have hit the 10GB limit on SQL Server Express, if that is the case then you will need to empty some space on the DMOZ database (particularly the PRIMARY filegroup) to allow new data to be added, or split data on two databases… The total size of my DMOZ database is about 9.8 GB, almost hitting the SQL Express limit.
Hope this helps, let me know how you get on!
That is a strange issue indeed, TRY_PARSE is a SQL 2012 function and should work in Express edition without problems.
Can you try running a select the version variable on the server instance were the DMOZ database is located, like this:
And reporting back the results?
There are ways to get around the Try_Parse function by substituting SQL 08 functions, and returning NULL if the casting fails (or is unsuitable), but would’ve not thought you needed that in SQL 2012, this is rather odd!.