Archive for category: Coding
Hashing can be a very useful technique when dealing with the storage and look up of large text fields (say a table of URLs or Search Keywords), these fields will incur high resource utilization on any database engine if used directly in DML statements, in which they are either filtered by or aggregated on. Any index built on these fields is costly to maintain, if it is at all possible given that SQL Server limits index size to 900 bytes.
Using hashing functions we can facilitate the handling of large textual data in the relational engine, leading to improved performance when these fields are being compared to satisfy a query, hashing can also be used to build unique and non-unique indexes that are easier to manage than directly using the text fields in the index definition. In this post we will discuss a few options for hashing large text data using functions native within SQL Server, as well as provide other external hashing algorithms that we can integrate into Microsoft’s SQL Server (or any RDBMS for that matter) that might provide a better practical performance.
Today I was trying to create a 2 dimensional data structure that can be queried using string indices rather than integer ones, this is using Python which am a total newbie in (but trying to write a research project using).
The idea was to find something natively within Python, rather than implement my own structure, such a data-structure is fundamental in programming theory, so a very likely chance that an out-of-the-box implementation exists already in most languages, and they are generally a dimensional extension of Arrays (Array of Array), Lists (List of Lists) or Dictionaries (Dictionary of Dictionary), but with string rather than integer indexes.
NLTK (Natural Language Toolkit) is a Python library that allows developers and researchers to extract information and annotations from text, and run classification algorithms such as the Naive Bayes or Maximum Entropy, as well as many other interesting Natural Language tools and processing techniques.
The Maximum Entropy algorithm from NLTK comes in different flavours, this post will introduce the different Max Ent classification algorithm flavours supported by the NLTK library, as well as provide a compiled MEGAM binary on a Linux (Ubuntu) 64-bit machine, which is a requirement for running Max Ent NLTK classification on the megam algorithm.
Dmoz (the Open Directory Project) has a wealth of data in relation to websites, as well as a comprehensive list of categories, this has been established through years of maintaining the directory (before and after being bought by Google), and being one of the most “sought after” real-estate in terms of link building.
Recently I came across Dmoz data through a classification research project I was working on, essentially we had a Naive Bayes classifier which we were trying to use to classify companies (through a description snippet) into categories, and then extract which other competitors of this company exist within the same category… Simples!
In order to import the Dmoz data into SQL Server, I resorted to using the Dmoz Data Importer solution by bodzebod, which although very good and did the job well for the Dmoz Structure files, which contains all the category classifications, bodzebod has not yet implemented the import of the Dmoz content file, which actually contains the data. This post presents a solution to importing Dmoz content file into a SQL Server database through C#, building on the work of bodzebod.
Bayesian Networks, particularly in its Naive (or Idiotic, as some angry physicist might call it), is an absolutely amazing and intuitive way for reasoning with a Probabilistic Network model. The Bayesian model has been heavily used across a wide array of industries, even though the Naive model is very much a simplistic view of what an actual Bayesian model might looks like, it is still a very practical approximation that has gained a lot of popularity in fields such as classifications and segmentations. This post introduces a client library for running reasoning patterns on a custom-built Bayesian Network.
This is more of a rant rather than anything useful, although I do give a link to an awesome JQuery library for zipping and unzipping files, so stick around!
Running PHP from C-Sharp is not a great way to go about building a stable code base, but sometimes it is a necessary evil.
I’ve ran into this very issue recently when trying to convert a new PHP PageRank hashing algorithm into C-Sharp, the problem mainly stemmed from the fact that the code did a lot of Byte Shifting, which is does not map from one programming language to another easily, and so thought I would share my experience regarding how to run PHP code in csharp.