There has been a lot of buzz going around about the level of integration Microsoft will offer between SQL Server and the Hadoop distributed processing framework. Ever since the public announcement during the SQL Server 2012 virtual product launch event, SQL Server developers have been rushing to try our Hadoop, and understand how it fits within the data management eco-system SQL Server has created.
What is Hadoop?
Hadoop is SQL Server’s current answer to processing very large data-sets, caused by this data explosions in sectors such as social media and general user-activity monitoring. The idea is simple, the more data you can collect about anything that can influence your business decisions, the more accurate these decisions can be made. For example this is crucial in terms of online advertising, were client spend could be in the millions of dollars per month, and an increase in conversion by 1% could be the difference between millions of revenue, any information that can be collected on users regarding for example their sentiment towards the brand, their interest and activities, their friend’s interests, their income bracket, their marital status, etc. can give this brand an edge over competitors, and tailor campaigns or even services around better ROI from their demographic.
I took online marketing as an example model of large data because of how traceable ROI and user behavior in this sector is, apart from on-site behavior which is traceable down to the mouse movements on the page (useful in Usability Analysis), but you could effectively buy an entire few years worth of Twitter data on most users, and with the correct Data-Mining algorithms, you could deduce anything from user’s holiday schedule (to advertise holidays preempting their usual holiday time), to a user that is going through a divorce (to advertise a divorce lawyer). This could get really scary, but very awesome for the people designing these cool algorithms.
Hadoop is essentially a way to go through these mountains of data, and through a Map and Reduce functions, deduce information about the underlying un-structured or semi-structured data, this is achieved through a few key features:
- Hadoop Data File-System: Hadoop creates its own DFS on a disk partition, this is an optimised file-system for the type of activity to expect through big data analysis
- Optimised for Write Once Read Multiple Times: Hadoop’s DFS is optimised for data to be written once, then read multiple times, updates are highly discouraged. This is ideal for processing fact data that does not change at all, for example Twitter communication or web server logs.
- Map/Reduce Locale: In order to achieve high distributed performance throughput, Hadoop moves the Map and Reduce function to the local system were the data lives, rather than moving the data to where the code lives. This is referred to as code locale.
- Batched Processing: This is mostly a restriction rather than a feature, considering there is already a real demand for processing real-time continuous data at a much higher discretization rate.
In essence, Hadoop on its own is effectively an awesome highly distributed processing engine geared to handle expressive processing of large (very large) data. The technology itself is still young, there are many limitation, but also since its an open-source project, there are many cool extensions and related projects that offer highly promising concepts.
How Does Hadoop fit with SQL Server 2012
Microsoft sees Hadoop as a data processing layer, essentially at the OLTP side of a database architecture, pushing data to a relational-warehouse after the data has been cleansed and processed for storage. Essentially an SSIS package for processing very large data, extracting the required relational information, and dumping them into a relational database model.
In its current state, Microsoft’s stab at handling large data, which is mostly through its enterprise level Data-Warehouse SQL Server solution, is limited and not optimised to work with the unstructured nature of this raw data, and so backing Hadoop seems logical, a move that has already been adopted by many relational database competitors (such as IBM or Oracle), although I do wonder how long this will last before Microsoft comes up with their own framework.