Hadoop is a pretty neat set of tools for processing loads of data in a distributed, parallel and easy to scale-out manner, and so rightfully the Hadoop toolset owns a pretty high position in the data analysis and BI game, and a must consider when embarking on any new big data project. But that being said, the Hadoop eco-system, however advance in many areas, is still away from being a complete end to end BI solution, particularly when it comes to offering support for emerging data analysis and business intelligence concepts, such as exploratory data analysis and real-time data querying, or even fully-integrated data visualization and report authoring tools.
Archive for category: Hadoop
This is a quick post to describe a solution to a problem I was having with my local installation of the HDInsight Hadoop cluster.
On one corner we have Hadoop, a massively distributed JVM-based data processing engine with a Map & Reduce API and a proven track record in handling huge data-sets. On the other corner we have SSIS, a natively non-distributed ETL engine part of the SQL Server family tool-set with .NET code extensibility features and a drag and drop UI (for the most part anyway). Two sweet technologies, probably shouldn’t be compared to each other but we’re doing it anyway, pitted head to head against a data mapping task to the death (or at least to the recycling of my test VMs)… Now FIGHT!
Recently I have been involved in researching and building a low-latency high-data-volume OLAP environment for a social entity and interaction analysis platform, the perfect mixture of concepts such as Big Data collection and processing, large-scale Network Analysis, Natural Language Processing (NLP) and a highly scaled-out OLAP environment for end users to explore and discover data (essentially a Self-Service and Exploratory BI layer).
It is by all means not an easy mission to orchestrate all the technologies that back those concepts, particularly if you are interested in using the optimum solution for the problem at hand, for example Big Data might be better handled by a Hadoop layer, but Hadoop or Hive (at least on their own) are not geared up to respond to OLAP queries, which are real-time by nature, and even if they were, your end-user needs familiar tools and interfaces to analyse and study this data, which is where SQL Server Analysis Service and the whole Microsoft BI stack might come in and offer great integration with already existing business applications (such as Office or SharePoint).
This post discusses a few architectural approaches to exposing a Hadoop layer through a SQL Server Analysis Service (SSAS) interface, with references to data-latency, redundancy and over-all performance.
After a year from announcing partnership and starting the collaboration project, Microsoft (SQL Server) and Hortonworks (Hadoop) have finally announced the result of this integration: Microsoft HDInsight Server and HDInsight Azure Service.
So what is HDInsight? well, it is essentially Microsoft’s Hadoop-based distribution which is built on top of the Hortonworks Data Platform. So if you download Microsoft HDInsight Server for a local installation of the Hadoop distribution, then you will end up with a local cluster with your own Hadoop Hive able to run Hadoop jobs, as well as benefit from the already released Hadoop integration points with SharePoint and EXCEL. This is just so powerful!
There has been a lot of buzz going around about the level of integration Microsoft will offer between SQL Server and the Hadoop distributed processing framework. Ever since the public announcement during the SQL Server 2012 virtual product launch event, SQL Server developers have been rushing to try our Hadoop, and understand how it fits within the data management eco-system SQL Server has created.