Hadoop is a pretty neat set of tools for processing loads of data in a distributed, parallel and easy to scale-out manner, and so rightfully the Hadoop toolset owns a pretty high position in the data analysis and BI game, and a must consider when embarking on any new big data project. But that being said, the Hadoop eco-system, however advance in many areas, is still away from being a complete end to end BI solution, particularly when it comes to offering support for emerging data analysis and business intelligence concepts, such as exploratory data analysis and real-time data querying, or even fully-integrated data visualization and report authoring tools.
But when looking at solution providers such as Oracle and Microsoft, who delivers mature BI products that went through years of feedback, refinement and improvement, suddenly the idea of having a hybrid BI environment becomes very attractive.
So a healthy step towards allowing the integration of the Hadoop eco-system with Microsoft’s BI infrastructure is connecting SQL Server with Hadoop, or specifically Hive, and subsequently SQL Server Analysis Service (SSAS). I have talked in the past about the various BI architectures for integrates Hive and Analysis Services, this is an expansion of the “All Data in Hadoop, OLAP Data in SSAS” architecture, showing exactly how to setup the Linked Server and views involved in this approach.
Step 1: Download Hive ODBC Drivers
This is pretty simple, just download and install the Hive ODBC Drivers from Microsoft, and remember to follow the steps to “Create a Hive ODBC Data Source“, which simply involves adding a new DSN (Data Source Name) on the server, this means you need to supply the Hive instance Host, Port number and other connection details.
After installation, you will be able to connect to and query Hive (using HiveQL) just like any other data source (using the DSN) in Excel, queries will be written and issued directly to Hive and turned into Map/Reduce jobs, with the results returned to the user as a table. The same ODBC driver can also be used to connect SQL Server to Hive.
Step 2: Create a Linked Server object
In order to build an SSAS cube with data from Hive in your Hadoop environment, we need to use SQL Server as an interface, which can be done through adding a Linked Server to the Hive instance. The reason for not connecting SSAS directly to Hive is due to some limitation in the way SSAS data-sources are setup, even though it can be circumvented, the approach is a bit more complicated.
Connect to the SQL Server instance you would like to query Hive on, and add a Linked Server object, the code should look something like this:
EXEC master.dbo.sp_addlinkedserver @server = '[linked-server-name]', @srvproduct='HIVE', @provider='MSDASQL', @datasrc='[hive-data-source]', @provstr='Provider=MSDASQL.1;Persist Security Info=True;User ID=[username];Password=[password]';
Remembering to replace:
- linked-server-name: The name you want to give your new linked server
- hive-data-source: The DSN you have setup in Step 1
- username: your username on the Hive instance
- password: password for username
Step 3: Create an OpenQuery View
In order to use the data in SSAS, we need to create a view based on an OpenQuery statement to the Hive linked server, which can look like this:
CREATE VIEW [view-name] AS SELECT * FROM OpenQuery([linked-server-name], 'SELECT * FROM [hive-table];')
The newly created view, based on the linked server connection to hive, can now be referenced in an SSAS project just like any other table or view.
This approach allows you to leverage the Hive engine directly, so any optimization done on the Hive layer, such as using Hive Views, should help improve the speed at which the data is extracted.
I really recommend reading the Klout paper in the resources below, particularly the section on Best Practices.
- Klout paper on MOLAP 2 Hive