On one corner we have Hadoop, a massively distributed JVM-based data processing engine with a Map & Reduce API and a proven track record in handling huge data-sets. On the other corner we have SSIS, a natively non-distributed ETL engine part of the SQL Server family tool-set with .NET code extensibility features and a drag and drop UI (for the most part anyway). Two sweet technologies, probably shouldn’t be compared to each other but we’re doing it anyway, pitted head to head against a data mapping task to the death (or at least to the recycling of my test VMs)… Now FIGHT!
Recently I have been tasked with building a data processing layer tracking social signals with the following characteristics:
- Input data is flat files. Although initially the amount of data might not be classified under “Big Data” per-say, but certainly had the potential to grow very quickly. Files were very small JSON format (1 KB average).
- Output data is flat files. Delimited file which will be queried through a Hive Warehouse layer.
- Data is only Mapped and not Reduced. Which means data is only extracted from the flat files and processed but never aggregated, and in any case SSIS is not capable of reducing (or aggregating) data in a scale-out architecture without building a custom intermediary layer (such as temporarily placing data in a database).
- Data Latency into Hive is of Paramount Importance.
Both technologies are capable of iterating through a large number of flat files, extracting information and building an output, and when we take the Reduce operation out of the equation, we level the playing field and now both technologies can be scaled out, albeit Hadoop in a perhaps more friendly manner.
Although these technologies have a wider application and usage that they might be better suited to, in this experiment I was only interested in performance figures on this basic task.
In order to test these technologies against the mapping task, I have built two test machines, one for SSIS with SQL Server to support the SSIS Catalogue database, and another for a simple 3 node Hadoop cluster, the technical specification for each scenario is as follows:
|Integration Service (SSIS)||Hadoop|
|CPU||4 Cores / Node||2 Cores / Node|
|RAM||8 GB / Node||3 GB / Node|
|Nodes||1 VM||3 VMs|
|OS||Windows Server 2012||CentOS|
|Edition||SQL Server 2012||Cloudera CDH 4|
Although the specifications for each test setup is slightly different, which makes the comparison fairly “unscientific”, the over-all processing resources available for each test scenario should be fairly comparable, with the Hadoop cluster gaining a slight edge in terms of over-all CPU cores and RAM. Besides, we are only looking for a really considerable difference in the result to warrant a favouritism of one technology over the other in this business requirement.
I ran two test scenarios:
- Scenario 1: 33,000 small (1KB) JSON input files, each file will have about 5 – 10 values to extract against a key (mapping).
- Scenario 2: 33 input files (every 1,000 files in scenario 1 is concatenated)
The results of the test were as follows:
|Scenario 1 (33,000)||Scenario 2 (33)|
|SSIS||14.5 (Min)||3.94 (Sec)|
|Hadoop Cluster||957 (Min)||134 (Sec)|
As can be deduced from the results above, 1 SSIS instance showed up to 66X better performance in handling and processing flat files than the same job running in a Hadoop cluster.
Learnings from SSIS vs Hadoop Test
There are a few key learnings that has been gained by doing this experiment:
- Hadoop has a terrible start time when operating on a file, the processing engine could take up-to 5 seconds before it could actually start processing the file, were SSIS takes less than 0.2 of a second. Java has never been a very agile language in my opinion.
- Hadoop is not intended to handle a large number of small files, instead try combining smaller files into bigger concatenations. Sometimes it is considerably faster to have a pre-processing step that concatenates files into smaller batches.
- Although the number of “Reducers” for a Hadoop job could be easily controlled, it is more difficult to control how many “Mappers” available for a job across the cluster, and Hadoop does not always adhere to the user-set number of Mappers.
- Although SSIS outperforms Hadoop by an average of 50X on this simple task, Hadoop scales in a much more user-friendly manner, and allows users to “Reduce” or aggregate the data across all nodes for a particular job, a feature that is not supported by the out-of-the-box Integration Service.
- Don’t just jump on new technologies, you need to test it and ensure that it is suitable for your particular business requirement, Hadoop is a great distributed processing engine when used in the correct context. It is too easy these days for managers and BI people to band around the term “Hadoop” for everything “Big Data”, from data processing to warehousing, but you need to take the time to separate the wheat from the chaff.
- HDInsight (Microsoft’s Hadoop distribution which runs on Windows and Azure) was another technology that we were investigating at the time, although performance was extremely terrible that it was eliminated from the race fairly quickly.