This is a pretty simple post to show how to alter (add, remove or edit) a calculated field in an SSAS cube without redeploying the whole project, a useful technique if you do not have the SSAS cube project handy or wish to quickly implement changes on a live cube.
Until now, I have found working with Kerberos when setting up a SQL Server stack to be a complete nightmarish experience, mainly due to two reasons:
- Working with Kerberos usually requires access rights to Active Directory for the account setting up this authentication protocol on the stack, in order to be able to effectively diagnose the setup and also configure the Service Principal Names (SPN) for the various SQL Server and SharePoint service accounts, and setup delegation. This means SQL Server architects and Network Administrators need to collaborate in order to correctly configure the stack, which is often an unpleasant and long winded experience of trial and error.
- The lack of a centralized diagnostic and configuration tools for Kerberos setup on SQL Server makes this tasks very tedious, particularly if you follow the limited number of online resources out there to setup Kerberos, and find that they do not apply exactly to your situation, or do not work exactly as intended after following the lengthy steps, and you are left with a very limited option in terms of diagnosing exactly what went wrong.
Excel 2013 brings forward an array of new and exciting features at the finger tips of the data analysts, ranging from a shiny new visualization and exploratory data analysis platform (PowerView) to a number of new pivoting features as well as a powerful in-memory data modelling engine (PowerPivot) enabled by default.
Among all these features, the new Excel delivers a few different options for visualizing geographical and location based data, each visualization technique serving a different purpose (with a specific set of features) or targeting a particular demographic segment of the over-all Excel user-base. This is a short post introducing some techniques for visualizing geo information in Excel.
This is a quick post to describe a solution to a problem I was having with my local installation of the HDInsight Hadoop cluster.
Hashing can be a very useful technique when dealing with the storage and look up of large text fields (say a table of URLs or Search Keywords), these fields will incur high resource utilization on any database engine if used directly in DML statements, in which they are either filtered by or aggregated on. Any index built on these fields is costly to maintain, if it is at all possible given that SQL Server limits index size to 900 bytes.
Using hashing functions we can facilitate the handling of large textual data in the relational engine, leading to improved performance when these fields are being compared to satisfy a query, hashing can also be used to build unique and non-unique indexes that are easier to manage than directly using the text fields in the index definition. In this post we will discuss a few options for hashing large text data using functions native within SQL Server, as well as provide other external hashing algorithms that we can integrate into Microsoft’s SQL Server (or any RDBMS for that matter) that might provide a better practical performance.
On one corner we have Hadoop, a massively distributed JVM-based data processing engine with a Map & Reduce API and a proven track record in handling huge data-sets. On the other corner we have SSIS, a natively non-distributed ETL engine part of the SQL Server family tool-set with .NET code extensibility features and a drag and drop UI (for the most part anyway). Two sweet technologies, probably shouldn’t be compared to each other but we’re doing it anyway, pitted head to head against a data mapping task to the death (or at least to the recycling of my test VMs)… Now FIGHT!
Recently I have been involved in researching and building a low-latency high-data-volume OLAP environment for a social entity and interaction analysis platform, the perfect mixture of concepts such as Big Data collection and processing, large-scale Network Analysis, Natural Language Processing (NLP) and a highly scaled-out OLAP environment for end users to explore and discover data (essentially a Self-Service and Exploratory BI layer).
It is by all means not an easy mission to orchestrate all the technologies that back those concepts, particularly if you are interested in using the optimum solution for the problem at hand, for example Big Data might be better handled by a Hadoop layer, but Hadoop or Hive (at least on their own) are not geared up to respond to OLAP queries, which are real-time by nature, and even if they were, your end-user needs familiar tools and interfaces to analyse and study this data, which is where SQL Server Analysis Service and the whole Microsoft BI stack might come in and offer great integration with already existing business applications (such as Office or SharePoint).
This post discusses a few architectural approaches to exposing a Hadoop layer through a SQL Server Analysis Service (SSAS) interface, with references to data-latency, redundancy and over-all performance.
In a previous post I described how to convert an SSRS graph into a Highcharts graph by consuming the XML output of the report from the SSRS Web Service and converting that to an input for a Highcharts graph.
In this article I discuss some methods you could adopt to improve the accuracy of your text classifier, I’ve taken a generalized approach so the recommendations here should really apply for most text classification problem you are dealing with, be it Sentiment Analysis, Topic Classification or any text based classifier. This is by no means a comprehensive list, but it should provide a nice introduction into the subject of text classification algorithm optimisation.
One way to increase the accuracy of a classification algorithm is to allow the algorithm to return an “Unknown” value, particularly when the probability of what we are trying to classify is too low to simply belong in one class and the algorithm is essentially guessing an answer, leading to incorrect classification.
In this post I will try and explore a method for researching and implementing the “Unknown” result in your classifier based on the probability distribution results of a classification, the idea is to give you the tools to tweak the optimum thresholds that gives you the best accuracy, while maintaining acceptable level of over-all coverage of data.