Today, we are pleased to announce the public preview availability of Apache Spark for Azure HDInsight. Apache Spark is an open source project in the Apache ecosystem that has been gaining in popularity. This post will give you the ins and outs of this new offering.
What is Apache Spark?
Apache Spark is an open source processing framework that runs large-scale data analytics applications in-memory. This allows Spark to deliver queries up to 100 times faster than traditional big data solutions, along with a common execution model for various tasks like extract-transform-load (ETL) processes, batch queries, interactive queries, real-time streaming, machine learning, and graph processing on data stored in Azure Storage.
What is Microsoft’s Offering for Spark?
Microsoft has been on a journey to make big data easy and more approachable. Today, Microsoft is expanding its Azure Big Data offering by announcing the availability of Apache Spark for Azure HDInsight. HDInsight eliminates much of the heavy lifting associated with deploying, managing and executing tasks on Spark, thus raising the bar on what it means to process big data in the cloud.
For customers, we have seen three specific scenarios that Spark has been able to change the game:
- Make interactive queries over big data in Hadoop using BI tools or Open Source Notebooks
- Create a streaming solution for IOT or a real-time application
- Use machine learning algorithms to be able to predict outcomes in your analysis
Interactive queries over big data using BI tools or Open Source Notebooks
As more and more data is collected from a variety of sources, enterprises are anxious to get deep analytics about their business. However, one area where existing big data technologies lack is for analysts and data scientists to interactively explore and build BI models and reports over large data sets. With the release of Spark for HDInsight, analysts and BI professionals can analyze large unstructured data and build reports with their BI tool of choice or with open source notebooks.
Using Power BI
With the availability of Spark, we will also announce the general availability of Power BI on July 24 which will include out-of-the-box connectors to Spark. Power BI is a cloud-based business analytics service that enables anyone to visualize and analyze data with greater speed, efficiency, and understanding. Users can start from unstructured/semi structured data in Azure storage, schematize the data using notebooks on Azure HDInsight and build data models using Microsoft Power BI. The reports in Power BI are kept up-to-date with the auto-refresh feature.
Using your BI tool of choice
Spark for Azure HDInsight also has built-in connectivity to other BI or visualization tools. We have partnered with a number of third party BI tool vendors including Tableau, SAP, and Qlik. Each of these companies offer a rich set of visualizations and report building capabilities that now support Spark for Azure HDInsight. Customers can connect to any Spark cluster in the HDInsight service and use Spark’s interactive query capabilities to visually explore terabytes of data.
Using open-source notebooks
Notebooks can also be used to visualize data running in Spark for Azure HDInsight. Notebooks are open source tools that give data scientists an ability to combine live code, statistical equations, narrative text, and visualizations to tell a story of their data. We have enabled popular Jupyter (IPython) and Zeppelin notebooks to run on Spark for Azure HDInsight. Jupyter will come with standard IPython visual libraries and ideal for those who code using Python. Zeppelin is ideal for those who write in Scala while also supporting Spark SQL and Markdown.
Create a streaming solution for IOT or a real-time application
Beyond batch and interactive queries, Spark is also ideal for building real-time solutions that can solve for challenges like fraud detection, click stream analysis, financial alerts, telemetry from connected sensors and devices (IoT) and others. Customers using Spark for Azure HDInsight can use the out-of-the-box integration to Azure Event Hubs to ingest data and process it with Spark in near-real time. Spark streaming APIs can be used to write complex algorithms expressed with streaming functions like join and window. This makes Spark unique in its ability to handle both batch/interactive queries and streaming functions using the same common execution model. Finally, data can also be ingested from other sources like Kafka, Flume, Twitter, ZeroMQ, or TCP sockets. Customers can find these connectors as part of the open source Apache distribution.
Use machine learning algorithms to predict outcomes in your analysis
As part of Spark, customers will also have access to Spark MLib which is a scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives. This will allow customers to incorporate predictive analytic capabilities as part of their application. As customers want to build more machine learning solutions, Azure Machine Learning is also an ideal solution for its easy-to-use experience and its ability to deploy a ML model in minutes as a fully managed web service.
Why choose Microsoft to run Spark?
Spark as an open source project in the Apache ecosystem has been gaining in popularity with many different offerings that support it. Microsoft has made a big bet on Spark by providing users with the best experience by putting the end user first, by hardening Spark for your mission critical application and by making Spark easy to deploy.
- Enterprise hardening Spark for mission critical deployments: By integrating Spark with Azure, we are ensuring it’s ready to meet the demands of your mission critical deployments. Azure guarantees that you can run Spark with a 99.9% service level agreement at general availability to ensure continuity and protection against catastrophic events. Customers will have peace of mind with our 24/7 enterprise support and cluster monitoring to ensure you are always up and running. We have also enabled premium features not available in the open source Spark like concurrent queries. This allows multiple queries from one person or multiple queries from various users and Apps to share the same cluster resources. Finally, we allow you to externalize all of the metadata content and save your notebooks making the Spark cluster very close to stateless. This allows you to drop and recreate clusters and pick up where you left off.
- Ease of deployment: With Spark for HDInsight, there’s no time-consuming installation or set up. Azure does it for you. You’ll be up and running in minutes and can deploy Spark without buying new hardware or other up-front costs. As you need to scale, Azure allows you to create larger clusters of any size to process big data on demand. The choice is yours as you can pick a VM type that makes use of a lot of SSDs or a VM type with large amounts of RAM. While you run Spark, you can choose to cache data either in memory or in SSDs. This allows you to easily adjust resources within the various Apps to optimize for certain workloads.
How do I get started?
To get started, customers will need to have an Azure subscription or a free trial to Azure. With this in hand, you should be able to get a Spark cluster up and running in minutes by going through this getting started guide.
Also, head over to watch this Channel 9 video below on Azure Fridays:
- Announcement Blog by Microsoft CVP T. K. “Ranga” Rengarajan
- Spark Overview Service page on Azure
- Channel 9 Video on Azure Fridays
Documentation and How-To’s:
- Overview of Apache Spark for Azure HDInsight
- How to provision a Spark cluster
- Doing interactive data analysis with Spark
- Connecting BI tools to Spark
- Using Spark Stream for Real-time applications
- Using Spark MLib for Machine Learning
- About the Spark Job Server
- About the Spark Resource Manager
- ODBC Driver download link for Spark