Hello, this is Yu ISHIKAWA at ATL.
Now I am participating in Spark Summit 2014 held from 2014/06/30 to 2014/07/02. I briefly summarized the announcements that I heard on the 1st day. People who are studying Spark for the first time should watch the beginning part of the keynote and the features of Databrics Cloud which are now being developed by the Databricks Company. In this article, I will not discuss about what Spark is. I will write about Spark itself later.
For those who have no idea about what Spark is, this article can be hard to read. What I want you to understand in this article is these two points:
I summarized what Spark is in the slide below.
I am going to introduce some sessions which I think are important.
Spark’s Role in the Big Data Ecosystem
Mr. Matei Zaharia is the person who created Spark at AMPlab in UC Berkly. He invented Resilient Distributed Databricks (RDD) which is a core in Spark, and he implemented Spark with Scala. For the explanation about RDD, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing is very easy to understand.
There are 3 aspects on the trends of Spark in this announcement. First, Spark is a big data related open source and one of the most active projects. Second, there was explanation about recently added components in Spark. Finally, they talked about their visions about Spark.
They compared their community growth indexes between the years of 2013 and 2014 in Spark. In the year of 2013, there were 68 contributors and 17 companies. In 2014, there are 255 contributors and 50 companies. In May 2014, the version 1.0 has been released, and in June 2014, there are 7 Hadoop Distributors and over 20 Spark applications.
They have two big topics recently released. One is about security, monitoring, and qualities needed in operations at the HA companies. The other one is about additional extensions of the libraries in Spark. There are 3 noteworthy points in the libraries.
One is Spark SQL for SQL in Spark unlike Shark, another one is MLlib for machine learnings on Spark, and the other one is GraphX for graphs. Spark SQL is a technology which handles big data in Spark, which enables engineers to extract data more clearly. For example, they can easily treat JSON objects and also can easily write SQL User Defined Functions (UDF).
Next, there are some new machine learning algorithms added into MLlib to use machine learning with Spark. Some of the examples include decision trees, SVD, PCA, and L-BFGS. Now, the algorithms developed now are LDA and non-negative matrix factorization. In addition, the feature of GraphX with which people can make network analysis in Spark is extended.
For those who want to analyze big data, the most interesting point in Spark is that what kind of positioning Databricks wants Spark to be. Databricks is willing to raise Spark to be a big data infrastructure which can do many kinds of things with unity. The aspects of data analysis in a company can be divided into 3 of the followings: one is to make analysis as butch processing, the second is to analyze data interactively and the third one is to make analysis as streaming processing in real time. Their current goal seems to manage them in Spark consistently and with unity. The Databricks Company will provide these unity as open source as an Apache project.
Apache Spark and Databricks
For me, this was the most exciting announcement among others today. Databricks will provide Databricks Cloud as infrastructure of big data analysis. Although its βversion has not been released yet, it was great as long as I saw the demonstration.
One of the most difficult points for user companies which provide actual services is for engineers, data scientists, and analysts to work well with each other without barriers. I think Databricks Cloud will lower the barriers. As its Beta version has not been released yet, I am excited to see if they actually will have the expected features in the future, but again it was great as long as I saw the demonstration. To say the characteristics of Databricks Cloud as one word can be described as “Make Big Data Easy”. I want to address more.
The first 3 points does not sounds so great, but the last 2 are the noteworthy features. I cannot say another word but excellent for the features and ideas of Notebooks and Dashboards.
The difficulties in data analysis laid at companies have with roughly 2 aspects: “intercommunity” and “reproducibility”. I cannot explain more about these aspects here, so please find the details in the following article.
You can write the processing such as Spark / Shark on browsers with Notebook. In addition, you can implement the processing such as streaming processing, SQL, machine learning, and network analysis. Moreover, you can make instantly bar graphs and line charts with implemented SQL, and you can share the operation with other people interactively in real time. At the demonstration, I saw the updates of drawings in seconds by real time collaboration and streaming processing. You can keep the results of such analysis and the results and graphs by streaming processing as a dashboard easily.
I will write more about other sessions on this blog at a later date.