Hello, this is Yu ISHIKAWA at ATL, who seems to have a jet lag but wants to believe that he does not have it. I want to cover the highlights on the 2nd day as the report on the Spark Summit 2014 continued from yesterday’s post.

  • Key technology of trends on big data analysis
  • Vision on Spark
  • Hive on Spark

Key technology of trends on big data analysis

As Mr. Aiaz Kazi at SAP gave the talk, we need to scale out with distributed in-memory on the system of big data analysis. Middle wear developments which support real time or near real time processing such as butch processing and streaming processing get very active. Spark can support cluster distributed in-memory processing and streaming processing such as Spark Streaming.

In addition to these technological trends, other trends include providing data analysis services such as how to support the collaboration with data analysis works as data analysis in companies. Moreover, achieving being able to make analysis interactively with high speed is another trend.

Vision on Spark

  • The Future of Spark
  • By Mr.Patrick Wendell (Databricks)

Mr. Patrick Wendell, who is a cofounder of Databricks Company which mainly develops Spark, spoke about vision on Spark. Mr. Patrick Wendell will give a talk at Hadoop Conference Japan 2014 held on July 8, 2014, so please register the event if you are interested in it and still not registered. The event is sponsored also by my company, Recruit Technologies.

There are some goals of Spark projects and I will introduce two of them.

More freedom for data scientists and engineers

They are willing to ease strain on engineers by providing environmental construction and highly maintenance analytic environment and also on data scientists by easily providing a higher-speed analysis environment. In addition to batch processing, Spark has characteristics which can support real time analysis by an interactive analytical environment and by streaming processes with one framework.

Besides those, Spark SQL, which can write Spark processing in SQL, has been added to its version 1.0 which has been released recently. There was another announcement about SparkR: Interactive R programs at Scale that can execute R’s third party libraries on Spark by executing Spark from R.

Providing powerful standard libraries

Providing MLlib which can execute machine learning on Spark will enable us to do a lot of analytic processing on Spark. The libraries which execute Decision Trees with Spark were introduced at Scalable Distributed Decision Trees in Spark MLLib on the 1st day.

As you see, the goal of Spark is to get great performance out of Spark as a core engine of big data analysis.

Hive on Spark

As one of the examples of Spark as a core engine, there was announcement about on-going development that can execute Apache Hive on Spark in The Emergence of the Enterprise Data Hub.

I probably think that Hadoop related batch processing executed by many companies is implemented by Apache Hive. They are going to provide a higher speed data analytic environment by implementing Hive on Tez and Apache Spark in addition to Hadoop MapReduce.

There was another announcement about Tachyon which is fault tolerant and can deploy to memory and is a distributed file system. I cannot write everything down here, but I will join the Hadoop Conference Japan 2014 on July 8, 2014, so catch me for any question.