I am Ishikawa from Recruit Technologies Advanced Technologies Laboratory. I contributed to the writing of the book A Detailed Explanation of Apache Spark released by Gijutsu-Hyohron Co., Ltd. on April 29, 2016. I am deeply grateful to everyone at Gijutsu-Hyohron Co., Ltd. who contributed to the review of this book. I am also thankful that we were able to consult with professionals at Spark in Japan to put together a talented team of writers that worked on every component.

技術評論社「詳解Apache Spark」

In addition to its core function of decentralized processing, Apache Spark also encompasses a variety of components including SQL interfaces, machine learning, and streaming processes. The writing concerning each of these components was shared by the same team of authors throughout this book. Overall, the book is not merely an explanation of basic functions, but rather is structured to deepen understanding through demonstrations of practical use.

Ordinarily my chief pursuit is the machine learning library MLlib, but in this book I was in charge of chapter 5, “DataFrame and Spark SQL.” I will leave it up to my co-authors to promote the other chapters, and will only introduce chapter 5 of this entry.

The Target Readership for Chapter 5

In chapter 5, I explain how DataFrame API and Spark SQL simplify the handling of structured data with Spark. An analysis of structured data is extraordinarily important, to the extent that one could say that 90% of analysis is sufficient cross-tabulation. After penning this chapter, I became aware that what I wanted to do was create a structure that allowed one to learn how to use DataFrame API andSpark SQL thoroughly giving an explanation of basic functions as part of the flow of a more practical analysis. One can say that the goals of this chapter have been achieved if its readers gain the following by reading it.

  1. Engineers who do not ordinarily perform data analysis can learn basic data analysis through Spark
  2. Data analysts can learn how to use Spark to perform the analytical processes they want to do

The Structure of Chapter 5

The explanation in this chapter is advanced through the analysis of open data that has actually been released to the public. It explains everything through basic data manipulation, reading, writing, processing of missing values, and other such basic operations through using Apache Hive and JDBC to obtain data and other such integrations with external systems. Also, the explanation of the Dataset API that was experimentally added to Spark 1.6 is simple yet thoroughgoing. After that, while explaining how DataFrame API can be used in more practical analytical scenarios through the analysis of open data while utilizing the basic functions that have been described. The chapter is structured as follows.

  • 5.1 DataFrame API and Spark SQL
  • 5.2 Data preparation
  • 5.3 Data processing using DataFrame API
  • 5.3.6 Operating DataFrame with SQL queries
  • 5.4 Performance tuning
  • 5.5 Linking with external systems
  • 5.6 Dataset API
  • 5.7 Analysis of DataFrame API using BAY AREA BikeShare
  • 5.8 Synopsis

I have personally been dissatisfied with most books dealing with data analysis products as they only explain functions, and are lacking in perspectives based on how to perform practical analyses. One distinguishing characteristic of this chapter is the description in “5.7 Analysis of DataFrame API using BAY AREA BikeShare” of what sort of functions enable the flow of a practical analysis.

Sample Code from this Book

Sample code from this book has been publicly released in the sample code published in A Detailed Explanation of Apache Spark published by Gijutsu-Hyohron Co., Ltd. I think that reading the sample code and actually putting it to use will advance one’s understanding of Spark.

Final Comments

I would once again like to thank @ueshin, who took part in the review of this chapter. Although I do have a little bit of experience that pertains to development with the Public API part of DataFrame API, I do not understand it in any more depth than that, and without help it would’ve been impossible for me to finish the chapter in its present completed form. Thank you for all your help.