Hello. I’m Yu Ishikawa from the Advanced Technology Lab. Apache Spark 1.5, which is the latest version of Apache Spark™ with which I am involved professionally, was released just the other day. In this article, I would like to write about the development possibilities and impossibilities leading to the SparkR 1.5 release, which is a component of Spark. Information such as what is supported and how to use can be found elsewhere, so in this article I will write about the development of SparkR, in which I was personally involved. Please note that I shall also omit explanations of Spark and of Spark’s DataFrame.
During the development of Spark 1.5, the parts involving SparkR were broadly divided into two.
1. Determining the coding rules and applying static analysis tools
2. Layout around DataFrame
Determining the Coding Rules and Applying Static Analysis Tools
Upon reading the SparkR source, which has been merged with Spark 1.4, at first glance, I was underwhelmed. One reason for this was that the coding rules had not been established, and auto-testing using the stating analysis tools was incompatible.
First, Spark development is basically hosted on GitHub, and implemented using pull requests (“PR”). When someone sends a PR, testing is implemented automatically using open-source Jenkins. Implementation includes not only single testing but also static analysis, and the mechanism implements scalastyle for Scala and PEP8 for Python, respectively. In release 1.4, this mechanism was not compatible with SparkR.
One element that was possible with the Spark1.5 development was determining the SparkR coding rules, and static analysis tool compatibility. Finally, the SparkR development coding rules basically conformed to the Advanced R coding rules. One candidate that emerged during discussions on the adoption of SparkR was Google’s R Style Guide.The reason for emulating the Advanced R rules was that it seemed like a good idea to make free use of the jimhester/lintr static analysis tool, which is compatible with Advanced R.
As agreement had been reached on adopting the coding rules, I shall write about the actual application of the static analysis tools. Applying the static analysis tools required several tasks.
- Send request and PR for missing
lintr
functions tolintr
- Update SparkR
- Create script to be run using open-source Jenkins
- Request changes to open-source Jenkins settings
- Adjust
lintr
andSpark
communities
It is normal in life for things not to go as planned, and lintr
, which we had decided to use, was unable to satisfy all our requirements regarding SparkR. We sent a PR for the missing functions we thought we could create ourselves, and improved those we thought were too difficult by sending a request to the lintr
creators. The creators were extremely kind, and promptly revamped the code as per our proposals.
After installing a library that satisfied the rules, it was necessary to write a script to run the static analysis, and make settings to enable operations using open-source Jenkins. The lintr
version supported by CRAN was obsolete, and no future updates are scheduled, so it was necessary to clone the latest version from GitHub directly for use. The Jenkins administrator introduced lintr
and updated the relevant settings.
Being unable to use Spark 1.5 in the development was because the lintr
settings are as yet incomplete. Also, as some rules were not completely checked using static analysis, I would like to make this compatible in build 1.6.
DataFrame Adjustments
Spark has an API called DataFrame. Also, this API supports not only core Scala, but also Python, Java, and R without exception. Unifying users’ freedom of use as far as possible between different languages is extremely difficult. For example, if attempting to create a function comparable to IN
in SQL, in Python, “in
” is reserved and so cannot be used… Here, I would like to write about what was possible and impossible in developing build 1.5 during adjustments to SparkR’s DataFrame.
Possible during the development of Spark1.5 was enabling over 100 SparkR DataFrame functions to be supported. Over 100 min
, max
, and other Spark DataFrame functions were added to Spark1.5. All these functions basically require transplantation into a language other than Scala.
In the case of Python, for example, there is an attribute called __doc__
in the class method. Using the __doc__
attribute inserts documents even if the method is defined dynamically. Using this mechanism supposedly enables transplant to Python “fairly effectively”.
Even if there is a mechanism for defining functions and methods dynamically in R, however, it is not possible to also insert a document simultaneously. In other words, like counting your sins, we transplanted the functions mounted to Scala into R in order to arrange the documents. To some extent, using Scala’s reflection functions, etc., enables functions to be defined semi-automatically, but R’s object specification functions cannot be used, so ultimately this was a manual task while making checks using vgrep
. Further, we were also able to arrange the documents while receiving help from the authors of the roxygen2 package, which creates the documents.
Conversely, we were unable to transplant all of the functions for DataFrame during the Spark1.5 development. The mechanisms for handling Windows functions and complex formats were incompatible. This is because there is still room for improvement in the Scala and R type conversions.
The core part is Scala, so it is not possible to simply extract the Scala results from R. Consequently, we are creating a mechanism to create transactions and template conversions between Scala and R at the binary level. Unless we improve this, Windows functions and complex template conversions will not be possible. This is why I want to strive to enable its release with Spark1.6.
Mixed Feelings
Not only SparkR but also Apache Spark depends on a great number of external libraries. This includes the static analysis tool lintr
in R and the document package roxygen2
, which are described in this article. The cheerful responses from the external library authors to our questions and requests for advice were truly helpful. It was extremely gratifying not only to merely send PRs, but also to be able to experience their contributory style while coordinating on the external libraries.