--- title: "R Interfaces to Apache Spark" author: "Luke Tierney" output: html_document --- ## Apache Spark [Apache Spark](https://spark.apache.org/) is described as a unified analytics engine for large-scale data processing. A [Wikipedia article](https://en.wikipedia.org/wiki/Apache_Spark) describes it as an an open-source distributed general-purpose cluster-computing framework There are (at least) two R interfaces available: - [`SparkR`](https://spark.apache.org/docs/latest/sparkr.html), a lower level interface from the Spark project. - [`sparklyr`](https://spark.rstudio.com/), a higher-level interface from RStudio. ## Some `sparklyr` Examples ### Installation and Startup You can install Spark and `sparklyr` with ```{r, eval = FALSE} install.packages("sparklyr") library(sparklyr) spark_install() ``` You may need to set the `JAVA_HOME` environment variable; something like this: ```{r, eval = FALSE} Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/") ``` ```{r, include = FALSE} if (file.exists("/usr/lib/jvm/java-8-openjdk-amd64/")) Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/") ``` Connect to a local Spark instance with ```{r} library(sparklyr) sc <- spark_connect(master = "local") ``` A web interface allows you to examine various aspects of the Spark instance. ```{r, eval = FALSE} spark_web(sc) ``` ### Adding Some Data This example used the `flights` table from the `nycflights13` package. The `flights` table can be copied to Spark with `copy_to`: ```{r} library(nycflights13) flights_tbl <- copy_to(sc, nycflights13::flights, "flights") ``` ### Data Base Operations with `dplyr` Spark provides a `dplyr` backend that can carry out data base operations specified by `dplyr` expressions: ```{r, message = FALSE} library(dplyr) delays <- summarise(group_by(flights_tbl, tailnum), count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) delays <- filter(delays, count > 20, dist < 2000, !is.na(delay)) ``` The `collect` function brings values from Spark back into R for further examination and visualization: ```{r, message = FALSE} delays <- collect(delays) library(ggplot2) ggplot(delays, aes(dist, delay)) + geom_point(aes(size = count), alpha = 1/2) + geom_smooth() + scale_size_area(max_size = 2) ``` ### Using Spark Machine Learning Tools ```{r} ad_delays_no_na <- filter(flights_tbl, ! is.na(arr_delay), ! is.na(dep_delay)) fit <- ml_linear_regression(ad_delays_no_na, response="arr_delay", features = "dep_delay") summary(fit) ``` ### Running R Code in Spark These work but take for ever with the full data set: ```{r} spark_apply(head(flights_tbl, 100), function(d) dplyr::count(d), group_by = "origin") spark_apply(head(flights_tbl, 100), function(d) broom::tidy(lm(arr_delay ~ dep_delay, data = d)), names = c("term", "estimate", "std.error", "statistic", "p.value"), group_by = "origin") ```