Scaling data science tasks for speed
Spark is great for scaling up data science tasks and workloads! As long as you’re using Spark data frames and libraries that operate on these data structures, you can scale to massive data sets that distribute across a cluster. However, there are some scenarios where libraries may not be available for working with Spark data frames, and other approaches are needed to achieve parallelization with Spark. This post discusses three different ways of achieving parallelization in PySpark:
- Native Spark: if you’re using Spark data frames and libraries (e.g. MLlib), then your code we’ll be parallelized and distributed natively by Spark.
- Thread Pools: The multiprocessing library can be used to run concurrent Python threads, and even perform operations with Spark data frames.
- Pandas UDFs: A new feature in Spark that enables parallelized processing on Pandas data frames within a Spark environment.
I’ll provide examples of each of these different approaches to achieving parallelism in PySpark, using the Boston housing data set as a sample data set.
Before getting started, it;s important to make a distinction between parallelism and distribution in Spark. When a task is parallelized in Spark, it means that concurrent tasks may be running on the driver node or worker nodes. How the task is split across these different nodes in the cluster depends on the types of data structures and libraries that you’re using. It’s possible to have parallelism without distribution in Spark, which means that the driver node may be performing all of the work. This is a situation that happens with the scikit-learn example with thread pools that I discuss below, and should be avoided if possible. When a task is distributed in Spark, it means that the data being operated on is split across different nodes in the cluster, and that the tasks are being performed concurrently. Ideally, you want to author tasks that are both parallelized and distributed.
The full notebook for the examples presented in this tutorial are available on GitHub and a rendering of the notebook is available here. I used the Databricks community edition to author this notebook and previously wrote about using this environment in my PySpark introduction post.
Before showing off parallel processing in Spark, let’s start with a single node example in base Python. I used the Boston housing data set to build a regression model for predicting house prices using 13 different features. The code below shows how to load the data set, and convert the data set into a Pandas data frame.
Next, we split the data set into training and testing groups and separate the features from the labels for each group. We then use the LinearRegression class to fit the training data set and create predictions for the test data set. The last portion of the snippet below shows how to calculate the correlation coefficient between the actual and predicted house prices.
We now have a task that we’d like to parallelize. For this tutorial, the goal of parallelizing the task is to try out different hyperparameters concurrently, but this is just one example of the types of tasks you can parallelize with Spark.
If you use Spark data frames and libraries, then Spark will natively parallelize and distribute your task. First, we’ll need to convert the Pandas data frame to a Spark data frame, and then transform the features into the sparse vector representation required for MLlib. The snippet below shows how to perform this task for the housing data set.
In general, it’s best to avoid loading data into a Pandas representation before converting it to Spark. Instead, use interfaces such as spark.read to directly load data sources into Spark data frames.
Now that we have the data prepared in the Spark format, we can use MLlib to perform parallelized fitting and model prediction. The snippet below shows how to instantiate and train a linear regression model and calculate the correlation coefficient for the estimated house prices.
When operating on Spark data frames in the Databricks environment, you’ll notice a list of tasks shown below the cell. This output indicates that the task is being distributed to different worker nodes in the cluster. In the single threaded example, all code executed on the driver node.
We now have a model fitting and prediction task that is parallelized. However, what if we also want to concurrently try out different hyperparameter configurations? You can do this manually, as shown in the next two sections, or use the CrossValidator class that performs this operation natively in Spark. The code below shows how to try out different elastic net parameters using cross validation to select the best performing model.
If MLlib has the libraries you need for building predictive models, then it’s usually straightforward to parallelize a task. However, you may want to use algorithms that are not included in MLlib, or use other Python libraries that don’t work directly with Spark data frames. This is where thread pools and Pandas UDFs become useful.
One of the ways that you can achieve parallelism in Spark without using Spark data frames is by using the multiprocessing library. The library provides a thread abstraction that you can use to create concurrent threads of execution. However, by default all of your code will run on the driver node. The snippet below shows how to create a set of threads that will run in parallel, are return results for different hyperparameters for a random forest.
This approach works by using the map function on a pool of threads. The map function takes a lambda expression and array of values as input, and invokes the lambda expression for each of the values in the array. Once all of the threads complete, the output displays the hyperparameter value (n_estimators) and the R-squared result for each thread.
Using thread pools this way is dangerous, because all of the threads will execute on the driver node. If possible it’s best to use Spark data frames when working with thread pools, because then the operations will be distributed across the worker nodes in the cluster. The MLib version of using thread pools is shown in the example below, which distributes the tasks to worker nodes.
One of the newer features in Spark that enables parallel processing is Pandas UDFs. With this feature, you can partition a Spark data frame into smaller data sets that are distributed and converted to Pandas objects, where your function is applied, and then the results are combined back into one large Spark data frame. Essentially, Pandas UDFs enable data scientists to work with base Python libraries while getting the benefits of parallelization and distribution. I provided an example of this functionality in my PySpark introduction post, and I’ll be presenting how Zynga uses functionality at Spark Summit 2019.
The code below shows how to perform parallelized (and distributed) hyperparameter tuning when using scikit-learn. The first part of this script takes the Boston data set and performs a cross join that create multiple copies of the input data set, and also appends a tree value (n_estimators) to each group. Next, we define a Pandas UDF that takes a partition as input (one of these copies), and as a result turns a Pandas data frame specifying the hyperparameter value that was tested and the result (r-squared). The final step is the groupby and apply call that performs the parallelized calculation.
With this approach, the result is similar to the method with thread pools, but the main difference is that the task is distributed across worker nodes rather than performed only on the driver. Example output is below:
There’s multiple ways of achieving parallelism when using PySpark for data science. It’s best to use native libraries if possible, but based on your use cases there may not be Spark libraries available. In this situation, it’s possible to use thread pools or Pandas UDFs to parallelize your Python code in a Spark environment. Just be careful about how you parallelize your tasks, and try to also distribute workloads if possible.