Latest revision: September 22, 2021

Ray - Fast and simple distributed computing

Ray is such a cool distributed computing library that it makes me weep for joy. It is not only cool for batu + ray = baturay but it is also cool for really simplifying the distributed computing especially for machine learning purposes. Thanks to Ray, you can now distribute your jobs with much less effort and minimal lines of code.

Distributed computing is an important concept to split your computation job over multiple computers, nodes, or computing devices. It has two major benefits which are scalability and redundancy.

Scalability: Based on the workload, you can simply add more machines or remove some machines to make your application less costly

Redundancy: Since we have multiple machines, in case one machine fails we still have up and running other machines to complete our computing job.

Distributed computing can help us a lot in machine learning as well. With Ray, you can do the following distributed computations;

  • Distributed data loading and transformation: Like pandas dataframes, Ray also has a dataset called "ray datasets". For example, if a dataset has 3000 rows, you can simply have 3 ray objects having 1000 rows each to ease the distributed computing. So, you can apply a custom data pipeline to each object and run it over multiple machines. Ray automatically handles the distributed computing for you. After applying the transformations, if you need more advanced transformations, you can still convert the Ray dataset into pandas or dask dataframes. Currently, the ray dataset is in beta but very promising. You can check the details about the datasets from here.


  • Distribute tasks : If you have an array of objects and you need to process each element, you may also want to use distributed computing. Let's say there are 4 objects to process and you have 4 machines. In this case, each machine can handle one job. A very simple example is shared below;
import ray
import time
def f(i):
    return i
futures = [f.remote(i) for i in range(4)]
  • Distributed object creation and instance modification: Ray can be applied to the python classes as well to create objects via multiple machines. If you want to create 4 objects and modify their instances over 4 different machines, you can have a look at the following example;
import ray
class Counter(object):
    def __init__(self):
        self.n = 0
    def increment(self):
        self.n += 1
    def read(self):
        return self.n
counters = [Counter.remote() for i in range(4)]
[c.increment.remote() for c in counters]
futures = [ for c in counters]
  • Hyperparameter tuning: Ray focused more on the machine learning side and thus it has this cool hyperparameter tuning function as well. It supports many ML frameworks like scikit-learn, xgboost, keras, tensorflow, pytorch etc. In these machine learning models, we often need to do hyperparameter tuning to improve the model performance and it is an expensive job. Ray parallelizes each training job to find the best hyperparameter configuration within the provided search space. Not only that, it also has a very good feature called early stop. If a hyperparameter configuration is not promising, i.e. has a very poor performance in the first iterations or epochs it doesn't continue and immediately jumps to the next configuration. So, it prunes unnecessary computations and improves the performance. As a hyperparameter tuning strategy, Ray has several state-of-the-art methods like population-based training, BayesOptSearch and HyperBand/ASHA.
from ray.tune.sklearn import TuneGridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
import numpy as np

# Create dataset
X, y = make_classification(
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=1000)
tune_search = TuneGridSearchCV(
  SGDClassifier(), parameter_grid, early_stopping=True, max_iters=10)

import time  # Just to compare fit times
start = time.time(), y_train)
end = time.time()
print("Tune GridSearch Fit Time:", end - start)
# Tune GridSearch Fit Time: 15.436315774917603 (for an 8 core laptop)
  • Serving: You can serve a python function or a stateful python class via Ray as well. It also distributes the endpoint of your application automatically and you can access these endpoints via a REST API. You can follow this link for an end-to-end tutorial to serve your application. You may ask why not flask and Ray. The answer is you need to handle the distributed computing part by yourself if you have a flask app. If you use TFServe, then it is only useful if you have TensorFlow models. So, you are not flexible. Sagemaker could be another option but now it is a bit black box and you are in the hands of Bezos( see). Ray offers you a simple deployment with high scalability in production environments. It might be very useful in your projects if you want to scalably deploy your machine learning models without using expensive cloud services (e.g. sagemaker). A simple example with a stateful class deployment has been shared below. However, also check the machine learning model deployments from here.
import ray
from ray import serve
import requests
client = serve.start()
class Counter:
    def __init__(self):
        self.count = 0
    def __call__(self, request):
        self.count += 1
        return {"count": self.count}
client.create_backend("my_backend", Counter)
client.create_endpoint("my_endpoint", backend="my_backend", route="/counter")
# > {"count": 1}

Ray provides universal distributed computing for almost all frameworks. As shown in the examples, you just need a few more lines to make your solution distributed. The good part for us is that it is more focused on the machine learning part and there are many good tutorials about machine learning applications. I hope you can find a good use case to implement Ray in your projects.



Ask us

Any legal, privacy or security related questions.

sign up to our weekly AI & data digest ❤️