aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • Engineering
  • Platforms
  • Tech

Building A Machine Learning Platform With Kubeflow And Ray On Google Kubernetes Engine

  • aster.cloud
  • October 25, 2022
  • 10 minute read

Increasingly more enterprises adopt Machine Learning (ML) capabilities to enhance their services, products, and operations. As their ML capabilities mature, they build centralized ML Platforms to serve many teams and users across their organization. Machine learning is inherently an experimental process requiring repeated iterations. An ML Platform standardizes the model development and deployment workflow to offer greater consistency for the repeated process. This facilitates productivity and reduces time from prototype to production.

Every organization and ML project have unique requirements, and there are many options for ML Platforms. With Google Cloud, you can choose Vertex AI, a fully managed ML Platform, or choose Google Kubernetes Engine (GKE) to build a custom one on self-managed resources. Vertex AI provides fully-managed workflows, tools, and infrastructure that reduce complexity, accelerate ML deployments, and make it easier to scale ML in an organization. Some organizations may prefer to build their own custom ML Platform, an approach that enables flexibility to meet highly specialized ML requirements and frameworks. Typically, these organizations build their own platform for specific resource utilization behavior and infrastructure strategies.For ML Platforms, Open Source Software (OSS) is an important driver of digital innovation. If you are following the evolution of ML technologies, then you are probably aware of the ever-growing ecosystem of OSS ML frameworks, platforms, and tools. However, no single OSS library delivers a complete ML solution, so we must integrate multiple OSS projects to build an ML platform.


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

To start building an ML Platform, you should support the basic ML user journey of notebook prototyping to scaled training to online serving. If your organization has multiple teams, you may additionally need to support administrative requirements of multi-user support with identity-based authentication and authorization. Two popular OSS projects – Kubeflow and Ray – together can support these needs. Kubeflow provides the multi-user environment and interactive notebook management. Ray orchestrates distributed computing workloads across the entire ML lifecycle, including training and serving.

Google Kubernetes Engine (GKE) simplifies deploying OSSe ML software in the cloud with autoscaling and auto-provisioning. GKE reduces the effort to deploy and manage the underlying infrastructure at scale and offers the flexibility to use your ML frameworks of choice. In this article, we will show how Kubeflow and Ray can be assembled into a seamless experience. We will demonstrate how platform builders can deploy them both to GKE to provide a comprehensive, production-ready ML platform.

Kubeflow and Ray

First, let’s take a closer look at these two OSS projects. While both Kubeflow and Ray deal with the problem of enabling ML at scale, they focus on very different aspects of the puzzle.

Kubeflow is a Kubernetes-native ML platform aimed at simplifying the build-train-deploy lifecycle of ML models. As such, its focus is on general MLOps. Some of the unique features offered by Kubeflow include:

  • Built-in integration with Jupyter notebooks for prototyping
  • Multi-user isolation support
  • Workflow orchestration with Kubeflow Pipelines
  • Identity-based authentication and authorization through Istio Integration
  • Out-of-the-box integration with major cloud providers such as GCP, Azure, and AWS

Source: https://www.kubeflow.org/docs/started/architecture/

Ray is a general-purpose distributed computing framework with a rich set of libraries for large scale data processing, model training, reinforcement learning, and model serving. It is popular with customers as a simple API for building and scaling AI and Python workloads. Its focus is on the application itself – allowing users to build distributed computing software with a unified and flexible set of APIs. Some of the advanced libraries offered by Ray include:

  • RLLib for reinforcement learning
  • Ray Tune for hyperparameter tuning
  • Ray Train for distributed deep learning
  • Ray Serve for scalable model serving
  • Ray Data for preprocessing

Source: https://docs.ray.io/en/latest/index.html#what-is-ray

It should be noted that Ray is not a Kubernetes-native project. In order to deploy Ray on Kubernetes, the OSS community has created KubeRay, which is exactly what it sounds like – a toolkit for deploying Ray in Kubernetes. KubeRay offers a powerful set of tools that include many great features, like custom resource APIs and a scalable operator. You can learn more about it here.Now that we have examined the differences between Kubeflow and Ray, you might be asking which is the right platform for your organization. Kubeflow’s MLOps capabilities and Ray’s distributed computing libraries are both independently useful with different advantages. What if we can combine the benefits of both systems? Imagine having an environment that:

  • Supports Ray Train with autoscaling and resource provisioning
  • Integrated with identity-based authentication and authorization
  • Supports multi-user isolation and collaboration
  • Contains an interactive notebook server
Read More  Why All Retailers Should Consider Google Cloud Retail Search

Let’s now take a look at how we can put these two platforms together and take advantage of the useful features offered by each. Specifically, we will deploy Kuberay in a GKE cluster installed with Kubeflow. The system looks something like this:

In this system, the Kubernetes cluster is partitioned into logically-isolated workspaces, called “profiles”. Each new user will create their own profile, which is a container for all their resources in this Kubernetes cluster. The user can then provision their own resources within their designated namespace, including Ray Clusters and Jupyter Notebooks. If the user’s resources are provisioned through the Kubeflow dashboard, then Kubeflow will automatically place these resources in their profile namespace.Under this setup, each Ray cluster is by default protected by role-based access control policies (with Istio) preventing unauthorized access. This allows each user to interact with their own Ray clusters independently of each other, and allows them to share Ray clusters with other team members.For this setup, I used the following versions:

  • Google Kubernetes Engine 1.21.12-gke.2200
  • Kubeflow 1.5.0
  • Kuberay 0.3.0
  • Python 3.7
  • Ray 1.13.1

The configuration files used for this deployment can be found here.

Deploying Kubeflow and Kuberay

For deploying Kubeflow, we will be using the GCP instructions here. For simplicity purposes, I have used mostly default configuration settings. You can freely experiment with customizations before deploying, for example, you can enable GPU nodes in your cluster by following these instructions.

Deploying the KubeRay operator is pretty straightforward. We will be using the latest released version:

export KUBERAY_VERSION=v0.3.0
kubectl create -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources?ref=${KUBERAY_VERSION}"
kubectl apply -k "github.com/ray-project/kuberay/manifests/base?ref=${KUBERAY_VERSION}"

This will deploy the KubeRay operator in the “ray-systems” namespace in your cluster.

Creating Your Kubeflow User Profile

Before you can deploy and use resources in Kubeflow, you need to first create your user profile. If you follow the GKE installation instructions, you should be able to navigate to https://[cluster].endpoints.[project].cloud.goog/ in your browser, where [cluster] is the name of your GKE cluster and [project] is your GCP project name.

This should redirect you to a web page where you can use your GCP credentials to authenticate yourself.

Follow the dialogue, and Kubeflow will create a namespace with you as the administrator. We’ll discuss later in this article how to invite others to your workspace.

Build the Ray Worker Image

Next, let’s build the image we’ll be using for the Ray cluster. Ray is very sensitive when it comes to version compatibility (for example, the head and worker nodes must use the same versions of Ray and Python), so it is highly recommended to prepare and version-control your own worker images. Look for the base image you want from their Docker page here: rayproject/ray – Docker Image.

The following is a functioning worker image using Ray 1.13 and Python 3.7:

FROM rayproject/ray:1.13.1-py37

RUN pip install numpy tensorflow

CMD ["bin/bash"]

Here is the same Dockerfile for a worker image running on GPUs if you prefer GPUs instead of CPUs:
FROM rayproject/ray:1.13.1-py37-gpu

RUN pip install numpy tensorflow

CMD ["bin/bash"]

Read More  Styles Of Software Infrastructure Management
Use Docker to build and push both images to your image repository:
$ docker build -t <path-to-your-image> -f Dockerfile .
$ docker push <path-to-your-image>

Build the Jupyter Notebook Image

Similarly we need to build the notebook image that we are going to use. Because we are going to use this notebook to interact with the Ray cluster, we need to ensure that it uses the same version of Ray and Python as the Ray workers.

The Kubeflow example Jupyter notebooks can be found at Example Notebook Servers. For this example, I changed the PYTHON_VERSION in components/example-notebook-servers/jupyter/Dockerfile to the following:

ARG MINIFORGE_VERSION=4.10.1-4
ARG PIP_VERSION=21.1.2
ARG PYTHON_VERSION=3.7.10

Use Docker to build and push the notebook image to your image repository, similar to the previous step:
$ docker build -t <path-to-your-image> -f Dockerfile .
$ docker push <path-to-your-image>

Deploy a Ray Cluster

Now we are ready to configure and deploy our Ray cluster.

1. Copy the following sample yaml file from GitHub:

curl https://github.com/richardsliu/ray-on-gke/blob/main/manifests/ray-cluster.serve.yaml -o ray-cluster.serve.yaml

2. Edit the settings in the file:a. For the user namespace, change the value to match with your Kubeflow profile name:
namespace: %your_name%

b. For the Ray head and worker settings, change the value to point to the image you have built previously:
image:  %your_image%

c. Edit resource requests and limits, as required. For example, you can change the CPU or GPU requirements for worker nodes here:
resources: 
    limits:
      cpu: 1
    requests:
       cpu: 200m

3. Deploy the cluster:
kubectl apply -f raycluster.serve.yaml

4. Your cluster should be ready to go momentarily. If you have enabled node auto-provisioning on your GKE cluster, you should be able to see the cluster dynamically scale up and down according to usage. You can check the status of your cluster by doing:
$ kubectl get pods -n <user name>
NAME                                       READY   STATUS    RESTARTS   AGE
example-cluster-head-8cbwb                 1/1     Running   0          12s
example-cluster-worker-large-group-75lsr   1/1     Running   0          12s
example-cluster-worker-large-group-jqvtp   1/1     Running   0          11s
example-cluster-worker-large-group-t7t4n   1/1     Running   0          12s

You can also verify that the service endpoints are created:
$ kubectl get services -n <user name>
NAME                       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                                AGE
example-cluster-head-svc   ClusterIP   10.52.9.88   <none>        8265/TCP,10001/TCP,8000/TCP,6379/TCP   18s

Remember this service name – we will come back to it later.Now our ML Platform is all set up and we are ready to start Training a model.

Training a ML Model

We are going to use a Notebook to orchestrate our model training. We can access Ray from a Jupyter notebook session.

1. In the Kubeflow dashboard, navigate to the “Notebooks” tab.

2. Click on “New Notebook”.
3. In the “Image” section, click on “Custom Image”, and input the path to the Jupyter notebook image that you have built here.
4. Configure resource requirements for the notebook as needed. The default notebook uses half a CPU and 1G of memory. Note that these resources are only for the notebook session, and not for the Training resources. Later, we use Ray to orchestrated resources at scale on GKE.5. Click on “LAUNCH”.6. When the notebook finishes deploying, click on “Connect” to start a new notebook session.

7. Inside the notebook, open a terminal by clicking on File -> New -> Terminal.8. Install Ray 1.13 in the terminal:
pip install ray==1.13

9. Now you are ready to run an actual Ray application, using this notebook and the Ray cluster you just deployed in the previous section. I have made a .ipynb file using the canonical Ray trainer example here.10. Run through the cells in the notebook. The magic line that connects to the Ray cluster is:
ray.init("ray://example-cluster-head-svc:10001")

This should match with the service endpoint that you created earlier. If you have several different Ray clusters, you can simply change the endpoint here to connect to a different one.11. The next few lines will start a Ray Trainer process on the cluster:
trainer = Trainer(backend="tensorflow", num_workers=4)
trainer.start()
results = trainer.run(train_func_distributed)
trainer.shutdown()

Read More  Google Cloud Security Overview
Note here that we specify 4 workers, which matches with our Ray cluster’s number of replicas. If we change this number, the Ray cluster will automatically scale up or down according to resource demands.

Serving a ML Model

In this section we will look at how we can serve the machine learning model that we have just trained in the last section.

1. Using the same notebook, wait for the training steps to complete. You should see some output logs with metrics for the model that we have trained.

2 Run the next cell:

serve.start(detached=True, http_options={"host": "0.0.0.0"})
TFMnistModel.deploy(TRAINED_MODEL_PATH)

This will start serving the model that we have just trained, using the same service endpoint we created before.3. To verify that the inference endpoint is now working, we can create a new notebook. You can use this one here.4. Note that we are calling the same inference endpoint as before, but using a different port:

 

resp = requests.get(
      "http://example-cluster-head-svc:8000/mnist",
      json={"array": np.random.randn(28 * 28).tolist()})

5. You should see the inference results displayed in your notebook session.

Sharing the Ray Cluster with Others

Now that you have a functional workspace with an interactive notebook and a Ray cluster, let’s invite others to collaborate.

 

1. On Cloud Console, grant the user minimal cluster access here.

2. In the left-hand panel of the Kubeflow dashboard, select “Manage Contributors”.

3. In the “Contributors to your namespace” section, enter the email address of the user to whom you are granting access. Press enter.

4. That user can now select your namespace and access your notebooks, including your Ray cluster.

 

Using Ray Dashboard

Finally, you can also bring up the Ray Dashboard using Istio virtual services. Using these steps, you can bring up a dashboard UI inside the Kubeflow central dashboard console:

1. Create an Istio Virtual Service config file:

 

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: example-cluster-virtual-service
  Namespace: kubeflow
spec:
  gateways:
  - kubeflow-gateway
  hosts:
  - '*'
  http:
  - match:
    - uri:
        prefix: /example-cluster/
    rewrite:
      uri: /
    route:
    - destination:
        host: example-cluster-head-svc.$(USER_NAMESPACE).svc.local
        port:
          number: 8265

Replace $(USER_NAMESPACE) with the namespace of your user profile. Save this to a local file.2. Deploy the virtual service:
kubectl apply -f virtual_service.yaml

3. In your browser window, navigate to https://<host>/_/example-cluster/. The Ray dashboard should be displayed in the window:

Conclusion

Let’s take a minute to recap what we have done. In this article, we have demonstrated how to deploy two popular ML frameworks, Kubeflow and Ray, in the same GCP Kubernetes cluster. The setup also takes advantage of GCP features like IAP (Identity-Aware Proxy) for user authentication, which protects your applications while simplifying the experience for cloud admins. The end result is a well-integrated and production-ready system that pulls in useful features offered by each system:

  • Orchestrating distributed computing workloads using Ray APIs;
  • Multi-user isolation using Kubeflow;
  • Interactive notebook environment using Kubeflow notebooks;
  • Cluster autoscaling and auto-provisioning using Google Kubernetes Engine

We’ve only scratched the surface of the possibilities, and you can expand from here:

  • Integrations with other MLOps offerings, such as Vertex Model monitoring;
  • Faster and safer image storage and management, through the Artifact Repository;
  • High throughput storage for unstructured data using GCSFuse;
  • Improve network throughput for collective communication with NCCL Fast Socket.

We look forward to the growth of your ML Platform and how your team innovates with Machine Learning. Look out for future articles on how to enable additional ML Platform features.

 

 

By: Richard Liu (Senior Software Engineer, Google Kubernetes Engine) and Winston Chiang (Product Manager, Google Kubernetes Engine AI/ML)
Source: Google Cloud Blog


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

aster.cloud

Related Topics
  • Google Cloud
  • Google Kubernetes Engine
  • Kubeflow
  • Kubernetes
  • Machine Learning
  • Open-Source Software
  • Ray
  • Tutorials
You May Also Like
Getting things done makes her feel amazing
View Post
  • Computing
  • Data
  • Featured
  • Learning
  • Tech
  • Technology

Nurturing Minds in the Digital Revolution

  • April 25, 2025
View Post
  • Engineering
  • Technology

Guide: Our top four AI Hypercomputer use cases, reference architectures and tutorials

  • March 9, 2025
View Post
  • Computing
  • Engineering

Why a decades old architecture decision is impeding the power of AI computing

  • February 19, 2025
View Post
  • Tech

Deep dive into AI with Google Cloud’s global generative AI roadshow

  • February 18, 2025
View Post
  • Engineering
  • Software Engineering

This Month in Julia World

  • January 17, 2025
View Post
  • Engineering
  • Software Engineering

Google Summer of Code 2025 is here!

  • January 17, 2025
View Post
  • Data
  • Engineering

Hiding in Plain Site: Attackers Sneaking Malware into Images on Websites

  • January 16, 2025
Volvo Group: Confidently ahead at CES
View Post
  • Tech

Volvo Group: Confidently ahead at CES

  • January 8, 2025

Stay Connected!
LATEST
  • college-of-cardinals-2025 1
    The Definitive Who’s Who of the 2025 Papal Conclave
    • May 7, 2025
  • conclave-poster-black-smoke 2
    The World Is Revalidating Itself
    • May 6, 2025
  • 3
    Conclave: How A New Pope Is Chosen
    • April 25, 2025
  • Getting things done makes her feel amazing 4
    Nurturing Minds in the Digital Revolution
    • April 25, 2025
  • 5
    AI is automating our jobs – but values need to change if we are to be liberated by it
    • April 17, 2025
  • 6
    Canonical Releases Ubuntu 25.04 Plucky Puffin
    • April 17, 2025
  • 7
    United States Army Enterprise Cloud Management Agency Expands its Oracle Defense Cloud Services
    • April 15, 2025
  • 8
    Tokyo Electron and IBM Renew Collaboration for Advanced Semiconductor Technology
    • April 2, 2025
  • 9
    IBM Accelerates Momentum in the as a Service Space with Growing Portfolio of Tools Simplifying Infrastructure Management
    • March 27, 2025
  • 10
    Tariffs, Trump, and Other Things That Start With T – They’re Not The Problem, It’s How We Use Them
    • March 25, 2025
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • 1
    IBM contributes key open-source projects to Linux Foundation to advance AI community participation
    • March 22, 2025
  • 2
    Co-op mode: New partners driving the future of gaming with AI
    • March 22, 2025
  • 3
    Mitsubishi Motors Canada Launches AI-Powered “Intelligent Companion” to Transform the 2025 Outlander Buying Experience
    • March 10, 2025
  • PiPiPi 4
    The Unexpected Pi-Fect Deals This March 14
    • March 13, 2025
  • Nintendo Switch Deals on Amazon 5
    10 Physical Nintendo Switch Game Deals on MAR10 Day!
    • March 9, 2025
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.