aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • Design
  • Engineering

Load Testing I/O Adventure With Cloud Run

  • aster.cloud
  • December 27, 2022
  • 6 minute read
In 2021 and 2022, Google I/O invited remote attendees to meet each other and explore a virtual world in the I/O Adventure online conference experience powered by Google Cloud technologies. (For more details on how this was done, see my previous post.)
When building an online experience like I/O Adventure, it’s important to decide on a provisioning strategy early on. Before we could make that decision, however, we needed to have a reasonably accurate estimate of how many attendees to expect.Estimating the actual number of attendees in advance is easier for an in-person event than it is for a (free) online event. We recognized that this number could vary wildly from our best guesses, in either direction. The only safe strategy was to design a server architecture that would be able to handle much more traffic than we actually expected. Since the I/O Adventure experience was going to be live for only a few days during the conference, we determined that it would be affordable to overprovision by spinning up many server instances before the event started.To further ensure that heavy traffic would not degrade the attendees’ experience, we decided to implement a queue. Our system would welcome as many potential simultaneous attendees as it could smoothly support, and steer additional users to a waiting queue. In the unlikely event that the actual traffic exceeded the large allocated capacity, the queue would prevent the system from becoming overly congested.Designing a scalable cloud architecture for our project was one thing. Making sure that it would actually be able to support a heavy load is quite another! This post describes how we performed load testing on the cloud backend as a whole system, rather than on individual cloud components such as a single VM.

When load testing an entire cloud backend, you need to address several concerns that are not necessarily accounted for in component-level testing. Quotas are a good example. Quota issues are difficult to foresee before you actually hit them. And, we needed to consider many quotas! Do we have enough quota to spin up 200 VMs? More specifically, do we have enough for the type of machine (E2, N2, etc.) that we use? And even more specifically, in the cloud region where the project is deployed?


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

 

Pouring thousands of bots in the I/O Adventure world

Read More  How A Zero Trust Approach Protects Governments And Constituents Against Fraud

Infrastructure

To design the load test and simulate thousand of attendees, we had to take two key factors into account:

  • Attendees communicate from their browser with the cloud backend using WebSockets
  • A typical attendee session lasts for at least 15 minutes without disconnecting, which is long-lived compared to some common load testing methodologies more focused on individual HTTP requests

 

I/O Adventure Attendee-Server communication between attendee’s browser and a GKE server pod, using WebSockets

While it is possible to set up a load test suite by provisioning and managing VMs, I have a strong preference for serverless solutions like Cloud Functions and Cloud Run. So, I wanted to know if we could use one of them to simulate user sessions – to essentially play the role of a load injector. Does Google Cloud’s serverless infrastructure support the necessary protocol – WebSockets – for this use case?Yes! It turns out that Cloud Run supports WebSockets for egress and ingress, and has a configurable per-request timeout up to 1 hour. 

Load test Client-Server communication using Cloud Run communicating via WebSockets with GKE pods

In the load test we mimicked a typical attendee session, in which a WebSocket connection transmits thousands of messages over several minutes, without disconnecting.

Concurrency

On the backend, the I/O Adventures servers handle thousands of simultaneous attendees by:

  • Accepting 500 attendees in each “shard”, where each shard is a server representing a part of the whole conference world, in which attendees can interact with each other;
  • Having hundreds of independent, preprovisioned shards;
  • Running several shards in each GKE Node;
  • Routing each incoming attendee to a shard with free capacity.

On the client side (the load injector), we implemented multiple levels of concurrency:

  • Each trigger (e.g. an HTTPS request from my workstation initiated with curl) can launch many concurrent sessions of 15 minutes and wait for their completion.
  • Each Cloud Run instance can handle many concurrent triggering requests (maximum concurrent requests per instance).
  • Cloud Run automatically starts new instances when the existing instances approach their full capacity. A Cloud Run service can scale to hundreds or thousands of container instances as needed.
  • We created an additional Cloud Run service specifically for triggering more simultaneous requests to the main Cloud Run injector as a way to amplify the load test.
Read More  Introducing Data Studio As Our Newest Google Cloud Service

 

Simulating a single attendee story

A simulated “user story” is a load test scenario that consists of logging in, being routed to the GKE pod of a shard, making a few hundred random attendee movements for 15 minutes, and disconnecting.

For this project, I ran the simulation in Cloud Run, and I kicked off the test by issuing a curl command from my laptop. In this setup, the scenario (story) initiates a connection as a WebSocket client, and the pods are WebSocket servers.

 

Injecting one attendee stroy

Initiating many attendee stories with one triggerWe created an injector service handler (implemented as a Node.js script) to start many stories in parallel, and wait for their completion. 

Injecting many attendee stories with one trigger

Injecting many attendee stories with several concurrent triggersI can multiply the load by triggering the injector many times concurrently with curl, from a command line terminal:
for i in {1..10}; do 
    curl -X POST "https://fancy-load-test.run.app" &
done
wait

Cloud Run automatically scales up by spinning new injector machines (i.e Cloud Run instances) when needed. 

Injecting attendee stories through many curl requests, triggering many Cloud Run injector instances(Login service details omitted for clarity.)

Injecting more attendee stories through an extra Cloud Run serviceIn the previous setup, my workstation became the bottleneck as I was launching too many long-lived trigger requests with curl. My workstation limited the number of TCP connections, and struggled to keep up with the CPU load for all the SSL handshakes.We fixed this by creating a new intermediate service specifically for dealing with a large number of triggers. We tried several parameter values (number of stories per trigger, max requests per Cloud Run instance, etc.) to maximize the injector’s throughput. 

Injecting attendee stories through an extra Cloud Run trigger service

Note the BigQuery component, used for log ingestion on both the injector side and the server side.

Measuring the success rate

Unlike HTTP requests, which have an explicit response code, WebSocket messages are unidirectional and by default don’t expect an acknowledgement.

To keep track of how many stories have run successfully to completion, we wrote a few events (login, start, finish…) to the standard output and activated a logging sink to stream all of the logs to BigQuery. The events were logged from the point of view of the clients (the injector) and from the point of view of the servers (GKE pods).

Read More  Introducing Our New Cohort Of Startups For The 2022 Google Cloud Accelerator Canada

This made it very convenient, with aggregate SQL queries, to:

  • make sure that at least 99% of all the stories did finish successfully, and
  • make sure the stories did not take more time than expected.

 

We also kept an eye on several Grafana dashboards to monitor live metrics of our GKE cluster, and make sure that CPU, memory, and bandwidth resources didn’t get overwhelmed. 

Visual check

As a bonus, it was very fun to connect as a “real” attendee and watch hundreds of bots running everywhere!

 

Connecting to the system under stress with a browser also enabled us to assess the subjective experience. We could see, for example, how smooth the animations were when a shard was hosting 500 attendees and the frontend was rendering dozens of moving avatars.

Results

With a total of 4000 triggers for 40 stories each, and a max concurrency of 40 requests per Cloud Run instance, our tests used just over 100 instances and successfully injected 160,000 simultaneous active attendees. We ran this load test script several times over a few days, for a total cost of about $100. The test took advantage of the full capacity of all of the server CPU cores (used by the GKE cluster) that our quota allowed. Mission accomplished!

We learned that:

  • The cost of the load test was acceptable.
  • The quotas we needed to raise were the number of specific CPU cores and the number of external IPv4 addresses.
  • Our platform would successfully sustain a load target of 160K attendees.

During the actual event, the peak traffic turned out to be less than the maximum supported load. (As a result, no attendees had to wait in the queue that we had implemented.) Following our tests, we were confident that the backend would handle the target load without any major issues, and it did.

Of course, Cloud Run and Cloud Run Jobs can handle many types of workloads, not only website backends and load tests. Take some time to explore them further and think about where you can put them to use in your own workflows!

 

By: Valentin Deleplace (Google Cloud Developer Advocate)
Source: Google Cloud Blog


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

aster.cloud

Related Topics
  • Cloud Run
  • Google Cloud
  • I/O Adventure
You May Also Like
View Post
  • Engineering
  • Technology

Guide: Our top four AI Hypercomputer use cases, reference architectures and tutorials

  • March 9, 2025
View Post
  • Computing
  • Engineering

Why a decades old architecture decision is impeding the power of AI computing

  • February 19, 2025
View Post
  • Engineering
  • Software Engineering

This Month in Julia World

  • January 17, 2025
View Post
  • Engineering
  • Software Engineering

Google Summer of Code 2025 is here!

  • January 17, 2025
View Post
  • Data
  • Engineering

Hiding in Plain Site: Attackers Sneaking Malware into Images on Websites

  • January 16, 2025
View Post
  • Computing
  • Design
  • Engineering
  • Technology

Here’s why it’s important to build long-term cryptographic resilience

  • December 24, 2024
IBM and Ferrari Premium Partner
View Post
  • Data
  • Engineering

IBM Selected as Official Fan Engagement and Data Analytics Partner for Scuderia Ferrari HP

  • November 7, 2024
View Post
  • Engineering

Transforming the Developer Experience for Every Engineering Role

  • July 14, 2024

Stay Connected!
LATEST
  • college-of-cardinals-2025 1
    The Definitive Who’s Who of the 2025 Papal Conclave
    • May 7, 2025
  • conclave-poster-black-smoke 2
    The World Is Revalidating Itself
    • May 6, 2025
  • oracle-ibm 3
    IBM and Oracle Expand Partnership to Advance Agentic AI and Hybrid Cloud
    • May 6, 2025
  • 4
    Conclave: How A New Pope Is Chosen
    • April 25, 2025
  • Getting things done makes her feel amazing 5
    Nurturing Minds in the Digital Revolution
    • April 25, 2025
  • 6
    AI is automating our jobs – but values need to change if we are to be liberated by it
    • April 17, 2025
  • 7
    Canonical Releases Ubuntu 25.04 Plucky Puffin
    • April 17, 2025
  • 8
    United States Army Enterprise Cloud Management Agency Expands its Oracle Defense Cloud Services
    • April 15, 2025
  • 9
    Tokyo Electron and IBM Renew Collaboration for Advanced Semiconductor Technology
    • April 2, 2025
  • 10
    IBM Accelerates Momentum in the as a Service Space with Growing Portfolio of Tools Simplifying Infrastructure Management
    • March 27, 2025
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • 1
    Tariffs, Trump, and Other Things That Start With T – They’re Not The Problem, It’s How We Use Them
    • March 25, 2025
  • 2
    IBM contributes key open-source projects to Linux Foundation to advance AI community participation
    • March 22, 2025
  • 3
    Co-op mode: New partners driving the future of gaming with AI
    • March 22, 2025
  • 4
    Mitsubishi Motors Canada Launches AI-Powered “Intelligent Companion” to Transform the 2025 Outlander Buying Experience
    • March 10, 2025
  • PiPiPi 5
    The Unexpected Pi-Fect Deals This March 14
    • March 13, 2025
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.