aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • Design
  • Engineering

Cloud TPU V4 Records Fastest Training Times On Five MLPerf 2.0 Benchmarks

  • aster.cloud
  • July 13, 2022
  • 4 minute read

ML-driven innovation is fundamentally transforming computing, enabling entirely new classes of internet services. For example, recent state-of-the-art lage models such as PaLM and Chinchilla herald a coming paradigm shift where ML services will augment human creativity. All indications are that we are still in the early stages of what will be the next qualitative step function in computing. Realizing this transformation will require democratized and affordable access through cloud computing where the best of compute, networking, storage, and ML can be brought to bear seamlessly on ever larger-scale problem domains.

Today’s release of MLPerf™ 2.0 results from the MLCommons® Association highlights the public availability of the most powerful and efficient ML infrastructure anywhere. Google’s TPU v4 ML supercomputers set performance records on five benchmarks, with an average speedup of 1.42x over the next fastest non-Google submission, and 1.5x vs our MLPerf 1.0 submission. Even more compelling — four of these record runs were conducted on the publicly available Google Cloud ML hub that we announced at Google I/O. ML Hub runs out of our Oklahoma data center, which uses over 90% carbon-free energy.


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

Let’s take a closer look at the results.

Figure 1: TPUs demonstrated significant speedup in all five published benchmarks over the fastest non-Google submission (NVIDIA on-premises). Taller bars are better. The numbers inside the bars represent the quantity of chips / accelerators used for each of the submissions.

 

Performance at scale…and in the public cloud

Our 2.0 submissions1, all running on TensorFlow, demonstrated leading performance across all five benchmarks. We scaled two of our submissions to run on full TPU v4 Pods. Each Cloud TPU v4 Pod consists of 4096 chips connected together via an ultra-fast interconnect network with an industry-leading 6 terabits per second (Tbps) of bandwidth per host, enabling rapid training for the largest models.

Read More  Powering Up Firestore To COUNT() Cost-Efficiently

Hardware aside, these benchmark results were made possible in no small part by our work to improve the TPU software stack. Scalability and performance optimizations in the TPU compiler and runtime, including faster embedding lookups and improved model weight distribution across the TPU pod, enabled much of these improvements, and are now widely available to TPU users. For example, we made a number of performance improvements to the virtualization stack to fully utilize the compute power of both CPU hosts and TPU chips to achieve peak performance on image and recommendation models. These optimizations reflect lessons from Google’s cutting-edge internal ML use cases across Search, YouTube, and more. We are excited to bring the benefits of this work to all Google Cloud users as well.

Figure 2: Our 2.0 submissions make use of advances in our compiler infrastructure to achieve a larger scale and better per-chip performance across the board than previously possible, averaging 1.5x speedup over our 1.0 submissions2

 

Translating MLPerf wins to customer wins

Cloud TPU’s industry-leading performance at scale also translates to cost savings for customers. Based on our analysis summarized in Figure 3, Cloud TPUs on Google Cloud provide ~35-50% savings vs A100 on Microsoft Azure (see Figure 3). We employed the following methodology to calculate this result:2

  1. We compared the end-to-end times of the largest-scale MLPerf submissions, namely ResNet and BERT, from Google and NVIDIA. These submissions make use of a similar number of chips — upwards of 4000 TPU and GPU chips. Since performance does not scale linearly with chip count, we compared two submissions with roughly the same number of chips.
  2. To simplify the 4216-chip A100 comparison for ResNet vs our 4096-chip TPU submission, we made an assumption in favor of GPUs that 4096 A100 chips would deliver the same performance as 4216 chips.
  3. For pricing, we compared our publicly available Cloud TPU v4 on-demand prices ($3.22 per chip-hour) to Azure’s on-demand prices for A1003 ($4.1 per chip-hour). This once again favors the A100s since we assume zero virtualization overhead in moving from on-prem (NVIDIA’s results) to Azure Cloud.
Read More  Google Cloud At I/O: Everything You Need To Know

The savings are especially meaningful given that real-world models such as GPT-3 and PaLM are much larger than the BERT and ResNet models used in the MLPerf benchmark: PaLM is a 540 billion parameter model, while the BERT model used in the MLPerf benchmark has only 340 million parameters — a 1000x difference in scale. Based on our experience, the benefits of TPUs will grow significantly with scale and make the case all the more compelling for training on Cloud TPU v4.

Figure 3: For the BERT model, using Cloud TPU v4 provides ~35% savings over A100, and ~50% savings for ResNet.4

Have your cake and eat it too — a continued focus on sustainability

Performance at scale must take environmental concerns as a primary constraint and optimization target. The Cloud TPU v4 pods powering our MLPerf results run with 90% carbon-free energy and a Power Usage Efficiency of 1.10, meaning that less than 10% of the power delivered to the data center is lost through conversion, heat, or other sources of inefficiency. The TPU v4 chip delivers 3x the peak FLOPs per watt relative to the v3 generation. This combination of carbon-free energy and extraordinary power delivery and computation efficiency makes Cloud TPUs among the most efficient in the world.4

Making the switch to Cloud TPUs

There has never been a better time for customers to adopt Cloud TPUs. Significant performance and cost savings at scale as well as a deep-rooted focus on sustainability are why customers such as Cohere, LG AI Research, Innersight Labs, and Allen Institute have made the switch. If you are ready to begin using Cloud TPUs for your workloads, please fill out this form. We are excited to partner with ML practitioners around the world to further accelerate the incredible rate of ML breakthroughs and innovation with Google Cloud’s TPU offerings.

Read More  Using A Subflow In Dialogflow CX

 

  • 1 Innersight Labs.jpg

     

  • 2 Allen Institute.jpg

     

  • 3 Cohere.jpg

     

  • 4 LG AI Research.jpg

     

By: Naveen Kumar (Principal Engineer) and Vikram Kasivajhula (Director, Product Management, ML Infrastructure)
Source: Google Cloud Blog


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

aster.cloud

Related Topics
  • Cloud TPU
  • Google Cloud
  • Machine Learning
  • Performance Benchmark
  • Tensor Processing Unit
You May Also Like
View Post
  • Engineering

Just make it scale: An Aurora DSQL story

  • May 29, 2025
View Post
  • Engineering
  • Technology

Guide: Our top four AI Hypercomputer use cases, reference architectures and tutorials

  • March 9, 2025
View Post
  • Computing
  • Engineering

Why a decades old architecture decision is impeding the power of AI computing

  • February 19, 2025
View Post
  • Engineering
  • Software Engineering

This Month in Julia World

  • January 17, 2025
View Post
  • Engineering
  • Software Engineering

Google Summer of Code 2025 is here!

  • January 17, 2025
View Post
  • Data
  • Engineering

Hiding in Plain Site: Attackers Sneaking Malware into Images on Websites

  • January 16, 2025
View Post
  • Computing
  • Design
  • Engineering
  • Technology

Here’s why it’s important to build long-term cryptographic resilience

  • December 24, 2024
IBM and Ferrari Premium Partner
View Post
  • Data
  • Engineering

IBM Selected as Official Fan Engagement and Data Analytics Partner for Scuderia Ferrari HP

  • November 7, 2024

Stay Connected!
LATEST
  • 1
    Just make it scale: An Aurora DSQL story
    • May 29, 2025
  • 2
    Reliance on US tech providers is making IT leaders skittish
    • May 28, 2025
  • Examine the 4 types of edge computing, with examples
    • May 28, 2025
  • AI and private cloud: 2 lessons from Dell Tech World 2025
    • May 28, 2025
  • 5
    TD Synnex named as UK distributor for Cohesity
    • May 28, 2025
  • Weigh these 6 enterprise advantages of storage as a service
    • May 28, 2025
  • 7
    Broadcom’s ‘harsh’ VMware contracts are costing customers up to 1,500% more
    • May 28, 2025
  • 8
    Pulsant targets partner diversity with new IaaS solution
    • May 23, 2025
  • 9
    Growing AI workloads are causing hybrid cloud headaches
    • May 23, 2025
  • Gemma 3n 10
    Announcing Gemma 3n preview: powerful, efficient, mobile-first AI
    • May 22, 2025
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • Understand how Windows Server 2025 PAYG licensing works
    • May 20, 2025
  • By the numbers: How upskilling fills the IT skills gap
    • May 21, 2025
  • 3
    Cloud adoption isn’t all it’s cut out to be as enterprises report growing dissatisfaction
    • May 15, 2025
  • 4
    Hybrid cloud is complicated – Red Hat’s new AI assistant wants to solve that
    • May 20, 2025
  • 5
    Google is getting serious on cloud sovereignty
    • May 22, 2025
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.