Introducing the Cloud HPC Toolkit
Running High Performance Computing (HPC) workloads in the cloud provides many benefits, including the flexibility to create and tear down whole clusters within minutes. However, that flexibility comes with complexity. How do you pick the right machine configuration? How do you install your preferred scheduler? Set up your choice of filesystem? Of course, you also want to get the best performance. Finally, you also want a standardized process that is easy, flexible, and repeatable.
We are excited to share our next step in simplifying HPC on Google Cloud: Cloud HPC Toolkit, an open source tool that enables users to easily create repeatable, turnkey HPC clusters based on proven best practices. Cloud HPC Toolkit makes HPC easier by enabling you to create turnkey HPC clusters within minutes.
For the past several years, Google Cloud has worked to optimize running HPC workloads and ensure compatibility across the HPC ecosystem. We have made significant progress, offering you simple ways to deploy your favorite job schedulers like Altair’s PBS Professional and Altair Grid Engine, Slurm supported by SchedMD, and IBM Spectrum LSF, and building the highest scoring Lustre IO500 system with DDN.
HPC Toolkit Features
The HPC Toolkit features a modular design that enables composable HPC environments. This allows it to easily define and deploy both simple and advanced HPC environments. An HPC blueprint defines the infrastructure and software configuration of an HPC environment via a high-level YAML-formatted file that composes Terraform modules, Packer templates, and Ansible playbooks. You can create a cluster with an existing blueprint or modify it to fit your needs. Through a few text lines in the blueprint, you can easily modify the configuration to provision the required infrastructure and industry-specific tools required for the job.
The HPC Toolkit comes with several example configuration blueprints, including a small basic cluster and a high I/O cluster. These can be used as-is to get familiar with the operations of the HPC Toolkit, or they can be modified to build different configurations.
HPC modules are components that are imported to assemble the HPC environment, including compute, schedulers, storage, and networking. You can develop and import these modules locally or import them automatically from Github. Currently, Cloud HPC Toolkit supports the following infrastructure, solutions, and modules:
- Compute: All VM Types, GPUs, HPC VM Image, Instance Templates, Configurable SMT
- Scheduler: Slurm
- Storage: Intel DAOS, DDN EXAScaler (Lustre), Filestore, Local SSD, Persistent Disk
- Network: 100 Gbps (Tier 1) bandwidth, Placement Groups
- And other key functionality through Spack and Dell Omnia, Cloud Monitoring
Using Cloud HPC Toolkit with an Intel® Select Solutions for Simulations and Modeling blueprint brings the added benefit of automatically spinning up a hardware-software configuration that has been rigorously tested and optimized for real-world performance, eliminating guesswork. The Intel® Select Solutions for Simulations and Modeling blueprint includes the Intel® oneAPI HPC Toolkit (HPC kit) which simplifies the work to build, analyze, optimize, and scale HPC applications with the latest techniques in vectorization, multithreading, multi-node parallelization, and memory optimization. It also includes the popular Intel® MPI Library and Intel® Math Kernel Library.
We have also already started working on upcoming features and integrations that will be released in later versions, including support for Altair’s PBS Professional and Altair Grid Engine scheduler.
What our partners are saying
The demand for high performance storage is one of the fastest growing needs in HPC. Distributed Asynchronous Object Storage (DAOS) is an open source software defined scale-out object store that provides high bandwidth, low latency and high IOPS storage containers to HPC applications. It has growing use in AI and HPDA. Google Cloud HPC Toolkit makes it much easier to use DAOS in GCP. Google HPC users can now provision DAOS ephemeral storage for any size the project in minutes. A hybrid model of DAOS combined with Object storage brings accelerated performance and cost effectiveness. DAOS is now fully integrated with the Google Cloud environment and hosted in Google’s newly announced HPC Toolkit for a fully automated experience. “DAOS is the future of HPC. We’re excited to see the benefits of our year plus technical collaboration pay off with today’s announcement of fast and easy access to DAOS in Google Cloud,” – Kelsey Prantis, Director of Engineering, High-Performance Storage at Intel.
Cloud computing enables scalability, ease of implementation, and incredible price-performance for our customers’ most demanding workloads. The new Google’s Cloud HPC Toolkit further eases deployment by enabling anyone to create an HPC environment with C2D VMs powered by 3rd Gen AMD EPYC™ processors. “The HPC Toolkit reduces complexities and improves automation while mitigating errors for HPC in the cloud. We are happy to collaborate with Google Cloud to optimize AMD powered virtual machines for higher accessibility for all customers.” – Suresh Andani, Director, Cloud Business Development, AMD
“We’re thrilled to be part of this strategic technical collaboration with Google. By integrating Altair PBS Professional and Altair Grid Engine with the HPC Toolkit, we’re simplifying access to Google Cloud and democratizing HPC for the masses.” – Piush Patel, Sr VP, Strategic Relationships
Over the course of the last year NAG and Google collaborated on the development of key components of the Google Cloud HPC Toolkit going public today. NAG enjoys a close partnership with Google, working together to provide additional services including end user support and consulting on top of the Cloud HPC Toolkit for GCP clients as part of NAG® Cloud HPC Migration Services. “Using the Cloud HPC Toolkit, we can now create an HPC Cluster in GCP in a matter of minutes.” – Adrian Tate, CEO, NAG
Getting started with Cloud HPC Toolkit
Get started testing out the HPC Toolkit today using one of our existing blueprints, such as a basic cluster, or one with higher I/O performance, or even modify the examples to make your own. For a full list of example HPC blueprints, see the Cloud HPC Toolkit GitHub repository. You can read more about using the HPC Toolkit in the HPC Toolkit documentation, including our quickstart guides. We would love to hear how the Cloud HPC toolkit is working for you through our support channels. You can read more about Google Cloud’s HPC solutions on our HPC Solution page, and contact us to learn more.
By: Chelsie Czop (Peterson) (HPC Product Manager, Google Cloud) and Carlos Boneti (HPC Software Engineer, Google Cloud)
Source: Google Cloud Blog