aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
code
  • Engineering
  • Software Engineering
  • Tools

Cloud Storage As A File System In AI Training

  • aster.cloud
  • November 23, 2021
  • 4 minute read

Cloud Storage is a common choice for Vertex AI and AI Platform users to store their training data, models, checkpoints and logs. Now, with Cloud Storage FUSE, training jobs on both platforms can access their data on Cloud Storage as files in the local file system.

This post introduces the Cloud Storage FUSE for Vertex AI Custom Training. On AI Platform Training, the feature is very similar.


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

Cloud Storage FUSE provides 3 benefits over the traditional ways of accessing Cloud Storage:

  • Training jobs can start quickly without downloading any training data.
  • Training jobs can perform I/O easily at scale, without the friction of calling the Cloud Storage APIs, handling the responses, or integrating with client-side libraries.
  • Training jobs can leverage the optimized performance of Cloud Storage FUSE.

The problems

Traditionally, training jobs have two ways to use data from Cloud Storage.

  1. They can use gsutil to download the entire dataset prior to training. This may take hours depending on the dataset size, which significantly slows down the start-up of the jobs.
  2. They can call Cloud Storage APIs directly or from a client library integrated. This way greatly adds complexity to the training code and thus the cost for development and maintenance.

Cloud Storage FUSE

Cloud Storage FUSE is a File System in User Space (FUSE) mounted on Vertex AI systems.

When you start a custom training job, the job sees a directory /gcs which contains all the Cloud Storage buckets as subdirectories. The job can visit the subdirectories (ie. buckets) when certain permissions are granted.

For instance, training jobs can read from file /gcs/example-bucket/data.csv to get the training data stored in object gs://example-bucket/data.csv

with open('/gcs/example-bucket/data.csv', 'r') as f:
  lines = f.readlines()

Training jobs can also write to the bucket:

Read More  Anthos Makes Multi-Cloud Easier With New API, Support For Azure
with open('/gcs/example-bucket/epoch3.log', 'a') as f:
  f.write('success!\n')

 

Permissions

Users can assign service accounts to the training jobs to configure their permissions for the Cloud Storage buckets.

  • If the training job is assigned without a service account, it is allowed to access all the buckets owned by the same project.
  • If the training job is assigned with a service account that has Cloud Storage Roles, it has the permissions given by the roles.

For instance, you may create a service account as

  • storage.objectAdmin to bucket A, and
  • storage.objectViewer to bucket B.

If you assign it to your training job, your training job will be able to

  • read and write in bucket A, and
  • read only in bucket B.

The training job will fail with error “permission denied” if it tries to write to bucket B.

Performance

The I/O is often a bottleneck for training jobs with large datasets. Here are some tips to improve the read throughput of the Cloud Storage FUSE:

  • Store data in large files to reduce the number of files used in the training. Fewer files mean less lookup overhead in locating and opening objects in Cloud Storage.
  • Use multiple threads. Higher concurrency utilizes the bandwidth better.
  • Keep the files warm. Files to be accessed frequently (ie warm) are generally better cached and have better performance being read.

Restrictions

Cloud Storage FUSE is not a POSIX compliant file system. Therefore, some usage in a POSIX file system would have unwanted results, which should be avoided.

Directories:

  • The root directory `/gcs` is not readable. If you run ls /gcs, you will get an “Input/output error”. However, it is okay to read the bucket root such as ls /gcs/example-bucket.
  • Renaming a directory is not atomic. A renaming operation interrupted would leave a partial result with some files in the new directory, while others in the old directory. A directory with too many direct and indirect files cannot be renamed.
Read More  How Can Demand Forecasting Approach Real Time Responsiveness? Vertex AI Makes It Possible

Files:

  • Hard links are not supported.
  • File metadata such as ownership, permissions, mtime, extended attributes, are not supported. Do not rely on file metadata for training logic.
  • Flushing files pushes the entire file to Cloud Storage, which is expensive. Closing a file leads to a flush. Therefore, one should avoid frequent file closes and flushes.
  • Concurrent write to a file would lead to data corruption.

Logs

You can find the logs from Cloud Storage FUSE to help you diagnose the errors in training.

  • First, you follow the link to the Cloud Log Explorer on the training job’s page in Pantheon. In the explorer, you can run queries to inspect the logs generated from your training job.
  • Second, you can view the logs with “gcsfuse” in the resource.labels.taskName property. For instance, the task name “workerpool0-0.gcsfuse” indicates the log is from the Cloud Storage FUSE mounted for the first worker “0” in the first worker pool “workerpool0”.

What’s next

You can find more information on Cloud Storage Fuse in documentation:

  • http://cloud/vertex-ai/docs/training/code-requirements#fuse
  • https://cloud.google.com/storage/docs/gcs-fuse

You can also find code samples using Cloud Storage FUSE for Vertex AI Custom Training:

  • https://github.com/GoogleCloudPlatform/vertex-ai-samples/tree/master/community-content

 

By: Oliver Zhuang (Software Engineer)
Source: Google Cloud Blog


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

aster.cloud

Related Topics
  • Cloud Storage
  • Cloud Storage FUSE
  • Google Cloud
  • Python
  • Tutorial
  • Vertex AI
You May Also Like
View Post
  • Engineering

Just make it scale: An Aurora DSQL story

  • May 29, 2025
notta-ai-header
View Post
  • Featured
  • Tools

Notta vs Fireflies: Which AI Transcription Tool Deserves Your Attention in 2025?

  • May 16, 2025
View Post
  • Engineering
  • Technology

Guide: Our top four AI Hypercomputer use cases, reference architectures and tutorials

  • March 9, 2025
View Post
  • Software Engineering
  • Technology

Claude 3.7 Sonnet and Claude Code

  • February 25, 2025
View Post
  • Computing
  • Engineering

Why a decades old architecture decision is impeding the power of AI computing

  • February 19, 2025
View Post
  • Engineering
  • Software Engineering

This Month in Julia World

  • January 17, 2025
View Post
  • Engineering
  • Software Engineering

Google Summer of Code 2025 is here!

  • January 17, 2025
View Post
  • Data
  • Engineering

Hiding in Plain Site: Attackers Sneaking Malware into Images on Websites

  • January 16, 2025

Stay Connected!
LATEST
  • 1
    Just make it scale: An Aurora DSQL story
    • May 29, 2025
  • 2
    Reliance on US tech providers is making IT leaders skittish
    • May 28, 2025
  • Examine the 4 types of edge computing, with examples
    • May 28, 2025
  • AI and private cloud: 2 lessons from Dell Tech World 2025
    • May 28, 2025
  • 5
    TD Synnex named as UK distributor for Cohesity
    • May 28, 2025
  • Weigh these 6 enterprise advantages of storage as a service
    • May 28, 2025
  • 7
    Broadcom’s ‘harsh’ VMware contracts are costing customers up to 1,500% more
    • May 28, 2025
  • 8
    Pulsant targets partner diversity with new IaaS solution
    • May 23, 2025
  • 9
    Growing AI workloads are causing hybrid cloud headaches
    • May 23, 2025
  • Gemma 3n 10
    Announcing Gemma 3n preview: powerful, efficient, mobile-first AI
    • May 22, 2025
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • Understand how Windows Server 2025 PAYG licensing works
    • May 20, 2025
  • By the numbers: How upskilling fills the IT skills gap
    • May 21, 2025
  • 3
    Cloud adoption isn’t all it’s cut out to be as enterprises report growing dissatisfaction
    • May 15, 2025
  • 4
    Hybrid cloud is complicated – Red Hat’s new AI assistant wants to solve that
    • May 20, 2025
  • 5
    Google is getting serious on cloud sovereignty
    • May 22, 2025
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.