aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • Data
  • Engineering

Automate Annotations For Vertex AI Text Datasets With Cloud Vision API And BigQuery

  • aster.cloud
  • June 28, 2022
  • 3 minute read

One of the main challenges machine learning practitioners face is the availability of annotated training datasets or a lack thereof. In many cases, practitioners may have access to existing datasets that have been manually extracted, which they can use to accelerate their model training.

In this post, we demonstrate how Google Cloud AI/ML products can be used to train a text entity extraction model for patent application PDFs. We use BigQuery, Vision API, and Jupyter Notebook to automatically annotate an existing dataset used for model training. Although we won’t go into the details of each step, you can check the complete version in this Jupyter Notebook, which is released as part of the Vertex AI Samples GitHub repository.


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

The sample dataset

The dataset used in this example is the Patent PDF Samples with Extracted Structured Data from the BigQuery public datasets. It contains links to PDFs from the first page of a subset of US and EU patents stored in Google Cloud Storage. The dataset also contains labels for multiple patent entities including the application number, patent inventor, and publication date. This provides the ideal dataset to use for our next step.

Preprocessing PDF documents using Cloud Vision API

Today, Vertex AI AutoML entity extraction supports only text data for training datasets. Our first step in using the PDF files is to convert them to text format. Cloud Vision API offers a text detection feature that uses Optical Character Recognition (OCR) to detect and extract text out of PDF and TIFF files. It offers a batch operation mode that allows us to process multiple files at once.

Read More  How Spam Detection Taught Us Better Tech Support

Preparing the training dataset

Vertex AI offers multiple ways to upload our training dataset. The most convenient choice in our case is to include annotations as part of the import process using an import file. The import file follows a specific format that specifies the content and the list of annotations for each label we want to train.

To generate the annotations, we are going to query the existing data stored in BigQuery and find the location of extracted entities in each file. If an entity has multiple occurrences in the text, all of the occurrences are included in the annotations. We will then export the annotations in JSON Lines format to a file in Google Cloud Storage and use that file in our model training. We can also review the annotated dataset in the Google Cloud Console to ensure the accuracy of the annotations.

Training the model

Once the import file is ready, we can then create a new text dataset in Vertex AI, and use that dataset to train a new entity extraction model. In a few hours, a model is ready for deployment and testing.

Evaluating the model

Once the model training is completed, you can review the model’s evaluation results in the Google Cloud Console. Click here to learn more about how to evaluate Vertex AI AutoML models.

Model evaluation results

Putting it all together

The diagram below shows the various components used to build the complete solution and how they interact with each other.

 

 

 

Solution diagram
Note: This diagram was created using the free Google Cloud Architecture Diagramming Tool, which makes it easy to document your Google Cloud architecture. Check it out and begin using it for your own project!

Summary

In this post, we’ve learned how to train a Vertex AI text entity extraction model by using BigQuery and Vision API to annotate ground truth data.  By using this approach, it is easier for you to replicate this solution and leverage existing datasets to accelerate your AI/ML journey.

Read More  Speed Up Model Inference With Vertex AI Predictions’ Optimized Tensorflow Runtime

Next Steps

You can try this solution by using this Jupyter Notebook. You can run this notebook on your machine, in Colab or in Vertex AI Workbench. You can also check out the Vertex AI Samples GitHub repository for more examples on developing and managing machine learning workflows using Google Cloud Vertex AI.

And if you’d like to review more of the latest tool set from Google Cloud for ML practitioners, you can watch recordings of the second Google Cloud Applied ML Summit. Catch up on the latest product announcements, insights from experts, and customer stories that can help you grow your skills at the pace of innovation.

We wish you a happy machine learning journey!

Special thanks to Karl Weinmeister, Andrew Ferlitsch and Daniel Wang for their help in reviewing this post’s content, and to Terrie Pugh for her editorial support. You rock!

 

By: Mohammad Al-Ansari (Customer Engineer, Infrastructure Modernization (GCloud Customers))
Source: Google Cloud Blog


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

aster.cloud

Related Topics
  • BigQuery;
  • Cloud Vision API
  • Data Processing
  • Google Cloud
  • Machine Learning
  • Vertex AI
You May Also Like
Getting things done makes her feel amazing
View Post
  • Computing
  • Data
  • Featured
  • Learning
  • Tech
  • Technology

Nurturing Minds in the Digital Revolution

  • April 25, 2025
View Post
  • Engineering
  • Technology

Guide: Our top four AI Hypercomputer use cases, reference architectures and tutorials

  • March 9, 2025
View Post
  • Computing
  • Engineering

Why a decades old architecture decision is impeding the power of AI computing

  • February 19, 2025
View Post
  • Engineering
  • Software Engineering

This Month in Julia World

  • January 17, 2025
View Post
  • Engineering
  • Software Engineering

Google Summer of Code 2025 is here!

  • January 17, 2025
View Post
  • Data
  • Engineering

Hiding in Plain Site: Attackers Sneaking Malware into Images on Websites

  • January 16, 2025
View Post
  • Computing
  • Design
  • Engineering
  • Technology

Here’s why it’s important to build long-term cryptographic resilience

  • December 24, 2024
IBM and Ferrari Premium Partner
View Post
  • Data
  • Engineering

IBM Selected as Official Fan Engagement and Data Analytics Partner for Scuderia Ferrari HP

  • November 7, 2024

Stay Connected!
LATEST
  • college-of-cardinals-2025 1
    The Definitive Who’s Who of the 2025 Papal Conclave
    • May 7, 2025
  • conclave-poster-black-smoke 2
    The World Is Revalidating Itself
    • May 6, 2025
  • 3
    Conclave: How A New Pope Is Chosen
    • April 25, 2025
  • Getting things done makes her feel amazing 4
    Nurturing Minds in the Digital Revolution
    • April 25, 2025
  • 5
    AI is automating our jobs – but values need to change if we are to be liberated by it
    • April 17, 2025
  • 6
    Canonical Releases Ubuntu 25.04 Plucky Puffin
    • April 17, 2025
  • 7
    United States Army Enterprise Cloud Management Agency Expands its Oracle Defense Cloud Services
    • April 15, 2025
  • 8
    Tokyo Electron and IBM Renew Collaboration for Advanced Semiconductor Technology
    • April 2, 2025
  • 9
    IBM Accelerates Momentum in the as a Service Space with Growing Portfolio of Tools Simplifying Infrastructure Management
    • March 27, 2025
  • 10
    Tariffs, Trump, and Other Things That Start With T – They’re Not The Problem, It’s How We Use Them
    • March 25, 2025
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • 1
    IBM contributes key open-source projects to Linux Foundation to advance AI community participation
    • March 22, 2025
  • 2
    Co-op mode: New partners driving the future of gaming with AI
    • March 22, 2025
  • 3
    Mitsubishi Motors Canada Launches AI-Powered “Intelligent Companion” to Transform the 2025 Outlander Buying Experience
    • March 10, 2025
  • PiPiPi 4
    The Unexpected Pi-Fect Deals This March 14
    • March 13, 2025
  • Nintendo Switch Deals on Amazon 5
    10 Physical Nintendo Switch Game Deals on MAR10 Day!
    • March 9, 2025
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.