aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • Data
  • Engineering

Streaming Data Into BigQuery Using Storage Write API

  • aster.cloud
  • March 8, 2022
  • 4 minute read

BigQuery is a serverless, highly scalable, and cost-effective data warehouse that customers love. Similarly, Dataflow is a serverless, horizontally and vertically scaling platform for large scale data processing. Many users use both these products in conjunction to get timely analytics from the immense volume of data a modern enterprise generates.  To make sure that users have a great experience using BigQuery and Dataflow together,  we are constantly working on making new integrations that make it easier to use, scale and optimize.  For example, recently we launched support for auto sharding for BigQueryIO connector, which improves the throughput of streaming pipelines on average by 3x.

 


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

Today, we are happy to launch another such integration that brings the best of BigQuery to Dataflow users. The BigQuery team recently launched the new  BigQuery Storage Write API into general availability.  The BigQuery Storage Write API is a unified data-ingestion API for BigQuery. For Dataflow users, this means you can combine  streaming ingestion and batch loading into a single high-performance API. You can use the Storage Write API to stream records into BigQuery that become available for querying as they are written, or to batch process a large number of records and commit them in a single atomic operation. The new API offers higher throughput than its predecessor, table.insertAll() API, and significantly lowers streaming ingestion cost, including up to 2TB free usage per month.

The new API is available via a Java client library, a Python client library or you can use it from any language which supports gRPC.

We are now making support for the Storage Write API in Dataflow available by providing two additional methods to the BigQueryIO connector. You have a choice of using a method with exactly-once semantics of inserting data into BigQuery or a  lower latency and potentially cheaper method with at-least-once semantics.

Read More  Automate Identity Document Processing With Document AI

Using BigQuery Storage Write API with exactly-once semantics

The following section  shows how with a few small changes you can update your existing Java pipelines and take advantage of the Storage Write API’s strong transactional semantics.

1. Update your pipeline to the Beam SDK version that supports the Write API (we recommend using version 2.36.0 or newer) and is supported by Dataflow.

2. To use the new API, set the new method, STORAGE_WRITE_API, when creating the BigQueryIO’s Write transform. A typical code will look like this:

 

WriteResult writeResult = rows.apply("Save Rows to BigQuery", 
BigQueryIO.writeTableRows()
        .to(options.getFullyQualifiedTableName())
        .withWriteDisposition(WriteDisposition.WRITE_APPEND)
        .withCreateDisposition(CreateDisposition.CREATE_NEVER)
        .withMethod(Method.STORAGE_WRITE_API);
);

 

3. If your pipeline needs to create the table (in case it doesn’t exist and you specified the create disposition as CREATE_IF_NEEDED) you will need to provide the table schema. It will be also used by the API to validate data and convert it to the efficient binary protocol buffer message before calling the backend. The new API will be using this schema to do data validation before it’s submitted to the backend.

 

TableSchema schema = new TableSchema().setFields(
        List.of(
            new TableFieldSchema()
                .setName("request_ts")
                .setType("TIMESTAMP")
                .setMode("REQUIRED"),
            new TableFieldSchema()
                .setName("user_name")
                .setType("STRING")
                .setMode("REQUIRED")));

 

4. Finally, for the streaming pipelines two additional parameters need to be set – number of streams and triggering frequency.

Number of streams defines the parallelism of the BigQueryIO’s Write transform and roughly corresponds to the number of Storage Write API’s streams which will be used by the pipeline. You can set it explicitly on the transform via the withNumStorageWriteApiStreams method or provide the “numStorageWriteApiStreams” option to the pipeline as defined in BigQueryOptions class.

Read More  Here’s How Much Your Personal Information Is Worth To Cybercriminals – And What They Do With It

Triggering frequency will determine how soon the data will be visible for querying in BigQuery. You can explicitly set it via the withTriggeringFrequency method or specify the number of seconds by setting the “storageWriteApiTriggeringFrequencySec” option.

The combination of these two parameters affect the size of the batches of rows the BigqueryIO creates before calling the Storage Write API. Setting the frequency too high can result in smaller batches which can affect the performance.

We recommend that you test your pipeline on a representative volume before running it in production (good idea for all pipelines!) and find optimal values for the two parameters we mentioned above.  As a starting point here’s some guidance: a single stream should be able to handle throughput of at least 1Mb per second. Creating exclusive streams used by this method is an expensive operation for the BigQuery service; use only as many streams as needed for your use case. Triggering frequency in single-digit seconds is a good choice for most pipelines.

In the future we will enable auto sharding support to determine and adjust these parameters at the run time.

Using BigQuery Storage Write API with at-least-once semantics

For the use cases where potential duplicate records in the target table are acceptable you can use the STORAGE_WRITE_API method’s cousin, the STORAGE_API_AT_LEAST_ONCE method. Because this method doesn’t persist the records to be written to BigQuery into its shuffle storage (needed to provide the exactly-once semantics of the STORAGE_WRITE_API method) it will be cheaper and will result in lower latency for most pipelines.  It is also simpler to use – it doesn’t need the two additional parameters we mentioned earlier.

Read More  Introducing Pay-As-You-Go Pricing For Apigee API Management

Before running large scale pipelines which will require a substantial number of streams (thousands), review the Storage Write API quotas. These quotas are related to the number of open gRPC connections to the BigQuery service. In BigQueryIO implementation the number of connections varies based on the method. For the STORAGE_WRITE_API method it is roughly the number of streams;  for the STORAGE_API_AT_LEAST_ONCE method it can be up to the maximum number of workers in the pipeline. Notice that the maximum number of connections for ingestion into tables located in multi-regional BigQuery locations (“us” and “eu”) is higher than for regional locations.

Next steps

BigQuery Storage Write API and Dataflow support for it are now available for all users with  Beam SDK 2.36.0 (or newer).

 

 

By: Shanmugam (Shan) Kulandaivel (Product Manager, Streaming Analytics, Google Cloud) and Sergei Lilichenko (Solutions Architect)
Source: Google Cloud Blog


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

aster.cloud

Related Topics
  • API
  • BigQuery;
  • Data Analytics
  • Dataflow
  • Google Cloud
You May Also Like
Getting things done makes her feel amazing
View Post
  • Computing
  • Data
  • Featured
  • Learning
  • Tech
  • Technology

Nurturing Minds in the Digital Revolution

  • April 25, 2025
View Post
  • Engineering
  • Technology

Guide: Our top four AI Hypercomputer use cases, reference architectures and tutorials

  • March 9, 2025
View Post
  • Computing
  • Engineering

Why a decades old architecture decision is impeding the power of AI computing

  • February 19, 2025
View Post
  • Engineering
  • Software Engineering

This Month in Julia World

  • January 17, 2025
View Post
  • Engineering
  • Software Engineering

Google Summer of Code 2025 is here!

  • January 17, 2025
View Post
  • Data
  • Engineering

Hiding in Plain Site: Attackers Sneaking Malware into Images on Websites

  • January 16, 2025
View Post
  • Computing
  • Design
  • Engineering
  • Technology

Here’s why it’s important to build long-term cryptographic resilience

  • December 24, 2024
IBM and Ferrari Premium Partner
View Post
  • Data
  • Engineering

IBM Selected as Official Fan Engagement and Data Analytics Partner for Scuderia Ferrari HP

  • November 7, 2024

Stay Connected!
LATEST
  • college-of-cardinals-2025 1
    The Definitive Who’s Who of the 2025 Papal Conclave
    • May 7, 2025
  • conclave-poster-black-smoke 2
    The World Is Revalidating Itself
    • May 6, 2025
  • 3
    Conclave: How A New Pope Is Chosen
    • April 25, 2025
  • Getting things done makes her feel amazing 4
    Nurturing Minds in the Digital Revolution
    • April 25, 2025
  • 5
    AI is automating our jobs – but values need to change if we are to be liberated by it
    • April 17, 2025
  • 6
    Canonical Releases Ubuntu 25.04 Plucky Puffin
    • April 17, 2025
  • 7
    United States Army Enterprise Cloud Management Agency Expands its Oracle Defense Cloud Services
    • April 15, 2025
  • 8
    Tokyo Electron and IBM Renew Collaboration for Advanced Semiconductor Technology
    • April 2, 2025
  • 9
    IBM Accelerates Momentum in the as a Service Space with Growing Portfolio of Tools Simplifying Infrastructure Management
    • March 27, 2025
  • 10
    Tariffs, Trump, and Other Things That Start With T – They’re Not The Problem, It’s How We Use Them
    • March 25, 2025
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • 1
    IBM contributes key open-source projects to Linux Foundation to advance AI community participation
    • March 22, 2025
  • 2
    Co-op mode: New partners driving the future of gaming with AI
    • March 22, 2025
  • 3
    Mitsubishi Motors Canada Launches AI-Powered “Intelligent Companion” to Transform the 2025 Outlander Buying Experience
    • March 10, 2025
  • PiPiPi 4
    The Unexpected Pi-Fect Deals This March 14
    • March 13, 2025
  • Nintendo Switch Deals on Amazon 5
    10 Physical Nintendo Switch Game Deals on MAR10 Day!
    • March 9, 2025
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.