aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • Data
  • Engineering

Streaming Data Into BigQuery Using Storage Write API

  • aster.cloud
  • March 8, 2022
  • 4 minute read

BigQuery is a serverless, highly scalable, and cost-effective data warehouse that customers love. Similarly, Dataflow is a serverless, horizontally and vertically scaling platform for large scale data processing. Many users use both these products in conjunction to get timely analytics from the immense volume of data a modern enterprise generates.  To make sure that users have a great experience using BigQuery and Dataflow together,  we are constantly working on making new integrations that make it easier to use, scale and optimize.  For example, recently we launched support for auto sharding for BigQueryIO connector, which improves the throughput of streaming pipelines on average by 3x.

 


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

Today, we are happy to launch another such integration that brings the best of BigQuery to Dataflow users. The BigQuery team recently launched the new  BigQuery Storage Write API into general availability.  The BigQuery Storage Write API is a unified data-ingestion API for BigQuery. For Dataflow users, this means you can combine  streaming ingestion and batch loading into a single high-performance API. You can use the Storage Write API to stream records into BigQuery that become available for querying as they are written, or to batch process a large number of records and commit them in a single atomic operation. The new API offers higher throughput than its predecessor, table.insertAll() API, and significantly lowers streaming ingestion cost, including up to 2TB free usage per month.

The new API is available via a Java client library, a Python client library or you can use it from any language which supports gRPC.

We are now making support for the Storage Write API in Dataflow available by providing two additional methods to the BigQueryIO connector. You have a choice of using a method with exactly-once semantics of inserting data into BigQuery or a  lower latency and potentially cheaper method with at-least-once semantics.

Read More  4 New Ways Citrix & Google Cloud Can Simplify Your Cloud Migration

Using BigQuery Storage Write API with exactly-once semantics

The following section  shows how with a few small changes you can update your existing Java pipelines and take advantage of the Storage Write API’s strong transactional semantics.

1. Update your pipeline to the Beam SDK version that supports the Write API (we recommend using version 2.36.0 or newer) and is supported by Dataflow.

2. To use the new API, set the new method, STORAGE_WRITE_API, when creating the BigQueryIO’s Write transform. A typical code will look like this:

 

WriteResult writeResult = rows.apply("Save Rows to BigQuery", 
BigQueryIO.writeTableRows()
        .to(options.getFullyQualifiedTableName())
        .withWriteDisposition(WriteDisposition.WRITE_APPEND)
        .withCreateDisposition(CreateDisposition.CREATE_NEVER)
        .withMethod(Method.STORAGE_WRITE_API);
);

 

3. If your pipeline needs to create the table (in case it doesn’t exist and you specified the create disposition as CREATE_IF_NEEDED) you will need to provide the table schema. It will be also used by the API to validate data and convert it to the efficient binary protocol buffer message before calling the backend. The new API will be using this schema to do data validation before it’s submitted to the backend.

 

TableSchema schema = new TableSchema().setFields(
        List.of(
            new TableFieldSchema()
                .setName("request_ts")
                .setType("TIMESTAMP")
                .setMode("REQUIRED"),
            new TableFieldSchema()
                .setName("user_name")
                .setType("STRING")
                .setMode("REQUIRED")));

 

4. Finally, for the streaming pipelines two additional parameters need to be set – number of streams and triggering frequency.

Number of streams defines the parallelism of the BigQueryIO’s Write transform and roughly corresponds to the number of Storage Write API’s streams which will be used by the pipeline. You can set it explicitly on the transform via the withNumStorageWriteApiStreams method or provide the “numStorageWriteApiStreams” option to the pipeline as defined in BigQueryOptions class.

Read More  ML Engineers: Partners For Scaling AI In Enterprises

Triggering frequency will determine how soon the data will be visible for querying in BigQuery. You can explicitly set it via the withTriggeringFrequency method or specify the number of seconds by setting the “storageWriteApiTriggeringFrequencySec” option.

The combination of these two parameters affect the size of the batches of rows the BigqueryIO creates before calling the Storage Write API. Setting the frequency too high can result in smaller batches which can affect the performance.

We recommend that you test your pipeline on a representative volume before running it in production (good idea for all pipelines!) and find optimal values for the two parameters we mentioned above.  As a starting point here’s some guidance: a single stream should be able to handle throughput of at least 1Mb per second. Creating exclusive streams used by this method is an expensive operation for the BigQuery service; use only as many streams as needed for your use case. Triggering frequency in single-digit seconds is a good choice for most pipelines.

In the future we will enable auto sharding support to determine and adjust these parameters at the run time.

Using BigQuery Storage Write API with at-least-once semantics

For the use cases where potential duplicate records in the target table are acceptable you can use the STORAGE_WRITE_API method’s cousin, the STORAGE_API_AT_LEAST_ONCE method. Because this method doesn’t persist the records to be written to BigQuery into its shuffle storage (needed to provide the exactly-once semantics of the STORAGE_WRITE_API method) it will be cheaper and will result in lower latency for most pipelines.  It is also simpler to use – it doesn’t need the two additional parameters we mentioned earlier.

Read More  Google Cloud Next 2019 | Making Books Accessible to the Visually Impaired

Before running large scale pipelines which will require a substantial number of streams (thousands), review the Storage Write API quotas. These quotas are related to the number of open gRPC connections to the BigQuery service. In BigQueryIO implementation the number of connections varies based on the method. For the STORAGE_WRITE_API method it is roughly the number of streams;  for the STORAGE_API_AT_LEAST_ONCE method it can be up to the maximum number of workers in the pipeline. Notice that the maximum number of connections for ingestion into tables located in multi-regional BigQuery locations (“us” and “eu”) is higher than for regional locations.

Next steps

BigQuery Storage Write API and Dataflow support for it are now available for all users with  Beam SDK 2.36.0 (or newer).

 

 

By: Shanmugam (Shan) Kulandaivel (Product Manager, Streaming Analytics, Google Cloud) and Sergei Lilichenko (Solutions Architect)
Source: Google Cloud Blog


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

aster.cloud

Related Topics
  • API
  • BigQuery;
  • Data Analytics
  • Dataflow
  • Google Cloud
You May Also Like
View Post
  • Engineering
  • Technology

Apple supercharges its tools and technologies for developers to foster creativity, innovation, and design

  • June 9, 2025
View Post
  • Engineering

Just make it scale: An Aurora DSQL story

  • May 29, 2025
Getting things done makes her feel amazing
View Post
  • Computing
  • Data
  • Featured
  • Learning
  • Tech
  • Technology

Nurturing Minds in the Digital Revolution

  • April 25, 2025
View Post
  • Engineering
  • Technology

Guide: Our top four AI Hypercomputer use cases, reference architectures and tutorials

  • March 9, 2025
View Post
  • Computing
  • Engineering

Why a decades old architecture decision is impeding the power of AI computing

  • February 19, 2025
View Post
  • Engineering
  • Software Engineering

This Month in Julia World

  • January 17, 2025
View Post
  • Engineering
  • Software Engineering

Google Summer of Code 2025 is here!

  • January 17, 2025
View Post
  • Data
  • Engineering

Hiding in Plain Site: Attackers Sneaking Malware into Images on Websites

  • January 16, 2025

Stay Connected!
LATEST
  • 1
    Pure Accelerate 2025: All the news and updates live from Las Vegas
    • June 18, 2025
  • 2
    ‘This was a very purposeful strategy’: Pure Storage unveils Enterprise Data Cloud in bid to unify data storage, management
    • June 18, 2025
  • What is cloud bursting?
    • June 18, 2025
  • 4
    There’s a ‘cloud reset’ underway, and VMware Cloud Foundation 9.0 is a chance for Broadcom to pounce on it
    • June 17, 2025
  • What is confidential computing?
    • June 17, 2025
  • Oracle adds xAI Grok models to OCI
    • June 17, 2025
  • Fine-tune your storage-as-a-service approach
    • June 16, 2025
  • 8
    Advanced audio dialog and generation with Gemini 2.5
    • June 15, 2025
  • 9
    A Father’s Day Gift for Every Pop and Papa
    • June 13, 2025
  • 10
    Global cloud spending might be booming, but AWS is trailing Microsoft and Google
    • June 13, 2025
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • Google Cloud, Cloudflare struck by widespread outages
    • June 12, 2025
  • What is PC as a service (PCaaS)?
    • June 12, 2025
  • 3
    Crayon targets mid-market gains with expanded Google Cloud partnership
    • June 10, 2025
  • By the numbers: Use AI to fill the IT skills gap
    • June 11, 2025
  • 5
    Apple services deliver powerful features and intelligent updates to users this autumn
    • June 11, 2025
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.