aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • Data
  • Engineering

Ingesting Google Cloud Storage Files To BigQuery Using Cloud Functions And Serverless Spark

  • aster.cloud
  • April 11, 2022
  • 4 minute read

Apache Spark has become a popular platform as it can serve all of data engineering, data exploration, and machine learning use cases. However, Spark still requires the on-premises way of managing clusters and tuning infrastructure for each job. This increases costs, reduces agility, and makes governance extremely hard; prohibiting enterprises from making insights available to the right users at the right time.Dataproc Serverless lets you run Spark batch workloads without requiring you to provision and manage your own cluster. Specify workload parameters, and then submit the workload to the Dataproc Serverless service. The service will run the workload on a managed compute infrastructure, autoscaling resources as needed. Dataproc Serverless charges apply only to the time when the workload is executing.

You can run the following Spark workload types on the Dataproc Serverless for Spark service:


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

  • Pyspark
  • Spark SQL
  • Spark R
  • Spark Java/Scala

This post walks you through the process of ingesting files into BigQuery using serverless service such as Cloud Functions, Pub/Sub & Serverless Spark. We use Daily Shelter Occupancy data in this example.

You can find the complete source code for this solution within our Github.

Here’s a look at the architecture we’ll be using:

 

Here’s how to get started with ingesting GCS files to BigQuery using Cloud Functions and Serverless Spark:

1. Create a bucket, the bucket holds the data to be ingested in GCP. Once the object is upload in a bucket, the notification is created in Pub/Sub topic

 

PROJECT_ID=<<project_id>>
GCS_BUCKET_NAME=<<Bucket name>>
gsutil mb gs://${GCS_BUCKET_NAME}
gsutil notification create \
    -t projects/${PROJECT_ID}/topics/create_notification_${GCS_BUCKET_NAME} \
    -e OBJECT_FINALIZE \
    -f json gs://${GCS_BUCKET_NAME}

 

Read More  Get A Head Start With No-Cost Learning Challenges Before Next ‘22

2. Build and copy the jar to a GCS bucket(Create a GCS bucket to store the jar if you don’t have one).

Follow the steps to create a GCS bucket and copy JAR to the same.

 

GCS_ARTIFACT_REPO=<<artifact repo name>>
gsutil mb gs://${GCS_ARTIFACT_REPO}
cd gcs2bq-spark
mvn clean install
gsutil cp target/GCS2BQWithSpark-1.0-SNAPSHOT.jar gs://${GCS_ARTIFACT_REPO}/

 

3. Enable network configuration required to run serverless spark

 

  • Open subnet connectivity: The subnet must allow subnet communication on all ports. The following gcloud command attaches a network firewall to a subnet that allows ingress communications using all protocols on all ports if the source and destination are tagged with “serverless-spark”

 

gcloud compute firewall-rules create allow-internal-ingress \
--network="default" \
--source-tags="serverless-spark" \
--target-tags="serverless-spark" \
--direction="ingress" \
--action="allow" \
--rules="all"

 

Note: The default VPC network in a project with the default-allow-internal firewall rule, which allows ingress communication on all ports (tcp:0-65535, udp:0-65535, and icmp protocols:ports), meets this requirement. However, it also allows ingress by any VM instance on the network

  • Private Google Access: The subnet must have Private Google Access enabled. External network access. Drivers and executors have internal IP addresses. You can set up Cloud NAT to allow outbound traffic using internal IPs on your VPC network.

4. Create necessary GCP resources required by Serverless Spark

  • Create BQ Dataset Create a dataset to load csv files.

 

DATASET_NAME=<<dataset_name>>
bq --location=US mk -d \
    --default_table_expiration 3600 \
    ${DATASET_NAME}

 

  • Create BQ table Create a table using the schema in schema/schema.json

 

TABLE_NAME=<<table_name>>
bq mk --table ${PROJECT_ID}:${DATASET_NAME}.${TABLE_NAME} \
    ./schema/schema.json

 

  • Create service account and permission required to read from GCS bucket and write to BigQuery table
Read More  Introducing SWIFT On Google Cloud

 

SERVICE_ACCOUNT_ID="gcs-to-bq-sa"
gcloud iam service-accounts create ${SERVICE_ACCOUNT_ID} \
    --description="GCS to BQ service account for Serverless Spark" \
    --display-name="GCS2BQ-SA"

roles=("roles/dataproc.worker" "roles/bigquery.dataEditor" "roles/bigquery.jobUser" "roles/storage.objectViewer")
for role in ${roles[@]}; do
    gcloud projects add-iam-policy-binding ${PROJECT_ID} \
        --member="serviceAccount:${SERVICE_ACCOUNT_ID}@${PROJECT_ID}.iam.gserviceaccount.com" \
        --role="$role"
done

 

  • Create GCS bucket to load data to BigQuery

 

GCS_TEMP_BUCKET=<<temp_bucket>>
gsutil mb gs://${GCS_TEMP_BUCKET}

 

  • Create Dead Letter Topic and Subscription

 

ERROR_TOPIC=err_gcs2bq_${GCS_BUCKET_NAME}
gcloud pubsub topics create $ERROR_TOPIC
gcloud pubsub subscriptions create err_sub_${GCS_BUCKET_NAME}} \
--topic=${ERROR_TOPIC}

 

Note: Once all resources are created, change the variables value () in trigger-serverless-spark-fxn/main.py from line 27 to 31. The code of the function is in Github

5. Deploy the cloud function.

The cloud function is triggered once the object is copied to the bucket. The cloud function triggers the Servereless spark which loads data into Bigquery.

 

cd trigger-serverless-spark-fxn
gcloud functions deploy trigger-serverless-spark-fxn --entry-point \
invoke_sreverless_spark --runtime python37 \
--trigger-resource ${GCS_BUCKET_NAME}_create_notification \
--trigger-event google.pubsub.topic.publish

 

6. Invoke end to end pipeline

Invoke the end-to-end pipeline by Downloading 2020 Daily Center Data and uploading to the GCS bucket(GCS_BUCKET_NAME).(Note: replace with the bucket name created in Step-1)

Debugging Pipelines

Error messages for the failed data pipelines are published to Pub/Sub topic (ERROR_TOPIC) created in Step 4 (Create Dead Letter Topic and Subscription). The errors from both cloud function and spark are forwarded to Pub/Sub. Pub/Sub topics might have multiple entries for the same data-pipeline instance. Messages in Pub/Sub topics can be filtered using the “oid” attribute. The attribute(oid) is unique for each pipeline run and holds a full object name with the generation id.

Conclusion

In this post, we’ve shown you how to ingest GCS files to BigQuery using Cloud Functions and Serverless Spark. Register interest here to request early access to the new solutions for Spark on Google Cloud.

Read More  Developers - Build, Learn, And Grow Your Career Faster With Google Cloud

 

 

By: Swati Sindwani (Big Data and Analytics Cloud Consultant) and Bipin Upadhyaya (Strategic Cloud Engineer)
Source: Google Cloud Blog


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

aster.cloud

Related Topics
  • Apache Spark
  • BigQuery;
  • Cloud Functions
  • Data Analytics
  • Google Cloud
  • Serverless
  • Tutorial
You May Also Like
Data center
View Post
  • Data
  • Public Cloud

Data Sovereignty in Spain. It’s Not Just About the Law, It’s About Efficiency

  • June 3, 2026
View Post
  • Data
  • Platforms
  • Technology

Scaling cloud and AI: Microsoft Azure’s commitment to Europe’s digital future

  • May 11, 2026
View Post
  • Data

Streamline read scalability with Cloud SQL autoscaling read pools

  • March 23, 2026
View Post
  • Data
  • Platforms
  • Public Cloud

PayPal’s historically large data migration is the foundation for its gen AI innovation

  • March 4, 2026
View Post
  • Data
  • Technology

3 obstacles to agentic AI adoption and how to overcome them

  • December 22, 2025
Points, Lines and a Question
View Post
  • Architecture
  • Design
  • Engineering
  • People

What Is The Point In Making Points?

  • November 26, 2025
View Post
  • Engineering
  • Software Engineering

Development gets better with Age

  • October 9, 2025
View Post
  • Engineering
  • Technology

Apple supercharges its tools and technologies for developers to foster creativity, innovation, and design

  • June 9, 2025

Stay Connected!
LATEST
  • 1
    You Do Not Need to Invest in the IPO of SpaceX, Anthropic, and OpenAI
    • June 10, 2026
  • 2
    The consequences of relying on AI for accurate news
    • June 10, 2026
  • 3
    Connecting AI agents with unstructured data using Google Cloud Storage MCP Servers
    • June 10, 2026
  • 4
    IBM and Google Cloud Announce Strategic Partnership to Scale AI with Human Expertise and AI‑Powered Delivery
    • June 4, 2026
  • Data center 5
    Data Sovereignty in Spain. It’s Not Just About the Law, It’s About Efficiency
    • June 3, 2026
  • 6
    Ink vs Pixels. What you miss versus what you are actually missing.
    • June 1, 2026
  • 7
    Banks race to patch new cyber vulnerabilities, and other cybersecurity news
    • May 25, 2026
  • pope-leo-xiv-cq5dam-1500.844 8
    Pope Leo XIV to Publish First Encyclical on Artificial Intelligence and Human Dignity on 25 May
    • May 22, 2026
  • 9
    Portfolio to Clients, and is Strengthened by Ongoing Project Glasswing Work
    • May 20, 2026
  • reMarkable Paper Pure 10
    Everything The reMarkable Paper Pure Actually Does
    • May 14, 2026
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • 1
    Scaling cloud and AI: Microsoft Azure’s commitment to Europe’s digital future
    • May 11, 2026
  • Anthropic Institute 2
    Introducing The Anthropic Institute
    • March 11, 2026
  • reMarkable Paper Pure 3
    The Quiet Revolution You Did Not Know You Needed
    • May 9, 2026
  • spain-qNO3XMQILTA-unsplash 4
    When the World Feels Unstable, Spain Remains the Calm. Here’s How to Get There Safely.
    • May 2, 2026
  • 5
    Why The CLOUD Act And Geopolitics Are Forcing A Data Sovereignty Reckoning In Europe
    • May 2, 2026
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.