aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • Engineering
  • Practices

Spatial Clustering On BigQuery – Best Practices

  • aster.cloud
  • September 17, 2022
  • 6 minute read

Most data analysts are familiar with the concept of organizing data into clusters so that it can be queried faster and at a lower cost. The user behavior dictates how the dataset should be clustered: for example, when a user seeks to analyze or visualize geospatial data (a.k.a location data), it is most efficient to cluster on a geospatial column. This practice is known as spatial clustering, and in this blog, we will share best practices for implementing it in BigQuery (hint — let BigQuery do it for you).

BigQuery is a petabyte-scale data warehouse that has many geospatial capabilities and functions. In the following sections, we will describe how BigQuery does spatial clustering out of the box using the S2 indexing system. We will also touch on how to use other spatial indexes like H3 and geohash, and compare the cost savings of different approaches.


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

How BigQuery does spatial clustering under the hood

Clustering ensures that blocks of data with similar values are colocated in storage, which means that the data is easier to retrieve at query time. It also sorts the blocks of data, so that only the necessary blocks need to be scanned, which reduces cost and processing time. In geospatial terms, this means that when you’re querying a particular region, only the rows within or close to that region are scanned, rather than the whole globe.

All of the optimizations described above will occur automatically in BigQuery if you cluster your tables on a GEOGRAPHY column. It’s as easy as typing CLUSTER BY [GEOGRAPHY] when creating the table. Only predicate functions (e.g. ST_Intersects, ST_DWithin) leverage clustering, with the exception of ST_DISJOINT. It should also be noted that while BigQuery supports partitioning and clustering on a variety of fields, only clustering is supported on a geospatial field. This is because geometries can be large and could span across partitions, no matter how BigQuery chooses to partition the space. Finally, cluster sizes will range from 100MB to 1GB, so clustering on a table smaller than 100MB will provide no benefit.

When writing to a table that is clustered by GEOGRAPHY, BigQuery shards the data into spatially-compact blocks. For each block, BigQuery computes a bit of metadata called an S2 covering that includes the spatial area of the data contained within. When querying a geography-clustered table using spatial predicates, BigQuery reads the covering and evaluates whether a particular covering can satisfy the filter. BigQuery then prunes the blocks that cannot satisfy the filter. Users are only charged for data from remaining blocks. Note that S2 coverings can overlap, as it is often impossible to divide data into non-overlapping regions.

Read More  Essential Tips And Tricks For Your First Tech Job

Fundamentally, BigQuery is using the S2 index to map a geometry into a 64-bit integer, then BigQuery clusters on that integer using existing integer-based clustering mechanisms. In the past, customers have manually implemented an S2 indexing system in BigQuery. This was done prior to BigQuery’s native support of spatial clustering via S2. Using BigQuery’s native clustering resulted in a large performance increase, not to mention the added simplicity of not having to manage your own S2 indexes.

Alternative Spatial Indexes

Spatial clustering utilizes a spatial indexing system, or hierarchy, to organize the stored data. The purpose of all spatial indices is to represent this globe we call Earth in numerical terms, allowing us to define a location as a geometric object like a point, polygon or line. There are dozens of spatial indexes, and most databases implement them in their own unique way. Although BigQuery natively uses S2 cells for clustering, other indexes can be manually implemented, such as H3, geohash, or quadkeys. The examples below will involve the following spatial indexes:

  • S2:  The S2 system represents geospatial data as cells on a three dimensional sphere. It is used by Google Maps.
    • uses quadrilaterals, which are more efficient than hexagons
    • Higher precision than H3 or geohashing
  • H3:  The H3 system represents geospatial data on overlapping hexagonal grids.
    • Hexagons are more visually appealing
    • Convolutions and smoothing algorithms are more efficient than S2
  • Geohash – Geohash is a public domain system that represents geospatial data on a curved grid.
    • Length of the Geohash id determines the spatial precision
    • Fairly poor spatial locality, so clustering does not work as well

Spatial clustering in BQ — S2 vs. Geohash

In most cases for analysis, BigQuery’s built-in spatial clustering will give the best performance with the least effort. But if the data is queried according to other attributes, e.g. by geohash box, a custom indexing is necessary. The method of querying the spatial indexes has implications on the performance, as is illustrated in the example below.

Read More  Data Here, Data There, Look, There’s Data Everywhere! Replicating Your Data From Cloud SQL For SQL Server

Example

First, you will create a table with random points in longitude and latitude. Use the BigQuery function st_geohash to generate a geohash id for each point.

 

drop table if exists tmp.points;

create or replace table tmp.tenkrows as
select x from unnest(generate_array(1, 10000)) x;

create or replace table tmp.points
cluster by point
as
with pts as (
 select st_geogpoint(rand() * 360 - 180, rand() * 180 - 90) point
 from tmp.tenkrows a, tmp.tenkrows b
)
select st_geohash(point) gh, pts.point
from pts

 

Use the st_geogpoint function to transform the latitude and longitude into a GEOGRAPHY, BigQuery’s native geospatial type, which uses S2 cells as the index. Select a collection of around 3,000 points. This should cost around 25MB. If you run the same query on an unclustered table, it would cost 5.77GB (the full table size).

 

select * from  tmp.points
where st_dwithin(st_geogpoint(1, 2), point, 10000)

 

Now you will query by geohash id. BigQuery’s ability to leverage the spatial clustering will depend on whether the BQ SAT solver can prove the cluster of data can be pruned. The queries below are both leveraging the geospatial clustering, costing only 340 MB. Note that if we had clustered the table by the ‘gh’ field (ie geohash id), these queries would cost the same as the one above, around 25MB.

 

select * from  tmp.points
where starts_with(gh, 'bbb')

select * from  tmp.points
where gh between 'bbb' and 'bbb~'

 

The query below is much less efficient, costing 5.77GB, a full scan of the table. BigQuery cannot prove this condition fails based on the min/max values of the cluster so it must scan the entire table.

 

select * from  tmp.points
where left(gh, 3) = 'bbb'

 

As the examples show, the least costly querying option is to use the indexing consistent with the query method — native S2 indexing when querying by geography, string indexing when querying by geohash. When using geohashing, avoid left() or right() functions, as it will cause BigQuery to scan the entire table.

Spatial clustering in BQ with H3

One may also find themselves in a situation where they need to use H3 as a spatial index in BigQuery. It is still possible to leverage the performance benefits of clustering, but as with geohashing, it is important to avoid certain patterns.

Read More  HDFC ERGO Partners With Google Cloud To Digitize Insurance Purchasing In India

Suppose you have a huge table of geography points indexed by H3 cell ID at level 15, which you’ve clustered by H3_index (note: these functions are supported through the Carto Spatial Extension for BigQuery). You want to find all the points that belong to lower resolution cells, e.g. at level 7. You might write a query like this:

 

select * from  points
 where H3_ToParent(h3_index, 7) = @parent_cell_id

 

Where H3_ToParent is a custom function that computes the parent cell ID from a higher resolution index. Since you’ve clustered by the H3 index, you might expect a lower cost, however this query will scan the entire table. This happens because H3_ToParent involves bit operations, and is too complex for the BigQuery query analyser to understand how the query’s result is related to cluster boundaries. What you should do instead is give BigQuery the range of the H3 cell IDs at the level that the geographies are indexed, like the following example:

 

select * from  points
 where h3_index BETWEEN H3_CellRangeStart(@parent_cell_id, 15)
                    AND H3_CellRangeEnd(@parent_cell_id, 15)

 

Where H3_CellRangeStart and H3_CellRangeEnd are custom functions that map the lower-resolution parent ID to the appropriate start and end IDs of the higher resolution cells. Now BigQuery will be able to figure out relevant clusters, reducing the cost and improving the performance of the query.

What’s Next?

Spatial clustering is a complex topic that requires specialized knowledge to implement. Using BiqQuery’s native spatial clustering will take most of the work out of your hands. With your geospatial data in BigQuery, you can do amazing spatial analyses like querying the stars, even on large datasets. You can also use BigQuery as a backend for a geospatial application, such as an application that allows customers to explore the climate risk of their assets. Using spatial clustering, and querying your clusters correctly will ensure you get the best performance at the lowest cost.


Acknowledgments: Thanks to Eric Engle and Travis Webb for their help with this post.

 

 

By: Remy Welch (Customer Engineer) and Michael Entin (Software Engineer)
Source: Google Cloud Blog


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

aster.cloud

Related Topics
  • Best Practice
  • BigQuery;
  • Data Analytics
  • Design
  • Google Cloud
  • Indexes
  • Query Design
  • Spatial Clustering
You May Also Like
View Post
  • Engineering

Just make it scale: An Aurora DSQL story

  • May 29, 2025
View Post
  • Engineering
  • Technology

Guide: Our top four AI Hypercomputer use cases, reference architectures and tutorials

  • March 9, 2025
View Post
  • Computing
  • Engineering

Why a decades old architecture decision is impeding the power of AI computing

  • February 19, 2025
View Post
  • Engineering
  • Software Engineering

This Month in Julia World

  • January 17, 2025
View Post
  • Engineering
  • Software Engineering

Google Summer of Code 2025 is here!

  • January 17, 2025
View Post
  • Data
  • Engineering

Hiding in Plain Site: Attackers Sneaking Malware into Images on Websites

  • January 16, 2025
View Post
  • Computing
  • Design
  • Engineering
  • Technology

Here’s why it’s important to build long-term cryptographic resilience

  • December 24, 2024
IBM and Ferrari Premium Partner
View Post
  • Data
  • Engineering

IBM Selected as Official Fan Engagement and Data Analytics Partner for Scuderia Ferrari HP

  • November 7, 2024

Stay Connected!
LATEST
  • 1
    Just make it scale: An Aurora DSQL story
    • May 29, 2025
  • 2
    Reliance on US tech providers is making IT leaders skittish
    • May 28, 2025
  • Examine the 4 types of edge computing, with examples
    • May 28, 2025
  • AI and private cloud: 2 lessons from Dell Tech World 2025
    • May 28, 2025
  • 5
    TD Synnex named as UK distributor for Cohesity
    • May 28, 2025
  • Weigh these 6 enterprise advantages of storage as a service
    • May 28, 2025
  • 7
    Broadcom’s ‘harsh’ VMware contracts are costing customers up to 1,500% more
    • May 28, 2025
  • 8
    Pulsant targets partner diversity with new IaaS solution
    • May 23, 2025
  • 9
    Growing AI workloads are causing hybrid cloud headaches
    • May 23, 2025
  • Gemma 3n 10
    Announcing Gemma 3n preview: powerful, efficient, mobile-first AI
    • May 22, 2025
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • Understand how Windows Server 2025 PAYG licensing works
    • May 20, 2025
  • By the numbers: How upskilling fills the IT skills gap
    • May 21, 2025
  • 3
    Cloud adoption isn’t all it’s cut out to be as enterprises report growing dissatisfaction
    • May 15, 2025
  • 4
    Hybrid cloud is complicated – Red Hat’s new AI assistant wants to solve that
    • May 20, 2025
  • 5
    Google is getting serious on cloud sovereignty
    • May 22, 2025
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.