aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • Engineering
  • Practices

Spatial Clustering On BigQuery – Best Practices

  • aster.cloud
  • September 17, 2022
  • 6 minute read

Most data analysts are familiar with the concept of organizing data into clusters so that it can be queried faster and at a lower cost. The user behavior dictates how the dataset should be clustered: for example, when a user seeks to analyze or visualize geospatial data (a.k.a location data), it is most efficient to cluster on a geospatial column. This practice is known as spatial clustering, and in this blog, we will share best practices for implementing it in BigQuery (hint — let BigQuery do it for you).

BigQuery is a petabyte-scale data warehouse that has many geospatial capabilities and functions. In the following sections, we will describe how BigQuery does spatial clustering out of the box using the S2 indexing system. We will also touch on how to use other spatial indexes like H3 and geohash, and compare the cost savings of different approaches.


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

How BigQuery does spatial clustering under the hood

Clustering ensures that blocks of data with similar values are colocated in storage, which means that the data is easier to retrieve at query time. It also sorts the blocks of data, so that only the necessary blocks need to be scanned, which reduces cost and processing time. In geospatial terms, this means that when you’re querying a particular region, only the rows within or close to that region are scanned, rather than the whole globe.

All of the optimizations described above will occur automatically in BigQuery if you cluster your tables on a GEOGRAPHY column. It’s as easy as typing CLUSTER BY [GEOGRAPHY] when creating the table. Only predicate functions (e.g. ST_Intersects, ST_DWithin) leverage clustering, with the exception of ST_DISJOINT. It should also be noted that while BigQuery supports partitioning and clustering on a variety of fields, only clustering is supported on a geospatial field. This is because geometries can be large and could span across partitions, no matter how BigQuery chooses to partition the space. Finally, cluster sizes will range from 100MB to 1GB, so clustering on a table smaller than 100MB will provide no benefit.

When writing to a table that is clustered by GEOGRAPHY, BigQuery shards the data into spatially-compact blocks. For each block, BigQuery computes a bit of metadata called an S2 covering that includes the spatial area of the data contained within. When querying a geography-clustered table using spatial predicates, BigQuery reads the covering and evaluates whether a particular covering can satisfy the filter. BigQuery then prunes the blocks that cannot satisfy the filter. Users are only charged for data from remaining blocks. Note that S2 coverings can overlap, as it is often impossible to divide data into non-overlapping regions.

Read More  5 Google Cloud Next ’22 Sessions On Cloud FinOps Cost Optimization

Fundamentally, BigQuery is using the S2 index to map a geometry into a 64-bit integer, then BigQuery clusters on that integer using existing integer-based clustering mechanisms. In the past, customers have manually implemented an S2 indexing system in BigQuery. This was done prior to BigQuery’s native support of spatial clustering via S2. Using BigQuery’s native clustering resulted in a large performance increase, not to mention the added simplicity of not having to manage your own S2 indexes.

Alternative Spatial Indexes

Spatial clustering utilizes a spatial indexing system, or hierarchy, to organize the stored data. The purpose of all spatial indices is to represent this globe we call Earth in numerical terms, allowing us to define a location as a geometric object like a point, polygon or line. There are dozens of spatial indexes, and most databases implement them in their own unique way. Although BigQuery natively uses S2 cells for clustering, other indexes can be manually implemented, such as H3, geohash, or quadkeys. The examples below will involve the following spatial indexes:

  • S2:  The S2 system represents geospatial data as cells on a three dimensional sphere. It is used by Google Maps.
    • uses quadrilaterals, which are more efficient than hexagons
    • Higher precision than H3 or geohashing
  • H3:  The H3 system represents geospatial data on overlapping hexagonal grids.
    • Hexagons are more visually appealing
    • Convolutions and smoothing algorithms are more efficient than S2
  • Geohash – Geohash is a public domain system that represents geospatial data on a curved grid.
    • Length of the Geohash id determines the spatial precision
    • Fairly poor spatial locality, so clustering does not work as well

Spatial clustering in BQ — S2 vs. Geohash

In most cases for analysis, BigQuery’s built-in spatial clustering will give the best performance with the least effort. But if the data is queried according to other attributes, e.g. by geohash box, a custom indexing is necessary. The method of querying the spatial indexes has implications on the performance, as is illustrated in the example below.

Read More  New Anthos Training: A Masterclass In Hybrid Cloud Architecture And Management

Example

First, you will create a table with random points in longitude and latitude. Use the BigQuery function st_geohash to generate a geohash id for each point.

 

drop table if exists tmp.points;

create or replace table tmp.tenkrows as
select x from unnest(generate_array(1, 10000)) x;

create or replace table tmp.points
cluster by point
as
with pts as (
 select st_geogpoint(rand() * 360 - 180, rand() * 180 - 90) point
 from tmp.tenkrows a, tmp.tenkrows b
)
select st_geohash(point) gh, pts.point
from pts

 

Use the st_geogpoint function to transform the latitude and longitude into a GEOGRAPHY, BigQuery’s native geospatial type, which uses S2 cells as the index. Select a collection of around 3,000 points. This should cost around 25MB. If you run the same query on an unclustered table, it would cost 5.77GB (the full table size).

 

select * from  tmp.points
where st_dwithin(st_geogpoint(1, 2), point, 10000)

 

Now you will query by geohash id. BigQuery’s ability to leverage the spatial clustering will depend on whether the BQ SAT solver can prove the cluster of data can be pruned. The queries below are both leveraging the geospatial clustering, costing only 340 MB. Note that if we had clustered the table by the ‘gh’ field (ie geohash id), these queries would cost the same as the one above, around 25MB.

 

select * from  tmp.points
where starts_with(gh, 'bbb')

select * from  tmp.points
where gh between 'bbb' and 'bbb~'

 

The query below is much less efficient, costing 5.77GB, a full scan of the table. BigQuery cannot prove this condition fails based on the min/max values of the cluster so it must scan the entire table.

 

select * from  tmp.points
where left(gh, 3) = 'bbb'

 

As the examples show, the least costly querying option is to use the indexing consistent with the query method — native S2 indexing when querying by geography, string indexing when querying by geohash. When using geohashing, avoid left() or right() functions, as it will cause BigQuery to scan the entire table.

Spatial clustering in BQ with H3

One may also find themselves in a situation where they need to use H3 as a spatial index in BigQuery. It is still possible to leverage the performance benefits of clustering, but as with geohashing, it is important to avoid certain patterns.

Read More  How Pub/Sub Eliminates Boring Meetings And Makes Your Systems Scale

Suppose you have a huge table of geography points indexed by H3 cell ID at level 15, which you’ve clustered by H3_index (note: these functions are supported through the Carto Spatial Extension for BigQuery). You want to find all the points that belong to lower resolution cells, e.g. at level 7. You might write a query like this:

 

select * from  points
 where H3_ToParent(h3_index, 7) = @parent_cell_id

 

Where H3_ToParent is a custom function that computes the parent cell ID from a higher resolution index. Since you’ve clustered by the H3 index, you might expect a lower cost, however this query will scan the entire table. This happens because H3_ToParent involves bit operations, and is too complex for the BigQuery query analyser to understand how the query’s result is related to cluster boundaries. What you should do instead is give BigQuery the range of the H3 cell IDs at the level that the geographies are indexed, like the following example:

 

select * from  points
 where h3_index BETWEEN H3_CellRangeStart(@parent_cell_id, 15)
                    AND H3_CellRangeEnd(@parent_cell_id, 15)

 

Where H3_CellRangeStart and H3_CellRangeEnd are custom functions that map the lower-resolution parent ID to the appropriate start and end IDs of the higher resolution cells. Now BigQuery will be able to figure out relevant clusters, reducing the cost and improving the performance of the query.

What’s Next?

Spatial clustering is a complex topic that requires specialized knowledge to implement. Using BiqQuery’s native spatial clustering will take most of the work out of your hands. With your geospatial data in BigQuery, you can do amazing spatial analyses like querying the stars, even on large datasets. You can also use BigQuery as a backend for a geospatial application, such as an application that allows customers to explore the climate risk of their assets. Using spatial clustering, and querying your clusters correctly will ensure you get the best performance at the lowest cost.


Acknowledgments: Thanks to Eric Engle and Travis Webb for their help with this post.

 

 

By: Remy Welch (Customer Engineer) and Michael Entin (Software Engineer)
Source: Google Cloud Blog


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

aster.cloud

Related Topics
  • Best Practice
  • BigQuery;
  • Data Analytics
  • Design
  • Google Cloud
  • Indexes
  • Query Design
  • Spatial Clustering
You May Also Like
View Post
  • Engineering
  • Technology

Guide: Our top four AI Hypercomputer use cases, reference architectures and tutorials

  • March 9, 2025
View Post
  • Computing
  • Engineering

Why a decades old architecture decision is impeding the power of AI computing

  • February 19, 2025
View Post
  • Engineering
  • Software Engineering

This Month in Julia World

  • January 17, 2025
View Post
  • Engineering
  • Software Engineering

Google Summer of Code 2025 is here!

  • January 17, 2025
View Post
  • Data
  • Engineering

Hiding in Plain Site: Attackers Sneaking Malware into Images on Websites

  • January 16, 2025
View Post
  • Computing
  • Design
  • Engineering
  • Technology

Here’s why it’s important to build long-term cryptographic resilience

  • December 24, 2024
IBM and Ferrari Premium Partner
View Post
  • Data
  • Engineering

IBM Selected as Official Fan Engagement and Data Analytics Partner for Scuderia Ferrari HP

  • November 7, 2024
View Post
  • Engineering

Transforming the Developer Experience for Every Engineering Role

  • July 14, 2024

Stay Connected!
LATEST
  • college-of-cardinals-2025 1
    The Definitive Who’s Who of the 2025 Papal Conclave
    • May 7, 2025
  • conclave-poster-black-smoke 2
    The World Is Revalidating Itself
    • May 6, 2025
  • 3
    Conclave: How A New Pope Is Chosen
    • April 25, 2025
  • Getting things done makes her feel amazing 4
    Nurturing Minds in the Digital Revolution
    • April 25, 2025
  • 5
    AI is automating our jobs – but values need to change if we are to be liberated by it
    • April 17, 2025
  • 6
    Canonical Releases Ubuntu 25.04 Plucky Puffin
    • April 17, 2025
  • 7
    United States Army Enterprise Cloud Management Agency Expands its Oracle Defense Cloud Services
    • April 15, 2025
  • 8
    Tokyo Electron and IBM Renew Collaboration for Advanced Semiconductor Technology
    • April 2, 2025
  • 9
    IBM Accelerates Momentum in the as a Service Space with Growing Portfolio of Tools Simplifying Infrastructure Management
    • March 27, 2025
  • 10
    Tariffs, Trump, and Other Things That Start With T – They’re Not The Problem, It’s How We Use Them
    • March 25, 2025
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • 1
    IBM contributes key open-source projects to Linux Foundation to advance AI community participation
    • March 22, 2025
  • 2
    Co-op mode: New partners driving the future of gaming with AI
    • March 22, 2025
  • 3
    Mitsubishi Motors Canada Launches AI-Powered “Intelligent Companion” to Transform the 2025 Outlander Buying Experience
    • March 10, 2025
  • PiPiPi 4
    The Unexpected Pi-Fect Deals This March 14
    • March 13, 2025
  • Nintendo Switch Deals on Amazon 5
    10 Physical Nintendo Switch Game Deals on MAR10 Day!
    • March 9, 2025
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.