As businesses scale to meet the demands of their customers, so do their need for efficient products to collect, manage and analyze data to meet their business goals. Whether you are building a multi-player game or a global e-commerce platform, it’s critical to ensure that data can be stored and queried at scale with strong consistency and then processed for analysis to deliver real-time insights.
In this blog we’ll discuss how Cloud Spanner and BigQuery are a match made in heaven, and can be used together to process transactions at scale and generate real-time insights to deliver amazing customer experiences. Using Cloud Spanner and BigQuery also allows customers to build their data clouds using Google Cloud, a unified, open approach to data-driven transformation that is unmatched for speed, scale, and security.
Industry leading speed, scalability and reliability
Spanner is Google Cloud’s fully managed relational database optimized for transactional workloads.
Spanner provides seamless replication across regions in Google Cloud. It also has strong external consistency, unlimited scale and processes over 1 billion requests per second at peak. Customers can start Spanner with 100 Processing Units at $65/month and will not have to worry about the scalability of their database even if their business grows 100x.
BigQuery is a serverless, highly scalable, and cost-effective multi cloud data warehouse designed for business agility.
It is optimized to remove the limits for ad-hoc analysis and reporting, which makes it best suited for gaining organizational insights. BigQuery customers analyze over 110 terabytes (TB) of data per second. BigQuery and Spanner are powerful tools independently, but seamlessly work together to execute transactional and analytical workloads and handle high throughput needs.
Built on proven infrastructure
Spanner and BigQuery can scale independently from each other in both compute and storage resources as workload demands change. Historically databases have been architected with tightly coupled storage and compute, but Spanner and BigQuery are architected with separate compute and storage. Both products are built on Colossus, Google’s distributed storage system, that powers Google’s most popular, globally available products, like YouTube, Search, Maps, and Gmail.
Colossus provides high durability, availability, performance and scalability globally for the services built on it. Users never have to worry about running out of capacity during peak seasons. For example, year end financial planning and forecasting exercises that run many complex queries against large datasets or the massive retail transaction volumes during Cyber Monday can all be handled by both services.
Spanner and BigQuery can scale up or down independently or together as needed by users. Both products are also built on Borg, Google’s internal cluster management system and Jupiter, in-house custom network hardware and software that connects all servers in Google’s data centers.
Real time analytics made easy
Users have long performed analytics against transactional data by extracting required data from OLTP databases into analytics (OLAP) systems, with homegrown scripts or external ETL/ELT tools. This worked well for many years, but as workloads grew from tens to millions of transactions per second and analytics expanded from weekly and daily static reports to include ad hoc queries against the latest transactional data, entire teams were dedicated to maintain the integration of these systems. Spanner and BigQuery were architected to unify the data lifecycle and quickly enable analytics in BigQuery using the latest transactional data from Cloud Spanner.
Depending on the use case, users have the option of ingesting data into BigQuery as physical tables first or querying the data in Spanner directly when needed. In both cases, users can leverage query federation for a simple setup. Users only need to configure an external data source in BigQuery pointing to the intended Spanner instance and write the appropriate queries. These queries may be used to populate a BigQuery table on demand or scheduled to run as needed. Alternatively it can be joined with another BigQuery result set dynamically. There is no additional maintenance or dependency on other components for BigQuery users to be able to consume the latest and greatest OLTP data from Spanner. In cases where there may be more complex transformations or external dependencies, users may also ingest Spanner data into BigQuery using a service like DataFlow.
Sample BigQuery and Spanner usage scenario
To get a better picture of how BigQuery and Spanner are used together, look at a sample scenario of an online gaming startup company Cymbal with just over a million players around the world.
The company is fictional but the scenario is real. Cymbal stores game data in Spanner including user profile, inventory items, customizations, actions and other gaming activities. This information is physically located in Colossus in Spanner’s file format designed for efficiency and scalability. In addition to high transaction rates to support game play data, Cymbal also performs analytics on player behavior including how they interact with other players, equipment, even non-player characters. Cymbal analysts routinely need to combine competition data already in BigQuery with data in Spanner. Cymbal has two ways to accomplish this. They can replicate data from Spanner into BigQuery and perform analytics against local data or they can use federated queries to retrieve data from Spanner on-demand.
As mentioned earlier, users can leverage federated query to read data from Spanner and write to a native BigQuery table. This is most commonly used when data replication for analytics is needed infrequently and the data can be minutes to days old. This works well for some of Cymbal’s analysts and we will dive deeper into federated queries in the next section including step-by-step details on setting it up. Other Cymbal analysts need to have data refreshed daily from Spanner and do not want to manage this themselves. In this case, Cymbal admins can configure a recurring job to replicate data from Spanner to BigQuery using federated queries, a service like Google DataFlow or one of the many solutions by Google Cloud partners.
BigQuery users can easily run federated queries against data residing in Spanner on-demand. The data can be consumed in real-time as part of a bigger analytics query. The on-demand nature of federated queries provides users flexibility and control over how fresh they need the data to be. No need to wait for the next scheduled execution of a data movement (ETL/ELT) job to complete.
Although federated queries are powerful there are some limitations. A federated query might not be as fast as querying local BigQuery tables. There may be higher latency because of the small wait time for the source database to execute the external query and then move the resultset from Spanner to BigQuery. Federated queries really are just another workload on the Spanner instance that consumes compute resources so users should make sure it does not negatively impact the existing OLTP workload..
How to run federated queries
The steps to run federate queries are simple!
1) Start by launching BigQuery and choosing the Google Cloud project that contains the Spanner instance that includes the database you need.
2) Then set up an external data source for the Spanner database in BigQuery. You’ll need to have bigquery.admin permissions to set this up.
Note: The “Read data in parallel” checkbox will enable Spanner to divide a federated query into smaller partitions for parallel execution. This option is restricted to queries whose first operator in the execution plan is a distributed union operator. Other queries will return an error. To learn more, see Understand how Cloud Spanner executes queries.
3) Finally write a query in BigQuery that accesses the data in the Spanner data source. If you’d like other users to access this external data source in BigQuery, simply grant them permission to use the connection resource you just created.
When to use which?
If your use case has a clear scope of the Spanner data to be used in BigQuery, fixed data freshness requirements (e.g. 15 minutes or 1 day), have a large number of users of the same dataset or only want to share a specific dataset (not the entire database), replicating data to local BigQuery tables might be the right approach. This provides security and resource isolation of your BigQuery analytic workload from your Spanner OLTP workload. If your use case involves mostly ad hoc queries that are unpredictable in the data to be retrieved, when the queries might run and can allow permitted BigQuery users access to all Spanner data, then federated queries might be the better option. To learn more about the topic in this blog post, please visit the links below.
- To get started with Spanner, create an instance or try it out with a Spanner Qwiklab.
- To learn more about Spanner federated queries with BigQuery, visit this tutorial.
- To learn more about how to replicate data from Spanner To Big Query, visit this blog post.
- To learn more about BigQuery, make sure to check out these training and tutorials
By: Bukola Ayodele (Developer Advocate)
Source: Google Cloud Blog