Over one third of organizations believe that data analytics and machine learning have the most potential to significantly alter the way they run business over the next 3 to 5 years. However, only 26% of organizations are data driven. One of the biggest reasons for this gap is that a major portion of the data generated today is unstructured, which includes images, documents, and videos. It is estimated to cover roughly up to 80% of all data, which has so far remained untapped by organizations.One of the goals of Google’s data cloud is to help customers realize value from data of all types and formats. Earlier this year, we announced BigLake, which unifies data lakes and warehouses under a single management framework, enabling you to analyze, search, secure, govern and share unstructured data using BigQuery.At Next ‘22, we announced the preview of object tables, a new table type in BigQuery that provides a structured record interface for unstructured data stored in Google Cloud Storage. This enables you to directly run analytics and machine learning on images, audio, documents and other file types using existing frameworks like SQL and remote functions natively in BigQuery itself. Object tables also extend our best practices of securing, sharing and governing structured data to unstructured, without needing to learn or deploy new tools.

Directly process unstructured data using BigQuery ML

Object tables contain metadata such as URI (Uniform Resource Identifier), content type, and size that can be queried just like other BigQuery tables. You can then derive inferences using machine learning models on unstructured data with BigQuery ML. As part of preview, you can import open source TensorFlow Hub image models, or your own custom models to annotate the images. Very soon, we plan to enable this for audio, video, text and many other formats, and pre-trained models to enable out-of-the box analysis. Check out this video to learn more and watch a demo.

# Create an object table
CREATE EXTERNAL TABLE my_dataset.object_table
WITH CONNECTION us.my_connection 
OPTIONS(uris=["gs://mybucket/images/*.jpg"],
        object_metadata="SIMPLE", metadata_cache_mode="AUTOMATIC");

     # Generate inferences with BQML
SELECT * FROM ML.PREDICT(
    MODEL my_dataset.vision_model, 
    (SELECT ML.DECODE_IMAGE(data) AS img FROM my_dataset.object_table)
);

By analyzing unstructured data natively in BigQuery, businesses can

  • Eliminate manual effort as pre-processing steps such as tuning image sizes to model requirements are automated
  • Leverage the simple and familiar SQL interface to quickly gain insights
  • Save costs by utilizing existing BigQuery slots without needing to provision new forms of compute

Adswerve is a leading Google Marketing, Analytics and Cloud partner on a mission to humanize data. Twiddy & Co. is Adswerve’s client – a vacation rental company in North Carolina. By combining structured and unstructured data, Twiddy and Adswerve used BigQuery ML to analyze images of rental listings and predict the click-through rate, enabling data-driven photo editorial decisions.

“Twiddy now has the capability to use advanced image analysis to stay competitive in an ever changing landscape of vacation rental providers – and can do this using their in-house SQL skills.” said Pat Grady, Technology Evangelist, Adswerve

Process unstructured data using remote functions

Customers today use remote functions (UDFs) to process structured data for languages and libraries that are not supported in BigQuery. We are extending this capability to process unstructured data using object tables.

Object tables provide signed URLs to allow remote UDFs running on Cloud Functions or Cloud Run to process the object table content. This is particularly useful for running Google’s pre-trained AI models, including Vision AI, Speech-to-Text, Document AI, open source libraries such as Apache Tika, or deploying your own custom models where performance SLAs are important.

Here’s an example of an object table being created over PDF files that are parsed using an open source library running as a remote UDF.

SELECT uri, extract_title(samples.parse_tika(signed_url)) AS title
FROM EXTERNAL_OBJECT_TRANSFORM(TABLE pdf_files_object_table, 
                               ["SIGNED_URL"]);

Extending more BigQuery capabilities to unstructured data

Business intelligence – The results of analyzing unstructured data either directly in BigQuery ML or via UDFs can be combined with your structured data to build unified reports using Looker Studio (at no charge), Looker or any of your preferred BI solutions. This allows you to gain more comprehensive business insights. For example, online retailers can analyze product return rates by correlating them with the images of defective products. Similarly, digital advertisers can correlate ad performance with various attributes of ad creatives to make more informed decisions.

BigQuery search index – Customers are increasingly using the search functionality of BigQuery to power search use cases. These capabilities now extend to unstructured data analytics as well. Whether you use BigQueryML to produce inference on images or use remote UDFs with Doc AI to produce document extraction, the results can now be search indexed and used to support search access patterns.

Here’s an example of search index on data that is parsed from PDF files:

CREATE SEARCH INDEX my_index ON pdf_text_extract(ALL COLUMNS);

SELECT * FROM pdf_text_extract WHERE SEARCH(pdf_text, "Google");

Security and governance – We are extending BigQuery’s row-level security capabilities to help you secure objects in Google Cloud Storage. By securing specific rows in an object table, you can restrict the ability of end users to retrieve the signed URLs of corresponding URIs present in the table. This is a shared responsibility security model, for which administrators need to ensure that end users don’t have direct access to Google Cloud Storage, and use signed URLs from object tables as the only access mechanism.Here’s an example of a policy for PII images that are secured to be first processed through a blur pipeline:
CREATE ROW ACCESS POLICY pii_data ON object_table_images
GRANT TO ("group:[email protected]") 
FILTER USING (ARRAY_LENGTH(metadata)=1 AND   
              metadata[OFFSET(0)].name="face_detected")

Soon, Dataplex will support object tables, allowing you to automatically create object tables in BigQuery and manage and govern unstructured data at scale.Data sharing – You can now use Analytics Hub to share unstructured data with partners, customers and suppliers while not compromising on security and governance. Subscribers can consume the rows of object tables that are shared with them, and use signed URLs for unstructured data objects.

Getting Started

Submit this form to try these new capabilities that unlock the power of your unstructured data in BigQuery. Watch this demo to learn more about these new capabilities.


Special thanks to engineering leaders Amir Hormati, Justin Levandoski and Yuri Volobuev for contributing to this post.

 

 

By: Gaurav Saxena (Sr. Product Manager, Google Cloud) and Thibaud Hottelier (Software Engineer, Google Cloud)
Source: Google Cloud Blog

Previous Meet Google Cloud At Supercomputing 2022
Next The New Google Cloud Region In Israel Is Now Open