Directly process unstructured data using BigQuery ML
Object tables contain metadata such as URI (Uniform Resource Identifier), content type, and size that can be queried just like other BigQuery tables. You can then derive inferences using machine learning models on unstructured data with BigQuery ML. As part of preview, you can import open source TensorFlow Hub image models, or your own custom models to annotate the images. Very soon, we plan to enable this for audio, video, text and many other formats, and pre-trained models to enable out-of-the box analysis. Check out this video to learn more and watch a demo.
# Create an object table CREATE EXTERNAL TABLE my_dataset.object_table WITH CONNECTION us.my_connection OPTIONS(uris=["gs://mybucket/images/*.jpg"], object_metadata="SIMPLE", metadata_cache_mode="AUTOMATIC"); # Generate inferences with BQML SELECT * FROM ML.PREDICT( MODEL my_dataset.vision_model, (SELECT ML.DECODE_IMAGE(data) AS img FROM my_dataset.object_table) );
- Eliminate manual effort as pre-processing steps such as tuning image sizes to model requirements are automated
- Leverage the simple and familiar SQL interface to quickly gain insights
- Save costs by utilizing existing BigQuery slots without needing to provision new forms of compute
Adswerve is a leading Google Marketing, Analytics and Cloud partner on a mission to humanize data. Twiddy & Co. is Adswerve’s client – a vacation rental company in North Carolina. By combining structured and unstructured data, Twiddy and Adswerve used BigQuery ML to analyze images of rental listings and predict the click-through rate, enabling data-driven photo editorial decisions.
“Twiddy now has the capability to use advanced image analysis to stay competitive in an ever changing landscape of vacation rental providers – and can do this using their in-house SQL skills.” said Pat Grady, Technology Evangelist, Adswerve
Process unstructured data using remote functions
Customers today use remote functions (UDFs) to process structured data for languages and libraries that are not supported in BigQuery. We are extending this capability to process unstructured data using object tables.
Object tables provide signed URLs to allow remote UDFs running on Cloud Functions or Cloud Run to process the object table content. This is particularly useful for running Google’s pre-trained AI models, including Vision AI, Speech-to-Text, Document AI, open source libraries such as Apache Tika, or deploying your own custom models where performance SLAs are important.
Here’s an example of an object table being created over PDF files that are parsed using an open source library running as a remote UDF.
SELECT uri, extract_title(samples.parse_tika(signed_url)) AS title FROM EXTERNAL_OBJECT_TRANSFORM(TABLE pdf_files_object_table, ["SIGNED_URL"]);
Extending more BigQuery capabilities to unstructured data
Business intelligence – The results of analyzing unstructured data either directly in BigQuery ML or via UDFs can be combined with your structured data to build unified reports using Looker Studio (at no charge), Looker or any of your preferred BI solutions. This allows you to gain more comprehensive business insights. For example, online retailers can analyze product return rates by correlating them with the images of defective products. Similarly, digital advertisers can correlate ad performance with various attributes of ad creatives to make more informed decisions.
BigQuery search index – Customers are increasingly using the search functionality of BigQuery to power search use cases. These capabilities now extend to unstructured data analytics as well. Whether you use BigQueryML to produce inference on images or use remote UDFs with Doc AI to produce document extraction, the results can now be search indexed and used to support search access patterns.
Here’s an example of search index on data that is parsed from PDF files:
CREATE SEARCH INDEX my_index ON pdf_text_extract(ALL COLUMNS); SELECT * FROM pdf_text_extract WHERE SEARCH(pdf_text, "Google");
CREATE ROW ACCESS POLICY pii_data ON object_table_images GRANT TO ("group:[email protected]") FILTER USING (ARRAY_LENGTH(metadata)=1 AND metadata[OFFSET(0)].name="face_detected")
Special thanks to engineering leaders Amir Hormati, Justin Levandoski and Yuri Volobuev for contributing to this post.
By: Gaurav Saxena (Sr. Product Manager, Google Cloud) and Thibaud Hottelier (Software Engineer, Google Cloud)
Source: Google Cloud Blog