aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • Data
  • Engineering

Sentiment Analysis With BigQuery ML

  • aster.cloud
  • March 13, 2023
  • 5 minute read

Introduction

We recently announced BigQuery support for sparse features which help users to store and process the sparse features efficiently while working with them. That functionality enables users to represent sparse tensors and train machine learning models directly in the BigQuery environment. Being able to represent sparse tensors is a useful feature because sparse tensors are used extensively in encoding schemes like TF-IDF as part of data pre-processing in NLP applications and for pre-processing images with a lot of dark pixels in computer vision applications.

There are numerous applications of sparse features such as text generation and sentiment analysis. In this blog, we’ll demonstrate how to perform sentiment analysis with the space features in BigQuery ML by training and inferencing machine learning models using a public dataset. This blog also highlights how easy it is to work with unstructured text data on BigQuery, an environment traditionally used for structured data.


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

Using sample IMDb dataset

Let’s say you want to conduct a sentiment analysis on movie reviews from the IMDb website. For the benefit of readers who want to follow along, we will be using the IMDb reviews dataset from BigQuery public datasets. Let’s look at the top 2 rows of the dataset.

Although the reviews table has 7 columns, we only use reviews and label columns to perform sentiment analysis for this case. Also, we are only considering negative and positive values in the label columns. The following query can be used to select only the required information from the dataset.

SELECT
 review,
 label,
FROM 
 `bigquery-public-data.imdb.reviews`
WHERE
 label IN ('Negative', 'Positive')

The top 2 rows of the result is as follows:

Methodology

Based on the dataset that we have, the following steps will be carried out:

  1. Build a vocabulary list using the review column
  2. Convert the review column into sparse tensors
  3. Train a classification model using the sparse tensors to predict the label (“positive” or “negative”)
  4. Make predictions on new test data to classify reviews as positive or negative.
Read More  Announcing General Availability Of ReCAPTCHA Enterprise Password Leak Detection

Feature engineering

In this section, we will convert the text from the reviews column to numerical features so that we can feed them into a machine learning model. One of the ways is the bag-of-words approach where we build a vocabulary using the words from the reviews and select the most common words to build numerical features for model training. But first, we must extract the words from each review. The following code creates a dataset and a table with row numbers and extracted words from reviews.

-- Create a dataset named `sparse_features_demo` if doesn’t exist
CREATE SCHEMA IF NOT EXISTS sparse_features_demo;




-- Select unique reviews with only negative and positive labels
CREATE OR REPLACE TABLE sparse_features_demo.processed_reviews AS (
 SELECT
   ROW_NUMBER() OVER () AS review_number,
   review,
   REGEXP_EXTRACT_ALL(LOWER(review), '[a-z]{2,}') AS words,
   label,
   split
 FROM (
   SELECT
     DISTINCT review,
     label,
     split
   FROM
     `bigquery-public-data.imdb.reviews`
   WHERE
     label IN ('Negative', 'Positive')
 )
);

The output table from the query above should look like this:

The next step is to build a vocabulary using the extracted words. The following code creates a vocabulary including word frequency and word index from reviews. For this case, we are going to select only the top 20,000 words to reduce the computation time.

-- Create a vocabulary using train dataset and select only top 20,000 words based on frequency
CREATE OR REPLACE TABLE sparse_features_demo.vocabulary AS (
 SELECT
   word,
   word_frequency,
   word_index
 FROM (
   SELECT
     word,
     word_frequency,
     ROW_NUMBER() OVER (ORDER BY word_frequency DESC) - 1 AS word_index
   FROM (
     SELECT
       word,
       COUNT(word) AS word_frequency
     FROM
       sparse_features_demo.processed_reviews,
       UNNEST(words) AS word
     WHERE
       split = "train"
     GROUP BY
       word
   )
 )
 WHERE
   word_index < 20000 # Select top 20,000 words based on word count
);

The following shows the top 10 words based on frequency and their respective index from the resulting table of the query above.

Creating a sparse feature

Now we will use the newly added feature to create a sparse feature in BigQuery. For this case, we aggregate `word_index` and `word_frequency` in each review, which generates a column as ARRAY[STRUCT<int, numerical>] type. Now, each review is represented as ARRAY[(word_index, word_frequency)].

-- Generate a sparse feature by aggregating word_index and word_frequency in each review.
CREATE OR REPLACE TABLE sparse_features_demo.sparse_feature AS (
 SELECT
   review_number,
   review,
   ARRAY_AGG(STRUCT(word_index, word_frequency)) AS feature,
   label,
   split
 FROM (
   SELECT
     DISTINCT review_number,
     review,
     word,
     label,
     split
   FROM
     sparse_features_demo.processed_reviews,
     UNNEST(words) AS word
   WHERE
     word IN (SELECT word FROM sparse_features_demo.vocabulary)
 ) AS word_list
 LEFT JOIN
   sparse_features_demo.vocabulary AS topk_words
   ON
     word_list.word = topk_words.word
 GROUP BY
   review_number,
   review,
   label,
   split
);

Once the query is executed, a sparse feature named `feature` will be created. That `feature` column is an `ARRAY of STRUCT` column which is made of `word_index` and `word_frequency` columns. The picture below displays the resulting table at a glance.

Read More  Forrester Names Google Cloud A Leader In AI Infrastructure

Training a BigQuery ML model

We just created a dataset with a sparse feature in BigQuery. Let’s see how we can use that dataset to train with a machine learning model with BigQuery ML. In the following query, we will train a logistic regression model using the review_number, review, and feature to predict the label:

-- Train a logistic regression classifier using the data with sparse feature
CREATE OR REPLACE MODEL sparse_features_demo.logistic_reg_classifier
 TRANSFORM (
   * EXCEPT (
       review_number,
       review
     )
 )
 OPTIONS(
   MODEL_TYPE='LOGISTIC_REG',
   INPUT_LABEL_COLS = ['label']
 ) AS
 SELECT
   review_number,
   review,
   feature,
   label
 FROM
    sparse_features_demo.sparse_feature
 WHERE
   split = "train"
;

Now that we have trained a BigQuery ML Model using a sparse feature, we evaluate the model and tune it as needed.

-- Evaluate the trained logistic regression classifier
SELECT * FROM ML.EVALUATE(MODEL sparse_features_demo.logistic_reg_classifier);

The score looks like a decent starting point, so let’s go ahead and test the model with the test dataset.

-- Evaluate the trained logistic regression classifier using test data
SELECT * FROM ML.EVALUATE(MODEL sparse_features_demo.logistic_reg_classifier,
 (
   SELECT
     review_number,
     review,
     feature,
     label
   FROM
     sparse_features_demo.sparse_feature
   WHERE
     split = "test"
 )
);

The model performance for the test dataset looks satisfactory and it can now be used for inference. One thing to note here is that since the model is trained on the numerical features, the model will only accept numeral features as input. Hence, the new reviews have to go through the same transformation steps before they can be used for inference. The next step shows how the transformation can be applied to a user-defined dataset.

Sentiment predictions from the BigQuery ML model

All we have left to do now is to create a user-defined dataset, apply the same transformations to the reviews, and use the user-defined sparse features to perform model inference. It can be achieved using a WITH statement as shown below.

WITH
 -- Create a user defined reviews
 user_defined_reviews AS (
   SELECT
     ROW_NUMBER() OVER () AS review_number,
     review,
     REGEXP_EXTRACT_ALL(LOWER(review), '[a-z]{2,}') AS words
   FROM (
     SELECT "What a boring movie" AS review UNION ALL
     SELECT "I don't like this movie" AS review UNION ALL
     SELECT "The best movie ever" AS review
   )
 ),


 -- Create a sparse feature from user defined reviews
 user_defined_sparse_feature AS (
   SELECT
     review_number,
     review,
     ARRAY_AGG(STRUCT(word_index, word_frequency)) AS feature
   FROM (
     SELECT
       DISTINCT review_number,
       review,
       word
     FROM
       user_defined_reviews,
       UNNEST(words) as word
     WHERE
       word IN (SELECT word FROM sparse_features_demo.vocabulary)
   ) AS word_list
   LEFT JOIN
     sparse_features_demo.vocabulary AS topk_words
     ON
       word_list.word = topk_words.word
   GROUP BY
     review_number,
     review
 )


-- Evaluate the trained model using user defined data
SELECT review, predicted_label FROM ML.PREDICT(MODEL sparse_features_demo.logistic_reg_classifier,
 (
   SELECT
     *
   FROM
     user_defined_sparse_feature
 )
);

Here is what you would get for executing the query above:

Read More  Intel To Collaborate With Microsoft On DARPA Program

And that’s it! We just performed a sentiment analysis on the IMDb dataset from a BigQuery Public Dataset using only SQL statements and BigQuery ML. Now that we have demonstrated how sparse features can be used with BigQuery ML models, we can’t wait to see all the amazing projects that you would create by harnessing this functionality.

If you’re just getting started with BigQuery, check out our interactive tutorial to begin exploring.

By: Thu Ya Kyaw (Developer Advocate, Google Cloud) and Xiaoqiu Huang (Software Engineer, Google Cloud)
Originally published at Google Cloud Blog

Source: Cyberpogo


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

aster.cloud

Related Topics
  • BigQuery
  • BigQuery ML
  • Google Cloud
  • Machine Learning
  • ML
  • ML Training
  • Query
  • SQL
You May Also Like
View Post
  • Engineering

Just make it scale: An Aurora DSQL story

  • May 29, 2025
Getting things done makes her feel amazing
View Post
  • Computing
  • Data
  • Featured
  • Learning
  • Tech
  • Technology

Nurturing Minds in the Digital Revolution

  • April 25, 2025
View Post
  • Engineering
  • Technology

Guide: Our top four AI Hypercomputer use cases, reference architectures and tutorials

  • March 9, 2025
View Post
  • Computing
  • Engineering

Why a decades old architecture decision is impeding the power of AI computing

  • February 19, 2025
View Post
  • Engineering
  • Software Engineering

This Month in Julia World

  • January 17, 2025
View Post
  • Engineering
  • Software Engineering

Google Summer of Code 2025 is here!

  • January 17, 2025
View Post
  • Data
  • Engineering

Hiding in Plain Site: Attackers Sneaking Malware into Images on Websites

  • January 16, 2025
View Post
  • Computing
  • Design
  • Engineering
  • Technology

Here’s why it’s important to build long-term cryptographic resilience

  • December 24, 2024

Stay Connected!
LATEST
  • 1
    Just make it scale: An Aurora DSQL story
    • May 29, 2025
  • 2
    Reliance on US tech providers is making IT leaders skittish
    • May 28, 2025
  • Examine the 4 types of edge computing, with examples
    • May 28, 2025
  • AI and private cloud: 2 lessons from Dell Tech World 2025
    • May 28, 2025
  • 5
    TD Synnex named as UK distributor for Cohesity
    • May 28, 2025
  • Weigh these 6 enterprise advantages of storage as a service
    • May 28, 2025
  • 7
    Broadcom’s ‘harsh’ VMware contracts are costing customers up to 1,500% more
    • May 28, 2025
  • 8
    Pulsant targets partner diversity with new IaaS solution
    • May 23, 2025
  • 9
    Growing AI workloads are causing hybrid cloud headaches
    • May 23, 2025
  • Gemma 3n 10
    Announcing Gemma 3n preview: powerful, efficient, mobile-first AI
    • May 22, 2025
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • Understand how Windows Server 2025 PAYG licensing works
    • May 20, 2025
  • By the numbers: How upskilling fills the IT skills gap
    • May 21, 2025
  • 3
    Cloud adoption isn’t all it’s cut out to be as enterprises report growing dissatisfaction
    • May 15, 2025
  • 4
    Hybrid cloud is complicated – Red Hat’s new AI assistant wants to solve that
    • May 20, 2025
  • 5
    Google is getting serious on cloud sovereignty
    • May 22, 2025
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.