aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • Data
  • Design
  • Engineering
  • Solutions
  • Technology

Building Trust In The Data With Dataplex

  • aster.cloud
  • October 19, 2022
  • 5 minute read

Analytics data is growing exponentially and so is the dependence on the data in making critical business and product decisions.  In fact, the best decisions are said to be the ones which are backed by data. In data, we trust!

But do we trust the data ?


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

As the data volumes have grown – one of the key challenges organizations are facing is how to maintain the data quality in a scalable and consistent way across the organization.  While data quality is not a newly found need,  the needs used to be  contained when the data footprint was small and data consumers were few. In such a world, data consumers knew who the producers were and producers knew what the consumers needed. But today, data ownership is getting distributed and data consumption is finding new users and use cases.  So the existing data quality approaches find themselves limited and are isolated to certain pockets of the organization.  This often exposes data consumers to inconsistent and inaccurate data which ultimately impacts the decisions made from that data.  As a result, organizations today are losing 10s of millions of dollars due to the low quality of data.

These organizations are looking for solutions that empower their data producers to consistently create high quality data cloud scale.

Building Trust with Dataplex data quality

Earlier this year,  at Google Cloud,  we launched Dataplex, an intelligent data fabric that enables governance and data management across distributed data at scale.  One of the key things Dataplex enables out-of-box is for data producers to build trust in the data with a built-in data quality.

Dataplex data quality task delivers a declarative, data-ops centric experience for validating data across BigQuery and Google Cloud Storage.  Producers can now easily build and publish quality reports or can easily include data validations as part of their data production pipeline.  Reports can be aggregated across various data quality dimensions and the execution is entirely serverless.

Read More  Connecting Apigee To GKE Using Headless Services And Cloud DNS

 

Dataplex data quality task provides  –

  1. A declarative approach for defining “what good looks like” that can be managed as part of a CI/CD workflow.
  2. A serverless and managed execution with no infrastructure to provision.
  3. Ability to validate across data quality dimensions like  freshness, completeness, accuracy and validity.
  4. Flexibility in execution – either by using Dataplex serverless scheduler (at no extra cost) or executing the data validations as part of a pipeline (e.g. Apache Airflow).
  5. Incremental execution – so you save time and money by validating new data only.
  6. Secure and performant execution with zero data-copy from BigQuery environments and projects.
  7. Programmatic consumption of quality metrics for Dataops workflows.

Users can also execute these checks on data that is stored in BigQuery and Google Cloud Storage but is not yet organized with Dataplex.  For Google Cloud Storage data that is managed by Dataplex, Dataplex auto-detects and auto-creates tables for structured and semi-structured data. These tables can be referenced with the Dataplex data quality task as well.

Behind the scenes – Dataplex makes use of an open source data quality engine – Cloud Data Quality Engine – to run these checks. Providing an open platform is one of our key goals and we have made contributions to this engine to integrate seamlessly with Dataplex’s  metadata and serverless environment.

You can learn more about this in our product documentation.

Building enterprise trust at American Eagle Outfitters

One of our enterprise customers  – American Eagle Outfitters (AEO) – is continuing to build trust in their critical data using Dataplex Data Quality Task.  Kanhu Badtia, lead data engineer from AEO, shares their rationale and experience with Dataplex data quality task:

“AEO is a leading global specialty retailer offering high-quality & on-trend clothing under its American Eagle® and Aerie® brands. Our company operates stores in the United States, Canada, Mexico, and Hong Kong, and ships to 81 countries worldwide through its websites.

Read More  GraphCast: AI Model For Faster And More Accurate Global Weather Forecasting

We are a data-driven organization that utilizes data from physical and digital store fronts, from social media channels, from logistics/delivery partners and many other sources through established compliant processes. We have a team of data scientists and analysts who create models, reports and dashboards that inform responsible business decision-making on such matters as inventory, promotions, new product launches and other internal business reviews. As the data engineering team at AEO, our goal is to provide highly trusted data for our internal data consumers.

Before Dataplex – AEO had methods for maintaining data quality that were effective for their purpose. However, those methods were not scalable with the continual expansion of data volume and demand for quality results from our data consumers.  Internal data consumers identified and reported quality issues where ‘bad data’ was impacting business critical dashboards/reports . As a result, our teams were often in “fire-fighting” mode – finding & fixing bad data. We were looking for a solution that would standardize and scale data quality across the production data pipelines.

The majority of AEO’s business data is in Google’s BigQuery or in Google Cloud Storage (GCS). When Dataplex launched the data quality capabilities, we immediately started a proof-of-concept. After a careful evaluation, we decided to use it as the central data quality framework for production pipelines. We liked that –

 

  1. It provides an easy declarative (YAML) & flexible way of defining data quality. We were able to parameterize it to use across multiple tables.
  2. It allows validating data in any BigQuery table with a completely serverless and native execution using existing slot reservations.
  3. It allows executing these checks as part of the ETL pipelines using DataPlex Airflow Operators. This is a huge win as pipelines can now pause further processing if critical rules do not pass.
  4. Data quality checks are executed in parallel which gives us the required execution efficiency in pipelines.
  5. Data quality results are stored centrally in BigQuery & can be queried to identify which rules failed/succeeded and how many rows failed. This enables defining custom thresholds for success.
  6. Organizing data in Dataplex Lakes is optional when using Dataplex data quality.
Read More  Built With BigQuery: How Oden Provides Actionable Recommendations With Network Resiliency To Optimize Manufacturing Processes

 

Our team truly believes that data quality is an integral part of any data-driven organization and Dataplex DQ capabilities align perfectly with that fundamental principle.

For example, here is a sample Google Cloud Composer / Airflow DAG  that loads & validates the “item_master” table and stops downstream processing if the validation fails.

 

It includes simple rules for uniqueness, completeness and more complex rules for referential integrity or business rules such as checking daily price variance. We publish all data quality results centrally to a BigQuery table, such as this:

Sample data quality output table

 

We query this output table for data quality issues & fail the pipeline in case of critical rule failure. This stops low quality data from flowing downstream.

We now have a repeatable process for data validation that can be used across the key data production pipelines. It standardizes the data production process and effectively ensures that bad data doesn’t break downstream reports and analytics.”

Learn more

Here at Google – we are excited to enable our customer’s journey to high quality, trusted data.  To learn more about our current data quality capabilities please refer to –

  • Dataplex Data Quality Overview
  • Sample Airflow DAG with Dataplex Data Quality task

 

By: Sandeep Karmarkar (Product Manager, Google Cloud) and Kanhu Badtia (Sr. Data Engineer, American Eagle Outfitters)
Source: Google Cloud Blog


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

aster.cloud

Related Topics
  • American Eagle Outfitters
  • Dataplex
  • Google Cloud
You May Also Like
View Post
  • Computing
  • Multi-Cloud
  • Technology

Host a static website on AWS with Amazon S3 and Route 53

  • June 27, 2025
View Post
  • Computing
  • Multi-Cloud
  • Technology

Prioritize security from the edge to the cloud

  • June 25, 2025
View Post
  • Computing
  • Multi-Cloud
  • Technology

6 edge monitoring best practices in the cloud

  • June 25, 2025
Genome
View Post
  • Technology

AlphaGenome: AI for better understanding the genome

  • June 25, 2025
View Post
  • Computing
  • Multi-Cloud
  • Technology

Pure Accelerate 2025: All the news and updates live from Las Vegas

  • June 18, 2025
View Post
  • Computing
  • Multi-Cloud
  • Technology

‘This was a very purposeful strategy’: Pure Storage unveils Enterprise Data Cloud in bid to unify data storage, management

  • June 18, 2025
View Post
  • Computing
  • Multi-Cloud
  • Technology

What is cloud bursting?

  • June 18, 2025
View Post
  • Computing
  • Multi-Cloud
  • Technology

There’s a ‘cloud reset’ underway, and VMware Cloud Foundation 9.0 is a chance for Broadcom to pounce on it

  • June 17, 2025

Stay Connected!
LATEST
  • Camping 1
    The Summer Adventures : Camping Essentials
    • June 27, 2025
  • Host a static website on AWS with Amazon S3 and Route 53
    • June 27, 2025
  • Prioritize security from the edge to the cloud
    • June 25, 2025
  • 6 edge monitoring best practices in the cloud
    • June 25, 2025
  • Genome 5
    AlphaGenome: AI for better understanding the genome
    • June 25, 2025
  • 6
    Pure Accelerate 2025: All the news and updates live from Las Vegas
    • June 18, 2025
  • 7
    ‘This was a very purposeful strategy’: Pure Storage unveils Enterprise Data Cloud in bid to unify data storage, management
    • June 18, 2025
  • What is cloud bursting?
    • June 18, 2025
  • 9
    There’s a ‘cloud reset’ underway, and VMware Cloud Foundation 9.0 is a chance for Broadcom to pounce on it
    • June 17, 2025
  • What is confidential computing?
    • June 17, 2025
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • Oracle adds xAI Grok models to OCI
    • June 17, 2025
  • Fine-tune your storage-as-a-service approach
    • June 16, 2025
  • 3
    Advanced audio dialog and generation with Gemini 2.5
    • June 15, 2025
  • Google Cloud, Cloudflare struck by widespread outages
    • June 12, 2025
  • 5
    Global cloud spending might be booming, but AWS is trailing Microsoft and Google
    • June 13, 2025
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.