In this blog, we’ll cover data governance as it relates to managing data in the cloud. We’ll discuss the operating model which is independent of technologies whether on-prem or cloud, processes to ensure governance, and finally the technologies that are available to ensure data governance in the cloud. This is a two part blog on data governance. In this first part, we’ll discuss the role of data governance, why it’s important, and processes that need to be implemented to run an effective data governance program.
In the second part, we’ll dive into the tools and technologies that are available to implement data governance processes, e.g. data quality, data discovery, tracking lineage, and security.
For an in-depth and comprehensive text on Data governance, check Data Governance: People, Processes, and Tools to Operationalize Data Trustworthiness.
What is Data Governance?
Data governance is a function of data management which creates value for the organization by implementing processes to ensure high data quality, and provides a platform that makes it easier to share data securely across the organization while ensuring compliance with all the regulations.
The goal of data governance is to maximize the value derived from data, build user trust, and ensure compliance by implementing required security measures.
Data governance needs to be in place from the time a factoid of data is collected or generated and until the point in time at which that data is retired. Along the way, in this full lifecycle of the data, data governance focuses on making the data available to all stakeholders in a form that they can readily access and use in a manner that generates the desired business outcomes (insights, analysis), and if relevant, conforms to regulatory standards. These regulatory standards are often an intersection of industry (e.g. healthcare), government (e.g. privacy), and company (e.g. non-partisan) rules and codes of behavior. See more details here.
Why is Data Governance Important?
In the last decade, the amount of data generated by users using mobile phones, health & fitness and IOT devices, retail beacons etc. have caused an exponential growth in data. At the same time, the cloud has made it easier to collect, store, and analyze this data at a lower cost. As the volume of data and adoption of cloud continues to grow, organizations are challenged with a dual mandate to democratize and embed data in all decision making while ensuring that it is secured and protected from unauthorized use.
An effective data governance program is needed to implement this dual mandate to make the organization data driven on one hand and securing data from unauthorized use on the other. Organizations without an effective data governance program will suffer from compliance violations leading to fines, poor data quality which leads to lower quality insights impacting business decisions, challenges in finding data which results in delayed analysis and missed business opportunities, poorly trained data models for AI which reduces the model accuracy and benefits of using AI.
An effective data governance strategy encompasses people, processes, and tools & technologies. It drives data democratization to embed data in all decision making, builds user trust, increases brand value, reduces the chances of compliance violations which can lead to substantial fines, and loss of business.
Components of Data Governance
People and Roles in Data Governance
A comprehensive data governance program starts with a data governance council composed of leaders representing each business unit in the organization. This council establishes the high level governing principles on how the data will be used to drive business decisions. The council with the help of key people in each b business functions identify the data domains, e.g. customer, product, patient, and provider. The council then assigns data ownership and stewardship roles for each data domain. These are senior level roles and each owner is held accountable and accordingly rewarded for driving the data goals set by the data governance council. Data owners and stewards are assigned from business, for example customer data owner may be from marketing or sales, finance data owner from finance, while HR data owner from HR.
The role of IT is that of data custodian. IT ensures the data is acquired, protected, stored, and shared according to the policies specified by data owners. As data custodians, IT does not make the decisions on data access or data sharing. IT’s role is limited to managing technology to support the implementation of data management policies set by data owners.
Processes in Data Governance
Each organization will establish processes to drive towards the implementation of goals set by the data governance council. The processes are established by data owners and data stewards for each of their data domains. The processes focus on the following high level goals:
1. Data meets the specified data quality standards – e.g. 98% completeness, no more than 0.1% duplicate values, 99.99% consistent data across different tables, and what constitutes on-time delivery
2. Data security policies to ensure compliance with internal and external policies
- Data is encrypted at rest and on wire
- Data access is limited to authorized users only
- All sensitive data fields are redacted or encrypted and dynamically decrypted only for authorized users
- Data can be joined for analytics in de-identified form, e.g. using deterministic encryption or hashing
- Audits are available for authorized access as well as unauthorized attempts
3. Data sharing with external partners is available securely via APIs
4. Compliance with industry and geo specific regulations, e.g. HIPAA, PCI DSS, GDPR, CCPA, LGPD
5. Data replication is minimized
6. Centralized data discovery for data users via data catalogs
7. Trace data lineage to identify data quality issues, data replication sources, and help with audits
Implementing the processes as specified in the data governance program requires use of technology. From securing data, retaining and reporting audits, to automate monitoring and alerts, multiple technologies are integrated to manage data life cycle.
In Google Cloud, a comprehensive set of tools enables organizations to manage their data securely and drive data democratization. Data Catalog enables users to easily find data from one centralized place across Google Cloud. Data Fusion tracks lineage so data owners can trace data at every point in the data life cycle and fix issues that may be corrupting data. Cloud Audit Logs retain audits needed for compliance. Dataplex provides intelligent data management, centralized security and governance, automatic data discovery, metadata harvesting, lifecycle management, and data quality with built-in AI-driven intelligence.
We will discuss the use of tools and technologies to implement governance in part 2 of this blog.
By: Imad Qureshi (Customer Engineer, Google Cloud)
Source: Google Cloud Blog