Data Cleaning In Python: The Ultimate Guide

Data cleaning or cleansing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

What a long definition! It is certainly not fun and very time-consuming.

machine learning class - send you back to data cleaning — Source: kdnuggets.com

To make it easier, we created this new complete step-by-step guide in Python. You’ll learn techniques on how to find and clean:

Missing Data
Irregular Data (Outliers)
Unnecessary Data — Repetitive Data, Duplicates and more
Inconsistent Data — Capitalization, Addresses and more

Within this guide, we use the Russian housing dataset from Kaggle. The goal of this project is to predict housing price fluctuations in Russia. We are not cleaning the entire dataset but will show examples from it.

Before we jump into the cleaning process, let’s take a brief look at the data.

# import packages

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

import matplotlib.mlab as mlab

import matplotlib

plt.style.use(‘ggplot’)

from matplotlib.pyplot import figure

%matplotlib inline

matplotlib.rcParams[‘figure.figsize’] = (12,8)

pd.options.mode.chained_assignment = None

# read the data

df = pd.read_csv(‘sberbank.csv’)

# shape and data types of the data

print(df.shape)

print(df.dtypes)

# select numeric columns

df_numeric = df.select_dtypes(include=[np.number])

numeric_cols = df_numeric.columns.values

print(numeric_cols)

# select non numeric columns

df_non_numeric = df.select_dtypes(exclude=[np.number])

non_numeric_cols = df_non_numeric.columns.values

print(non_numeric_cols)

view raw read_explore_data.py hosted with

by GitHub

From these results, we learn that the dataset has 30,471 rows and 292 columns. We also identify whether the features are numeric or categorical variables. These are all useful information.

Now we can run through the checklist of “dirty” data types and fix them one by one.

Let’s get started.

Source: GIPHY

Missing data

Dealing with missing data/value is one of the most tricky but common parts of data cleaning.While many models can live with other problems of the data, most models don’t accept missing data.

How to find out?

We cover three techniques to learn more about missing data in our dataset.

Technique #1: Missing Data Heatmap

When there is a smaller number of features, we can visualize the missing data via heatmap.

cols = df.columns[:30] # first 30 columns

colours = [‘#000099’, ‘#ffff00’] # specify the colours – yellow is missing. blue is not missing.

sns.heatmap(df[cols].isnull(), cmap=sns.color_palette(colours))

view raw missing_data_find1.py hosted with

by GitHub

The chart below demonstrates the missing data patterns of the first 30 features. The horizontal axis shows the feature name; the vertical axis shows the number of observations/rows; the yellow color represents the missing data while the blue color otherwise.

For example, we see that the life_sq feature has missing values throughout many rows. While the floor feature only has little missing values around the 7000th row.

missing data heatmap — Missing Data Heatmap

Technique #2: Missing Data Percentage List

When there are many features in the dataset, we can make a list of missing data % for each feature.

# if it’s a larger dataset and the visualization takes too long can do this.

# % of missing.

for col in df.columns:

pct_missing = np.mean(df[col].isnull())

print(‘{} – {}%’.format(col, round(pct_missing*100)))

view raw missing_data_find2.py hosted with

by GitHub

This produces a list below showing the percentage of missing values for each of the features.

Specifically, we see that the life_sq feature has 21% missing, while floor has only 1% missing. This list is a useful summary that can complement the heatmap visualization.

missing data percentage list — Missing Data % List — the first 30 features

Technique #3: Missing Data Histogram

Missing data histogram is also a technique for when we have many features.

To learn more about the missing value patterns among observations, we can visualize it by a histogram.

# first create missing indicator for features with missing data

for col in df.columns:

missing = df[col].isnull()

num_missing = np.sum(missing)

if num_missing > 0:

print(‘created missing indicator for: {}’.format(col))

df[‘{}_ismissing’.format(col)] = missing

# then based on the indicator, plot the histogram of missing values

ismissing_cols = [col for col in df.columns if ‘ismissing’ in col]

df[‘num_missing’] = df[ismissing_cols].sum(axis=1)

df[‘num_missing’].value_counts().reset_index().sort_values(by=‘index’).plot.bar(x=‘index’, y=‘num_missing’)

view raw missing_data_dropping1.py hosted with

by GitHub

This histogram helps to identify the missing values situations among the 30,471 observations.

For example, there are over 6000 observations with no missing values and close to 4000 observations with one missing value.

missing data histogram — Missing Data Histogram

What to do?

There are NO agreed-upon solutions to dealing with missing data. We have to study the specific feature and dataset to decide the best way of handling them.

Below covers the four most common methods of handling missing data. But, if the situation is more complicated than usual, we need to be creative to use more sophisticated methods such as missing data modeling.

Solution #1: Drop the Observation

In statistics, this method is called the listwise deletion technique. In this solution, we drop the entire observation as long as it contains a missing value.

Only if we are sure that the missing data is not informative, we perform this. Otherwise, we should consider other solutions.

There could be other criteria to use to drop the observations.

For example, from the missing data histogram, we notice that only a minimal amount of observations have over 35 features missing altogether. We may create a new dataset df_less_missing_rows deleting observations with over 35 missing features.

# drop rows with a lot of missing values.

ind_missing = df[df[‘num_missing’] > 35].index

df_less_missing_rows = df.drop(ind_missing, axis=0)

view raw missing_data_dropping2.py hosted with

by GitHub

Solution #2: Drop the Feature

Similar to Solution #1, we only do this when we are confident that this feature doesn’t provide useful information.

For example, from the missing data % list, we notice that hospital_beds_raion has a high missing value percentage of 47%. We may drop the entire feature.

# hospital_beds_raion has a lot of missing.

# If we want to drop.

cols_to_drop = [‘hospital_beds_raion’]

df_less_hos_beds_raion = df.drop(cols_to_drop, axis=1)

view raw missing_data_dropping3.py hosted with

by GitHub

Solution #3: Impute the Missing

When the feature is a numeric variable, we can conduct missing data imputation. We replace the missing values with the average or median value from the data of the same feature that is not missing.

When the feature is a categorical variable, we may impute the missing data by the mode (the most frequent value).

Using life_sq as an example, we can replace the missing values of this feature by its median.

# replace missing values with the median.

med = df[‘life_sq’].median()

print(med)

df[‘life_sq’] = df[‘life_sq’].fillna(med)

view raw missing_data_imputation1.py hosted with

by GitHub

Moreover, we can apply the same imputation strategy for all the numeric features at once.

# impute the missing values and create the missing value indicator variables for each numeric column.

df_numeric = df.select_dtypes(include=[np.number])

numeric_cols = df_numeric.columns.values

for col in numeric_cols:

missing = df[col].isnull()

num_missing = np.sum(missing)

if num_missing > 0: # only do the imputation for the columns that have missing values.

print(‘imputing missing values for: {}’.format(col))

df[‘{}_ismissing’.format(col)] = missing

med = df[col].median()

df[col] = df[col].fillna(med)

view raw missing_data_imputation2.py hosted with

by GitHub

Luckily, our dataset has no missing value for categorical features. Yet, we can apply the mode imputation strategy for all the categorical features at once.

# impute the missing values and create the missing value indicator variables for each non-numeric column.

df_non_numeric = df.select_dtypes(exclude=[np.number])

non_numeric_cols = df_non_numeric.columns.values

for col in non_numeric_cols:

missing = df[col].isnull()

num_missing = np.sum(missing)

if num_missing > 0: # only do the imputation for the columns that have missing values.

print(‘imputing missing values for: {}’.format(col))

df[‘{}_ismissing’.format(col)] = missing

top = df[col].describe()[‘top’] # impute with the most frequent value.

df[col] = df[col].fillna(top)

view raw missing_data_imputation3.py hosted with

by GitHub

Solution #4: Replace the Missing

For categorical features, we can add a new category with a value such as “_MISSING_”. For numerical features, we can replace it with a particular value such as -999.

This way, we are still keeping the missing values as valuable information.

# categorical

df[‘sub_area’] = df[‘sub_area’].fillna(‘_MISSING_’)

# numeric

df[‘life_sq’] = df[‘life_sq’].fillna(–999)

view raw missing_data_replace.py hosted with

by GitHub

Irregular data (Outliers)

Outliers are data that is distinctively different from other observations. They could be real outliers or mistakes.

How to find out?

Depending on whether the feature is numeric or categorical, we can use different techniques to study its distribution to detect outliers.

Technique #1: Histogram/Box Plot

When the feature is numeric, we can use a histogram and box plot to detect outliers.

Below is the histogram of feature life_sq.

# histogram of life_sq.

df[‘life_sq’].hist(bins=100)

view raw outlier_histogram.py hosted with

by GitHub

The data looks highly skewed with the possible existence of outliers.

To study the feature closer, let’s make a box plot.

# box plot.

df.boxplot(column=[‘life_sq’])

view raw outlier_boxplot.py hosted with

by GitHub

In this plot, we can see there is an outlier at a value of over 7000.

Technique #2: Descriptive Statistics

Also, for numeric features, the outliers could be too distinct that the box plot can’t visualize them. Instead, we can look at their descriptive statistics.

For example, for the feature life_sq again, we can see that the maximum value is 7478, while the 75% quartile is only 43. The 7478 value is an outlier.

df[‘life_sq’].describe()

view raw outlier_describe.py hosted with

by GitHub

Technique #3: Bar Chart

When the feature is categorical. We can use a bar chart to learn about its categories and distribution.

For example, the feature ecology has a reasonable distribution. But if there is a category with only one value called “other”, then that would be an outlier.

	# bar chart - distribution of a categorical variable
	df[‘ecology’].value_counts().plot.bar()

view raw outlier_barchart.py hosted with

by GitHub

Other Techniques: Many other techniques can spot outliers as well, such as scatter plot, z-score, and clustering. This article does not cover all of those.

What to do?

While outliers are not hard to detect, we have to determine the right solutions to handle them. It highly depends on the dataset and the goal of the project.

The methods of handling outliers are somewhat similar to missing data. We either drop or adjust or keep them. We can refer back to the missing data section for possible solutions.

Unnecessary data

After all the hard work done for missing data and outliers, let’s look at unnecessary data, which is more straightforward.

All the data feeding into the model should serve the purpose of the project. The unnecessary data is when the data doesn’t add value. We cover three main types of unnecessary data due to different reasons.

Unnecessary type #1: Uninformative / Repetitive

Sometimes one feature is uninformative because it has too many rows being the same value.

How to find out?

We can create a list of features with a high percentage of the same value.

For example, we specify below to show features with over 95% rows being the same value.

num_rows = len(df.index)

low_information_cols = [] #

for col in df.columns:

cnts = df[col].value_counts(dropna=False)

top_pct = (cnts/num_rows).iloc[0]

if top_pct > 0.95:

low_information_cols.append(col)

print(‘{0}: {1:.5f}%’.format(col, top_pct*100))

print(cnts)

print()

view raw irrelevant_data.py hosted with

by GitHub

We can look into these variables one by one to see whether they are informative or not. We won’t show the details here.

What to do?

We need to understand the reasons behind the repetitive feature. When they are genuinely uninformative, we can toss them out.

Unnecessary type #2: Irrelevant

Again, the data needs to provide valuable information for the project. If the features are not related to the question we are trying to solve in the project, they are irrelevant.

How to find out?

We need to skim through the features to identify irrelevant ones.

For example, a feature recording the temperature in Toronto doesn’t provide any useful insights to predict Russian housing prices.

What to do?

When the features are not serving the project’s goal, we can remove them.

READ MORE: [button style=’accent’ url=’https://aster.cloud/2021/09/08/beautiful-is-better-than-ugly-using-python-to-automate-your-seo-strategy/’ target=’_blank’ arrow=’true’ fullwidth=’true’]BEAUTIFUL IS BETTER THAN UGLY’: USING PYTHON TO AUTOMATE YOUR SEO STRATEGY[/button]

[button style=’accent’ url=’https://aster.cloud/2021/11/01/why-hackers-should-learn-python-for-pen-testing/’ target=’_blank’ arrow=’true’ fullwidth=’true’]WHY HACKERS SHOULD LEARN PYTHON FOR PEN TESTING[/button]

Unnecessary type #3: Duplicates

The duplicate data is when copies of the same observation exist.

There are two main types of duplicate data.

Duplicates type #1: All Features based

How to find out?

This duplicate happens when all the features’ values within the observations are the same. It is easy to find.

We first remove the unique identifier id in the dataset. Then we create a dataset called df_dedupped by dropping the duplicates. We compare the shapes of the two datasets (df and df_dedupped) to find out the number of duplicated rows.

# we know that column ‘id’ is unique, but what if we drop it?

df_dedupped = df.drop(‘id’, axis=1).drop_duplicates()

# there were duplicate rows

print(df.shape)

print(df_dedupped.shape)

view raw duplicate_data_rows.py hosted with

by GitHub

10 rows are being complete duplicate observations.

What to do?

We should remove these duplicates, which we already did.

Duplicates type #2: Key Features based

How to find out?

Sometimes it is better to remove duplicate data based on a set of unique identifiers.

For example, the chances of two transactions happening at the same time, with the same square footage, the same price, and the same build year are close to zero.

We can set up a group of critical features as unique identifiers for transactions. We include timestamp, full_sq, life_sq, floor, build_year, num_room, price_doc. We check if there are duplicates based on them.

key = [‘timestamp’, ‘full_sq’, ‘life_sq’, ‘floor’, ‘build_year’, ‘num_room’, ‘price_doc’]

df.fillna(–999).groupby(key)[‘id’].count().sort_values(ascending=False).head(20)

view raw duplicate_data_key_check.py hosted with

by GitHub

There are 16 duplicates based on this set of key features.

What to do?

We can drop these duplicates based on the key features.

# drop duplicates based on an subset of variables.

key = [‘timestamp’, ‘full_sq’, ‘life_sq’, ‘floor’, ‘build_year’, ‘num_room’, ‘price_doc’]

df_dedupped2 = df.drop_duplicates(subset=key)

print(df.shape)

print(df_dedupped2.shape)

view raw duplicate_data_drop.py hosted with

by GitHub

We dropped the 16 duplicates within the new dataset named df_dedupped2.

Inconsistent data

It is also crucial to have the dataset follow specific standards to fit a model. We need to explore the data in different ways to find out the inconsistent data. Much of the time, it depends on observations and experience. There is no set code to run and fix them all.

Below we cover four inconsistent data types.

Inconsistent type #1: Capitalization

Inconsistent usage of upper and lower cases in categorical values is a common mistake. It could cause issues since analyses in Python is case sensitive.

How to find out?

Let’s look at the sub_area feature.

df[‘sub_area’].value_counts(dropna=False)

view raw string_lower_case1.py hosted with

by GitHub

It stores the name of different areas and looks very standardized.

But sometimes there is inconsistent capitalization usage within the same feature. The “Poselenie Sosenskoe” and “pOseleNie sosenskeo” could refer to the same area.

What to do?

To avoid this, we can put all letters to lower cases (or upper cases).

# make everything lower case.

df[‘sub_area_lower’] = df[‘sub_area’].str.lower()

df[‘sub_area_lower’].value_counts(dropna=False)

view raw string_lower_case2.py hosted with

by GitHub

Inconsistent type #2: Formats

Another standardization we need to perform is the data formats. One example is to convert the feature from string to DateTime format.

How to find out?

The feature timestampis in string format while it represents dates.

view raw string_to_datetime1.py hosted with

by GitHub

What to do?

We can convert it and extract the date or time values by using the code below. After this, it’s easier to analyze the transaction volume group by either year or month.

df[‘timestamp_dt’] = pd.to_datetime(df[‘timestamp’], format=‘%Y-%m-%d’)

df[‘year’] = df[‘timestamp_dt’].dt.year

df[‘month’] = df[‘timestamp_dt’].dt.month

df[‘weekday’] = df[‘timestamp_dt’].dt.weekday

print(df[‘year’].value_counts(dropna=False))

print()

print(df[‘month’].value_counts(dropna=False))

view raw string_to_datetime2.py hosted with

by GitHub

Inconsistent type #3: Categorical Values

Inconsistent categorical values are the last inconsistent type we cover. A categorical feature has a limited number of values. Sometimes there may be other values due to reasons such as typos.

How to find out?

We need to observe the feature to find out this inconsistency. Let’s show this with an example.

We create a new dataset below since we don’t have such a problem in the real estate dataset. For instance, the value of city was typed by mistakes as “torontoo” and “tronto”. But they both refer to the correct value “toronto”.

A simple way to identify them is fuzzy logic (or edit distance). It measures how many letters (distance) we need to change the spelling of one value to match with another value.

We know that the categories should only have four values of “toronto”, “vancouver”, “montreal”, and “calgary”. We calculate the distance between all the values and the word “toronto” (and “vancouver”). We can see that the ones likely to be typos have a smaller distance with the correct word. Since they only differ by a couple of letters.

from nltk.metrics import edit_distance

df_city_ex = pd.DataFrame(data={‘city’: [‘torontoo’, ‘toronto’, ‘tronto’, ‘vancouver’, ‘vancover’, ‘vancouvr’, ‘montreal’, ‘calgary’]})

df_city_ex[‘city_distance_toronto’] = df_city_ex[‘city’].map(lambda x: edit_distance(x, ‘toronto’))

df_city_ex[‘city_distance_vancouver’] = df_city_ex[‘city’].map(lambda x: edit_distance(x, ‘vancouver’))

df_city_ex

view raw fuzzy_logic_distance.py hosted with

by GitHub

What to do?

We can set criteria to convert these typos to the correct values. For example, the below code sets all the values within 2 letters distance from “toronto” to be “toronto”.

msk = df_city_ex[‘city_distance_toronto’] <= 2

df_city_ex.loc[msk, ‘city’] = ‘toronto’

msk = df_city_ex[‘city_distance_vancouver’] <= 2

df_city_ex.loc[msk, ‘city’] = ‘vancouver’

df_city_ex

view raw fuzzy_logic_replace.py hosted with

by GitHub

Inconsistent type #4: Addresses

The address feature could be a headache for many of us. Because people entering the data into the database often don’t follow a standard format.

How to find out?

We can find messy address data by looking at it. Even though sometimes we can’t spot any issues, we can still run code to standardize them.

There is no address column in our dataset for privacy reasons. So we create a new dataset df_add_ex with feature address.

# no address column in the housing dataset. So create one to show the code.

df_add_ex = pd.DataFrame([‘123 MAIN St Apartment 15’, ‘123 Main Street Apt 12 ‘, ‘543 FirSt Av’, ‘ 876 FIRst Ave.’], columns=[‘address’])

df_add_ex

view raw address_cleaning1.py hosted with

by GitHub

As we can see, the address feature is quite messy.

What to do?

We run the below code to lowercase the letters, remove white space, delete periods and standardize wordings.

df_add_ex[‘address_std’] = df_add_ex[‘address’].str.lower()

df_add_ex[‘address_std’] = df_add_ex[‘address_std’].str.strip() # remove leading and trailing whitespace.

df_add_ex[‘address_std’] = df_add_ex[‘address_std’].str.replace(‘\\.’, ”) # remove period.

df_add_ex[‘address_std’] = df_add_ex[‘address_std’].str.replace(‘\\bstreet\\b’, ‘st’) # replace street with st.

df_add_ex[‘address_std’] = df_add_ex[‘address_std’].str.replace(‘\\bapartment\\b’, ‘apt’) # replace apartment with apt.

df_add_ex[‘address_std’] = df_add_ex[‘address_std’].str.replace(‘\\bav\\b’, ‘ave’) # replace apartment with apt.

df_add_ex