Machine Learning - Handling Missing Data

Here’s the data that we will use for the tutorial: Click here

When you have empty values in your data, they’ll often show up in your IDE as NAN values. This missing data needs to be handled by you before you begin training on the data. There are two basic ways we can go about it. We can either remove rows with missing values or fill them in.

Image 1: Screenshot of the dataset from PC — **Image 1:** Screenshot of the dataset from PC

Above you can see just how many cells have missing data.

Check if NANs exist

Datasets are really big, so how can we find empty values quickly?

We can use the isNull() method to find any possible NAN values that appear. We can use that with the .sum() method to get a count of all NAN values per column

Code Snippet: dataset.isnull().sum()

Here’s a partial example of the dataset from above:

Image 2: Screenshot taken from PC — **Image 2:** Screenshot taken from PC

We see here ‘Age‘ has 2 NAN values and ‘DeathCounty‘ has 1100. This code snippet gives us the exact number of NAN values per column in our dataset.

Dealing with NAN values.

There a few options that you have, depending on your dataset on how to deal with NAN values.

1. Remove Missing Values

In the example above, we saw that ‘Date‘ has 2 NAN values.

Now for me, ‘Date‘ might be necessary for my processing and so I would want to remove any row that didn’t have date values.

To achieve this we can use the built-in method, dropna(). This method automatically drops any row with NAN. However, we’re going to want to work with specific columns so let’s see how we can do that

a. Drop rows with missing values in a column

Code Snippet: dataset.dropna(subset=['ColumnName1', 'CoulmnName2'], how='any')

Here’s we see that there are no longer any Date rows with NAN.

A quick side note, the how='any' will drop the row if any of the selected columns had a NAN. In this example, I was only looking at one column so it doesn’t matter but let’s say if subset=['Date', 'Age']. If either ‘Date‘ or ‘Age’ had a NAN value, the row would be dropped.

If we wanted the row to drop only when both ‘DATE’ & ‘AGE’ were NAN we could use how='all' which checks that all values have NAN before dropping the row.

b. Drop all rows with NAN

It may be the case that I just want to remove any row when any column has a NAN. This is fairly easy to do with the .dropna() method:

Code Snippet: dataset = dataset.dropna(how='any')

In our case though, performing this deletes all rows so it isn’t very useful.

c. Drop all rows with columns that are all NAN

Sometimes it happens that there are rows that just have all NAN values. To remove these unnecessary rows we can just:

Code Snippet: dataset = dataset.dropna(how='all')

2. Filling Missing Values

So there are times when our data is almost perfect but we just have a few missing values that need to be filled in so that we can process the data.

a. Filling in Categorical Data

Let’s take the example of the ‘Race’ column. If we look at the values in the dataset with the value_counts() method, making sure to set dropna=False so that we can also count NAN values and not just categories we get this:

Code Snippet: dataset['Race'].value_counts(dropna=False)

We see here that there are 13 NAN values in the ‘Race‘ column. Now lucky for us there are already categories like ‘Unknown‘ or ‘Other‘ that we can use.

In this case, it makes sense for us to map all NAN values to the ‘Unknown‘ category since we don’t know if any of these values are part of any the primary categories or belong in the ‘Other‘ category.

To do this we just need to use the .fillna() method like below and we can assign all the missing values to ‘Unknown‘.

Code Snippet: dataset['ColName'].fillna(value= 'Replace', inplace=True)

And bam! All the NAN values have been mapped to ‘Unknown‘ just like we wanted.

b. Filling in Numerical Data

If we take ‘Age‘ as an example, we can see from our previous run (Image 3) there are 3 rows where Age is NAN. Now since this isn’t a lot of missing data, normally we could just remove them because it probably wouldn’t affect our training all that much. But let’s say we needed to handle them, what strategies could we employ?

Replace by the average, median or most frequent value of the column

Depending on the situation you may want to replace the values by the mean, median, or mode. In this situation mean would make sense but let’s do it all three ways.

Average: dataset['col'].fillna((dataset['col'].mean()), inplace=True)Most Frequent: dataset['col'].fillna(dataset['col'].mode().iloc[0], inplace=True)Median: dataset['col'].fillna((dataset['col'].median()), inplace=True)

In this example for age, we can see index 0, 12, and 4887 are NAN values