Handling Missing Data in Pandas
Learn via video courses
Overview
When no information is given for one or more elements, a full unit, or both, this is known as missing data. Missing data is a serious issue in real-world situations. In pandas, missing data may also refer to NA (Not Available) values. Many datasets in DataFrame occasionally come with missing data, either because the data was never gathered or because it was present but was not captured. Assume, for instance, that some people being surveyed opt not to disclose their income and that some users choose not to disclose their addresses. As a result, numerous datasets went missing.
Scope
- We will discuss different ways of handling missing data using pandas.
- We will learn about the steps involved, along with syntax and code explanation.
- We'll discuss how to identify missing data in a dataframe and what it comprises, and calculation with Missing Data.
- We will also discuss Cleaning Missing Data, Dropping Missing Data, and Replacing Missing Data.
- After discussing the steps, we'll discuss in more detail the functions for handling missing data in pandas like isnull(), notnull(), dropna(), fillna(), replace, interpolate.
Introduction
When no information is provided for one or more elements or for the entire unit, this is referred to as missing data. Missing data poses a serious issue in real-world situations. In pandas, missing data can also be referred to as NA (Not Available) values. Many datasets simply have missing data when they are imported into DataFrame, either because the data was never gathered or because it was present but was not captured. For example, suppose different individuals being surveyed opt not to reveal their income, and some users choose not to share their address; as a result, several datasets go missing.
Pandas support two values to represent missing data:
- None: None is a Python singleton object that is commonly used in Python programs to represent missing data.
- NaN: Also known as Not a Number, or NaN, is a particular floating-point value that is accepted by all systems that employ the IEEE standard for floating-point representation.
Handling Missing Data
Following these stages, we handle missing data. We'll go through each step in more detail, but here's a general idea:
- We start by importing the necessary packages.
- We use the read_csv() function to read the dataset.
- The dataset is printed. And we check if any record has NaN values or missing data.
- On the dataset, we apply the dropna() function. The records that contain missing values are deleted using this procedure. In order to remove the entries and update the new dataset in the same variable, we additionally pass the argument in place to be True.
- The dataset is printed. No records contain missing values anymore.
Calculation with Missing Data
None, is a Python singular object that is frequently used for missing data in Python programs. Because it is a Python object, None can only be used in arrays of the data type "object" (i.e., arrays of Python objects), and cannot be used in any other NumPy/Pandas array:
Output:
The alternative missing data representation, NaN (an acronym for Not a Number), is distinct; it is a unique floating-point value recognized by all systems that utilize the common IEEE floating-point notation:
Output:
It's important to note that NumPy selected a native floating-point type for this array, which implies that in contrast to the object array from earlier, this array allows rapid operations that are pushed into the produced code.
Cleaning Missing Data
The result of the isna() and isnull() methods is a Boolean check of whether or not each cell of the DataFrame has a missing value. In this way, if a value is absent from a certain cell, the function will return True; otherwise, it will return False (if the cell has a value).
We've undoubtedly observed that both isna() and isnull() provide an identical response, so you can use either one to display the Boolean check to see if there is missing data or not.
Output:
Dropping Missing Data
You can choose to either ignore missing data or substitute values for it when handling missing data. As we can see at the bottom of the DataFrame output, this results in a clean DataFrame with no missing data.
Output:
Replacing Missing Data
You can choose to either ignore missing data or substitute values for it when handling missing data. Fortunately, the Pandas fillna() method may be used to replace missing values in a DataFrame with a value given by the user. Type the following to replace any missing values with the number 0 (i.e., the value of 0 is arbitrary and may be any other value of your choice):
Output:
Important Functions for Handling Missing Data in Pandas
isnull()
The isnull() method returns a dataframe of boolean values that are True for NaN values when checking null values in a Pandas DataFrame.
Output:
For specific column data, we access using the df[] syntax.
Output:
notnull()
The notnull() method returns a dataframe of boolean values that are False for NaN values when checking for null values in a Pandas Dataframe.
Output:
dropna()
We have used the dropna() method to remove null values from a dataframe. This function removes rows and columns of datasets containing null values in several ways. We'll go through an illustration of dropping rows that have at least one null value.
Output:
Eliminate columns with at least one missing value.
Output:
Dropping rows in a CSV file that include at least one null value.
Output:
fillna()
By using the fillna(), replace(), and interpolate() functions, we may fill in any null values in a dataset by replacing NaN values with alternative values. The datasets of a DataFrame can be filled with null values thanks to all these functions. The Interpolate() method is mostly used to fill NA values in a dataframe, although it does so using various interpolation techniques rather than hard-coding the value. We'll begin by looking at an illustration of how to fill in a null value with a single value.
Output:
We'll look at an illustration of how to fill in a null value with a previous value.
Output:
We'll look at an illustration of how to fill in a null value with a next value.
Output:
We'll look at an illustration of how to fill in a null value of a specific column with a specified value.
Output:
replace()
Using the replace method, we can not only replace or fill null values but any value specified as a function attribute. We specify the value to be replaced in to_replace and the new value in value.
Output:
interpolate()
Pandas' ability to substitute missing values with those that make sense is another characteristic. The interpolate() function is employed. Pandas fill in the gaps beautifully by using the points' midpoints. Naturally, if this was curvilinear, a function would be fitted to it in order to discover a different technique to determine the average. We'll discuss the linear approach to interpolate the missing numbers. Keep in mind that the linear technique treats the data as equally spaced and disregards the index.
Output:
We can see the values in the Salary column have been replaced by interpolate function.
Conclusion
- In conclusion, Numpy lacks a built-in concept of NA data for non-floating-point data types, and Pandas is limited in how it handles missing values.
- We import the necessary packages and read the dataset. We check if any record has NaN values or missing data.
- We use the isnull() function to print the null values. We can also print the null values column-wise.
- We use the notnull() method to return a dataframe of boolean values that are False for NaN values when checking for null values in a Pandas Dataframe.
- We can choose to either ignore missing data or substitute values for it when handling missing data. We apply the dropna() function. In order to remove the entries and update the new dataset in the same variable, we additionally pass the argument in place to be True.
- By using the fillna(), replace(), and interpolate() functions, we can fill in any null values in a dataset by replacing NaN values with alternative values.
- The interpolate() function fills in the gaps beautifully by using the points’ midpoints in the dataset.