Exploratory Data Analysis in Python

Learn via video course

Python Certification Course: Master the essentials
Python Certification Course: Master the essentials
By Rahul Janghu
Free
star4.90
Enrolled: 1000
Python Certification Course: Master the essentials
Python Certification Course: Master the essentials
Rahul Janghu
Free
4.90
icon_usercirclecheck-01Enrolled: 1000
Start Learning

Overview

Exploratory data analysis (EDA) is a process of analyzing and summarizing a dataset in order to understand its characteristics and relationships better. It is an iterative process that involves performing statistical analysis (correlation, descriptive statistics, etc.), visualizing the data, identifying underlying patterns and trends, and deriving insights from the data. It is a critical step in any Data Science project, as it helps you understand which features will be useful for the ML model development. In this article, let’s perform exploratory data analysis on a dataset and derive insights from it.

What Are We Building?

In this project, we will use the titanic dataset that contains the details of onboarded passengers. The feature survived represents whether the passenger survived or not. You can download the dataset from here. It consists of details of 891 passengers, such as passenger class, gender, age, fare, survived, etc. We will perform exploratory data analysis (EDA) on this data to identify underlying patterns and derive insights.

Pre-requisites

  • Python
  • Data Visualization
  • Descriptive Statistics

How Are We Going to Build This?

  • We will load the dataset and will explore it using various descriptive statistics techniques.
  • Further, we will perform univariate and bi-variate analyses to identify underlying patterns and trends.
  • Finally, we will derive insights by grouping the data and analyzing the correlation heatmap.

Requirements

We will be using below libraries, tools, and modules in this project -

  • Pandas
  • Numpy
  • Matplotlib
  • Seaborn

Dataset Feature Descriptions

The description for the features present in this dataset is -

  • PassengerId - Unique ID of the passenger.
  • Survived - Whether the passenger survived or not.
  • Pclass - This feature represents the class of the passenger’s reservation, i.e., class-1, class-2, class-3.
  • Name, Sex - these represent the name and gender of the passenger.
  • Age - Age of the passenger.
  • SibSp - It represents the number of siblings or spouses of a person onboard.
  • Parch - Similar to SibSp, it represents the number of parents or children of a person onboard.
  • Ticket - It is the ticket number of the passenger.
  • Fare - The cost of the ticket.
  • Cabin - It is the cabin number of the passenger.
  • Embarked - It represents the location from where the passengers embarked on the ship. In this feature, S represents Southampton, C represents Cherbourg, and Q represents Queenstown.

Doing the Exploratory Data Analysis

Import Libraries and Loading Dataset

Let’s start the project by importing all necessary libraries to perform exploratory data analysis and loading the dataset.

importing libraries to perform exploratory data analysis

Descriptive Statistics

  • First, we will understand the data by exploring its data types and how many NULL values are in the dataset.

exploring datatypes and null values in dataset

  • As we can see, this dataset contains a mix of categorical and numerical features. The features Age and Cabin contain missing values. Now, let’s explore the numerical features using descriptive statistics.

numerical features using descriptive statistics

  • As seen in the above figure, the mean age of the passengers is around 30 years, the average fare of the tickets is around 32, the maximum number of siblings or spouses onboarded is 8, and the maximum number of parents and children onboarded is 6. Let’s perform a univariate analysis of the features in the next step.

Univariate Analysis

  • In this step, we will explore the distribution of each feature using various visualization techniques. We will use a count plot for categorical variables, and for numerical variables, we will use histograms. Let’s explore the target variable Survived and its distribution.

target variable and its distribution

  • As we can see, around 40% of the patients survived out of 891 passengers in the dataset. Let’s explore passenger class features and their category distribution.

class features and theirs category distribution

  • We can see in the above figure that most of the passengers were from class-3 and the least number of passengers were from class-2. Let’s explore the Embarked feature and its category distribution.

embarked feature and their category distribution

  • From the above figure, we can see that the maximum number of passengers (around 72%) embarked from Southampton and the least number of passengers embarked from Queenstown. Now we will explore the gender distribution of the passengers.

gender distribution of passenger example

  • From the above diagram, we can see that most of the passengers (around 64%) were male and the rest of them were female. Now we will move toward the numerical variables to explore them. First, we will explore the distribution of age among passengers.

distribution of age amoung passenger graph

  • We can see in the above diagram that most of the passengers were between the ages of 20 and 50. Similarly, let’s explore the histogram of the Fare feature.

histogram of fare features

  • As we can see, the maximum number of passengers paid the fare below 50 for the journey. In the next step, we will perform a bi-variate analysis of the features.

Bi-variate Analysis

  • Bivariate analysis is a statistical method used to examine the relationship between two variables. It is used to determine if there is a significant association between the two variables and to identify the strength and direction of the relationship. In this step, we will perform bi-variate analysis using various visualization techniques, such as box plots, scatter plots, bar plots, etc. Let’s explore what Fare distribution looks like for each passenger class.

bi variate analysis using box plots

  • As we can infer from the above diagram that class-1 has the highest fare and class-3 has the lowest fare. From the univariate analysis, we also observed that the highest number of passengers belonged to class-3. So, we can conclude here that class-1 belonged to rich and upper-class passengers. Now, let’s explore the distribution of surviving passengers based on each passenger class.

distribution of surviving passengers

  • As we can see, in class-1, the number of surviving passengers is higher than passengers who could not survive. We can conclude here that passengers from class-1 and class-2 were given priority in rescue over passengers from class-3. Now, let’s explore how gender plays a role in the survival of passengers.

gender role in survival of passengers

  • We can conclude from the above figure that females were given higher priority in the rescue, which seems intuitive as well. Now, let’s explore the scatter plot between Age and Fare.

scatter plot between age and fare

  • Insights from the above figure -
    • Children (age of between 0-8) had the higher chances of survival, as generally children and females are given priority in the rescue.
    • As age increases, chances of survival increase with the fare. It means that in adults, chances of survival were higher if you paid a higher fare.
  • Now, in the next step, let’s explore the dataset by grouping it.

Grouping Data

  • Group by is a method in pandas that can help us understand the effect of different categorical attributes on other data variables. Let's explore the average age and fare for male and female passengers based on their survival status.

grouping data example

  • Insights from the above table are -
    • The average age of surviving female passengers is higher, and the average age of surviving male passengers is lower.
    • The average fare is way higher for surviving passengers, irrespective of their gender.
  • Let’s explore the average fare based on each passenger class. As you can see in the below results, the average fare is highest in class-1, and the average fare is lowest in class-3. This is also consistent with our findings from the bi-variate analysis.

findings from bi variate analysis

Correlation

  • Correlation is a statistical measure that describes the strength and direction of the relationship between two variables. It is used to determine how closely two variables are related and to identify the relationship's strength and direction. We will use a heat map to visualize the correlation between variables.

heat map to visualize correlation between variables

  • Insights from the above correlation heat map are -
    • Most of the variables have no correlation with each other.
    • The passenger class has a moderate negative correlation with the fare, which is consistent with our previous insights. We can recall that class 1 had the highest fare and class 3 had the lowest fare.

Conclusion

  • We loaded the titanic dataset containing details of passengers along with their survival status.
  • We performed exploratory data analysis (EDA) on this dataset using various methods, such as descriptive statistics, univariate, and bivariate visualization, grouping the data, and correlation.