Detecting Parkinson’s Disease

Challenge Inside! : Find out where you stand! Try quiz, solve problems & win rewards!

Learn via video course

Python and SQL for Data Science
Python and SQL for Data Science
By Srikanth Varma
Free
star5
Enrolled: 1000
Python and SQL for Data Science
Python and SQL for Data Science
Srikanth Varma
Free
5
icon_usercirclecheck-01Enrolled: 1000
Start Learning

Overview

Parkinson’s disease is a progressive central nervous system disorder affecting movement and inducing tremors and stiffness. It has a total of 5 stages and affects more than 10 lacs individuals annually in India. This is a chronic disease, and no cure has been found yet. In this article, we will develop an ML model to detect Parkinson’s disease based on a patient’s diagnostic results.

What are We Building?

In this project, we will use the UCI ML Parkinsons dataset. You can download the dataset from here. It consists of 195 records and 24 columns. The feature status represents whether the patient has Parkinson’s disease or not. Our objective is to build a machine learning-based solution to detect Parkinson’s disease in a patient.

Pre-requisites

  • Python
  • Data Visualization
  • Descriptive Statistics
  • Machine Learning
  • Data Cleaning and Preprocessing

How are We Going to Build This?

  • We will perform exploratory data analysis (EDA) using various visualization techniques to identify underlying patterns and correlations.
  • Further, we will train and develop a Decision Tree, Random Forest, and XGBoost model and compare their performance.

Requirements

We will be using below libraries, tools, and modules in this project -

  • Pandas
  • Numpy
  • Matplotlib
  • Seaborn
  • Sklearn
  • xgboost

Building the Parkinson’s Disease Detector

Import Libraries

Let’s start the project by importing all necessary libraries to load the dataset and perform EDA on it.

Data Understanding

  • Let’s load the dataset in a pandas dataframe and explore variables and their data types.

dataset in pandas dataframe example

variables-and-dataframes-pandas

  • As we can see, this dataset has 24 features, all features are numeric except the name. There are no NULL values present in any of the features. Let’s have a look at the summary statistics of the features in this dataset.

summary statistics in dataset

Exploratory Data Analysis

  • Let’s explore the distribution of the target variable. In this dataset, 75% of the records belong to patients having Parkinson’s disease.

distribution-of-target-variable

  • As most of the features are continuous and numeric in this dataset, let’s explore the correlation matrix of this dataset.

correlation matrix example

  • As we can see in the above heatmap, features MDVP:Jitter(%) and MDVP:Shimmer have a strong positive correlation, and features MDVP:Jitter(%) and HNR have a strong negative correlation. Let’s plot the scatter plots to validate to visualize the correlations between these variables.

scatter plot mdvp jitter and hnr

scatter plot mdvp jitter and hnr2

Data Preprocessing

  • Before developing the ML models, let’s standardize each feature in the dataset. We will use the StandardScaler class provided by the sklearn library in Python. Further, we will split the input data into training and testing data with an 80:20 ratio.

Developing the ML Models

  • Let’s first train a simple decision tree classifier and evaluate its performance. In this project, we will use accuracy and F1 score to compare and evaluate the performance of the ML models.

dataset for developing ml models

  • As we can see in the above figure the accuracy is 77% and F1 Score for the positive class is 0.85. These are not good numbers, and let’s build a Random Forest classifier and check whether we get any improvement in accuracy and F1 score or not.

random forest classifier

  • With Random Forest, we got an accuracy of 92% and F1 score of 0.95. This is a huge improvement over a simple Decision Tree Classifier. Let's see if we get any improvement by using the XGBoost model or not.

random forest classifier

  • We did not get any improvement by using the XGBoost model. This way, we can conclude that the Random Forest model gives us the best accuracy and F1 score for this detector.

What’s Next

  • You can explore whether outliers are present or not in the features. You can also explore the correlation between target variables and independent variables.
  • You can perform a grid search-based hyperparameter tuning to come up with the best hyperparameter combination for a given model.

Conclusion

  • We examined the UCI ML Parkinsons dataset by applying various statistical and visualization techniques.
  • We trained and developed three ML models - Decision Tree, Random Forest, and XGBoost. We concluded that Random Forest works best and XGBoost does not give any further improvement in accuracy.