Read Excel File in Python Pandas
Learn via video courses
Overview
In this article, we are going to learn how to "work with an Excel sheet using Pandas". We will cover the pandas read Excel function and its parameters used for reading the data of the Excel file and also various ways to read the data of the Excel file.
Introduction
In Python, we can work with the data in the Excel sheet with the help of the Pandas module. There is a function called the pandas read_excel function for reading the Excel file. There are lots of parameters for this function, like "io", "sheet_name", "dtype", etc., for reading the data in different ways. We can also get a specific part of the data using pandas read_excel function parameters.
Getting Started with Excel Files in Pandas
First, to work with Excel files in Pandas, it is necessary to import the pandas module by running the following command :
Understanding the Pandas' read_excel Function
pandas.read excel() is pandas read_excel function which is used to read the excel sheets with extensions (.xlsx, .xls, .xlsx, .xlsm, .xlsb, .odf, .ods and .odt) into pandas DataFrame object.
A "Pandas DataFrame object" is returned by reading a single sheet while reading two sheets results in a Dict of DataFrame.
We can also load Excel files from a URL or which are stored in the local filesystem.
It supports http, ftp, s3, and file for URLs.
The Pandas Read Excel Function has various parameters.
Some of them are as follows:
- io :
This parameter describes the path to the file. The string path can be any valid string. There should be a host for file URLs.- can be str, bytes, ExcelFile, xlrd.Book, path object, or file-like object
- eg. :
-
sheet_name :
For sheet names, strings are used. In positions on sheets with index zero, integers are used (chart sheets do not count as a sheet position). Multiple sheets can be requested using lists of strings or integers. To receive all worksheets, specify none.- can be str(for sheet names), int(for position), list(for multiple sheets), or None, default 0.
- eg. :
-
header :
The column labels of the parsed DataFrame should be used in a row (0-indexed). The row positions of a MultiIndex will be created if a list of integers is passed. If no header is present, use None.- int, list of int, default 0.
-
usecols :
This parameter is used to read specific columns.- can be strings of columns, integers representing positions columns, or Excel-style columns like ("A:C").
-
dtype :
The data type name to use for a column or data- can be a dictionary using data types as values and columns as keys.
- eg. : "a" is np.float64 and "b" is np.int32
- it is by default None and can be list-like, or int.
-
nrows :
number of rows that need to be parsed.- int, default None
Specify Data Types in Pandas read_excel
When we are reading an Excel file, it is easy to specify the data type of the columns because of pandas.
The main three objectives served by this are :
- preventing the improper reading of data
- accelerating the reading process
- preserving memory
We can enter a dictionary where the data types are the values and the keys are the columns. This guarantees that the data are prepared correctly. Let's look at how to define the data types for our columns.
Code:
Output:
Explanation:
In the above example, module pandas are imported as pd. Then the read_excel function is used to read the Excel file and store the dataset into a data variable. dtype parameter is used to specify the data type of columns in the form of key-value pairs, i.e., a dictionary.
Pandas read_excel() Example
Let's see how to read Excel files using the pandas read_excel function by following these examples :
Here, I am using this sample Excel file from Github.
Importing an Excel file from a URL link.
Code:
Output:
Explanation:
In the above code, we just import the file URL link, read the file using the read_excel function, and print the data from the excel file.
Let's try to import a file from the local file system. So, Initially, my excel file looked like this :
Code:
Output:
Explanation:
In this code example, we are importing a file from the disk and reading it using the read_excel function.
Code:
Output:
Explanation:
Using the read_excel function, we read the data from the file. Then using the head function, by default, data of 5 rows from the start is printed.
Pandas read_excel() usecols Example
When we want to access some specific data from our dataset, i.e., some specific column or range of columns, then we usecols parameter of the read_excel function. It takes the column name, position, or range of columns (int, list, string, or callable default None) to specify which column we want to access.
Let's see the below examples:
Code:
Output:
Explanation:
Here, we tried to read the valies of some specific columns using usecols function. Columns in the range from A to B as index values i.e. first two columns are printed as output.
Code:
Output:
Explanation:
In this above code example column at index "C" is printed using the usecols parameter.
Code:
Output:
Explanation:
Here, we use integer indexes to the usecols parameter to specify the column range. As a result, intial two columns are printed.
Code:
Output:
Explanation:
In this above code example, we are using only a single integerlist for printing the column data at that specific index.
Code:
Output:
Explanation:
We can also use the name of the column as a string list to get the data of that column.
How to Read a Subset of Columns?
Using only the square brackets, you can select columns, rows, or a combination of columns and rows. See how this works now.
Selecting only column:
Code:
Output:
Explanation:
Here, in this example, only column data of the specified given column name is printed.
Code:
Output:
Explanation:
In this above code example, the head() function is used to read the starting values of the file. Usually, It takes the data of the starting first 5 five rows by default if nothing is passed in the head function.
Code:
Output:
Explanation :
we can also filter data, like in the above code example , we get only the marks in English that are greater than 86.
There is another approach for selecting columns in python using the .loc() function.
Code:
Output:
Explanation:
Using .loc we can access a collection of rows and columns by label(s) or a boolean array. In this above example, we use an integer index [1] to access the row at the 1st index.
How to Get the List of Columns Headers in Excel ?
There are various approaches to getting the list of column headers or column names.
-
Using list(data.columns) function.
Code:
Output:
Explanation:
Pandas is imported using import pandas as pd. Using data.columns inside the list, the list of the headers of the column is stored in the column_header variable.
-
Using list(data) function.
Code:
Output:
Explanation:
If we use list(data) instead of list(data.columns), we get the same results.
-
Using the coulmns property.
Code:
Output:
Explanation:
ravel returns the contiguous flattened array(1-D array of the same type of input elements). Using data.columns.ravel(), we can get the headers of the columns.
-
Using the data.keys() function.
Code:
Output:
Explanation:
Here, we use .values which returns an array, and .tolist() to get column names in the list.
How to Print a Column Data ?
To get the column data we specify the column name of which data we want and the .tolist() function to convert values into a list.
Code:
Output:
There is another way to print the column data is by using the column name inside "[]" brackets as :
Cod:
Output:
We can also use data.column_name to get the column data with the index. Let's access the data in the Maths column :
Code:
Output:
Reading Excel File without Header Row
Pass the header as None if the excel sheet does not have any header row.
Code:
Output:
If we pass the header as any integer, then that row is treated as a header row and values start to read from the next row onwards. Data before the header row will be discarded. Let's see the below example in which we pass the header as 2.
Code:
Output:
How to Skip Rows when Reading Excel Files in Pandas ?
The skiprows parameter is used to skip the rows from the excel file.
Code:
Output:
Explanation:
In the above code, the skiprows parameter is used to skip the first 3 rows of the dataset and the 4th row is considered as a header.
How to Read Multiple Sheets from an Excel File in Pandas?
We can read multiple sheets from an excel file using pandas. Let's see how it works.
Getting sheets from excel file from the system :
Code:
Output:
Explnation:
In this example, we imported multiple sheets of the excel file using the pd.read_excel function and specified the index of the sheets using sheet_name.
Let's see how to read the multiple sheets :
Code:
Output:
Explanation:
In the above code example, pd.read_excel is used to read the sheets and store them in the df_multi_sheet variable. We have two sheets in our dataset so using the index of the sheets data of both sheets can be read.
How to Read n Lines from an Excel File in Pandas ?
The nrows parameter is used to read the desired number of lines by the user.
Code:
Output:
Explanation :
Here, we are reading the excel file using the read_excel function and the nrows parameter for printing the desired number of rows. Like in this example, 4 rows are printed.
Conclusion
- For using pandas, we first need to import them using the import pandas as pd command.
- Pandas read_excel() function is used for reading the Excel files.
- Files can be imported using a URL link or can be directly imported from the disk.
- There are different parameters like io for path,sheet_name for specifying the name of the sheet, etc., for reading the file.
- We can also print specific columns using the usecols parameter.
- We can also read multiple sheets.
- We can also print subsets of the data.