What are Different Types of Dataset Formats Generally Used?

Challenge Inside! : Find out where you stand! Try quiz, solve problems & win rewards!

Learn via video courses

Overview

There are different types of Datasets in pandas that are supported by pandas in Python. There are various methods to work with files of different formats.

Scope

In this article, we will cover:

  • Different file formats of Dataset in pandas like CSV, XLSX, ZIP, TXT, HTML, and PDF file formats.
  • Different methods to work with different file formats for CSV files. There is .read_csv() functions and many more.

Introduction

As all file formats are not the same, some files are in CSV format. Some are Excel file datasets, etc. To work with the different file formats of the pandas have various methods like for CSV file .read_csv, for excel file .read_excel etc. Let's see some of the file formats.

Some of the File Formats Used in Pandas

There are Datasets in pandas with different formats when we are working with the data in python, which is generally used. Let's discuss some of them.

CSV

CSV stands for Comma-separated values. Each line in a CSV file represents a data record, and each record has one or more data fields, which are separated by commas. Here CSV dataset is loaded from the GitHub link

Code#1:

Output:

Explanation:

Pandas is imported as pd and to load the CSV data .read_csv() function of the pandas is used.

Code#2:

Output:

Explanation: Here the CSV Dataset in pandas is loaded from the device using the .read_csv() function. The .head() function is used to get the data from the start. By default, it takes the first five rows of the dataset.

XLSX

The XLSX file is a spreadsheet in the Open XML Format of Microsoft Excel. This is used to store any kind of data, although it's primarily used to develop mathematical models and store financial data, among other things.

The Excel file looks like this:

example-xlsx-file

Code#3:

Output:

Explanation:

In the above code example, the Excel file is loaded by using the .read_excel() function of the pandas, and .head() is used to get the starting rows of the data.

ZIP

Related files are bundled together to make them smaller and more compact so that they can be shared more quickly and easily online. Since they take up less space on your computer, zip files are great for archiving. Additionally, they help encrypt data and protect it from theft.

Code#4:

Output:

Explanation: We can load the zip file by using the .read_csv() function of the data. Here encoding='ISO-8859-1' is used to avoid the bit error.

TXT

TXT files can be used to store information in plain text without any further formatting other than the most basic fonts and font styles. Any text editing program, as well as other software, that can recognize it.

Initially, the Text file looks like this:

example-txt-file

Code#5:

Output:

Explanation:

In the above code example, the text file is loaded using .read_csv() function.

Code#6:

Output:

Explanation: In the above code example, the text file is loaded using .read_csv(), and sep=" " is used to separate the columns by space.

Code#7:

Output:

Explanation: In the above examples, the top row is considered the header. When we set the header as None automatic numeric values are assigned to the header.

Code#8:

Output:

Explanation:

Here, Team1 and Team2 is assigned to the columns as column names using the names attribute.

Code#9:

Output:

Explanation:

We can also read tables in a text file using the pd.read_table() function.

Code#10:

Output:

Explanation:

In the pd.read_table() function, delimeter=" " is used to separate the columns. Here we are using a delimiter as space to separate the columns by spaces.

There is another method i.e. pd.read_fwf(), to read a table of fixed-width formatted lines into a DataFrame. Let's see how this function works with the help of an example.

Code#11:

Output:

HTML

Web pages are created using HTML, which stands for Hyper Text Markup Language. With the help of the .read_html() method, we can read HTML tables in Python.

Code#12:

Output:

Explanation: In the above code example, the HTML file is loaded using the .read_html() function.

PDF

Portable Document Format is referred to as PDF. It has the extension .pdf. It can reliably exhibit and exchange documents regardless of the operating system, hardware, or applications used. The International Organization for Standardization (ISO) maintains the open standard PDF, which was created by Adobe (ISO). Links, buttons, form fields, music, video, and business logic can all be found in PDF documents.

First, we need to install the Tabula library to work with PDF files using the following command.

Code#13:

Output:

Explanation: To work with the pdf format Dataset in pandas, python has the tabula.read_pdf() function. Pass the path inside the function to load the pdf file.

At a time, it can only read one page, but as a pdf consists of more than one page, so for that, use the following code.

Code#14:

To read all the pages of the file pages='all' is used.

There are other ways in Python to read and write PDF files, such as using the PyPDF2 library, etc.

Conclusion

  • There are Datasets in pandas of different file formats like some CSV, HTML, XLSX, etc.
  • To work with CCS file .read_csv() function of pandas is used.
  • For Excel or XLSX file format .read_excel() and for HTML file format .read_html() functions of pandas are used.
  • For the pdf file, we need to install the tabula-py package.