Python Pandas - Sparse Data
Learn via video course
Overview
Sparse data has the majority of its elements equal to a certain value. Pandas offer specific data structures and methods for storing and processing such data more efficiently. NumPy functions allow performing various math and comparison operations on sparse data. In the latest pandas versions, the SparseSeries and SparseDataFrame subclasses have been deprecated, and many approaches to working with sparse data have been changed. This has to be taken into account when migrating the code from previous to new pandas versions.
Scope
This article discusses the following:
- What are sparse data?
- Why use them?
- What basic structures, subclasses, and methods are related to them.
- How to create sparse data.
- How to work with them.
This article outlines both previous and current approaches to creating and using sparse data and how to migrate the code from the older pandas versions. Some practical examples of using sparse data are demonstrated.
Introduction to Sparse Data in Pandas
Sparse data has more than half of its elements equal to a certain value, also known as the fill value. This value usually represents missing values in the data, such as NaN, None, or zero. However, it does not necessarily have to be a missing value. Instead, it can be whatever value that predominates the data and hence does not provide any useful information.
Sparse data often requires a significant amount of memory for its storage and time for its processing. The opposite of sparse data is called dense data.
To store sparse data more efficiently and process it faster, pandas offer specific data structures that compress the data where its elements match a predefined value. In other words, those values are omitted from the data representation and are not stored in the memory.
Pandas DataFrame.to_sparse
This function was used in older pandas versions to convert common DataFrame objects into SparseDataFrame objects. It was deprecated since version 0.25.0.
Syntax:
Parameters:
- fill_value: the value that should be compressed. None by default. Otherwise, float (this technically also includes NaN).
- kind: tracks where data is not equal to the fill value. By default, 'block' (tracks only the locations and sizes of data blocks). Otherwise, 'integer' (tracks all the locations of the data).
Return value:
An object of the SparseDataFrame subclass, which is a sparse representation of the original dataframe.
SparseArray in Pandas
A SparseArray is a 1-dimensional array object for more efficient storage of an array of sparse data using the SparseDtype class. It is currently the basic structure used for working with sparse data in pandas. Essentially, it stores only the actual values omitting the specified predominant values. A SparseArray object can be passed to the pd. Series() or pd.DataFrame() constructors to create a sparse series or data frame from an array of sparse data.
To convert a regular array with sparse data to a sparse array, we pass this array to the pd.arrays.SparseArray() constructor. For the opposite operation, we use the numpy. asarray() function passing in a sparse array.
Sparse Dtypes
A SparseArray has the dtype that provides two bits of information about the object:
- The data type of the meaningful values
- The fill value
It works as follows:
It is possible to create a SparseDtype by providing only a data type. In this case, NumPy automatically assigns a default fill value that is typically used as a missing value for that data type. Alternatively, we can override this default value and explicitly pass any other value to the constructor, as follows:
Sparse Accessor
Pandas .sparse accessor is applied to access different "sparse-dtype" specific methods and attributes that we can use on a sparse object of SparseDtype (a series or a data frame). This accessor works similarly to the .str and .dt accessors on string and datetime data, respectively. It has the following syntax:
Sparse Accessor Applications
- To find the number of meaningful values, i.e., those that are not equal to the fill value (only for a sparse series)
- To calculate the density of the object defined as the number of meaningful values divided by the total number of elements
- To create a scipy sparse matrix from a sparse object or vice versa (see also the Interaction with scipy. sparse section)
- To display the fill value (only for a sparse series)
- To convert a data frame with sparse values to a dense form
Interaction with scipy.sparse
Some dedicated attributes and methods of pandas sparse accessor allows the creation of sparse data frames and series from scipy sparse matrices (scipy. sparse). It is also possible the opposite: to convert a sparse data frame or a series into a sparse matrix. Various machine learning libraries, including sklearn, use sparse matrices as input for building machine learning models.
We will see an example of creating a sparse data frame from a scipy sparse matrix in the Examples section.
What is Sparse Calculation in Pandas?
We can apply numerous NumPy universal functions, also known as ufuncs, to a SparseArray to perform different math and comparison operations on it. Such functions operate on an object (whether a sparse or a regular ndarray) in an element-wise manner, taking a fixed number of inputs and returning a fixed number of outputs. Using ufuncs on a SparseArray produces another SparseArray or a single float value as a result.
NumPy universal functions also work on the fill values to ensure obtaining the correct dense output. Some examples of NumPy ufuncs are min(), max(), abs(), square(), sqrt() (if applicable), isnan(), reciprocal(), sign(), etc.
Sparse Subclasses
There are two sparse subclasses in pandas: SparseSeries and SparseDataFrame. They were used in older versions of pandas for working with sparse data as constructors for sparse series and data frames from array-like structures and matrices. In the objects of the SparseDataFrame subclass, all columns have to be sparse.
Since version 0.25.0 both subclasses are deprecated and eventually removed in version 1.0.0. Their usage is now substituted with regular series or dataframe objects with sparse values, with no loss in performance or memory efficiency.
Migrating
Since the SparseSeries and SparseDataFrame pandas subclasses are not in use anymore, the old code needs to be migrated properly from old to new pandas versions. Let us outline the essential differences between the previous and current approaches.
1. Creating a sparse series or dataframe from an array
- Previous approach: using the pd.SparseSeries() or pd.SparseDataFrame() constructors and passing a regular array
- Current approach: using the regular pd. Series() or pd.DataFrame() constructors and passing a SparseArray (pd.arrays.SparseArray())
2. Creating a sparse dataframe from a scipy matrix
- Previous approach: using the pd.SparseDataFrame() constructor
- Current approach: using the pd.DataFrame() constructor in combination with the sparse accessor and the from_spmatrix() method
3. Converting a regular series to a sparse version
- Previous approach: using the Series.to_sparse() function
- Current approach: creating a SparseArray from the series (using pd.arrays.SparseArray()) and passing the result to the pd. Series() constructor
4. Converting a regular dataframe to a sparse version
- Previous approach: using the DataFrame.to_sparse() function
- Current approach: using the DataFrame.astype() method passing in the pd.SparseDtype() constructor
5. Converting a sparse series or dataframe to a dense version
- Previous approach: using the to_dense() function directly on the object
- Current approach: using the to_dense() function on the sparse accessor of the object
6. Checking sparse-specific properties of a sparse series or dataframe
- Previous approach: calling the property directly on the object (e.g., df. density)
- Current approach: calling the property directly on the sparse accessor of the object (e.g., df.sparse.density)
7. SparseDataFrame vs. DataFrame: assigning a new column with sparse data
- Previous approach: The new column automatically becomes sparse
- Current approach: It is necessary to explicitly convert the new column to a sparse data type
[IMAGE 1 Create a chart or a table that demonstrates the above-described differences between the old and new styles. It is even possible to remove the above text descriptions and leave only the chart/table]
Examples
1. Converting a regular dataframe with sparse data to a sparse dataframe
- Old style:
Code:
Output:
- New style:
Code:
Output:
2. Converting a sparse dataframe to the standard dense form
Code:
Output:
3. Finding the number of meaningful values and the fill value of a sparse series
Code:
Output:
4. Converting a regular array to a sparse array and vice versa
Code:
Output:
5. Creating a sparse dataframe from a scipy matrix
Code:
Output:
6. Applying NumPy ufuncs to a SparseArray
Code:
Output:
Conclusion
- Sparse data has more than half of its elements equal to a certain value.
- Pandas offer specific data structures that compress sparse data to store and process it more efficiently.
- A SparseArray is the basic structure used for working with sparse data.
- The sparse accessor of pandas is used to access different "sparse-dtype" specific methods and attributes of a sparse object, such as finding its density, converting it to a dense form, or creating a scipy sparse matrix from it.
- NumPy universal functions can be applied to a SparseArray to perform different math and comparison operations on it. They also work on the fill values.
- The SparseSeries and SparseDataFrame subclasses are now deprecated and removed from pandas.
- In the latest pandas versions, many approaches to working with sparse data have been changed. This has to be considered when migrating the code from previous to new versions.