Hashing in Data Structure
Learn via video course
Overview
Hashing is a technique or process of mapping keys, values into the hash table by using a hash function. It is done for faster access to elements. The efficiency of mapping depends on the efficiency of the hash function used.
Takeaways
-
Complexity of Hashing
- Time complexity - O()
- Space complexity - O()
Introduction to Hashing in Data Structure
Before going into Hashing in Data Structure let’s understand what is mapping
- Mapping is the way of linking 2 things together, referred to as “Key” and “Value”.
- Key – An Identifier to uniquely identify the Data(Entity).
- Value – The Actual Entity(and its details) which we want to store.
Let’s try to understand with an example of an example Entity- Student.
Each student can have different properties like his name, his address, etc. But to recognize him as a student, he must be given a unique identifier, like Roll Number. So Roll Number can be thought of as a way to identify the Student.
i.e. Key for a Student = Roll Number and Value for Student = Student details of Name, Address, Age, and so on.
This linking of all student details with a Roll Number can also be treated as the mapping of Student properties to a roll Number.
Now whenever someone wants to refer to a particular student, all he needs to know is his roll number.
Another advantage of mapping is that the name/age/address can be the same across multiple Students, but the Roll Num would be unique. So instead of referring to a Student with all his details, we can simply refer to him via his Roll Number.
This can be boiled down to the fact that mapping helps to link an easier/smaller “key” (here Roll Number) to a particular “value”(here Student’s properties).
With the above picture clear, it becomes easier to understand Hashing.
Hashing is a way where we can get an Integer value from any Key. This Key may or may not be an integer, but after hashing is performed, it will return an Integer value for any Key.
Hashing is required as the Key which was given in the input can not be used as the Memory location to place this key value.
So, Hashing performs a transformation on this Key, to return an Integer value which can be used as the memory address(or Index in the Data structure) to place this Key-Value at. We will see how this transformation looks, in later sections.
Please note that to make things easier to understand, we will assume we always have Keys as Integer values, and hashing will be done on these Integral Keys.
But to talk briefly on how hashing would work on non-integral keys, there are 2 steps involved. It first gets some integral value(say N) for that non-integral Key, and then performs hashing on N. This Integral number(N) can be retrieved from the Key in any way e.g. we can choose to take the first character of the key if the key = String, or if key itself is an Entity, e.g. Address, then take Pincode(which is numeric) as the Integer to run hashing on.
The basic point is that hashing works in 2 steps:
- It takes any Entity(as a Key) as input, and if that key is not an Integer, we need to tell it some way to get an Integer value(N) from that Entity
- Once we have got an initial Integral value (N), this Integer cant be used as the Memory address to place this Key-Value at. We will see why this won’t work, but hashing converts this Integer to another Integer value which can be used as the Memory Address/Index to place this key-Value at.
Let’s understand with 2 Examples:
Example 1: If Key=Roll Number and Value=Student, then Key = Integer itself, and hashing will run on this Roll Number to return the Location to place this Key-Value pair at.
Example 2: If Key=Address and Value=Student, and where Key is itself an Entity:
Step 1: We define a way to get some integer value from this Key. e.g. for Address, the pin code makes sense.
Step 2: Hashing takes Pin code as the Input, and will return the Memory Location to place this Key-Value pair at.
We will see in detail how Hashing can take an Integer, and return the Memory Address(or Index in Data Structure), and why it needs to do so.
What is Hashing in Data Structure?
Hashing in Data Structure is an important method designed to solve the problem of efficiently finding and storing data in an array.
What is data here to be stored?
It can be thought of as an object in OOPS terms, or a group of properties that all belong to a specific thing.
Using the same Student example above, data can be Student details, which can be his Roll Number, his name, his address, and so on.
All these details can be thought of as “data” we want to store. And Since each Student has a unique Roll Number, Roll number can be treated as the Key.
So Key = Roll Number, and Value = Student details.
In the Data structures’ realm, we use the concept of hashing to get the Memory address or the Index in the given data structure, where we can store this Key-Value pair at.
E.g. say the Key=100 and Value = certain Student detail, we can’t use either 100 or any Student detail to place this data at.
So the hashing is used which takes Key as the Input and returns us the memory address(or the Index in the Data structure) to place this Key-value pair at.
We will see how this works internally in detail, in later sections.
Hash Table in Data Structure
Hashing in data structure uses hash tables to store the key-value pairs.
Suppose we want to store some data(i.e. Value) identified by a unique Key, we can use the Hash Table.
Q: How is data stored inside the hash Table?
Ans: The Data Structures(e.g. Hash table) used to store the Key-Value are of limited size.
E.g. If we need to store 3 students with their Roll number as 302, 312, 342, then we need the array of size at least 343, so that these 3 students are placed at , and Index of that table.
Which might improve the search efficiency, but is a massive waste of space.
So instead of directly storing the keys in the hash table, hashing comes into play, which runs the hash function converts these big Keys to the Integer value <= size of Array. As discussed above, this Integer value is called Hash Index.
Hash Function is a mathematical formula that can take Key as an input and can return some integer value, which would be <= size of Array.
We will learn about hash functions in later sections in detail.
Let’s try to understand with an example:
Student Details
Key: Roll num
Value: Name, Age, Address
And we have 3 Students:
Student 1: 101 → Anmol, 25, Haryana
Student 2: 104 → Manoj, 26, New Delhi
Student 3: 106 → Nitin, 24, Gurgaon
Since the Roll Numbers are 101, 104, 106, either we need a very big array(of at least size=106) to store just 3 entries, or we can use something called a Hash Index which will allow us to just use Array of size=3.
Hash Function would take the Roll Number(Key) as input and will return the hashIndex as Output.
So the hash Function e.g. can be as simple as KaTeX parse error: Expected 'EOF', got '%' at position 12: RollNumber %̲ ArraySize.
e.g. for Student1: RollNumber = 101, so this hash function will generate 101%3 = 2. So Student 1 can be placed at Index in the Array. Note that we have taken %3 as the size of the array can be 3(3 students so space required = at least 3).
Student 2: 104%3 = 0 Student 3: 106%3 = 1
So Student1 can be at 2nd Index, Student1 at 0th and Student2 at 1st index of the array.
Hash Table(containing Student Roll numbers): [104, 106, 101]
What if we want to search for a Student e.g. with RollNumber = 101 ?
He would run the similar Hash function on 101 and will get the same hash Index back i.e. 2.
Then we can go to the Index of Array, to get the Student with id=101.
As you can see, using the Hashing technique you only need to find the index of the Key, rather than finding/comparing the Details(i.e. Values).
Problems:
But, if you would have read it carefully you might have stumbled upon one issue with the above technique, i.e. what if this hash function returns the same index for 2 RollNumbers?
It’s called Collision, and we will get to Collisions in detail in later sections. But we need to avoid collisions, so the hash function should be good enough that it gives distinct Indexes for every Key.
Given some Key-Value, to store it we would need to:
- Pass this Key to the hash function to get a unique index in the Array.
- Store the Key inside the hash table, against this index.
To summarize – The hash function converts the Key into a small integer or hash Index. This integer is used as an index to store the original data.
Hashing Function in DBMS
For a huge database structure of say 1000000 entries, iterating over these 1000000 entries again and again to search for a value, will make the system very slow. Hashing is an effective technique to calculate the direct location of a data record on the disk.
In Databases, Data is stored at the data blocks whose address is generated by using the hashing function. The memory location where these records are stored is known as data buckets or data blocks.
A hash function can choose any of the column values to generate the address. Such a column is called a Key. Most of the time, the hash function uses the primary key to generate the address of the data block.
To explain with an example, consider we have millions of records of students in the Database, each having a Roll Number.
Now when the insertion is to be done, DBMS(Database Management System) first calculates the hash Index of Primary Key of Student(Roll Number). This hash Index can be used as the memory address where the student can be placed. It then places this student at that memory location.
When we need to search if a Student with a certain Roll Number exists, DBMS again calculates the Memory Location by using the same hash function on Roll Number of input student and retrieves the Student details from this Memory Location(if it exists).
Ways to Calculate Hashing in Data Structure
As Briefly discussed above, the hash function is just a mathematical formula that can give a small Integer value(within the array size) for some big keys.
Below we will see 3 ways of how this method works internally.
1. Division Method
This is the easiest of all the methods to understand.
Consider a scenario where we only have 10 cells in Array, and the Key to insert is 15. i.e. > 10. Then we need some way to find a method that takes 15 and returns an index < 10.
Modulus operators can do this pretty easily.
Division method: HashFunction = Key % size
Here % = Modulus Operator which returns the remainder when Key is divided by Size.
E.g. if key=15, then KaTeX parse error: Expected 'EOF', got '%' at position 3: 15%̲10=5 and so we can insert this Data with key=15 at the 5th index of the array of size 10.
2. Multiplication Method
Similar to Division, it uses the function like:
Here, k is the key and A can be any constant value between 0 and 1.
Both k and A are multiplied and their fractional part is separated using the modulus operator. This is then multiplied with n to get the hash value.
Example:
So K with Key=123 can be placed at Index 1.
3. Universal Hashing
Universal hashing means that we don’t use a fixed hash function, but there is a family of hash functions from which we can randomly pick any hash function.
These hash functions must obey the basic principle of Universal hashing that the expected number of collisions for a particular key x is < 1.
Once we have such a family of hash functions, then any hash function can be chosen out of this, at runtime.
This brings randomness to the hash functions, and hence it favors the most evenly spread amongst all other hashing techniques.
4. Folding Method
Folding, as the name suggests, folds the given key in a way to find such an index that can fit in the given size of the array.
By folding we mean that, if the array size = 100, that means 2-digit numbers can only fit inside it, so we fold the given key in parts of 2.
I.e. if Key=20574 Then we will fold it in parts of 2, which will be:
20, 57, and 4.
So to get an index < 100 from these parts, we can sum these up i.e.
20 + 57 + 4 = 81. So 81 is the index where this Key=20574 can be placed.
Please note that there can be cases where even after folding we get a number > size e.g. Key=56571 then breaking it down in parts of 2= 56+57+1=114
Now we cant place this Data at index 114 inside the array of size 100, so we can either apply this algorithm again or can use the Division method(mostly used in such scenarios) to get KaTeX parse error: Expected 'EOF', got '%' at position 4: 114%̲100=14
So this Data can be placed at the 14th Index of this array.
Similarly, if the array size = 1000, then we can insert 3-digit numbers, so then we can break Key=20574 as
So Data with Key=20574 can be placed at 279th Index in an array of size 1000.
5. Mid Square Method
In this method, we basically square the given number and pick N middle numbers in that squared number, to find the index. E.g. If the array size = 100, meaning we can fit only 2 digit numbers, let’s take N=2 here (N can be taken anything).
. Picking middle elements i.e. 67
So 67 can be the potential index to place 93 at.
E.g. 2 Where the index obtained from the mid square method is > size of array, we can again resort back to the Division method.
Let’s take N=2 and Array size=11.
Hash(93) = 93^2 = 8649. Picking N(=2) middle elements i.e. 64 Now since 64>11, so we can apply the Division method to get
64%11=9.
So 93 can be placed at 9th Index in the array of size 11.
Collision Resolution Technique
If we notice carefully, we can notice that the same index can be obtained by running the Hash Function on different keys.
For E.g.
If using the Division method: Let’s say the array size=10
Then Hash(15) = 15%10 = 5
And Hash(25) = 25%10 = 5
So this hash function will return 5 as the index for both of these keys.
This is called a Collision, where the resultant hashes of 2 or more Keys, wrongly map to the same place in the hash table.
The collision creates a problem because each index in a hash table is supposed to store only one value
There are two fundamentals that can be used to tackle a hash collision:
1. Open Addressing –
Open addressing means where we try to resolve the collision by finding the alternative index to place the Key at.
There are 3 types of Open Addressing:
-
Rehashing
This method invokes 2 hash functions to get rid of collisions. These 2 hash functions can either be the same or different and the secondary hash function is applied continuously until an unused/vacant index is found.
E.g. Let’s take the Division Method as Primary and mid square as a secondary method.
Array size = 10.
Key=15
Hash1(15) = 15%10 = 5
So 15 is placed at the 5th Index.
Key=25
Hash1(25) = 25%10 = 5
Since 5 is already used, it’s a collision.
Hash2(25) = 625. Picking 1 middle element we get 2. So 25 can be placed at the 2nd Index.
So as we can see, by using 2 different hash functions, we are able to place both 15 and 25 at the 5th and 2nd index respectively.
In such cases, to retrieve back the value, we first calculate the hash of the key to find the Index to first look at.
Then we need a second comparison(as the key might/not be stored at this index) of the key we are searching against the key stored at that Index. We Linearly again probe down from this Index, until the keys match too.
-
Linear Probing
When the collision occurs, using the Linear Probing technique, the closest free cell is found by linearly iterating over the cells. Here, the searching is performed sequentially, starting from the position where the collision occurs till the empty cell is not found.
Let’s understand the linear probing through an example:
Let’s have an array of size 10.
And choose a simple hash function hash(K) = (2K+ 1) % size
So we Note that the calculated index of 7 is 5 which is already occupied by another key, i.e. 2. When linear probing is applied, the nearest empty cell to the index 5 is 6; therefore, the Key=7 will be added at the index 6.
As the 5th Index is again occupied with Key=2, we can’t place 12 at 5, so we will again need to linearly probe from 5.
Similarly, the 6th index is also occupied with Key=7, we look for the next Free Index.
The 7th index is free, so we can place Key=12 at the 7th Index.
-
Quadratic probing
Quadratic probing is similar to linear probing, but instead of moving one step at a time to find the empty cell to place this Data at, here the next slot is computed by adding the successive value of an arbitrary polynomial in the original hashed index. To put it simply, it means we can move in a Quadratic fashion.
E.g. From 5 we can go to 5+12 = 6
Then from 6, the next slot to check can be 6 + 22 = 10.
And then next from 10 would be 10 + 32 = 19.
So as we can see, we are moving from 5 → 6 → 10 → 19
Versus the 5 → 6 → 7 → 8 in case of Linear Probing.
Here the polynomial function taken is .
2. Closed Addressing –
Contrary to the Open addressing, where we found an alternative index incase of collision, here we place it at that index only using the Chaining technique.
-
Chaining
The chaining method builds a Linked list of items whose key hashes to the same value. This method requires an extra link field to each table position.
So if we compare it against Rehashing, it only uses 1 hash function, but how the Data is stored is different.
Again let’s use the same example as above to understand this:
Array size = 10.
Key=15
Hash(15) = 15%10 = 5
So 15 is placed at the 5th Index.
Key=25
Hash(25) = 25%10 = 5
Since 5 is already used, it’s a collision.
But using the Chaining mechanism, we can place both 15 and 25 at the same 5th Index, by chaining 15 to 25.
So the 5th Index of Array will contain a LinkedList of 15 → 25.
- To retrieve the data in such cases, e.g. here we have 15 and 25 chained at Index=5.
Now if any of 15 and 25 is given as input to search, we first calculate the hash to arrive at Index 5. Now since we have a chain, now we iterate over this chain until the Key matches. E.g. If the input Key to search was 25, we iterate over this chain of 15 → 25 to find 25.
So to summarize:
- To store the Key-Value, we calculate the Hash of the key and place it at that index.
- Incase of a Collision, we chain to the existing element at that Index.
- To search back, we again calculate the hash to arrive at the index that(along with other chain of elements, perhaps) is stored.
- Iterate over the chain, comparing keys, to find the Element we are interested in.
Hashing Terminologies
-
Data Bucket
Data buckets are memory locations where the records of the Database are stored. It is also known as Unit Of Storage.
This is analogous to the cells in Arrays.
-
DBMS Key
A DBMS key is an attribute or set of an attribute that helps you to identify a row(tuple) in a relation(table). This allows you to find the relationship between the two tables.
E.g. For the Students table, the Roll number can be thought of as a key, using which we can identify a Student’s record in the Students table.
-
Hash Function
A hash function is a mapping function that maps all the keys to the address where this Data must be stored, and the same function can then also be used to retrieve the address where this Key was stored.
E.g. To place Data with key=5.
We run HashFunction to get its Index=100, for instance.
So this Data is stored at index 100.
Similarly, when we want to retrieve Data with Key=5, we can run this hash function to get Index=100 and can read the data stored at 100th Index.
-
Hash Index
It is an address of the data block where the data is stored.
In Data Structures terms, this can be thought of as the index of Arrays or hash tables where the Data element is stored.
A hash function could be a simple mathematical function to even a complex mathematical function,
e.g. HashFunction(K) = 2*K + 7.
-
Bucket Overflow
Bucket in DBMS is a unit of storage holding records. The bucket is a disk block or contiguous block and each Bucket can contain multiple records.
So in DBMS terms, the hash Functions maps the given input record to the Bucket to place that record in.
Now as we have discussed, a hash function can give the same index for multiple Keys, similarly, in DBMS the Records may be mapped to a similar bucket.
In case the Bucket becomes full, there will be a bucket Overflow situation.
We can deploy any Collision resolution technique to tackle this:
- Chaining where the overflow bucket is chained together to another bucket in a Linked List.
- Place in another available bucket. This is not suitable in DBMS however.
- Rehash to find the next available bucket with some vacant slot.
The condition of bucket-overflow is called a collision. This is a fatal stage for any static to function.
-
Delete Operation
Using the hash function, you can first fetch the index where the Key-Value would have been stored. Then you can fetch the Key-Value from that index, which you want to delete.
Then you can remove the Data stored at that index.
-
Search Operation
When you need to retrieve the Key-Value, the same hash function should be helpful to retrieve the address of the bucket where this data should be stored.
Advantages of Hashing
- As we have seen, hashing improves the insertion and lookup significantly.
- To insert and lookup large data elements will be a costly operation.
E.g. Consider the simple example of the Address table, where we store the address of people in the Database(or Arrays/hash tables e.g.). - Here each row will contain big Strings containing Addresses.
- To search if a particular address exists in the table, we would need to compare each row and then each character of the address contained in each row.
Now compare this with using Hashing:
While inserting, we can take a hash to find the index to store this address too.
When we need to look up, we can again calculate the hash and can go to the index we got to fetch the data from there. Hence if you compare it takes constant time to calculate the Hash(as it’s just a mathematical formula) and the lookups and insertions are hence reduced to constant time i.e. O(1).
Being able to look up and insert data on the order of O(1), independent of the size of the data structure is the biggest advantage of Hashing.
However, nothing is perfect.
Hashing also suffers from O(N) worst-case time complexity.
If too many elements were hashed into the same Index: looking at the chained elements at this index may take O(n) time, n being the number of elements in that chain. And in the worst case all elements can get linked to the same Index, hence giving the Worst Case Time Complexity of O(N), N being total elements in the HashTable.
Note that in such a case, the HashTable will just be a chain of elements, or just be a LinkedList, and search in LinkedList = O(N).
To understand with an example, say if chaining is used, and say 10 elements were chained to 1 single Index. Then to search back we need to linearly search in these 10 elements chained together.
So it becomes very important to choose a hash function that is able to distribute the elements as evenly as possible, to get the average Constant time for its operations.
Conclusion
Hashing in Data Structure is a technique to find a small Number from some Data provided as an input. This number can be used to find the location to store this data at. This could be an index in Arrays or the Memory location on the disk in case of Database storage.
- Hashing in Data Structure is used to index and retrieve items as it is faster to search that specific item using the shorter hashed key instead of using its original value.
- Data bucket, Key, Hash function, Linear Probing, Quadratic probing, Hash index,Collisions are important terminologies used in hashing.
- In case the resultant index for 2 different inputs comes to be the same, it’s called a collision.
- Rehashing and chaining are two methods that help you to avoid hashing collisions.
- Rehashing is where we keep on hashing until we find the vacant index, and chaining is where we let both these values be placed at the same index with these values chained to each other.
- Using Hashing in Data Structure we can get the Insertion and Search done in Constant Time i.e. O(1) in Average/ cases.
- A good hash function must be chosen, which can bring the complexity close to O(1) if it is able to distribute the elements as evenly as possible.