Hash Collisions: Causes and Solutions in Hash Tables¶
A hash table is an efficient data structure that allows us to quickly find corresponding “values” using “keys.” Imagine a hash table as a huge collection of drawers, where each drawer has a unique number (array index). We use a “key rule” (hash function) to place each “item” (key) into its corresponding drawer. When searching, we directly open the target drawer to retrieve the item, without rummaging through all drawers. However, this “key rule” can fail—different items (keys) may end up in the same drawer (array index)—this is a hash collision.
I. What is a Hash Collision?¶
A hash collision occurs when two distinct keys (e.g., student IDs 101 and 201) map to the same array index (e.g., the 5th drawer) after applying the hash function. It is analogous to two students with different names (keys) being assigned to the same classroom seat (array index) due to a seating rule (hash function), requiring them to “find their own seats.”
II. Why Do Collisions Occur?¶
The root cause of hash collisions is either:
- The number of keys far exceeds the array capacity, or
- The hash function is poorly designed (not uniformly distributed).
Example: If the hash table has an array length of 10 (10 drawers) but 100 distinct keys (e.g., student IDs 1-100), using the “modulo 10” hash function (key % 10) will cluster keys like 1, 11, 21…91 into drawer 1, 2, 12, 22…92 into drawer 2, etc. Since there are only 10 possible remainders (0-9), more than 10 keys will collide.
III. How to Resolve Hash Collisions?¶
The core idea is to ensure conflicting keys “occupy distinct positions” when collisions occur. Common solutions include:
1. Chaining (Zipping Method) – Most Common¶
Principle: Instead of a single empty drawer, each array index (drawer) stores a linked list. Conflicting keys are “linked” sequentially to the same list.
Example:
- Array has 4 indices (drawers 0-3), hash function key % 4.
- Insert key 5 (5%4=1) → Drawer 1: [5].
- Insert key 1 (1%4=1, collision) → Drawer 1: 5 → 1.
- Insert key 9 (9%4=1, collision) → Drawer 1: 5 → 1 → 9.
- Insert key 13 (13%4=1, collision) → Drawer 1: 5 → 1 → 9 → 13.
Search: Compute the array index (drawer 1), then traverse the linked list to find the target key.
Advantages: Simple implementation, high space efficiency (no wasted empty slots), and widely used in languages like Java’s HashMap.
2. Open Addressing – Find Empty Positions Sequentially¶
Principle: When a collision occurs, search for the next empty position starting from the collision index.
Linear Probing (Simplest Open Addressing):
- Step size = 1: i+1, i+2, i+3... (where i is the collision index).
Example:
- Insert key 5 at index 1.
- Insert key 1 (collision at 1) → Try index 2 (empty) → Place at 2.
- Insert key 2 (2%4=2, collision) → Try index 3 (empty) → Place at 3.
- Insert key 6 (6%4=2, collision) → Index 2 occupied → Try 3 (occupied) → Try 4 (empty, assuming array length 5) → Place at 4.
Disadvantage: Prone to “clustering” (multiple conflicting keys occupy consecutive positions), reducing efficiency for large datasets.
3. Quadratic Probing – Larger Step Sizes to Reduce Clustering¶
Principle: Step sizes are not 1 but squared values (1², 2², 3²…), e.g., i+1², i+2², i+3²....
Example:
- Collision at index i=1: First probe at 1+1²=2; if collision, probe at 1+2²=5; if still colliding, probe at 1+3²=10, etc.
Advantages: Reduces clustering compared to linear probing but may still cluster with poorly chosen step sizes. Implementation is slightly more complex.
4. Double Hashing – Use Multiple Hash Functions¶
Principle: If the first hash function collides, use a second hash function to compute a new position.
Example:
- First hash function: h1(k) = k % 4 (collision at index 1).
- Second hash function: h2(k) = (k // 10) % 5 (alternative mapping).
- New position: h1(k) + h2(k) until an empty slot is found.
Advantages: Scatters collisions using multiple hash functions, reducing clustering. Disadvantages: Requires maintaining multiple hash functions, complex to implement.
5. Public Overflow Area – Main Area + Overflow Area¶
Principle: Store non-colliding keys in the main array. Conflicting keys are unified into an “overflow array.”
Search: Check the main array first; if not found, search the overflow array.
Disadvantage: Overflow array size is hard to predict, leading to potential insertion failures if space is exhausted.
IV. Summary¶
Hash collisions are inherent to hash tables, as the number of keys may exceed array capacity, and hash functions cannot be perfectly uniform. Among solutions, chaining (zipping method) is the most intuitive and widely used, relying on linked lists to resolve collisions efficiently. Open addressing (linear/quadratic/double hashing) also works but may suffer from clustering. The choice depends on implementation complexity and space efficiency requirements.
Understanding hash collisions and their solutions helps design and optimize hash tables for efficient code!