Hashing #
- Generic data structure that allows fast insert/delete/search
- Doesn’t need items to be comparable, only hashable
- Idea: get better performance in expectation using randomness
Primitive: direct addressing #
- Suppose all keys are in the set {1..9}
- Create buckets 1..9
- Insert: map key i to bucket i
- Delete: remove from bucket i
- Search: look for bucket i
Problem: if keys are arbitrary, this requires a lot of memory
An improvement: put items in buckets based on one digit
- Problem with this: if keys have same common digit, one bucket becomes very large
Hash functions improve bucket allocation by using randomness.
Terminology #
- U: universe of size M (very very large)
- n: number of (distinct) elements of U we expect to see, where (n « M)
Hash functions #
- A hash function h: U -> {1..n} is a function that maps elements of U to buckets 1..n
- Example: h(x) = least significant digit of x
- For simplicity, we assume the number of buckets is n, although this doesn’t need to be the case
- In general, want #buckets = O(#elements we expect to see)
Hash tables #
- Array of n buckets, where each bucket stores a linked list
- Can insert into a linked list in time O(1)
- Search/deletion in a linked list takes time O(length(list))
- Bucket index: given by hash function for any element
- Goals:
- Not too many buckets (at most n) to not use too much space
- Items should be spread out across buckets for fast insert/delete/search
Designing a hash function #
Goal: Design a function h: U -> {1..n} where no matter what n items of U are adversarially chosen, buckets are balanced (i.e., O(1) entries per bucket)
This requires a random hash function!
- If the function were deterministic, an adversary could pick inputs that are guaranteed to result in one bucket being overloaded
Testing parameters:
- Adversary chooses n items u1…un and any sequence of insert/search/delete operations on those items
- Algorithm chooses random hash function h: U -> {1..n}
- Hash table should provide O(1) search/insert/delete performance
Refined goal: For all adversarial inputs u1..un, and for all i in {1..n}, want E[length(list containing ui)] <= 2
Example: random hash function #
- Suppose h: U -> {1..n} is a uniformly random function (h(1) is uniformly random between 1 and n); write down the generated permutation h(i) for all i in U
- Pro: Adversary has a very hard time trying to defeat this; fulfills expectation in refined goal
- Con: Needs a lot of storage space to store the permutation
Hash families #
- Cleverly chosen subset H of functions; only need log(|H|) bits to store an element of H
- e.g. H = {LeastSignificantDigit, MostSignificantDigit} and pick a random function in H for each element; use one bit to denote what we used
Universal hash families: hash family that satisfies P{h in H} {h(ui) = h(uj)} <= 1/n for all nonequal ui, uj in U
Example:
- Pick prime p >=M
- fa,b(x) = ax + b (mod p)
- ha,b(x) = fa,b(x) (mod n)
- H = {ha,b | a, b in {1..p-1}}
Using this structure: hash table takes O(n log M) bits of space
- O(n) buckets
- O(n) items requiring O(log M) bits each
- O(log M) bits for the hash function