Hashing #

Generic data structure that allows fast insert/delete/search
- Doesn’t need items to be comparable, only hashable
- Idea: get better performance in expectation using randomness

Primitive: direct addressing #

Suppose all keys are in the set {1..9}
Create buckets 1..9
Insert: map key i to bucket i
Delete: remove from bucket i
Search: look for bucket i

Problem: if keys are arbitrary, this requires a lot of memory

An improvement: put items in buckets based on one digit

Problem with this: if keys have same common digit, one bucket becomes very large

Hash functions improve bucket allocation by using randomness.

Terminology #

U: universe of size M (very very large)
n: number of (distinct) elements of U we expect to see, where (n « M)

Hash functions #

A hash function h: U -> {1..n} is a function that maps elements of U to buckets 1..n
- Example: h(x) = least significant digit of x
- For simplicity, we assume the number of buckets is n, although this doesn’t need to be the case
- In general, want #buckets = O(#elements we expect to see)

Hash tables #

Array of n buckets, where each bucket stores a linked list
- Can insert into a linked list in time O(1)
- Search/deletion in a linked list takes time O(length(list))
- Bucket index: given by hash function for any element
Goals:
- Not too many buckets (at most n) to not use too much space
- Items should be spread out across buckets for fast insert/delete/search

Designing a hash function #

Goal: Design a function h: U -> {1..n} where no matter what n items of U are adversarially chosen, buckets are balanced (i.e., O(1) entries per bucket)

This requires a random hash function!

If the function were deterministic, an adversary could pick inputs that are guaranteed to result in one bucket being overloaded

Testing parameters:

Adversary chooses n items u1…un and any sequence of insert/search/delete operations on those items
Algorithm chooses random hash function h: U -> {1..n}
Hash table should provide O(1) search/insert/delete performance

Refined goal: For all adversarial inputs u1..un, and for all i in {1..n}, want E[length(list containing ui)] <= 2

Example: random hash function #

Suppose h: U -> {1..n} is a uniformly random function (h(1) is uniformly random between 1 and n); write down the generated permutation h(i) for all i in U
Pro: Adversary has a very hard time trying to defeat this; fulfills expectation in refined goal
Con: Needs a lot of storage space to store the permutation

Hash families #

Cleverly chosen subset H of functions; only need log(|H|) bits to store an element of H
e.g. H = {LeastSignificantDigit, MostSignificantDigit} and pick a random function in H for each element; use one bit to denote what we used

Universal hash families: hash family that satisfies P{h in H} {h(ui) = h(uj)} <= 1/n for all nonequal ui, uj in U

Example:

Pick prime p >=M
fa,b(x) = ax + b (mod p)
ha,b(x) = fa,b(x) (mod n)
H = {ha,b | a, b in {1..p-1}}

Using this structure: hash table takes O(n log M) bits of space

O(n) buckets
O(n) items requiring O(log M) bits each
O(log M) bits for the hash function