Hashing

Hashing #

  • Generic data structure that allows fast insert/delete/search
    • Doesn’t need items to be comparable, only hashable
    • Idea: get better performance in expectation using randomness

Primitive: direct addressing #

  • Suppose all keys are in the set {1..9}
  • Create buckets 1..9
  • Insert: map key i to bucket i
  • Delete: remove from bucket i
  • Search: look for bucket i

Problem: if keys are arbitrary, this requires a lot of memory

An improvement: put items in buckets based on one digit

  • Problem with this: if keys have same common digit, one bucket becomes very large

Hash functions improve bucket allocation by using randomness.

Terminology #

  • U: universe of size M (very very large)
  • n: number of (distinct) elements of U we expect to see, where (n « M)

Hash functions #

  • A hash function h: U -> {1..n} is a function that maps elements of U to buckets 1..n
    • Example: h(x) = least significant digit of x
    • For simplicity, we assume the number of buckets is n, although this doesn’t need to be the case
    • In general, want #buckets = O(#elements we expect to see)

Hash tables #

  • Array of n buckets, where each bucket stores a linked list
    • Can insert into a linked list in time O(1)
    • Search/deletion in a linked list takes time O(length(list))
    • Bucket index: given by hash function for any element
  • Goals:
    • Not too many buckets (at most n) to not use too much space
    • Items should be spread out across buckets for fast insert/delete/search

Designing a hash function #

Goal: Design a function h: U -> {1..n} where no matter what n items of U are adversarially chosen, buckets are balanced (i.e., O(1) entries per bucket)

This requires a random hash function!

  • If the function were deterministic, an adversary could pick inputs that are guaranteed to result in one bucket being overloaded

Testing parameters:

  1. Adversary chooses n items u1…un and any sequence of insert/search/delete operations on those items
  2. Algorithm chooses random hash function h: U -> {1..n}
  3. Hash table should provide O(1) search/insert/delete performance

Refined goal: For all adversarial inputs u1..un, and for all i in {1..n}, want E[length(list containing ui)] <= 2

Example: random hash function #

  • Suppose h: U -> {1..n} is a uniformly random function (h(1) is uniformly random between 1 and n); write down the generated permutation h(i) for all i in U
  • Pro: Adversary has a very hard time trying to defeat this; fulfills expectation in refined goal
  • Con: Needs a lot of storage space to store the permutation

Hash families #

  • Cleverly chosen subset H of functions; only need log(|H|) bits to store an element of H
  • e.g. H = {LeastSignificantDigit, MostSignificantDigit} and pick a random function in H for each element; use one bit to denote what we used

Universal hash families: hash family that satisfies P{h in H} {h(ui) = h(uj)} <= 1/n for all nonequal ui, uj in U

Example:

  1. Pick prime p >=M
  2. fa,b(x) = ax + b (mod p)
  3. ha,b(x) = fa,b(x) (mod n)
  4. H = {ha,b | a, b in {1..p-1}}

Using this structure: hash table takes O(n log M) bits of space

  • O(n) buckets
  • O(n) items requiring O(log M) bits each
  • O(log M) bits for the hash function