# Hashing #

- Generic data structure that allows fast insert/delete/search
- Doesn’t need items to be comparable, only hashable
- Idea: get better performance in expectation using randomness

## Primitive: direct addressing #

- Suppose all keys are in the set {1..9}
- Create buckets 1..9
- Insert: map key
*i*to bucket*i* - Delete: remove from bucket
*i* - Search: look for bucket
*i*

Problem: if keys are arbitrary, this requires a *lot* of memory

An improvement: put items in buckets based on one digit

- Problem with this: if keys have same common digit, one bucket becomes very large

**Hash functions improve bucket allocation by using randomness.**

## Terminology #

*U*: universe of size*M*(very very large)*n*: number of (distinct) elements of U we expect to see, where (*n*«*M*)

## Hash functions #

- A hash function
*h: U -> {1..n}*is a function that maps elements of*U*to buckets 1..n- Example:
*h(x)*= least significant digit of*x* - For simplicity, we assume the number of buckets is
*n*, although this doesn’t need to be the case - In general, want #buckets = O(#elements we expect to see)

- Example:

## Hash tables #

- Array of
*n*buckets, where each bucket stores a linked list- Can insert into a linked list in time O(1)
- Search/deletion in a linked list takes time O(length(list))
- Bucket index: given by hash function for any element

- Goals:
- Not too many buckets (at most
*n*) to not use too much space - Items should be spread out across buckets for fast insert/delete/search

- Not too many buckets (at most

## Designing a hash function #

**Goal: Design a function h: U -> {1..n} where no matter what n items of U are adversarially chosen, buckets are balanced (i.e., O(1) entries per bucket)**

This requires a random hash function!

- If the function were deterministic, an adversary could pick inputs that are guaranteed to result in one bucket being overloaded

Testing parameters:

- Adversary chooses
*n*items u1…u*n*and any sequence of insert/search/delete operations on those items - Algorithm chooses random hash function
*h: U -> {1..n}* - Hash table should provide O(1) search/insert/delete performance

**Refined goal: For all adversarial inputs u1..u n, and for all i in {1..n}, want E[length(list containing ui)] <= 2**

### Example: random hash function #

- Suppose
*h: U -> {1..n}*is a uniformly random function (h(1) is uniformly random between 1 and*n*); write down the generated permutation*h(i)*for all*i*in*U* - Pro: Adversary has a very hard time trying to defeat this; fulfills expectation in refined goal
- Con: Needs a
**lot**of storage space to store the permutation

### Hash families #

- Cleverly chosen subset
*H*of functions; only need log(|*H*|) bits to store an element of*H* - e.g.
*H*= {LeastSignificantDigit, MostSignificantDigit} and pick a random function in*H*for each element; use one bit to denote what we used

Universal hash families: hash family that satisfies P{h in H} {h(u*i*) = h(u*j*)} <= 1/*n* for all nonequal u*i*, u*j* in *U*

Example:

- Pick prime
*p*>=*M* - fa,b(x) = ax + b (mod p)
- ha,b(x) = fa,b(x) (mod n)
- H = {ha,b | a, b in {1..p-1}}

Using this structure: hash table takes O(n log M) bits of space

- O(n) buckets
- O(n) items requiring O(log M) bits each
- O(log M) bits for the hash function