# 2.3. data structuresÂ¶

Some notes on advanced data structures, based on UVAâs âProgram and Data Representationâ course.

## 2.3.1. listsÂ¶

### 2.3.1.1. arrays and stringsÂ¶

start by checking for null, length 0

*ascii*is 128, extended is 256

### 2.3.1.2. queue - linkedlistÂ¶

has insert at back (enqueue) and remove from front (dequeue)

```
class Node {
Node next;
int val;
public Node(int d) { val = d; }
}
```

finding a loop is tricky, use visited

reverse a linked list

requires 3 ptrs (one temporary to store next)

return pointer to new end

### 2.3.1.3. stackÂ¶

```
class Stack {
Node top;
Node pop() {
if (top != null) {
Object item = top.data;
top = top.next;
return item;
}
return null;
}
void push(Object item) {
Node t = new Node(item);
t.next = top;
top = t;
}
}
```

sort a stack with 2 stacks

make a new stack called ans

pop from old

while old element is > ans.peek(), old.push(ans.pop())

then new.push(old element)

stack with min - each el stores min of things below it

queue with 2 stacks - keep popping everything off of one and putting them on the other

sort with 2 stacks

## 2.3.2. treesÂ¶

Balanced binary trees are generally logarithmic

Root: a node with no parent; there can only be one root

Leaf: a node with no children

Siblings: two nodes with the same parent

Height of a node: length of the longest path from that node to a leaf

Thus, all leaves have height of zero

Height of a tree: maximum depth of a node in that tree = height of the root

Depth of a node: length of the path from the root to that node

Path: sequence of nodes n1, n2, âŠ, nk such that ni is parent of ni+1 for 1 â€ i â€ k

Length: number of edges in the path

Internal path length: sum of the depths of all the nodes

Binary Tree - every node has at most 2 children

Binary Search Tree - Each node has a key value that can be compared

Every node in left subtree has a key whose value is less than the rootâs key value

Every node in right subtree has a key whose value is greater than the rootâs key value

```
void BST::insert(int x, BinaryNode * & curNode){ //we pass in by reference because we want a change in the method to actually modify the parameter (the parameter is the curNode *)
//left associative so this is a reference to a pointer
if (curNode==NULL)
curNode = new BinaryNode(x,NULL,NULL);
else if(x<curNode->element)
insert(x,curNode->left);
else if(x>curNode->element)
insert(x,curNode->right);
}
```

BST Remove

if no children: remove node (reclaiming memory), set parent pointer to null - one child: Adjust pointer of parent to point at child, and reclaim memory - two children: successor is min of right subtree - replace node with successor, then remove successor from tree

worst-case depth = n-1 (this happens when the data is already sorted)

maximum number of nodes in tree of height h is 2^(h+1) - 1

minimum height h â„ log(n+1)-1

Perfect Binary tree - impractical because you need the perfect amount of nodes

all leaves have the same depth

number of leaves 2^h

### 2.3.2.1. AVL TreeÂ¶

For every node in the tree, the height of the left and right sub-trees differs at most by 1

guarantees log(n)

balance factor := The height of the right subtree minus the height of the left subtree

âUnbalancedâ trees: A balance factor of -2 or 2

AVL Insert - needs to update balance factors

same sign -> single rotation

-2, -1 -> needs right rotation

-2, +1 -> needs left then right

Find: Î(log n) time: height of tree is always Î(log n)

Insert: Î(log n) time: find() takes Î(log n), then may have to visit every node on the path back up to root to perform up to 2 single rotations

Remove: Î(log n): left as an exercise

Print: Î(n): no matter the data structure, it will still take n steps to print n elements

### 2.3.2.2. Red-Black TreesÂ¶

definition

A node is either red or black

The root is black

All leaves are black The leaves may be the NULL children

Both children of every red node are black Therefore, a black node is the only possible parent for a red node

Every simple path from a node to any descendant leaf contains the same number of black nodes

properties

The height of the right and left subtree can differ by a factor of n

insert (Assume node is red and try to insert)

The new node is the root node

The new nodeâs parent is black

Both the parent and uncle (aunt?) are red

Parent is red, uncle is black, new node is the right child of parent

Parent is red, uncle is black, new node is the left child of parent

Removal

Do a normal BST remove

Find next highest/lowest value, put itâs value in the node to be deleted, remove that highest/lowest node

Note that that node wonât have 2 children!

We replace the node to be deleted with itâs left child

This child is N, itâs sibling is S, itâs parent is P

### 2.3.2.3. Splay TreesÂ¶

A self-balancing tree that keeps ârecentlyâ used nodes close to the top

This improves performance in some cases

Great for caches

Not good for uniform access

Anytime you find / insert / delete a node, you splay the tree around that node

Perform tree rotations to make that node the new root node

Splaying is Î(h) where h is the height of the tree

At worst this is linear time - Î(n)

We say it runs in Î(log n) amortized time - individual operations might take linear time, but other operations take almost constant time - averages out to logarithmic time

m operations will take m*log(n) time

### 2.3.2.4. other treesÂ¶

to go through

*bst (without recursion) in order*, use stackspush and go left

if canât go left, pop

add new left nodes

go right

*breadth-first tree*recursively print only at a particular level each time

create pointers to nodes on the right

*balanced tree*= any 2 nodes differ in height by more than 1(maxDepth - minDepth) <=1

*trie*is an infix of the word âretrievalâ because the trie can find a single word in a dictionary with only a prefix of the wordroot is empty string

each node stores a character in the word

if ends, full word

need a way to tell if prefix is a word -> each node stores a boolean isWord

## 2.3.3. heapsÂ¶

used for

*priority queue*peek(): just look at the root node

add(val): put it at correct spot, percolate up

percolate - Repeatedly exchange node with its parent if needed

expected run time: âi=1..n 1/2^nân=2

pop(): put last leaf at root, percolate down

Remove root (that is always the min!)

Put âlastâ leaf node at root

Repeatedly find smallest child and swap node with smallest child if needed.

Priority Queue - Binary Heap is always used for Priority Queue

insert

inserts with a priority

findMin

finds the minimum element

deleteMin

finds, returns, and removes minimum element

perfect (or complete) binary tree - binary tree with all leaf nodes at the same depth; all internal nodes have 2 children.

height h, 2h+1-1 nodes, 2h-1 non-leaves, and 2h leaves

Full Binary Tree

A binary tree in which each node has exactly zero or two children.

Min-heap - parent is min

Heap Structure Property: A binary heap is an almost complete binary tree, which is a binary tree that is completely filled, with the possible exception of the bottom level, which is filled left to right.

in an array - this is faster than pointers

left child: 2*i

right child: (2*i)+1

parent: floor(i/2)

pointers need more space, are slower

multiplying, dividing by 2 are very fast

Heap ordering property: For every non-root node X, the key in the parent of X is less than (or equal to) the key in X. Thus, the tree is partially ordered.

Heap operations

findMin: just look at the root node

insert(val): put it at correct spot, percolate up

percolate - Repeatedly exchange node with its parent if needed

expecteed run time: âi=1..n 1/2^nân=2

deleteMin: put last leaf at root, percolate down

Remove root (that is always the min!)

Put âlastâ leaf node at root

Repeatedly find smallest child and swap node with smallest child if needed.

Compression

Lossless compression: X = Xâ

Lossy compression: X != Xâ

Information is lost (irreversible)

Compression ratio: \(\vert X\vert /\vert Y\vert \)

Where \(\vert X\vert \) is the number of bits (i.e., file size) of X

Huffman coding

Compression

Determine frequencies

Build a tree of prefix codes

no code is a prefix of another code

start with minheap, then keep putting trees together

Write the prefix codes to the output

reread source file and write prefix code to the output file

Decompression

read in prefix code - build tree

read in one bit at a time and follow tree

ASCII characters - 8 bits, 2^7 = 128 characters

cost - total number of bits

âstraight costâ - bits / character = log2(numDistinctChars)

Priority Queue Example

insert (x)

deleteMin()

findMin()

isEmpty()

makeEmpty()

size()

## 2.3.4. Hash tablesÂ¶

java: load factor = .75, default init capacity: 16, uses buckets

string hash function: s[0]*31^(n-1) + s[1]*31^(n-2) + âŠ + s[n-1] where n is length mod (table_size)

Standard set of operations: find, insert, delete

No ordering property!

Thus, no findMin or findMax

Hash tables store key-value pairs

Each value has a specific key associated with it

fixed size array of some size, usually a prime number

A hash function takes in a âthingâ )string, int, object, etc._

returns hash value - an unsigned integer value which is then modâed by the size of the hash table to yield a spot within the bounds of the hash table array

Three required properties

Must be deterministic

Meaning it must return the same value each time for the same âthingâ

Must be fast

Must be evenly distributed

implies avoiding collisions

Technically, only the first is required for correctness, but the other two are required for fast running times

A perfect hash function has:

No blanks (i.e., no empty cells)

No collisions

Lookup table is at best logarithmic

We canât just make a very large array - we assume the key space is too large

you canât just hash by social security number

hash(s)=(âkâ1i=0siâ37^i) mod table_size

you would precompute the powers of 37

collision - putting two things into same spot in hash table

Two primary ways to resolve collisions:

Separate Chaining (make each spot in the table a âbucketâ or a collection)

Open Addressing, of which there are 3 types:

Linear probing

Quadratic probing

Double hashing

Separate Chaining

each bucket contains a data structure (like a linked list)

analysis of find

The load factor, Î», of a hash table is the ratio of the number of elements divided by the table size

For separate chaining, Î» is the average number of elements in a bucket

Average time on unsuccessful find: Î»

Average length of a list at hash(k)

Average time on successful find: 1 + (Î»/2)

One node, plus half the average length of a list (not including the item)

typical case will be constant time, but worst case is linear because everything hashes to same spot

Î» = 1

Make hash table be the number of elements expected

So average bucket size is 1

Also make it a prime number

Î» = 0.75

Javaâs Hashtable but can be set to another value

Table will always be bigger than the number of elements

This reduces the chance of a collision!

Good trade-off between memory use and running time

Î» = 0.5

Uses more memory, but fewer collisions

Open Addressing: The general idea with all of them is that, if a spot is occupied, to âprobeâ, or try, other spots in the table to use

3 Types:

General: pi(k) = (hash(k) + f(i)) mod table_size 1.Linear Probing: f(i) = i

Check spots in this order :

hash(k)

hash(k)+1

hash(k)+2

hash(k)+3

These are all mod table_size

find - keep going until you find an empty cell (or get back)

problems

cannot have a load factor > 1, as you get close to 1, you get a lot of collisons

clustering - large blocks of occupied cells

âholesâ when an element is removed 2.Quadratic: f(i) = i^2

hash(k)

hash(k)+1

hash(k)+4

hash(k)+9

you move out of clusters much quicker 3.Double hashing: i * hash2(k)

hash2 is another hash function - typically the fastest

problem where you loop over spots that are filled - hash2 yields a factor of the table size

solve by making table size prime

hash(k) + 1 * hash2(k)

hash(k) + 2 * hash2(k)

hash(k) + 3 * hash2(k)

a prime table size helps hash function be more evenly distributed

problem: when the table gets too full, running time for operations increases

solution: create a bigger table and hash all the items from the original table into the new table

position is dependent on table size, which means we have to rehash each value

this means we have to re-compute the hash value for each element, and insert it into the new table!

When to rehash?

When half full (Î» = 0.5)

When mostly full (Î» = 0.75)

Javaâs hashtable does this by default

When an insertion fails

Some other threshold

Cost of rehashing

Letâs assume that the hash function computation is constant

We have to do n inserts, and if each key hashes to the same spot, then it will be a Î(n2) operation!

Although it is not likely to ever run that slow

Removing

You could rehash on delete

You could put in a âplaceholderâ or âsentinelâ value

gets filled with these quickly

perhaps rehash after a certain number of deletes

has functions

MD5 is a good hash function (given a string or file contents)

generates 128 bit hash

when you download something, you download the MD5, your computer computes the MD5 and they are compared to make sure it downloaded correctly

not reversible because when a file has more than 128 bits, wonât be 1-1 mapping

you can lookup a MD5 hash in a rainbow table - gives you what the password probably is based on the MD5 hash

SHA (secure Hash algorithm) is much more secure

generates hash up to 512 bits