data structures
view markdownSome notes on advanced data structures, based on UVA’s “Program and Data Representation” course.
lists
arrays and strings
 start by checking for null, length 0
 ascii is 128, extended is 256
queue  linkedlist
 has insert at back (enqueue) and remove from front (dequeue)
class Node { Node next; int val; public Node(int d) { val = d; } }
 finding a loop is tricky, use visited
 reverse a linked list
 requires 3 ptrs (one temporary to store next)
 return pointer to new end
stack
class Stack {
Node top;
Node pop() {
if (top != null) {
Object item = top.data;
top = top.next;
return item;
}
return null;
}
void push(Object item) {
Node t = new Node(item);
t.next = top;
top = t;
}
}
 sort a stack with 2 stacks
 make a new stack called ans
 pop from old
 while old element is > ans.peek(), old.push(ans.pop())
 then new.push(old element)
 stack with min  each el stores min of things below it
 queue with 2 stacks  keep popping everything off of one and putting them on the other
 sort with 2 stacks
trees
 Balanced binary trees are generally logarithmic
 Root: a node with no parent; there can only be one root
 Leaf: a node with no children
 Siblings: two nodes with the same parent
 Height of a node: length of the longest path from that node to a leaf
 Thus, all leaves have height of zero
 Height of a tree: maximum depth of a node in that tree = height of the root
 Depth of a node: length of the path from the root to that node
 Path: sequence of nodes n1, n2, …, nk such that ni is parent of ni+1 for 1 ≤ i ≤ k
 Length: number of edges in the path
 Internal path length: sum of the depths of all the nodes
 Binary Tree  every node has at most 2 children
 Binary Search Tree  Each node has a key value that can be compared
 Every node in left subtree has a key whose value is less than the root’s key value
 Every node in right subtree has a key whose value is greater than the root’s key value
void BST::insert(int x, BinaryNode * & curNode){ //we pass in by reference because we want a change in the method to actually modify the parameter (the parameter is the curNode *)
//left associative so this is a reference to a pointer
if (curNode==NULL)
curNode = new BinaryNode(x,NULL,NULL);
else if(x<curNode>element)
insert(x,curNode>left);
else if(x>curNode>element)
insert(x,curNode>right);
}
 BST Remove
 if no children: remove node (reclaiming memory), set parent pointer to null
 one child: Adjust pointer of parent to point at child, and reclaim memory
 two children: successor is min of right subtree
 replace node with successor, then remove successor from tree
 worstcase depth = n1 (this happens when the data is already sorted)
 maximum number of nodes in tree of height h is 2^(h+1)  1
 minimum height h ≥ log(n+1)1
 Perfect Binary tree  impractical because you need the perfect amount of nodes
 all leaves have the same depth
 number of leaves 2^h
AVL Tree
 if no children: remove node (reclaiming memory), set parent pointer to null
 one child: Adjust pointer of parent to point at child, and reclaim memory
 two children: successor is min of right subtree
 replace node with successor, then remove successor from tree
 For every node in the tree, the height of the left and right subtrees differs at most by 1
 guarantees log(n)
 balance factor := The height of the right subtree minus the height of the left subtree
 “Unbalanced” trees: A balance factor of 2 or 2
 AVL Insert  needs to update balance factors
 same sign > single rotation
 2, 1 > needs right rotation
 2, +1 > needs left then right
 Find: Θ(log n) time: height of tree is always Θ(log n)
 Insert: Θ(log n) time: find() takes Θ(log n), then may have to visit every node on the path back up to root to perform up to 2 single rotations
 Remove: Θ(log n): left as an exercise
 Print: Θ(n): no matter the data structure, it will still take n steps to print n elements
RedBlack Trees
 definition
 A node is either red or black
 The root is black
 All leaves are black The leaves may be the NULL children
 Both children of every red node are black Therefore, a black node is the only possible parent for a red node
 Every simple path from a node to any descendant leaf contains the same number of black nodes
 properties
 The height of the right and left subtree can differ by a factor of n
 insert (Assume node is red and try to insert)
 The new node is the root node
 The new node’s parent is black
 Both the parent and uncle (aunt?) are red
 Parent is red, uncle is black, new node is the right child of parent
 Parent is red, uncle is black, new node is the left child of parent
 Removal
 Do a normal BST remove
 Find next highest/lowest value, put it’s value in the node to be deleted, remove that highest/lowest node
 Note that that node won’t have 2 children!
 We replace the node to be deleted with it’s left child
 This child is N, it’s sibling is S, it’s parent is P
Splay Trees
 This child is N, it’s sibling is S, it’s parent is P
 Find next highest/lowest value, put it’s value in the node to be deleted, remove that highest/lowest node
 Do a normal BST remove
 A selfbalancing tree that keeps “recently” used nodes close to the top
 This improves performance in some cases
 Great for caches
 Not good for uniform access
 Anytime you find / insert / delete a node, you splay the tree around that node
 Perform tree rotations to make that node the new root node
 Splaying is Θ(h) where h is the height of the tree
 At worst this is linear time  Θ(n)
 We say it runs in Θ(log n) amortized time  individual operations might take linear time, but other operations take almost constant time  averages out to logarithmic time
 m operations will take m*log(n) time
other trees
 to go through bst (without recursion) in order, use stacks
 push and go left
 if can’t go left, pop
 add new left nodes
 go right
 add new left nodes
 breadthfirst tree
 recursively print only at a particular level each time
 create pointers to nodes on the right
 balanced tree = any 2 nodes differ in height by more than 1
 (maxDepth  minDepth) <=1
 trie is an infix of the word “retrieval” because the trie can find a single word in a dictionary with only a prefix of the word
 root is empty string
 each node stores a character in the word
 if ends, full word
 need a way to tell if prefix is a word > each node stores a boolean isWord
heaps
 used for priority queue
 peek(): just look at the root node
 add(val): put it at correct spot, percolate up
 percolate  Repeatedly exchange node with its parent if needed
 expected run time: ∑i=1..n 1/2^n∗n=2
 pop(): put last leaf at root, percolate down
 Remove root (that is always the min!)
 Put “last” leaf node at root
 Repeatedly find smallest child and swap node with smallest child if needed.
 Priority Queue  Binary Heap is always used for Priority Queue
 insert
 inserts with a priority
 findMin
 finds the minimum element
 deleteMin
 finds, returns, and removes minimum element
 insert
 perfect (or complete) binary tree  binary tree with all leaf nodes at the same depth; all internal nodes have 2 children.
 height h, 2h+11 nodes, 2h1 nonleaves, and 2h leaves
 Full Binary Tree
 A binary tree in which each node has exactly zero or two children.
 Minheap  parent is min
 Heap Structure Property: A binary heap is an almost complete binary tree, which is a binary tree that is completely filled, with the possible exception of the bottom level, which is filled left to right.
 in an array  this is faster than pointers
 left child: 2*i
 right child: (2*i)+1
 parent: floor(i/2)
 pointers need more space, are slower
 multiplying, dividing by 2 are very fast
 in an array  this is faster than pointers
 Heap ordering property: For every nonroot node X, the key in the parent of X is less than (or equal to) the key in X. Thus, the tree is partially ordered.
 Heap Structure Property: A binary heap is an almost complete binary tree, which is a binary tree that is completely filled, with the possible exception of the bottom level, which is filled left to right.
 Heap operations
 findMin: just look at the root node
 insert(val): put it at correct spot, percolate up
 percolate  Repeatedly exchange node with its parent if needed
 expecteed run time: ∑i=1..n 1/2^n∗n=2
 deleteMin: put last leaf at root, percolate down
 Remove root (that is always the min!)
 Put “last” leaf node at root
 Repeatedly find smallest child and swap node with smallest child if needed.
 Compression
 Lossless compression: X = X’
 Lossy compression: X != X’
 Information is lost (irreversible)
 Compression ratio: $\vert X\vert /\vert Y\vert $
 Where $\vert X\vert $ is the number of bits (i.e., file size) of X
 Huffman coding
 Compression
 Determine frequencies
 Build a tree of prefix codes
 no code is a prefix of another code
 start with minheap, then keep putting trees together
 Write the prefix codes to the output
 reread source file and write prefix code to the output file
 Decompression
 read in prefix code  build tree
 read in one bit at a time and follow tree
 Compression
 ASCII characters  8 bits, 2^7 = 128 characters
 cost  total number of bits
 “straight cost”  bits / character = log2(numDistinctChars)
 Priority Queue Example
 insert (x)
 deleteMin()
 findMin()
 isEmpty()
 makeEmpty()
 size()
Hash tables
 java: load factor = .75, default init capacity: 16, uses buckets
 string hash function: s[0]31^(n1) + s[1]31^(n2) + … + s[n1] where n is length mod (table_size)
 Standard set of operations: find, insert, delete
 No ordering property!
 Thus, no findMin or findMax
 Hash tables store keyvalue pairs
 Each value has a specific key associated with it
 fixed size array of some size, usually a prime number

A hash function takes in a “thing” )string, int, object, etc._
 returns hash value  an unsigned integer value which is then mod’ed by the size of the hash table to yield a spot within the bounds of the hash table array
 Three required properties
 Must be deterministic
 Meaning it must return the same value each time for the same “thing”
 Must be fast
 Must be evenly distributed
 implies avoiding collisions  Technically, only the first is required for correctness, but the other two are required for fast running times
 Must be deterministic
 A perfect hash function has:
 No blanks (i.e., no empty cells)
 No collisions
 Lookup table is at best logarithmic

We can’t just make a very large array  we assume the key space is too large
 you can’t just hash by social security number

hash(s)=(∑k−1i=0si∗37^i) mod table_size
 you would precompute the powers of 37
 collision  putting two things into same spot in hash table
 Two primary ways to resolve collisions:
 Separate Chaining (make each spot in the table a ‘bucket’ or a collection)
 Open Addressing, of which there are 3 types:
 Linear probing
 Quadratic probing
 Double hashing
 Two primary ways to resolve collisions:
 Separate Chaining
 each bucket contains a data structure (like a linked list)
 analysis of find
 The load factor, λ, of a hash table is the ratio of the number of elements divided by the table size
 For separate chaining, λ is the average number of elements in a bucket
 Average time on unsuccessful find: λ
 Average length of a list at hash(k)
 Average time on successful find: 1 + (λ/2)
 One node, plus half the average length of a list (not including the item)
 Average time on unsuccessful find: λ
 typical case will be constant time, but worst case is linear because everything hashes to same spot
 λ = 1
 Make hash table be the number of elements expected
 So average bucket size is 1
 Also make it a prime number
 λ = 0.75
 Java’s Hashtable but can be set to another value
 Table will always be bigger than the number of elements
 This reduces the chance of a collision!
 Good tradeoff between memory use and running time
 λ = 0.5
 Uses more memory, but fewer collisions
 For separate chaining, λ is the average number of elements in a bucket
 The load factor, λ, of a hash table is the ratio of the number of elements divided by the table size
 Open Addressing: The general idea with all of them is that, if a spot is occupied, to ‘probe’, or try, other spots in the table to use
 3 Types:
 General: pi(k) = (hash(k) + f(i)) mod table_size
1.Linear Probing: f(i) = i
 Check spots in this order :
 hash(k)
 hash(k)+1
 hash(k)+2
 hash(k)+3
 These are all mod table_size
 find  keep going until you find an empty cell (or get back)
 problems
 cannot have a load factor > 1, as you get close to 1, you get a lot of collisons
 clustering  large blocks of occupied cells
 “holes” when an element is removed 2.Quadratic: f(i) = i^2
 hash(k)
 hash(k)+1
 hash(k)+4
 hash(k)+9
 you move out of clusters much quicker 3.Double hashing: i * hash2(k)
 hash2 is another hash function  typically the fastest
 problem where you loop over spots that are filled  hash2 yields a factor of the table size
 solve by making table size prime
 hash(k) + 1 * hash2(k)
 hash(k) + 2 * hash2(k)
 hash(k) + 3 * hash2(k)
 Check spots in this order :
 General: pi(k) = (hash(k) + f(i)) mod table_size
1.Linear Probing: f(i) = i
 a prime table size helps hash function be more evenly distributed
 problem: when the table gets too full, running time for operations increases
 solution: create a bigger table and hash all the items from the original table into the new table
 position is dependent on table size, which means we have to rehash each value
 this means we have to recompute the hash value for each element, and insert it into the new table!
 When to rehash?
 When half full (λ = 0.5)
 When mostly full (λ = 0.75)
 Java’s hashtable does this by default
 When an insertion fails
 Some other threshold
 Cost of rehashing
 Let’s assume that the hash function computation is constant
 We have to do n inserts, and if each key hashes to the same spot, then it will be a Θ(n2) operation!
 Although it is not likely to ever run that slow
 Removing
 You could rehash on delete
 You could put in a ‘placeholder’ or ‘sentinel’ value
 gets filled with these quickly
 perhaps rehash after a certain number of deletes
 3 Types:
 has functions
 MD5 is a good hash function (given a string or file contents)
 generates 128 bit hash
 when you download something, you download the MD5, your computer computes the MD5 and they are compared to make sure it downloaded correctly
 not reversible because when a file has more than 128 bits, won’t be 11 mapping
 you can lookup a MD5 hash in a rainbow table  gives you what the password probably is based on the MD5 hash
 SHA (secure Hash algorithm) is much more secure
 generates hash up to 512 bits
 MD5 is a good hash function (given a string or file contents)