the setup

initial state: $s_0$
actions at each state: go( $s_1$ ), go( $s_2$ )
transition model: result( $s_0$ , go( $s_1$ )) = $s_1$
goal states: $s_\text{goal}$
path cost function: cost( $s_0$ , a, $s_1$ )

informed search

informed search - use and
- g(n): cost from start to n
- heuristic h(n): cost from n to goal
best-first - choose nodes with best f
- f=g: uniform-cost search
- f=h: greedy best-first search
$A^*$ : $f(n) = g(n) + h(n)$

admissible

admissible: $h(n)$ never overestimates
$\implies A^*$ (with tree search) is optimal and complete

graph

consistent

consistent:
- monotonic
$\implies A^*$ (with graph search) is optimal and complete

optimally efficient

consistent optimally efficient (guaranteed to expand fewest nodes)
- never re-open a node
weaknesses
- must store all nodes in memory

memory-bounded heuristic search

iterative-deepening $A^*$ - cutoff f-cost
recursive best-first search
- each selected node backs up best f alternative
- if exceeding this, rewind
- when rewinding, replace nodes with with best child f
- simplified memory-bounded
- when memory full, collapse worst leaf

heuristic functions

want big h(n) because we expand everything with $f(n) < C^*$
- $h_1$ dominates $h_2$ if $h_1(n) \geq h_2(n) \: \forall \: n$
- combining heuristics: pick $h(n) = max[h_1(n), ..., h_m(n)]$
relaxed problem yields admissible heuristics
- ex. 8-tile solver

discrete space

hill-climbing = greedy local search
- stochastic hill climbing
- random-restart hill climbing
simulated annealing - pick random move
- if better accept
- else accept with probability $\exp(\Delta f / T_t)$
local beam search - pick k starts, then choose the best k states from their neighbors
- stochastic beam search - pick best k with prob proportional to how good they are

continuous space

could just discretize neighborhood of each state
SGD: line search - double $\alpha$ until f increases
Newton-Raphson method: $x = x - H_f^{-1} (x) \nabla f(x)$

inference

constraint propagation uses constraints to prune domains of variables
finite-domain constraint set of binary constraints w/ auxiliary variables
- ex. dual graph transformation: constraint $\to$ variable, shared variables $\to$ edges

basic constraint propagation

node consistency - unary constraints
arc consistency - satisfy binary constraints (AC-3 algorithm)
- for each arc, apply it
  - if things changed, re-add all the neighboring arcs to the set
- $O(cd^3)$ where $d = \vert domain\vert$ , c = num arcs

advanced constraint propagation

path consistency - consider constraints on triplets - PC-2 algorithm
- extends to k-consistency
- strongly k-consistent - also (k-1), ..., 1-consistent
  - $\implies O(k^2d)$ to solve
global constraints

b3 - intelligent backtracking

conflict set for each node (list of variable assignments that deleted things from its domain)
backjumping - backtracks to most recent assignment in conflict set
conflict-directed backjumping
- let $X_j$ be current variable and $conf(X_j)$ be conflict set. If every possible value for $X_j$ fails, backjump to the most recent variable $X_i$ in $conf(X_j)$ and set $conf(X_i) = conf(X_i) \cup conf(X_j) - X_i$
constraint learning - finding min set of assigments from conflict set that causes problem

structure of problems

connected components of constraint graph are independent subproblems
tree - any 2 variables are connected by only one path
- directed arc consistency - ordered variables , every is consistent with each for j>i
  - tree with n nodes can be made directed arc-consisten in $O(n)$ steps - $O(nd^2)$

making trees

assign variables so remaining variables form a tree
- assigned variables called cycle cutset with size c
- $O[d^c \cdot (n-c) d^2]$
- finding smallest cutset is hard, but can use approximation called cutset conditioning
tree decomposition - view each subproblem as a mega-variable
- tree width w - size of largest subproblem - 1
- solvable in $O(n d^{w+1})$

nondeterministic search

multiple successors can result from an action (w/ some prob)
can represent this as a a game tree with expectimax chance nodes
mdps can help solve these problems
goal: find a policy $\pi$ which maps states to actions

elements

standard: states, actions, start, terminal states
transition function: $T(s, a, s')$
reward function: $R(s, a, s')$
try to maximize sum of rewards or discounted sum of rewards

mdps

memoryless: $T(s, a, s') = P(s'|s, a)$
value $V^*(s)$ = maximum expected utility starting at s
q-value $Q^*(s, a)$ = maximum expected utility starting in s, taking action a

value iteration

initialize state values to zero and iteratively them

$V_0(s) = 0$
Bellman update for each state
once we have values for each state, $\pi^(s) = \underset{a}{\text{argmax}} \: Q^(s, a)$

policy iteration

initialize policy and iteratively update it

define $\pi_0(s)$
Bellman update for policy, where values are computed using the current policy $V^\pi_i(s)$

equations

direct methods: just keep track of counts
TD (bellman eqn w/ fixed policy):
- exponential moving average: $V^\pi(s) = (1- \alpha) V^\pi (s) + \alpha \cdot sample$
$Q{k+1}(s, a) = \underset{s'}{\sum} T(s, a, s') \cdot \underbrace{[R(s, a, s') + \gamma \underset{a'}{\max} Q(s', a')]}{\text{sample}}$ (like value iteration, but no max)
- exponential moving average: $Q(s, a) = (1 - \alpha) Q(s, a) + \alpha \cdot sample$
- approximate q-learning = use feature-based repr.

exploration + exploitation

$\epsilon$ -greedy policy - act randomly with probability $\epsilon$
exploration function - replace Q(s, a) with some function based on how many times state has been visited

triples

	Causal chain	Common cause	Common effect

$X \perp Z$ ?	❌	❌	✅
$X \perp Z \vert Y$ ?	✅	✅	❌

decision networks

decision theory

$EU(a|e) = \sum_{s'} P(s'|s, a) U(s')$
$MEU(e) = \underset{a}{\max} EU(a|e)$
$VPI(T) = E_T[MEU(e, t)] - MEU(e)$

propositional logic

declarative vs procedural (knowing how to ride a bike)
horn clause - at most one positive
- definite clause - exactly one positive
- goal clause - 0 positive

first-order logic

Screen Shot 2018-08-01 at 7.52.25 PM

first-order logic: add objects, relations, quantifiers ( $\exists, \forall$ )
- unification

action example

:
- PRECOND: $At(p, from) \land Plane(p) \land Airport(from) \land Airport(to)$
- EFFECT:
  - can only use variables in the precondition

heuristics

ex. ignore preconditions
ex. ignore delete lists - remove all negative literals
ex. state abstractions - many-to-one mapping from states abstract states
- ex. ignore some fluents

mdps

$\pi(s) = \underset{\pi}{\text{argmax}} \:U^\pi (s)$

value iteration: $U(s) = R(s) + \gamma MEU(s)$ (Bellman eqn)
policy iteration: $U(s) = R(s) + \gamma EU(s)$

passive rl

given $\pi$ , find $U(s)$

ADP: find $P(s'|s, a)$ , $R(s) \to$ plug into Bellman eqn
TD: when $s \to s'$ : $U(s) = U(s) + \alpha[R(s) + \gamma U(s') - U(s)]$

active rl

find $\pi$ , maybe maximize rewards along the way

here only Q-learning: $U(s) = \underset{a}{\max}Q(s, a)$

ADP: $Q(s, a) = R(s) + \alpha[\sum_{s'} P(s'|s, a)\underset{\alpha}{\max}Q(s', a')]$
TD: when $s\to s'$ : $Q(s, a) = Q(s, a) + \alpha[R(s) + \gamma \underset{a}{\max}Q(s', a') - Q(s, a)]$
SARSA: when $s\to s'$ : $Q(s, a) = Q(s, a) + \alpha[R(s) + \gamma Q(s', a') - Q(s, a)]$

extra topics