decisions, rl view markdown
Some notes on decision theory based on Berkeley's CS 188 course and "Artificial Intelligence" Russel & Norvig 3rd Edition
cience
 Means–ends analysis  for planning subgoals, use the distancetothegoal as a continuous reward signal (and basically do greedy search with backtracking)
 at testtime, we solve an optimization problem
game trees  R&N 5.25.5
 like search (adversarial search)
 minimax algorithm
 ply  half a move in a tree
 for multiplayer, the backedup value of a node n is the vector of the successor state with the highest value for the player choosing at n
 time complexity  $O(b^m)$
 space complexity  $O(bm)$ or even $O(m)$
 alphabeta pruning cuts in half the exponential depth
 once we have found out enough about n, we can prune it
 depends on moveordering
 might want to explore best moves = killer moves first
 transposition table can hash different movesets that are just transpositions of each other
 imperfect realtime decisions
 can evaluate nodes with a heuristic and cutoff before reaching goal
 heuristic uses features
 want quiescent search  consider if something dramatic will happen in the next ply
 horizon effect  a position is bad but isn’t apparent for a few moves
 singular extension  allow searching for certain specific moves that are always good at deeper depths
 forward pruning  ignore some branches
 beam search  consider only n best moves
 PROBCUT prunes some more
 search vs lookup
 often just use lookup in the beginning
 program can solve and just lookup endgames
 stochastic games
 include chance nodes
 change minimax to expectiminimax
 $O(b^m numRolls^m)$
 cutoff evaluation function is sensitive to scaling  evaluation function must be a positive linear transformation of the probability of winning from a position
 can do alphabeta pruning analog if we assume evaluation function is bounded in some range
 alternatively, could simulate games with Monte Carlo simulation
utilities / decision theory – R&N 16.116.3, mazzonni quant finance book
 lottery  any function of a random variable
 utility function  lottery that satisfiers certain properties (e.g. transitivity)
 expected utility = von NeumannMorgenstern utility
 goal: maximize utility by taking actions (focus on single actions)
 utility function U(s) gives utility of a state
 actions are probabilistic: $P[RESULT(a)=s’ \vert a,e]$
 s  state, e  observations, a  action
 soln: pick action with maximum expected utility
 expected utility $EU(a\vert e) = \sum_{s’} P(RESULT(a)=s’ \vert a,e) U(s’)$
 notation
 A>B  agent prefers A over B
 A~B  agent is indifferent between A and B
 preference relation has 6 axioms of utility theory
 orderability  A>B, A~B, or A<B
 transitivity
 continuity
 substitutability  can do algebra with preference eqns
 monotonicity  if A>B then must prefer higher probability of A than B
 decomposability  2 consecutive lotteries can be compressed into single equivalent lottery
 these axioms yield a utility function
 isn’t unique (ex. affine transformation yields new utility function)
 value function = ordinal utility function  sometimes ranking, numbers not needed
 agent might not be explicitly maximizing the utility function
utility functions
 preference elicitation  finds utility function
 normalized utility to have min and max value
 assess utility of s by asking agent to choose between s and $(p: \min, (1p): \max)$
 people have complicated utility functions
 ex. micromort  one in a million chance of death
 ex. QALY  qualityadjusted life year
 risk
 agents exhibits monotonic preference for more money
 gambling has expected monetary value = EMV
 risk averse = when utility of money is sublinear
 risk premium = value agent will accept in lieu of lottery = certainty equivalent= insurance premium
 riskneutral = linear
 riskseeking = supralinear
 absolute risk aversion $ARA(x) =  \frac{u’‘(x)}{u’(x)} $ : higher is more risk averse
 relative risk aversion $ARA(x) =  \frac{x \cdot u’‘(x)}{u’(x)} $
 optimizer’s curse  tendency for E[utility] to be too high because we keep picking high utility randomness
 normative theory  how idealized agents work
 descriptive theory  how actual agents work
 certainty effect  people are drawn to things that are certain
 ambiguity aversion
 framing effect  wording can influence people’s judgements
 anchoring effect  buy middletier wine because expensive is there
decision theory / VPI – R&N 16.5 & 16.6
 note: here we are just making 1 decision
 decision network (sometimes called influence diagram)
 chance nodes  represent RVs (like BN)
 decision nodes  points where decision maker has a choice of actions
 utility nodes  represent agent’s utility function
 can ignore chance nodes
 then actionutility function = Qfunction maps directly from actions to utility
 evaluation
 set evidence
 for each possible value of decision node
 set decision node to that value
 calculate probabilities of parents of utility node
 calculate resulting utility
 return action with highest utility
the value of information
 information value theory  enables agent to choose what info to acquire
 observations only affect agent’s belief state
 value of info = difference in best expected value with/without info

maximum $EU(\alpha e) = \underset{a}{\max} \sum_{s’} P(Result(a)=s’ a, e) U(s’)$
 value of perfect information VPI  assume we can obtain exact evidence for a variable (ex. variable $T=t$)

$VPI(T) = \mathbb{E}_{T}\left[ EU(\alpha e, T) \right]  \underbrace{EU(\alpha \vert e)}_{\text{original EU}}$  first term expands to $\sum_t P(T=t \vert e) \cdot EU(\alpha \vert e, T=t) $
 within each of these EU, we take a max over actions
 VPI not linearly additive, but is orderindependent
 intuition
 info is more valuable when it is likely to cause a change of plan
 info is more valuable when the new plan will be much better than the old plan

 informationgathering agent
 myopic  greedily obtain evidence which yields highest VPI until some threshold
 conditional plan  considers more things
mdps and rl  R&N 17.117.4
 sequences of actions
 fully observable  agent knows its state
 markov decision process  all these things are given
 set of states s
 set of actions a
 stochastic transition model $P(s’ \vert s,a)$
 reward function $R(s)$
 utility aggregates rewards, for models more complex than mdps reward can be a function of past sequences of actions / observations
 want policy $\pi (s)$  what action to do in state s
 optimal policy yields highest expected utlity
 optimizing MDP  multiattribute utility theory
 could sum rewards, but results are infinite
 instead define objective function (maps infinite sequences of rewards to single real numbers)
 ex. discounting to prefer earlier rewards (most common)
 discount reward n steps away by $\gamma^n, 0<\gamma<1$
 ex. set a finite horizon and sum rewards
 optimal action in a given state could change over time = nonstationary
 ex. average reward rate per time step
 ex. agent is guaranteed to get to terminal state eventually  proper policy
 ex. discounting to prefer earlier rewards (most common)
 expected utility executing $\pi$: $U^\pi (s) = \mathbb E_{s_1,…,s_t}\left[\sum_t \gamma^t R(s_t)\right]$
 when we use discounted utilities, $\pi$ is independent of starting state
 $\pi^*(s) = \underset{\pi}{\text{argmax}} : U^\pi (s) = \underset{a}{\text{argmax}} \sum_{s’} P(s’ \vert s,a) U(s’)$
 experience replay  instead of learning from samples one by one, want to reduce correlation between subsequent samples
 take a large batch of samples and sample randomly from it, rather than going sequentially
value iteration
 value iteration  calculates utility of each state and uses utilities to find optimal policy
 bellman eqn: $U(s) = R(s) + \gamma : \underset{a}{\max} \sum_{s’} P(s’ \vert s, a) U(s’)$
 start with arbitrary utilities
 recalculate several times with Bellman update to approximate solns to bellman eqn
 value iteration eventually converges
 contraction  function that brings variables together
 contraction only has 1 fixed point
 Bellman update is a contraction on the space of utility vectors and therefore converges
 error is reduced by factor of $\gamma$ each iteration
 also, terminating condition: if $ \vert \vert U_{i+1}U_i \vert \vert < \epsilon (1\gamma) / \gamma$ then $ \vert \vert U_{i+1}U \vert \vert <\epsilon$
 what actually matters is policy loss $ \vert \vert U^{\pi_i}U \vert \vert $  the most the agent can lose by executing $\pi_i$ instead of the optimal policy $\pi^*$
 if $ \vert \vert U_i U \vert \vert < \epsilon$ then $ \vert \vert U^{\pi_i}  U \vert \vert < 2\epsilon \gamma / (1\gamma)$
 contraction  function that brings variables together
policy iteration
 another way to find optimal policies
 policy evaluation  given a policy $\pi_i$, calculate $U_i=U^{\pi_i}$, the utility of each state if $\pi_i$ were to be executed
 like value iteration, but with a set policy so there’s no max
 $U_i(s) = R(s) + \gamma : \sum_{s’} P(s’ \vert s, \pi_i(s)) U_i(s’)$
 can solve exactly for small spaces, or approximate (set of lin. eqs.)
 policy improvement  calculate a new MEU policy $\pi_{i+1}$ using $U_i$
 same as above, just $\pi^*(s) = \underset{\pi}{\text{argmax}} : U^\pi (s) = \underset{a}{\text{argmax}} \sum_{s’} P(s’ \vert s,a) U’(s)$
 policy evaluation  given a policy $\pi_i$, calculate $U_i=U^{\pi_i}$, the utility of each state if $\pi_i$ were to be executed
 asynchronous policy iteration  don’t have to update all states at once
partially observable markov decision processes (POMDP)

agent is not sure what state it’s in

same elements but add sensor model $P(e \vert s)$
 have distr $b(s)$ for belief states
 updates like the HMM: $b’(s’) = \alpha P(e \vert s’) \sum_s P(s’ \vert s, a) b(s)$
 changes based on observations

optimal action depends only on the agent’s current belief state
 use belief states as the states of an MDP and solve as before
 changes because state space is now continuous
 value iteration
 expected utility of executing p in belief state is just $b \cdot \alpha_p$ (dot product)
 $U(b) = U^{\pi^*}(b)=\underset{p}{\max} : b \cdot \alpha_p$
 belief space is continuous [0, 1] so we represent it as piecewise linear, and store these discrete lines in memory
 do this by iterating and keeping any values that are optimal at some point
 remove dominated plans  generally this is far too inefficient
 belief space is continuous [0, 1] so we represent it as piecewise linear, and store these discrete lines in memory
 dynamic decision network  online agent
reinforcement learning – R&N 21.121.6
 reinforcement learning  use observed rewards to learn optimal policy for the environment

in ch 17, agent had model of environment ($P(s’ s, a)$ and $R(s)$)

 2 problems
 passive  given $\pi$, learn $U^\pi (s)$
 active  explore states to find utilities and exploit to get highest reward
 2 model types, 3 agent designs

modelbased: can predict next state/reward before taking action (for MDP, requires learning $P(s’ s,a)$)  utilitybased agent  learns $U(S)$  utility function on states
 requires model of the environment
 utilitybased agent  learns $U(S)$  utility function on states
 modelfree
 Qlearning agent: learns $Q(s, a)$  actionutility function = Qfunction maps actions $\to$ utility
 reflex agent: learns $Q(s)$  policy that maps directly from states to actions

passive reinforcement learning (estimate value function given policy)

given policy $\pi$, learn $U^\pi (s) = \mathbb E\left[ \sum_{t=0}^{\infty} \gamma^t R(S_t)\right]$
 like policy evaluation, but transition model / reward function are unknown
 direct utility estimation: treat states independently
 run trials to sample utility
 average to get expected total reward for each state = expected total reward from each state
 adaptive dynamic programming (ADP)  2 steps

sample to estimate transition model $P(s’ s, a)$ and rewards $R(s)$  find $U^\pi(s)$ with the Bellman eqn (plug in at each step)
 we might want to enforce a prior on the model (two ways)
 Bayesian reinforcement learning  assume a prior $P(h)$ on transition model h
 use prior to calculate $P(h \vert e)$

use $P(h e)$ to calculate optimal policy: $\pi^* = \underset{\pi}{argmax} \sum_h P(h \vert e) u_h^\pi$
 $u_h^\pi$= expected utility over all possible start states, obtained by executing policy $\pi$ in model h
 robust control theory  give best outcome in the worst case over H
 $\pi^* = \underset{\pi}{argmax}: \underset{h}{\min} : u_h^\pi$
 Bayesian reinforcement learning  assume a prior $P(h)$ on transition model h

 temporaldifference learning  adjust utility estimates towards local equilibrium for correct utilities
 like an approximation of ADP
 when we transition $s \to s’$, update $U^\pi(s) = U^\pi (s) + \alpha \left[R(s)  U^\pi (s) + \gamma :U^\pi (s’) \right]$
 $\alpha$ should decrease over time to converge
 prioritized sweeping  prefer adjustments to states whose likely successors have just undergone a large adjustment in their own utility estimates
 speeds things up
active reinforcement learning

no longer following set policy

explore states to find their utilities and exploit model to get highest reward

must explore all actions, not just those in the policy

 bandit problems  determining exploration policy
 narmed bandit  pulling n levelers on a slot machine, each with different distr.
 Gittins index  function of number of pulls / payoff
 coorect schemes should be GLIE  greedy in the limit of infinite exploration  visits all states infinitely, but eventually become greedy
agent examples
 ex. choose random action $1/t$ of the time
 ex. active adp agent
 give optimistic utility to relatively unexplored states
 uses exploration function f(u, numTimesVisited) around the sum in the bellman eqn
 high utilities will propagate
 ex. active TD agent
 now must learn transitions (same as adp)
 update rule same as passive TD
learning actionutility function $Q(s, a)$
 $U(s) = \underset{a}{\max} : Q(s,a)$

ADP version: $Q(s, a) = R(s) + \gamma \sum_{s’} P(s’ s, a) \underset{a’}{\max} Q(s’, a’)$  TD version: $Q(s,a) = Q(s,a) + \alpha [R(s)  Q(s,a) + \gamma : \underset{a’}{\max} Q(s’, a’)]$  this is what is usually referred to as Qlearning
 this is offpolicy (only uses best Qvalue, doesn’t pay attention to actualy policy being followed)  more flexible

 SARSA (stateactionrewardstateaction) is related: $Q(s,a) = Q(s,a) + \alpha [R(s) + \gamma : Q(s’, a’)  Q(s,a) ]$
 here, $a’$ is action actually taken
 SARSA is onpolicy (pays attention to actual policy being followed)
 can approximate Qfunction with something other than a lookup table
 ex. linear function of parameters $\hat{U}_\theta(s) = \theta_1f_1(s) + … + \theta_n f_n(s)$
 can learn params online with delta rule = wildrowhoff rule: $\theta_i = \theta  \alpha : \frac{\partial Loss}{\partial \theta_i}$
 ex. linear function of parameters $\hat{U}_\theta(s) = \theta_1f_1(s) + … + \theta_n f_n(s)$
policy search
 keep twiddling the policy as long as it improves, then stop
 store one Qfunction (parameterized by $\theta$) for each action
 ex. $\pi(s) = \underset{a}{\max} : \hat{Q}_\theta (s,a)$
 this is discontinunous, instead often use stochastic policy representation (ex. softmax for $\pi_\theta (s,a)$)
 learn $\theta$ that results in good performance
 Qlearning learns actual Q* function  could be different (scaling factor etc.)
 to find $\pi$ maximize policy value $p(\theta) = $ expected reward executing $\pi_\theta$
 could do this with sgd using policy gradient
 when environment/policy is stochastic, more difficult
 could sample mutiple times to compute gradient
 REINFORCE algorithm  could approximate gradient at $\theta$ by just sampling at $\theta$: $\nabla_\theta p(\theta) \approx \frac{1}{N} \sum_{j=1}^N \frac{(\nabla_\theta \pi_\theta (s, a_j)) R_j (s)}{\pi_\theta (s, a_j)}$
 PEGASUS  correlated sampling  ex. 2 blackjack programs would both be dealt same hands  want to see different policies on same things
deep rl course
“supervised rl” (imitation learning)

imitation learning / behavioral cloning – given pairs of observations / actions, learn policy to take action given observation $\pi_\theta(a_t o_t)$  basic example: cost function is 0 when action is same as human’s in data and 1 otherwise
 usually inefficient / insufficient
 one improvement: DAgger (ross et al. 2011)  use learned policy to generate synthetic observations and have humans label those
 we can query observations when deviate slightly from expert trajectory

goalconditioned behavioral cloning  subdivides data based on different goals  learn $\pi_\theta(a s, g)$  example  given a goal location, take actions to move a robot there
 Learning to Reach Goals via Iterated Supervised Learning (ghosh … levine, 2020)
 move robot arm based on policy (initially random)
 see which random goals are met
 use this as goalconditioned behavioral cloning
 update policy and repeat
rl algorithms overview
 3 general steps (iterated)
 fit a model / estimate the return
 improve the policy
 generate samples (i.e. run the policy)
 $\theta^{\star}=\arg \max {\theta} E{\tau \sim p_{\theta}(\tau)}\left[\sum_{t} r\left(\mathbf{s}{t}, \mathbf{a}{t}\right)\right]$
 Valuebased: estimate value function or $Q$function of the optimal policy (no explicit policy)
 Policy gradients: directly differentiate the above objective

policy network $\pi_\theta(a s)$  gradients are extremely noisy compared to supervised learning

 Actorcritic: estimate value function or Qfunction of the current policy (critic), use it to improve policy (actor)
 Modelbased RL: estimate the transition model, and then…
 just use the model to plan (no plicy)
 trajectory optimization/optimal control (continuous space)  optimize over actions
 discrete planning  e.g. monte carlo tree search
 backpropagate gradients into the policy
 use the model to learn a value function
 just use the model to plan (no plicy)
fit model / estimate return  improve policy  

valuebased  fit $V(s)$ or $Q(s, a)$  $\pi(s) = \text{argmax}_a Q(s, a)$  
(direct) policy gradients  evaluate returns $R_\tau = \sum_t r(s_t, a_t)$  $\theta = \theta + \alpha \nabla_\theta E[\sum_t r(s_t, a_t)]$  
actorcritic  fit $V(s)$ or $Q(s, a)$  $\theta = \theta + \alpha \nabla_\theta E[Q(s, a)]$  
modelbased  maybe model $P(s’  s, a)$ 
 sample efficiency
 on policy  must generate new samples each time the policy is changed
 off policy  can improve policy without generating new samples from it (so look at a bunch of samples then update)
 offline rl  data collected only once with any policy, then want to learn good policy from that
other problems
inverse rl
Inverse RL  learning reward functions from example
 ai should be uncertain about utitilies
 utilties should be inferred from human preferences
 in systems that interact, need to express preferences in terms of game theory
 can solve it with a GAN: e.g. A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and EnergyBased Models (finn et al. 2016)

[Apprenticeship learning via inverse reinforcement learning Proceedings of the twentyfirst international conference on Machine learning](https://dl.acm.org/doi/abs/10.1145/1015330.1015430?casa_token=inLo77czqAAAAAA:JirDplyZ_94a7A0WXavY3V5napVOxdJ5Qjkzin02K4bapip6D5bQNuPAefmkYeYAb9OSVyqkEKRcCJs) (abbeel & ng, 2004)  good intro to inverse RL
planning
 Efficient Learning in Cellular Simultaneous Recurrent Neural Networks  The Case of Maze Navigation Problem (ilin et al. 2007)  explored connections between planning algorithms and recurrent NNs
 Value Iteration Networks (tamar…levine, & abbeel, 2017)
 represent value iteration as a fully differentiable DNN using recurrence
metalearning
 learning to learn (very close to multitask learning)
 e.g. learn optimizer, representation
 e.g. learn how to explore
 e.g. learn how to do RL for various different walking tasks and then generalize to new walking task with few samples
 ModelAgnostic MetaLearning for Fast Adaptation of Deep Networks (finn, levine, & abbeel, 2017)  treat metalearner itself as an RL algorithm
 multitask learning
 sometimes have the ability to decide which new tasks to add (e.g. by changing simulator)
offline rl
 core issue with offline RL: want policy that improves over the training policy, but can’t deviate from the training policy due to distr. shift
 one idea: instead of collecting new samples with policy, reweight samples using importance sampling based on policy
 IQL: Implicit Qlearning (ashvin’s paper, 2021)
 IQL  foregoes need to evaluate unseen actions