~Today’s Topic~
the problem of figuring out what is the right thing to do, when your actions may have delayed consequences
we may have to sacrifice immediate reward in order to maximize long-term reward
Markov Process or Markov Chain
•
Memoryless random process - sequence of random state with Markov property
Markov Process
•
future is independent of past given present
•
a finite or potentially infinite set of states
•
a dynamics model which specifies the probability for the next state given the previous state with no rewards, no actions yet
•
Definition of Markov Process
•
If finite number (N) of states, can express P as a matrix
Markov Chain
•
sort of just the sequence of random states
•
the transition dynamics satisfies this Markov property
MRP(Markov Reward Process)
•
Markov Chain + reward
•
but still no actions
•
Definition of Markov Reward Process
•
If finite number (N) of states, can express R as a vector
(it's just the expected reward we get for being in each state.)
Return & Value Function
•
Horizon
◦
number of time steps in each episode
◦
can be infinite
◦
otherwise called finite Markov reward process
•
Return (G_t)
◦
discounted sum of rewards from time step to horizion
◦
average return of reward in state s
◦
현재 시점 뿐 만 아니라 미래 시점의 return 값도 신경써야 함
•
State Value Function (V(s))
◦
expected return from starting in state s
Computing the Value of a Markov Reward Process
•
Generate a large number of episodes
•
Average returns
•
Concentration inequalities bound how quickly average concentrates to expected value
•
Could estimate by simulation
◦
how many simulations would you need to do in order for your empirical average to be close to the true expected value
◦
requires no assumption of the Markov structure
▪
the Markov structure : Markov Reward Process = just a way to estimate sums of returns- sums up rewards
•
Markov property yields additional structure
◦
Markov structure (the present that the future is independent of the past given the present) ⇒ can decompose the value function
•
MRP value function satisfies →이 공식에서 V(s)만 찾으면 되고 나머지는 known 값
Iterative Algorithm for Computing Value of a MRP
•
Dynamic programming
•
Initialize V_0(s) = 0 for all s
•
For k=1 until convergence
◦
for all s in S
•
Computational complexity :
three ways to try to compute the value of MRP
•
simulation
•
analytically
◦
requires a step a finite set of states
•
dynamic programming
MDP(Markov Decision Process)
•
Markov Reward process + actions
•
Definition of Markov Decision Process
•
MDP is a tuple : (S, A, P, R, γ)
described as a tuple which is just the set of states, actions, dynamics model, rewards, and discount factor
•
there are a lot of cases where we don't have perfect models of the environment → there are stochastic processes
MDP Policies
•
Policy specifies what action to take in each state
◦
can be deterministic or stochastic → have a distribution over in the next action, or have a deterministic mapping
•
For generality, consider as a conditional distribution
◦
Given a state, specifies a distribution over actions
•
Policy :
MDP+Policy
•
MDP + Policy = MRP (by
•
Implies we can use same techniques to evaluate the value of a policy for a MDP as we could to compute the value of a MRP, by defining a MRP with R^π and P^π
MDP Policy Evaluation, Iterative Algorithm
•
exactly same with MRP!
•
Initialize V_0(s)=0 for all s
•
For k=1 until convergence
◦
For all s in S
•
There is a Bellman backup for a particular policy
◦
instantiate the reward function by always picking the action that the policy would take → learn what is the value of a particular policy
value of the state under this policy : the immediate reward I would get by following the policy in the current state + the expected discounted sum of rewards I get by following this policy
MDP Control
•
There are exists a unique optimal value function
•
Optimal policy for MDP in an infinite horizon problem is deterministic
•
Stationary - timestamp에 구애받지 X( because it’s “infinite horizon”)
→ focus on deterministic policy이 sufficient한 이유
•
Policy : a mapping be going to be a map from states to actions
•
The value function is unique but if there may be cases where it's not
•
the expected discounted sum of returns can be more than one
but at least one optimal policy which leads to the maximum value for that state
Policy Search
•
One option is searching to compute best policy
•
Number of deterministic policies is 〖|A|〗^(|S|)
•
Policy iteration is generally more efficient than enumeration
MDP Policy Iteration (PI)
•
set i=0
•
Initialize π_0(s) randomly for all states s
•
While i==0 or ||π_i - π_(i-1)|| > 0 (L1-norm, measures if the policy changed for any stae):
State-action Value (Q)
•
take action a, then follow the policy π, state value V_π(s)
Policy Improvement
•
Compute the state-action value of a policy π;
◦
For s in S and a in A;
•
Compute new policy π_(i+1) for all s in S
Delving Deeper Into Policy Improvement Step
•
Suppose we take π_(i+1)(s) for one action, then the follow π_i forever
◦
Our expected sum of rewards is at least as good as if we had always followed π_i
•
But new proposed policy is to always follow π_(i+1) ......
Monotonic Improvement in Policy
PI & VI
Policy Iteration (PI)
MDP : Computing Optimal Policy and Optimal Value
•
Policy iteration computes optimal value and policy
•
Value iteration is another technique
◦
Idea : We always have a policy. Maintain optimal value of starting in a state s if have a finite number of steps k left in the episode
◦
Iterate to consider longer and longer episodes
Bellman Equation and Bellman Backup Operators
•
Value function of a policy must satisfy the Bellman equsation
•
Bellman backup operator
◦
Applied to a value function
◦
Returns a new value function
◦
Improves the value if possible
◦
BV yields a value function over all states s
Value Iteration (VI)
•
Set k=1
•
Initialize V_0(s)=0 for all states s
•
Loop until [finite horizon, convergence]:
◦
for each state s
◦
View as Bellman backup on value function