Search

2강_Given a Model of the World

~Today’s Topic~

the problem of figuring out what is the right thing to do, when your actions may have delayed consequences
we may have to sacrifice immediate reward in order to maximize long-term reward

Markov Process or Markov Chain

Memoryless random process - sequence of random state with Markov property

Markov Process

future is independent of past given present
a finite or potentially infinite set of states
a dynamics model which specifies the probability for the next state given the previous state with no rewards, no actions yet
Definition of Markov Process
If finite number (N) of states, can express P as a matrix

Markov Chain

sort of just the sequence of random states
the transition dynamics satisfies this Markov property

MRP(Markov Reward Process)

Markov Chain + reward
but still no actions
Definition of Markov Reward Process
If finite number (N) of states, can express R as a vector
(it's just the expected reward we get for being in each state.)

Return & Value Function

Horizon
number of time steps in each episode
can be infinite
otherwise called finite Markov reward process
Return (G_t)
discounted sum of rewards from time step to horizion
average return of reward in state s
현재 시점 뿐 만 아니라 미래 시점의 return 값도 신경써야 함
State Value Function (V(s))
expected return from starting in state s

Computing the Value of a Markov Reward Process

Generate a large number of episodes
Average returns
Concentration inequalities bound how quickly average concentrates to expected value
Could estimate by simulation
how many simulations would you need to do in order for your empirical average to be close to the true expected value
requires no assumption of the Markov structure
the Markov structure : Markov Reward Process = just a way to estimate sums of returns- sums up rewards
Markov property yields additional structure
Markov structure (the present that the future is independent of the past given the present) ⇒ can decompose the value function
MRP value function satisfies →이 공식에서 V(s)만 찾으면 되고 나머지는 known 값

Iterative Algorithm for Computing Value of a MRP

Dynamic programming
Initialize V_0(s) = 0 for all s
For k=1 until convergence
for all s in S
Computational complexity :

three ways to try to compute the value of MRP

simulation
analytically
requires a step a finite set of states
dynamic programming

MDP(Markov Decision Process)

Markov Reward process + actions
Definition of Markov Decision Process
MDP is a tuple : (S, A, P, R, γ)
described as a tuple which is just the set of states, actions, dynamics model, rewards, and discount factor
there are a lot of cases where we don't have perfect models of the environment → there are stochastic processes

MDP Policies

Policy specifies what action to take in each state
can be deterministic or stochastic → have a distribution over in the next action, or have a deterministic mapping
For generality, consider as a conditional distribution
Given a state, specifies a distribution over actions
Policy :

MDP+Policy

MDP + Policy = MRP (by
Implies we can use same techniques to evaluate the value of a policy for a MDP as we could to compute the value of a MRP, by defining a MRP with R^π and P^π

MDP Policy Evaluation, Iterative Algorithm

exactly same with MRP!
Initialize V_0(s)=0 for all s
For k=1 until convergence
For all s in S
There is a Bellman backup for a particular policy
instantiate the reward function by always picking the action that the policy would take → learn what is the value of a particular policy
value of the state under this policy : the immediate reward I would get by following the policy in the current state + the expected discounted sum of rewards I get by following this policy

MDP Control

There are exists a unique optimal value function
Optimal policy for MDP in an infinite horizon problem is deterministic
Stationary - timestamp에 구애받지 X( because it’s “infinite horizon”)
→ focus on deterministic policy이 sufficient한 이유
Policy : a mapping be going to be a map from states to actions
The value function is unique but if there may be cases where it's not
the expected discounted sum of returns can be more than one
but at least one optimal policy which leads to the maximum value for that state

Policy Search

One option is searching to compute best policy
Number of deterministic policies is 〖|A|〗^(|S|)
Policy iteration is generally more efficient than enumeration

MDP Policy Iteration (PI)

set i=0
Initialize π_0(s) randomly for all states s
While i==0 or ||π_i - π_(i-1)|| > 0 (L1-norm, measures if the policy changed for any stae):

State-action Value (Q)

take action a, then follow the policy π, state value V_π(s)

Policy Improvement

Compute the state-action value of a policy π;
For s in S and a in A;
Compute new policy π_(i+1) for all s in S

Delving Deeper Into Policy Improvement Step

Suppose we take π_(i+1)(s) for one action, then the follow π_i forever
Our expected sum of rewards is at least as good as if we had always followed π_i
But new proposed policy is to always follow π_(i+1) ......

Monotonic Improvement in Policy

PI & VI

Policy Iteration (PI)

MDP : Computing Optimal Policy and Optimal Value

Policy iteration computes optimal value and policy
Value iteration is another technique
Idea : We always have a policy. Maintain optimal value of starting in a state s if have a finite number of steps k left in the episode
Iterate to consider longer and longer episodes

Bellman Equation and Bellman Backup Operators

Value function of a policy must satisfy the Bellman equsation
Bellman backup operator
Applied to a value function
Returns a new value function
Improves the value if possible
BV yields a value function over all states s

Value Iteration (VI)

Set k=1
Initialize V_0(s)=0 for all states s
Loop until [finite horizon, convergence]:
for each state s
View as Bellman backup on value function

Policy Iteration as Bellman Operations

Contraction Operator

Will Value Iteration Converge?