3강_ Model-Free Policy Evaluation

lec 3

•

Recall

◦

Definition of Return G_t

◦

Definition of State Value Function, expected return from starting in state s under policy Pi. 

◦

Definition of State-Action Value Function, expected return from starting in state s, taking action a and then following policy Pi.

estimating the expected return of a particular policy if don’t have access to true MDP models.

Dynamic Programming

Monte Carlo policy evaluation

policy evaluation when don’t have a model of how the world work

given on-policy samples

temporal difference (TD)

metrics to evaluate and compare algorithms

dynamic programming

Planning으로 가장 흔히 사용되던 것이 dynamic programming, 다이내믹 프로그래밍 인데, 이미 주어진 혹은이미 알고 있는 MDP에 따라 reward를 정의하고 제한된 환경에서 에이전트를 학습시키는 과정.

다이내믹 - 연속적으로 발생하는 문제를 해결하는 것.

다이내믹 프로그래밍 - 연속적으로 발생되는 문제를 수학적으로 optimize하여 풀어내는 것. 큰 문제를 작은 문제로 쪼개서 풀어내는 것과 같다 (divide & conquer)

Model free 방식은, MDP가 주어지지 않은 상황에서(모르는 상황에서) agent가 environment와 직접적으로 상호작용하여 경험을 축적, 이를 통해 학습을 이어나가는 것. 이 과정에서 value function을 최적화하여 optimal policy를 찾아나간다.

여기서 Model Free method의 방식으로 Monte Carlo Learning, 그리고 Temporal Differnece Learning 두가지가 있다.

•

Monte Carlo methods learn directly from episodes of experience.

•

Monte Carlo is a model-free : no knowledge of MDP transitions / rewards

•

Monte Carlo learns from complete episodes : which means there is no bootstrapping

•

Uses the simplest possible idea : which value is equal to the mean return

에피소드마다 직접 경험을 통해 environment를 학습해 나가는데, transition / reward에 대한 사전 지식이 없는 상태로 observe & reward가 주어진다.

episode가 종료된 후, 받게 되는 reward의 mean 만큼 value로 사용된다.

obtaining final gain → define value function → inference of E, but different way of using mean value of reward??

Bootstrapping?? → 같은 종류의 추정값에 대해서 업데이트를 할 때, 한개 혹은 그 이상의 추정값을 사용하는 것.

•

Temporal difference method learn directly from episodes of experience

•

It is a model-free : requires no knowledge of MDP transitions / rewards

•

Unlike Monte Carlo, use bootstrapping

Important Properties to Evaluate Policy Evaluation Algorithms

•

Robustness to Markov assumption

•

Bias/variance charecteristics

•

Data efficiency

•

Computational efficiency