## BlackJack with Monte Carlo Prediction

In Dynamic Programming (DP), we solve the Markov Decision Process (MDP) by using value iteration and policy iteration. Both of

Learning from human preference is a major breakthrough in Reinforcement learning (RL). The algorithm was proposed by researchers at OpenAI

Are you a fan of the game chess? If I asked you to play chess, how would you play the

In TRPO, we improve the policy and impose a constraint that the KL divergence between an old policy and a new policy is to be less than some constant.

In this post, we will look into the very popular off-policy TD control algorithm called Q learning. Q learning is a very simple and widely used TD algorithm.

We will learn a new technique called hindsight experience replay (HER) proposed by OpenAI researchers for dealing with sparse rewards.

Reinforcement learning (RL) is a branch of machine learning where the learning occurs via interacting with an environment. It is