In Dynamic Programming (DP), we solve the Markov Decision Process (MDP) by using value iteration and policy iteration. Both of these techniques require transition and reward probabilities to find the optimal policy. But how can we solve MDP when we don’t know the transition and reward probabilities? In that case, we use the Monte Carlo method. The Monte Carlo method requires only sample sequences of states, actions, and rewards. the Monte Carlo methods are applied only to the episodic tasks. Since Monte Carlo doesn’t require any model, it is called the model-free learning algorithm.

A value function is basically the expected return from a state S with a policy π. Here, instead of expected return, we use mean return.

In Monte Carlo prediction, we approximate the value function by taking the mean return instead of the expected return.

Using Monte Carlo prediction, we can estimate the value function of any given policy. The steps involved in the Monte Carlo prediction are very simple and are as follows:

- First, we initialize a random value to our value function.
- Then we initialize an empty list called a return to store our returns
- Then for each state in the episode, we calculate the return
- Next, we append the return to our return list
- Finally, we take the average of return as our value function

The following flowchart makes it more simple:

The Monte Carlo prediction algorithm is of two types:

- First visit Monte Carlo
- Every visit Monte Carlo

### First visit Monte Carlo

As we have seen, in the Monte Carlo methods, we approximate the value function by taking the average return. But in the first visit MC method, we average the return only the first time the state is visited in an episode. For example, consider an agent is playing the snakes and ladder games, there is a good chance the agent will return to the state if it is bitten by a snake. When the agent revisits the state, we don’t consider an average return. We consider an average return only when the agent visits the state for the first time.

### Every visit Monte Carlo

In every visit Monte Carlo, we average the return every time the state is visited in an episode. Consider the same snakes and ladders game example: if the agent returns to the same state after a snake bites it, we can think of this as an average return although the agent is revisiting the state. In this case, we average return every time the agents visit the state.

Let’s play Blackjack with Monte Carlo

Now let’s better understand Monte Carlo with the Blackjack game. Blackjack, also called 21, is a popular card game played in casinos. The goal of the game is to have a sum of all your cards close to 21 and not exceeding 21. The value of cards J, K, and Q is 10. The value of ace can be 1 or 11; this depends on player choice. The value of the rest of the cards (1 to 10) is the same as the numbers they show.

The rules of the game are very simple:

- The game can be played with one or many players and one dealer.
- Each player competes only with the dealer and not another player.
- Initially, a player is given two cards. Both of these cards are face up, that is, visible to others.
- A dealer is also given two cards. One card is face up and the other is face down. That is, the dealer only shows one of his cards.
- If the sum of a player’s cards is 21 immediately after receiving two cards (say a player has received a jack and ace which is 10+11 = 21), then it is called natural or Blackjack and the player wins.
- If the dealer’s sum of cards is also 21 immediately after receiving two cards, then it is called a draw as both of them have 21.
- In each round, the player decides whether he needs another card or not to sum the cards close to 21.
- If a player needs a card, then it is called a hit.
- If a player doesn’t need a card, then it is called a stand.
- If a player’s sum of cards exceeds 21, then it is called bust; then the dealer will win the game.

Let’s better understand Blackjack by playing. I’ll let you be the player and I am the dealer:

In the preceding diagram, we have one player and a dealer. Both of them are given two cards. Both of the player’s two cards are face up (visible) while the dealer has one card face up (visible) and the other face down (invisible).

In the first round, you have been given two cards, say a jack and a number 7, which is (10 + 7 = 17), and I as the dealer will only show you one card which is number 2. I have another card face down. Now you have to decide to either hit (need another card) or stand (don’t need another card). If you choose to hit and receive number 3 you will get 10+7+3 = 20 which is close to 21 and you win:

But if you received a card, say number 7, then 10+7+7 = 24, which exceeds 21. Then it is called bust and you lose the game. If you decide to stand with your initial cards, then you have only 10 + 7 = 17. Then we will check the dealer’s sum of cards. If it is greater than 17 and does not exceed 21 then the dealer wins, otherwise you win:

The rewards here are:

- +1 if the player won the game
- -1 if the player loses the game
- 0 if the game is a draw

The possible actions are:

- Hit: If the player needs a card
- Stand: If the player doesn’t need a card

The player has to decide the value of an ace. If the player’s sum of cards is 10 and the player gets an ace after a hit, he can consider it as 11, and 10 + 11 = 21. But if the player’s sum of cards is 15 and the player gets an ace after a hit, if he considers it as 11 and 15+11 = 26, then it’s a bust. If the player has an ace we can call it a usable ace; the player can consider it as 11 without being bust. If the player is bust by considering the ace as 11, then it is called a nonusable ace.

Now we will see how to implement Blackjack using the first visit Monte Carlo algorithm. First, we will import our necessary libraries:

First, let us import all the necessary libraries

1 2 3 4 5 6 7 8 9 |
import gym import numpy as np from matplotlib import pyplot import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from collections import defaultdict from functools import partial %matplotlib inline plt.style.use('ggplot') |

Simulate the Blackjack environment

1 |
env = gym.make('Blackjack-v0') |

Then we define the policy function which takes the current state and check if the score is greater than or equal to 20, if yes we return 0 else we return 1. i.e If the score is greater than or equal to 20 we stand (0) else we hit (1)

1 2 3 |
def sample_policy(observation): score, dealer_score, usable_ace = observation return 0 if score >= 20 else 1 |

We define a function called generate_episode for generating epsiodes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
def generate_episode(policy, env): # we initialize the list for storing states, actions, and rewards states, actions, rewards = [], [], [] # Initialize the gym environment observation = env.reset() while True: # append the states to the states list states.append(observation) # now, we select an action using our sample_policy function and append the action to actions list action = sample_policy(observation) actions.append(action) # We perform the action in the environment according to our sample_policy, move to the next state # and receive reward observation, reward, done, info = env.step(action) rewards.append(reward) # Break if the state is a terminal state if done: break return states, actions, rewards |

Now that we learned how to generate an episode, we will see how to perform First Vist MC Prediction

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
def first_visit_mc_prediction(policy, env, n_episodes): # First, we initialize the empty value table as a dictionary for storing the values of each state value_table = defaultdict(float) N = defaultdict(int) for _ in range(n_episodes): # Next, we generate the epsiode and store the states and rewards states, _, rewards = generate_episode(policy, env) returns = 0 # Then for each step, we store the rewards to a variable R and states to S, and we calculate # returns as a sum of rewards for t in range(len(states) - 1, -1, -1): R = rewards[t] S = states[t] returns += R # Now to perform first visit MC, we check if the episode is visited for the first time, if yes, # we simply take the average of returns and assign the value of the state as an average of returns if S not in states[:t]: N[S] += 1 value_table[S] += (returns - value_table[S]) / N[S] return value_table |

1 2 |
value = first_visit_mc_prediction(sample_policy, env, n_episodes=500000) |

Let us see first few elements in the value table

1 2 |
for i in range(10): print(value.popitem()) |

1 2 3 4 5 6 7 8 9 10 |
((7, 1, False), -0.6222222222222212) ((4, 7, False), -0.4941634241245135) ((18, 6, False), -0.6950909090909095) ((6, 7, False), -0.5026595744680847) ((17, 8, True), -0.38016528925619797) ((20, 6, False), 0.6994790971016418) ((8, 9, False), -0.5072024260803637) ((4, 4, False), -0.5106382978723403) ((13, 1, False), -0.670459424645771) ((12, 10, True), -0.28959276018099556) |

We define the function plot_blackjack for plotting the value function and we can see how our value function is attaining the convergence.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
def plot_blackjack(V, ax1, ax2): player_sum = np.arange(12, 21 + 1) dealer_show = np.arange(1, 10 + 1) usable_ace = np.array([False, True]) state_values = np.zeros((len(player_sum), len(dealer_show), len(usable_ace))) for i, player in enumerate(player_sum): for j, dealer in enumerate(dealer_show): for k, ace in enumerate(usable_ace): state_values[i, j, k] = V[player, dealer, ace] X, Y = np.meshgrid(player_sum, dealer_show) ax1.plot_wireframe(X, Y, state_values[:, :, 0]) ax2.plot_wireframe(X, Y, state_values[:, :, 1]) for ax in ax1, ax2: ax.set_zlim(-1, 1) ax.set_ylabel('player sum') ax.set_xlabel('dealer showing') ax.set_zlabel('state-value') |

1 2 3 4 5 6 |
fig, axes = pyplot.subplots(nrows=2, figsize=(5, 8), subplot_kw={'projection': '3d'}) axes[0].set_title('value function without usable ace') axes[1].set_title('value function with usable ace') plot_blackjack(value, axes[0], axes[1]) |