Reinforcement Learning for the Blackjack

Overview

Reinforcement Learning for Blackjack

Author: ZHA Mengyue

Math Department of HKUST

Problem Statement

We study playing Blackjack by reinforcement learning. Prediction methods used to update q-value function for option here are Monte Carlo, Q Learning and Temporal Difference. We also test the algorithm under different combination of (M, N). M is the number of decks and N denotes N-1 palyers with 1 dealer. For each configuration, we find the optimal policy after iterations. Outcomes of three pre diction methods are compared by visualization and tables.

Since the detailed rules in different casinos of different areas varies a lot, we describe the one we adopt in the code here. The rule we used basically follows the one in Sutton's book (Example 5.1, p.93, Chapter 5).

Card Count:

  • 2-9: the number on the cards
  • Jack, Queen, King: 10
  • Ace: 1 or 11, maximizing the points player gets that no more than 21
  • Jockers: not used in the Blackjack

Game Initialization

Cards Initialization

We consider the case each player compete independently with the dealer. The game initialize with two cards dealt to both the players and the dealer. All cards dealt in initialization are faced up except for the second one dealt to the dealer.

Instant Wins

If the palyer has 21 after initialization (an Ace and a 10-card), it's called a natural and that palyer wins unless the dealer also has a natural. In the case both some players and the dealer has a natual then the game is a draw.

Game On

The players turn first:

Players request additional cards one by one (Hit) until it choose to stop (Stick) or the points got after last hit exceeds 21 (Bust) and then is the next player's turn. If one player goes bust then it loses immediately or we will see later after the dealer's turn.

If all palyers go bust then the dealder immediately wins no matter his points later. If there are some players stick successfully without an bust then the dealer's turn begins. The dealer sticks on any sum of 17 or greater and hits otherwise. Note that the dealer's strategy is fixed without any choice.

Game over

We compare the points for the successful players stick before a bust and the dealer to determine the final reward. If the dealer goes bust then the survival palyers wins then the final outcome —— win, lose and draw are determined by whose final sum is closer to 2.

Rewards

  • win: +1
  • lose: -1
  • draw: 0

Game Implementation Details

All rewards with in a game are zero and we use the discount factor $\gamma=1$ which means the terminal rewards are also the returns.

State:

  • (players' card points, dealer's dhowing card points)

Action:

  • hit: 0
  • stick: 1

Decks:

Denoted by termianl input variable M (eg. --M=2 means two decks are used in the game). If the users want to use infinite deck aka with replacement then they should type --M=0 because the code recognize 0 deck as infinite deck.

In order to make sure that the cards are sufficient we also insert a mechanism to automatically reinitialize the decks once the number of cards left are smaller than $M * 52 * 0.6$ . BTW, infinite decks make keeping track of the already dealt cards impossible.

Homework Statement

Assume that in the Blackjack game, there are $m$ decks of cards, and $n$ players (a dealer vs $n-1$ players). The rules of the game are explained above.

(1) Find the optimal policy for the Blackjack, when m=inf, n=2. You can use any of the methods learned so far in class (e.g. Monte Carlo, TD, or Q-Learning). If you use more than one method, do they reach the same optimal policy?

(2) Visualise the value functions and policy as done in Figures 5.1 and 5.2 in Sutton's book.

(3) Redo (1) for different combinations of (m,n), e.g. $m=6, 3, 1$, and $n=3,4,6$. What are differences?

Implementation

File Structure

  • main.py: main code needs terminal variable aissignment for the number of $m$ decks (--m) and the number of $n$ people (--n). One dealer and $n-1$ players.

    NOTE: Our main.py accepts receive a list of m and n as the inputs and doing the experiments of combination of these m's and n's. Once you run the main.py, it builds several instances under corresponding INSTANCE folder where each instance is basically an experiment with a try on a set of specific hyperparameters. We list the hyperparameters below and discuss them later.

    • m: number of decks
    • n: number of people
    • update: the method used to update the value or q value function. eg. Monte Carlo, Q Learning and Temporal Difference
    • policy: the policy improvement strategy. choices are epsilon greedy policy and the best policy.

    Other hyperparameters are epoches, n_zeros and session.

  • config.json: stores the configuration. This config.json is only a template. We will create new ones for experiments with different hyperparameters combination later.

    • epochs: how many times of the Blackjack we played with the algorithm to train it.
    • update: the method used to update the q-value function
    • name: the name of the experiment
    • policy: policy method used for the experiment
    • n_zero: a factor used to calculate the $\epsilon$ in epsilon greedy policy
  • deck.py: class Deck()

    • def __ init __(): initialize the $m$ decks
    • def shuffle(): suffle the decks
    • def pop(): pop up a card and delete it from the decks
  • player.py: class Player()

    • def hit(): hit action
    • def call_points(): player return the points it got
  • game.py: class Game()

    • def __ init __(): Initialize a game as described in the problem statement, game initialization section.
    • def step(): given the current state and action, return the next state and reward
  • utils.py

    • def MC(): Monte Carlo update function
    • def QL(): Q Learning update function
    • def TD(): Temporal Difference update function
    • def save_value(): save the Q value function in the form that every row is (player's points, dealer's points, action, value)
    • def save_win_records(): save the (state, action, value) pairs visited by a specific palyer
  • plot.py

    • def plot_single_player(): plot the (state, action, value) pairs visited by a specific player
    • def plot_state_action_value(): plot the value function learned
    • All pics created in this section will be stored in the path HOME+STORAGE+INSTANCE+pic

Example

  1. Prepare the environment

    conda create -n Blackjack python=3.6
    conda activate Blackjack

    Now your working environment is the Blackjack now. Let's install the necessary packages. We have listed all packages in requirement.txt

    pip install -r requirement.txt
    

    Now your environment should be fully ready.

  2. Experiments on a single Instance

    The following code blocks plays the Blackjack with m=2 decks and n=3 people where 2 are players and one is the dealer.

    python main.py --m=2, --n=3
    

    Note that when $m=\infty$, we use --m=0 instead.

    python main.py --m=0, --n=2
  3. Experiments on instances of combinations of (m, n)

    Also you can test the combinations of (m, n) pairs. For example, m= 6, 3, 1 and n= 3, 4, 6

    python main.py --m 6 3 1 --n 3 4 6
  4. Experiments on $m=\infty$

    We use --m=0 infers to use infinite decks in the game instead.

  5. The optimal policy

    We store the final Q-value function instead and the optimal poliy are derived from it by either best policy or epsilon greedy policy.

    The value.csv are stored in thecorresponding instance folder as:

    MC_best_value.csv

    MC_epsilon_value.csv

    QL_best_value.csv

    etc.

Tabular Summary for the Experiments

Choices for policy update: policy=['best', 'epsilon']

Choices for policy evaluateion(value function update): update=['MC', 'QL', 'TD']

  • best: best policy evaluation
  • epsilon: epsilon greedy policy evaluation
  • MC: Monte Carlo
  • QL: Q Learning
  • TD: Temporal Difference

Single Instance of $m=\infty$, $n=2$

m=$\infty$, n=2 MC QL TD
best policy 39.9040% 37.9700% 37.1210%
epsilon greedy policy 42.4840% 41.3440% 41.2620%

Conclusions:

  • epsilon greedy policy outperforms best policy
  • The best update strategy is MC and TD has the lowest performance

Combination of m=[6, 3, 1], n=[3, 4, 6]

We summary the performance of (update_policy) combinations in the tables below.

MC_best n=3 n=4 n=6
m=6 40.0435% 40.1663% 39.7334%
m=3 39.7110% 39.7077% 39.1590%
m=1 40.5960% 40.2913% 39.2028%
MC_epsilon n=3 n=4 n=6
m=6 42.0310% 42.1147% 42.5882%
m=3 42.4060% 42.5710% 42.1484%
m=1 42.5700% 42.6407% 42.6614%
QL_best n=3 n=4 n=6
m=6 38.935% 38.5240% 38.5292%
m=3 39.1540% 38.4173% 38.7022%
m=1 39.5825% 39.3437% 38.9810%
QL_epsilon n=3 n=4 n=6
m=6 41.3675% 41.5430% 41.4012%
m=3 41.5625% 41.7900% 41.3582%
m=1 41.8030% 42.0723% 41.7474%
TD_best n=3 n=4 n=6
m=6 39.3855% 40.2017% 39.5762%
m=3 39.9090% 40.3023% 39.6646%
m=1 39.4165% 39.6960% 40.2408%
TD_epsilon n=3 n=4 n=6
m=6 41.4880% 41.0790% 41.0342%
m=3 41.1925% 41.0230% 41.2132%
m=1 41.6990% 41.3067% 41.3138%

Conclusions

  • epsilon greedy policy outperforms best policy
  • The best update strategy is MC and TD has the lowest performance
  • For MC_best, the more players are in, the less chance they will win
  • For MC_epsilon, if we see the values in table as an matrix, the lower triangle part is greater than the upper triangle part. This means players enjoys greater chance to win when many players palying with few decks (just one deck is perfect!).
  • The conclusions for QL_best and QL_epsilon are the same with MC_epsilon.
  • For TD_best and TD_epsilon, the phenomenon in MC_epsilon is quite weak. Some combinations of $(m, n)$ in the upper triangle part are quite well.
    • TD_best: (m=6, n=4), (m=3,n=4)
    • TD_epsilon: (m=6, n=3)

Testing

We provide useful test codes and print commands bracket by the annotation sign """ """ inside the code. If you would like to test the code in small sclae, you can assign epochs to be 10 and n_seros to be 2. Then release the print in lines 79-81, 159-165, 172-181, 203-212 in main.py. You may also test objects like player, deck and game in the corresponding python file after releasing the annotation on the last few lines.

Hyperparameters

All settable hyperparameters except for $m$ and $n$ are assigned by the instance level config.json under the instance's folder.

Some hypperparameters has finite many choices and will be generated in the main.py when different instances are created. We will write these hyperparameters into the instance level config.json that inherited from the template config.json (under the INSTANCE folder).

  • update: choices in ['MC', 'QL', 'TD']
  • name: choices in the combination of form 'update-epsilon' or 'update-best' for policy being epsilon greedy policy and best policy respectively.
  • policy: choices in ['epsilon_greedy_policy', 'best_policy']

We also has some higher level hyperparameters that are assigned in the template config.json. Note that these hyperparameters are the same for all instances created by call main.py once. They are:

  • epochs: number of iterations.
  • n_zeros: a constant for determine the value of $\epsilon$ in epsilon greedy policy
  • session: denotes how often we summay the performance of a given player in plot.py. For example, if session = 1000, we summary its wins losses and draws every 1000 actions.

Visualization

We illustrate the typical plots as examples and you want to see more, please visit the subfolder with path = STORAGE/INSTANCE/pic

Visualization on m=inf, n=2

We only take the update=MC as example and you should refer to Blackjack/storage/m0n2/pic/ for outcomes for QL and TD

Value Function Visualization

MC_best_value visualization

MC_epsilon_value visualization

Remark

Since I forgot to add the labels for x-axis, y-axis and z-axis when doing the experiment, their position and labels are denoted by the following Pseudo Value Function Plot. All axes' arrangements in the figures of this repository follow the left-hand rule. You may refer to the following pic to identify the arrangement and meaning of the x, y, z axes.

Player Performance Visualization

Visualize MC

MC_best_player_1 visualization

MC_epsilon_player_1 vs. MC_best_player_1 visualization

We see clearly that under the update rule MC, the player with epsilon greedy policy performs consistently better than they player with the deterministic best policy. The outcome shows that expolration is important !!!

Compare MC, QL, TD and best, epsilon

We have the following conclusions by observing the player performance visualization on update=[MC, QL, TD] and policy=[best, epsilon]

  • epsilon greedy policy outperforms the best policy consistently no matter which update strategy we adopt.

  • For a fixed policy, the performances of update strategies are MC>QL>TD

    The reason we guess is that since the Blackjack game has a relative small state space and action space, some advantages of MC are maximized:

    • precise real return without apprixiamtion
    • sampled long trajectories making memory on the card possible.

Citation

If you use my Blackjack in any context, please cite this repository:

@article{
  ZHA2021:RL_Blackjack,
  title={Reinforcement Learning for the Blackjack},
  author={ZHA Mengyue},
  year={2021},
  url={https://github.com/Dolores2333/Blackjack}
}

This work is done by ZHA Mengyue for Homework1 in MATH6450I Reinforcement Learning lectured by Prof Bing-yi Jing in HKUST. Please cite the repository if you use the code and outcomes.

Owner
Dolores
👉 👉 👉
Dolores
My tensorflow implementation of "A neural conversational model", a Deep learning based chatbot

Deep Q&A Table of Contents Presentation Installation Running Chatbot Web interface Results Pretrained model Improvements Upgrade Presentation This wor

Conchylicultor 2.9k Dec 28, 2022
Direct application of DALLE-2 to video synthesis, using factored space-time Unet and Transformers

DALLE2 Video (wip) ** only to be built after DALLE2 image is done and replicated, and the importance of the prior network is validated ** Direct appli

Phil Wang 105 May 15, 2022
Source code for GNN-LSPE (Graph Neural Networks with Learnable Structural and Positional Representations)

Graph Neural Networks with Learnable Structural and Positional Representations Source code for the paper "Graph Neural Networks with Learnable Structu

Vijay Prakash Dwivedi 180 Dec 22, 2022
BARTScore: Evaluating Generated Text as Text Generation

This is the Repo for the paper: BARTScore: Evaluating Generated Text as Text Generation Updates 2021.06.28 Release online evaluation Demo 2021.06.25 R

NeuLab 196 Dec 17, 2022
Baseline of DCASE 2020 task 4

Couple Learning for SED This repository provides the data and source code for sound event detection (SED) task. The improvement of the Couple Learning

21 Oct 18, 2022
✨风纪委员会自动投票脚本,利用Github Action帮你进行裁决操作(为了让其他风纪委员有案件可判,本程序从中午12点才开始运行,有需要请自己修改运行时间)

风纪委员会自动投票 本脚本通过使用Github Action来实现B站风纪委员的自动投票功能,喜欢请给我点个STAR吧! 如果你不是风纪委员,在符合风纪委员申请条件的情况下,本脚本会自动帮你申请 投票时间是早上八点,如果有需要请自行修改.github/workflows/Judge.yml中的时间,

Pesy Wu 25 Feb 17, 2021
ICML 21 - Voice2Series: Reprogramming Acoustic Models for Time Series Classification

Voice2Series-Reprogramming Voice2Series: Reprogramming Acoustic Models for Time Series Classification International Conference on Machine Learning (IC

49 Jan 03, 2023
Lightweight Cuda Renderer with Python Wrapper.

pyRender Lightweight Cuda Renderer with Python Wrapper. Compile Change compile.sh line 5 to the glm library include path. This library can be download

Jingwei Huang 53 Dec 02, 2022
Caffe implementation for Hu et al. Segmentation for Natural Language Expressions

Segmentation from Natural Language Expressions This repository contains the Caffe reimplementation of the following paper: R. Hu, M. Rohrbach, T. Darr

10 Jul 27, 2021
Myia prototyping

Myia Myia is a new differentiable programming language. It aims to support large scale high performance computations (e.g. linear algebra) and their g

Mila 456 Nov 07, 2022
Intrinsic Image Harmonization

Intrinsic Image Harmonization [Paper] Zonghui Guo, Haiyong Zheng, Yufeng Jiang, Zhaorui Gu, Bing Zheng Here we provide PyTorch implementation and the

VISION @ OUC 44 Dec 21, 2022
ADGAN - The Implementation of paper Controllable Person Image Synthesis with Attribute-Decomposed GAN

ADGAN - The Implementation of paper Controllable Person Image Synthesis with Attribute-Decomposed GAN CVPR 2020 (Oral); Pose and Appearance Attributes Transfer;

Men Yifang 400 Dec 29, 2022
Simple-Image-Classification - Simple Image Classification Code (PyTorch)

Simple-Image-Classification Simple Image Classification Code (PyTorch) Yechan Kim This repository contains: Python3 / Pytorch code for multi-class ima

Yechan Kim 8 Oct 29, 2022
🔅 Shapash makes Machine Learning models transparent and understandable by everyone

🎉 What's new ? Version New Feature Description Tutorial 1.6.x Explainability Quality Metrics To help increase confidence in explainability methods, y

MAIF 2.1k Dec 27, 2022
[NeurIPS 2021] Galerkin Transformer: a linear attention without softmax

[NeurIPS 2021] Galerkin Transformer: linear attention without softmax Summary A non-numerical analyst oriented explanation on Toward Data Science abou

Shuhao Cao 159 Dec 20, 2022
More Photos are All You Need: Semi-Supervised Learning for Fine-Grained Sketch Based Image Retrieval

More Photos are All You Need: Semi-Supervised Learning for Fine-Grained Sketch Based Image Retrieval, CVPR 2021. Ayan Kumar Bhunia, Pinaki nath Chowdh

Ayan Kumar Bhunia 22 Aug 27, 2022
Simple Baselines for Human Pose Estimation and Tracking

Simple Baselines for Human Pose Estimation and Tracking News Our new work High-Resolution Representations for Labeling Pixels and Regions is available

Microsoft 2.7k Jan 05, 2023
Multi-Stage Progressive Image Restoration

Multi-Stage Progressive Image Restoration Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Sh

Syed Waqas Zamir 859 Dec 22, 2022
Object Tracking and Detection Using OpenCV

Object tracking is one such application of computer vision where an object is detected in a video, otherwise interpreted as a set of frames, and the object’s trajectory is estimated. For instance, yo

Happy N. Monday 4 Aug 21, 2022
Differentiable Abundance Matching With Python

shamnet Differentiable Stellar Population Synthesis Installation You can install shamnet with pip. Installation dependencies are numpy, jax, corrfunc,

5 Dec 17, 2021