REINFORCEMENT LEARNING I
GAME PLAYING
By
KEAN GIAP LIM
Bachelor of Science
Oklahoma State University
Stillwater, Oklahoma
1995
Submitted to the Faculty of the
Graduate College of the
Oklahoma State University
in partial fulfillment of
the requirements for
the Degree of
MASTER OF SCIENCE
December 1998
REINFORCEMENT LEAR ING IN
GAME PLAYING
Thesis Approved:
U rk;/hL! 6. f/; =='L..._
Dean of the Graduate College
II
PREFACE
This study is conducted to understand the internal workings of reinforcement
learning. In the movie called "Terminator II", in a clip, Arnold Schwarzeneger told
the little boy he was protecting from the Terminator that "My CPU is a neural net
computer. The more I interact with humans, the more I will learn and understand
about humans." Reinforcement learning (RL) is one mechanism that improves an
agent's intelligence by evaluating the feedback that it receives from the environment
with which it interacts. RL rewards well chosen actions and punishes bad decisions.
The RL algorithm that was experimented with in this study is QIearning. The agent
was given the task of learning to play the trivial game of tictactoe. Without any
winning strategy encoded into the agent, the agent improved its moves selection by
playing against its opponent. The first half of the study examined th parameters of
Qlearning. The second half of this study used a neural network to generalize the
agent's experience. The advantages and disadvantages of both generalized QIearning
and Qlearning are discussed.
11l
I
ACKNOWLEDGEMENTS
r would like to sincerely thank all my committee members; Dr. J. P. Chandler
(Chair), Dr. William Nick Street (exChair), Dr. Blayne Mayfield, and Dr. Mansur
Samadzadeh for assisting with my research. \iVithout their guidance, friendship, and
all the different kinds of help, I would not have completed this study. I would like
to extend my appreciation to the Computer Science Department for giving me the
opportunity to be a part of the graduate program.
I would also like to specially thank my mom and dad, who work very hard day
and night, to provide the opportunity for me to further my education at Oklahoma
Statl' University. Their sacrifices, unlimited love, and supports made my dream of
studying at the graduate school come true. I will never forget my sister, Lim Wei
Cheng, for putting up with me for six years of my educational life in Stillwater and
for the numerous suggestions that she gave me throughout this study.
Finally, I would like to express my appreciation for all the special people who work
at the Center for Computer Integrated Manufacturing, especially Dr. Manjunath
Kamath who gave me the opportunity to be a part of this research team. Without
this opportunity, I would not have enjoyed and appreciated the value of carrying out
mv research. My thanks also go to Partha Ramachandran who introduced and helped
me greatly with ~'IEX, and also for reviewing this report.
IV
TABLE OF CO TE TS
I LITERATURE REVIEW 1
1.1 Introduction . 1
1.2 Markov Decision Processes 2
1.3 Reinforcement Learning 3
1.3.1 Policy 5
1.3.2 Environment 5
1.3.3 Reward Function 6
1.3.4 Value Function 8
1.4 Exploration vs. Exploitation. 10
1.5 Temporal Difference Learning 12
1.5.1 QLearning 13
1.5.2 SARSA 16
1.6 Generalization in Reinforcement Larning 17
1.7 Artificial Neural Networks 17
1.7.1 LMS, Delta Rule, ADALINE, or WidrowHoff rule. 24
1.7.2 Multilayer eural Network. 28
1.7.3 Backpropagation or Generalized Delta Rule 29
1.7.4 Batch and Online Training . 32
1.8 TD().) 32
1.9 Game Playing . 35
1.9.1 General 35
1.9.2 Reinforcement Learning in Game Playing. 36
II RESEARCH OBJECTIVE AND METHODOLOGY 39
2.1 Research Objectives. 39
2.2 Details of Implementation 40
2.3 Details of the Opponent 41
2.4 Details of Neural etwork Training 42
2.5 Method of Analysis 44
v
III EMPIRICAL RESULTS
3.1 Bounded Random Walks . . . . . . .
3.2 TicTacToe .
3.2.1 Experiment 1. Learning Rate
3.2.2 Experiment 2. Discount Rate
3.2.3 Experiment 3. Exploration Rate
3.2.4 Experiment 4. QIearning Versus SARSA
3.3 Function Approximator . . . . . . . . . . . . . . .
3.3.1 Experiment 1. Raw Board Representation .
3.3.2 Experiment 2. Feature Selection .
IV SUMMARY, CO CLUSIONS, AND RECOMME DATIONS
4.1 Summary.....
4.2 Conclusions ...
4.3 Recommendations
4.4 Concluding Comment
APPENDICES:
A. GLOSSARy .
B. SAMPLE PROGRAM CODE
VI
45
45
47
47
48
50
52
53
54
55
59
59
61
62
63
68
70
LIST OF TABLES
I The process of backing up transition value . . . . . . . . . . . . . .. 10
II Effect of discount on Qvalue , 15
III Facts about the most publicized chess match between a computer and
a human. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 36
IV Performance of the tictactoe opponent that is used in the experiments.
Its performance is tested by playing against a random move
generator .. 42
V Qvalues generated by SARSA . . . . . . . . . . . . . . . . . . . . .. 46
VI Qvalues generated by Qlearning. . . . . . . . . . . . . . . . . . .. 46
VII Q'learning versus SARSA: Mean Equity of the first and next 25,000
games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 53
\'11
LIST OF FIGURES
1 Reinforcement learning system . . . . . . . . . . . . . . . . . . . . .. 5
2 A tictactoe example. . . . . . . . . . . . . . . . . . . . . . . . . .. 6
3 Branch and bound heuristic search (left) and value updates of the
second iteration of path ACOE(right) . . . . . . 9
4 The need for exploration . . . . . . . . . . . . . 11
5 QIearning: An offpolicy TD control algorithm 14
6 An example MDP with rewards . . . . . . . 15
7 SARSA: An onpolicy TO control algorithm . . 16
8 Twolayer perceptron . . . . . . . . . . . . . . . 18
9 The exclusiveor, a classification problem that is not linearly separable 23
10 The mean square error surface . . . . . . . . . . . . . . . . . . . . .. 25
11 Gradient descent orthogonal search direction . . . . . . . . . . . . .. 26
12 A small learning rate converges (top) and a slightly larger learning rate
may diverge (bottom). . . . . . . . . . . 27
13 A multilayer feedforward neural network . . . . . . . . . 28
14 The sigmoid threshold function . . . . . . . . . . . . . . 29
15 Garry Kasparov versus Deep Blue: Game 6 final position 37
16 Reinforcement learning simulation class model . . . . . . 41
17 Evaluation function neural network for a tictactoc example 44
18 Bounded random walks. . . . . . . . . . . . . . . . . . . . . 45
19 Mean square error: Effects of learning rate on the performance 47
20 Equity: Effects of learning rate on the performance . . . . . . 48
21 Equity: Discounted versus nondiscounted learning . . . . . . . 49
22 Mean square error: Discounted versus nondiscounted learning 49
23 Mean square error: Exploration versus exploitation 51
24 Equity: Exploration versus exploitation. . . . .51
25 Equity: QIearning versus SARSA 52
26 Mean square error: QIearning versus SARSA 53
27 Equity: An 18159 network trained with backpropagation is used to
approximate the value function ... . . . . . . . . . . . . . . . . .. 54
28 Mean square error: An 18159 network trained with backpropagation
is used to approximate the value function . . . . . . . . . . . . . . .. 55
29 Mean square error of a network in which the input representation incorporated
handselected features and was trained with backpropagation 56
Vlll
30 Equity: A network in which the input representation incorporat d
hand selected features (J eature) and was trained with backpropagation
compared to A N using raw board representation (regbp) and
lookup table (table) . . . . . . . . . . . . . . . . . . . . . . . . . . .. 57
31 umber of losses: Comparing the results obtained by the lookup table
to a network in which the input representation used handselected
features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 58
IX
CHAPTER I
LITERATURE REVIEW
1.1 Introduction
In recent years, researchers from different fields have shown growing interest in the
field of reinforcement learning (RL). RL methods have been applied in many different
domains [9] as a practical computational tool for constructing autonomous systems
that improve themselves with experience. The report of National Science Foundation
(NSF) Winter 1996 Workshop in RL [13] states that researchers are surprised by the
failures and successes of RL and there are still many op n questions to be answer d.
Game playing has been an important topic in the Artificial Intelligence world and
much RL research has examined this problem domain. One success story that stands
out in the domain of game playing is Gerry Tesauro's TOGammon backgammon
[32, 33, 34]. This program plays backgammon as well as the best human players.
Since then, many researchers have attempted to recreate the success of TOGammon
in other games such as Go and Chess. However these attempts have been less successful.
This thesis presents research for conducting an experiment with RL applied in the
domain of game playing. This study serves mainly as a theoretical tool for studying
the principles of agents learning to act by constructing an RL agent that learns to
play the game of tictactoe.
1
2
1.2 Markov Decision Processes
Each day we make many decisions, and today's decisions have impacts on tomorrow's
and tomorrow's on the day after tomorrow's. We observe the situation that we are
in at a point in time and choose one of the available actions. We, the decision maker,
receive an immediate reward or cost and evolve into a new situation at a subsequent
point in time. At this subsequent point in time, we repeat the same process over and
over again. This process is known as sequential decision making.
The key ingredients of this sequential decision making model are the following:
1. A set of decision epochs.
2. A set of system states.
3. A set of available actions.
4. A set of state and actiondependent immediate rewards.
5. A set of state and actiondependent transition probabiliti .
At each decision epoch (or time), the system provides the decision maker with all
necessary information for choosing an action from the set of available actions in that
state. As a result of choosing an action in a state, two things happen: the decision
maker receives an immediate reward and the current state evolves into a new state
at the next decision epoch. As this process evolves through time, the decision maker
receives a sequence of rewards, some positive and some negative.
A Markov decision process model (MOP) [19] is one particular sequential decision
model whereby the set of available actions, the rewards, and the transition probabilities
depend only on the current state and action and not on states occupied and
actions chosen in the past. In other words, if the current state summarizes everything
3
important about the actions that produced it, then the sequentia.l decision model is
said to have the Markov property.
MDP is studied extensively in operation research and optimal control. If an MDP
has a finite number of states and a finite number of actions for each state, then it is a
finite MDP. The context of this study deals only with finite MDP. MDP is important
because it plays a critical role in the theory of reinforcement learning. For more
information about MDP, please refer to [19].
Classical optimization methods for sequential decision problems, such as dynamic
programming, can compute an optimal solution. Dynamic programming is a term
that refers to the mathematical formulation of a sequential decision problem. If the
problem is formulated, then we can find the optimal solution to the problem telling
us which action to choose in a situation at a point in time. The formulation of the
problem requires the five ingredients mentioned above.
However there are several disadvantages of using this method. A major disadvantage
of dynamic programming is that it involves exhaustive sweeps through the stat
space. This makes it very inappropriate for large problems. Secondly this method
requires a complete specification of the transition probabilities of each state. This information
is normally not available a priori for the vast majority of practical problems.
On the other hand, such information can be estimated from experience through trialand
error with the system. This is the key idea in the field of reinforcement learning;
learning from interaction to achieve longterm goals.
1.3 Reinforcement Learning
Reinforcement learning (RL) has attracted a lot of attention from researchers in
different fields especially in the past ten years. This happened because there were
reports of several breakthroughs in learning different tasks using this method. As
4
mentioned in the introduction, Gerry Tesauro s TDGammon [32, 33, 34] learns the
game of backgammon and it is capable of playing at a grandmaster level. Crites and
Barto [4] used QIearning (explained later in Section 1.5.1) in an elevator scheduling
task. The average squared waiting time for passengers was approximately 7% less
than the best alternative algorithm and less than half the squared waiting time of the
most frequently used elevator scheduling algorithms.
Another attractive property of this learning mechanism is that it enables the
agent to learn the tasks autonomously. In reinforcement learning, the decision maker
or the learner is called the agent. Autonomous learning means learning without the
assistance from a teacher. Instead an autonomous agent learns by experiencing the
task. The agent improves its knowledge by evaluating the feedback that it receives
from the environment for all chosen actions.
Reinforcement learning is studied predominantly in the field of animal behavioral
sciences. Reinforcement learning is a major part of the human learning process, thus
making it easy to understand. Do you remember how you learned to ride a bicycle
when you were young? You probably fell dozens of times before you uccessfully
learned the appropriate ways to steer, brake, and pedal the bicycle. The objective is
to learn to ride a bicycle without falling.
In reinforcement learning, the computer is given a goal to achieve. Reinforcement
learning then learns a mapping from situations to actions by trialanderror interactions
with a dynamic environment, as depicted in Figure 1. The agent's goal is to
discover which actions yield the most reward by trying them based on its experience
with the environment.
There are four fundamental parts in reinforcement learning. These are the policy,
environment reward function, and value function.
Situation
 Environment ......
Reward
L...+ Agent I
Action
5
Figure 1. Reinforcement learning system
1.3.1 Policy
The policy 71" is the decisionmaking function of the agent, telling it what action to
perform in each state. A policy is a mapping from states to actions. In reinforcement
learning, we are trying to obtain the optimal policy 71"* where
At each point of time, or epoch t, each pair Stat of the optimal policy n* tells us
that taking action at when the agent is in state St yields the highe t value in the long
run [3, 19]. State ST is the terminal state. The terminal state Sr is our goal.
This is the core of the reinforcement learning agent because the other components
serve to improve the policy. Ultimately it is the policy itself that determines how well
the agent performs.
1.3.2 Environment
The environment is a simulation model with which the agent interacts. For example,
an RL gameplaying agent interacts with its opponent to increase its experience. In
this case, the opponent is the environment. A decisionmaking agent seeks to achieve
its goal or goals despite uncertainty about its environment. Consider a tictactoe
learning agent that is in state St with three possible actions aI, a2, and a3 to choose.
6
If it chooses action al, that means the opponent has action a2, and a3 to select. Then
the uncertainty of the environment is the action that will be cho en by the opponent.
Thus, the agent is uncertain about the next state St+l because it does not know which
action will be chosen by the opponent as depicted in Figure 2.
~x x
¥/ o x 0 IAction I
~
~ Agent
o x ~ ,) X x , , , , o x 0 ~ox x0 0x Opponent
S
,
~ ) )I
I
8'+1 Not chosen
o x 0
I I .
t ~ t+l Epoch t
uncertainty
Figure 2. A tictactoe example
1.3.3 Reward Function
The reward function is also known as the reinforcement function or the utility function.
The reward function defines the goal of the reinforcement learning system. It
maps the state of the environment to a single number. When the agent achieves the
goal, then the action that leads to the goal state is rewarded or reinforced with a
positive reward. On the other hand, if the agent makes a mistake, then the agent
is punished for making that bad decision. This punishment is also refered as the
negative reward. For instance, if the goal of a tictactoe agent is to learn to win, a
positive reward of 1 is assigned for taking the action that leads to the winning state.
7
Similarly the move that loses a game is punished with a negative reward of 1. So
the reward function for a tictactoe agent is
+1 wm
o draw
R(x) =
1 loss
o otherwise
where win, draw, and loss are terminal states, and otherwise defines the reward for
all other intermediate moves of a game. As you may have noticed, all the other
intermediate actions are neither punished nor rewarded. Let us take the game of
chess as an example using the same reward function. The game of chess is WOIl by
good defenses and traps throughout the game. So what is wrong if a positive reward
is assigned for a good chess move such as a knight fork (knight fork is an attacking
move where the knight is in a position to attack more than one of its enemy)? In
this case, the agent will choose the intermediate move that has the highest immediate
reward. As a result, the agent prefers actions that are advantageous in the short term,
not actions that leads to a win. Unless one of the agent's goal is to learn knight fork
instead of winning the game, it should not be encoded in the reward function for an
agent. It is crucial to indicate again the importance of rewarding and punishing only
actions that meets the goal or goals. This is the reason why all intermediate moves
are not rewarded nor punished.
In contrast to reinforcement learning, all of today's successful chessplaying programs
that search the positiontree for every move, material winning is the most
important component of the objective function. In chess and checkers, the end of
the game is usually not visible for being too far down the tree. Thus intermediate
positions are evaluated in a treesearching method.
The reward function is myopic; it determines the immediate reward that an agent
8
will receive by taking a certain action. It is only one component of reinforcement
learning that defines the goal or goals of the learning system. The objective of an RL
agent is to maximize the total reward it will receive in the long run, meaning, in this
case, over many games played. So we need something else that tells the agent which
action to choose from the beginning till the end of every trial. To do this, we need a
value function.
1.3.4 Value Function
The value function is the means that specifies what action selection is the best in the
long run. Values indicate the long term desirability of a state taking into consideration
all the states that are likely to follow. The value function is helpful because it can
be used to improve the policy. A reinforcement learning system changes its value
functions to make them close to the optimal solution.
A heuristic evaluation function is a form of value function that we are mostly familiar
with. Heuristic search is an efficient and intelligent search method to generate
good solutions without having to search exhaustively. Heuristic search, however, incorporates
prior knowledge about state values into the search algorithm, thus making
it inflexible [23]. Qlearning (explained later in Section 1.5.1) does not require prior
knowledge of the domain, but rather, learns a value function through experience.
To have a better understanding about the value function, let us look at how the
heuristic search and the value function are used to find a solution. Let us consider the
branch and bound heuristic search. At each step of the branch and bound process, we
select the most promising of the states we have so far. We expand all the branches
of the most promising state. We stop if one of them is a solution. The left half of
Figure 3 shows an example where state A is the beginning state. First, expand all
of state A's branches, producing two successor states, Band C. In this example, we
9
attempt to minimize the value of the objective function. ext, state C is chosen to
be expanded because it has a lower value, 4, than AB which is 5. ow the paths that
are expanded are AC(4), ACD(6), ACK(ll), and ACL(12). These paths are put into
a list that is sorted by the path's value. Now branch AC is expanded to ABF(6) and
ABI(8). This time, ABF and ACD have the same value of 6. Thus state D and F
are expanded and generate ABFH(lO), ABFG(12), and ACDE(15). This make path
ABI the shortest path with a value of 8. Finally when ABI is expanded to ABIJ(ll),
path ABFH becomes the optimal solution with its path value of 10. This is because
there is no other branch that has a smaller value than ABFH.
A
 activated
transition
tmnsition
not activated
.. back up
• terminal
• state
G H J MNO P E
Figure 3. Branch and bound heuristic search (left) and value updates of the second
iteration of path ACDE(right)
Next, let us look at the value function. Reinforcement learning updates each value
of the transition that the agent has chosen. This process of reestimating values is
called the back up process. As mentioned before, RL learns through experimentation.
A simple value update function can be written a.<;
Vtransition = Ttrunsition + min Vnext transition
where Vtransition is the value of a transition, Ttransition is the reward of a transition, and
min Vnext transition is the minimum value of the next transition. Each transition value
10
is initialized to zero. Suppose that the agent has chosen path ACDE for the first two
trials. Table I shows how the value of each transition of path ACDE is backed up
for two iterations. In the second iteration, min v of C~D is 9 because the value of
branch D~E is updated to 9 in the first iteration. Through trial and error, the agent
keeps backing up transition values of paths that it choose . If the agent continue to
choose different actions, it will eventually learn that the optimal path is ABFH.
Table 1. The process of backing up transition value
Transition A~C C~D D~E
VAC = r AC + min v VCD = rCD + min v VDE = rDE+O
Iteration 1 4+0=4 2+0=2 9+0=9
Iteration 2 4+0=4 2+9=11 9+0=9
But how does the agent decide which transition to choose? This is the topic of
discussion in the next section.
1.4 Exploration vs. Exploitation
A reinforcement learning agent's objective is to maximize the total expected reward
over some time period. If we always choose the action that yields the highest estimated
value, as in greedy search, then we are exploiting the current knowledge of the value
of the actions. Otherwise we are exploring. Exploration enables us to improve the
estimate of the nongreedy action's value that may generate greater total reward in
the long run. Consider the example shown in Figure 4 where the starting state is 1.
If the agent always chooses the path that leads to the highest immediate reward, it
takes path 1 ~ 3. It is obvious that the agent needs to explore to realize a better
path by trying 1 t 2 ~ H that yields a total reward of 10, which is higher than path
1 ~ 3 t H that has a cumulative reward of 6.
11
Figure 4. The need for exploration
The example given above can be applied to learn the optimal policy of general
decision problems. This example is based on the reinforcement learning method that
is applied in this study that learns the mapping from each stateaction pair to a value.
However most games like chess or checkers that uses heuristic search method to find
for moves, the value depends on the position (state).
To ensure that the agent will keep exploring all actions, it is necessary for the
agent to continue to select them. There are two approaches to ensure this, called
the onpolicy learning and offpolicy learning [30]. Onpolicy methods attempt to
evaluate and improve the same policy that they use to make decisions. In offpolicy
methods, the policy llsed to generate behavior, called the behavior policy, is unrelated
to the policy that is evaluated and improved, the estimation policy. This separation
has the advantage that the estimation policy can keep getting more greedy while the
behavior policy can continue to sample all possible actions. One behavior policy that
is commonly used in RL is the €greedy policy [30] where 0 ~ f ~ 1 . If E is set to 0.1,
this means that the agent chooses one action randomly in every ten decisions. The
agent behaves greedily for the other nine moves.
€greedy actionselection is not suitable for problems in which it is unacceptable
for the agent to experiment with a nongreedy action that is very bad. This is because
when the agent explores, it chooses an action with equal likelihood among all actions.
12
A slightly more sophisticated alternative to tgreedy is softmax or Boltzmann exploration
[30]. This is used in simu.lated annealing [10]. In this case, an action is chos n
probabilistically according to the distribution
Q(a) is the value of action a, and T is the temperature parameter that can be
reduced over time to reduce xploration. If T is set high, each action is chosen with
equal probability. As T approaches 0, the values of the actions are distanced from
each other.
1.5 Temporal Difference Learning
The key idea to reinforcement learning as agreed by many researchers is temporal difference
learning (TD) [28], which is a technique for recursively learning an evaluation
function.
The power behind TD learning is that it eliminates the "rollout" method [29] us d
in a traditional method such as dynamic programming. The "rollout" method backs
up or refines values of the policy only when it reaches the final state. TD learning
improves its policy by making estimates from another estimates. This mechanism is
actually closer to how humans learn. Suppose you have experience drawing a circle;
you do not need to wait till the circle is completely drawn before knowing what adjustments
are needed if you want to draw a wellrounded circle.
Temporaldifference learning is separated into two categories: multistep prediction
learning or singlestep prediction learning. In singlestep prediction problem,
you know the correctness of a prediction after a step of the prediction. Whereas in
multistep prediction problem, the correctness of a prediction is not determined until
more than one step after the prediction. It is beneficial to update the prediction
13
made several steps in the past using new observations gathered after that. For example,
a bad move made in a chess game may not be revealed until several sequ nce
of moves later. It has been shown that treating most prediction problems as multistep
converges faster. However the algorithm of multistep prediction is slightly more
computationally intensive and harder to implement. The algorithm of multistep prediction,
called TO("\) is discussed after the discussion of artificial neural network.
The algorithm of singlestep prediction is a special case of TD(,.\) known as TO(0) .
The singlestep algorithm that is very well known is QIearning. A variation of Qlearning
called SARSA is also discussed in detail later.
1.5.1 QLearning
QIearning is an example of an offpolicy TD control algorithm that was developed
by Watkins in 1989 [39]. The notation Q(8, a) is used to represent the estimated value
taking action a in state 8. The goal is to learn the value function, Q : S x A ~ R
Traditionally this function is implemented as a table, with a value for ea h tateaction
pair. Its simplest form, Istep QIearning is defined by
6Q (8, a) = 0: [r + ,maxa,Q(s', a')  Q(8, CL)] (1.1 )
where r is the reward, Q' is the stepsize parameter or learning rate, , is the discount
rate, 8', a' is the stateaction pair in the subsequent point in time t + 1, and
maxa, Q (8' , a') is the maximum Qvalue for picking any action a' in state 8'. The
algorithm for QIearning using table lookup is shown in Figure 5. To illustrate how
the learning function is used, consider the following settings for a tictactoe agent
that learns to win where a = 0.5, , = 0.5 and all values of Q(8, a) are set to O. The
opponent selects its moves randomly, and the agent wins the first game. Thus, the
agent receives a reward of 1 for winning the game by taking action aw in state Sw'
14
Initialize Q(s, a) arbitrarily
Repeat (for each trial):
Initialize s to start state
Repeat (for each step of trial):
Choose a from 8 using policy derived from Q(e.g., Egreedy)
Take action a, observe r, s'
Q(s,a) = Q(s,a) + a:[r +'YmaXa,Q(s', a')  Q(s,a)]
,
s:= S
until s is terminal
Figure 5. QIearning: An offpolicy TD control algorithm
According to (1.1),
6Q (sw, aw) = 0.5 [1 + [0.5 (0)  0]]
= 0.5 (1 + 0)
= 0.5
Since the initial value of Q(sw. aw) is 0, the updated value of Q(sw, aw) i 0.5.
The agent continues to play another game, and the agent r aches state SwJ.
The opponent responded by taking an action that ended up in state SW' SO the
maxa,Q(s', a') is Q(sw, aw) which is 0.5. Thus Q(SwI, awd is
Q (SwI) awd = Q(8w I' awI) + 0: h+l + 'YQ (sw, aw)  Q (SwI, awI)]
= 0+ 0.5 [0 + 0.5 (0.5)  0]
= 0.125
A discounted positive Qvalue is assigned to Q(Swl' awI) because the agent may
reach state SW' If each action is executed in each state an infinite number of times
on an infinite run and 0: is decayed appropriately, the Qvalues will converge with
15
probability 1 to Q* where Q* is the optimal Qvalue [81.
Figure 6. An example MDP with rewards
A discount rate, " is introduced to account for the time value of rewards. It
measures the value at time t of a oneunit reward received at time t + 1. A oneunit
reward received t periods in the future has present value of ,,/. To see the effects of "
let us look at Figure 6. The figure shows that a positive reward of 1 is only assigned
when state H is reached. The effect of, on each Qvalue is shown in Table II.
Table II. Effect of discount on Qvalue
Path 1+2 2+3 3+H 1+4 4+H
Discounted Qvalues b = 0.5) 0.25 0.5 1 0.5 1
Undiscounted Qvalues (, = 1) . 1 II 1 1 1 1
Taking the discount rate into account affects the agent's preference for policies. If
this example is an undiscounted problem, then both paths to state H have the same
Qvalue of 1. In contrast, using the discount rate, the agent prefers path 1+4+H
because the goal state H is reached in fewer steps.
The Qvalue of the terminal state is always O. A Qvalue is assigned to every
stateaction pair that involves a state transition. The terminal states are the only
exception since they are the only statt's that have no transition to another state.
16
1.5.2 SARSA
SARSA is a variation of QIearning that was developed by Sutton [30]. Just like
QIearning, the SARSA agent begins by taking an action a in state s. It then waits
for the opponent to respond. After the opponent has moved, state s volves into
state s'. Then the SARSA agent needs to choose an action at based on its behavior
policy. This is the step that distinguishes SARSA from QIearning. SARSA's chosen
action, a', may be an exploratory action. So Q(8', a') of SARSA will not always be
the highest value, while QIearning always backs up each Q(s, a) by discounting the
highest Qvalue of the s', a' pair. The SARSA algorithm is shown in Figure 7.
SARSA is an onpolicy learning algorithm because it learns by improving the same
Initialize Q(s, a) arbitrarily
Repeat (for each trial):
Initialize s, a
Repeat (for each step of trial):
Take action a, observe r,
Choose a from s' using policy derived from Q (e.g., €greedy)
Q(8, a) = Q(8, a) + a[r + "(Q(s', a')  Q(s, a)]
,
a := a; 8 := S
until s is terminal
Figure 7. SARSA: An onpolicy TD control algorithm
policy that it uses to select all actions. If the behavior policy of SARSA always acts
greedily, then SARSA and Qlearning are the same.
17
1.6 Generalizati.on in Reinforcement Learning
The artificial neural network functions as an approximation mechanism of the Qvalues
instead of storing them in a table. Why do we want to approximate the Qvalues
when we can have the actual Qvalues stored in a table? The table Qvalue are
updated each time the corresponding stateaction pair is encountered during training.
This means that at the end of the learning process, there are possibly many Qvalu s
that have not been updated because these stateaction pairs were not experienced.
The ability of an artificial neural network to generalize experiences allows the agent
to make better decisions even in situations that it has never experienced before. Generalization
allows the network to evaluate a state and generates a response based on
features that are similar to some states that the network has encountered. Therefore,
generalization may speed up learning by discovering commonalities among states.
In addition, because of the large number of states in many environments, using
a lookup table to represent the function is not feasible. This shortcoming is known
as the 'curse of dimensionality'. Generalization of these large paces can be achieved
using a function approximator such as artificial neural network (ANN) [1]. This is a
very powerful learning tool. It is essential to understand artificial neural networks if
we want to scale up reinforcement learning to solve larger problems.
1.7 Artificial Neural Networks
Today, there is a large body of literature on the subject of artificial neural n tworks,
ANN for short. An A is a simplified model of the brain cells, or neurons that
are massively interconnected, usually tiimulated in software by using a computer. In
the 1950s and 1960s, when the mass enthusiasm began, this field was known as connectionism.
That was when the perceptron was introduced by Frank Rosenblatt [21],
and this is the first network we will discuss.
input{
neuron
o'utput neuron
,..A..
o
18
otherwise
if net ~ 0
Figure 8. Twolayer perceptron
Figure 8 illustrates a twolayer perceptron. The connections between neurons
determine the function of the network. Associated with each connection is a weight
Wi. The strength of the weights is represented by real numbers. The neurons sum
up the weights of the activated inputs. If the sum of the inputs is greater than some
threshold value, (), the neuron turns on (outputs a 1), otherwise it is off (outputs a
0). The function that represents the threshold value is called the activation function.
In supervised learning, an artificial neural network including the perceptron learns
a mapping from a set of inputs to the corresponding target outputs. Learning takes
place by updating the weights and thresholds so that the network generates the desired
outputs corresponding to the given inputs. The output of the perceptron is given by
0= hardlim (~(WiXi)+ ()) .
where Xi is the ith input neuron, Wi denotes the weight that connects input neuron i
to the output neuron, and () is the threshold value. Let us denote net as the weighted
sum of inputs. The hardlim activation function is defined as
o = hardlim(~ WiXi + B) = { 1
i 0
~
net
19
The perceptron learning rule indicates how the weights and thresholds are adjusted
if the network is not producing aU the desired outputs. In other words, the perceptron
keeps changing both its adjustable parameters. The error of the network, e, is the
difference between the desired output's value, y, and the estimated output value
produced by the network, o. The perceptron stops adjusting the parameters if it
generates all the desired outputs based upon the examples that the perceptron has
seen. The perceptron learning rule is
w = w+ex,
e = y  o.
Let us now look at a trivial example to see how a perceptron works. Suppose
you would like a vending machine that accepts one and two dollar bills. The way the
machine separates them is by using its somewhat primitive scanning device to read
in three selected features on the face of the bill. The first feature is the seal of the
Department of the Treasury (both bills have this), then the fullness of the pre ident's
hair (Jefferson has more hair than Washington does), and finally whether there exist
the word "THE UNITED STATES OF AMERICA" OIl top of the president's picture
(the one dollar bill has it whereas the two dollar bill does not). The featur s of the
input can be represented as a vector su h that
seal
x = hair
() = () + e, where
USA
If any of the feature exists, a 1 can be used to indicate that; and a 1 to indicate
otherwise. So the respective input vector for the one dollar bili and the two dollar
20
bill are
1
X 1 = 1
1
and X$2 =
1
1
1
First of all, we need to select a network architecture to represent this task. Since
there are three features that the vending machine scans, so we need three input
neurons. And we want only one answer, whether it is a one dollar or a two dollar
bill. Hence one output node is essential. The output is either a 0 or a 1 because
the hardlim activation function is used. In this example, an output of 1 means a one
dollar bill and vice versa. This artificial neural network architecture is a twolayer 31
network. In this example, it is a twolayer network because it consists of one layer of
three input neurons, and another layer that consists of one output neuron.
The weights and thresholds of the perceptron are commonly initialized with small
random numbers. Let us pick 0.2,0.3,0.1, and 0.4 for Wl,W2,W3, and erespectively.
Now we are ready to train the perceptron to recognize both the one and two dollar bill.
Let us see if these initial values of the network parameters actually produce the two
outputs that we desire. This could happen because this is not a complicated example.
We begin the training with a one dollar bill where WI = 0.2, W2 = 0.3, W3 = 0.1,
Xl = 1,x2 = 1,x3 = 1 and e= 0.4.
Iteration 1:
o = hardlim ( 2tWiXi + B)
= hardlim(0.2(1) + (0.3)( 1) + (0.1)(1) + (0.4))
= hardlim( 0.5)
= O.
We expect the perceptron to output a 1; instead it outputs a 0 which is an error.
The perceptron learning rule needs to be applied to change the weights and threshold
21
in order to yield the correct answer. First we need to calculate the error.
e=yo=10=1.
Then the weights update are
W =w+ex
Wl = 0.2 + (1)1 = 1.2
W2 = 0.3 + (1)  1 = 0.7
W3 = 0.1 + (1)1 = 1.1
The threshold update is
e= e+ e = 0.4 + (1) = 0.6
These are the adjustments to the weights and threshold where now WI = 1.2, W2 =
0.7, W3 = 1.1 and () = 0.6. But does the perceptron recognizes the one dollar bill
right now? Let us find out.
o = hardlim(1.2(1) + (0.7)(1) + 1.1(1) + 0.6)
= hardlim(3.6)
=1.
Next we look at whether the perceptron recognizes the features of a two dollar
bill with its current weights and threshold values where Xl = 1, X2 = 1, X3 = l.
Iteration 2:
o = hardlim(1.2(1) + (0.7)(1) + 1.1(1) + 0.6)
=hardlim(O)
=1.
e =yo=Ol=1.
WI = 1.2 + (1)1 = 0.2
W2 = 0.3 + (1)1 = 1.3
W3 =1.4+(1)(1)=2.4.
() =0.6+(1)=0.4
22
Let us prove that the network recognizes the two dollar bill:
o = hardlim(O.2(1) + (1.3)1 + (2.4)(1) + (0.4))
= hardlim( 3.9)
= O.
We know that the perceptron recognizes the two dollar bill now, but do it still
recognizes the one dollar bill? Since we have only two training samples, th third
iteration tests the network with the one dollar bill again.
Iteration 3:
o = hardlim(0.2(1) + (1.3)(1) + (2.4)1 + (0.4))
= hardlim(3.5)
=1.
Well since the weights and threshold generate the correct output of 1 indicating
that the input is a one dollar bill, no adjustment to the parameters is needed after
the third iteration. So the network parameters have converged after two iterations
with WI = 0.2, 102 = 1.3, W3 = 2.4, and () = 0.4. The perceptron learning rule is
proved to converge to parameters that accomplish the desired cia sification [7], given
that such parameters exist. Remember that this perceptron is trained to recognize
only the one and two dollar bill. If the perceptron is presented with a tw nty dollar
bill in which all three features are present, the perceptron will output a O. It is not
capable of always classifying correctly any inputs other than one and two dollar bills
because it is not trained to do so. This simple example presented above is a patteTn
recognition problem, for which artificial neural networks are often used.
The neural network parameters draw a hyperplane in the threedimensional weight
space (since there are three inputs in this example) to linearly separate the one and
two dollar bill into two categories. That is why a twenty dollar bill is classified as
a two tiollar bill, because it falls into the two dollar half of the weight space. The
23
perceptron keeps adjusting the parameters because it has not found the hyperplane
in the weight space that correctly separates the two patterns.
Even though the perceptron is powerful enough to solve many classification problems,
it is limited to problems that are linearly separable. In 1969, Minsky and Papert
documented the shortcomings and capability of the perceptron [15]. One problem that
is not linearly separable by one hyperplane is the classical exclusiveor, as depicted
in Figure 9.
"
" Input 1 Input 2 Output "
"
"
" " 0 0 1 1
"
" 0 0 0 "
"
" 1 1 0 "
" " 1 0 1
"
"
"
Figure 9. The exclusiveor, a classification problem that is not linearly separable
In order to learn problems that are not linearly separabl , we can build a multilayer
neural network with one or more "hidden" layers to learn a more sophisticated
function. This is also known as a multilayer perceptron or multilayer feed forward
network. Multilayer neural networks can be trained to solve problems that are not
linearly separable, such as the exclusiveor, by using the backpropagation algorithm.
But before looking at backpropagation, first we have to understand the learning systern
behind it. This learning system is known as the LeastMeanSquare rule.
24
1.7.1 LMS, Delta Rule, ADALI E, or WidrowHoff rule
In 1960, Widrow and Hoff introduced an adaptive algorithm known as the LeastMean
Square (LMS) rule. It is also known as the delta rule, the Adaptive Linear
Neuron (ADALINE), and the WidrowHoff rule. Like the perceptron rule, the LMS
algorithm is also an adaptive or optimization algorithm which means the parameters
(weights and thresholds) can be properly adjusted to achieve the correct values. The
LMS algorithm is applicable to a twolayer network, consisting of a layer of input
neurons and another layer of output neurons. However, the LMS algorithm also
suffers from not being capable of learning nonlinear functions. Yet it is essential to
understand the LMS algorithm because the backpropagation algorithm is a powerful
extension of the LMS rule.
The LMS algorithm observes the performance of the network during training.
The performance of a network is better when the network is closer to classifying all
the patterns correctly. As mentioned before, if not all the patterns are classified
correctly, then the network has erroneous parameters. Consequently, if we have a
way to measure the error of the network paramet rs, we know the performance of the
network.
The performance measure of the LMS rule is the approximate mean square error.
The LMS rule tries to reduce the approximate mean fiquare error in order to obtain
the best performance network. This is where the name LeastMean Square (LMS)
comes from. The mean square error (MSE) is defined as
where n is the number of inputoutput pairs, Yi is the i th desired output, 0i is the ith
estimated output, and 6i is the difference between Yi and 0i. At each iteration, the
25
LMS algorithm estimates the mean square error by
where the mean square error is replaced by the square error. The error of the output
is reduced through adjustment of the weights and thresholds. A space has dimen ion
n when points in it can be specified by n coordinates. The twodimensional plane
requires two coordinates (x, y), threedimensional space requires three, and so on.
Thus the approximate mean square error at each point in time is a coordinate on the
error surface that is specified by the weights and thresholds. For a twolayer network,
the error surface with respect to the network parameters is a paraboloid as shown in
Figure 10.
Error
W
1
Figure 10. The mean square error surface
The bottom of the paraboloid, the minimum or the optimum point, is where the
error is reduced to zero. To reduce the approximate mean square error is to descend
downhill to the minimum point on the error surface. The steepest slope downhill is
the negative of the gradient, VwMSE because the gradient vector is pointing uphill.
This method of minimizing the error is known as steepest descent or gradient descent.
26
The gradient with respect to the network weights is defined as
 [OMSE oMSE ] 'VwMSE = O. 1 O. , ...
W t Wt+l
such that
where 8~E is the first derivative of the MSE along the Wi axis, n is the number of
neurons in the input layer, and Xi is the ith input neuron. The derivative is zero at the
minimum point, therefore the gradient is orthogonal (perpendicular or tangent) to
the previous search direction. Thus, consecutive search directions of gradient desc nt
are always orthogonal to each other as illustrated in Figure 11.
Figure 11. Gradient descent orthogonal search direction
27
Using the approximation of "VwMSE, the LMS algorithm is
() = e+ 2a6
such that
tJ.Wi =a("VwMSEi )
= a(28iXi)
(negative gradient)
where a is a constant learning rate. The learning rate decides the magnitude of the
gradient. Hence a small learning rate is normally used to keep the gradient from
changing too fast. If the magnitude of the new gradient is greater than the previous,
then the parameters may not converge (illustrated in Figure 12) [7]. The selection of
the best learning rate is obtained through trial alld error.
Figure 12. A small learning rate converges (top) and a slightly larger learning rate
may diverge (bottom)
28
1.7.2 Multilayer Neural etwork
Up to this point, we understand that both the perceptron and the ADALI E network
are incapable of solving problems that are not linearly separable. Researchers believed
that this barrier could be overcome by building a different network architecture and
using a more powerful algorithm to train the network. Studies were conducted in
constructing an algorithm to train a multilayer neural network. An additional layer
of neurons, called the "hidden layer", was added in between the input and the output
layer. Finally Rumelhart, Hinton, and Williams introduced the backpropagation
algorithm that is capable of training a multilayer neural network. By applying the
backpropagation algorithm, a multilayer network is capable of classifying nonlinear
problems. Figure 13 illustrates a 231 multilayer feedforward neural network architecture
that has 2 input units, 3 hidden units, and 1 output unit.
()
6
Output {
layer
Hidden {
layer
Input {
layer
Figure 13. A multilayer feedforward neural network
The notation Wij denotes the weight connecting neuron unit i in a previous layer
29
with neuron unit j, Xi is the input to unit i, Yj is the desired output of the current
input, and OJ is the output value of neuron .i.
There is no clear understanding a.s to how many hidden layer should bused.
Normally, a network of either one or two hidden layers are used in practice. In
addition it has been shown that a onehiddenIayer MLP, if given enough hidden
units, is capable of cla.ssifying any nonlinear function. Thus a threelayer network is
sufficient for handling most cla.ssification problems.
1.7.3 Backpropagation or Generalized Delta Rule
The backpropagation technique of Rumelhart, Hinton, and Williams [22]' has become
the most commonly used neural networking algorithm. The term "backpropagation"
refers to the way the partial derivatives are efficiently computed in a backward propagating
sweep through the network [281. It is the choice of optimization algorithm
used for this experiment because it is very wellstudied.
Backpropagation is a simple optimization method, but it is ob olete and probably
should never be used for production work. Conjugate gradient methods are always
faster and, if properly coded, just a.s robust.
_____________ ._ _ }...o_ _ _ :.;..0.0
·5 a 5
Figure 14. The sigmoid threshold function
30
The backpropagation algorithm will work with many activation functions. The
most commonly used activation function is:
1
0
J  1 + enetj Vj E hidden, output
where OJ is the output value of the lh neuron either in the hidden or in the output
layer. The input to neuron i is denoted by Xi. The Sshaped activation function
depicted in Figure 14 is known as a sigmoid. Let Wij denote the weight connecting
neuron i in a previous layer to neuron j in the successive layer. Please note that the
output of neuron i is essentially the input to neuron j such that 0i = Xj. So the
definition of netj, the weighted sum of neuron j is computed by:
Vj E hidden, output
= L WijXj + OJ
i
where neuron j is either a neuron of the output layer or a neuron of a hidden layer,
and OJ represents the bias value corresponding to the yth neuron. Every hidden and
output neuron has its own bias value.
In gradient descent, the current value of the weights is moved in the direction in
which the expected error falls most rapidly, the direction of steepest descent. The
gradient, V'wMBE can be obtained by using the chain rule. This way, the mean
square error can be distribut~d among all the weights of the network. The first part
of the problem is to find how MSE changes as the weight Wjk changes
 8MSE~alJetk
 &Ok dnetk Wk
=  (Yk  Ok)Ok (1  Ok)Oj
=(Yk  Ok)Ok(l  Ok)Xk
Vj E hidden, 'Vk E output
where Wjk denotes the weight connecting the yth hidden neuron to the kth output
neuron, and Yk denotes the target output for the kth neuron in the output layer.
31
Secondly, we need to relate the change in MSE to the change in Wij, the wight
that connect neurons from the input layer to the hidden layer. Any change to the
weight Wij changes OJ, which becomes the input of unit k. By using th chain rule,
the gradient with respect to Wij is therefore the sum of the changes to the output
value of each hidden and output neuron of the network. Let us denote that i E input,
j E hidden, and k E output, then
= oMSE~ogetk Y.!!.L ~netj
aOk dnetk Wk dnetj Wij
= 2:  (Yk  Ok)Ok(1  Ok)WjkOj(1  OJ)Oi
k
= 2: 5kwjkoj(1  OJ)Oi'
k
= oioj(1  OJ) 2: 5kwjk
k
Therefore, the negative gradient of the sample error, 5j , is computed by a backpropagation
process defined by:
5j = OJ (1  OJ) 2: WjJ,;5k, Vj E hidden,
kE01ltput
If 5j is computable, the update equation can be written. The weight update rule
is
where a is the learning rate, and 0i is the output value of neuron i which is equivalent
to Xj' Each neuron in the hidden and output layer has a bias value. The bias of unit
j, (}j, is adjusted by
A momentum coefficient, T}, can be used to speed up convergence. This is because
it keeps the process moving in a consistent direction. Using momentum, the weight
32
update formula becomes:
1. 7.4 Batch and Online Training
There are two ways to train a neural network; batch and online training. Batch or
offline training means a set of training examples is obtained and used to approximate
the function before it is used in the application. Thus the parameters are updated
only when the entire training set is presented to the network. Then the gradients
of all examples are computed, and the average of all the gradients is used to get a
better estimate of the gradient [7].
Online training involves continuous weight adj ustment of the network by using
the data gathered while the system is ]n operation. This way of training is capable of
adapting to a timevarying function. This is essential in temporal difference learning
because the target functions change over time. However, there are mixed successes
in online applications. This problem exists because the network "forgets" previously
learned examples as more new examples are presented for training. In practic , it can
be overcome by storing old examples and retraining on them, or by learning slowly
and training extensively [31].
1.8 TD(A)
TD(A) is a multistep prediction algorithm. This multistep algorithm makes greater
alteration to more recent predictions. This algorithm uses an exponential decay, A,
where predictions n steps in the past are weighted according to An for 0 ~ A ~ 1.
The weight update equation of TD(A) is given by
t
tlWt = o:(Pt+l  Pt) I: Atn'VwPn
n=l
33
where Q is a constant learning rate, Pt is the prediction at time t, Pt+1 stands for the
prediction made in time t + 1, and VwPn is the gradient of the nth. prediction error
with respect to the weights.
It is important to note that this TD rule is an offline algorithm. So each 'JwPn
needs to be recorded and the weights are changed at the end of the trial. The case
of TO(O) is very like that of conventional backpropagation. TD(O) is essentially a
singlestep prediction algorithm. The TO error (TDE) is backpropagated to each
weight and it is determined only by the most recent observation when A is O. The
weight update rule is
So each individual weight, Wij, is updated using
where Oi is the estimated output of neuron i, and b is computed using the backpropagation
process discussed in the previous section.
But TD(A) is treated slightly differently. The process of backpropagation produces
an "eligibility" term for each weight. The gradient of each prediction error that
is exponentially decayed is known as the eligibility. As a temporal difference is determined
at each time step, it is broadcast to all weights. Then the temporal difference
is combined with their eligibility to determine the changes to the weights [28]. In
order to reduce the computational resources, the online version of the algorithm is
normally preferred. If we have a way to compute the eligibility incrementally, we can
34
do the computation online. The eligibility, e, in every time step is computed by
t
et+l = L ,\t+ln\7wPn
n=1
t
= \7w Pt+l + L ,\t+ln\7wPn
n=1
t
= \7wPt+l + L ,\. :At

n
\7wPn
n=1
t
= \7wPt+l +,\ L ,\tn\7wPn
n=1
If the neural network has k output neurons, then each weight has k eligilibility
traces. Then the kth eligibility of the weight from neuron i to neuron) is defined as
where Oi is the output of neuron unit i, pk+l is the new prediction of the kth output
node, and e~jk is the kth eligibility at time t of the weight connecting unit 1: to unit j.
The bkj = 8:11 is computed using the backpropagation algorithm defined as:
)
oj(l  oj) if k = j
Let us try to understand the equation for updating b. The first and second
conditions indicates how to compute b that corresponds to weights that connect to
the output layer. So if we are trying to determine the kth 6 of a weight that connects
to the kth output neuron, then 6 is computed using the first condition. The rest of
the 6s are set to O. The 6 that corresponds to a weight that connects an input neuron
to a hidden neuron is calculated by following the third condition. It should be noted
35
that the equation of TD(>.) for computing {j is slightly different when>' > O. The
incremental version of TD(>') weight update rule is
w~tl = W~j + a L (Pt~l  p/c
) e~jk'
kEoutput
TD(O) is the focus of this study. It has been shown in Tesauro's TDGammon and
Thrun's NeuroCbess papers that 0 is the optimal value for >.. The TD(>') algorithm
is provided because most of the literature that discusses TD(>') does not provide the
details of the implementation. The details of the algorithm provided above comes
from an unpublished paper written by Sutton [27]. This paper by Sutton is a followup
of his introductory paper to temporal difference learning and TD(>') [28].
1.9 Game Playing
1.9.1 General
Many outstanding names in the history of computer science touched upon the domain
of game playing. These researchers include Claude Shannon, the father of information
theory; Alan Turing, renowned for his contributions to the theory of computation
and for his work during World War II in deciphering German war codes; and Herbert
Simon, the father of artificial intelligence.
From the very beginning of the development of digital computers, researchers
became interested in studying how computers might solve complex problems by examining
the game of chess. In 1944, John von Neumann presented the minimax
algorithm for selecting the move to make in chess [16]. In 1950 Claude Shannon published
the paper that detailed the procedures to implement computer chess [26]. Till
today, chess programs that are written are based on Shannon's ideas. Simultaneous
with Shannon's work, Alan Turing published his approach to the automation of chess
strategy [38]. His ideas were very similar to Shannon's. Finally in 1955, Alan Newell,
36
John Shaw, and Herbert Simon wrote the Newell, Shaw, and Simon ( SS) program
that attempted to simulate the human mind's approach to selecting moves in chess
[18].
In recent game playing developments, an IBM's supercomputer named Deep Blue
defeated the current World Champion and possibly the best ever human chess player,
Garry Kasparov, 3.5 to 2.5 in a sixgame match [17]. Table III shows the facts
about both the contenders. The most shocking news about the match happened in
the sixth game, in which Deep Blue defeated the World Champion in just nineteen
moves. Figure 15 shows the final position of the historical sixth game.
Table III. Facts about the most publicized chess match between a computer and a
human
I Facts I Garry Kasparov I Deep Blue
Height 5'10" 6'5"
Weight 176 lbs. 1.4 tons
Age 34 years 4 years
Birthplace Azerbaijan Yorktown, NY
Number of processors 50 13 Neurons 32 P2SC Proc ssors
Moves per second 2 200 million
Power source electricalfchemical electrical
Next career champion pharmaceutical design
1.9.2 Reinforcement Learning in Game Playing
The one application that is always mentioned for its pioneering success applying reinforcement
learning (RL) in game playing is Samuel's checkers playing system [24, 25].
The value function is learned and represented by a linear function approximator, and
the training is done similarly to TD updates.
'I I
":1
II~
.'."
2
1
abc d e f g h
Deep Garry
Blue Kasparov
1. e4 c6
2. d4 d5
3. Nc3 dxe4
4. Nxe4 Nd7
5. Ng5 N
e
l£f6
6. Bd3 6"
7.NIf3 h6
8. Nxe6 Qe7
9.00 fxe6
10. Bg6+ Kd8
11. Bf4 b5
12. a4 Bb7
13. ReI NdS
14. Bg3 Ke8
15. axo5 exb5
16. Od3 Bc6
17. Bf5 exf5
18. Rxe7 Bxe7
19. c4 Resigns
37
I,
II
Figure 15. Garry Kasparov versus Deep Blue: Game 6 final position
TDGammon is successful in applying TD in learning to play backgammon [32,
33, 34}. Backgammon has approximately 1020 states. Tesauro uses a combination
of TD and a threelayer ANN with 80 hidden units as the function approximator
for generalizing the experience. The successful result was achieved by constant self
play. The program always acts greedily in choosing the move with the best chance to
win. The success of using this strategy is rather surprising considering all the studies
that have been done to discover better exploration methods. The training required
several months of computer time for training on 1,500,000 games. This program has
competed at the very top level of international human play.
Many researchers have attempted to reproduce the success of TDGammon by
using the TD learning approach in other games such as chess. Gould [6] implemented
Morph by using Adaptive Predictive Search (APS), a learning framework that, given
little initial domain knowledge, increases its predictive abilities in complex problem
domains such as chess. The result shows that the GnuChess [14] level of play is
higher than Morph. Sebastian Thrun [35} combined TD learning and Explanation
'I
:'I,
'II
Based Neural etwork (EBN ) to train euraChe . The EB
38
learning algorithm
[36] enables the computer to learn more accurately from less training data by taking
advantage of other previously acquired knowledge, even if it is inexact, to significantly
improve accuracy for the new learning task. His research shows that EBNN s
generalize better than AN's using backpropagation. The main drawback of euroChess
is that it does not develop good chess openings. Jonathan Baxter [2] created
KnightCap, a chess program that learns by combining TD('x) with minimax search.
This program learns to play only the middle and end games. KnightCap selects chess
openings from a database. These results are not very successful compared to TDGammon.
Several very successful chessplaying programs have been developed over
the years [12, 11, 17]. However, none of these programs used reinforcement learning.
,'I. .,
II
,..
"
CHAPTER II
RESEARCH OBJECTIVE AND METHODOLOGY
2.1 Research Objectives
The main objective of this research is to conduct a study in the area of reinforcement
learning. This study attempts to understand the internal workings of an RL agent
learning to act. This is accomplished by implementing the reinforcement learning
framework and experimenting with it in the domain of game playing. The learning
mechanism that is of interest in this study is the temporal difference learning method.
Tictactoe is commonly used to begin the understanding of reinforcement learning
method in the domain of gameplaying. Tictactoe is a smallsize problem with
39 states in its state space. It is used to answer most of the questions of this study.
The first area of concern is to find out how learning is affected by the learning rate,
the discount rate, the exploration rate, and by the difference in offpolicy method
and onpolicy learning method. The agent is implemented to play many games of
tictactoe by trial and error. Then its performances are recorded to conduct the
analysis. In this part of the study, the Qvalue of each stateaction pair is stored in
a lookup table.
This research then switches its attention to examine the effectiveness of approximating
the value function using a multilayer feedforward network trained by the
39
40
backpropagation algorithm. It is not the intention of this research to optimize th
learning rate of the backpropagation algorithm. The goal of this part of the re earch
is to know how reinforcement learning and a function approximator are combined to
solve sequential decision problems. This is important because generalization scale
up reinforcement learning to solve practical problems that have a large state pace.
It is the objective in this part of the study to find out whether an RL agent will learn
an optimal policy. Analysis will be carried out to find factors that could lead to the
success of generalized reinforcement learning.
2.2 Details of Implementation
Simulation is the evaluation tool for this research. The simulation is written in C++
to utilize the power of objectoriented programming. It allows convenient interaction
between the various modules of the program, as well as for runtime speed. I am
using Microsoft Visual C++ 4.0 as the tool for coding and debugging.
The class diagram of this experiment is shown in Figure 16:
1. SIMULATOR: The main function of this module is to provide the us r with
a user interface to configure the experiment model.
2. RL FRAMEWORK: A module that models the framework shown in Figure 1.
Learning through trialanderror is done here. The learning algorithms are Qlearning
and SARSA.
3. POLICY: In this module, the agent decides which action to take, given the
state that it is in at a given time, by following its behavioral policy. The
behavior policy that is implemented is €greedy.
4. ENVIRONMENT: This is the opponent that the agent is playing against.
After the opponent has chosen its move, it returns two things to the agent:
:/
:1
II
I.
1
the subsequent state that the ystem has evolved into, and the reward of the
agent's move selection. The opponent used in this experimentation for learning
tictactoe is discussed in further detail in Section 2.3.
5. Qvalue: The Qvalue is either stored in a lookup table or it is approximated
by using an AN . Double hashing [40] was implemented for storing and looking
up the Qvalues. The multdayer neural network using the backpropagation
algorithm was implemented using source code from Rogers's book [20]. The
ANN and the lookup table are both subclasses of Qvalue. This module returns
and modifies the Qvalue of each stateaction pair.
I ~I L:J oAgregrle..g..are
D Cl...
SIMULATOR
R1J FRAMEWORK *
{> Inhentence
Environment *1
Figure 16. Reinforcement learning simulation class model
2.3 Details of the Opponent
The tictactoe opponent is a minimax player that searches two plies for a move. The
minimax algorithm has random noise added to its move selection, thus causing it to
lose games by missing blocks or missing the winning moves. The stochastic property
of this opponent should be ideal for testing reinforcement learning effectiveness in
42
making decisions under uncertainty.
This opponent is put to a test by playing against a random move generator. The
performance of the opponent is averaged over ten thousand games. The equity is the
expected value of the number of losses subtracted from the number of wins. The skill
of this opponent is shown in Table IV.
Table IV. Performance of the tictactoe opponent that is used in the experiments.
Its performance is tested by playing against a random move generator
Opponent Wins Draws Losses Equity
Minimax 2ply with random exploration 70.51 8.63 20.86 49.65
2.4 Details of Neural Network Training
The update equation for TD learning used in value function approximation is in
accordance with Thrun's [37] training strategy where:
+1 w~n
0 draw
Q (s, a) =
1 loss
rmaxQ (,<;/, a') otherwise
It is not the primary goal of this project to finetune the network parameters
for the fastest learning rate possible. These are the common settings used in the
experiments:
• Network weights are initialized randomly between 0.1 and 0.1.
• Weights of the network are adjusted online.
• The original backpropagation algorithm is used. No momentum coefficient is
used.
43
• Each network has one layer of hidden unit .
• The sigmoid function is used as the activation function for both the hidden and
output layer. The sigmoid function scales its output value in [0,1]. In order to
scale the reward signal which is in the range of [1, 1]:
 Before training the network, the Qval ue in {1, 1] is scaled down to the
range [0,1] before it can be used as the target (denoted by 0) using
Q+ 1
0=2 .
 The output of the network is scaled back into the range of [1,1] before
it is used for TD learning:
Q = 20  1.
A multilayer neural network is used to learn the mapping from each stateaction
pair to its Qvalue. Two input units are used to represent one tictactoe square; 00,
01 and 10 for' " '0', and 'X' respectively. The Qvalue of each action is represented
by nine output neurons, where each neuron represents a tictactoe square. Each
time, only one of the nine neurons is updated according to the square which is played.
Then the training examples are fed to the network to approximate Q*, as shown in
Figure 17.
Studies have shown that feature selection is important in the effectiveness and
success of representing the problem ~35]. There are eight different positions in which
a player can win a game in tictactoe. The player either wins in one of the three
columns, one of the three rows, or one of the two diagonals. Since these are the
key features that we look for in a game, this information is used as the input. This
neural network uses 00, 01, and 10 to represent' " '0', and 'X'. Then, it uses 48
neurons to represent all eight different winning positions whereby six neurons are
used to represent each winning position. These inputs are mapped to nine output
44
o 1
Action
o o o o 0
00 ....
~~
=it o selected
by the
agent
(~I"""""""""'
o
Hidden
layer
State {
1 0
Qvalue { 0
Figure 17. Evaluation function neural network for a tictactoc example
nodes. Each output node represents the Qvalue of one possible action. The results
of using this representation are reported in Chapter III.
2.5 Method of Analysis
Data were gathered by running the simulation. The data include the wins, losses,
draws, mean square error, and equity. The agent's ability to win is measured by its
equity. The performance is analyzed by generating charts using these data.
The results were gathered when the agent halted learning temporarily and played
one hundred games greedily. It is important to hatt learning temporarily because there
are explorations involved during the learning process. These data were gathered after
everyone thousand games of training. The effectiveness of using generalization in
reinforcement learning is determined by using results from the lookup table as the
benchmark.
CHAPTER III
EMPIRICAL RESULTS
3.1 Bounded Random Walks
To test whether both QIearning and SARSA actually converge to the optimal policy,
a simple test was conducted with a bounded random walk. In this experiment, there
are seven states that the agent can be in. The agent starts at state 4 and takes a
step to a neighbor randomly. A positive reward of +1 is assigned if the agent reaches
state 7, and a negative reward of 1 is assigned for ending the walk at state 1. This
problem is shown in Figure 18.
1
GS(2
starting state
I
I
+1
Figure 18. Bounded random walks
The Qvalues of each stateaction pair are shown in Table V and Table VI after
1000 trials.
45
46
Table V. Qvalues generated by SARSA
State!Action Left Right
4. 0.0811 0.0559175
3. 0.260868 0.0120768
2. 0.921006 0.0876725
5. 0.00294266 0.264135
6. 0.0762577 0.921555
Table VI. Qvalues generated by QIearning
State!Action Left , Right
4. 0.466433 0.473384
3. 0.460682 0.463019
2. 0.927747 0.46221
5. 0.462804 0.463898
6. 0.481388 0.928404
Both SARSA and QIearning have learned the optimal policy. However, th Q
values produced by QIearning are somewhat misleading. This is because it only
punishes the action for walking to the left from state 2 to state 1. The rest of the Qvalues
are all positive because of its max operator. On the other hand, the Qvalues
generated by SARSA are more informative. Starting from state 4, a negative Q(4,3)
of 0.0811 tells us that we will probably ended up getting punished for walking to
the left; and Q(4, 5) of 0.056 tells us that the agent is more likely to be rewarded by
choosing to walk to the right.
47
3.2 TicTacToe
The Qvalues of experiments 1, 2, 3, and 4 are stor d in a lookup table. U ing the
actual Qvalue allows us to observe the effects of changing the parameters of temporal
difference learning. Experiment 5 approximates the Qvalue using an AN trained
with backpropagation. It uses the network architecture that is illustrated in Section
2.4.
3.2.1 Experiment 1. Learning Rate
The first experiment tested the learning rate. SARSA was selected to learn the game
of tictactoe. The agent played against a minimax opponent that searched 2ply
deep for a move. It followed the Egreedy behavior policy to handle exploration. The
exploration rate was set to 0.1 and a total of 50,000 games were played by the agent,
using .two different learning rates of 0.1 and 1.0. Figures 19 and 20 show how the
learning curve was affected by the learning rate, a .
• /IS  "
,(16
,U2
\
II
.14 r \
\ ,1\\
./2 ... \/\
\ {'
\ I
./0  . .. if.. r' ';"
\ I , { ,
\ 1 I/,!\ J' ~
.. t r.. \"f" \., 1\ I' . .I r
\ 1 I, I I, J '\ { , / I I
I \ I I I, I I' ~ \
1 \ v ,I " I II
.I(f .'1 II .. ,. l
Ii I: ,.... \1
0./
0.00 /.0
1 < 7 10 JJ I< /9 22 25 28 JI J< 17 40 4J <6 <9
fOOO gamtJ per IIni,
Figure 19. Mean square error: Effects of learning rate on the performance
48
MJ .
II
I \ 1\ 7
I \I ,,, I
I~' ~ / , \ /
I. _ ...
1
r
20'
60 "
1\1
1
1
·20 . I.
I
I
40 " I ••
I
"
.S' t 0
·60
O.J
80 J.(l
/ 4 7 to H IfJ 19 22 2Ji 1" JI 14 J1 40 n lI6 If)
1000 &t1mtS p~r Imrl
Figure 20. Equity: Effects of learning rate on the performance
vVhen the learning rate was set to 1.0, each Qvalue added the actual temporal
difference between Q(St, at) and Q(St+l, at+1) instead of a fraction of the temporal
difference. So, Q(St at) was replaced by Q(St+ll at+d every time Q(Stl at) was backed
up. When the agent explored, a positive Q(Stl at) could be r placed by a n gative
Q(St+l' at+l)' This is the reason that caused the fluctuations of th mean square
error.
3.2.2 Experiment 2. Discount Rate
In Section 1.5.1, we say that the usage of a discount rate in the update equation should
only affect the agent's preference for policies. This also says that it should not affect
the agent's performance. An experiment was carried out to verify if this statement
holds. The agent played 50,000 games. The learning algorithm is SARSA. The
objective in this experiment was to compare undiscounted learning with discounted
learning. The learning rate used was 0.1, and the exploration rate was set to 0.1. The
49
discount rates used were 0.8 and 1.0. Figure 21 and Figure 22 show the equity and
the mean square error for both discount rates.
I!KJ .   .
60
"'.
0 c· .~
:Z
·20
.4Q
tiC
UJ
I
,I. .
I
/
r
I
I
I
I
110'
./(K} 0.1
J 4 7 }() IJ '" 19 22 2' U JJ J4 J7 <10 4J 46 49
1000 somes per ulli,
Figure 21. Equity: Discounted versus nondiscounted learning
.14 ._~_.__._
.12
.10
~ .08
~
~"
l;
~ .IAS
~
.0<
.112 '
0.00
I 7 J!J /.I 16 19 22 25 l~ .'1 )4 Jl 4() ",J ~ 49
/.0
0./
/000 sames p{!r unit
Figure 22. Mean square error: Discounted versus nondiscounted learning
The tictactoe opponent used in this experiment uses a minimax search algorithm
50
to determine its moves. However, the algorithm h inc rporat d random ov . Thi
randomness caused the opponent's move selection to be nondeterministic. Using th
method of temporal difference learning, the Qvalue of each action will conv rge to
its asymptotic Qvalue if sufficient training time is provided. By not discounting the
Qvalue, the agent does not prefer moves that win the game in fewer moves. So when
the agent does not prefer to win the game immediately, it indirectly increase the
chances for its opponent to settle for more ties. That is the reason why the agent
that used a discount rate has better performance. Therefore, the use of a discount
rate in a certain class of problems may greatly improve the solution.
3.2.3 Experiment 3. Exploration Rate
Exploration enables the agent to discover better decisions given that such decisions
exist. When the exploration rate was varied, there was an unexpected result. The
experiment used an €greedy strategy as the exploration strategy to test the performance
difference between a greedy agent and an agent that explores. Figur s 24 and
23 show the differences in the learning curve between an agent that explores (€=0.1)
and an agent that does not (€=1.0). The learning rate was 0.1, the discount rate was
0.8, and the learning algorithm was SARSA.
The information shown in the graphs is misleading. If you look at the equity
graph, it seems as though the agent that does not explore performed better than
the agent that does. The mean equities obtained by averaging the equity of the
last 20,000 games are 61.0 and 50.48 for the greedy agent and nongreedy agent
respectively. Without exploration, the equity is higher because the greedy agent has
only experienced a small subset of the state space, whereas, when the agent explores,
it is learning to act in all possible states. Thus, if both agents are allowed to play
against a human player after training, the greedy agent will lose more games because it
has many states that are left unexplored. So exploration i
the overall performance of the agent.
51
ke fa tor ill det rmining
./6 ._.,~
.Ill
,ru
.14       ••••• ~  ~     _ ••••  _ _ .
. /2  ••••••••••••••  . .   •  _.    ••••   •••••    ••.• .     ••
\
./0    .•••     ' .•     .   .  .  .•     •  •  _ ••••
•011 .      ,        . •••••••• ••  ..  .    •••     ••
I
I
,IllS , • 1/\   ..    ..  ••  .      ..•..•.  .. 
I
\
.~ "'\ ....  _. _. ..   . . 
\ II I' . ,
J I / '\","  ,..\ 1\ " 1\/'
..... .~. .  ~~ .~ ~ .. ~. I/.:. ... ~ ~:...."
E 0.0
0.00 ~. O,!
I f 7 10 J.f J6 19 22 2$ 26 .II J4 J1 «J ilJ 46 49
1000 to",,, p" un;'
Figure 23. Mean square error: Exploration versus exploitation
80 ....
4() ••••  ••• ... ••• . ••••  ••  •••  •••••
f>()  ••  .      ••   ••  •••  •••••••  •• . .  •••••••••••  ••
.N) +......rr....,.,.,rr.....,.rr.....J
/
1000 tam" p" uni!
E 0.0
E 0,/
Figure 24. Equity: Exploration versus exploitation
52
3.2.4 Experiment 4. Qlearning ersu S RS
This section reports the results obtained by comparing Qlearning with S RSA. This
was an effort to find out if there is any advantage in Qlearning and its variation,
SARSA. The agent was trained to learn the game of tictactoe. In both cas s, the
learning rate was set to 0.1, exploration rate was 0.1 using the €greedy exploration
strategy, and discount rate was 0.8. The equity and the mean square error charts are
illustrated in Figure 25 and 26.
80 ..
10 • _•••• _••••••  ••••••••••••  •••••  •.•••••••••••••• 
40  •••••  ••  ••••••.  ••   ••  •••••  ••••• 
60 . •••  •••••••  ••  ••••••••••••••••••••• . ••.•
~ ~w
} • U nUll ~ D R ~ ~ ~
!OOO games fHr U/IJI
Figure 25. Equity: Qlearning versus SARSA
QIearning and SARSA were expected to learn the optimal policy. Indeed, the
results show that there are virtually no differences in their performances after they
stabilize. To prove this, the equity of both QIearning methods were averaged before
and after 25,000 games of training. The average equity of the first 25,000 games
shows us the agent's speed of learning to win of both algorithms. The remaining
25,000 games were used to determine the mean equity of both QIearning methods
after the equity stabilized. The results are shown in Table VII. In both cases, the
equity increases at almost the same rate. The mean equity of both QIearning for the
53
.1'
.16 ••••••••••••••••••••••••••••••••••••••••••••••••••••••
.,•.....................................................
\
\
~ .oil ••• ~ ••••••••••••••••••••••••••••••••••••••••••  ••
I
.06 •• .  ',•••••••••••••••••••
0.00 $AIlS,<.
/ 9 /J 17 1I zj 19 JJ J7 ./ 4J 49
!(}()() gam" ptr lUI;'
Figure 26. Mean square error: Qlearning versus SARSA
remaining 25,000 games differs only by 0.47. The mean square error of Qlearning
and SARSA were properly reduced at a similar rate too. Thus, there is no visible
advantage of using either one of them over another.
Table VII. Qlearning versus SARSA: Mean Equity of the first and next 25,000
games
Games 1  25,000 26,000  50,000
QIearning 31.37 49.88
SARSA 30.96 50.35
3.3 Function Approximator
This section discusses the results obtained by applying a multilayer feed forward neural
network using the backpropagation algorithm to approximate the value function. By
doing this, we hope to obtain a nearoptimal value function. The results obtained by
generalizing the value function with different network architectures are provided and
5
discussed.
3.3.1 Experiment 1. Raw Board Representation
The first experiment dealt with using a 3layer 18159 network archit cture to learn
the mapping from states to Qvalues of actions. The input to the neur 1 network
was the raw board representation of the game in progress (explained in S ction 2.4).
The learning rate was set to 0.1, discount rate was 0.8, the exploration rate was
0.1, and the agent played 500,000 games. The learning algorithm was SARSA. The
objective of this experiment was to find out the efficiency of using this method and
the performance of the agent.
As shown in Figure 28, the average mean square error over the last 100,000 games
was 0.059. It stabilized after approximately 250,000 games of training.
60 .
·40 .
60 .. .
lIO.........r........rrr.'
I 46 91 /J6 /3/ m 27/ J/6 16/ 4O<l ." "Otl
1000 gamu per uflit
Figure 27. Equity: An 18159 network trained with backpropagation is used to approximate
the value function
Comparing the equity obtained by using a lookup table, the equity obtained in
this experiment is very much lower than with the lookup table. The equity of this
55
experiment is shown in Figure 27. Averaging ov r th 1 t 50,000 gam f thi
experiment, the mean equity was 26.98. This w significan ly low r c mp t th
mean equity of using a lookup table which was pproximately 50.
./8
.16 •••• _ •••• _ _   _ .. _  ..
J4 .. •••    ••• ... •••  •••  •••••
.ol4w.r,.r,.r,..,J
I ~ W m m m m m _ ~ ill ~
J()OO gamu per /lnil
Figure 28. Mean square error: An 18159 network trained with backpropagation is
used to approximate the value function
Secondly, the gradient descent method for approximating the value function is
slow. It takes close to 350,000 games to reach its maximum equity around 25, whereas
when a lookup table was used to store Qvalues, the agent reached the equity of 25
in less than 10,000 games of training. In this case, generalization did not speed up
learning and it took a significantly higher number of games to learn a suboptimal
solution using this representation.
3.3.2 Experiment 2. Feature Selection
This section reports the findings about using all eight winning positions in tictactoe
to represent the input. Let us call this network NETFS and the network that uses
raw board representation NETBP. In each case, the agent was trained for 150,000
56
games. The learning rate was 0.1, the exploration rate was 0.1, and th di count
rate was 0.8. The learning algorithm was SARS . The r suIt of the lookup tablet
ETFS, and ETBP are compared in this ection.
The mean square error of NETFS is much lower than for the lookup table which
is shown in Figure 29. However, lower mean square error do not mean that a better
policy was learned. The convergence of the mean square error does not nece sarily
imply that an optimal policy is learned [5] .
.2 ,
..
"t:
!.".! ~ .1
j
".... bp
/1 JJ 49 dJ 41 91 IIJ /29 /4J
Figure 29. Mean square error of a network in which the input representation incorporated
handselected features and was trained with backpropagation
Figure 30 compares the equity of the lookup table, NETBP, and NETFS. Averaging
the equity of the last 25,000 games, the averaged equity of the lookup table is
17.03 units higher than NETFS. On the other hand, NETFS reached higher equity
than NETBP using this representation.
However when we look at the number of losses of the lookup table and NETFS,
the numbers are pretty close. This is shown in Figure 31. NETFS resulted in more
draws than the lookup table. NETFS has learned a nearoptimal policy compared to
/7 jJ <9 6.l 8/ 97 IIJ /29 /4$
20 ,              •••     . .  .   •••••  ••  •••••• 
4()   ••   ••  " •  .  .  •••••   ••••     •••••    ••  ••••     ••
.(,() ._ ... ..  _ .. _ .. __ .. 
·80 I..,...r....................
I
1000 80mt1 per unit
re,. ""
Figure 30. Equity: A network in which the input representation incorporated hand
selected features (feature) and was trained with backpropagation compared
to ANN using raw board representation (regbp) and lookup table
(table)
the policy obtained by the lookup table. In another comparison, NETFS learns significantly
faster and it has better performance than NETBP. This experim nt show
that the selection of input features to represent problems plays a major role in determining
the success of approximating the value function.
58
100 '',
IJO  •••    •  • .  • •••• • _.
60 _ •••••••• __ •.•• _ ••••.• _ ••••.•••••  ••••.••• _.
111
17 JJ •• 6$ 61 .7 III 12. '.$
1000 games fHT '''''''
Figure 31. Number of losses: Comparing the results obtained by the lookup table to
a network in which the input representation used handselected features
CHAPTER IV
SUMMARY, CONCLUSIO S, AND RECOMMENDATIONS
4.1 Summary
This research studies the internal workings of reinforcement learning. Reinforcement
learning solves sequential decision problems through trial and error. This gives it the
upper hand for its ability to learn in real time. The learning algorithms of this experimentation
were QIearning and a variation of it called SARSA. Both these learning
mechanisms are onestep learning prediction algorithms that learn a value function, a
mapping from a stateaction pair to a real number. They update a prediction at time
t to be a fraction closer to the prediction at time t + 1. These algorithms are considered
special cases of a multistep prediction algorithm known as TD(..\). At every
time step, TD(..\) updates a prediction by exponentially decaying previous predictions
based on recency, where A is the decay factor. QIearning and SARSA are essentially
TD(O), a onestep prediction algorithm in which A is O. TD is the abbreviation for
temporal difference learning.
In this study, temporal difference learning was used to learn the game of tictac
toe. The Qvalues were either stored in a lookup table, or approximated by a
multilayer neural network using backpropagation algorithm. The effects of the learning
rate, the discount rate, the exploration rate, and both forms of QIearning were
59
60
experimented with. In these experiments, the Qvalu w r tor d in I okup· bl.
If a function approximator was used, then the obj cti~, wit h d to an I zm
the effectiveness and efficiency of generalization in reinforcement I arning.
When using a lookup table to store tne Qvalue , the re ult of learning ticta were
successful. The constant learning rate did not affect the olu ion ex ept wh n
the learning rate was set close to 1. This was less effective because we w nt ch
prediction to be a fraction closer to the next prediction, not taking on th value of
the next prediction. Otherwise, a good action could be updated to become a b d
action when the next prediction is an exploratory move. The second test de It with
examining discounted versus nondiscounted learning. The effect of a discount rate
should affect only the selection of the optimal policies. By using a discount rate, the
tictactoe agent prefers actions that leads to a win in the least number of moves.
However, this experiment showed that using a discount rate generates a better policy.
By not having a preference on the winning strategy when no discount is used, increasing
the length of the game inadvertently increases the probability of not winning th
game. This is why a discount rate should be used in learning to win in tictactoe.
The third experiment dealt with the exploration rate. A test was conduct d to
experiment with the outcome of using the greedy policy versus a ten percent exploration
rate in its decision makings. When no exploration was used, the result shows
a very misleading statistical report. The mean square error was reduced significantly
faster compared to the latter. Secondly the results show that the greedy agent actually
learned to win at tictactoe better than the agent that explores. However, this
does not mean that no exploration is a better strategy. If the agent does not explore,
it leaves a lot of the state space unexplored. So when the greedy agent is put into
a test after training by playing against a better player, say a human player who will
not lose, it will be clear that the agent that explores is a better overall player. Thus
......
61
exploration is an important factor in d.etermining th ov rall p rform
The second half of the experiment dealt with using multi} er £ dforward n twork
that was trained by the backpropagation algorithm to approxim te th valu
function. A function approximator is commonly used to g n ralize th v lu function
because it may speed up learning. In addition, the ability to generaliz also provid s
the ability to scale up reinforcement learning for solving problems th t hav large
state spaces. In the first experiment, the input was encoded using the raw board representation.
The network then mapped the state of the game to the action's Qvalue.
The empirical results showed that it was learning much slower when it is compared
to storing the Qvalues in a lookup table. It took literally 3000 p rcent additional
games to reach the performance attained by table QIearning. In the second experiment
with a generalized value function, the tictactoe board was represented by eight
winning positions of tictactoe as the input. By mapping the winning positions of
tictactoe to each action's Qvalue, the agent was capable of learning a comparable
policy to the results of table QIearning. Although this representation of tictacto
reduced the mean square error faster than the table QIearning, table Ql arning had
a slightly higher equity. This finding suggests that selection of features for input representation
of the problem is critical towards the success of generaliz d reinforcem nt
learning.
4.2 Conclusions
The experiments conducted in this study show that applying Qlearning in learning
the game of tictactoe is indeed successful. The optimal policies of the agent beat
the opponent close to seventy percent of the time in just 50,000 games of training.
The selection of the discount rate, learning rate, and the exploration rate affects the
policy that is learned by the agent.
62
Experimental findings about combining a multila r £ dforw rd n twork u ing
backpropagation and QIearning show that it is not ver r liabl . Diffi r n r ults
can be obtained if different representation of the same problem is used. It n ds lot
of experience and understandings of the tricks and tips in this field of study in order to
make them to work together more successfully. In my experiments, when a function
approximator is used, the tictactoe agent learned to win at a pace similar to table
QIearning. After 10,000 games of training, it reached a point as if it had stopped
learning. At this point, the agent lost half of its games. It was discovered later that
the agent was actually reducing the mean square error at a very slow pace. Without
the analysis of the mean square error, I would not have realized that the agent was
improving its value function slowly. After an additional 300,000 games of training, the
agent improved its equity from approximately 0 to approximately 25. However it was
learning very slowly. It will be beneficial to look at some of the algorithms that are
designed to speed up backpropagation. My final conclusion concerning generalized
reinforcement learning is that it needs a lot of experimentation to test with different
network architectures to represent the same problem, and analyzing the mean square
error during the learning process are keys to a better success in the generation of a
better policy.
4.3 Recommendations
Here are some suggestions for potential future research:
1. Find algorithms that can speed up the backpropagation algorithm and implement
it to generalize the value function. Use them to train an RL agent that
learns to win tictactoe and several other sequential decision problems. Compare
the results to see if every approach produces the optimal solution.
2. First find optimization algorithms for backpropagation. Use them to train an
6
RL agent to play tictactoe. If the ar uce ful th n ppl hi, m tho in
the domain of backgammon, chess, or go. Find out how other researchers I ct
features that they think are important in one of those games. Then use those
handselected features to represent the input and use it to learn that game.
4.4 Concluding Comment
This study provides a closeup understanding of the internal workings of reinforcement
learning methods. The empirical results obtained in the experiments are successful
when a lookup table is used to store the Qvalue. This method of learning through
trialanderror proved to be very robust only when a lookuptable was used, and using
a lookup table is prohibitively costly for games with a large state space such as chess.
In my opinion, even though reinforcement learning and supervised learning are
considered as two separate fields of study, it ~eems like the understanding of both
fields is compulsory when studying reinforcement learning. I am saying this because
most published literature that I have read was concerned with the scalability of r inforcement
learning. Researchers hope that the reinforcement learning mechanism can
be integrated with different function approximators more successfully in solving large
real world problems. Many papers were published that proved the convergence of
temporal difference learning algorithms used with different function approximators.
But convergence does not mean that an optimal policy is learned. Local function
approximators such as CMAC, and radial basis networks are recommended to be applied
together with TD learning. However CMAC is not fully scalable even though
hashing provides better and more efficient memory usage. Alternative scalable global
function approximators such as an artificial neural network trained by the backpropagation
algorithm showed that this method of generalizing the value functions may
produce only suboptimal solutions, or in the worst case, may diverge. It is still an on
6
going research area to discover better approache to scaling up reinforcem nt learning.
REFERENCES
[1] D. H. Ackley and M. L. Littman. Generalization and scaling in reinforc ment
learning. Advances in Neural Information Processing Systems, 2:550557, 1990.
[2] J. Baxter, A. Tridgell, and L. Weaver. Knightcap: A chess program that learns
by combining TD(..\) with minimax search. Technical report, Australian ational
University, Department of Systems Engineering, November 1997.
[3] D. P. Bertsekas and J. N. Tsitsiklis. NeuroDynamic Programming. Athena
Scientific, Belmont, MA, 1996.
[4] R. H. Crites and A. G. Barto. Improving elevator performance using reinforce~
ment learning. Advances in Neural Information Processing Systems, 8:10171023,
1996.
[5] G. J. Gordon. Stable function approximation in dynamic programming. In
Proceedings of the Twelfth International Conference on Machine Learning, pp.
261268. Morgan Kaufmann, San Francisco, CA, 1995.
[6] J. Gould and R. Levinson. Machine learning: A multistrat gy approach:
Experiencebased adaptive search. Advances in Neural Information Processing
Systems, 4: 579{)03, 1994.
[7] M. T. Hagan, H. B. Demuth, and M. Beale. Neural Network Design. PWS,
Boston, MA, 1996.
[8] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of tochastic
iterative dynamic programming algorithms. Neural Computation, 6(6):11851201,
November 1994.
[9] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A
survey. Journal of Artificial Intelligence Research, 4:237285, 1996.
[10J S. Kirkpatrick, C. D. Gelatt, and Vecchi M. P. Optimization by simulated annealing.
Science, 220(4598):671680, 1983.
(11] D. N. L. Levy and M. M. Newborn. All About Chess and Computers: Containing
the Complete Works, Chess and Computers. Computer Science Press, Potomac,
MD,1983.
65
.H.' m
66
(12] D. . L. Levy and M. M. Newborn. How Computers Play Oh
and Company, New York, Y, 1991.
[13] S. Mahadevan and L. P. Kaelbling. The ational Science Foundation work hop
on reinforcement learning. AI Magazine, 17(4):8997, 1996.
[14] T. Mann. Gnu chess and XBoard: Frequently asked question. URL:
http://www.research.digital.com/SRC/personal/Tim...Mann/gnuches /FAQ.html.
ewsgroup: gnu. chess. (Date accessed: Nov. 1998)
[15] M. Minsky and S. Papert. Perceptrons. MIT Press, Cambridge, MA, 1969.
[16] J. Von Neumann and O. Morgenstern. Theory of Games and Economic Behavior.
Princeton Univ. Press, Princeton, NJ, 1944.
[17] M. M. ewborn. Kasparov Versus Deep Blue: Computer Chess Comes of Age.
SpringerVerlag, New York, NYt 1996.
[18] A. Newell, J. Shaw, and H. Simon. Chessplaying programs and the problem of
complexity. In E. Feigenbaum and J. Feldman, editors, Computers and Thought,
pp. 3970. McGrawHill, New York, NY, 1963.
[19] M. L. Puterman. Markov Decision Processes. John Wiley and Sons, New York,
NY, 1994.
[20] J. Rogers. ObjectOriented Neural Networks in C++. Academic Press,
Huntsville, AL, 1997.
[21] F. Rosenblatt. The perceptron: A probabilistic model for information storage
and organization in the brain. Psychological Review, 65(6):386408, 1958.
[22] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning repre entation
by backpropagating errors. Nature, 323(6038):533536, 1986.
[23] S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice
Hall, Upper Saddle River, NJ, 1995.
[24] A. L. Samuel. Some studies in machine learning using the game of checkers. IBM
Journal of Research and Development, 3(3):211229, 1959.
[25] A. L. Samuel. Some studies in machine learning using the game of checkers II.
IBM Journal of Research and Development, 11(6):601617, 1967.
[26] C. E. Shannon. Programming a computer for playing chess. In N. J. A. Sloane
and A. D. Wyner, editors, Claude Elwood Shannon Collected Papers, pp. 637656.
IEEE Press, New York, NY, 1993.
[27] R. S. Sutton. Implementation details of the TD(.\) procedure for the case of
vector predictions and backpropagation. Technical Report TN87509.1, GTE
Laboratories Incorporated, May 1987.
67
[28] R. S. Sutton. Learning to predict by the method of temp ral dift r Machine
Learning, 3(1):944, 1988.
[29] R. S. Sutton. Generalization in reinforcement learning: Su ful exampl s
using sparse coarse coding. Advances in Neuml Information Proce sing Systems,
8:10381044,1996.
[30] R. S. Sutton and A. G. Barto. An Introduction to Reinforcement Learning. MIT
Press, Cambridge, MA, 1998.
[31] R. S. Sutton and S. D. Whitehead. Online learning with random representation.
In Proceedings of the Tenth International Conference on Machine Learning, pp.
314321. Morgan Kaufmann, San Mateo, CA, 1993.
[32] G. Tesauro. Practical issues in temporal difference learning. Advances in Neural
Information Processing Systems, 4:259266, 1992.
[33] G. Tesauro. TDgammon, a selfteaching backgammon program, achieves masterlevel
play. Neural Computation, 6(2):215219, 1994.
[34] G. Tesauro. Temporal difference learning and TDgammon. Communications of
the A CM, 38(3) :5867, March 1995.
[35] S. Thrun. Learning to play the game of chess. Advances in Neural Information
Processing System, 7:10691076, 1995.
[36] S. Thrun. Lifelong learning: A case study. Technical report CMUCS95208,
Carnegie Mellon University, Pittsburgh, PA, 1995.
[37] S. Thrun. ExplanationBased Neural Network Learning: A Lifelong Learning
Approach. Kluwer Academic Publishers, Norwell, MA, 1996.
[38] A. Turing. Faster than Thought: A Symposium on Digital Computing Machines.
B. V. Bowden, Pittman, London, UK, 1953.
[39] C. J. Watkins. Learning from delayed rewards. Ph.D. thesis, King's College,
Cambridge, England, 1989.
[40] M. A. Weiss. Data Structures and Algorithm Analysis. Benjamin/Cummings,
Redwood City, CA, 1992.
Appendix A
GLOSSARY
€greedy: A policy whereby € denotes the exploration rate.
Agent: The decision maker or the learner.
Artificial Neural Network: A simplified model of the brain cells, or neurons that are
massively interconnected, usually simulated in software using a computer.
Autonomous Learning: Learning by experiencing the task through trial and error
interactions with the environment.
Backpropagation: An optimization algorithm that is capable of training a multilayer
neural network by computing the partial derivatives in a backward
propagating sweep through the network.
Environment: A simulation model with which the agent interacts with to increase
its experience.
Exploitation: Choose actions greedily that yields the highest value.
Exploration: Choose actions that does not necessarily yield the highest value.
Least Mean Square: An algorithm that is applicable to a twolayer network that
optimizes the performance of the network by reducing the approximate
mean square error.
Markov Decision Process: One particular sequential decision model whereby the
current state summarizes everything important about the actions that
produced it.
OffPolicy: The policy used to generate behavior is unrelated to the policy that is
evaluated and improved.
68
69
OnPolicy: This method evaluates and improves the am polic th tit u t
decisions.
Policy: The decisionmaking function of the agent, telling it what action to
choose in each state.
QLearning: An offpolicy temporal difference learning algorithm develop d by
Watkins.
Reinforcement Learning: A learning mechanism that rewards good decisions and
punishes bad choices made by the agent; and the learning is autonomous.
Reward Function: The definition of the agent's goal that maps the state of the
environment to a single number.
SARSA: An onpolicy temporal difference learning algorithm.
TD(>'): A multistep temporal difference learning algorithm whereby at each time
step, TD(>.) updates a prediction by exponentially decaying previous
predictions based on recency, where). is the decay factor.
Temporal Difference Learning: A reinforcement learning method that updates a
prediction at time t to be a fraction closer to the prediction after t.
Value Function: The function that specifies what action election i the best in the
long run.
/
...
Appendix B
SAMPLE PROGRAM CODE
1111111111111111111111111111111111111111111111111111111111111111111111111111111111
II main.cpp
11111111111111111/111//1//1/1//111//111/1/11/1///1/111//1/1/111/11/111/11//1/11111
#include <iostream.h>
#include <fstream.h>
#include "str.h"
#include "clock.h"
#include "simulator.h"
/111//1/1//1111/1/11///1111//111////1/1//111/111/11///1//1/1/11///1/1/1111111/111/
/1 Main program of reinforcement learning system
11/1/11///11//1///1/111/1/11//1111/11/1//1/1/1/111//1/11///1/1//11//1/1//111111111
int mainO
{
ofstream foutj
RL_Simulator_Class simulator;
if Csimulator. setupCcout , cin) == 1)
{
II if simulator setup is completed
cout « endl « "Constructing RL system ... " « endl;
simulator.construct();
cout « "Done." « endl;
clockClass clock;
cout « endl « "Agent is learning ... " « endl;
clock. start 0 ;
simulator.learnCcout, fout); /1 RL agent is learning
clock. stop 0 j
cout « "Learning is done in II « clock « endl;
}
II allow user to train more games if data gathered is inadequate
cout « endl « "Do you want to train more games (y, n)7" « endl « "> "j
char response;
long numOfGames;
double explorationRate;
cin » responsej
while (response == 'y')
{ /1 network weight, move selection, score stat files
cout « "Please save weights, play, and score file" « endl « endl;
70
}
cout « "How many additional games do you vant to train?" « endl « "> ";
cin » numOfGames;
simulator.setNumOfGames(numOfGames);
II experiment the effects of changing exploration rate if wanted
cout « "What exploration rate do you want to use?" « endl;
cin » explorationRate;
simulator.setExplorationRate(explorationRate);
clockClass clock;
cout « endl « "Agent is learning "« endl;
clock. start 0 ;
simulator.learn(cout, fout);
clock.stopO;
cout « "Learning is done in " « clock « endl;
cout « endl « "Do you want to train more games (y, n)?" « endl « "> ";
cin » response;
}
return 0;
71
72
/////////////////////////////////////////////////11////1/11//////1//////1///1///1/
// simulator.h  interface file
////////////////////////////////////////////////////////////////////1//1///11///1/
#ifndef SIMULATOR
#define SIMULATOR
#include <iostream.h>
#include <fstream.h>
#include "str.h"
#include "rl.h"
/////////////////////1//////////////////1/////////////////////////////////////////
// Provides a user interface to setup and construct the simulat:i.on framework
//////////////////////////////////////////////////////////////////////////////////
class RL_Simulator_Class
{
private:
RL_Framework_Class *agent;
int tableOrNN; // hash table of neural network
int learning; // RL learning algorithm selection
int restore; // restoring network weight or not
double stepSize; // step size of QIearning
double discount; // discount rate of QIearning
double learningRate; // learning rate of neural network
double momentum; // neural net's momentum
double exploration; // agent's exploration rate
int policy; // types of behavior policy
int difficulty; // opponent's playing level
int numOfHidden; // number of neural net hidden nodes
double rewardWin; // reward for a win
double rewardTie; // reward for a tie
double rewardLoss; // reward for a loss
double rewardNonej // reward for intermediate moves
long numOfGames; // number of games for training
stringClass scoreFile;
int display; // whether to display
public:
RL_Simulator_Class()j
RL_Simulator_Class();
void construct();
void displaySetup(ostream&); // display system information
void systemMenu(ostreamk);
void editMenu(ostream&);
void editSystem(ostreamk, istream&);
int setup (ostream& , istream&); // setup system
void learn (ostream&, ofstream&);
void printSetupToFile(ofstream&) j
void genScoreFileName();
void setNumOfGames(long);
void setExplorationRate(double);
} ;
#endif SIMULATOR
73
11111111111111111111111111111111111111111111111111111111111111111111///111/11/1/1/
I I simulator. cpp  implementation file
11//11/1/1/111/111/111/1//1/1/1///1///111/111//1//111/111////11111//111111/111//11
#include <iomanip.h> // for setw()
#include <strstrea.h>
#include <time.h>
#include "r l std.h"
#include "s imulator.h"
////11111/1/11111//1//111///11/1111/1111/1/11/111/1/11/11/1///1//111/11/1111/////1
1/ Constructor
1//111//11//1/1/1//1//11/111111111111/11111/11/111///11//1/1///1/11/////111//1////
RL_Simulator_Class::RL_Simulator_Class()
{
display = 0;
rewardWin = 1.0;
rewardLoss = 1.0;
rewardTie = 0.0;
rewardNone = 0.0;
restore = 0;
tableDrNN = NEURAL_NET;
learning = Q_LEARNING;
stepSize = 0.2;
discount = 0.8;
learningRate = 0.1;
momentum = 0.9;
exploration = 0.1;
policy = EPSILON_GREEDY;
difficulty = MEDIUM;
numOfHidden = 20;
numOfGames = 40000;
1/ create new network, default
1/ Hash table or neural network approximation
// Sarsa or Qlearning
/1 default stepsize for update equation (alpha)
// discount rate for update equation (gamma)
II neural network learning rate
// neural network momentum
/1 exploration rate
// behavior policy of agent
/1 opponent's difficulty level, or intelligence
II number of hidden nodes for neural network
/1 number of trials or games for learning
}
//1///1//1//1///////////////1//11///11////1////////////////1//1//////1///////////1
/1 destructor
/1////1/1////11/1//1/11/1/1/11//1/111/1/1/1111//11//1/1111///1111/1/1//1/1111/11/1
RL_Simulator_Class::RL_Simulator_Class()
{
delete agent;
}
1///1////11/11//1111/1/1/1/1/1///111/////1//1/1/1///111/1///1///11/1//1/1//1/1/1//
/1 generate a file that store the score based on the setup
/11//1/11111/1/1/1//1//1/1/1///1/1/1//111111/1////1//1/1//11/11/1//11/111/11///1//
void RL_Simulator_Class::genScoreFileName()
{
char qDrSarsa, epsilonOrSoftmax, diff;
switch (learning)
{
case Q_LEARNING : qOrSarsa = 'q'; break;
case SARSA : qOrSarsa = 's'; break;
}
switch (policy)
{
case EPSILON_GREEDY: epsilonOrSoftmax = 'e'; break;
case SOFTMAX : epsilonOrSoftmax = 's'; break;
7
}
switch (difficulty)
{
case EASY: diff = 'e'; break;
case MEDIUM: diff = 'm'; break;
case DIFFICULT: diff = 'd'; break;
}
char tmptime[20] , tmpdate[20];
/* Set time zone from T2 environment variable. If T2 is not set.
* operating system default is used, otherwise PST8PDT is used
* (Pacific standard time, daylight savings).
*/
_tzset 0;
/* Display operating systemstyle date and time. */
_strtime( tmptime );
_strdate( tmpdate );
for (int i=O; i<20; i++)
{
if (tmptime[i] == ':')
tmptime[i] = '.';
if (tmpdate[i] == 'I')
tmpdate[i] = '.';
}
ostrstream *ostr = new ostrstream;
if (tableOrNN == NEURAL_NET)
*ostr « qOrSarsa « "n" « numOfHidden « epsilonOrSoftmax « exploration
« "a" « stepSize « "g" « discount « "_1" « learningRate « "m"
« momentum « diff « numOfGames « rewardWin « rewardTie « rewardLoss
« "_" « tmpdate « tmptime « ".txt" « ends;
else *ostr « qOrSarsa « "til « epsilonOrSoftmax « exploration « "_a "
« stepSize « "gil « discount « diff « numOfGames « rewardWin
« rewardTie « rewardLoss « "_" « tmpdate « tmptime « ".txt" « ends;
char *name = ostr>str();
scoreFile = name;
delete ostr;
delete name;
// generate data file name based on the setup
}
//////////////////////////////////////////////////////////////////////////////////
// construct the simulator
//////////////////////////////////////////////////////////////////////////////////
void RL_Simulator_Class: :construct()
{
if (tableOrNN == NEURAL_NET)
agent = new RL_Framework_Class(restore, BtepSize, discount,
learningRate, momentum, exploration, policy, difficulty,
numOfHidden, rewardWin, rewardTie. rewardLoss. rewardNone,
display, tableDrNN);
else if (tableDrNN == HASH_TABLE)
75
agent = new RL_Framework_Class(stepSize, discount, exploration, policy,
difficulty, relilardWin, rewardTie, rewardLoss,
rewardNone, display);
}
1111//////////////1/1////11//1////1//111/1/11111111/11111111/1/111/////1/11111/1//
/1 display the setup to screen
//11//111111//11/11////1111111/11/111/1/1111//11111/11111111111111111111111111/1/1
void RL_Simulator_Clas s : :. di splaySetup (ostream8t; out)
{
out « "Learning algorithm ";
if (learning == Q_LEARNING)
out « "QIearning" « endl;
else if (learning  SARSA)
out « "Sarsa" « endl;
out « "Behavior Policy";
if (policy == EPSILON_GREEDY)
out « "epsilongreedy" « endl;
else out « "softmax" « endl;
if (tableOrNN == NEURAL_NET)
{
out « "Neural Network
« "Restore weight
« "Learning rate
« "Momentum
" « numOfHidden « " hidden nodes" « endl
" « restore « endl
" « learningRate « end1
" « momentum « endl;
}
else out « "Hash Table " « endl;
numOfGames « endl;
stepSize « endl
discount « endl
exploration « endl
rewardWin « endl
rewardTie « endl
rewardLoss « endl
rewardNone « endl
" «
" «
" «
" «
" «
" «
" «
out « "Step size
« "Discount rate
« "Exploration rate
« "Reward for win
« "Reward for tie
« "Reward for loss
« "No reward
« "Opponent Difficulty";
if (difficulty == EASY)
out « "easy" « endl;
else if (difficulty == MEDIUM)
out « "medium" « endl;
else out « "difficult" « endl;
out « "Number of Games " «
}
111/11111/111////111/1/11111111111111111/111111//1//111///1//1//1/1111/111/1111///
II print system setup information into a file
11/11/1///1111/111/11/1/////1/11///11111111///11/111/111/11////1/1/1///11/1/1//111
void RL_Simulator_Class: :printSetupToFile(ofstream& fout)
{
fout « "Learning algo II,
if (learning == Q_LEARNING)
fout « "Qlearning" « endl;
else if (learning == SARSA)
fout « "Behavior Policy If ,•
if (policy == EPSILON_GREEDY)
fout « "epsilongreedy" « endl;
else fout « "softmax" « endl;
if (tableOrNN == NEURAL_NET)
{
76
fout « "Neural Network
« "Restore weight
« "Learning rate
« "Momentum
}
else fout « "Hash Table
" « numOfHidden « " hidden nod.es" « endl
" « restore « endl
" « learningRate « endl
" « momentum « endl;
" « endl;
" « stepSize « endl
« discount « endl
« exploration « endl
« rewardWin « endl
« rewardTie « endl
« rewardLoss « endl
« rewardNone « endl
fout « "Step size
« "Discount rate
« "Exploration rate
« "Reward for win
« "Reward for tie
« "Reward for loss
« "No reward
« "Opponent Difficulty
if (difficulty == EASY)
fout « "easy" « endl;
else if (difficulty == MEDIUM)
fout « "medium" « endl;
else tout « "difficult" « endl;
fout « "Number of Games " « numDfGames « endl « endl;
}
//////////////////////////////////////////////////////////////////////////////////
// print the choice of menus
//  type "q A" to use SARSA as the learning algo
////////////////////////1//////////////1//11//////////////////////////////////////
void RL_Simulator_Class::systemMenu(ostream& out)
{
out « "c setup complete" « endl
« ltd display system setup" « endl
« "e edit system" « endl
« "1 load system from file" « endl
« "x exit or abort" « endl « endl;
}
//////////1/1///////////////////////////////////////////////////////////////1/////
// print the setup options
/////////////1////////////////////////////////////////////////////////////////////
void RL_Simulator_Class: :editMenu(ostream& out)
{
out « endl
« "q Learning algorithm (0 Sarsa, 1 QIearning" « endl
« "b Behavior policy (0 Softrnax, 1 Epsilongreedy)"
« endl
« "t Method to set value (0 Hash Table, 1 Neural Network"
« endl
« Old Update equation discount (0 =< d <= 1) " « endl
« "s Update equation stepsize (0 =< s <= 1) " « endl
77
(0 =< 1 <= 1) " « endl
(0 =< m <= 1)" « end1
(0 =< e <= 1)" « end1
(l true o false)" « endl
«"1 Neural network learning rate
«"m Neural network momentum
«"e Exploration rate
«"r Restore
«"0 Opponent Difficulty" « endl
«"+ Reward for a win" « endl
«"= Reward for a tie" « endl
«" Reward for a loss" « endl
«"0 No reward" « endl
«"g Number of games (g > 0)" « endl
«"n Score file name" « endl;
if (tableOrNN == NEURAL_NET)
{
out « "h Number of hidden nodes" « endl;
II continue until exit option is chosen
}
out «"x exit edit menu" « endl « endl
« "Choose the option follow by the value (eg. > d 0.25)"
« endl « endl « endl;
}
///////////////////////////////////////////////////////////////11//1111/1//1/111/1
// modify the setups provided by the user
/1/1/11//1///1///11/1/1/11/11111111//11//11/1///1/////1///1/1//1///1/1//1//1/1/1//
void RL_Simulator_Class::editSystem(ostreaml out, istream& in)
{
out « "Type? for edit help" « endl « endl
« If> If;
char choice;
in » choice;
while (choice != 'x')
{
switch (choice)
{
case '?' editMenu(out); break;
case 'q' in » learning; break;
case 'b' in » policyj break;
case 't' in » tableOrNN; break;
case 'r' in » restore; break;
case 'd' in » discountj break;
case 's' in » stepSize; break;
case I l' in » learningRatej break;
case 'm' in » momentumj break;
case 'e' in » explorationj break;
case '0 I in » difficulty j break;
case ,+, in » rewardWinj breakj
case '0' in » rewardNone; break;
case ,_ I in » rewardLoss; break;
case '=' in » rewardTiej break;
case ' g ' in » numOfGames; break;
case 'n' in » scoreFilej break;
case 'h' if (tableOrNN==NEURAL_NET)
in » numOfHidden; break;
default : out « "Invalid edit option" « endlj
}
out « "> II.
I
78
in » choice;
}
}
1111111111111111111111111111111111111111/111111/11/1111/1/1111/111111/111111/11111
II setup the simulation
11111111111/1/111/11/111111111/1/11111111111111111111111/111111111111111111/1111/1
int RL_Simulator_Class: : setup (ostream& out, istreamk in)
{
out « "Reinforcement learning system" « endl « endl;
systemMenu(out);
out « "Type ? for help" « endI « endl
« "> ";
char choice;
in » choice;
while Cchoice != 'x' && choice != 'c ' )
{
switch (choice)
{
II display system menu
II display current system setup
case' c'
case '?'
case 'd'
case 'e'
default:
}
break;
systemMenu(out); break;
displaySetupCout); break;
editSystem(out, in); break;
out « "Invalid option" « endl;
out « "> H;
in » choice;
}
if (choice == 'x')
return 0;
else return 1;
}
1111111111111111/11111/111111111111111111111111111111/1111111111/11111/1/111/11111
II The selected RL learning algo is called
1111111/111111/1/111/111/111/11/111111111111/111111/11/1/111/1//1111/1/1111//11//1
void RL_Simulator_Class: : learn (ostream& out, ofstream& fout)
{
genScoreFileNameC);
fout.open(scoreFile, ios: :out);
fout.flags(ios: :left ios: :fixed ios::floatfield I ios::showpoint);
II output system setup into a file
RL_Simulator_Class: :printSetupToFile(fout);
fout « setw(lO) « "Win" « setw(lO) « "Tie" « setw(lO) « "Loss" « endl;
if (tableOrNN == NEURAL_NET)
{
if (learning == Q_LEARNING)
agent>Approx_Q_Learning(numOfGames, fout, out);
else if (learning == SARSA)
agent>Approx_Sarsa_Learning(numOfGames, fout, out);
}
else
79
{
if (learning == Q_LEARNING)
agent>Table_Q_Learning(numOfGames, fout, out);
else if (learning == SARSA)
agent>Table_Sarsa_Learning(numOfGames, fout, out)i
}
fout. close () ;
}
//////////////////////////////////////////////////////////////////////////////////
void RL_Simulator_Class: :setNumOfGames (long games)
{
numOfGames = games;
}
//////////////////////////////////////////////////////////////////////////////////
// set exploration rate, only called when we vant to train additional games
////////////////////////////////////////////////////////////////////////////1/////
void RL_Simulator_Class: :setExplorationRate(double exploration)
{
agent>setExploration(exploration);
}
80
11111111111111111111/1111/1/1/11/1///1//////1111/11///11//1/11//11/11/1/11/1/1/11/
1/ rl.h  RLClass interface file
1/1/1/111//111/1//1/11///1111111/////11/1/11///////111//1/1/1/1//11111111///11111/
#ifndef RL
#define RL
#include <fstream.h>
#include <iostream.h>
#include "str.hlt
#include "environment.hlt
#include "experience.h"
#include "policy.h"
1/1111111//111/1/1/1/1/11/111/11/11///111/11/111/1/1/11//11//11//1111/1//1/11//1//
1/ Reinforcement Learning Framework Class
II
// The reinforcement learning framework is coded here whereby the agent
II interacts with its environment and learn by receiving feedbacks from
1/ the environment whether its decision is good or bad.
1/111/1111/111//111/11/1111111111111111/111111111111111111111111111111/111111/1111
class RL_Framework_Class
{
private:
RL_Policy_Class
RL_Environment_Class
RL_Qvalue_Class
stringClass
stringClass
.policy;
.envir;
.Q;
curState;
nextState;
gameFile; II file output learning statisfic
playFile; /1 play selection file output
double
double
BOOL
void initialize();
of stream
of stream
public:
alpha;
gamma;
net;
II learning rate
/1 discount rate
II flag  neural net or hash table
RL_Framework_Class(int, double, double, double, double,
double, int, int, int, double, double, double, double, int, int);
RL_Framework_Class(double, double, double, int, int, double,
double, double, double, int);
RL_Framework_Class()
{
delete policy;
delete Q;
gameFile.close();
playFile.close();
}
void Approx_Sarsa_Learning(long int, ofstreamk, ostream&);
void Approx_Q_Learning(long int, ofstream&, ostream&);
void Table_Q_Learning(long int, ofstream&, ostream&)j
void Table_Sarsa_Learning(long int, ofstream&, ostream&);
void printState();
void approx_GreedyPlay(ostream&);
void table_GreedyPlay(ostream&)j
void setExploration(double);
double get_alpha() { return alpha; }
double get_gamma() { return gamma; }
};
#endif RL
81
82
///////////////////////////////11/11/1/111/111/1/11111111111//1111111//1//1//1//1/
/1 rl.cpp  Implementation file for class reinforcementLearningClass
//1//11/1///111//////11/11////11/11//11/1/1//1////1////111/1////1/1/1/1111/1//11/1
#include <iostream.h>
#include <fstream.h>
#include <string.h>
#include <iomanip.h>
#include <math.h> // for exponential function
#include "random.h" II random number generator to choose who to start the game
#include "rl.htl
#include "rlstd.h"
////111/1/1/1////111//1/1////1/1/1/1//11////////1/1/1/1///1///1/1/1111/11////11//1
/1 allow change of the exploration rate when user wants to train more games
///11///////11//1//1////11//111111/1//11/////11///11////1/////////1////////11/////
void RL_Framework_Class::setExploration(double rate)
{
policy>setExploration(rate);
}
////1/////11//1/11////////////11////////////////////////////////1////////////1////
// qLearning constructor, construct learning system
///////1/////11/////1//////1///1//1////11////1/////////1////1////1/////1//1///////
RL_Framework_Class::RL_Framework_Class(double stepSize, double discount,
double exploration, int bPolicy, int difficulty, double
rewardWin, double rewardTie, double rewardLoss,
double rewardNone, int display)
{
gameFile . open (" state. txt", ios:: out) ;
gameFile.flags(ios: :left I ios: :fixed ios: :floatfield ios: :showpoint ) j
playFile.open("play.txt", ios: :out)j
playFile.flags(ios: :left I ios: :fixed I ios::floatfield I ios: :showpoint );
playFile « setw(10) « "Win" « setw(lO) « "Draw" « setw(10)
« "Loss" « setw(10) « t1MSE" « endl;
alpha = stepSizej
gamma = discount;
net = FALSEj // means it is using Qtable
if (bPolicy == EPSILON_GREEDY)
policy = new RL_EpsiloD_Greedy_ClassCexploration);
else policy = new RL_Softmax_ClassCexploration)j
// reward for tie is used to initialize Qtable
Q = new RL_Hashing_Class(rewardTie, display)j
envir = new RL_Connect3_Class(difficulty, rewardWin, rewardTie,
rewardLoss, rewardNone)j
}
///////1///////1////1//1//1/////////////////////////1///////////1///////1/////////
1/ qLearning constructor, construct learning system
////1//1/////////////////////////1///////1////////////1//1/////////1/1/1/1//1////1
RL_Framework_Class: :RL_Framework_Class(int restore, double stepSize,
double discount, double learningRate, double momentum,
double exploration, int bPolicy, int difficulty,
83
int numOfHidden, double rewardWin, double rewardTie,
double rewardLoss, double rewardNone,int display, int net)
{
gameFile. open( "statel. txt ", ios:: out) ;
gameFile.flags(ios: :left I ios: :fixed ios: :floatfield ios: :showpoint );
playFile.open("play.txt", ios: :out);
playFile.flags(ios: :left t ios: :fixed I ios::floatfield ios::showpoint);
playFile « setw(10) « "Win" « setw(10) « "Draw" « setw(10)
« "Loss" « setw(10) « "MSE" « endl;
alpha = stepSize;
gamma = discount;
net = TRUE; // is using Neural network
if (bPolicy == EPSILON_GREEDY)
policy = new RL_Epsilon_Greedy_Class(exploration);
else policy = new RL_Softmax_Class(exploration);
if (net == NEURAL_NET)
Q = new RL_Neural_Network_Class(restore, learningRate, momentum,
numOfHidden, curState, 0.5, rewardNone, display);
envir = new RL_Connect3_Class(difficulty, rewardWin,
rewardTie, rewardLoss, rewardNone);
}
///1/////////////////////////////////////////////////////1//////////////////1/////
// initialize state to the start of a game, empty board
///////////////////////////////////1//////////////////1//1////////////////////////
void RL_Framework_Class::initialize()
{
"
curState = II
nextState
It.,
II.,
}
egreedy)
Initialize Q(s, a) arbitrarily
Repeat (for each trial):
Initialize s,a
Repeat (for each step of trial):
Choose a from s using policy derived from Q(e.g,
Take action a, observe r, s'
Q(s,a):= gamma * maxQ(s',a')
s := 5';
until s is terminal
///////////////////////////////////1//////////1///////////1///////////////////////
// DESCRITION: OffPolicy Qlearning where the Qvalue is approximated using
// a neural network. Value function approximation is in
// accordance of Sebastian Thrun's method.
//
// PSEUDOCODE:
//
//
//
//
//
1/
//
//
//
///////////1////1//////////////////1//////////////////////1///////////////////////
void RL_Framework_Class: :Approx_Q_Learning (long int numOfTrial, ofstream& fout,
ostream& out)
{
stringClass action;
validActionClass move;
double reward, Qvalue;
double x_percent = 0.01, process;
randomNumClass random;
BOOL print = FALSE;
for (int i=O; i<numOfTrial; i++)
{
initialize 0 ;
action = "";
II reset state, or empty board
II reset action to blank
if (random.randomlnteger(l,2) == 1) /1 opponent to start
curState = envir>chooseAction(curState, action);
if (i %1000 == 0)
{ II print every state transitions once every 1000 games
print = TRUE;
gameFile « "******* Game" « i+l « " *******" « endl;
}
/1 Sebastian's method
II end of Sebastian's method
do
{
1/ **** Approx method *****
move.generate(curState);
action = policy>choose(*Q, move, curState);
if (print)
printState 0 ;
nextState = envir>chooseAction(curState, action);
reward = envir>feedback(nextState);
if (envir>gameOver(nextState)==FALSE)
Qvalue = gamma * Q>max(nextState);
else Qvalue = reward;
Q>setValue(curState, action, Qvalue);
if (print)
printState 0 ;
curState = nextState;
}
while (envir>gameOver(curState)
if (print)
printStateO;
print = FALSE;
FALSE);
}
envir>keepScore(curState, fout);
process = (double)i / (double)numOfTrial;
if (process >= x_percent) 1/ indication of %learning completed
{
out « int(process * 100) « "%" « endl;
x_percent += 0.01;
}
if (i %1000 0) II test agent's knowledge every 1000 games
approx_GreedyPlay(out);
}
out « "100%" « endl;
s := s'
until s is terminal
Initialize Q(s, a) arbitrarily
Repeat (for each trial):
Initialize s to start state
Repeat (for each step of trial):
Choose a from s using policy derived from Q(e.g, egreedy)
Take action a, observe r, s'
Q(s,a):= Q(s,a) + alpha[r + gamma.Q(s' ,a')Q(s,a)]
85
//////////////////////////////////////////////////////////////////////////////////
// DESCRITION: OffPolicy Qlearning. The Qvalue is stored in hash table
//