DEVELOPMENT OF INTERATOMIC POTENTIALS BASED ON
AB INITIO METHODS AND NEURAL NETWORKS FOR
MOLECULAR DYNAMICS SIMULATIONS
By
MILIND MALSHE
Master of Science in Mechanical Engineering
Oklahoma State University
Stillwater, Oklahoma
2004
Submitted to the Faculty of the
Graduate College of the
Oklahoma State University
in partial fulfillment of
the requirements for
the Degree of
DOCTOR OF PHILOSOPHY
December, 2009
ii
DEVELOPMENT OF INTERATOMIC POTENTIALS BASED ON
AB INITIO METHODS AND NEURAL NETWORKS FOR
MOLECULAR DYNAMICS SIMULATIONS
Dissertation Approved:
Dr. R. Komanduri
Dissertation Adviser
Dr. L. M. Raff
Dr. M. Hagan
Dr. R. D. Delahoussaye
Dr. S. Bukkapatnam
Dr. A. Gordon Emslie
Dean of the Graduate College
iii
ACKNOWLEDGEMENTS
I would like to express my gratitude to my parents and brother for their support,
encouragement, love and confidence in me.
I would like to thank my advisor, Dr. R. Komanduri, for his inspiration, technical
guidance, and financial support. I wish to express my sincere appreciation to Dr. L. M.
Raff for introducing me to the concept of Molecular Dynamics and Monte Carlo
simulations and for providing guidance and encouragement throughout the study. I
would like to thank Dr. M. Hagan for introducing me to the field of Neural Networks and
for his invaluable guidance and encouragement. I would like to thank Dr. P. M. Agrawal
for his guidance in understanding various aspects of simulation techniques. This project
is a result of coordinated effort linking simulation methods and neural networks. I would
also like to thank Dr. S. Roy and Dr. H. Lu for their suggestions in this study. I’m grateful
to Dr. R. Komanduri and my committee members for helping me to explore and develop
new methods for improving simulation techniques.
The project is funded by a grant (DMI-0457663) from the National Science Foundation's
Division of Civil, Mechanical, and Manufacturing Innovation (CMMI). I would like to thank Dr. G.
Hazelrigg and Dr. J. Harrison, Program Director for Materials Processing and Manufacturing for
their interest and support for this work
iv
Table of Contents
1 Introduction ................................................................................................................. 1
2 Literature Review ........................................................................................................ 5
2.1 Literature review of the development of interatomic potentials ...................... 5
3 Molecular Dynamics Simulation ................................................................................ 15
3.1 Brief history of molecular dynamics simulation ............................................... 15
3.2 Applicability of MD simulations ........................................................................ 16
3.3 Formulation of the differential equations of motion ....................................... 18
3.4 Time integration algorithms ............................................................................. 21
3.4.1 Verlet algorithm ............................................................................................ 21
3.4.2 Leap-frog Verlet algorithm............................................................................ 22
3.4.3 Velocity Verlet algorithm .............................................................................. 23
3.4.4 Beeman algorithm ........................................................................................ 23
4 Interatomic potentials ............................................................................................... 25
4.1 Design of interatomic potentials ...................................................................... 26
4.2 Types of potential energy functions ................................................................. 27
4.2.1 Pair potentials ............................................................................................... 29
4.2.2 Many-body potentials ................................................................................... 30
v
5 Ab initio methods ...................................................................................................... 36
5.1 Born-Oppenheimer approximation .................................................................. 37
5.2 Basis sets ........................................................................................................... 37
5.2.1 Slater type orbitals (STO) .............................................................................. 38
5.2.2 Gaussian type orbitals (GTO) ........................................................................ 38
5.3 Hartree-Fock method ........................................................................................ 40
5.4 Electron correlation methods ........................................................................... 43
5.4.1 Møller-Plesset Perturbation theory (MP) ..................................................... 44
5.4.2 Density Functional Theory (DFT) ................................................................... 45
6 Neural Networks ........................................................................................................ 48
6.1 Biological neurons ............................................................................................. 48
6.2 History of multilayer neural networks .............................................................. 52
6.3 Neural networks versus conventional computers ............................................ 53
6.4 Model of a neuron ............................................................................................ 55
6.5 Multilayer network ........................................................................................... 55
6.6 Transfer functions ............................................................................................. 56
6.7 Neural network training .................................................................................... 57
6.7.1 Supervised training and parameter optimization ......................................... 59
6.8 Least Mean Squared Error (LMS) algorithm ..................................................... 60
vi
6.9 Backpropagation algorithm .............................................................................. 62
6.10 Numerical optimization techniques .................................................................. 66
6.10.1 Conjugate gradient methods .................................................................... 66
6.10.2 Newton and quasi-Newton methods ....................................................... 67
6.10.3 Levenberg-Marquardt algorithm .............................................................. 68
6.11 Bias/variance dilemma...................................................................................... 68
6.11.1 Neural network growing and neural network pruning ............................. 70
6.12 Regularization and early stopping .................................................................... 70
6.13 Early stopping.................................................................................................... 71
6.14 Normalizing the data ........................................................................................ 72
6.15 Neural network derivatives .............................................................................. 73
6.15.1 First derivative with respect to weights ................................................... 73
6.15.2 Derivatives of the transfer function.......................................................... 74
6.15.3 First derivative with respect to inputs ...................................................... 75
6.16 Novelty detection ............................................................................................. 76
6.16.1 Minimum distance computation .............................................................. 78
7 Problem statement .................................................................................................... 82
8 Parameterization of Tersoff potential function using genetic algorithm (GA) ......... 84
8.1 Approximation of Tersoff potential .................................................................. 93
vii
8.1.1 Neural network training for mean squared error (MSE) .............................. 93
8.1.2 GA search for optimal Tersoff parameters ................................................... 94
8.2 Approximation of ab initio energy using Tersoff functional form .................... 98
9 Study of dissociation dynamics of vibrationally excited vinyl bromide .................. 105
9.1 Novelty sampling procedure ........................................................................... 108
9.2 Improving the neural network accuracy using novelty sampling ................... 111
9.3 Dissociation dynamics of vinyl bromide ......................................................... 117
9.4 Procedure for the development of potential energy surfaces based on ab initio
methods using neural networks and novelty sampling .............................................. 120
10 Parameterization of interatomic potential functions using neural networks .... 124
10.1 Parameterization of modified Tersoff potential functional form ................... 128
11 Development of Generalized Potential Energy Surfaces (GPES) ........................ 135
11.1 Fitting ab initio energies for 5-atoms silicon clusters using GPES .................. 142
11.2 Fitting ab initio energies for Sin (n= 3, 4, 5) clusters using GPES .................... 144
11.3 Fitting ab initio energies for Sin (n= 3, 4, 5, 6,7) clusters using GPES............. 146
11.4 GPES development for vinyl bromide ............................................................. 149
12 Conclusions ......................................................................................................... 153
13 References .......................................................................................................... 156
1
Chapter 1
1 Introduction
Computer simulation techniques play an important role in the study of various
chemical, physical, biological and mechanical processes at the atomistic level. By
comparing the results of the simulations with experimental observations, theoretical
models used in the simulations can be corrected and validated. Once validated, the
simulations can be used to study the system under different conditions for which
experimental results are not easily available. By analyzing the results from the
simulations, experiments can then be performed under specific conditions to achieve
the desired results. Thus computer simulations can complement both theoretical and
experimental approaches. In conjunction with the developments in computer
technology, the use of simulations has greatly extended the range of problems that can
be studied. In chemistry, simulations are performed to study reaction dynamics. In
engineering, simulations are used to study the behavior of a material under varying
conditions in various processes, such as machining, indentation, tribology and melting.
Such simulations can provide a useful insight into important mechanisms such as phase
transformation and dislocation dynamics, which otherwise is not easily understood
using other techniques.
2
The most important part of MD/MC simulations is the development of potential
energy surfaces, which describe the interactions between the atoms within a system. It
should provide a realistic description of the interatomic interactions to match the
experimental observations. The usual approach for the development of potentials is to
determine a functional form motivated by physical intuition and then adjust the
parameters either to ab initio data and/or some physical properties, such as lattice
energy, melting point, to come up with an empirical potential. Although, such empirical
potentials provide a simple and physically interpretable description for the interatomic
interactions their applicability is limited to the type of data to which it was fitted to.
Once fitted, there is no easy way to improve upon it, without refitting the data. A
solution to this problem is to model the system using ab initio electronic structure
calculation, such as by solving the Schrödinger’s equation and compute the potential
energy and forces from the first principles. Such a technique can provide very accurate
description of the interatomic interactions, but are computationally very extensive and
hence limited only to small systems involving only a few atoms.
Genetic algorithm (GA) and other nonlinear functional approximation routines
based on a stochastic gradient search require the specification and evaluation of an
objective or fitness function to determine the quality of fit. Problems involving
repetitive evaluation of the objective function, such as the root mean squared error
(RMSE) in fitting a complicated function becomes computationally time consuming. This
study presents an approach based on using the function approximation capability of
neural networks (NN) for the computation of the objective function. The method is
3
applied for parameter determination of interatomic potential for silicon using the
Tersoff functional form.
In the second part of the investigation, the reaction dynamics of vibrationally
excited vinyl bromide was investigated using neural network potential surface fitted to
ab initio energies obtained from electronic structure calculations. Dissociation rate
coefficients and branching ratios for open reaction channels were computed at an
internal energy of 6.44 eV. The potential energy hyper-surface is developed by fitting
the ab initio energies using a multi-layer neural network (NN). The use of NN for
developing potential energy hyper-surfaces obviates the need to assume a fixed
underlined functional form and thus provides a greater flexibility.
Empirical potential surfaces are frequently employed to represent the force
fields present in the system under investigation. In most cases, the functional forms
present in these potentials are selected on the basis of chemical and physical intuition.
The parameters of the surface are frequently adjusted to fit a very small set of
experimental data that comprise bond energies, equilibrium bond distances and angles,
fundamental vibrational frequencies, and perhaps measured barrier heights to reactions
of interest. Such potentials generally yield only qualitative or semi-quantitative
descriptions of the system dynamics.
Neural networks (NN) provide a powerful method to effect the interpolation
between the energies of an ensemble of points in a database. In this investigation, a
generalized NN fitting procedure is developed to fit the parameters of the empirical
4
potential along with the weight and bias matrices of the network to a database of ab initio
energies. The potential parameters can be made functions of the atomic structure of the
system with the parameters defining those functions fitted using a NN.
The potential energy can be expanded in terms of two-body, three-body, and
many-body interaction terms. In the last part of the investigation, a general method for
the development of potential-energy surfaces is presented. The Generalized Potential
Energy Surface (GPES) combines a many-body expansion to represent the potential-energy
surface with multi-layer NNs. The summations run over the two-, three-, ..., M-body
atomic clusters rather than over the 2-, 3-, ..., m-dimensional component functions.
Each of the M-body terms in the expansion is represented by using a neural network.
The bottleneck presented by the combinatorial problem is significantly reduced by
summing over atomic clusters. This problem is further simplified by an extension of the
concept of bond energy to moiety energy (ME). The prime advantage of using a neural
network is that it avoids the assumption of a specific functional form and thus the
accuracy of fitting is not restricted by the underline functional form. Neural network
parameters are adjusted using a Jacobian method such as the Levenberg-Marquardt (LM).
The use of neural network to represent each of the M-body terms provides the flexibility
needed to accurately fit a complex potential surface. The use of the ME concept reduces
the number of neural networks required.
5
Chapter 2
2 Literature Review
The most important part of the MD/MC simulation is the development of
potential energy surfaces, which can model the system under consideration sufficiently
close to the actual one. In recent years many empirical potentials for silicon, such as
Stillinger-Weber potential, Tersoff potential, Bolding-Anderson potential, and Brenner
potential have been developed and applied to a number of different systems.
2.1 Literature review of the development of interatomic potentials
Existing models for interatomic potentials differ in the degree of sophistication,
functional form, fitting strategy and the range of applicability where each one is
designed for a specific problem. Not only surfaces and small clusters are difficult to
model (Balamane et al, 1992; Kaxiras, 1996) even bulk material, including crystalline and
amorphous phases, solid defects and liquid phase have not provided a transferable
description by a single potential. The usual approach for developing an empirical
potential is to arrive at a parameterized functional form motivated by physical intuition
and then find a set of parameters by fitting to either ab initio data or experimental
observation, or both for various atomic structures. A covalent material presents a
difficult challenge because of the complex quantum mechanical effects, such as
6
chemical bond formation, hybridization, metallization, charge transfer, and bond
bending must be described by an effective interaction between atoms. In spite of a wide
range of functional forms and fitting strategies most of these models have been
successful in the regime for which they were parameterized but have shown a lack of
transferability. Balamane and Halicioglu ( 1992) performed a comparative study of six
classical many-body potentials namely, Pearson, Takai, Halicioglu and Tiller potential,
Biswas and Hamann potential, Stillinger and Weber potential, Dodson potential, Tersoff
potential (T2), modified Tersoff potential (T3). They concluded that each has its own
strengths and limitations, none appearing clearly superior to the others, and none being
fully transferable.
The Stillinger-Weber potential (1985) was parameterized for experimental
properties of solid and liquid silicon, such as lattice energy, melting point. The model has
the form of a third order cluster potential (Carlson, 1990) in which the potential energy
is expressed as a linear combination of two and three-body terms, as:
Σ Σ
< < <
= +
i j k
i j k
i j
i j
E v i j v i j k
, ,
3
,
2 ( , ) ( , , )
(2.1)
where v2(i,j) represents two-body interactions and v3(i,j,k) represents three-body
interactions. The two-body interactions are expressed as :
( ) ( ) ,
( ) ( / ),
1
2
2 2
= − − − − −
=
p q r a
ij ij
and f A B r r e
v r ε f r α
(2.2)
where ε has energy units and α has length units.
7
The three-body interaction is expressed as a separable product of radial functions g(r)
and angular functions h(θ). Thus:
2
3 3
1
) cos( ) ( ) ( ) , , (
=Σ ⋅ ⋅ + ijk
ijk
ij ik v i j k g r g r θ
(2.3)
where θijk is the bond angle formed by the bonds ij and ik, and g(r) is a decaying
function. The angular function has a minimum at θ= 109.5°, i.e. the tetrahedral angle to
represent the angular preference for sp3 bonding. This potential gives a fairly realistic
description of crystalline silicon, point defects, liquid and amorphous phases but
provides inadequate description of under-coordinated or over-coordinated structures.
Mistriotis et al. (1989) proposed an improvement over Stillinger-Weber potential which
included four-body interactions as well. It was able to give good agreement with the
melting point of the crystal and the geometries and the energies of the ground state and
low metastable states of silicon clusters.
Tersoff (1988; 1989) proposed an approach by coupling two-body and multi-body
correlations into the model. The potential is based on bond order which takes into
account local environment. The bond strength depends on the geometry, the more
neighbors an atom has, the weaker the bond to each neighbor would be. The energy is
the sum of repulsive and attractive interactions which depends on the local bonding
environment. Thus:
8
[ ]
( , , ).
( ) ( , ) ( ) ( , ) ,
2
1
,
, ,
3
,
Σ
Σ
≠ ≠
≠
≠
=
= ⋅ +
k i k j
j i
i j k
ij
i j
i j
C ij R ij ij A
and v i j k
E f r f i j b f i j
ζ
ζ
(2.4)
The fR and fA terms are repulsive and attractive pair potentials, respectively and fC is a
cutoff function. The main feature of this form is the term bij. The idea is that the
strength of each bond depends upon the local environment and is lowered when the
number of neighbors is relatively high. The term ζij defines the effective coordination
number of an atom taking into account the relative distance between two neighbors (rij-rik)
and the bond angle θijk.
The Bolding-Anderson potential (1990) is a general from of Tersoff potential in
which the interaction between a pair of atoms depends on the environment. They
suggested that clusters of four or more atoms have a local minima on the potential
energy surface corresponding to multiple configurations, which involve atoms with high
coordination number and strained bonds angles. The increase in potential because of
strained bond angles is offset by the large number of weak bonds formed. Their
potential incorporated angle dependence that is different for clusters and crystals by
introducing interference functions.
Bazant and Kaxiras (1997) have developed environment dependent interatomic
potential (EDIP) for bulk silicon. The interaction model included two-body, three-body
terms which depend on the local atomic environment through an effective coordination
number given by:
9
V2(r,Z) = ΦR(r) + p(Z) ΦA(r), (2.5)
where ΦR(r) represents the short range repulsion of atoms, ΦA(r) represents the
attractive force of bond formation, and p(Z) is the bond order, which determines the
strength of the attraction as a function of the atomic environment, measured by the
coordination number Z. The explanation of the bond order is that, as an atom becomes
overcoordinated beyond the ideal coordination and beyond its valency, the bonds
become more metallic and characterized by delocalized electrons (Carlsson, 1985; 1990;
Abell, 1985; Pettifor, 1990). They provided a test of empirical theories of covalent
bonding in solids using a procedure to invert ab initio cohesive energy curves. They
showed that the inversion of ab initio cohesive energy curves verified trends in chemical
bonding across various bulk bonding arrangements consistent with theoretical
predictions (Bazant and Kaxiras, 1996). Their results indicated that a coordination
dependent pair interaction can provide a fair description of high symmetry crystal
structures without requiring additional many-body interactions, and angular forces are
needed only to stabilize the structure under symmetry breaking distortion, primarily for
small coordination number
Ischtwan and Collins (1994) have developed a moving interpolation technique in
which the potential energy in the neighborhood of any point is approximated by Taylor’s
series using inverse bond length coordinates as the expansion variables. The data from
ab initio calculations is used to obtain a set of initial Taylor series expansions. The
procedure expresses the potential at any configuration as a weighted average of the
10
Taylor series about all points in the data set. Subsequently, the results are iteratively
improved by computing trajectories on the Taylor series fitted surface and the internal
coordinates are stored. These new points are added to the data set according to a
weight factor that is determined by the relative density of the points in the data set in
the region of a new point. The criterion used to assess convergence of the iteratively
improved potential to the final surface was computed using dynamic variables, such as
reaction probability.
Jordan et al. (1995) used the moving interpolation trajectory sampling method to
investigate the reaction dynamics. Maisuradze et al. (2003) introduced an interpolating
moving least squares (IMLS) method for the interpolation between the computed ab
initio points. When the method is unrestricted, the least squares coefficients are
obtained from the solution of a large matrix equation that must be solved repeatedly
during a trajectory study.
Rahman and Raff (2001) presented an analytical fitting procedure to the ab initio
data for the dissociation dynamics of vinyl bromide. Their approach was to develop
analytic functions for each of the energetically open reaction channels using functional
forms suggested by physical and chemical considerations. The parameters of these
functions are expressed as appropriate functions of the system’s geometry. These
channel potentials are connected smoothly with parameterized switching functions that
play a central role in fitting the potential barriers, the reaction coordinate curvatures,
and the coordinate coupling terms obtained by the ab initio calculations.
11
Another approach by Ercolessi and Adams, (1994) involves bypassing the
physical motivation behind the functional form of the potential in favor of elaborate
fitting. The potential involves combinations of cubic splines with effectively hundreds of
adjustable parameters and the approach is a force matching method, which involves
fitting the potential to ab initio atomic forces of many atomic configurations including
surfaces, clusters, liquids, and crystals. Potential form is defined by single variable
functions- atomic coordinates, such as glue potential which is defined as:
Σ Σ Σ
= Φ +
i j
ij
i j
ij V (r ) U (r )
2
1
,
ρ
(2.6)
where ( ) ij Φ r is a pair potential, ( ) ij ρ r is a function of atomic density, and U(n) is the
glue function. They attempted to match the forces from first principle calculations for a
large set of configurations with those predicted by a classical potential, by minimizing
the objective function, namely,
(α ) (α ) (α ) F C Z = Z + Z , (2.7)
where ZF contains ab initio data and ZC consists of physical quantities from experimental
results. If emphasis is given to ZF, then the minimizing potential appears as the best
approximation of the first principles system, with ZC acting as a guide towards the
correct region in the space of parameters. On the other hand, if the emphasis is on ZC,
then the method appears like the conventional fit, but the ZF term relieves the burden
of guessing the functional form.
12
A different approach utilizes genetic algorithms and genetic programming. The
idea is to let the computer program determine the functional form and come up with
the best possible solution with minimal human input. The concept is based on genetics
and evolution theory, in which different genetic representations of a set of variables are
combined to produce a better solution. The technique is stochastic in nature rather than
the conventional gradient based approaches and hence the dimensionality of the
problem does not pose a serious limitation. It works by randomly generating and
exchanging functional elements (variables and arithmetic operators, such as addition
and multiplication), and selecting the fittest individuals (which represent a set of
parameters) assessed by comparing with target values, a population of potential evolves
until a superior form emerges. Makarov and Metiu (1998) proposed a procedure by
which genetic programming could be used to find the best functional form and the best
set of parameters to fit the energy and its derivatives from ab initio calculations. They
suggested a directed search which guides the computer to use a specified functional
form without limiting too much on the flexibility of the search. Directed search was used
to fill in portions of human engineered functional template, without which the
computer could not find a simple interpretable form even for a pair potential.
As an alternative to these methods, Blank et al. (1993; 1995) explored an
interpolation method based on feed forward neural networks. Functional approximation
using neural networks does not require any assumption regarding the functional form of
the potential. Multilayer feed forward neural networks could fit in principle to any real
valued continuous function of any dimension to arbitrary accuracy provided it has
13
sufficient number of neurons (Cybenko, 1989). Blank et al. (1993; 1995) used neural
network to fit data sets produced by low dimensional models of CO molecule
chemisorbed on a Ni(111) surface. The models were constrained versions of many
dimensional empirical potential that had been used in the simulation of surface
dynamics where it had been shown to give a reliable qualitative picture of the molecule-surface
interaction.
Raff et al. (2005) presented an approach for development of accurate potential
energy hypersurfaces based on ab initio electronic structure calculations. The method
utilizes novelty sampling procedure to sample the regions of the configuration space
that are relevant for the simulation. Hobday et al. (1999) developed a neural network
model for more complex C-N system for which the potential energy function was not
parameterized. They used the neural network to fit the many-body term in the Tersoff
functional form rather than fitting the entire potential. In contrast to the conventional
network architecture, where a single input vector is presented to network, a set of N
vectors where N is the number of neighbors of the two atoms i and j, not including i and
j. The value of N depends on the i-j bond. For example if the i-j bond is a part of
monocyclic chain, then N=2. If the i-j bond is tetrahedrally coordinated, then N=6. To
present the local environment to the network, the bond order term is expressed as a
function of i-j-k contributions, where k is the first neighbor of i or j. The input network
consists of N vectors, where N is the number of different i-j-k contributions. Each vector
of the input data consisted of pairwise information, such as direction cosines, bond
lengths, cutoff functions, and first neighbor information, such as its bond lengths and
14
torsional angles and second neighbor information, such as the number of types of atoms
used to describe the i-j-k atoms triplet. The size of the input vector was constant but the
number of input vectors N is a variable which could change during the simulation.
15
Chapter 3
3 Molecular Dynamics Simulation
Molecular Dynamics is a computer simulation technique where in the time
evolution of a set of interacting atoms is followed by integrating the equations of
motion. While the continuum mechanics approach provides a better understanding of
the processes at the macro and micro regime, it cannot be implemented to simulate the
processes at the atomistic regime. Unlike in finite element analysis or other continuum
approaches, where the nodes and distances between the nodes are selected arbitrarily,
in molecular dynamics it is based on more fundamental unit for a specific material, such
as lattice spacing and inter-atomic distances. Thus the processes can be studied at the
fundamental level. However, since a large number of atoms (1023) constitute any
common material, one has to consider the interactions of several thousands of atoms
even for a nanometric study. Such simulations require large amount of computing
resources but in return can provide an insight into processes happening at the smallest
level, namely, atomistic level.
3.1 Brief history of molecular dynamics simulation
The molecular dynamics method was first introduced by Alder and Wainwright in
the late 1950's (Alder and Wainwright, 1957, 1959) to study the interactions of hard
16
spheres. The purpose of the paper was to investigate the phase diagram of a hard
sphere system and in particular the solid and liquid regions. In a hard sphere system,
particles interact via instantaneous collisions, and travel as free particles between
collisions. Gibson et al. (1960) investigated the dynamics of radiation damage. It was the
first example of molecular dynamics calculation with a continuous potential based on a
finite time integration method. Rahman (1964) carried out perhaps the first simulation
using a realistic potential for liquid argon. He studied properties of liquid argon using
Lennard- Jones potential. Rahman and Stillinger (1974) carried out simulation of liquid
water. Verlet calculated the phase diagram of argon using the Lennard-Jones potential,
and computed correlation functions to test theories of the liquid state. The bookkeeping
device, which became known as Verlet neighbor list was also introduced and Verlet time
integration algorithm was used. Phase transitions in the same system were investigated
by Hansen and Verlet (1969). The number of simulation techniques has expanded since
then. There exists now many specialized techniques for particular problems, including
mixed quantum mechanical - classical simulations.
3.2 Applicability of MD simulations
At the atomistic level, the system follows laws of quantum mechanics rather than
classical mechanics, whereas molecular dynamics follows the laws of classical
mechanics. A classical description of the system involved can be used, when quantum
effects, such as tunneling, and electronic excitations play no essential role in the
dynamics. In addition to this, the kinetic energy of a particle must be large enough to
ensure that de Broglie wavelength is much smaller than the lattice constant of the solid.
17
A simple test of the validity of the classical approximation is based on the de Broglie
thermal wavelength defined as:
M k T
h
B 2π
2
Λ=
(3.1)
where M is the atomic mass and T is the temperature. The classical approximation is
justified ifΛ<< a , where a is the mean nearest neighbor distance. The classical
approximation is poor for lighter elements, such as hydrogen, helium. Moreover,
quantum effects become important in any system when T is sufficiently low. The drop in
the specific heat of crystals below the Debye temperature or the anomalous behavior of
the thermal expansion coefficient are well known examples of measurable quantum
effects in solids. Hence, molecular dynamics results should be interpreted with caution
in these regions.
An important issue with MD simulations is the temporal scale that can be
explored. The maximum time step with which Newton’s equations can be integrated
numerically is inherently restricted to very small values (about 1 fs), in order to
reproduce atomic vibrations accurately. As a result molecular dynamics simulations
require long computational times because the most interesting motions occurring
during the simulation processes are very slow compared with the fast oscillations of
bond lengths and bond angles that limit the integration time step. This limits the time
scales that can be handled by MD in a reasonable amount of actual clock time. MD
simulations are usually limited to molecular processes happening in relatively short
18
times, the so called “rare events” phenomenon in which infrequent processes are
completed quickly, such as dissociation of a chemical bond which is a rapid process that
is completed on the pico-second time scale. However, the application of external load is
longer by orders of magnitude compared to the dissociation time, making it difficult to
observe such events at experimental speeds. In order to overcome the time scale
limitation of MD, two approaches are usually followed. In the first approach coarse-graining,
the dynamic variables are divided into slow and fast degrees of freedom and
averaging is performed on the fast degrees of freedom. In the second approach fast
(and rare) trajectories between desired states are computed explicitly, and their
probabilities (relative to non-reactive trajectories) are calculated.
3.3 Formulation of the differential equations of motion
Forces between atoms are computed explicitly and the motion of the molecule is
followed over a period of time using a suitable numerical integration method. Classical
mechanics describes how physical objects move and how their positions change with
time. Consider an isolated system consisting of N bodies with coordinates (xi, yi, zi)
where i=1, 2, 3… N (Goldstein, 1965). By isolated system, it means that all other bodies
are sufficiently remote to have a negligible influence on the system. Each of the N
bodies is assumed to be small enough to be treated as a point particle. The position of
the ith body with respect to a given inertial frame is denoted as ri (t). Its velocity and
acceleration are given by
vi (t) = r (t) i &
and ai (t) = r (t) i & &
19
Each atom has a mass mi. From Newton’s second law:
i i i F m a
ρ ρ
=
where i F
ρ
is the total force acting on an atom. This force is composed of a sum of forces
due to each of the other interacting atoms in the system. If the force on ith body due to
jth body is denoted as Fij, then
Σ
≠ =
=
N
i j j
i ij F F
1
ρ ρ
,
dt
d p
dt
d mv
d t
d r
F m i i i
i
ρ ρ ρ ρ
= = =
( )
2
2
(3.2)
where Fii is zero, since there no force on ith atom due to itself and i p
ρ
is the momentum
of the ith body.
The force on the atom i is the gradient of the potential energy function with respect to
the coordinates of atom i.
( , ,... ) i i 1 2 N F V r r r
ρ ρ ρ ρ
=−∇
(3.3)
where V is the potential energy function. i r
ρ
is the position vector of atom i, xi, yi, and zi,
are the Cartesian coordinates of atom i.
r x i y j z k i i i i
= ˆ+ ˆ + ˆ
ρ
(3.4)
and,
20
k
z
j
y
i
xi i i
i
ˆ ˆ ˆ
∂
∂
+
∂
∂
+
∂
∂
∇ =
(3.5)
Now, force in the X- direction on atom i is given by:
x
V
dt
d mv
dt
d p
F x x
x ∂
∂
= = = −
( )
(3.6)
The velocity of atom i in the X- direction is:
m
p
d t
d x
v x
x = =
(3.7)
Similar equations can be derived in the Y- and Z- directions.
It should be noted that to update the position of an atom, one has to solve 6
coupled first-order differential equations. If the number of atoms in the system is N,
then the number of differential equations to solve is 6N. If the potential function is a
simple pairwise potential, such as Lennard- Jones or Morse potential, then the total
number of terms in such a simple potential is N(N-1)/2. For an example, a system of
2000 atoms requires integration of 12,000 coupled first order differential equations of
motion and about 2 million pairwise terms need to be calculated each time a derivative
is evaluated. Such evaluations must be done for every integration step for every
trajectory calculation. The computational time, therefore increases rapidly as the
number of atoms in the system increases.
21
3.4 Time integration algorithms
The MD trajectories are defined by both position and velocity vectors and they
describe the time evolution of the system. Accordingly, the positions and velocities are
propagated with a finite time interval by solving a set of coupled first order differential
equations. Time integration algorithms are based on finite difference methods, where
time is discretized on a finite grid. These are based on Taylor series expansion truncated
after some finite number of terms. Truncation error varies as (Δt)n, where Δt is the time
step and n is the order of the method (which depends on the number of time
derivatives considered in the expansion). Hence, for higher order methods, the
truncation error drops off rapidly even with a small reduction in the time step. The idea
of the numerical integration of Newton’s equations of motion is to find an expression
that defines positions at time t+Δt in terms of already known positions and their time
derivatives, such as velocities and accelerations at time t. Thus:
( ) ( ) ( ) ...
( ) ...
2
1
( ) ( ) ( )
( ) ...
2
1
( ) ( ) ( )
2
2
+ = + +
+ = + + +
+ = + + +
a t t a t b t t
v t t v t a t t b t t
r t t r t v t t a t t
(3.8)
3.4.1 Verlet algorithm
This is perhaps the most widely used method of integrating the equations of
motion, initially developed by Verlet (1967) and attributed to Stömer (Gear, 1971). The
Verlet algorithm uses acceleration at time t and positions at time t and t-Δt to calculate
22
the new positions at time t-Δt. The algorithm is time reversible, and given the
conservative force field, it guarantees the conservation of linear momentum.
( ) ...
2
1
( ) ( ) ( )
( ) ...
2
1
( ) ( ) ( )
2
2
− = − + +
+ = + + +
r t t r t v t t a t t
r t t r t v t t a t t
(3.9) a
(3.9) b
Adding Error! Reference source not found.Equation 3-9 a and Error! Reference source
not found.Equation 3-9 ab, and neglecting higher order terms,
r(t + t)=2r(t)−r(t − t)+ a(t) t 2
(3.10)
and the velocity is computed as,
t
r t t r t t
v t
+ − −
=
2
( ) ( )
( )
(3.11)
The basic Verlet algorithm does not use velocities explicitly which may introduce
numerical error, since the local error in updating the positions is proportional to t 4,
whereas the error in updating the velocities is proportional to t 2.
3.4.2 Leap-frog Verlet algorithm
This is a modification of the Verlet algorithm in which velocities are computed at
half time step as well. The algorithm is given as,
v t t v t t a t t
r t t r t t v t t
+ = − +
+ = + +
) ( )
2
1
) (
2
1
(
)
2
1
( ) ( ) (
(3.12)
23
The advantage of this method is that the velocities are explicitly calculated, but the
disadvantage is that they are not calculated at the same time as the positions. The
velocities at time t can be approximated as,
= − + + )
2
1
) (
2
1
(
2
1
v(t) v t t v t t
(3.13)
3.4.3 Velocity Verlet algorithm
The velocity Verlet algorithm provides a better description of the velocities,
without the need to compute the velocities at half the time step.
v t t v t [a t a t t ] t
r t t r t v t t a t t
+ = + + +
+ = + +
( ) ( )
2
1
( ) ( )
( )
2
1
( ) ( ) ( ) 2
(3.14)
3.4.4 Beeman algorithm
Beeman algorithm (1976) provides a more accurate expression for velocities
v t t v t a t t t a t t a t t t
r t t r t v t t a t t a t t t
+ = + + + − −
+ = + + − −
( )
6
1
( )
6
5
( )
3
1
( ) ( )
( )
6
1
( )
3
2
( ) ( ) ( ) 2 2
(3.15)
It should be noted that these algorithms assume that the force remains constant while
updating the positions from t to t+ t . To circumvent this assumption, higher order
terms involving higher time derivatives should be included in the algorithm.
It may be noted that the above mentioned methods are self-starting methods
i.e. it is only necessary to know the positions and velocities at time t=t0. An estimate of
24
accuracy during the integration can be obtained by monitoring the total energy of the
system and the angular momentum which should be conserved. Step size reduction and
back integration are other resources to check the accuracy of the integration results.
Predictor-corrector methods provide an automatic error estimate at each integration
step, allowing the program to employ a variable step-size to achieve a specified
accuracy. However, these methods are not self-starting. Therefore it is necessary to use
velocity Verlet or similar single step methods to start the integration.
25
Chapter 4
4 Interatomic potentials
The interatomic potential used in a simulation to model the lattice structure plays
an important role in determining the accuracy of the simulation results. The accuracy of
the trajectories obtained from MD simulation is affected by the choice of the potential
energy function. The total energy of the system is the sum of kinetic energy (KE) and
potential energy (PE). The KE is simple to compute but the PE computation is complex
since it depends on the positions of all interacting atoms. The complexity of the
potential also determines the computational time required for the simulation. The force
acting on an atom is proportional to the first derivative of the potential energy function.
The largest part of a molecular dynamics simulation is the evaluation of the forces which
are required to calculate new positions of atoms. The potential energy can be obtained
using two approaches. One approach is to perform ab initio calculations by solving the
Schrodinger’s equation for the entire system at each integration step. Although,
theoretically this is the best approach (without any arbitrary assumptions and ad hoc
parameterization), it is not feasible to perform such calculations for the entire system at
every time step. For larger systems comparatively simple and computationally efficient
empirical potential functions are used which take into account factors such as, bond
26
stretching, bending and torsions, covalent bonding and long-range interactions, such as
Van der Waals and Coulombic forces.
4.1 Design of interatomic potentials
The traditional approach is to completely eliminate the electronic degrees of
freedom and move the nuclei using some appropriate analytical potential function. The
functional form of this potential is so chosen as to mimic the behavior of the true
potential in a realistic way for specific materials under specific conditions. Design of the
potential involves two steps:
1. Selection of the analytical form for the potential. This could be sums of pairwise
terms where the energy of the pair depends on the relative distance, or a many-body
form appropriate for specific type of bonding involving bond distances,
bond angles, and coordination number.
2. Once the form is decided, the next step is to determine specific parameterization
of the functions. A set of parameters is found which fits the data from ab initio
calculations or experimental properties, such as, density, cohesive energy and
phonon frequencies, or a data set containing some combination of both
theoretical and experimental observations.
The most important factor is the range of applicability of the potential energy
function. Since an ad hoc functional form is used, it has no theoretical basis and once
the empirical potential is developed there is no straightforward means to improve it.
Normally, the parameters are adjusted to the equilibrium data which means that the
27
potential can be used to model the bulk properties of the crystal lattice close to
equilibrium conditions. So it is highly unlikely that the energies and forces for
amorphous structures will be accurately represented by the potential. For example, a
potential function designed for bulk may not be appropriate for studying the bond
dissociation or diatomic molecule of the same element, since the environment would be
very different from the one for which it was designed. However, it may be possible to
model both bulk and surface where the environment differs due to the reduced
coordination of the atoms at the surface. These problems are intractable and one
should be cautious while using analytical potentials for simulating high temperature and
pressure conditions.
4.2 Types of potential energy functions
The empirical potentials are further classified into two-body, three-body and
many-body potential forms. Most of the empirical potential forms comprise an
attractive and repulsive terms. The term empirical may be misleading as it is not strictly
empirical as the term suggests (Komanduri and Raff, 1999). These potentials can provide
a more realistic model than the potentials derived from purely theoretical
considerations (Torrens, 1972) since parameterization of these potentials can take into
account some physical properties as well.
Empirical potential forms are based on mathematical expressions for pairwise
interaction between two atoms or ions, which may or may not be justified from theory.
These forms can be expressed in terms of attractive and repulsive terms. By convention,
28
repulsive forces are considered positive and attractive forces as negative. As the
distance between two atoms decreases, both attractive and repulsive forces increase.
The repulsive forces increase more rapidly than the attractive forces. The curvature of
the potential energy function is mainly determined by the repulsive force component,
which in turn also governs the elastic behavior of the solid. When attractive and
repulsive forces exactly balance each other, this corresponds to the equilibrium position
with the potential energy being minimum, which is also the bond energy. The cohesive
properties of solids, such as melting and vaporization behavior are determined by the
magnitude of the maximum binding energy, which is governed by attractive force
component. Higher the bonding energy, the higher is the melting temperature and the
elastic modulus and lower is the coefficient of thermal expansion.
Several empirical potential functions have been developed suitable for different
materials. Some examples of simple pair potentials are Lennard- Jones (also known as 6-
12) (1925), Morse potential (1929), and Born-Meyer potential (1933). Also, for complex
systems, such as the one involving covalent bonding common potential functions are
Stillinger-Weber potential (1985), Tersoff potential (1988, 1989), Bolding-Anderson
potential (1990), Brenner potential (1990), and embedded atom model (EAM) potential
and modified embedded atom model (MEAM) potential (Baskes, 1989).
In general, each atom can interact with all other atoms in the simulation. One
method of increasing the computational speed is to limit the range of the potential.
Normally the potential functions are truncated at a certain value of bond distance called
29
the “cutoff radius”. By neglecting the interaction beyond the cutoff radius, the
computational time is significantly reduced with very little loss in accuracy. Truncation
of the potential energy function also results in the truncation of the forces. The cutoff
distance is generally taken as the distance where the potential energy is ~ 3-5% of the
equilibrium potentials energy. Consequently, a finite cutoff results in small fluctuations
in total energy instead of strictly constant energy.
4.2.1 Pair potentials
Pair potentials are applicable for many metals such as, copper, aluminum, and
silver for atomistic studies. Pair potentials can be classified into two basic categories
(Vitek, 1996) one, in which the potential determnies the total energy of the system and
the other type determines the change in energy when a configuration varies under
constant density conditions.
Lennard- Jones potential (Lennard- Jones, 1925) is given as:
−
=
12 6
( ) 4
r r
V r LJ
σ σ
ε
(4.1)
where ε andσ are parameters that determine the well-depth and the position of the
potential minimum, respectively. One of the most commonly used forms of the pair
potential is the Morse potential (Morse, 1929). It is used to model the interaction
between atoms of two different materials and is given as:
V(r)=D(1−e−α ( r − re ) )2
(4.2)
30
or ( ) c
r r r r V (r) = D e− 2α ( − e ) − 2e−α ( − e ) ...r ≤ r
where r is the bond distance between the two atoms, and rc is the cutoff radius beyond
which the potential is truncated to zero. Potential values computed using the two forms
given in Error! Reference source not found.Equation 4-2 differ by a factor of D, which is
a constant. As a result the forces computed using the two forms are the same and either
of the forms can be used in the simulation. The potential parameters D, α, re are
adjusted to the measured sublimation enthalpy, Debye temperature, and the nearest
neighbor spacing for the material. Garifalco and Weizer (1959) calculated Morse
potential parameters using experimental values for energy of vaporization, lattice
constants, and compressibility. Morse potential can model FCC metals well but not BCC
metals and covalent materials. A general feature of a physically reasonable pair
potential is that it favors the formation of the closed packed structures. So, they are
unsuitable for describing covalent systems which assume more open structure.
4.2.2 Many-body potentials
Pair potentials, such as Lennard-Jones potential, are suitable to model
interactions of inert gases, such as helium, argon, and xenon. A covalent material
presents a difficult challenge because of complex quantum mechanical effects, such as
chemical bond formation, hybridization, metallization, charge transfer, and bond
bending, which must be described by an effective interaction between atoms. For
instance, silicon undergoes a series of structural phase transitions from tetrahedral, to a
denser phase called as β-tin to simple cubic to FCC under pressure. This indicates that
the energy difference between these structures is not too large which suggests that the
31
cohesive energy is nearly independent of the coordination. The pairwise potential
model would favor the more densely packed structure. Consequently, no reasonable
pair potential can stabilize the diamond tetrahedral structure without considering the
angular terms. Various methods have been introduced to circumvent this problem.
Several authors have introduced potentials with a strong angular dependence (Stillinger
and Weber, 1985; Biswas and Hamman, 1985; Baskes, 1987; Baskes et al, 1989). Pettifor
(1989) developed similar approach from tight binding. Models based on the bond
charge (Brenner and Garrison, 1985), and a function of local coordination (Tersoff, 1986)
have also been developed. Most of these models have been successful in the regime for
which they were parameterised but have shown a lack of transferability. A survey of six
such potentials (Balamane and Halicioglu, 1992) concluded that each has strengths and
limitations, none appearing clearly superior to others, and none being fully transferable.
Recently the environment dependent interatomic potential (EDIP) (Justo et al, 1997,
1998; Bazant et al, 1997) have been developed as an improvement over previous model.
The other group of potentials (Kane, 1985; Keating, 1966) are constructed to
accurately describe small distortions from the ground state in complex systems, such as
the diamond structure of semiconductors. The Keating model uses Taylor series
expansion of the energy about its minimum. It can give accurate descriptions of small
displacements but become progressively less accurate for large displacements. Such
potentials are useful for describing phonons and elastic deformations.
32
In general, any many-body potential energy function describing interactions
among N identical particles can generally be resolved into one-body, two-body, and
three-body etc. contributions as follows:
( ) ( , ) ( , , ) ... (1,...., )
, ,
3
,
1 2 E V i V i j V i j k V N N
i j k
i j k
i j
i i j
=Σ +Σ + Σ + +
< < <
(4.3)
The single particle potential V1 describes external forces. The first term which
describes the interactions of atoms is the second term, which has a form of pair
potential. In order that this description to be useful in theoretical modeling it is
necessary that the component functions Vn quickly converges to zero with increasing n.
The first step towards a many-body form is to include three-body terms, by favoring the
bond angles corresponding to those of diamond structure. Stillinger and Weber (1985)
proposed such a potential which has the form:
2
3
1
( ) ( ) ( ) cos( )
2
1
= ΣΦ +Σ ⋅ ⋅ + ijk
ijk
ij ik
ij
ij V r g r g r θ
(4.4)
where θijk is the bond angle formed by the bonds ij and ik, and g(r) is a decaying function
with a cutoff between the first and second neighbor. The bond angles in diamond
tetrahedral structure are ~ 109.5° and the cosine of this angle is ~ -1/3. This makes the
term cos(θijk)+1/3 to go to zero for θ ~ 109.5°, which makes tetrahedral structure more
stable than compact structure. This potential gives a fairly realistic description of
crystalline silicon. However, its built-in tetrahedral bias create problems in other
situations. For example, it cannot predict the right energies of the non-tetrahedral
structures found under pressure, the coordination of the liquid is too low and the
33
surface structures are not correct. Consequently this potential is not transferable to
situations other than that for which it was designed for.
Work by Carlsson (1990) showed that to design a realistic potential, analytic
forms which take into account local environments with bond strengths depending on it
should be considered. The work of Biswas and Hamman (1987) suggests that a three-body
potential is not adequate for accurately describing the cohesive energy of silicon
over a wide range of bonding geometry and coordination. However, a general form for a
four-body or five-body potential becomes intractable as it would contain too many
parameters. Instead of a general N-body form, Tersoff (Tersoff, 1988; 1989) proposed
an approach by coupling two body and multi body correlations into the model. The
potential is based on bond order which takes into account local environment. The bond
strength depends on the geometry the more neighbors an atom has, the weaker the
bond to each neighbor would be. If the energy per bond decreases rapidly with
increasing coordination, then the diatomic molecules would form the most stable
arrangements of atoms. On the other hand, if the bond order depends weakly on
coordination then this should favor closed packed structures to maximize the number of
bonds. Silicon undergoes structural transitions with varying coordination under pressure
with a small difference in cohesive energy (Yin and Cohen, 1980, 1982, 1984; Chang and
Cohen, 1984). This suggests that a decrease in bond strength with increase in
coordination number almost cancels the increase in the number of bonds over a range
of coordination (Tersoff, 1988).
34
Tersoff potential (Tersoff, 1989) is given as:
( ) [ ( ) ( )].
,
2
1
ij C ij R ij ij A ij
i j
ij
i
i
V f r f r b f r
E E V
= +
=Σ = Σ
≠
(4.5)
where E is the total potential energy of the system. fR and fA are repulsive and attractive
pair potentials, respectively, and fC is a cutoff function given by:
>
< <
−
−
+
<
=
= −
=
−
−
r S
R r S
S R
r R
r R
f r
f r B e
f r A e
ij
ij
ij
ij
C ij
r
A ij ij
r
R ij ij
ij ij
ij ij
0,
,
( )
( )
cos
2
1
2
1
1,
( )
( ) ,
( ) ,
( )
( )
π
μ
λ
(4.6)
where rij is the bond distance between atoms i and j. S is the cutoff radius. λij, μij, and R
are potential parameters. The main feature of this form is the term bij. The idea is that
the strength of each bond depends upon the local environment and is lowered when
the number of neighbors is relatively high. This dependence is expressed by the
parameter bij which can diminish the attractive force relative to the repulsive force.
[ ],
( cos( ))
( ) 1
( ) ( ),
(1 ) ,
2 2
2
2
2
,
2
1
i i ijk
i
i
i
ijk
ijk
k i j
ij C ik ik
n n
ij
n
ij ij i
d h
c
d
c
g
f r g
b i i
θ
θ
ζ ω θ
χ β ζ
+ −
= + −
=
= +
Σ≠
−
(4.7)
The term ζij defines the effective coordination number of atom taking into account the
relative distance between two neighbors (rij-rik) and the bond angle θijk. The function
35
g(θ) has a minimum at h=cos(θ), the parameter d determines how sharp the
dependence on the angle is and c expresses the strength of the angular effect. The
parameters R and S are not optimized but chosen so as to include first neighbors only
for several selected high symmetry structures, such as: graphite, diamond, simple cubic,
and face centered cubic.
1 / 2
ij i j
1 / 2
ij i j
i j
ij
i j
ij
1 / 2
ij i j
1 / 2
ij i j
R ( R R ) ; S ( S S )
;
2
;
2
A ( A A ) ; B ( B B ) ;
= =
+
=
+
=
= =
μ μ
μ
λ λ
λ
(4.8)
The parameter xij strengthens or weakens heteropolar bonds in multi-component
systems.
xii =1, and xij = xji. Also ωii = 1. The potential was first calibrated for silicon (Tersoff,
1988a) and then for carbon (Tersoff, 1988)b. The parameters were chosen to fit
theoretical and experimental data, such as cohesive energy, lattice constants, and bilk
modulus obtained from realistic and hypothetical configurations.
36
Chapter 5
5 Ab initio methods
'Ab initio' is a term used to describe a solution of the non-relativistic, time
independent Schrödinger equation: ĤΨ = EΨ. Ĥ is the Hamiltonian operator for the
system, which is a function of kinetic and potential energies of the particles of the
system, Ψ is the wavefunction, and E is the energy.
The Hamiltonian for an N electron atom is
Σ Σ Σ Σ
= = = = +
= − ∇ − ∇ − +
N
i
N
j i ij
N
i i
e
N
i
i
e
n
n r
e
r
Z
m m
H
1 1
2
1
2
1
2
2
2
2
2 2
η η
(5.1)
The first term on the right hand side of Error! Reference source not found.Equation 5-1
represents the kinetic energy of the nucleus. It contains the coordinates of the nucleus
and derivatives with respect to the coordinates of the nucleus but it does not contain
the coordinates of any of the N electrons or derivatives with respect to these
coordinates. Therefore, this is called as zero-electron term. The second and third terms
which correspond to kinetic and potential energy of electrons respectively in Error!
Reference source not found.Equation 5-1 contain the coordinates of one electron and
hence are called as one electron terms, whereas the last term, electron-electron
37
repulsion depends on coordinates of both electrons and hence is called as two electron
terms.
5.1 Born-Oppenheimer approximation
It is computationally impossible to solve Error! Reference source not
found.Equation 5-1 for anything other than the hydrogen atom. For this reason, for a
many bodied system, approximations are introduced to simplify the calculations. One of
the approximations is the Born-Oppenheimer approximation. It can be noted that the
mass term appears in the denominator of the nuclear kinetic energy term and is much
larger than the mass of the electron. Large mass means the nuclei are moving very
slowly relative to the electrons. The nuclei can be considered to be moving in the
potential generated by the moving electrons. This approximation allows the electronic
Hamiltonian to be considered for a fixed set of nuclear co-ordinates.
5.2 Basis sets
The next approximation involves expressing the molecular orbitals as linear
combinations of a pre-defined sets of one-electron functions known as basis functions.
A basis set is the mathematical description of the orbitals within a system, which in turn
combine to approximate the total electronic wavefunction. These basis functions are
usually centered on the atomic nuclei and so bear some resemblance to atomic orbitals.
An individual molecular orbital is defined as:
Σ=
Φ =
N
i i c
μ 1
μ μ χ
(5.2)
38
The coefficients i cμ are known as the molecular orbital expansion coefficients. The basis
functions χ1… χN are also chosen to be normalized.
5.2.1 Slater type orbitals (STO)
STOs are constructed from a radial part describing the radial extend of the
orbital and an angular part describing the shape of the orbital.
lm
C r n 1 e( ζ r ) Y
μ
Φ = − −
(5.3)
The radial part r n−1 e(−ζ r ) depends on the distance r from the origin of the basis function
(usually the location of the nucleus), the orbital exponent, and the principal quantum
number, n. The spherical part, Ylm known as spherical harmonics, depends on the
angular quantum number, l and the magnetic quantum number, m. The normalization
constant, C is chosen such that the integral over the square of the basis function yields
unity.
5.2.2 Gaussian type orbitals (GTO)
Gaussian-type orbitals are constructed from a radical and a spherical part, but
the radical part has a different dependence on r. The GTO uses r2 instead of r so that the
product of Gaussian primitives is another gaussian. By doing this, the equation is much
easier at the expense of accuracy. To compensate for this, more gaussian equations are
combined to get more accurate results. Gaussian and other ab initio electronic structure
programs use Gaussian-type atomic functions as basis functions. Gaussian functions
have a general form as:
39
g =c xa yb z c e−α r 2
(5.4)
where α is a constant determining the size (radical extent) of the function. In a Gaussian
function,
r2 e−α is multiplied by powers of x, y, z, and normalization constant so that:
∫ =
all space
g2 1 (5.5)
The sum of these exponents L= a+ b+ c is used to define the angular momentum of the
basis function: s- type (L=0), p- type (L=1) d- type (L=2), f- type (L=3), g- type (L=4)…
Following are some representative Gaussian functions for s, py and dxy respectively:
( )
( ) .
2048
,
,
128
,
,
2
( , )
2
2
2
4
1
3
7
4
1
3
5
4
3
r
xy
r
y
r
s
g r xy e
g r y e
g r e
α
α
α
π
α
α
π
α
α
π
α
α
−
−
−
=
=
=
(5.6)
Linear combinations of primitive Gaussians are used to form the actual basis functions
called contracted Gaussians which have the form:
=Σ
p
p p d g μ μ χ
(5.7)
where gp are primitive Gaussians, p dμ are fixed constants within a given basis set.
This results in the following expansion for molecular orbitals:
40
Σ Σ Σ
Φ = =
μ
μ μ
μ
μ μ χ
p
i i i p p c c d g
(5.8)
5.3 Hartree-Fock method
Molecular orbital theory decomposes the wave function Ψ into a combination of
molecular orbitals Φ1, Φ2… Φi chosen to be normalized orthogonal set of molecular
orbitals. Thus
∫ ∫ ∫Φ* Φ dx dy dz =1 i i
(5.9)
∫ ∫ ∫Φ Φ dx dy dz = i ≠ j i j * 0;
The simplest way of expressing the wave function of a many electron system is to take
the product of the spin-orbitals of the individual electrons. For the case of two electrons
( ) ( ) ( ) 1 2 1 1 2 2 Ψ x x =χ x χ x (5.10)
which is called the Hartree product.
This function is not antisymmetric, since swapping the orbitals of two electrons does not
result in a sign change. i.e.
( ) ( ) 1 2 2 1 Ψ x x ≠ −Ψ x x (5.11)
Therefore, the Hartree product does not satisfy the Pauli principle. This problem can be
overcome by taking a linear combination of both Hartree products:
{ ( ) ( ) ( ) ( )}
2
1
( ) 1 2 1 1 2 2 1 2 2 1 Ψ x x = χ x χ x −χ x χ x
(5.12)
41
where the coefficient is the normalization factor. This wave function is antisymmetric.
Moreover, it also goes to zero if any two wave functions or two electrons are the same.
The expression can be generalized by writing it as a determinant called, Slater
determinant:
( ) ( ) ... ( )
. . ... .
. . ... .
. . ... .
( ) ( ) ... ( )
( ) ( ) ... ( )
!
1
( , ,..., )
1 2
1 2 2 2 2
1 1 2 2 1
1 2
N N N N
N
N
N
x x x
x x x
x x x
N
x x x
χ χ χ
χ χ χ
χ χ χ
Ψ =
(5.13)
A single Slater determinant is used as an approximation to the electronic wave function in Hartree-Fock theory. In
in Hartree-Fock theory. In more accurate theories, such as configuration interaction and multiconfiguration self
multiconfiguration self consistent field (MCSCF) calculation (Hegarty and Robb, 1979; Eade and Robb, 1981;
Eade and Robb, 1981; Schlegel and Robb, 1982; Bernardi, et al., 1984; Frisch et al., 1992), a linear combination of
1992), a linear combination of Slater determinants is used. Each row is formed by representing all possible
representing all possible assignments of electron i to all orbital spin combinations. Swapping two electrons
Swapping two electrons corresponds to interchanging two rows of the determinant, which will have the effect of
which will have the effect of changing its sign. The original Hartree method expresses the total wave function of
the total wave function of the system as a product of one-electron orbital. In the Hartree-Fock method (Hehre et
Hartree-Fock method (Hehre et al., 1986), the wave function is an anti-symmetrized determinantal product of one-determinantal
product of one-electron orbital (the “Slater” determinant). Schrödinger’s equation is transformed
equation is transformed into a set of Hartree-Fock equations. Now the problem is to solve for a set of molecular
solve for a set of molecular orbital expansion coefficients i cμ in
42
. Hartree-Fock theory takes advantage of the variational principle, which states
that for the ground state of any antisymmetric normalized function of the electronic
coordinates denoted as Ξ, the expectation value corresponding to Ξ will always be
greater than the energy for the exact wave function:
E(Ξ) > E(Ψ). (5.14)
In other words, the energy of the exact wave function serves as a lower bound to the
energies calculated by any other normalized antisymmetric function. Thus the problem
becomes one of finding the set of coefficients that minimize the energy of the resultant
wave function.
The variables in the Hartree-Fock equations depend on themselves. So, they
must be solved in an iterative manner. Convergence may be improved by changing the
form of the initial guess. Since the equations are solved self-consistently, Hartree-Fock is
an example of a self-consistent field (SCF) method.
Hartree-Fock calculation involves the following steps:
1. Begin with a set of approximate orbital for all the electrons in the system.
2. One electron is selected, and the potential in which it moves is calculated by
freezing the distribution of all the other electrons and treating their averaged
distribution as the centro-symmetric source of potential
3. The Schrödinger equation is solved for this potential, which gives a new orbital
for it.
43
4. The procedure is repeated for all the other electrons in the system, using the
electrons in the frozen orbitals as the source of the potential.
5. At the end of one cycle, there are new orbitals from the original set.
6. The process is repeated until there is little or no change in the orbitals.
Unrestricted Hartree-Fock method is capable of treating unpaired electrons for open
shell systems. In this case the α and β electrons are in different orbitals, resulting in two
sets of molecular orbital expansion coefficients.
Σ
Σ
=
=
μ
μ
β
μ
β
μ
μ
α
μ
α
φ χ
φ χ
.
,
i i
i i
c
c
The two sets of coefficients result in two sets of Fock matrices producing two sets of
orbitals. These separate orbitals proper dissociation to separate atoms and correct
delocalized orbitals for resonant systems. However, the eigenfunctions are not pure spin
states, but contain some amount of spin contamination from higher states. For example,
doublets are contaminated to some degree by functions corresponding to quartets and
higher states.
5.4 Electron correlation methods
Hartree-Fock theory provides an inadequate treatment of the correlation
between the motions of electrons within a molecular system, especially that arising
between electrons of opposite spin. When Hartree-Fock theory fulfills the requirement
that |Ψ|2 be invariant with respect to the exchange of any two electrons by
antisymmetrizing the wave function, it automatically includes the major correlation
44
effects arising from pairs of electrons with the same spin. This correlation is termed as
exchange correlation. The motion of electrons of opposite spin remains uncorrelated
under Hartree-Fock theory. Any method which goes beyond SCF in attempting to treat
this phenomenon is known as electron correlation method or post- SCF method.
5.4.1 Møller-Plesset Perturbation theory (MP)
Møller-Plesset (MP) theory (Møller and Plesset, 1934) treats the electron
correlation as a perturbation on the Hartree-Fock problem. It adds higher order terms to
the Hartree-Fock theory as a non-iterative correction. The wave function and energy are
expanded in a power series of the perturbation. The Hamiltonian is expressed in two
parts:
H = H0 + H’ (5.15)
where H0 is called the zeroth order Hamiltonian, and H’ is the perturbation. H0 is chosen
such that the Schrödinger equation H0Ψ0 = E0 Ψ0 can be solved exactly for the zeroth
order eigenfunctions Ψ0 and the corresponding zeroth order energy eigenvalues E0. It is
assumed that perturbation is sufficiently small that its presence does not appreciably
alter the eigenfunctions of the system.
Hatree-Fock energy is correct to the first order. Hence, the perturbation energies
start contributing from second order. While performing MP calculations, first HF
calculation is performed and then second, third, fourth, and fifth order perturbation
corrections are computed. The final ground state energy is given by
45
Eground state = EHF + E(2) + E(3) + E(4) + E(5) + … , (5.16)
where EHF is the Hartree-Fock energy and E(i) are the successive Møller-Plesset
perturbation corrections.
In perturbation theory, the different correlation contributions emerge through
their interaction with the starting HF wave function Ψ0. Since the Hamiltonian contains
only one- and two-electron terms, only single and double excitations can contribute via
direct coupling to Ψ0 in the lowest orders. However, the self-consistent optimization of
the HF wave function prevents direct mixing between single excitations and Ψ0. Thus,
the second and third-order energies have contributions only from double excitations. In
higher orders, there is indirect coupling via double excitations, and thus the fourth and
fifth-order energies have contributions from single, double, triple, and quadruple
excitations.
5.4.2 Density Functional Theory (DFT)
Shortly after the formulation of quantum mechanics in the mid 1920's, Thomas
(1926) and Fermi (1928) introduced the idea of expressing the total energy of a system
as a functional of the total electron density. Because of the crude treatment of the
kinetic energy term, i. e. the absence of molecular orbitals, the accuracy of these early
attempts was far from satisfactory. It was not until the 1960's that an exact theoretical
framework called density functional theory (DFT) was formulated by Hohenberg and
Kohn (1964) and Kohn and Sham (1965) that provided the foundation for accurate
calculations.
46
In contrast to the Hartree-Fock picture, which deals with a description of
individual electrons interacting with the nuclei and all other electrons in the system, in
density functional theory, the total energy is decomposed into three contributions, a
kinetic energy, a Coulomb energy due to classical electrostatic interactions among all
charged particles in the system, and a term called the exchange-correlation energy that
captures all many-body interactions.
In density functional theory (Hohenberg and Kohn, 1964; Kohn and Sham, 1965),
correlation energy along with exchange energy is treated as a functional of the three
dimensional electron density. DFT partitions the electronic energy as:
E = ET +EV + EJ + EXC, (5.17)
where ET is the kinetic energy term, EV includes terms describing the potential energy of
the nuclear-electron attraction and the repulsion between pairs of nuclei, EJ
is the
electron-electron repulsion term and EXC is the exchange correlation term.
All terms, except the nuclear-nuclear repulsion in Error! Reference source not
found.Equation 5-17 are functions of ρ the electron density. The term EXC accounts for
the exchange energy arising from the antisymmetry of the quantum mechanical
wavefuncion, and dynamic correlation in the motions of the individual electrons.
Hohenbrg and Kohn (1964) demonstrated that EXC is determined entirely by the electron
density, i.e. it is a function of electron density. The term EXC is usually approximated as
an integral involving only the spin densities and possibly their gradients as given by:
47
E f r r r r d r B
XC ( ) ( ) (ϖ), (ϖ), (ϖ), (ϖ) 3ϖ α β ρ = ∫ ρ ρ ∇ρ ∇ρ
(5.18)
ρα and ρβ refer to α and β spin density, and total electron density ρ is (ρα + ρβ). EXC
is
separated into exchange and correlation parts corresponding to same spin and mixed
spin interactions as given by:
EXC (ρ) = EX(ρ) + EC(ρ), (5.19)
the terms EX(ρ) and EC(ρ) are again functionals of the electron density, termed as
exchange functional and correlation functional. Local functionals depend on the
electron density, while gradient corrected functionals depend on both electron density
and its gradient .
48
1
2 CHAPTER 6
6 Neural Networks
As the term neural network (NN) implies, this approach is aimed towards
modeling real networks of neurons in the brain. It was inspired by a number of features
of the brain that would be desirable in artificial systems, such as its robustness and fault
tolerance, its flexibility and the ability to deal with fuzzy and noisy information and its
highly parallel structure.
6.1 Biological neurons
The brain is composed of about 1011 highly connected elements called neurons,
each of which is connected to about 104 other neurons. Fig 6.1 shows the schematic of
biological neuron.
Fig 6.1 Schematic of biological neuron
49
A neuron's dendritic tree is connected to thousands of neighboring neurons. Each
neuron receives electrochemical inputs from other neurons at the dendrites. When one
of those neurons fire, a positive or negative charge is received by one of the dendrites.
The strengths of all the received charges are added together through the processes of
spatial and temporal summation. Spatial summation occurs when several weak signals
are converted into a single large one while temporal summation converts a rapid series
of weak pulses from one source into one large signal. The aggregate input is then passed
to the soma (cell body). The soma and the enclosed nucleus do not play a significant role
in the processing of incoming and outgoing data. Their primary function is to perform
the continuous maintenance required to keep the neuron functional. If the sum of these
electrical inputs is sufficiently powerful to activate the neuron, it transmits an
electrochemical signal along the axon, and passes this signal to the other neurons whose
dendrites are attached at any of the axon terminals. These attached neurons may then
fire. The part of the soma that does concern itself with the signal is the axon hillock. If
the aggregate input is greater than the axon hillock's threshold value, then the neuron
fires, and an output signal is transmitted down the axon. The strength of the output is
constant, regardless of whether the input was just above the threshold, or a hundred
times as great. The output strength is unaffected by the many divisions in the axon; it
reaches each terminal button with the same intensity it had at the axon hillock. This
uniformity is critical in an analogue device such as a brain where small errors can be
critical, and where error correction is more difficult than in a digital system. It is
important to note that a neuron fires only if the total signal received at the cell body
50
exceeds a certain level. The neuron either fires or it does not, there are not different
grades of firing.
Each terminal button is connected to other neurons across a small gap called a synapse,
as shown in Fig 6.2 . The physical and neuro-chemical characteristics of each synapse
determine the strength and polarity of the new input signal. This is where the brain is
most flexible.
Fig 6.2 Synaptic gap
Changing the constitution of various neuro-transmitter chemicals can increase or
decrease the amount of stimulation that the firing axon imparts on the neighboring
dendrite. Altering the neurotransmitters can also change whether the stimulation is
excitatory or inhibitory. Many drugs, such as alcohol have dramatic effects on the
production or destruction of these critical chemicals. The infamous nerve gas, sarin, can
kill because it neutralizes a chemical (acetylcholinesterase) that is normally responsible
for the destruction of a neurotransmitter (acetylcholine). This means that once a neuron
51
fires, it keeps on triggering all the neurons in the vicinity and one no longer has control
over muscles.
So, from a very large number of extremely simple processing units (each
performing a weighted sum of its inputs, and then firing a binary signal, if the total input
exceeds a certain level) the brain manages to perform extremely complex tasks. This is
the model on which artificial neural networks (NN) are based. Thus far, artificial neural
networks have not even come close to modeling the complexity of the brain. It is worth
noting that even though biological neurons are slow compared to electrical circuits, the
brain is able to perform many tasks much faster than any conventional computer
because of the massively parallel structure of biological neural networks.
While in actual neurons the dendrite receives electrical signals from the axons of
other neurons, in an artificial neuron (perceptron) these electrical signals are
represented as numerical values. At the synapses between the dendrite and axons,
electrical signals are modulated in various amounts. This is also modeled in an artificial
neuron by multiplying each input value by a value called the weight. An actual neuron
fires an output signal only when the total strength of the input signals exceeds a certain
threshold. We model this phenomenon in a ANN by calculating the weighted sum of the
inputs to represent the total strength of the input signals, and applying a step function
on the sum to determine its output. Fig 6.3 shows a simple model of a neuron.
52
Fig 6.3 Model of a neuron
6.2 History of multilayer neural networks
The first step toward artificial neural networks came in 1943 when Warren
McCulloch and Walter Pitts (1943) wrote a paper “A logical calculus of the ideas
immanent in nervous activity” describing how neurons might work that united the
studies of neurophysiology and mathematical logic. They modeled a simple neural
network with electrical circuits and showed the network in principle, could compute any
arithmetic function. Training of the neural network was introduced when Hebb (1949) in
1949 published the book "The organization of behavior," which pointed out the fact that
neural pathways are strengthened each time they are used, a concept fundamentally
essential to the ways in which humans learn. If two nerves fire at the same time, the
connection between them is enhanced. Frank Rosenblatt (1958) invented the
perceptron and associated learning rule. It was built in hardware. A single-layer
perceptron was found to be useful in classifying a continuous-valued set of inputs into
one of two classes. This was the first practical application of artificial neural networks. In
1959, Bernard Widrow and Marcian Hoff developed models called "ADALINE" (ADAptive
53
LInear Neuron) and "MADALINE” (Multiple ADAptive LInear Neuron). ADALINE was
developed to recognize binary patterns so that if it was reading streaming bits from a
phone line, it could predict the next bit. MADALINE was the first neural network applied
to a real world problem, using an adaptive filter to eliminate echo on phone lines. In
1962, Widrow and Hoff developed a learning rule to train ADALINE networks. Minsky
and Papert (1969) showed that perceptron and ADALINE networks were limited for
classification problems in which the two classes were linearly separable. These networks
were incapable of handling nonlinear classification problems, such as exclusive-or (XOR)
problem. Rosenblatt and Widrow were aware of these limitations and proposed
multilayer networks. First description of an algorithm to train multilayer networks was
given by Webros (1974). He described it for general networks with neural networks as a
special case. It was rediscovered by Rumelhart, Hinton, and Williams (1986), Parker
(1985) and Le Cun (1985). Backpropagation was developed by generalizing Widow-Hoff
learning rule to multilayer neural networks with nonlinear differentiable transfer
functions.
6.3 Neural networks versus conventional computers
Neural networks take a different approach to problem solving than that of
conventional computers. Conventional computers use an algorithmic approach i.e. the
computer follows a set of instructions in order to solve a problem. Unless the specific
steps that the computer needs to follow are known the computer cannot solve the
problem. That restricts the problem solving capability of conventional computers to
problems that we already understand and know how to solve particularly for repetitive
54
tasks. But computers would be so much more useful if they could do things that we do
not exactly know how to do. Neural networks process information in a similar way the
human brain does. The network is composed of a large number of highly interconnected
processing elements (neurons) working in parallel to solve a specific problem. Neural
networks learn by example. They are not programmed to perform a specific task. The
examples must be selected carefully otherwise the network might function incorrectly.
The disadvantage is that because the network finds out how to solve the problem by
itself, its operation can be unpredictable.
On the other hand, conventional computers use a predictive approach to
problem solving. The way the problem is to be solved must be known and stated in
unambiguous instructions. These instructions are then converted to a high level
language program and then into machine code that the computer can understand. Their
operation is predictable.
Neural networks and conventional algorithmic computers are not in competition but
complement each other. There are tasks that are more suited to an algorithmic
approach, such as arithmetic operations and tasks that are more suited to neural
networks. Even more, a large number of tasks, require systems that use a combination
of the two approaches (normally a conventional computer is used to supervise the
neural network) in order to perform at maximum efficiency. Neural networks have
shown to be good at problems which are easy for a human but difficult for a traditional
55
computer, such as image and voice recognition and predictions based on past
knowledge.
6.4 Model of a neuron
Fig 6.4 shows a single input neuron, which forms the building block for the
Artificial Neural Network (ANN). The scalar input p is multiplied by the scalar weight w,
the other input 1 is multiplied by the bias b. the summer adds wp and b to form the net
input n, which goes into a transfer function f to produce a scalar output a.
w and b are adjustable scalar parameters of the neurons, which are adjusted by the
learning rule so as to minimize the errors (i.e. neuron outputs and the target values).
Fig 6.4 Single input neuron (Hagan et al. 1996)
6.5 Multilayer network
Fig 6.5 shows a neural network with multiple layers with each layer having
multiple neurons. Each layer has its own weight matrix W and bias vector b. The
56
superscript to each parameter indicates the layer it is associated with. The network has
R inputs. The outputs from one layer are inputs to the next layer.
Fig 6.5 Multilayer network (Hagan et al. 1996)
A layer whose output is the output of the entire neural network is called as the output
vector, whereas other layers are called as hidden layers. The number of neurons in the
input layer is the same as the number of input variables and the number of neurons in
the output layer is same as the number of outputs. The number of neurons to be used
for the hidden layer is decided by the designer depending upon the complexity of the
function involved.
6.6 Transfer functions
A transfer function is a linear or non-linear function of the net input to the
neuron. Nonlinear transfer functions give neural networks their nonlinear capabilities.
The function must be differentiable for the optimization of the parameters and is
57
desirable to saturate at both extremes. The most common forms of transfer functions
are the monotonically increasing sigmoidal or hyperbolic tangent function.
Hyperbolic tangent function
n n
n n
e e
e e
f x −
−
+
−
( )= .
(6.1)
Log sigmoid function
e n
f x + −
=
1
1
( ) .
(6.2)
( a ) ( b )
Fig 6.6 a) Hyperbolic tangent function ranging from [-1, +1] for x = ±∞
b) Log sigmoid function ranging from [-1,+1] for x = ±∞
These functions are also called as squashing functions because of their
asymptotic behavior at ±∞. Convergence with functions which are symmetric about
the origin, like the hyperbolic tangent is faster (LeCun, 1998)
6.7 Neural network training
Neural network training is a procedure for modifying weight matrices and bias
vectors of the network to meet certain goals, such as minimizing the difference between
the network output and target values. It is equivalent to performing the non-linear
58
optimization of the network parameters. There are two approaches for neural network
training: supervised learning and unsupervised learning.
In supervised training, both the inputs and the outputs are provided. The
network then processes the inputs and compares its resulting outputs against the
desired outputs. Errors are then propagated back through the system, causing the
system to adjust the weights which control the network. This process occurs over and
over and the weights are continually adjusted. During the training of a network the
same set of data is processed many times as the weights and biases are adjusted. It is
very important to have enough data points for the training so as to cover the
configuration space adequately. The number of data points depends on the number of
input variables, as the number of input variables increase. It increases the
dimensionality of the problem and more points are required in order to cover the entire
configuration space. The number of data points required also depends on the
complexity of the problem. Also, it is equally important to choose the input variables
carefully. Sometimes it may not be obvious as to which variables to choose so that the
input data contains complete information from which the target output is derived.
Supervised learning has applications in function approximation and pattern classification
problems where the number and type of patterns is known beforehand.
In unsupervised training, the network is provided with the inputs but not with
the desired outputs. The system itself decides what features to use to group the input
data. This is referred to as self-organization. These networks look for regularities or
59
trends in the input to adapt the network parameters. Unsupervised learning has
applications in clustering operation. They learn to categorize the input patterns into a
finite number of classes.
6.7.1 Supervised training and parameter optimization
The supervised training is done by comparing the output with known target
values. The optimization of network parameters is performed by some iterative
optimization scheme until the desirable solution by the scalar cost function is reached.
Normally the cost function is taken as the sum of the squared difference between the
network output and the target values.
E(e2 ) = E[(t − a)2 ], (6.3)
where t is the target vector, a is the vector of neural network output and e is the
difference between the target and the neural network output. E(e2) is the expectation of
the sum squared error. The training process can be classified into two types:
incremental training and batch training. In incremental training the weights and biases
of the network are updated each time an input is presented to the network, while in
batch training, the weights and biases are updated after all inputs are presented. The
incremental training is comparatively faster since in batch training the redundancy or
the presence of patterns which are very similar leads to slower convergence. In
incremental training the noise present in the updates coming from the numerical
calculation of the gradient after each successive sample can result in weights which are
able to find a different minimum before getting stuck in a local minimum, unlike in
60
batch training where the random initialization of the weights is very crucial. However
such noise can prevent full convergence to the minimum. For batch training the
conditions of convergence are well understood. Many optimization schemes, such as
conjugate gradient method are applicable only for batch training. Also, the analysis of
convergence rates is simpler.
In order to minimize the cost function, the training algorithm cycles through the
following steps:
1. The network is presented with one (for incremental training) or all (for batch
training) training samples consisting of both inputs and targets.
2. The network output is measured and the squared error between the target and
the output is calculated.
3. The weights and biases are adjusted so as to minimize the cost function i.e. the
squared error.
4. The procedure is repeated until the squared error reaches the desired limit.
The optimization of the parameters is usually done using gradient based methods, such
as steepest descent or conjugate gradient or quasi Newton algorithms.
6.8 Least Mean Squared Error (LMS) algorithm
The LMS or Widrow–Hoff algorithm is based on approximate steepest descent
algorithm. Widrow and Hoff noticed that the estimate of the mean squared error could
be obtained by using squared error at each iteration.
61
∇F(x)=∇e2 (k) , (6.4)
where k is the iteration number.
The partial derivatives of squared error with respect to weights and biases at kth
iteration are given by:
.
( )
2 ( )
( )
and
,
( )
2 ( )
( )
2
1, 1,
2
b
e k
e k
b
e k
w
e k
e k
w
e k
j j
∂
∂
=
∂
∂
∂
∂
=
∂
∂
(6.5)
where R is the number of input variables.
( ) ( ) ( )
( ) [ ( ) ( )]
1
1,
1, 1, 1,
t k w p k b p k
w w
t k a k
w
e k
j
R
i
i i
j j j
− =
− +
∂
∂
=
∂
∂ −
=
∂
∂ Σ=
,
(6.6)
where p (k) i is the ith element of the input vector at the kth iteration.
Similarly,
1
( )
= −
∂
∂
b
e k
.
(6.7)
Therefore
∇ =∇ = −
1
( ) 2 ( ) 2 ( ) p
F x e k e k … where p is the input vector and 1 is constant
input for bias. That is, the approximate gradient is given by multiplying the input with
the error. Now the steepest descent method produces proceeds as
k k x xk x x F x + = = − ∇ 1 | α ( ) . (6.8)
62
LMS algorithm can be written in matrix notation as:
W(k+1)= W(k)+2α e(k) pT
(k),
and b(k+1)= b(k) +2α e(k). (6.9)
6.9 Backpropagation algorithm
Neural networks employing only one layer of neurons can solve only linearly
separable classification problems. First description of an algorithm to train multilayer
networks was given by Webros (1974). He described it for general networks with neural
networks as a special case. It was rediscovered by Rumelhart (1986), Parker (1985) and
Le Cun (1985). Backpropagation was developed by generalizing Widow-Hoff learning
rule to multilayer neural networks with nonlinear differentiable transfer functions. It is
also a gradient descent algorithm where the parameters are adjusted along the negative
of the gradient of the cost function. The term backpropagation refers to the manner in
which the gradient is computed for nonlinear multilayer networks. With single layer
network, the weights are adjusted based on the error values as in LMS (Widrow-Hoff)
learning. However, with multilayer networks, it is less obvious how to calculate the error
for the hidden layers as the error from the output layer is not an explicit function of
weights in the hidden layers.
Chain rule is applied to compute the gradient for the steepest descent algorithm.
63
and .
,
, ,
m
i
m
i
m
i
m
i
m
i j
m
i
m
i
m
i j
b
n
n
F
b
F
w
n
n
F
w
F
∂
∂
⋅
∂
∂
=
∂
∂
∂
∂
⋅
∂
∂
=
∂
∂
(6.10)
where m
i n is the net input to the ith neuron in the mth layer, and are the weights
and biases the layer m.
Now the net input m
i n is computed as:
and , 1.
,
1
,
1
1
,
1
=
∂
∂
=
∂
∂
= +
−
=
− Σ
−
m
i
m
m i
m j
i j
m
i
S
j
m
i
m
j
m
i j
m
i
b
n
a
w
n
n w a b
m
(6.11)
Now, a term sensitivity is defined as the change in the cost function F with respect to
change in the ith element of the net input at the layer m.
m
i
m
i n
F
s
∂
∂
= .
(6.12)
Therefore
and .
1 ,
,
m
m i
i
m
j
m
m i
i j
s
b
F
s a
w
F
=
∂
∂
=
∂
∂ −
(6.13)
Now, the steepest algorithm can be expressed as:
m
i j w ,
m
i b
64
and ( 1) ( ) .
( 1) ( ) ( 1 ) ,
m m m
m m m m T
b k b k s
W k W k s a
α
α
+ = −
+ = − −
(6.14)
where Wm and bm are weight matrix and bias vector at the layer m.
∂
∂
∂
∂
∂
∂
=
∂
∂
=
m
s
m
m
m
m
m n
F
n
F
n
F
n
F
s
.
.
.
2
1
(6.15)
Next step is to compute the sensitivities. This process involves a recurrence relationship
in which the sensitivity at layer m is computed from the sensitivity at layer m+1, which
gives it the name backpropagation. Now the Jacobian matrix is defined as:
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
=
∂
∂
+ + +
+ + +
+ + +
+
+ + +
m
S
m
S
m
m
S
m
m
S
m
S
m
m
m
m
m
m
S
m
m
m
m
m
m
m
m
m m m
m
m
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
1
2
1
1
1
1
2
2
1
2
1
1
2
1
1
2
1
1
1
1
1
1
1 1 1 ...
. . .
. . .
. . .
...
...
(6.16)
Now m
j
m
i
n
n
∂
∂ +1
is computed as:
65
.
( )
,
1
,
1
,
1
1
1
1 ,
m
j
m
j
m
m
m i j
j
m
m j
i j
m
j
m
i
m
l
S
l
m
i l
m
j
m
i
n
f n
w
n
a
w
n
w a b
n
n
m
∂
∂
=
∂
∂
=
∂
∂ +
=
∂
∂
+ +
+
=
+
+ Σ
(6.17)
Hence, the Jacobian matrix can be written as:
=
=
∂
∂ +
+
0 0 ( )
. . .
. . .
. . .
0 ( ) ... 0
( ) 0 ... 0
( )
( ),
2
1
1
1
m
S
m
m
m m
m m m
m
m
m f n
f n
f n
F n
W F n
n
n
&
&
&
&
&
(6.18)
And the recurrence relation can be written as:
sm = F&m (nm ) (W m+1 )T sm+1 , (6.19)
Thus, the sensitivities are propagated backward through the network from last to first
layer, which gives it the name backpropagation.
For the output layer the sensitivities are computed as:
s =− 2 F&(n )(t − a) , (6.20)
where t is the target value and a is the neural network output.
66
6.10 Numerical optimization techniques
The mean squared error of a single layer network with linear transfer function is
a quadratic function with only one minimum and a constant curvature. In such cases,
the LMS algorithm guarantees the convergence, so long as the learning rate is not too
large. But for a multilayer network the mean squared error may be a non-quadratic
function where the curvature varies drastically over the parameter space which makes it
difficult to choose the learning rate. To overcome some of these difficulties heuristic
techniques, such as variable learning rate, using momentum and rescaling the variables
are employed. Another approach employs standard numerical optimization techniques,
such as conjugate gradient and quasi Newton method.
6.10.1 Conjugate gradient methods
The backpropagation algorithm adjusts the weights along the steepest descent
direction. Although this is the direction in which the function decreases most rapidly,
this may not produce faster convergence. Instead of using orthogonal search directions
as in the steepest descent method, conjugate gradient method adjusts the search
direction so as to pass through the minimum of the function. The conjugate gradient
method uses the information on the first derivative of the error function only and also
has a quadratic convergence property i.e. it converges to the minimum of a quadratic
function in a finite number of steps. This is a O(N) method.
67
6.10.2 Newton and quasi-Newton methods
Newton method
The cost function i.e. the mean squared error can be expanded using Taylor’s
series as:
...
2
1
F(w + w) = F(w) + gT ⋅ w + wT ⋅ H ⋅ w + ,
(6.21)
where g is the gradient vector and H is the Hessian matrix defined as:
2
2 ( )
,
( )
w
F w
H
w
F w
g
∂
∂
=
∂
∂
=
(6.22)
Quadratic approximation of the error surface is obtained by ignoring higher order terms
in the Taylor’s series expansions. The optimum value of wi.e. the adjustments to the
weights is:
w = −H −1 ⋅ g . (6.23)
Newton method requires computing the Hessian matrix which involving second order
derivatives and then finding its inverse which takes O(N3) iterations. But this does not
guarantee the convergence. If the error surface is not quadratic, then the Hessian matrix
may not be always positive definite and the algorithm diverges. In quasi Newton
method, Hessian matrix is estimated by some positive definite matrix, to ensure the
convergence. The Hessian matrix is further approximated in terms of a Jacobian matrix:
H ≅ 2 J T J , (6.24)
68
6.10.3 Levenberg-Marquardt algorithm
The Hessian matrix in Gauss-Newton method may not always invertible.
Levenberg-Marquardt algorithm uses an approximation to the Hessian matrix as:
w [J T J I ] J T −1 = + μ , (6.25)
where J is the Jacobian matrix consisting of the first derivatives of network errors with
respect to weights and biases. When μ is zero, then it is similar to Gauss-Newton
method, and when μ is large, it becomes the gradient descent algorithm. The Jacobian
matrix can be computed through standard backpropagation technique that is less
complex than computing the Hessian matrix (Hagan and Menhaj, 1994).
6.11 Bias/variance dilemma
One of the most serious problems that arise in connectionist learning by neural
networks is overfitting of the provided training examples. This means that the trained
function fits very closely the training data. However it does not generalize well, i.e. it can
not model unseen data sufficiently well. The problem is to choose a function that both
fits the training data as well as generalizes well over a specified range. If the function is
parameterized so as to fit the training data perfectly using a higher order polynomial
(which is equivalent to having more number of neurons in the hidden layers) then the
fits would be drastically different for different sets of data randomly drawn from the
same underlying function. More parameter flexibility has a benefit of fitting anything
but at the cost of sensitivity to the variability in the data set. There is a variation
between the average fit over all data sets and the fit for a single data set i.e. a variance
69
introduced by the fits over multiple training sets. This can be addressed to some extent
by having more number of data points for the fitting. There is a drawback to the
flexibility afforded by extra degrees of freedom in fitting the data that the amount of
data required grows exponentially with the order of the fit. On the other hand, if very
few parameters are available then the fit would be too restrictive and although the fit
would not be as sensitive to the variability in the training data there would be a fixed
error or bias no matter how much data is collected. This problem is referred to as
bias/variance dilemma (Geman et. al, 1992). This becomes a problem when there are
relatively few data points especially in high dimensional space.
Statistical bias is the complexity restriction that the neural network architecture
imposes on the degree of fitting accurately to the target function. The statistical bias
accounts only for the degree of fitting the given training data, but not for the level of
generalization. Statistical variance is the deviation of the neural network training
efficacy from one data sample to another that could be described by the same target
function model. This statistical variance accounts for the generalization of whether or
not the neural network fits the examples without regard to the specificities of the
provided data. One approach to avoid overfitting is to deliberately add some random
noise with a mean value of zero to the model, which makes it difficult for the network to
fit any specific data point too closely. Another way of introducing noise is to use
increment training, i.e. updating the weights after every data point is presented, and to
randomly reorder the data points at the end of each training cycle. In this manner, each
weight update is based on a noisy estimate of the true gradient.
70
6.11.1 Neural network growing and neural network pruning
To reduce the statistical bias neural network growing is employed, where initially
the training is started with less number of neurons. During the training process when
the decrease in error falls below a certain threshold value, then more neurons are
added. With a view to reduce the statistical variance, neural network pruning is
employed where a large network is first trained and the relative importance of the
weights is assessed. Less important weights are then removed without affecting the
performance much to get a smaller network
6.12 Regularization and early stopping
Regularization involves modifying the performance function which is the mean
squared error. The risk of overfitting can be reduced if the variance is used in the error
function to penalize the neural network model with high curvature. The usual
performance function is defined as:
( ) Σ=
= = −
N
i
i i t a
N
F Mean Squared Error MSE
1
1 2
( ) ,
(6.26)
It is modified as:
( ) ( )
1
1
2 t a R W
N
F
N
i
i i γ + − = Σ=
,
(6.27)
where R(W) is the function of network parameters. Regularization is performed by
weight decay (Hertz et. al, 1996). A common form of R(W) used to penalize the model is
taken as the sum square of weights, since overfitting with large curvature occurs when
the weights of the network become too large.
71
,
1
Re (1 ) ,
1
2 Σ=
=
= = ⋅ + −
n
j
j w
n
msw
F mse gularization γ mse γ msw
(6.28)
where γ is the performance ratio. Using this performance function will cause the
network to have smaller weights and biases and this will cause the network response to
be smoother and less likely to overfit.
Bayesian framework (MacKay, 1992) provides an approach to determine the
optimal regularization parameter in an automated fashion. In this framework the
weights and biases are assumed to be random variables with a specified distribution.
The regularization parameters are related to unknown variances associated with these
distributions. Bayesian regularization is implemented along with Levenberg- Marquardt
algorithm (Foresee and Hagan, 1997).
6.13 Early stopping
In order to achieve good generalization, overfitting needs to be controlled. This
can be achieved by early stopping. In this technique, the available data set is divided
into three subsets, viz. training set, validation set, and testing set. Training set is used to
compute the gradients and updating the weights and biases. Validation and testing sets
are used to monitor the error. Generally, the error on the training and validation sets
decrease during the initial phase of training. However, when the network starts
overfitting, the error on the training set continues to decrease while that on the
validation set starts increasing. When the error on the validation set increases for a
72
specified number of iterations, the training is stopped. Fig 6.7 shows a typical variation
of the error on the training and validation sets during neural network training.
Fig 6.7 Evaluation of error during training
6.14 Normalizing the data
If the range of the output variables is quite large, then before training it is often
useful to scale the inputs and targets so that they fall within a specified range. Since the
range for nonlinear transfer functions, such as logsigmoid and hyperbolic tangent
function is [0, 1] and [-1, +1], the inputs and targets are scaled within a range from [-1,
+1] with the average close to zero.
1
(max min )
( min )
2 −
−
−
= ⋅
p p
p p
p i
i ,
(6.29)
where min p and max p are the minimum and maximum of the input or target values.
73
6.15 Neural network derivatives
In general, for function approximation problems, the neural network is used to
map the functional relationship between input and output variables. Once trained, this
neural network is presented with a input vector to calculate the corresponding output.
But in some cases the derivatives of the output variables with respect to the input
variables may be desired. An example of such an application would be molecular
dynamics simulation, where a multilayer neural network is trained for the energy of a
molecular cluster with bond distances and angles that define the geometry of the
cluster as the input. During the molecular dynamics simulation. The force i.e. the first
derivative of the energy with respect to coordinates of atoms is required.
6.15.1 First derivative with respect to weights
For the multilayer network, the network output is given as:
am+1 = f m+1 (W m+1am +bm+1 ) for m =0,1,...,M −1, (6.30)
where the superscript represents the layer number. M is the number of layers of the
network and input to the first layer is a0 = p
Now,
and .
,
, ,
m
i
m
i
m
i
m
i
m
i j
m
i
m
i
m
i j
b
n
n
F
b
F
w
n
n
F
w
F
∂
∂
⋅
∂
∂
=
∂
∂
∂
∂
⋅
∂
∂
=
∂
∂
(6.31)
74
and .
1 ,
,
m
m i
i
m
j
m
m i
i j
s
b
F
s a
w
F
=
∂
∂
=
∂
∂ −
(6.32)
where the sensitivity is defined as m
i
m
i n
F
s
∂
∂
=
6.15.2 Derivatives of the transfer function
The derivatives of the transfer functions are required to calculate the derivatives
of the network output with respect to the network input.
Linear function:
0.
( )
1,
( )
( ) ,
2
2
=
=
=
dx
d f x
dx
df x
f x x
(6.33)
Log sigmoid function
The log sigmoid transfer function is defined as:
75
( )
( ) (1 ( )) (1 2 ( )).
( )
(1 ) ( ) 1 ( ) ,
( )
1
1
( )
2
2
2
f x f x f x
dx
d f x
e e f x f x
dx
df x
e
f x
x x
x
= − −
= + = −
+
=
− − −
−
(6.34)
Hyperbolic tangent (tan sigmoid) function
The hyperbolic tangent function is defined as:
2 ( ) (1 ( )).
( )
(1 ( )) ,
( )
( )
2
2
2
2
f x f x
dx
d f x
f x
dx
df x
e e
e e
f x x x
x x
=− −
= −
+
−
= −
−
(6.35)
6.15.3 First derivative with respect to inputs
For the multilayer network, the network output is given as:
am+1 = f m+1 (W m+1am +bm+1 ) for m =0,1,...,M −1, (6.36)
where the superscript represents the layer number. M is the number of layers of the
network and the input to the first layer is a0 = p
The sensitivity is defined as:
76
= = M ( m ) ⋅ M +1 ⋅ M +1
M
M
M F n W s
dn
da
s & T ,
(6.37)
where F&M (nm ) is the Jacobian matrix
=
0 0 ( )
. . .
. . .
. . .
0 ( ) ... 0
( ) 0 ... 0
( )
2
1
m
S
m
m
m m
m f n
f n
f n
F n
&
&
&
&
(6.38)
For the output layer M, the sensitivity sM is 1
The sensitivity of hidden layers is computed as:
= ⋅(W +1 ⋅ s +1 ) for m = 1,2,....,M −1
dn
df
s m m
m
m
m T
,
(6.39)
where m
m
dn
df
is the derivative of the transfer function at layer m, and M is the number of
layers. The derivative with respect to the network input is computed as:
1 1
1 W s
n
a T
m
= ⋅
∂
∂
6.16 Novelty detection
Multilayer feed forward neural network can fit, in principle any real valued
continuous function of any dimension to an arbitrary accuracy provided it has sufficient
number of neurons (Cybenko, 1989). However, the performance of the trained neural
network depends upon the data set provided during the training. The training algorithm
77
minimizes errors between the targets and the network outputs for the training data. But
this does not guarantee the neural network performance for a data that was not
present in the training set, usually called as testing data.
Consider an example of approximating a function: f(x) = x4. The inputs p used for
the training the neural network are selected within the range [-1, 1]. Now this trained
network is tested for new inputs ranging from [-2, 2]. Fig 6.8 shows a plot of neural
network output trained over the range [-1, 1] and tested over the range [-2, 2].
Fig 6.8 Plot of neural network output for f(x) = x4
In the region of input between [-1, 1], which is same as that of training data, the
network performed well. However the network performs poorly outside this region,
producing very large errors between targets and outputs.
78
In the case of neural network fitting for a multivariate function, it becomes
difficult to decide which input vector lies outside the range of the inputs in the training
set and whether to count on the neural network output for a given input vector. Also,
when the dimension of the input vector is large then the distinction between
interpolation and extrapolation becomes more ambiguous. Therefore, differentiating
the boundaries between normal and abnormal data is much more difficult, making it
difficult to decide whether a specific input should be accepted or rejected. Novelty
Detection is a technique that is used to identify abnormal data points. Thus, the novelty
detection algorithm should be capable of identifying the inputs that fall outside the
range of the training data. For example in the above example of the function of one
variable, if an input is in between [-1, 1], then the corresponding output is accepted,
whereas is an input is out of range [-1, 1] then the corresponding output is rejected.
Some of the novelty detection techniques (Pukrittayakamee and Hagan, 2001) are:
1. Minimum distance computation
2. Neural tree algorithm (Martinez, 1998)
3. Gaussian kernel estimator (Bishop, 1994)
4. Associative multilayer perceptron (Frosini, 1996)
6.16.1Minimum distance computation
This method is based on the 2-norm distance calculation between training vectors and
the testing vector. When two vectors are close then their individual elements would also
be close to each other. The 2-norm distance between the vectors is given as,
79
[( ) ( )] 2
1
i j
T
i j i j a − b = a − b ⋅ a − b ,
(6.40)
where ai and bj are the two vectors.
A vector d is constructed whose elements are distances of a testing vector to all input
vectors used for training. The minimum distance separating the testing vector from all
training vectors is,
dm= min(d),
and the mean separation distance <d> is the average value of the elements of d. A large
value of dm would indicate that the testing vector is different from training vectors for
which the neural network was trained. Next step would be to incorporate this testing
vector into a new training data set along with original training data set, and train a new
neural network for this new data set. On the other hand, a smaller value of dm suggests
that there is at least one point in the training data set that is close to the testing vector,
and the testing vector may or may not be included into the new training data set. This is
decided based on the average distance <d> as well. If <d> is also small then that
suggests that there are many data points in the training set close to the testing data
point and this testing data point need not be included in the new training data set. In
other words, the density of points close to the testing point is high. But, if dm is small
and <d> is large, then it indicates that although there is at least one training data point
close to the testing point, the density of points close to the testing point is small. In
other words, this point lies in an isolated region from other data points in the training
set and this testing point should be incorporated into the new training set to improve
the overall neural network fit.
80
A selection procedure based only on the magnitude of dm may not be able to
detect points that are critical to improve the fit. In regions where the gradient is large
many points would be required to obtain an accurate fit. The network outputs and
targets could also be used for novelty detection. The minimum distance algorithm is
modified to include the outputs as well.
For an input vector p the error between the output a and target t is written as:
eT = tT – aT = F(pT) – aT, (6.41)
where F is the function to be approximated.
F could be expanded about the training input vector closet to p using Taylor’s series
expansion as:
( )
( ) ...
( )
( )
...
( )
( )
0 0 0
0 0
0
0
− +
∂
∂
= − +
− +
∂
∂
= +
=
=
p p
p
F p
t a
p p
p
F p
e F p
T
p p
T
T
T
p p
T
T
(6.42)
where p0 is the training vector and is the first derivative of the function with
respect to the input evaluated at the training vector.
Now, if pm is the closest vector in the training set to p, with a minimum distance dm and
if higher order terms could be ignored, then
( ) m
T
p p
T
T
m
T p p
p
F p
e t a
m
−
∂
∂
≈ − + =
( )
( ) .
(6.43)
0
( )
p p
T
p
F p
= ∂
∂
81
So, the network error not only depends on the distance between the testing vector p
and the closest training vector pm, but also on the gradient evaluated at the training
vector. Therefore, if the underlying function has a large slope, then the error could be
large even though the minimum distance is small.
The underlying function is generally not known, and therefore its gradient could
not be computed. Pukrittayakamee and Hagan (2001) found that an approximation of
the gradient could be used if the minimum distance computation is modified to include
the distance between the network output and the target. The modified vector is [p t],
where p is the input vector and t is the network output vector.
− + −
− + −
− + −
=
N N
T
T
T
p p a t
p p a t
p p a t
d
α
α
α
. .
. .
. .
2 2
1 1
(6.44)
where α is the greater than zero, and N is the total number of training points.
82
Chapter 7
7 Problem statement
Designing the potential function involves two main steps: 1) Choosing a proper
functional form to describe the interatomic interactions and 2) Optimizing the
parameters in the potential function to fit the energies from ab initio calculations and
physical properties measured from experiments such as, dissociation energy, lattice
constant and melting point. The parameterization involves optimizing a set of
parameters to minimize the objective function such as, root mean squared error
(RMSE). Commonly used gradient based methods such as, steepest descent method,
tend to get stuck in local minima. Stochastic optimization techniques such as genetic
algorithm (GA) and simulated annealing can provide the global optimum solution for the
multivariate function. Further, these methods do not require the evaluation of the
gradient of the function.
In this study, the parameters for Tersoff interatomic potential function were
optimized for silicon clusters using genetic algorithm. A multilayer neural network was
used to estimate the objective function during the optimization. The use of the neural
network bypasses the need for direct evaluation of the objective function at every
83
iteration. The accuracy of the neural network was improved during successive stages of
optimization.
In the next part of the investigation, the reaction dynamics of vibrationally
excited vinyl bromide was investigated using neural network potential surface fitted to
an ab initio energies obtained from electronic structure calculations. A multilayer neural
network was used to approximate the energy as a function of internal coordinates
defining the geometry of a configuration. Dissociation rate coefficients and branching
ratios for six open reaction channels were computed at an internal energy of 6.44 eV.
In the next part of the investigation, a general method for the development of
potential-energy hypersurfaces is presented. The method combines a many-body
expansion to represent the potential-energy surface with two-layer neural networks (NN)
for each M-body term in the summations. The total number of NNs required is
significantly reduced by employing a moiety energy (ME) approximation. An algorithm
is presented that efficiently adjusts all the coupled NN parameters to the database for the
surface. Application of the method to four different systems of increasing complexity
shows that the fitting accuracy of the method is good to excellent. The method is
illustrated by fitting large databases of ab initio energies for Sin (n = 3, 4, ..., 7) clusters
obtained from DFT calculations and for vinyl bromide (C2H3Br).
84
Chapter 8
8 Parameterization of Tersoff potential function using genetic
algorithm (GA) accelerated with neural networks (NN)
Genetic algorithm (GA) can be used to fit highly nonlinear multivariate functional
forms. GA uses a stochastic global search method that mimics the process of natural
biological evolution (Holland, 1975). The concept is based on genetics and evolution
theory, in which different genetic representations of a set of variables are combined to
produce a better solution. The technique is stochastic in nature rather than
conventional gradient based approaches and hence the dimensionality of the problem
does not pose a serious limitation. It works by randomly generating and exchanging
functional elements (variables and arithmetic operators, such as addition and
multiplication), and selecting the fittest individuals (which represent a set of
parameters) assessed by comparing with target values, a population of potential evolves
until a superior form emerges. Design of a potential energy function for a particular
element or a combination of elements involves finding a set parameters that can
accurately fit results from ab initio calculations using genetic algorithms and genetic
programming. The objective is to let the computer program determine the functional
form and come up with the best possible solution with minimal human input.
85
Makarov and Metiu (1998) proposed a procedure by which genetic programming
could be used to find the best functional form and the best set of parameters to fit the
energy and derivatives from ab initio calculations. They suggested directed search which
guides the computer to use a specified functional form without limiting too much on the
flexibility of the search. Directed search was used to fill in portions of human engineered
functional template, without which the computer could not find a simple interpretable
form even for a pair potential.
Yan and Minsker (2003) used a dynamics meta modeling approach in which neural
networks (NN) and support vector machines (SVM) into the GA optimization framework
to replace time consuming flow and containment transport models. Data produced from
early generations of the GA was sampled to train the NN and SVM, allowing the meta
model to adapt to the region in which the GA was searching. Farritor and Zhang (2001)
used a NN to evaluate the performance of modular robots designed using GA. Candidate
designs generated by human designers along with randomly generated designs were
used to train a NN fitness function. GA evolves new designs that human designers might
not have conceived.
Coit and Smith (1995, 1996) presented an optimization approach using a GA to
identify the preferred choice of components and the optimal levels of redundancy for a
reliability design problem. The approach was to use NN to estimate system reliability as
a function of the component reliabilities and the design configuration. Burton and
Vladimirova (1997) developed NN based adaptive resonance theory (ART) called
86
ARTMAP. The ART NN utilizes unsupervised learning algorithms to recognize patterns.
The fitness of the individuals is determined as a function of the degree of similarity
between clustered patterns. Buche et al. (2004) developed a Gaussian process model for
fast surrogate evaluation of the objective function.
In this study, the parameters for Tersoff potential function are found for silicon
clusters using GA. The most time consuming part in the implementation of GA is often
the evaluation of the objective or fitness function. The objective function O[P] is
expressed as sum squared error computed over a given large ensemble of data.
Consequently, the time required for evaluating the objective function becomes an
important factor. Since a GA is well suited for implementing on parallel computers, the
time required for evaluating the objective function can be reduced significantly by
parallel processing. The approach presented in this study combines the universal
function approximation capability of multi-layer neural networks to accelerate a GA
optimization procedure. A neural network trained beforehand to predict the objective
function using several possible solutions to improve the computational efficiency of GA
prior to its execution. This obviates the need for cumbersome direct evaluation of the
objective function.
The Tersoff potential, is one of the most commonly used functional forms for
modeling the interatomic interaction of group IV semiconductor materials, such as
silicon. The functional form of the Tersoff potential is given as follows:
87
( ) [ ( ) ( )].
,
2
1
ij C ij R ij ij A ij
i j
ij
i
i
V f r f r b f r
E E V
= +
=Σ = Σ
≠
(8.1)
where E is the total potential energy of the system. fR and fA are repulsive and attractive
pair potentials, respectively, and fC is a cutoff function given by:
>
< <
−
−
+
<
=
= −
=
−
−
r S
R r S
S R
r R
r R
f r
f r B e
f r A e
ij
ij
ij
ij
C ij
r
A ij ij
r
R ij ij
ij ij
ij ij
0,
,
( )
( )
cos
2
1
2
1
1,
( )
( ) ,
( ) ,
( )
( )
π
μ
λ
(8.2)
where rij is the bond distance between atoms i and j. S is the cutoff radius. λij, μij, and R
are potential parameters. The main feature of this form is the term bij. The idea is that
the strength of each bond depends upon the local environment and is lowered when
the number of neighbors is relatively high. This dependence is expressed by the
parameter bij which can diminish the attractive force relative to the repulsive force.
[ ],
( cos( ))
( ) 1
( ) ( ),
(1 ) ,
2 2
2
2
2
,
2
1
i i ijk
i
i
i
ijk
ijk
k i j
ij C ik ik
n n
ij
n
ij ij i
d h
c
d
c
g
f r g
b i i
θ
θ
ζ ω θ
χ β ζ
+ −
= + −
=
= +
Σ≠
−
(8.3)
The term ζij defines the