

small (250x250 max)
medium (500x500 max)
Large
Extra Large
large ( > 500x500)
Full Resolution


PREDICTING FINANCIAL MARKETS USING NEURO FUZZY GENETIC SYSTEMS By BRENT ARTHUR DOEKSEN PREDICTING FINANCIAL MARKETS USING NEURO FUZZY GENETIC SYSTEMS Thesis Approved: PREFACE This study was conducted to provide knowledge in stock market prediction through the use of several different types of artificial intelligence systems. Many attempts have been made to accurately predict the stock market with only marginal success. This study shows that predicting the stock market is possible with very little input data and compares the abilities of several different methods: Neural Networks, TABLE OF CONTENTS Chapter Page I. INTRODUCTION 1 Neural Networks 1 Conjugate Gradient 5 Fuzzy Logic 5 Genetic Algorithms 7 Decision Trees 10 Classification and Regression Tree 12 Objective of Study 13 Significance of Study 14 Data Set and Tools Used 14 Chapter Page IV. HYBRID INTELLIGENCE SySTEMS 20 ANFIS 27 Neuro Fuzzy 30 Takagi Sugeno Neuro Fuzzy 30 Mamdani Neuro Fuzzy 31 Input Selection 32 V. HURST EXPONENT ON DATA. 33 VI. DATA PREPERATION 35 Input Reduction 35 Data Reduction 39 Data Transformation 41 VII. RESUTLS 42 LIST OF TABLES Table Page 2.2 Normalization of a series 19 5.1 Hurst Exponent Calculations 35 6.1.1 Spearman Correlations 37 6.1.2 Greedy Input Reduction: 8 inputs 39 6.1.3 Greedy Input Reduction: 7 inputs 39 7.2 Conjugate Gradient vs. Back Propagation .45 LIST OF FIGURES Figure Page 1.1 Neural Network 3 1.3.1 Binary String Representation 8 1.3.2 Elitism 9 1.4 Example Decision Tree 10 1.7 Microsoft and Intel Stock Price 15 CCI AI ANFIS CSI INTC MF NOMENCLATURE Artificial Intelligence Artificial Neuro Fuzzy Inference System Michigan's Consumer Sentiment Index United State's Consumer Confidence Index Intel's Trade Symbol Membership Function Chapter 1 Introduction Moore's law is still in tack and thus processors are doubling in speed approximately every 18 months. This new power is very helpful with artificial intelligence, which was only a mere conception a few decades ago. Now, thanks to abundance of processing power, we can even combine artificial intelligence techniques in ways not possible just 10 years ago. Inference systems can learn patterns in megabytes of data in only seconds, thus allowing for more and more data to be learned by the machines. The ability to parse through tons of data is critical in the financial world as uncertainty. Neural Networks can even determine trends over time [18], which is a limitation of decision trees and many other artificial intelligence mechanisms. Time series analysis is critical to any financial model because we must learn how the prices changes over time and what inputs are most critical to future prices. Database marketing is an area that would benefit from neural networks. Database marketing often has hundreds of independent variables, which is well suited for a neural network. Figure 1.1 shows what a neural network could look like. Independent variables are inputted to every node in the hidden decision layer and their output is passed onto the next decision layer (depending on how many layers have been set up). Once the output is determined, its result is compared to the actual outcome and the result is backward I ~' ~ C., '~~"""' Output n  '"" '"'.:::::::: .,....;{}" p ~ '\ =~~~,.~ \" ~ :;::. ~ ".../"'< t .....~//' u ./"",,0 s ~. Figure 1.1 Neural Network The two key points that must be followed when designing a neural network are: the high price of software that runs this algorithm. ModelMAX, a tool used by many direct markets can be an extravagant expense to many companies [24]. The positive side of using a neural network is it can adapt for areas of higher uncertainty and has the ability to solve larger problems. Neural networks are well suited for problems that are highly nonlinear. These types of problems are very common in database marketing with hard to define variables such as customer satisfaction and even harder to define dependant variables such as customer loyalty. Another strength of neural networks is the ability to predict a continuous variable, whereas decision trees have problems with this topic. The ability to learn new situations and recognize trends is another reason that neural networks are popular. The ability of the neural network to faster than back propagation and require fewer epochs resulting in less expensive hardware being required. 1.1.1 Conjugate Gradient The conjugate gradient is a method, which uses an approximation of the second order derivative without actually calculating the second derivative. This process was originally discovered in the 1960s for solving linear systems [18]. This method is exceptionally fast and thus is very useful with solving large data sets or when many networks need to be built. The gradient uses a vector of previous points to determine the conjugate direction. Imagine that you are standing on step embankment that leads to a control, data classification, decision analysis, time series prediction, and pattern recognition [16]. Petrovic et al. [24] use fuzzy logic in a multiple objective decision model for a manufacturing plant. The rules for a fuzzy system can be generated either by interviewing experts in the field or mechanical mechanisms used in a fuzzy inference system, which uses supervised learning to recognize patterns in the data. A typical fuzzy rule is given below: If (customer has high credit score) and (customer has high income) then (grant loan). Equation 1.2 In the above example it is obvious that there is no absolute definition for either statement. allows the system to weigh rules within the system and give preference to rules that the customer fits better. As opposed to traditional probability theory not all possibilities must add up to 100% [29]. For example, let us say that there are two cases: a person is rich or a person is poor. It is possible that according to a membership function, Jack is rich (CF = 0.65) and Jack is poor (CF = 0.20). Except 0.65 + 0.20 "* 1.00 and this case is possible in fuzzy logic but not in probability theory. Fuzzy Logic is used today in many different real world applications. One such example is an AntiLock braking system [29] where instead of the traditional antilock braking system, which uses an on/off pumping action to unlock the wheel, there are about 18 sensing factors. When a sensor begins to come close to being locked, the pressure on string. This chromosome defines the characteristic of the member of the population and that allows the algorithm to determine its fitness. A population is a group of members and changes from generation to generation through methods such as mutation and crossover. The fitness function is used at every generation to see which members are fit and most likely to survive to the next generation through a crossover operation that can be thought of as mating. Use the integer equalvent of the binary value to determine its fitness 10010 Fitness Evaluation 18 Using Integer Fitness and using elitism and crossover to create the next generation 11011 11011 ,,, :> 11010 11010 00010 10010 10000 I....,.>10000 01000 f~?> 01010 01010 11000 00110 00010 Current Next Generation Generation Figure 1.3.2 Elitism We must also introduce some randomness to ensure more of the search space is covered and this can be done by mutation. Mutations can be done by simply flipping a bit in the string to produce a new mutated string. Mutation does not occur in every Genetic Algorithms are an exceptionally powerful tool, as they are very effective at searching a predefined search space, and this ability helps genetic algorithms to be used in a hybrid manor with other tools. 1.4 Decision Trees A decision trees can be used to predict an outcome for dependant variable based on many independent variables. The root node of the tree contains the most significant independent variable. As the tree is traversed, the node becomes less important to the outcome until a leaf node is reached and an outcome is predicted. Figure 1 below shows elements in class N.) For example, class P could be the people to receive catalog and class N could be the people who do not receive a catalog. p p n n I(p,n) =log2 log2 Equation 1.4 p+n p+n p+n p+n Set S is partitioned into sets {Sp S2, ... , Sv }. For Set Si' Pi is the number ofp's in the set and ni is the number of n's in the set. I(p,n) is the importance to model. The higher the I(p,n) is the better this combination is for a split. A value of zero means to attach no importance to I(p,n) and a value of 1 means n and p have ideal values. Gain (A) is the amount of information gained for an attribute A with a highest gain being the attribute to use as the root. where right only 50% of the time. After a decision support system was implemented, that used a decision tree, the success rate increased to 70% saving the company money [29]. 1.4.1 Classification and Regression Tree CART (Classification and Regression Tree) is a special case of a decision tree that can be constructed by examining data in a systematic approach; the CART grows through a series of splits. A CART determines the importance of each variable before adding a splitter in the tree. Starting from the root node an exhaustive search is preformed on all inputs to determine which input creates the least error when picked. After finding the split, two disjoint sets are created according to the split and each set is 1.5 Objective of Study The main focus of this study is to compare different performances of artificial intelligence paradigms on predicting the direction of individuals stocks, and how hybrid intelligence can be used to better solve problems. The first algorithm examined is Artificial Neural Network using conjugate gradient descent algorithm. The second algorithm used is a straightforward back propagation method. A Mamdani Neuro Fuzzy inference was built and then the membership functions were modified using back propagation and a Genetic Algorithm. This showed how effective Genetic Algorithms could be and provide a comparison with Takagi Sugeno Neuro Fuzzy model. The ANFIS model is based on Takagi Sugeno Fuzzy Inference System and was compared with a 1.6 Significance of Study The most recent studies compare indexes such as the S&P 500, NASDAQ, and the Dow Jones [2][8][28][31 ][32]. The experiments done in this project examine the chaotic behavior of actual companies that tend to be less stable and thus harder to predict. Studies have also shown that using direction as compared to prediction can generate higher profits, [8] and this study will try and capitalize on that idea. Also the prediction will examine a more realistic situation where an investor has the choice between multiple stocks, in this case 2, and chooses the stock that is mostly likely to increase in value. The experiments also compare many hybrid techniques and their abilities to predict a categorical output. The ability to predict the direction of the stock prices is the most Stock Price $80.00 _0' o.o__~ _._.o_ __.. o_.,o.•_._.__.. ~ ~._ ~ .~ $ 70.00 $60.00 $50.00 UI CI) I MSFT! 0 $40.00 U INTC $30.00 $20.00 $10.00 $r reo eo (]) (]) 0 0 ...... N N C") C") (]) (]) (]) (]) (]) (]) 0 0 o 0 0 0 0 0 (]) (]) (]) (]) (]) (]) 0 0 o 0 0 0 0 0 ...... ...... ...... ...... ...... ...... N N N N N N N N i5! ...... N t::: i5! ...... N t::: i5! ...... N t::: i5! ...... N t::: i5! ...... N t::: i5! ...... N t::: i5! ...... N t::: Date application was developed to randomize the rows or a .CSV file to ensure the network fully randomized this could have also been done using the preprocessing built into Neuro Solutions. FuzzyCope3 is designed to perform regression testing only and not classification. Thus it was necessary to writing an application to transform the predicted value to 0 or 1 and then do a comparison for accuracy. All Java applications where developed using JDeveloper by Oracle. Chapter 2 Literature Review 2.1 Hurst Exponent Some papers have used the Hurst Exponent [9][12][32][33] to prove that the data not completely random but in fact has the correspondence between the input and the output data. The Hurst Exponent was originally discovered by Hurst el al. [14] in 1965. The Hurst Exponent can show the degree of correlation. If the exponent is 0.5 the data is completely random and no thus no network will be able to predict the output and thus it is waste of time to attempt to learn any pattern in the data. The closer the Hurst Exponent , X"N =I (xu JlN) Equation 2.3 u=l Jlx is the mean of Xu for all N elements. The Hurst exponent can be very useful in any set and allows a method of comparing sets of data. For example, a set with a Hurst Exponent of 0.55 is very difficult to predict and any network with decent results should be great. However, a data set with a Hurst Exponent ofO.95 should expect the network to be extremely accurate to be considered good. 2.2 Scaling and Normalization xseries nseries 35.25 0.5478 37.25 0.9462 37.52 1.0000 37.5 0.9960 34.87 0.4721 32.5 0.0000 Table 2.2 Normalization of a series The above set of data shows how a data set can be spread out by using normalization, making it easier for the network to understand. 33]. Which states the correct number of hidden neurons is a multiple k times the number inputs (n) minus one. # neurons =(k *n)l Equation 2.5 second rule of thumb popular in newsgroups is # neurons = .Jinputs *outputs Equation 2.6 H n+1 =In(Hn ) Equation 2.7 The BaumHaussler rule for determining the correct number of hidden neurons is defined the following function. # neurons :::; N record' *Etolerance Equation 2.8 NInputs *Noutputs Using any of these rules of thumbs can prevent the networks from memorizing and thus whenever there is sufficient data with both inputs and outputs. When know outcome is available it is ideal to use supervised learning [26]. Unsupervised learning means the system attempts the recognized patterns in the data and doesn't have the expected outputs. Selforganizing maps are a common usage of unsupervised learning when the network attempts to recognize clusters of data and to group them according to similarities with other members. Unsupervised learning is done when the system doesn't know the expected output, and the system is then supposed to learn the patterns. A common tool used for unsupervised learning is a selforganizing map. 2.6 Recent Trends Many papers have dealt with input selection when it corn to mapp.ing financial indexes and stocks[2](8)[28) [3 1][32]. Inputs have been brok n into two different typ of inputs financial and political (which tend to be qualitative). Kuo et at [19] u e a genetic algorithm base fuzzy neural network to measure the qualitativ ffect on the stock price. Variable selection is critical to the success of any network and 5 key parts of the financial vi.ability of a company were identified by Quah el at [26] as yield, liquidity risk, growth, and momentum factors. These variables are widely available in qualitative fonn such as the PIE ratio can be used for yield and the return on equity could be u ed for growth etc. Macroeconomic factors such as inflation and shortterm interest rate [8] have to shown to have direct impacts on the stock returns. A better measure of fitness which considers profit [31] ha been suggested to replace a root means squared error. Yao and Poh [32] showed an example wh re a model with a low NMSE had a lower return then a model with a high r NM . Br wnst ne [6] recommends using percentages to measure performance s that th r suit can b bett r understood by traders and other people that might need their research and ar not xp rt in the field. Chen et at [8] used a 68day sliding window to predict the n xi day' pric of the index. Commission is commonly overlooked when doing research relating to tock market prediction; however, if any model is actually implemented it i going to incur fees which could greatly affect the profit predicted by the model. Chen el at [8] con ider 3 different levels ofcommissions and how it would affect the best buying trategy u ed by investors. Simulation [34] has been used to show how these models can produce profits on real world testing data that is not seen by the network. 22 Chapter 3 Hybrid Intelligence SystelTIS Architecture 3.1 Stand Alone "Standalone models consist of independent oftware components which do interact in any way [1]." These systems can work in a parallel enviroom nt to allow user to determine which model is the best fit to learn the signal of the data. Once the standalone system has aided in picking the best ystem that ystem would then be developed by itself to make the best possible single intelligent sy tern. The advantage this model is it is fast to build and uses software that is already available. A di advantage is the system doesn't incorporate any strengths of the discarded sy tern and as a re the performance is not any better than a single intelligence system. 3.2 Transformational. Hybrid Intelligent System The system begins as one system and then transition into an ntirely n w y Thus once the model is built on a ystem is required to b worked on. Like the standalone model this system suffers from not being able to use the trength of both sy terns. These systems also tend to be applicationoriented [I]. A di advantage of this system there is not any really available software that support this type of architecture. 23 3.3 Hierarchical Hybrid Intelligent System The Hierarchical Hybrid Intelligent yst m u s the trengths of muJtipl typ artificial intelligence syst ms to produce th be t po ibl int llig nt tern. Th design is broken up in layers with each layer ha ing a ingle int lJigenc what is best at that layer. A common usage of hierarchical hybrid int Ilig nt s st m use an evolutionary algorithm to produce the inputs or th be t tting for anoth r artificial intelligence system. Leigh (Forecasting the NY composite index) u ed a genetic algorithm to detenmne which of the 22 inputs where the mo t u eful and which could be eliminated to generate a better R quared corr lation. Th finding from the genetic algorithm were then used to create a bett r neural network. A hi rarchicaJ hybrid intelligent system is when the system begins 22 inputs D Genetic Algorithm Neural Network Hierarchical Hybrid Intelligence System Figure 3.3 Hierarchial Hybrid lntellig nt yst m as one type of intelligent sy tern, and then i transform d into a different type with the final product having no proof of ever being of the fir t type of intell ig nt system. The design shown in figure 3.3 was used in this study to reduce the numb r of inputs form to 9 which were then given to the neural etwork for training. Hierarchical hybrid 24 intelligence systems show dramatic iropro ern nt over using a singl int lligent st This allows the user to focu on the bigger picture, and tb computer can figure out the details of the design such as how many bidden n urons hould be u ed. 3.4 Integrated Intelligent System Integrated Intelligent Systems use fused architectures [1] that provid a single model tbe best characteristic of all models. There are numerou advantages to this type of model. Integrated Intelligent systems provide increased performance and are more robu because it is both noise resistant and has the ability to xplain itself. The bigge t disadvantage of this system is its complexity; to design tbi type of system i a complex undertaking for any company. Nevertheless these types of systems are needed by companies and so are actually being developed. The hope is that as more Integrated Intelligent systems are developed, the aforementioned problems wil.l begin to dissipate. One such model that is currently available is Fuzzy ope which provide a Nelli·oFuzzy model, and it is available at [10]. Similarly Neuro solution ha an AN I (Artificial Neural Fuzzy Inference System) that uses an integrat d intelligent sy t m [22]. Hierarchical design has been very popular in recent studie Abraham [1] di cusses a 5layered system that evolves NeurofuzzyEvolutionary yst m (~voNF). This typ of system would require the largest computers systems available today to build its model, which is a buge disadvantage of the hierarchical architecture. The cost of the system run these programs can be huge, but the biggest strength of these systems is their performance once the model has been built. Mamdani Fuzzy Inference shown in figure 3.4 is an example of an integrate intelligence system. 25 z = (centroid or Mea) x y x y Figure 3.4 Integrated Hybrid Intelligent ystem [3][4] 3.5 Conclusion of Hybrid Intelligence Systems The most interesting of the intelligence systems are the Integrated and Hierarchical hybrid because these two methods provide the most significant perfonnance improvements and can realize the strength of many different intelligent systems. However, we are not limited to having to choose one of these two systems, in fact, it would be perfectly reasonab}e to create a Hierarchical Integrated Hybrid Intelligence System. This system would contain layer a in the hierarchical sy tern, with on r more layers containing an integrated y tern. 26 Chapter 4 Hybrid Intelligence Systems 4.1 ANFIS ANFIS, Adaptive Networkbased Fuzzy Inference y tern hav h en shown to provide better result than artificial neural network and fuzzy mod Is [16]. A common model used today in ANFIS is the Takagi Sug no Fuzzy Model. In the Sugeno model each different rule has its own function. if(x is A) and (y is B) then z =j{x y) Equation 4.1.1 In the above functionf(x,y) is a crisp function and the sets A and B are fuzzy sets thus they don"t have absolute members, but rather a degree of member hip. lang [16] gives an excellent example of an ANFIS with only 2 input. he diagr m b low how the procedure for inputs x and y. Each lay r i then de cribed below. 27 Layer I Layer 2 Layer 3 La er 4 La er 5 x f y Figure 4.1 ANFIS [16] The ANFIS consists of 5 different layers described below: Layer 1 (Membership Function): This bell shaped graph determines if x is in A and to what degree it i.s a member. The bell shape of the graph can be manipulated by changing a value of any variable. Thus the end result i.s a bell shap that b tier matche the r al world. PA, = 2b, quation 4.1.2 x c, I +a, a, b, and c are constants that determine the shap of the bell. A is the linguistic label (tall, short, etc) that is associated with the node. 28 Layer 2 (Firing Strength): Every node in la er two corr ponds to th firing tr ngth fa rule. Any Tnonn operator could be u ed in this layer. Two common Tnoml op rator are the AND and MAX functions. Equation 4.1.3 Layer 3 (Normalized Strength): In layer three calculate a normalized firing strength 0 that the output one node doesn t overshadow all other nodes. W 0 3,I = W , = i = 1,2 Equation 4.1.4 WI +w2 Layer 4 (Adoptive function): Each node has a node function defined by W" normalized firing strength, and by 3 new constant p q and r. The e three parameters ar referred to as the consequent parameter . °4 ; = W,/' = W, (PiX + q,y + r,) quation 4.1.5 Layer 5 (Calculate Output): A summation of all input signals is us d in this ignal nod compute the overall output as describe in the formula below. Equation 4.1.6 29 The Mamdani fuzzy inference system is a sp cial ca e of the ugeno fuzz mod I in which the order of the model is zero. Since the order of the sy t m i z r th nfi a constant. 4.2 Neuro Fuzzy Neuro fuzzy systems are an attempt to combine natural linguistics u d in fuzzy inference Systems with the proven capabilities of artificial neural network [13]. Th combined system's goal is to be more transparent like a fuzzy system giving the u ers a list of general and understandable rules while at the same time building in the ability ofa neural network to predict nonlinear trends in data. Central to this idea, i building a bridge from fuzzy logic using membership functions and artificial neural network that possess quantitative adaptive number crunching power. Castellalo et al (7] de igned a Neurofuzzy model where the parameters ofthe fuzzy rules base were configured by a twophase learning of the neural network. 4.2.1 Takagi Sugeno Neuro Fuzzy A common fuzzy inference system (FIS) used today is Takagi ugeno fuzzy inference system [27]. The idea was to formalize a systematic m thod for generating rules that a computer could use for any given data set [17]. Takagi Sugeno FIB has rule that follow the format: if(pressure is high) then volume = 2 *pressu're Equation 4.2.1 30 In a Takagi Sugeno FIS the consequent. is a crisp function that can b expr ed in t rm ofj{x). A firstorder Sugeno fuzzy model occurs when the function/is a fir t order polynomial. A zeroorder Sugeno fuzzy model occur when the functionfis a con tanto This can also be viewed as a special case ofthe Marndani fuzzy inferenc y tern [17][18]. Takagi Sugeno has 2step process of learning that occurs for every epoch through the training set. The first step holds the membership functions constant and update the input patterns learned according to an iterative least squares method. The second part of the learning updates the membership function while the input patterns are held constant [3]. Theses steps provide for a very efficient learning tool. q W1,Zt + W1·Z2 z= W t+w2 x y x y Figure 4.2.1 TSK Fuzzy Inferenc ystem [3][4] 4.2.2 MamdaniNeuro Fuzzy AI and BJ are input fuzzy sets and the result is the output of the fuzzy set [3][21]. A supervised learning technique is used to learn the membership functions in a Mamdani. Fuzzy Inference system. The Mamdani system ha 6 layers instead of 5 that are in Takagi Sugeno Model. The fust layer is for the inputs. The second layer i a fuzzification layer. The third layer is the rule antecedent layer. Then the fourth rule is 31 the strength nonnalization rule and th fifth is tb c n equ Dt la er ru1 . The finalla er in the Mamdani UfO Fuzzy sy tern i th rule inti T nce la er. 4.3 Input Selection In real world problems there can b hundreds [16] of diffi rent po ibl input for any artificial intelligence system. For instance in a fmancial mod I the input ar not just limited to the stock price, dividends and volume trade of a particular stock or iod in question. However the indexes could extend to the overall p rformance of the mark the consumer confidence, Federal Reserve inter t rates or ev n world policie indicator such as how is the current war is proceeding. Once all thes pos ible input hav b n found it is good to find a mechanism for reducing the he r number of inputs as ha ing too many inputs can cause many problems such a complexity of computation and Ie s transparency of the underlying model. Four rules have been found as a rule of thumb to guide input selection by lang [17], and it is r a onable to b Ii ve that the e rul ar generalized enough that they could work for other mod I . 1) Remove noise/irrelevant inputs 2) Remove inputs that are dependant on other inputs 3) Inputs that create a more conci e and tran parent mod 4) Reduce time for model construction 32 Chapter 5 Hurst Exponent on Data Once the data was transfonned in the mo t viable form to u e in all the network, the Hurst Exponent [9][12][32][33]was calculated to show that both the predi tion is possible and that the prediction is going to be very difficult. The tim en us d to calculate the Hurst exponent consisted only of the percentage change in price from the previous day and the actual value was not used. P'oday  PYeslerdoy Pr ecentageChange = =: Equation 6.1 PYe..vlerdoy This equation was preformed on all 1398 days in the testing set from January 15t 1997 to July 31, 2002. Then the x/.N was calculated for all days u ing equation 2.3. nc that was done, then the R N (1.3387 for MSFT) could be found u ing equation 2.2. h standard deviation for MSFT was found to be 0.02742 that gave us all th information needed by equation 2.1 to detennine the Hurst xponent to be 0.537 for M T. This proves both points earlier stated. The data is not a complete random walk because neither network had a Hurst Exponent of 0.5. And second it shows that good performance will be very difficult to achieve for any network, as the network is nearly random. imilar tests where run on Intel's data to produce a HUT t Exponent of 0.513. Thus based on the Hurst Exponent, Intel's data is more random and should be more difficult to produce good results. Figure 5.1 shows how the calculation for Microsoft was calculated to find 33 Price 1 Change X t,n 1 0.0025 0.0035 2 0.0033 0.0078 3 0.0642 0.0554 4 0.0585 0.1129 5 0.0734 0.0386 6 0.0746 0.1122 7 0.0946 0.0167 8 0.0413 0.0256 9 0.0302 0.0568 10 0.0174 0.0752 11 0.0149 0.0613 12 0.0108 0.0731 13 0.0012 0.0753 14 0.0197 0.0960 15 0.0127 0.0843 16 0.0181 0.1034 17 0.0053 0.0990 18 0.0351 0.1351 19 0.0580 0.0781 20 0.0078 0.0713 21 0.0232 0.0955 22 0.0371 0.1335 23 0.0040 0.1385 1398 0.0364 0 IMean =0.0009851 ] = O. J129 min [x ] =  I .2257 Is/sN I, 1 IR=1.33971 I =0.02741 IN=1398! H LOg(RlS)=O.536 Log(N) Table 5.1 Hurst Expon nt Calculations Hurst exponent of 0.5368. 34 Chapter 6 Data Preparation A key when desigrung any artificial inteLLigence sy t m i to pre nt the data in the most meaningful and understandable format for the algorithm to under tand. Th 3 steps of presenting the data to the network to the sy tern is to choose the be t input remove misleading or corrupt data rows, and transform the data. 6.1 Input Reduction There are 16 inputs to begin with in each data model. This must b r duc d to aid in the performance time of the neuro fuzzy engine since execution time grows exponential with the number of inputs. It takes about 3 day to train an ANFI with five inputs and five MFs per input, thus the number of input mu t b r du d to n mor than five. Also results with neural network tend to b b tter wh n ulln cary inputs are removed and duplicate or similar inputs are eliminated. This is hown to b th case in the testing of these networks. The original 16 inputs that were con idered ar: on urn Confidence Index the prime rate, Michigan Consumer ntiment rnd x price 1 (pric yesterday), price 2 (price the day b fore ye terday), price 3 (price 3 days ago) price4 (price 4 days ago), price 5 (price 5 days ago), volume 1 (volum of trades y terday), low 1 (low price yesterday), high 1 (high price yesterday), op n 1 (open price yesterday) low 2 (low price the day before yesterday), high 2 (high price th day before yesterday) open 1 (open price yesterday), and volume 2 (volume day before 35 yesterday). Several models were used to detennine tb 1 ast important input and th 11 these inputs where removed b fore a mor accurate and it rali approach could b u ed to determine the final inputs. Models were built u ing decision tree ( ART) to determine wbich variable offered the most 'gain.' Linear regre ion mod 1 w r al 0 designed to determine which inputs contributed the most to the mod I. ignificanc towards the price model was built using correlation models to determine ach input contribution, which is shown below in figure 6.9. Variable <.05000 Vol1 Vol2 Price1 Price2 Price 3 Price4 PrlceS Prime CCI Price Vol1 0.876 0.0138 0.052 0.079 0.065 0.071 0.12 0.045 0.0431 Vol2 1 0.0402 0.013 0.053 ·0.082 0.064 0.12 0.046 0.0445 Price 1 0.023 0.019 0.016 0018 0.008 0.D15 0.0289 Price2 1 0.023 0.018 0.016 0.008 0.011 0.0082 Price3 0.025 0.017 0.009 0.007 0.0176 Price 4 0.023 0.011 0.007 0.0196 Price 5 1 0.007 0.011 0.0045 Prime 0.709 0.0062 CCI 0.0207 Price Table 6.1.1 pearman orrelation Also a genetic algorithm with a population of 50 was trained for 100 generation , and th results were examined to see what variable it had cho en to remove from the data model. Each network was trained for 10,000 epochs with a threshold of 500. The Mutation rat used was 0.1, uniform crossover, Roulette was used for election, and rank basi fitne s was used. A progression of generational was used in tead of a steady state. After building a network the sensitivity for each input can be test by hold all variables constant except the one variable in question, then the network is feed a series of value for this variable. How the outputted result changes for each different input gives us the 36 sensitivity for this variable. The ensitivity about the Mean te t was conducted to s how important each input is relative to a particular n ural network. Th r ult for Microsoft data set is shown below in figure 6.1 . Sensitivity About the Mean 0.35 0.3 0.25 ?: ~ 0.2 "in 0.15 c 11l en 01 !OCloseNI 0.05 0 Input Name Figure 6.1 Sensitivity About the Mean: MSFT All the results from the above experiments were then analyzed to determine which variables to use based on their importance in all the models. From the r suits given th best 9 inputs were picked. Using the model built, the number of input was reduced to a much more manageable size and from there a greedy systematic test wa p rformed to eliminate the least important variable according to a neural n twork. Giv n the 9 inputs, a network was built and tested with one input missing. This iterative approach was impressive as improving performance, however ideally a power set should be cr ated to pick the best inputs with every possible combination. Each network took about 20 minutes to train and test. For the original 16 inputs, the power set would contain =65536 combinations, which would take 2 years, 5 months, and 27 days, and thi not feasible for this study. The network which did the best on test was kept, thus after the 37 216 first iteration there wh re 8 input left. Till process' r peat d again ith all 8 inputs being removed individually to find out which input wa th I ast ignificant. h n that input was removed and so on until the data et contain d th 5 mo t ignificant input. The graph below shows a cross table r suIt that d tennin d that onsum r Confidence Index) should be removed from Microsoft s inputs. Network Name MSE cross validation NMSE on Testing without CCI 0.000521 0.00712 without close3 0.000535 0.00729 without volume 0.000527 0.00731 without high2 0.000515 0.0074 without c1ose2 0.00053 0.00757 without open1 0.000516 0.00765 without high1 0.000513 0.00779 without c1ose1 0.0006 0.00880 Table 6.1.2 Greedy Input Reduction: 8 inputs Interestingly, the network with CCI had a performance 0[0.007506063, which was wor than the network that had the input removed. Similar the next it ration, which choose volume to be removed even out perform d thi n twork with a NM of 0.00687 1094. This iteration is shown below. Network Name MSE cross validation NMSE on Testing without volume1 0.000515 0.00687 without close2 0.000518 0.00719 without high2 0.000509 0.00726 without close3 0.000520 0.00746 without high1 0.000532 0.00766 without open1 0.000524 0.00769 without close1 0.000584 0.00809 Table 6.1.3 Greedy Input Reduction: 8 input The final selection of inputs using this approach was: clo e1, op n], highl high2, and close3. The final performance on testing was 0.00693, which is just lightly wor than including 6 inputs. All networks in the above example were trained 3 eparate tim s 38 with randomly initialized weights chos n and the b t n twork th n cho en by th system to perform the t st on. Each run u d conj ugat gradient de cent and 10 000 epochs and a threshold of 500. A threshold of 500 in thi study means that if the ITor the cross validation set does not improve for 500 epoch then t rminate training. he network used contained 20 processing elements in layer on and 7 proce ing el m nt the second layer. NMSE used by Neuro Solutions in th M of th network divided by a straight forward network that picked the average value each tim and then calculat d the MSE for this dumb network. Thus wh n NMS iI, th network ha learned nothing about the data set and a value of 0 means the network is perfect. NMSE = MSEne,work ne/work MSE Equation 6.1 dumhNelW{)rJc 6.2 Data Reduction There are many things to do for data cl a.., up. Fir t, th d cision n d t be handled on what to do with missing or invalid data. Ifther i enough data as in this model, it is wise just to throwaway those record oth data set w re exarnin d ft r missing or incomplete data and none was found. Also we must d cid what to do with outliers, which can throw a model off. In some cases, such a fraud d tection, the outliel:are the meat ofthe problem; however, that is not the case for our data and thu all outlier were removed from the set. All days in whi.ch the price increased or decreased by more than 10% in a single day were removed from the database as these outliers where most likely caused by external forces. Also days in which little, less than 0.1% or no change occurred were removed because an action performed by the investor would not affect 39 there bottom line. The outliers for th e models wer d termined to b da s wh n tTad volume is 4 times the average trade volume or more. For Mi.crosoft thi valu cam out to any day that more than 105 million shares here trade. Th da are rno tly like caused by a natural or manmade disaster for which no network ould have the ability to foresee. The prime was also used and the change in prime can greatly affect the mark so any day that prime was changed was removed from the data. The daily valu sinc 1947 of the prime rate is published at on the web [25] and it is updated every time the prime is changed. Removing this noisy data aids the networks to obtain better prediction. Data reduction is also necessary to give a more manageable size to the data t. Man made disasters such as 9/11 have a huge impact on the stock market, which could not have been for seen by any algorithm. Thus the entire week following September 11 2001 was removed from the data sets of both Intel and Microsoft. The data sets u ed in this paper original1y had data from January 1990 all the way to August 2003, and the price ofthe stocks had changed so much that very little would b I arn d by u ing th entire dataset. All the data prior to 1997 was remov d from th data set. This pr vid d services: First it reduced the size of the data set and econd it gave a better plit of increasing days to decreasing days. Studies have shown in classification problem it i important to have equal representations of both cases in ord r to prevent the network from becoming biased towards the more common value, in this case increasing days. Intel's data set contained less than a 1% difference in the representation of increasing days as compared with decreasing days, thus further manipulation of the data s t was not required. However Microsoft did much better during this period and had an increase in 59.6% of the days in the sample set. In order to prevent the network from heavily 40 favoring the increased prediction the data set had increasing prediction randomly removed until a 55/45 split was achieved. This al10wed e p rimenls to b t ted to how the difference with strong bias and without a bias. Thu the la t year of data from August 1S\ 2002 until July 31 st, 2003 was held out of set to be u d as an un een testing set in the simulation. 6.3 Data Transfonnation Transfonning the data in the most meaningful manor for the network ise ential for the success of the network. The daily price was modifi d from a stock price to a percentage change from the previous day in an attempt to help the network better understand the network. This gets the data closer to the actually attempted prediction of predicting if the stock price goes up or down not simply trying to guess the price which we don't care about in this prediction paper. However, after day ofte ting this was found to actually hurt performance and a straight price mod I wa then u ed. All data should also be standardized or normaliz d so that one column of value cannot completely dominate the prediction. There are many types of normalization of data such as minmax normalization, zscore normalization and normalization by decimal scaling. Most of the better software in the industry handles the normalization for the us r. The data in this paper was normalized using a minmax normalization sp cified in equation 2.4. Also zscore normalization was tried, and the results tended to p rform wor ethan with the much more straightforward minmax normalization thus minmax normalization was used. Many neural network tools, such as WebStatistica and Neuro Solutions automatically normalize the data for the user. 41 Chapter 7 Results 7.1 Testing Standards The test standards listed below were used for all te t unles otherwise not don the results. All tests on the neural network where given 10,000 epochs to train and cross validation was used to terminate training after 200 epochs with no improvement. Al 0 networks were ran 3 times each with randomized starting weights to insur the best possible network by minimizing the chance of obtaining a local minima. Neural Solutions allows for varying of a single parameter; so often it was the case that with everything being held constant, the number of hidden neurons in the first layer would vary from 10 to 50, with a step size of 2, to determine which network was best at learning the data. This type of gradual improvem nt was key to the llcce seen in the final networks that were to average around 63% correct wlLich was much b tter than the original networks which had a very dismal performance of around 53%. Wh n a 2layered neural network was designed the second layer would contain the cei Iing of the log of the number of neurons in the first layer. All networks used conjugate gradient descent unless otherwise noted. The transfer function us d for all networks wa TanhAxon. The data split for the networks were 65% test, 15% cross and 20% testing, except in the simulation when the test set was the entire year. The training et was broken up as 80% training and 20% cross validation. etworks given for Fuzzy Cope used minmax normalization. 42 7.2 Conjugate Gradient vs. Back Propagation The original neural network designed used the v ry straightforward method of back propagation. The error would filter through the network back onth learning rat and momentum ofthe network. Several new ideas have been publi hed to provid faster learning and also to provide better results for the networks. This pap r examine conjugate gradient descent and back propagation algorithm. Figure 7.2 how that it takes about onethird the amount of time to train a conjugate gradient n twork than a back propa.gation network. Much ofthe improvement in sp ed was becaus conjugate gradient networks would norrnallycross the thre hold (set at 200 epochs without improvement) before ever reaching the epoch limit of 10,000. Exact time required i not possible since the weights are randomly initialized, and the picking of the weights can greatly affect the convergence of the network. For example, it took just ov ran h UT l train a series of conjugate gradient network with the numb r of proe ssing el ments 43 varying from 20 to 50 in a single layer and each network a built 3 time . Conjugate vs. Back Propagation r I Back Propagation 3 MSE 0.000361 I Back Propagation 2 MSE .0000349 I X. Back Propagation 1 MSE 0.000363 I >. le::: .~ Conjugate 3 MSE 0.000330 I IConjugate 2 MSE 0.000344 I Conjugate 1 MSE 0.000363 I o 2000 4000 6000 8000 10000 12.000 Epochs Figure 7.2 Conj ugate vs. Back Propagation The same test, with back propagation, takes about 4 hours. The chart below shows the performance of both back propagati.on and conjugate gradient, both using the default learning rates provided in N uro Solution. It i cl ar that onjugat radi nt i bett.er in almost every category in this regression t 81. Back Propa.gation MSE Error Overall Correct Decrease Increase 30 PE*  5 PE** 0.3725 59.98% 41.01% 78.72% 40 PE*  6 PE** 0.3748 59.10% 41.60% 76.41% 36 PE* 0.3702 61.14% 36.92% 85.08% Conjugate Gradient Descent 24 PE  5 PE 0.3702 59.98% 37.50% 82.19% 48 PE  6 PE 0.3689 59.40% 35.16% 83.35% 22 PE 0.3745 62.59% 42.18% 82.77% *Step size 1.0 first layer Random Set 1 **Step size 0.1 second layer Momentum 0.7 as default Table 7.2 Conjugate Gradient vs. Back Propagation 44 7.3 Takagi Sugeno Neuro Fuzzy Inference System Tests Neuro Solutions provide an ANFIS training kit which th y d fin a following: "The ANFIS (CoActive NeuroFuzzy lnference System) model int grate adaptable fuzzy inputs with a modular neural network to rapidJy and accurately approximate complex functions. Fuzzy inference systems ar also valuable as they combine th explanatory nature of rules (membership functions) with the power of "black box" neural networks." [22]The ANFIS package is not part of the educator ver ion and thu it had be tested only in evalua60n mode, limiting the number of exemplars to 300. This greatly reduced the networks ability to learn as the other networks had about 850 exemplar to be trained on. In order to make a comparison between a straight neural network and that of the ANFIS, the neural network was trained with the exact same restriction . The results for both were mediocre at best. The significance here is simply to show that the Takagi Sugeno Neuro Fuzzy was superior to the Neural network in learning the signal and this also shows the extreme complexity of this network. The network' U ed 5 inputs: pric· yesterday, price a week ago, volume yesterday high price y terday and high price 2 days ago. The la t two input were discovered in Neural Trading solutions a powerful stock market predictor software available for commercial u e[22]. To show th xtreme complexity of these networks, a rough approximation of time is given. [t should b noted that many of the networks terminated, due to th cross validation set not having any improvement for 500 epochs well before every reaching the] 0,000 epochs maximum. The time complexity of using an ANFIS is its worst a ets. It is simply impossible to te hundreds of different weights for a network when it takes 2 days to train a ingle network on one oftoday's fastest machines. 45 Type of Network NMSE on testing % Correct Time 3 MFs per input 0.003811 53.67% 30 min 4 MFs per input 0.004515 56.00% 4 hours 5 MFs per input 0.005447 55.33% 22 hours NN207 0.006474 55.00% 1 min NN10 0.008856 56.67% 15 sec NN9 0.006148 55.33% 15 sec Randomize Records 3 MFs per input 0.004685 55.33% 30 min 300 exemplars, Conjugate Gradient, TSK, Bell MF, Axon Transfer, 10,000 epochs, 500 threshold, crossvalidation Table 7.3 Takagi Sugeno Neuro Fuzzy vs. Neural Networks 7.4 Mamdani Neuro Fuzzy Tests The results reported by FuzzyCope3 were very poor. The oftware, simply put lacks the power to learn such a complex network. [t uffers from being wlabl to get Ollt of a local minimum and thus hundreds of similar networks must be built. The oftwar must then be reloaded to clear its memory so it does not fmd the arne local minimum. This technique was able to find a good olution for the problem using 3 m mb rship functions per input. However, the problem never converged when 4 and 5 membership functions were used per input despite around 100 attempts for both network using varying weights. NMSE error in these tests means, the mean of all the error squared on normalized data set in which the data is in the range [0, I]. The results are shown below. 46 MF per RMS NMSE input Step Momentum Epochs Training Testing % Correct 3 0.2 0.8 1000 0.0232 0.0414 53.31% 3 0.2 0.8 2000 0.0230 0.0412 53.01% 4 0.2 0.8 2000 0.2059 0.4792 NA 5 0.2 0.8 2000 0.2059 0.4792 NA Population Min, Max Generation 3 MF  GA 50 20,20 50 0.0228 0.0412 53.01% Cross Over Point: Fitness Selection: Uniform Normalization Elitism Tournament Table 7..4 Mamdani Neuro Fuzzy Results The error for both 4 and 5 membership functions per input are actually 10 time wors than that of the network with 3 membership functions. Thi is caused by an error in . FuzzyCope3, which is unable to converge for these networks. 7.5 Classification vs. Regression This fundamental question comes down to where to translate the data before or after submitting to the neural network. When performing regression testing all 5 input are given to the network, and the output is th stock price for the next day in dollar and cents. Once a test is completed the predicted price is compared with ye terday' price to see if the price increased or decreased. The actual price is compar d with yesterday s price to determine if it increased or decreased. [f both formulas produce the same output then it is determined that the network correctly predicted the value. las ification is much more straightforward. Before presenting the data to the network the output is translated to ], for increase, or 0 for decrease. And the objective ofthe network is to correctly predicted 0 or 1. 47 Regression Min MSE Cross NMSE Testing Testing split Correct Decrease Increased 28 PE  5 PE 0.0004663 0.006659 50.29% 57.94% 37.50% 78.14% 42 PE 6 PE 0.0004583 0.006789 50.29% 52.71% 31.65% 73.52% 26 PE 0.0004918 0.007338 50.29% 54.45% 36.33% 72.36% Classification 24 PE  5 PE 0.3702 NA 50.29% 59.98% 37.50% 82.19% 48 PE  6 PE 0.3689 NA 50.29% 59.40% 35.16% 83.35% 22 PE 0.3745 NA 50.29% 62.59% 42.18% 82.77% Table 7.5 Classification vs. Regre sion 7.6 Random Data Sets When dealing with random sets, it is often very important to show the data i actually able to be learned at the shown rate for more than aju t a single set. As thi which is chosen at random, could have been luckily. Showing the same tests for different randomly selected data sets shows the results are real izable. The Microsoft data set used in the previous two sections was randomized three different times and networks where built for each data set using the same standards di Cli sed in previou s ction to en ul' the best possible networks were created. 48 Composite of All Sets 60.00% 56.00% CI> 56.00% III III e .r=.J 54.00% CJTraining III Cross >. IcII 52.00% oTesting '0 '#. 50.00% 46.00% 46.00% Set 1 Set2 Set 3 Random Sets Figure 7.6 Random Data Sets It is important that the training set and the test set have similar composites in order for the network to perform well. Set I has about a 6% difference in the make up training and testing of the randomly selected rows making prediction much more difficult. Set 1 MSE Correct Decrease Increase 22 PE 0.3745 62.59% 42.18% 82.77% 24 PE  5 PE 0.3702 59.98% 37.50% 82.19% 48 PE  6 PE 0.3689 59.40% 35.16% 83.35% Set 2 Classification 34 PE 0.3920 55.91% 16.21% 90.03% 20 PE  5 PE 0.3924 58.23% 10.40% 95.92% 38 PE  6 PE 0.3905 59.40% 26.27% 87'86% Set 3 Classification 42 PE 0.379365 60.85% 36.33% 80.49% 32 PE  5 PE 0.382199 59.98% 36.99% 78.39% 54 PE 6 PE 0.380529 60.85% 37.64% 79.44% Conjugate Gradient MSFT Table 7.6 Classification across 3 Random Sets 49 The importance of the table abo is to sho that p rformanc i comparable a ro all sets. Thus the r suits can be con iderable reliabl for any random gr uping of th dataset. 7.7 Even Split in Classification Also studies have shown that in classification it is important to have a irnilar plit so that the network does not become biased. The data consists of M FT stock price from 1997 randomized with 55% of the total set representing incr asing days. In thi study the inputs we use show the difference in performance on a et of ev n inputs and wh n th train set is not evened out. The results are below. Classification: MSE Error Correct Decrease Increase 24 PE  5 PE 0.37022 59.98% 37.50% 82.19% 48 PE  6 PE 0.36889 59.40% 35.16% 83.35% 22 PE 0.37448 62.59% 42.18% 82.77% Classification Even Split: 22 PE  5 PE 0.37520 58.23% 57.97% 58.49% 46 PE  6 PE 0.37578 58.23% 61.48% 55.02% 18 PE 0.38068 61.14% 59.14% 63.12% Table 7.7 Classification with ven plit These find are consistent with studies publish d on datas t split . Th ov rall performance is slightly better when not making the training a set an v n plit" h w v r this creates a big bias even when the data is split 55/45. A you can ee in Tabl 7.6 the network that was not evenly split got over 80% for the network, for day that increased in value and less than 40% for days that decrease in value. However the even split data was able to score around 60 correct, regardless of if the prediction day is increasing or decreasing. 50 Chapter 8 Simulations and Discussions The best possible neural networks were designed u ing only data from January 1997 until July 2002. Data from these sets (MSFT and INTC) were r moved in the manner mentioned in section 6.2. One additional st p wa taken aft r th result were shown to be good. A regression based neural network predict d all value in the training period and the results were compared with the actual price. Any day with a pr diction error of more than 5% was removed from the training data set. The e days were outlier and most likely caused by external forces. Each model was given $100 at the beginning of the testing period, August 1, 2003, and the model bought if it predicted an incr a e with over a 50% certainty. The model would hold onto the stock until it came to a day, which had an increase certainty of less than 50%, and then it would sell the stock. The table below shows an example of this trategy. A similar model wa a] buill which Increase # of Output Action shares Cash 0.460738 STAY o $ 100.00 0.617069 BUY 4.411116 $ 0.566166 HOLD 4.411116 $ 0.551663 HOLD 4.411116 $ 0.595844 HOLD 4.411116 $ 0.521784 HOLD 4.411116 $ 0.489340 SELL o $ 106.93 0.552275 BUY 4.483247 $ 0.478012 SELL o $ 107.69 0.651398 BUY 4.617822 $ 0.485218 SELL o $ 113.78 0.557304 BUY 4.612206 $ 0.479946 SELL o $114.29 0.461905 STAY o $ 114.29 Table 8. 1 Profit Simulation 51 earned the prime rate for any money that was not invested in the tack, thi r ult din only marginal improvement for all models. The results from the simulation were astonishing. Profit was impro ed a much as 889% over a straightforward buy and hold strategy. The figure below how how profit increased dramatically by using the neural network to make deci ions. Microsoft Profit Simulation: $100invesbnent $120.00 $100.00 $80.00 Pro1il/w Prime Profit $60.00 o Profit $40.00 $20.00 $Buy and Hold 18 PE ·5 PE 36 PE· 6 PE 44 PE Figure 8.1 Microsoft Profit imulation The model, which was created with 44 processing elements in one layer, wa able to make $103.17 in a oneyear period with an initial investment of $100.00. For comparison, if the same money bought stocks at the being of the period and sold th stocks at the end of the period it would have made a profit of $10.43, this is refi rred to as a buy and hold strategy. This model predicted the direction of the stock correctly 63% of the time. Despite the lack luster perfonnance the model was able to predict correctly when it counts most. The biggest draw back of such a scheme in the real world is 52 commission. The model mentioned above bought and sold stock a total of 143 times during the 252 trading days in the simulation period. The total number oftrad i sbown in the table below (that shows the Microsoft model, wbich was much more accurat trading more often). Transactions 22 PE  5 PE 36 PE  6 PE 17 PE 18 PE  5 PE 36 PE  6 PE 44 PE Intel I Microsoft Figure 8.2 Transactions The prediction on Intel's stock was inferior in all cases. Two of the 3 model were abl to beat the buy and hold strategy. The only plus side of Intel performance is it minimized the number of transactions, thus it would incur less commi sian jfjt were actually implemented. The table below shows the best models picked by the lowest M E on the cross val.idation set for both Intel and Microsoft. 53 Network MSE Training MSE Cross Correct Profit MSFT 44 PE 0.3959 0.3968 63.32% $ 103.17 INTEL 17 PE 0.3957 0.3972 54.98% $ 49.23 Table 8.2 Profit on MSFT vs. INTC The profit seen by the best model for Intel increased profit over the buy and hold strateg by 55%. The graph below shows the results of all 3 models on Intel s data. Intel Profit Simulation: $100 investment $50.00 $45.00 $40.00 $35.00 $30.00 Profit wI Prime Profit $25.00 o Profit $20.00 $15.00 $10.00 Buy and Hold 22 PE  5 PE 36 PE  6 PE 17 PE Figure 8.3 Intel Profit Simulation 54 Chapter 9 Conclusion The ability to predict stocks on a daily bases is very difficuJt for even th rno t advance AI techniques. The Hurst Component confirmed the hypothesi which aid prediction is possible but especially difficult. Surprisingly, th onsumer onfidenc Index and Prime Rate were not able to improve the predictability ofthe e networks. Through the use of many techniques, it is possible to correctly predict the direction of the stock 63% of the time for a large company like Microsoft. The study d monstrated that Takagi Sugeno (TSK) Neuro Fuzzy System was able to produce a much better result than a pure neural network when given the same training set. The TSK Neuro Fuzzy sy tern requires a great deal ofprocessing power, and a network with five membership functions per input took as long at 22 hours to training. The Takagi ugeno euro Fuzzy model was superior in prediction to the Mamdani Neuro Fuzzy Infer nee y tem and both networks required about the same amount of time to train. Genetic Algorithm are a v ry efficient way of determining which input are the most valuable and by r moving unnecessary inputs the performance ofthe network can be increa d. enetic Algorithm provide a "good enough" solution when searching the entire s arch space could take years of CPU time on even the best system. Genetic Algorithms wer used to fine turn the membership function in the Mamdani Neuro Fuzzy ystem, and the result was a marginal improvement in the RMSE; however it didn't improve the percentage of directional correctness, which was the goal of this study. The GA  Mamdani didn't perform as well as the ANFIS.. 55 The neural network using conjugate gradient de cent wer abl to achi v 63% correct on the test set thus this researcb pro ide the groundwork for a r at d al of profit. Even the worst model used for Microsoft produced a r turn on investment of 66%, and the best network scored an astounding 103% return. Tbi study al a demonstrated that picking the correct stock is as important a building th be t n twork (as tbe best network for Intel was outperformed by the wor t network on Microsoft data). The best network for Intel scored a mediocre return on investment ofjust under 55%. The biggest downfall of these networks was that the transaction co t of buying and selling stocks would be very costly. However it would be feasible to fin tune the buy and sell strategy to lower this cost. 56 Chapter 10 Future Work There are many areas in which thi r earch an bet nd d. Th many ar a of interest is the stock market simulation. It is clear from all the tests that the network wer abl to I arn th pattern in Microsoft's data much more easily than Intel s data thus it might be po sible that another stock is more learnable than Microsoft. More stocks could b picked to determine which have the most learnable pattern. There also exi t many different type of networks such as Support Vector Machines, Generalized Feed Forward JordanlElman twork A FI (attempted but the software version used limited its ability) and Tim Lag Recurrent Network. Anyone ofthese might actually out perform the neural networks designed here, the only way to be for sure is to design all the networks and see which does th be Similarly, there are many different tran fer functions and trying many diff! r nt combinations might also improve performanc. ach probl m wa giv n well ov r a hundred networks before picking the best network' how v r th r are an infinite numb of networks and thus many more networks couJd be built hoping to find a mor optimal solution. The data sets themselves might not b optimized f! r learning and thus trying more than just 3 random data sets could improve performance. The input pace could b searched more thorougWy. The original 16 inputs were cut down to 5 inputs u ing multiple techniques such as Genetic Algorithms and an iterative approach. Howev r, all 65,536 could be checked to find the be t network for each approach. The imulation was performed using a static model, and the research has been done showing that a moving 57 model that is updated with each day can outperform a static mod l. Thu it might b possible that in the simulation if a new neural network is built r day giv nth mo t recent infonnation it coul.d improve its performance. The co t of commi ion i omitt d from this research, as profit is not the goal of this study. How ver if an actual r al world model were to be used for making profit it would be neces ary to modify the entire model to consider commission and thus reduce the numb r oftrad perform d by th network. Many of the networks had 150 trades over the 252 days the market was open during the testing period. It would be very advisable to set a tbre hold that woul.d not perform the predicted action unless the likelihood was much better than jut 50%. One last method for improving the network performance would be to remove more outljer from the networks and with different combinations. All data in th training and cross validation set that was missed by more than 5% by a well trained regr s ion network, were removed. The value of 5% shoul.d be changed many times to fmd an ideal value for learning. From all the above recommendations it.i cl ar that this problem i tractable, with so many combinations and that not aJl po sible network could v r be built. 0 w must be satisfied with a good enough result. 58 Bibliography Page 1. Abraham, A. Recent Advances in Intelligent Paradigm Studies in Fuzziness and Soft Computing. Ed. Ajith Abraham Lakhmi . Jain and Janusz Kacprzyk: PhysicaVerlag (2002): 128. 2. Abraham, A. Philip, N. S., and Saratchandran P. "Mod ling Chaotic Behavior of Stock Indices Using Intelligent Paradigms. Neural, Parallel and cientific Computations. 11 (2003): 143160. 3. Abraham, A. "It is Time to Fuzzify Neural Networks. Intelligent Multimedia, Computing and Communications: Technologies and Applications of the Future. (2001): pp. 253263. 4. Abraham A. "Meta Learning Evolutionary Artificial N ural Networks.' Neurocomputing Journal, Elsevier Science Netherlands, (2003) (forthcoming). 5. Bartos, F.J. "Motion Control Tunes into AI Methods." Control Engineering 46 No.5 May (1999). 6. Brownstone, D. "Using Percentage Accuracy to Measur Neural Network Predi.ctions in Stock Market Movements." Neurocomputi Ilg. 10 (1996): 237250. 7. Castellano, G., Castiello, C., Fanelli, A.M. and Giovannini, M. "A NeroFuzzy Framework for Predicting Ash Properties in Combu tion Proce ses. ' N ural, Parallel and Scientific Computation. 11 (2003): 6982. 8. Chen, A.S., Leung, MT., and Daouk, H. "Application ofNeural Networks to an Emerging Financial Market: Forecasting and Trading the Taiwan Stock Index." Computers and Operations Research. 30 (2003): 901923. 59 9. Feder, Jens. Fractals. Pelnum Pres New York w York, 1988. 10. Fuzzy Cope 3. http://kel.otago.ac.n:zJsoftware/FuzzyOPE3/,11/2003 11. Garver, M. S. "Try new datamining techniqu 'Marketing New. Volume 36 issue 19 Sept (2002). 12. Grossglauser. M. and Bolot, JC. "On the Relevance of LongRang Dep ndellc in Network Traffic." ACM SIGCOMM '96. August (1996). 13. Harris, c.J. and Hong, X. "NeuroFuzzy Network Model Con truction U ing Bezier Bernstein Polynomial Functions." IEEE Proc D Control Th oryand Application in Press. 47 (2000): 337+. 14. Hurst, RE., Black, R.P. and Simaika Y.M. LongTerm Storage: An Experim Iltal Study. London, England. Constable, 1965. 15. Izumi, K. and Veda, K. "Analysis of Exchange Rate Scenarios Using an Artificial Market Approach." Proceeding of the International Conference on Artificial Intelligence. 2 (1999): 360366. 16. lang, l.S.R. 'ANFIS: AdaptiveNetworkBa ed Fuzzy Inti r nce y tern.' _I~_ Transactions on Systems, Man, and Cybernetics. 23 (1993): 665684. 18. lang, J.S.R., Sun, C.T., and Mizutani, E. NeuroFuzzy and Soft omputing: A Computational Approach to Learning and Machine Intelligence. New Jersey: Prentice Hall, 1997. 19. Kuo, R.J., Chen, C.H., and Hwang, Y.C. "AnlnteIligent Stock Trading Decision Support System through Integration of Genetic Algorithm Based Fuzzy Neural 60 etwork and Artificial 118 2001): 2124. 20. Labbi A., Gauthier, E. "Combining Fuzzy Knowl dg and Data for uro uzz 1/2 (1997). 21. Mamdani, E Hand Assilian, S. "An experirn nt in Linguistic ynth is with a Fuzzy Logic Controller." International Journal of ManMachine No.1 pp. 113, (1975). 22. Neuro Solutions. http://www.nd.com.11/2003 23. O'Brian, T. V. "Neural Nets for Direct Marketers" Marketing Re arch. Volume 6. Issue 1 Dec (1994). 24. PetrovicLazarevic, Sonja, and Abraham, A. "Hybrid FuzzyLinear Programming Approach for Multi Criteria Decision Making Problems." eural, Parallel and Scientific Computations. 11 (2003): 5368. 25. Prime Rate. http://re earch.stlowsfed.org/fr d2/dataJPRIME.t l. 11/2003. 26. Quah, T.S. and Srinivasan, B. "Improving Returns on t k Inve tment thr u h Neural Network Selection." xpert Systems with Applications. 17 (1999): 295301. 27. Sugeno M, "Industrial Applications of Fuzzy ontrol." Isevier Science Pub o. (1985). 28. Tsaih, R., Hsu, Y., and Lai, C.C. "Forecasting S&P 500 Stock Index Futures with a Hybrid AI System." Decision Support System. 23 (1998) 161174. 29. Turban, E. and Aronson, J. E. Decision Support Systems and Intelligent ystems. Delhi, India: Pearson Education, Inc., 2001. 61 30. Weka. http://www.cs.waikato.ac.nzJrnl/weka.I1/2003. 31. Yao, J.T. and Tan, C.L. A Study on Training riteria for Financial Tim Forecasting.' Proceedings of International Processing. Nov. 2001: 772777. 32. Yao, J. and Poh H.L. 'Forecasting the KLSE Index U ing eural twork ." IEEE International Conference on Neural. Networks. 2 (1995) 10121017. 33. Yao, J., Tan, C.L., and Poh, H..L. "Neural Networks for Technical Analysis: A Study ofKLCI." International Journal of Theoretical and Appli Finance. 2 (1999) 22124l. 34. Yao, J., Poh, H. L. "Equity Forecasting: A Case Study on the KLSE Index. ' NNCM '95. (3Td International Conference on Neural Network in the apitaL Markets). Oct (1995) 341353. 35. Zadeh, L. A. "Fuzzy Sets." Information and Control. June (1965), 8(3): 338353. 62 VITA Br nt Arthur Do k nı Candidate for the D gr ofı Master of Comput r Sciencı Thesis: Predicting Financial Markets Using euro Fuzzy Genetic y tern Major Field: Computer Science Biographical: Personal Data: Born in Stillwater, Oklahoma On November 14 1976 th son of Gerald and Cheryl Doeksen Education: Graduated from Stillwater High School, Stillwater klahoma in May 1995; received Bachelor of Science degree in Computer Science and Mathematic from Oklahoma State University, Stillwater Oklahoma in December 1999. Completed the requirements for the Ma ter of ci nc degre with a major in Computer Science at Oklahoma tat Univ r ity in D c m r 2003. Experience: Brent has b en a prof! sional ftware d v lop r ince 199 and ha worked for several companie acros the Unit d tate: abr In ., J.8. Hunt Phillip Morris, OneOK, and William ommunication r up.
Click tabs to swap between content that is broken into logical sections.
Rating  
Title  Predicting Financial Markets Using Neuro Fuzzy Genetic Systems 
Date  20031201 
Author  Doeksen, Brent Arthur 
Document Type  
Full Text Type  Open Access 
Note  Thesis 
Rights  © Oklahoma Agricultural and Mechanical Board of Regents 
Transcript  PREDICTING FINANCIAL MARKETS USING NEURO FUZZY GENETIC SYSTEMS By BRENT ARTHUR DOEKSEN PREDICTING FINANCIAL MARKETS USING NEURO FUZZY GENETIC SYSTEMS Thesis Approved: PREFACE This study was conducted to provide knowledge in stock market prediction through the use of several different types of artificial intelligence systems. Many attempts have been made to accurately predict the stock market with only marginal success. This study shows that predicting the stock market is possible with very little input data and compares the abilities of several different methods: Neural Networks, TABLE OF CONTENTS Chapter Page I. INTRODUCTION 1 Neural Networks 1 Conjugate Gradient 5 Fuzzy Logic 5 Genetic Algorithms 7 Decision Trees 10 Classification and Regression Tree 12 Objective of Study 13 Significance of Study 14 Data Set and Tools Used 14 Chapter Page IV. HYBRID INTELLIGENCE SySTEMS 20 ANFIS 27 Neuro Fuzzy 30 Takagi Sugeno Neuro Fuzzy 30 Mamdani Neuro Fuzzy 31 Input Selection 32 V. HURST EXPONENT ON DATA. 33 VI. DATA PREPERATION 35 Input Reduction 35 Data Reduction 39 Data Transformation 41 VII. RESUTLS 42 LIST OF TABLES Table Page 2.2 Normalization of a series 19 5.1 Hurst Exponent Calculations 35 6.1.1 Spearman Correlations 37 6.1.2 Greedy Input Reduction: 8 inputs 39 6.1.3 Greedy Input Reduction: 7 inputs 39 7.2 Conjugate Gradient vs. Back Propagation .45 LIST OF FIGURES Figure Page 1.1 Neural Network 3 1.3.1 Binary String Representation 8 1.3.2 Elitism 9 1.4 Example Decision Tree 10 1.7 Microsoft and Intel Stock Price 15 CCI AI ANFIS CSI INTC MF NOMENCLATURE Artificial Intelligence Artificial Neuro Fuzzy Inference System Michigan's Consumer Sentiment Index United State's Consumer Confidence Index Intel's Trade Symbol Membership Function Chapter 1 Introduction Moore's law is still in tack and thus processors are doubling in speed approximately every 18 months. This new power is very helpful with artificial intelligence, which was only a mere conception a few decades ago. Now, thanks to abundance of processing power, we can even combine artificial intelligence techniques in ways not possible just 10 years ago. Inference systems can learn patterns in megabytes of data in only seconds, thus allowing for more and more data to be learned by the machines. The ability to parse through tons of data is critical in the financial world as uncertainty. Neural Networks can even determine trends over time [18], which is a limitation of decision trees and many other artificial intelligence mechanisms. Time series analysis is critical to any financial model because we must learn how the prices changes over time and what inputs are most critical to future prices. Database marketing is an area that would benefit from neural networks. Database marketing often has hundreds of independent variables, which is well suited for a neural network. Figure 1.1 shows what a neural network could look like. Independent variables are inputted to every node in the hidden decision layer and their output is passed onto the next decision layer (depending on how many layers have been set up). Once the output is determined, its result is compared to the actual outcome and the result is backward I ~' ~ C., '~~"""' Output n  '"" '"'.:::::::: .,....;{}" p ~ '\ =~~~,.~ \" ~ :;::. ~ ".../"'< t .....~//' u ./"",,0 s ~. Figure 1.1 Neural Network The two key points that must be followed when designing a neural network are: the high price of software that runs this algorithm. ModelMAX, a tool used by many direct markets can be an extravagant expense to many companies [24]. The positive side of using a neural network is it can adapt for areas of higher uncertainty and has the ability to solve larger problems. Neural networks are well suited for problems that are highly nonlinear. These types of problems are very common in database marketing with hard to define variables such as customer satisfaction and even harder to define dependant variables such as customer loyalty. Another strength of neural networks is the ability to predict a continuous variable, whereas decision trees have problems with this topic. The ability to learn new situations and recognize trends is another reason that neural networks are popular. The ability of the neural network to faster than back propagation and require fewer epochs resulting in less expensive hardware being required. 1.1.1 Conjugate Gradient The conjugate gradient is a method, which uses an approximation of the second order derivative without actually calculating the second derivative. This process was originally discovered in the 1960s for solving linear systems [18]. This method is exceptionally fast and thus is very useful with solving large data sets or when many networks need to be built. The gradient uses a vector of previous points to determine the conjugate direction. Imagine that you are standing on step embankment that leads to a control, data classification, decision analysis, time series prediction, and pattern recognition [16]. Petrovic et al. [24] use fuzzy logic in a multiple objective decision model for a manufacturing plant. The rules for a fuzzy system can be generated either by interviewing experts in the field or mechanical mechanisms used in a fuzzy inference system, which uses supervised learning to recognize patterns in the data. A typical fuzzy rule is given below: If (customer has high credit score) and (customer has high income) then (grant loan). Equation 1.2 In the above example it is obvious that there is no absolute definition for either statement. allows the system to weigh rules within the system and give preference to rules that the customer fits better. As opposed to traditional probability theory not all possibilities must add up to 100% [29]. For example, let us say that there are two cases: a person is rich or a person is poor. It is possible that according to a membership function, Jack is rich (CF = 0.65) and Jack is poor (CF = 0.20). Except 0.65 + 0.20 "* 1.00 and this case is possible in fuzzy logic but not in probability theory. Fuzzy Logic is used today in many different real world applications. One such example is an AntiLock braking system [29] where instead of the traditional antilock braking system, which uses an on/off pumping action to unlock the wheel, there are about 18 sensing factors. When a sensor begins to come close to being locked, the pressure on string. This chromosome defines the characteristic of the member of the population and that allows the algorithm to determine its fitness. A population is a group of members and changes from generation to generation through methods such as mutation and crossover. The fitness function is used at every generation to see which members are fit and most likely to survive to the next generation through a crossover operation that can be thought of as mating. Use the integer equalvent of the binary value to determine its fitness 10010 Fitness Evaluation 18 Using Integer Fitness and using elitism and crossover to create the next generation 11011 11011 ,,, :> 11010 11010 00010 10010 10000 I....,.>10000 01000 f~?> 01010 01010 11000 00110 00010 Current Next Generation Generation Figure 1.3.2 Elitism We must also introduce some randomness to ensure more of the search space is covered and this can be done by mutation. Mutations can be done by simply flipping a bit in the string to produce a new mutated string. Mutation does not occur in every Genetic Algorithms are an exceptionally powerful tool, as they are very effective at searching a predefined search space, and this ability helps genetic algorithms to be used in a hybrid manor with other tools. 1.4 Decision Trees A decision trees can be used to predict an outcome for dependant variable based on many independent variables. The root node of the tree contains the most significant independent variable. As the tree is traversed, the node becomes less important to the outcome until a leaf node is reached and an outcome is predicted. Figure 1 below shows elements in class N.) For example, class P could be the people to receive catalog and class N could be the people who do not receive a catalog. p p n n I(p,n) =log2 log2 Equation 1.4 p+n p+n p+n p+n Set S is partitioned into sets {Sp S2, ... , Sv }. For Set Si' Pi is the number ofp's in the set and ni is the number of n's in the set. I(p,n) is the importance to model. The higher the I(p,n) is the better this combination is for a split. A value of zero means to attach no importance to I(p,n) and a value of 1 means n and p have ideal values. Gain (A) is the amount of information gained for an attribute A with a highest gain being the attribute to use as the root. where right only 50% of the time. After a decision support system was implemented, that used a decision tree, the success rate increased to 70% saving the company money [29]. 1.4.1 Classification and Regression Tree CART (Classification and Regression Tree) is a special case of a decision tree that can be constructed by examining data in a systematic approach; the CART grows through a series of splits. A CART determines the importance of each variable before adding a splitter in the tree. Starting from the root node an exhaustive search is preformed on all inputs to determine which input creates the least error when picked. After finding the split, two disjoint sets are created according to the split and each set is 1.5 Objective of Study The main focus of this study is to compare different performances of artificial intelligence paradigms on predicting the direction of individuals stocks, and how hybrid intelligence can be used to better solve problems. The first algorithm examined is Artificial Neural Network using conjugate gradient descent algorithm. The second algorithm used is a straightforward back propagation method. A Mamdani Neuro Fuzzy inference was built and then the membership functions were modified using back propagation and a Genetic Algorithm. This showed how effective Genetic Algorithms could be and provide a comparison with Takagi Sugeno Neuro Fuzzy model. The ANFIS model is based on Takagi Sugeno Fuzzy Inference System and was compared with a 1.6 Significance of Study The most recent studies compare indexes such as the S&P 500, NASDAQ, and the Dow Jones [2][8][28][31 ][32]. The experiments done in this project examine the chaotic behavior of actual companies that tend to be less stable and thus harder to predict. Studies have also shown that using direction as compared to prediction can generate higher profits, [8] and this study will try and capitalize on that idea. Also the prediction will examine a more realistic situation where an investor has the choice between multiple stocks, in this case 2, and chooses the stock that is mostly likely to increase in value. The experiments also compare many hybrid techniques and their abilities to predict a categorical output. The ability to predict the direction of the stock prices is the most Stock Price $80.00 _0' o.o__~ _._.o_ __.. o_.,o.•_._.__.. ~ ~._ ~ .~ $ 70.00 $60.00 $50.00 UI CI) I MSFT! 0 $40.00 U INTC $30.00 $20.00 $10.00 $r reo eo (]) (]) 0 0 ...... N N C") C") (]) (]) (]) (]) (]) (]) 0 0 o 0 0 0 0 0 (]) (]) (]) (]) (]) (]) 0 0 o 0 0 0 0 0 ...... ...... ...... ...... ...... ...... N N N N N N N N i5! ...... N t::: i5! ...... N t::: i5! ...... N t::: i5! ...... N t::: i5! ...... N t::: i5! ...... N t::: i5! ...... N t::: Date application was developed to randomize the rows or a .CSV file to ensure the network fully randomized this could have also been done using the preprocessing built into Neuro Solutions. FuzzyCope3 is designed to perform regression testing only and not classification. Thus it was necessary to writing an application to transform the predicted value to 0 or 1 and then do a comparison for accuracy. All Java applications where developed using JDeveloper by Oracle. Chapter 2 Literature Review 2.1 Hurst Exponent Some papers have used the Hurst Exponent [9][12][32][33] to prove that the data not completely random but in fact has the correspondence between the input and the output data. The Hurst Exponent was originally discovered by Hurst el al. [14] in 1965. The Hurst Exponent can show the degree of correlation. If the exponent is 0.5 the data is completely random and no thus no network will be able to predict the output and thus it is waste of time to attempt to learn any pattern in the data. The closer the Hurst Exponent , X"N =I (xu JlN) Equation 2.3 u=l Jlx is the mean of Xu for all N elements. The Hurst exponent can be very useful in any set and allows a method of comparing sets of data. For example, a set with a Hurst Exponent of 0.55 is very difficult to predict and any network with decent results should be great. However, a data set with a Hurst Exponent ofO.95 should expect the network to be extremely accurate to be considered good. 2.2 Scaling and Normalization xseries nseries 35.25 0.5478 37.25 0.9462 37.52 1.0000 37.5 0.9960 34.87 0.4721 32.5 0.0000 Table 2.2 Normalization of a series The above set of data shows how a data set can be spread out by using normalization, making it easier for the network to understand. 33]. Which states the correct number of hidden neurons is a multiple k times the number inputs (n) minus one. # neurons =(k *n)l Equation 2.5 second rule of thumb popular in newsgroups is # neurons = .Jinputs *outputs Equation 2.6 H n+1 =In(Hn ) Equation 2.7 The BaumHaussler rule for determining the correct number of hidden neurons is defined the following function. # neurons :::; N record' *Etolerance Equation 2.8 NInputs *Noutputs Using any of these rules of thumbs can prevent the networks from memorizing and thus whenever there is sufficient data with both inputs and outputs. When know outcome is available it is ideal to use supervised learning [26]. Unsupervised learning means the system attempts the recognized patterns in the data and doesn't have the expected outputs. Selforganizing maps are a common usage of unsupervised learning when the network attempts to recognize clusters of data and to group them according to similarities with other members. Unsupervised learning is done when the system doesn't know the expected output, and the system is then supposed to learn the patterns. A common tool used for unsupervised learning is a selforganizing map. 2.6 Recent Trends Many papers have dealt with input selection when it corn to mapp.ing financial indexes and stocks[2](8)[28) [3 1][32]. Inputs have been brok n into two different typ of inputs financial and political (which tend to be qualitative). Kuo et at [19] u e a genetic algorithm base fuzzy neural network to measure the qualitativ ffect on the stock price. Variable selection is critical to the success of any network and 5 key parts of the financial vi.ability of a company were identified by Quah el at [26] as yield, liquidity risk, growth, and momentum factors. These variables are widely available in qualitative fonn such as the PIE ratio can be used for yield and the return on equity could be u ed for growth etc. Macroeconomic factors such as inflation and shortterm interest rate [8] have to shown to have direct impacts on the stock returns. A better measure of fitness which considers profit [31] ha been suggested to replace a root means squared error. Yao and Poh [32] showed an example wh re a model with a low NMSE had a lower return then a model with a high r NM . Br wnst ne [6] recommends using percentages to measure performance s that th r suit can b bett r understood by traders and other people that might need their research and ar not xp rt in the field. Chen et at [8] used a 68day sliding window to predict the n xi day' pric of the index. Commission is commonly overlooked when doing research relating to tock market prediction; however, if any model is actually implemented it i going to incur fees which could greatly affect the profit predicted by the model. Chen el at [8] con ider 3 different levels ofcommissions and how it would affect the best buying trategy u ed by investors. Simulation [34] has been used to show how these models can produce profits on real world testing data that is not seen by the network. 22 Chapter 3 Hybrid Intelligence SystelTIS Architecture 3.1 Stand Alone "Standalone models consist of independent oftware components which do interact in any way [1]." These systems can work in a parallel enviroom nt to allow user to determine which model is the best fit to learn the signal of the data. Once the standalone system has aided in picking the best ystem that ystem would then be developed by itself to make the best possible single intelligent sy tern. The advantage this model is it is fast to build and uses software that is already available. A di advantage is the system doesn't incorporate any strengths of the discarded sy tern and as a re the performance is not any better than a single intelligence system. 3.2 Transformational. Hybrid Intelligent System The system begins as one system and then transition into an ntirely n w y Thus once the model is built on a ystem is required to b worked on. Like the standalone model this system suffers from not being able to use the trength of both sy terns. These systems also tend to be applicationoriented [I]. A di advantage of this system there is not any really available software that support this type of architecture. 23 3.3 Hierarchical Hybrid Intelligent System The Hierarchical Hybrid Intelligent yst m u s the trengths of muJtipl typ artificial intelligence syst ms to produce th be t po ibl int llig nt tern. Th design is broken up in layers with each layer ha ing a ingle int lJigenc what is best at that layer. A common usage of hierarchical hybrid int Ilig nt s st m use an evolutionary algorithm to produce the inputs or th be t tting for anoth r artificial intelligence system. Leigh (Forecasting the NY composite index) u ed a genetic algorithm to detenmne which of the 22 inputs where the mo t u eful and which could be eliminated to generate a better R quared corr lation. Th finding from the genetic algorithm were then used to create a bett r neural network. A hi rarchicaJ hybrid intelligent system is when the system begins 22 inputs D Genetic Algorithm Neural Network Hierarchical Hybrid Intelligence System Figure 3.3 Hierarchial Hybrid lntellig nt yst m as one type of intelligent sy tern, and then i transform d into a different type with the final product having no proof of ever being of the fir t type of intell ig nt system. The design shown in figure 3.3 was used in this study to reduce the numb r of inputs form to 9 which were then given to the neural etwork for training. Hierarchical hybrid 24 intelligence systems show dramatic iropro ern nt over using a singl int lligent st This allows the user to focu on the bigger picture, and tb computer can figure out the details of the design such as how many bidden n urons hould be u ed. 3.4 Integrated Intelligent System Integrated Intelligent Systems use fused architectures [1] that provid a single model tbe best characteristic of all models. There are numerou advantages to this type of model. Integrated Intelligent systems provide increased performance and are more robu because it is both noise resistant and has the ability to xplain itself. The bigge t disadvantage of this system is its complexity; to design tbi type of system i a complex undertaking for any company. Nevertheless these types of systems are needed by companies and so are actually being developed. The hope is that as more Integrated Intelligent systems are developed, the aforementioned problems wil.l begin to dissipate. One such model that is currently available is Fuzzy ope which provide a Nelli·oFuzzy model, and it is available at [10]. Similarly Neuro solution ha an AN I (Artificial Neural Fuzzy Inference System) that uses an integrat d intelligent sy t m [22]. Hierarchical design has been very popular in recent studie Abraham [1] di cusses a 5layered system that evolves NeurofuzzyEvolutionary yst m (~voNF). This typ of system would require the largest computers systems available today to build its model, which is a buge disadvantage of the hierarchical architecture. The cost of the system run these programs can be huge, but the biggest strength of these systems is their performance once the model has been built. Mamdani Fuzzy Inference shown in figure 3.4 is an example of an integrate intelligence system. 25 z = (centroid or Mea) x y x y Figure 3.4 Integrated Hybrid Intelligent ystem [3][4] 3.5 Conclusion of Hybrid Intelligence Systems The most interesting of the intelligence systems are the Integrated and Hierarchical hybrid because these two methods provide the most significant perfonnance improvements and can realize the strength of many different intelligent systems. However, we are not limited to having to choose one of these two systems, in fact, it would be perfectly reasonab}e to create a Hierarchical Integrated Hybrid Intelligence System. This system would contain layer a in the hierarchical sy tern, with on r more layers containing an integrated y tern. 26 Chapter 4 Hybrid Intelligence Systems 4.1 ANFIS ANFIS, Adaptive Networkbased Fuzzy Inference y tern hav h en shown to provide better result than artificial neural network and fuzzy mod Is [16]. A common model used today in ANFIS is the Takagi Sug no Fuzzy Model. In the Sugeno model each different rule has its own function. if(x is A) and (y is B) then z =j{x y) Equation 4.1.1 In the above functionf(x,y) is a crisp function and the sets A and B are fuzzy sets thus they don"t have absolute members, but rather a degree of member hip. lang [16] gives an excellent example of an ANFIS with only 2 input. he diagr m b low how the procedure for inputs x and y. Each lay r i then de cribed below. 27 Layer I Layer 2 Layer 3 La er 4 La er 5 x f y Figure 4.1 ANFIS [16] The ANFIS consists of 5 different layers described below: Layer 1 (Membership Function): This bell shaped graph determines if x is in A and to what degree it i.s a member. The bell shape of the graph can be manipulated by changing a value of any variable. Thus the end result i.s a bell shap that b tier matche the r al world. PA, = 2b, quation 4.1.2 x c, I +a, a, b, and c are constants that determine the shap of the bell. A is the linguistic label (tall, short, etc) that is associated with the node. 28 Layer 2 (Firing Strength): Every node in la er two corr ponds to th firing tr ngth fa rule. Any Tnonn operator could be u ed in this layer. Two common Tnoml op rator are the AND and MAX functions. Equation 4.1.3 Layer 3 (Normalized Strength): In layer three calculate a normalized firing strength 0 that the output one node doesn t overshadow all other nodes. W 0 3,I = W , = i = 1,2 Equation 4.1.4 WI +w2 Layer 4 (Adoptive function): Each node has a node function defined by W" normalized firing strength, and by 3 new constant p q and r. The e three parameters ar referred to as the consequent parameter . °4 ; = W,/' = W, (PiX + q,y + r,) quation 4.1.5 Layer 5 (Calculate Output): A summation of all input signals is us d in this ignal nod compute the overall output as describe in the formula below. Equation 4.1.6 29 The Mamdani fuzzy inference system is a sp cial ca e of the ugeno fuzz mod I in which the order of the model is zero. Since the order of the sy t m i z r th nfi a constant. 4.2 Neuro Fuzzy Neuro fuzzy systems are an attempt to combine natural linguistics u d in fuzzy inference Systems with the proven capabilities of artificial neural network [13]. Th combined system's goal is to be more transparent like a fuzzy system giving the u ers a list of general and understandable rules while at the same time building in the ability ofa neural network to predict nonlinear trends in data. Central to this idea, i building a bridge from fuzzy logic using membership functions and artificial neural network that possess quantitative adaptive number crunching power. Castellalo et al (7] de igned a Neurofuzzy model where the parameters ofthe fuzzy rules base were configured by a twophase learning of the neural network. 4.2.1 Takagi Sugeno Neuro Fuzzy A common fuzzy inference system (FIS) used today is Takagi ugeno fuzzy inference system [27]. The idea was to formalize a systematic m thod for generating rules that a computer could use for any given data set [17]. Takagi Sugeno FIB has rule that follow the format: if(pressure is high) then volume = 2 *pressu're Equation 4.2.1 30 In a Takagi Sugeno FIS the consequent. is a crisp function that can b expr ed in t rm ofj{x). A firstorder Sugeno fuzzy model occurs when the function/is a fir t order polynomial. A zeroorder Sugeno fuzzy model occur when the functionfis a con tanto This can also be viewed as a special case ofthe Marndani fuzzy inferenc y tern [17][18]. Takagi Sugeno has 2step process of learning that occurs for every epoch through the training set. The first step holds the membership functions constant and update the input patterns learned according to an iterative least squares method. The second part of the learning updates the membership function while the input patterns are held constant [3]. Theses steps provide for a very efficient learning tool. q W1,Zt + W1·Z2 z= W t+w2 x y x y Figure 4.2.1 TSK Fuzzy Inferenc ystem [3][4] 4.2.2 MamdaniNeuro Fuzzy AI and BJ are input fuzzy sets and the result is the output of the fuzzy set [3][21]. A supervised learning technique is used to learn the membership functions in a Mamdani. Fuzzy Inference system. The Mamdani system ha 6 layers instead of 5 that are in Takagi Sugeno Model. The fust layer is for the inputs. The second layer i a fuzzification layer. The third layer is the rule antecedent layer. Then the fourth rule is 31 the strength nonnalization rule and th fifth is tb c n equ Dt la er ru1 . The finalla er in the Mamdani UfO Fuzzy sy tern i th rule inti T nce la er. 4.3 Input Selection In real world problems there can b hundreds [16] of diffi rent po ibl input for any artificial intelligence system. For instance in a fmancial mod I the input ar not just limited to the stock price, dividends and volume trade of a particular stock or iod in question. However the indexes could extend to the overall p rformance of the mark the consumer confidence, Federal Reserve inter t rates or ev n world policie indicator such as how is the current war is proceeding. Once all thes pos ible input hav b n found it is good to find a mechanism for reducing the he r number of inputs as ha ing too many inputs can cause many problems such a complexity of computation and Ie s transparency of the underlying model. Four rules have been found as a rule of thumb to guide input selection by lang [17], and it is r a onable to b Ii ve that the e rul ar generalized enough that they could work for other mod I . 1) Remove noise/irrelevant inputs 2) Remove inputs that are dependant on other inputs 3) Inputs that create a more conci e and tran parent mod 4) Reduce time for model construction 32 Chapter 5 Hurst Exponent on Data Once the data was transfonned in the mo t viable form to u e in all the network, the Hurst Exponent [9][12][32][33]was calculated to show that both the predi tion is possible and that the prediction is going to be very difficult. The tim en us d to calculate the Hurst exponent consisted only of the percentage change in price from the previous day and the actual value was not used. P'oday  PYeslerdoy Pr ecentageChange = =: Equation 6.1 PYe..vlerdoy This equation was preformed on all 1398 days in the testing set from January 15t 1997 to July 31, 2002. Then the x/.N was calculated for all days u ing equation 2.3. nc that was done, then the R N (1.3387 for MSFT) could be found u ing equation 2.2. h standard deviation for MSFT was found to be 0.02742 that gave us all th information needed by equation 2.1 to detennine the Hurst xponent to be 0.537 for M T. This proves both points earlier stated. The data is not a complete random walk because neither network had a Hurst Exponent of 0.5. And second it shows that good performance will be very difficult to achieve for any network, as the network is nearly random. imilar tests where run on Intel's data to produce a HUT t Exponent of 0.513. Thus based on the Hurst Exponent, Intel's data is more random and should be more difficult to produce good results. Figure 5.1 shows how the calculation for Microsoft was calculated to find 33 Price 1 Change X t,n 1 0.0025 0.0035 2 0.0033 0.0078 3 0.0642 0.0554 4 0.0585 0.1129 5 0.0734 0.0386 6 0.0746 0.1122 7 0.0946 0.0167 8 0.0413 0.0256 9 0.0302 0.0568 10 0.0174 0.0752 11 0.0149 0.0613 12 0.0108 0.0731 13 0.0012 0.0753 14 0.0197 0.0960 15 0.0127 0.0843 16 0.0181 0.1034 17 0.0053 0.0990 18 0.0351 0.1351 19 0.0580 0.0781 20 0.0078 0.0713 21 0.0232 0.0955 22 0.0371 0.1335 23 0.0040 0.1385 1398 0.0364 0 IMean =0.0009851 ] = O. J129 min [x ] =  I .2257 Is/sN I, 1 IR=1.33971 I =0.02741 IN=1398! H LOg(RlS)=O.536 Log(N) Table 5.1 Hurst Expon nt Calculations Hurst exponent of 0.5368. 34 Chapter 6 Data Preparation A key when desigrung any artificial inteLLigence sy t m i to pre nt the data in the most meaningful and understandable format for the algorithm to under tand. Th 3 steps of presenting the data to the network to the sy tern is to choose the be t input remove misleading or corrupt data rows, and transform the data. 6.1 Input Reduction There are 16 inputs to begin with in each data model. This must b r duc d to aid in the performance time of the neuro fuzzy engine since execution time grows exponential with the number of inputs. It takes about 3 day to train an ANFI with five inputs and five MFs per input, thus the number of input mu t b r du d to n mor than five. Also results with neural network tend to b b tter wh n ulln cary inputs are removed and duplicate or similar inputs are eliminated. This is hown to b th case in the testing of these networks. The original 16 inputs that were con idered ar: on urn Confidence Index the prime rate, Michigan Consumer ntiment rnd x price 1 (pric yesterday), price 2 (price the day b fore ye terday), price 3 (price 3 days ago) price4 (price 4 days ago), price 5 (price 5 days ago), volume 1 (volum of trades y terday), low 1 (low price yesterday), high 1 (high price yesterday), op n 1 (open price yesterday) low 2 (low price the day before yesterday), high 2 (high price th day before yesterday) open 1 (open price yesterday), and volume 2 (volume day before 35 yesterday). Several models were used to detennine tb 1 ast important input and th 11 these inputs where removed b fore a mor accurate and it rali approach could b u ed to determine the final inputs. Models were built u ing decision tree ( ART) to determine wbich variable offered the most 'gain.' Linear regre ion mod 1 w r al 0 designed to determine which inputs contributed the most to the mod I. ignificanc towards the price model was built using correlation models to determine ach input contribution, which is shown below in figure 6.9. Variable <.05000 Vol1 Vol2 Price1 Price2 Price 3 Price4 PrlceS Prime CCI Price Vol1 0.876 0.0138 0.052 0.079 0.065 0.071 0.12 0.045 0.0431 Vol2 1 0.0402 0.013 0.053 ·0.082 0.064 0.12 0.046 0.0445 Price 1 0.023 0.019 0.016 0018 0.008 0.D15 0.0289 Price2 1 0.023 0.018 0.016 0.008 0.011 0.0082 Price3 0.025 0.017 0.009 0.007 0.0176 Price 4 0.023 0.011 0.007 0.0196 Price 5 1 0.007 0.011 0.0045 Prime 0.709 0.0062 CCI 0.0207 Price Table 6.1.1 pearman orrelation Also a genetic algorithm with a population of 50 was trained for 100 generation , and th results were examined to see what variable it had cho en to remove from the data model. Each network was trained for 10,000 epochs with a threshold of 500. The Mutation rat used was 0.1, uniform crossover, Roulette was used for election, and rank basi fitne s was used. A progression of generational was used in tead of a steady state. After building a network the sensitivity for each input can be test by hold all variables constant except the one variable in question, then the network is feed a series of value for this variable. How the outputted result changes for each different input gives us the 36 sensitivity for this variable. The ensitivity about the Mean te t was conducted to s how important each input is relative to a particular n ural network. Th r ult for Microsoft data set is shown below in figure 6.1 . Sensitivity About the Mean 0.35 0.3 0.25 ?: ~ 0.2 "in 0.15 c 11l en 01 !OCloseNI 0.05 0 Input Name Figure 6.1 Sensitivity About the Mean: MSFT All the results from the above experiments were then analyzed to determine which variables to use based on their importance in all the models. From the r suits given th best 9 inputs were picked. Using the model built, the number of input was reduced to a much more manageable size and from there a greedy systematic test wa p rformed to eliminate the least important variable according to a neural n twork. Giv n the 9 inputs, a network was built and tested with one input missing. This iterative approach was impressive as improving performance, however ideally a power set should be cr ated to pick the best inputs with every possible combination. Each network took about 20 minutes to train and test. For the original 16 inputs, the power set would contain =65536 combinations, which would take 2 years, 5 months, and 27 days, and thi not feasible for this study. The network which did the best on test was kept, thus after the 37 216 first iteration there wh re 8 input left. Till process' r peat d again ith all 8 inputs being removed individually to find out which input wa th I ast ignificant. h n that input was removed and so on until the data et contain d th 5 mo t ignificant input. The graph below shows a cross table r suIt that d tennin d that onsum r Confidence Index) should be removed from Microsoft s inputs. Network Name MSE cross validation NMSE on Testing without CCI 0.000521 0.00712 without close3 0.000535 0.00729 without volume 0.000527 0.00731 without high2 0.000515 0.0074 without c1ose2 0.00053 0.00757 without open1 0.000516 0.00765 without high1 0.000513 0.00779 without c1ose1 0.0006 0.00880 Table 6.1.2 Greedy Input Reduction: 8 inputs Interestingly, the network with CCI had a performance 0[0.007506063, which was wor than the network that had the input removed. Similar the next it ration, which choose volume to be removed even out perform d thi n twork with a NM of 0.00687 1094. This iteration is shown below. Network Name MSE cross validation NMSE on Testing without volume1 0.000515 0.00687 without close2 0.000518 0.00719 without high2 0.000509 0.00726 without close3 0.000520 0.00746 without high1 0.000532 0.00766 without open1 0.000524 0.00769 without close1 0.000584 0.00809 Table 6.1.3 Greedy Input Reduction: 8 input The final selection of inputs using this approach was: clo e1, op n], highl high2, and close3. The final performance on testing was 0.00693, which is just lightly wor than including 6 inputs. All networks in the above example were trained 3 eparate tim s 38 with randomly initialized weights chos n and the b t n twork th n cho en by th system to perform the t st on. Each run u d conj ugat gradient de cent and 10 000 epochs and a threshold of 500. A threshold of 500 in thi study means that if the ITor the cross validation set does not improve for 500 epoch then t rminate training. he network used contained 20 processing elements in layer on and 7 proce ing el m nt the second layer. NMSE used by Neuro Solutions in th M of th network divided by a straight forward network that picked the average value each tim and then calculat d the MSE for this dumb network. Thus wh n NMS iI, th network ha learned nothing about the data set and a value of 0 means the network is perfect. NMSE = MSEne,work ne/work MSE Equation 6.1 dumhNelW{)rJc 6.2 Data Reduction There are many things to do for data cl a.., up. Fir t, th d cision n d t be handled on what to do with missing or invalid data. Ifther i enough data as in this model, it is wise just to throwaway those record oth data set w re exarnin d ft r missing or incomplete data and none was found. Also we must d cid what to do with outliers, which can throw a model off. In some cases, such a fraud d tection, the outliel:are the meat ofthe problem; however, that is not the case for our data and thu all outlier were removed from the set. All days in whi.ch the price increased or decreased by more than 10% in a single day were removed from the database as these outliers where most likely caused by external forces. Also days in which little, less than 0.1% or no change occurred were removed because an action performed by the investor would not affect 39 there bottom line. The outliers for th e models wer d termined to b da s wh n tTad volume is 4 times the average trade volume or more. For Mi.crosoft thi valu cam out to any day that more than 105 million shares here trade. Th da are rno tly like caused by a natural or manmade disaster for which no network ould have the ability to foresee. The prime was also used and the change in prime can greatly affect the mark so any day that prime was changed was removed from the data. The daily valu sinc 1947 of the prime rate is published at on the web [25] and it is updated every time the prime is changed. Removing this noisy data aids the networks to obtain better prediction. Data reduction is also necessary to give a more manageable size to the data t. Man made disasters such as 9/11 have a huge impact on the stock market, which could not have been for seen by any algorithm. Thus the entire week following September 11 2001 was removed from the data sets of both Intel and Microsoft. The data sets u ed in this paper original1y had data from January 1990 all the way to August 2003, and the price ofthe stocks had changed so much that very little would b I arn d by u ing th entire dataset. All the data prior to 1997 was remov d from th data set. This pr vid d services: First it reduced the size of the data set and econd it gave a better plit of increasing days to decreasing days. Studies have shown in classification problem it i important to have equal representations of both cases in ord r to prevent the network from becoming biased towards the more common value, in this case increasing days. Intel's data set contained less than a 1% difference in the representation of increasing days as compared with decreasing days, thus further manipulation of the data s t was not required. However Microsoft did much better during this period and had an increase in 59.6% of the days in the sample set. In order to prevent the network from heavily 40 favoring the increased prediction the data set had increasing prediction randomly removed until a 55/45 split was achieved. This al10wed e p rimenls to b t ted to how the difference with strong bias and without a bias. Thu the la t year of data from August 1S\ 2002 until July 31 st, 2003 was held out of set to be u d as an un een testing set in the simulation. 6.3 Data Transfonnation Transfonning the data in the most meaningful manor for the network ise ential for the success of the network. The daily price was modifi d from a stock price to a percentage change from the previous day in an attempt to help the network better understand the network. This gets the data closer to the actually attempted prediction of predicting if the stock price goes up or down not simply trying to guess the price which we don't care about in this prediction paper. However, after day ofte ting this was found to actually hurt performance and a straight price mod I wa then u ed. All data should also be standardized or normaliz d so that one column of value cannot completely dominate the prediction. There are many types of normalization of data such as minmax normalization, zscore normalization and normalization by decimal scaling. Most of the better software in the industry handles the normalization for the us r. The data in this paper was normalized using a minmax normalization sp cified in equation 2.4. Also zscore normalization was tried, and the results tended to p rform wor ethan with the much more straightforward minmax normalization thus minmax normalization was used. Many neural network tools, such as WebStatistica and Neuro Solutions automatically normalize the data for the user. 41 Chapter 7 Results 7.1 Testing Standards The test standards listed below were used for all te t unles otherwise not don the results. All tests on the neural network where given 10,000 epochs to train and cross validation was used to terminate training after 200 epochs with no improvement. Al 0 networks were ran 3 times each with randomized starting weights to insur the best possible network by minimizing the chance of obtaining a local minima. Neural Solutions allows for varying of a single parameter; so often it was the case that with everything being held constant, the number of hidden neurons in the first layer would vary from 10 to 50, with a step size of 2, to determine which network was best at learning the data. This type of gradual improvem nt was key to the llcce seen in the final networks that were to average around 63% correct wlLich was much b tter than the original networks which had a very dismal performance of around 53%. Wh n a 2layered neural network was designed the second layer would contain the cei Iing of the log of the number of neurons in the first layer. All networks used conjugate gradient descent unless otherwise noted. The transfer function us d for all networks wa TanhAxon. The data split for the networks were 65% test, 15% cross and 20% testing, except in the simulation when the test set was the entire year. The training et was broken up as 80% training and 20% cross validation. etworks given for Fuzzy Cope used minmax normalization. 42 7.2 Conjugate Gradient vs. Back Propagation The original neural network designed used the v ry straightforward method of back propagation. The error would filter through the network back onth learning rat and momentum ofthe network. Several new ideas have been publi hed to provid faster learning and also to provide better results for the networks. This pap r examine conjugate gradient descent and back propagation algorithm. Figure 7.2 how that it takes about onethird the amount of time to train a conjugate gradient n twork than a back propa.gation network. Much ofthe improvement in sp ed was becaus conjugate gradient networks would norrnallycross the thre hold (set at 200 epochs without improvement) before ever reaching the epoch limit of 10,000. Exact time required i not possible since the weights are randomly initialized, and the picking of the weights can greatly affect the convergence of the network. For example, it took just ov ran h UT l train a series of conjugate gradient network with the numb r of proe ssing el ments 43 varying from 20 to 50 in a single layer and each network a built 3 time . Conjugate vs. Back Propagation r I Back Propagation 3 MSE 0.000361 I Back Propagation 2 MSE .0000349 I X. Back Propagation 1 MSE 0.000363 I >. le::: .~ Conjugate 3 MSE 0.000330 I IConjugate 2 MSE 0.000344 I Conjugate 1 MSE 0.000363 I o 2000 4000 6000 8000 10000 12.000 Epochs Figure 7.2 Conj ugate vs. Back Propagation The same test, with back propagation, takes about 4 hours. The chart below shows the performance of both back propagati.on and conjugate gradient, both using the default learning rates provided in N uro Solution. It i cl ar that onjugat radi nt i bett.er in almost every category in this regression t 81. Back Propa.gation MSE Error Overall Correct Decrease Increase 30 PE*  5 PE** 0.3725 59.98% 41.01% 78.72% 40 PE*  6 PE** 0.3748 59.10% 41.60% 76.41% 36 PE* 0.3702 61.14% 36.92% 85.08% Conjugate Gradient Descent 24 PE  5 PE 0.3702 59.98% 37.50% 82.19% 48 PE  6 PE 0.3689 59.40% 35.16% 83.35% 22 PE 0.3745 62.59% 42.18% 82.77% *Step size 1.0 first layer Random Set 1 **Step size 0.1 second layer Momentum 0.7 as default Table 7.2 Conjugate Gradient vs. Back Propagation 44 7.3 Takagi Sugeno Neuro Fuzzy Inference System Tests Neuro Solutions provide an ANFIS training kit which th y d fin a following: "The ANFIS (CoActive NeuroFuzzy lnference System) model int grate adaptable fuzzy inputs with a modular neural network to rapidJy and accurately approximate complex functions. Fuzzy inference systems ar also valuable as they combine th explanatory nature of rules (membership functions) with the power of "black box" neural networks." [22]The ANFIS package is not part of the educator ver ion and thu it had be tested only in evalua60n mode, limiting the number of exemplars to 300. This greatly reduced the networks ability to learn as the other networks had about 850 exemplar to be trained on. In order to make a comparison between a straight neural network and that of the ANFIS, the neural network was trained with the exact same restriction . The results for both were mediocre at best. The significance here is simply to show that the Takagi Sugeno Neuro Fuzzy was superior to the Neural network in learning the signal and this also shows the extreme complexity of this network. The network' U ed 5 inputs: pric· yesterday, price a week ago, volume yesterday high price y terday and high price 2 days ago. The la t two input were discovered in Neural Trading solutions a powerful stock market predictor software available for commercial u e[22]. To show th xtreme complexity of these networks, a rough approximation of time is given. [t should b noted that many of the networks terminated, due to th cross validation set not having any improvement for 500 epochs well before every reaching the] 0,000 epochs maximum. The time complexity of using an ANFIS is its worst a ets. It is simply impossible to te hundreds of different weights for a network when it takes 2 days to train a ingle network on one oftoday's fastest machines. 45 Type of Network NMSE on testing % Correct Time 3 MFs per input 0.003811 53.67% 30 min 4 MFs per input 0.004515 56.00% 4 hours 5 MFs per input 0.005447 55.33% 22 hours NN207 0.006474 55.00% 1 min NN10 0.008856 56.67% 15 sec NN9 0.006148 55.33% 15 sec Randomize Records 3 MFs per input 0.004685 55.33% 30 min 300 exemplars, Conjugate Gradient, TSK, Bell MF, Axon Transfer, 10,000 epochs, 500 threshold, crossvalidation Table 7.3 Takagi Sugeno Neuro Fuzzy vs. Neural Networks 7.4 Mamdani Neuro Fuzzy Tests The results reported by FuzzyCope3 were very poor. The oftware, simply put lacks the power to learn such a complex network. [t uffers from being wlabl to get Ollt of a local minimum and thus hundreds of similar networks must be built. The oftwar must then be reloaded to clear its memory so it does not fmd the arne local minimum. This technique was able to find a good olution for the problem using 3 m mb rship functions per input. However, the problem never converged when 4 and 5 membership functions were used per input despite around 100 attempts for both network using varying weights. NMSE error in these tests means, the mean of all the error squared on normalized data set in which the data is in the range [0, I]. The results are shown below. 46 MF per RMS NMSE input Step Momentum Epochs Training Testing % Correct 3 0.2 0.8 1000 0.0232 0.0414 53.31% 3 0.2 0.8 2000 0.0230 0.0412 53.01% 4 0.2 0.8 2000 0.2059 0.4792 NA 5 0.2 0.8 2000 0.2059 0.4792 NA Population Min, Max Generation 3 MF  GA 50 20,20 50 0.0228 0.0412 53.01% Cross Over Point: Fitness Selection: Uniform Normalization Elitism Tournament Table 7..4 Mamdani Neuro Fuzzy Results The error for both 4 and 5 membership functions per input are actually 10 time wors than that of the network with 3 membership functions. Thi is caused by an error in . FuzzyCope3, which is unable to converge for these networks. 7.5 Classification vs. Regression This fundamental question comes down to where to translate the data before or after submitting to the neural network. When performing regression testing all 5 input are given to the network, and the output is th stock price for the next day in dollar and cents. Once a test is completed the predicted price is compared with ye terday' price to see if the price increased or decreased. The actual price is compar d with yesterday s price to determine if it increased or decreased. [f both formulas produce the same output then it is determined that the network correctly predicted the value. las ification is much more straightforward. Before presenting the data to the network the output is translated to ], for increase, or 0 for decrease. And the objective ofthe network is to correctly predicted 0 or 1. 47 Regression Min MSE Cross NMSE Testing Testing split Correct Decrease Increased 28 PE  5 PE 0.0004663 0.006659 50.29% 57.94% 37.50% 78.14% 42 PE 6 PE 0.0004583 0.006789 50.29% 52.71% 31.65% 73.52% 26 PE 0.0004918 0.007338 50.29% 54.45% 36.33% 72.36% Classification 24 PE  5 PE 0.3702 NA 50.29% 59.98% 37.50% 82.19% 48 PE  6 PE 0.3689 NA 50.29% 59.40% 35.16% 83.35% 22 PE 0.3745 NA 50.29% 62.59% 42.18% 82.77% Table 7.5 Classification vs. Regre sion 7.6 Random Data Sets When dealing with random sets, it is often very important to show the data i actually able to be learned at the shown rate for more than aju t a single set. As thi which is chosen at random, could have been luckily. Showing the same tests for different randomly selected data sets shows the results are real izable. The Microsoft data set used in the previous two sections was randomized three different times and networks where built for each data set using the same standards di Cli sed in previou s ction to en ul' the best possible networks were created. 48 Composite of All Sets 60.00% 56.00% CI> 56.00% III III e .r=.J 54.00% CJTraining III Cross >. IcII 52.00% oTesting '0 '#. 50.00% 46.00% 46.00% Set 1 Set2 Set 3 Random Sets Figure 7.6 Random Data Sets It is important that the training set and the test set have similar composites in order for the network to perform well. Set I has about a 6% difference in the make up training and testing of the randomly selected rows making prediction much more difficult. Set 1 MSE Correct Decrease Increase 22 PE 0.3745 62.59% 42.18% 82.77% 24 PE  5 PE 0.3702 59.98% 37.50% 82.19% 48 PE  6 PE 0.3689 59.40% 35.16% 83.35% Set 2 Classification 34 PE 0.3920 55.91% 16.21% 90.03% 20 PE  5 PE 0.3924 58.23% 10.40% 95.92% 38 PE  6 PE 0.3905 59.40% 26.27% 87'86% Set 3 Classification 42 PE 0.379365 60.85% 36.33% 80.49% 32 PE  5 PE 0.382199 59.98% 36.99% 78.39% 54 PE 6 PE 0.380529 60.85% 37.64% 79.44% Conjugate Gradient MSFT Table 7.6 Classification across 3 Random Sets 49 The importance of the table abo is to sho that p rformanc i comparable a ro all sets. Thus the r suits can be con iderable reliabl for any random gr uping of th dataset. 7.7 Even Split in Classification Also studies have shown that in classification it is important to have a irnilar plit so that the network does not become biased. The data consists of M FT stock price from 1997 randomized with 55% of the total set representing incr asing days. In thi study the inputs we use show the difference in performance on a et of ev n inputs and wh n th train set is not evened out. The results are below. Classification: MSE Error Correct Decrease Increase 24 PE  5 PE 0.37022 59.98% 37.50% 82.19% 48 PE  6 PE 0.36889 59.40% 35.16% 83.35% 22 PE 0.37448 62.59% 42.18% 82.77% Classification Even Split: 22 PE  5 PE 0.37520 58.23% 57.97% 58.49% 46 PE  6 PE 0.37578 58.23% 61.48% 55.02% 18 PE 0.38068 61.14% 59.14% 63.12% Table 7.7 Classification with ven plit These find are consistent with studies publish d on datas t split . Th ov rall performance is slightly better when not making the training a set an v n plit" h w v r this creates a big bias even when the data is split 55/45. A you can ee in Tabl 7.6 the network that was not evenly split got over 80% for the network, for day that increased in value and less than 40% for days that decrease in value. However the even split data was able to score around 60 correct, regardless of if the prediction day is increasing or decreasing. 50 Chapter 8 Simulations and Discussions The best possible neural networks were designed u ing only data from January 1997 until July 2002. Data from these sets (MSFT and INTC) were r moved in the manner mentioned in section 6.2. One additional st p wa taken aft r th result were shown to be good. A regression based neural network predict d all value in the training period and the results were compared with the actual price. Any day with a pr diction error of more than 5% was removed from the training data set. The e days were outlier and most likely caused by external forces. Each model was given $100 at the beginning of the testing period, August 1, 2003, and the model bought if it predicted an incr a e with over a 50% certainty. The model would hold onto the stock until it came to a day, which had an increase certainty of less than 50%, and then it would sell the stock. The table below shows an example of this trategy. A similar model wa a] buill which Increase # of Output Action shares Cash 0.460738 STAY o $ 100.00 0.617069 BUY 4.411116 $ 0.566166 HOLD 4.411116 $ 0.551663 HOLD 4.411116 $ 0.595844 HOLD 4.411116 $ 0.521784 HOLD 4.411116 $ 0.489340 SELL o $ 106.93 0.552275 BUY 4.483247 $ 0.478012 SELL o $ 107.69 0.651398 BUY 4.617822 $ 0.485218 SELL o $ 113.78 0.557304 BUY 4.612206 $ 0.479946 SELL o $114.29 0.461905 STAY o $ 114.29 Table 8. 1 Profit Simulation 51 earned the prime rate for any money that was not invested in the tack, thi r ult din only marginal improvement for all models. The results from the simulation were astonishing. Profit was impro ed a much as 889% over a straightforward buy and hold strategy. The figure below how how profit increased dramatically by using the neural network to make deci ions. Microsoft Profit Simulation: $100invesbnent $120.00 $100.00 $80.00 Pro1il/w Prime Profit $60.00 o Profit $40.00 $20.00 $Buy and Hold 18 PE ·5 PE 36 PE· 6 PE 44 PE Figure 8.1 Microsoft Profit imulation The model, which was created with 44 processing elements in one layer, wa able to make $103.17 in a oneyear period with an initial investment of $100.00. For comparison, if the same money bought stocks at the being of the period and sold th stocks at the end of the period it would have made a profit of $10.43, this is refi rred to as a buy and hold strategy. This model predicted the direction of the stock correctly 63% of the time. Despite the lack luster perfonnance the model was able to predict correctly when it counts most. The biggest draw back of such a scheme in the real world is 52 commission. The model mentioned above bought and sold stock a total of 143 times during the 252 trading days in the simulation period. The total number oftrad i sbown in the table below (that shows the Microsoft model, wbich was much more accurat trading more often). Transactions 22 PE  5 PE 36 PE  6 PE 17 PE 18 PE  5 PE 36 PE  6 PE 44 PE Intel I Microsoft Figure 8.2 Transactions The prediction on Intel's stock was inferior in all cases. Two of the 3 model were abl to beat the buy and hold strategy. The only plus side of Intel performance is it minimized the number of transactions, thus it would incur less commi sian jfjt were actually implemented. The table below shows the best models picked by the lowest M E on the cross val.idation set for both Intel and Microsoft. 53 Network MSE Training MSE Cross Correct Profit MSFT 44 PE 0.3959 0.3968 63.32% $ 103.17 INTEL 17 PE 0.3957 0.3972 54.98% $ 49.23 Table 8.2 Profit on MSFT vs. INTC The profit seen by the best model for Intel increased profit over the buy and hold strateg by 55%. The graph below shows the results of all 3 models on Intel s data. Intel Profit Simulation: $100 investment $50.00 $45.00 $40.00 $35.00 $30.00 Profit wI Prime Profit $25.00 o Profit $20.00 $15.00 $10.00 Buy and Hold 22 PE  5 PE 36 PE  6 PE 17 PE Figure 8.3 Intel Profit Simulation 54 Chapter 9 Conclusion The ability to predict stocks on a daily bases is very difficuJt for even th rno t advance AI techniques. The Hurst Component confirmed the hypothesi which aid prediction is possible but especially difficult. Surprisingly, th onsumer onfidenc Index and Prime Rate were not able to improve the predictability ofthe e networks. Through the use of many techniques, it is possible to correctly predict the direction of the stock 63% of the time for a large company like Microsoft. The study d monstrated that Takagi Sugeno (TSK) Neuro Fuzzy System was able to produce a much better result than a pure neural network when given the same training set. The TSK Neuro Fuzzy sy tern requires a great deal ofprocessing power, and a network with five membership functions per input took as long at 22 hours to training. The Takagi ugeno euro Fuzzy model was superior in prediction to the Mamdani Neuro Fuzzy Infer nee y tem and both networks required about the same amount of time to train. Genetic Algorithm are a v ry efficient way of determining which input are the most valuable and by r moving unnecessary inputs the performance ofthe network can be increa d. enetic Algorithm provide a "good enough" solution when searching the entire s arch space could take years of CPU time on even the best system. Genetic Algorithms wer used to fine turn the membership function in the Mamdani Neuro Fuzzy ystem, and the result was a marginal improvement in the RMSE; however it didn't improve the percentage of directional correctness, which was the goal of this study. The GA  Mamdani didn't perform as well as the ANFIS.. 55 The neural network using conjugate gradient de cent wer abl to achi v 63% correct on the test set thus this researcb pro ide the groundwork for a r at d al of profit. Even the worst model used for Microsoft produced a r turn on investment of 66%, and the best network scored an astounding 103% return. Tbi study al a demonstrated that picking the correct stock is as important a building th be t n twork (as tbe best network for Intel was outperformed by the wor t network on Microsoft data). The best network for Intel scored a mediocre return on investment ofjust under 55%. The biggest downfall of these networks was that the transaction co t of buying and selling stocks would be very costly. However it would be feasible to fin tune the buy and sell strategy to lower this cost. 56 Chapter 10 Future Work There are many areas in which thi r earch an bet nd d. Th many ar a of interest is the stock market simulation. It is clear from all the tests that the network wer abl to I arn th pattern in Microsoft's data much more easily than Intel s data thus it might be po sible that another stock is more learnable than Microsoft. More stocks could b picked to determine which have the most learnable pattern. There also exi t many different type of networks such as Support Vector Machines, Generalized Feed Forward JordanlElman twork A FI (attempted but the software version used limited its ability) and Tim Lag Recurrent Network. Anyone ofthese might actually out perform the neural networks designed here, the only way to be for sure is to design all the networks and see which does th be Similarly, there are many different tran fer functions and trying many diff! r nt combinations might also improve performanc. ach probl m wa giv n well ov r a hundred networks before picking the best network' how v r th r are an infinite numb of networks and thus many more networks couJd be built hoping to find a mor optimal solution. The data sets themselves might not b optimized f! r learning and thus trying more than just 3 random data sets could improve performance. The input pace could b searched more thorougWy. The original 16 inputs were cut down to 5 inputs u ing multiple techniques such as Genetic Algorithms and an iterative approach. Howev r, all 65,536 could be checked to find the be t network for each approach. The imulation was performed using a static model, and the research has been done showing that a moving 57 model that is updated with each day can outperform a static mod l. Thu it might b possible that in the simulation if a new neural network is built r day giv nth mo t recent infonnation it coul.d improve its performance. The co t of commi ion i omitt d from this research, as profit is not the goal of this study. How ver if an actual r al world model were to be used for making profit it would be neces ary to modify the entire model to consider commission and thus reduce the numb r oftrad perform d by th network. Many of the networks had 150 trades over the 252 days the market was open during the testing period. It would be very advisable to set a tbre hold that woul.d not perform the predicted action unless the likelihood was much better than jut 50%. One last method for improving the network performance would be to remove more outljer from the networks and with different combinations. All data in th training and cross validation set that was missed by more than 5% by a well trained regr s ion network, were removed. The value of 5% shoul.d be changed many times to fmd an ideal value for learning. From all the above recommendations it.i cl ar that this problem i tractable, with so many combinations and that not aJl po sible network could v r be built. 0 w must be satisfied with a good enough result. 58 Bibliography Page 1. Abraham, A. Recent Advances in Intelligent Paradigm Studies in Fuzziness and Soft Computing. Ed. Ajith Abraham Lakhmi . Jain and Janusz Kacprzyk: PhysicaVerlag (2002): 128. 2. Abraham, A. Philip, N. S., and Saratchandran P. "Mod ling Chaotic Behavior of Stock Indices Using Intelligent Paradigms. Neural, Parallel and cientific Computations. 11 (2003): 143160. 3. Abraham, A. "It is Time to Fuzzify Neural Networks. Intelligent Multimedia, Computing and Communications: Technologies and Applications of the Future. (2001): pp. 253263. 4. Abraham A. "Meta Learning Evolutionary Artificial N ural Networks.' Neurocomputing Journal, Elsevier Science Netherlands, (2003) (forthcoming). 5. Bartos, F.J. "Motion Control Tunes into AI Methods." Control Engineering 46 No.5 May (1999). 6. Brownstone, D. "Using Percentage Accuracy to Measur Neural Network Predi.ctions in Stock Market Movements." Neurocomputi Ilg. 10 (1996): 237250. 7. Castellano, G., Castiello, C., Fanelli, A.M. and Giovannini, M. "A NeroFuzzy Framework for Predicting Ash Properties in Combu tion Proce ses. ' N ural, Parallel and Scientific Computation. 11 (2003): 6982. 8. Chen, A.S., Leung, MT., and Daouk, H. "Application ofNeural Networks to an Emerging Financial Market: Forecasting and Trading the Taiwan Stock Index." Computers and Operations Research. 30 (2003): 901923. 59 9. Feder, Jens. Fractals. Pelnum Pres New York w York, 1988. 10. Fuzzy Cope 3. http://kel.otago.ac.n:zJsoftware/FuzzyOPE3/,11/2003 11. Garver, M. S. "Try new datamining techniqu 'Marketing New. Volume 36 issue 19 Sept (2002). 12. Grossglauser. M. and Bolot, JC. "On the Relevance of LongRang Dep ndellc in Network Traffic." ACM SIGCOMM '96. August (1996). 13. Harris, c.J. and Hong, X. "NeuroFuzzy Network Model Con truction U ing Bezier Bernstein Polynomial Functions." IEEE Proc D Control Th oryand Application in Press. 47 (2000): 337+. 14. Hurst, RE., Black, R.P. and Simaika Y.M. LongTerm Storage: An Experim Iltal Study. London, England. Constable, 1965. 15. Izumi, K. and Veda, K. "Analysis of Exchange Rate Scenarios Using an Artificial Market Approach." Proceeding of the International Conference on Artificial Intelligence. 2 (1999): 360366. 16. lang, l.S.R. 'ANFIS: AdaptiveNetworkBa ed Fuzzy Inti r nce y tern.' _I~_ Transactions on Systems, Man, and Cybernetics. 23 (1993): 665684. 18. lang, J.S.R., Sun, C.T., and Mizutani, E. NeuroFuzzy and Soft omputing: A Computational Approach to Learning and Machine Intelligence. New Jersey: Prentice Hall, 1997. 19. Kuo, R.J., Chen, C.H., and Hwang, Y.C. "AnlnteIligent Stock Trading Decision Support System through Integration of Genetic Algorithm Based Fuzzy Neural 60 etwork and Artificial 118 2001): 2124. 20. Labbi A., Gauthier, E. "Combining Fuzzy Knowl dg and Data for uro uzz 1/2 (1997). 21. Mamdani, E Hand Assilian, S. "An experirn nt in Linguistic ynth is with a Fuzzy Logic Controller." International Journal of ManMachine No.1 pp. 113, (1975). 22. Neuro Solutions. http://www.nd.com.11/2003 23. O'Brian, T. V. "Neural Nets for Direct Marketers" Marketing Re arch. Volume 6. Issue 1 Dec (1994). 24. PetrovicLazarevic, Sonja, and Abraham, A. "Hybrid FuzzyLinear Programming Approach for Multi Criteria Decision Making Problems." eural, Parallel and Scientific Computations. 11 (2003): 5368. 25. Prime Rate. http://re earch.stlowsfed.org/fr d2/dataJPRIME.t l. 11/2003. 26. Quah, T.S. and Srinivasan, B. "Improving Returns on t k Inve tment thr u h Neural Network Selection." xpert Systems with Applications. 17 (1999): 295301. 27. Sugeno M, "Industrial Applications of Fuzzy ontrol." Isevier Science Pub o. (1985). 28. Tsaih, R., Hsu, Y., and Lai, C.C. "Forecasting S&P 500 Stock Index Futures with a Hybrid AI System." Decision Support System. 23 (1998) 161174. 29. Turban, E. and Aronson, J. E. Decision Support Systems and Intelligent ystems. Delhi, India: Pearson Education, Inc., 2001. 61 30. Weka. http://www.cs.waikato.ac.nzJrnl/weka.I1/2003. 31. Yao, J.T. and Tan, C.L. A Study on Training riteria for Financial Tim Forecasting.' Proceedings of International Processing. Nov. 2001: 772777. 32. Yao, J. and Poh H.L. 'Forecasting the KLSE Index U ing eural twork ." IEEE International Conference on Neural. Networks. 2 (1995) 10121017. 33. Yao, J., Tan, C.L., and Poh, H..L. "Neural Networks for Technical Analysis: A Study ofKLCI." International Journal of Theoretical and Appli Finance. 2 (1999) 22124l. 34. Yao, J., Poh, H. L. "Equity Forecasting: A Case Study on the KLSE Index. ' NNCM '95. (3Td International Conference on Neural Network in the apitaL Markets). Oct (1995) 341353. 35. Zadeh, L. A. "Fuzzy Sets." Information and Control. June (1965), 8(3): 338353. 62 VITA Br nt Arthur Do k nı Candidate for the D gr ofı Master of Comput r Sciencı Thesis: Predicting Financial Markets Using euro Fuzzy Genetic y tern Major Field: Computer Science Biographical: Personal Data: Born in Stillwater, Oklahoma On November 14 1976 th son of Gerald and Cheryl Doeksen Education: Graduated from Stillwater High School, Stillwater klahoma in May 1995; received Bachelor of Science degree in Computer Science and Mathematic from Oklahoma State University, Stillwater Oklahoma in December 1999. Completed the requirements for the Ma ter of ci nc degre with a major in Computer Science at Oklahoma tat Univ r ity in D c m r 2003. Experience: Brent has b en a prof! sional ftware d v lop r ince 199 and ha worked for several companie acros the Unit d tate: abr In ., J.8. Hunt Phillip Morris, OneOK, and William ommunication r up. 



A 

B 

C 

D 

E 

F 

I 

J 

K 

L 

O 

P 

R 

S 

T 

U 

V 

W 


