

small (250x250 max)
medium (500x500 max)
Large
Extra Large
large ( > 500x500)
Full Resolution


PROBABILISTIC GRAPHIC MODELS FOR SPORTS VIDEO MINING: HYBRID GENERATIVEDISCRIMINATIVE APPROACHES By YI DING Bachelor of Science Xi’an University of Technology Xi’an, China 2002 Master of Science Xidian University Xi’an, China 2005 Submitted to the Faculty of the Graduate College of Oklahoma State University in partial fulfillment of the requirements for the Degree of DOCTOR OF PHILOSOPHY December, 2010 COPYRIGHT c⃝ By YI DING December, 2010 PROBABILISTIC GRAPHIC MODELS FOR SPORTS VIDEO MINING: HYBRID GENERATIVEDISCRIMINATIVE APPROACHES Dissertation Approved: Dr. Guoliang Fan Dissertation Advisor Dr. Carl D. Latino Dr. Damon Chandler Dr. Douglas Heisterkamp Dr. Mark E. Payton Dean of the Graduate College iii ACKNOWLEDGMENTS I would like to offer my gratitude to all who have made this dissertation possible! First, I am heartily thankful to my supervisor, Dr. Guoliang Fan for his invaluable guidance and vision that direct me towards my research goals. I have benefited so much from his tireless help in both of my research and my daily life. I feel extremely lucky that I could have the opportunity to work with him in the past five years. Meanwhile, I wish to acknowledge Prof. Laiyao Fan, my advisor in Xidian University, for his continuous kind advising and inspiration. I would like to thank members of my dissertation committee: Dr. Carl D. Latino, Dr. Damon Chandler, Dr. Douglas Heisterkamp, and Dr. Xiaolin Li, as well as other mentors, Dr. Jianping Fan and Dr. Jiebo Luo, who have guided me through my research. I appreciate their guidance and encouragements. Many thanks to all of my colleagues in Visual Computing and Image Processing Lab (VCIPL), especially to Cheng Chen, Xin Fan, Xin Zhang, Vijay Bhaskar Venkataraman, John Davis, Andy Hammett, and Bryan T. Wright. Thank you all for your suggestions, discussion, help, and friendship. I would also thank all my friends in Stillwater, you have made my 5 years Ph.D. life in Stillwater more colorful. Lastly but most importantly, I want to express my deepest gratitude to my parents, He Ding and Li Gao, for their everlasting love, consecutive encouraging and unconditional support during my Ph.D. study, even when they are in a hard situation. I am also grateful to my girlfriend, Xiaojing Yang, whose dedication, support and understanding make my life fulfilled with love and happiness. I love you! iv TABLE OF CONTENTS Chapter Page 1 INTRODUCTION 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Research Goals and Challenges . . . . . . . . . . . . . . . . . . . . . 2 1.3 Our Approaches and Contributions . . . . . . . . . . . . . . . . . . . 5 1.3.1 A threelayer semantic representation . . . . . . . . . . . . . . 6 1.3.2 Inferring midlevel keywords by generative graphical models . 7 1.3.3 Inferring highlevel semantics by discriminative graphical models 7 1.3.4 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 LITERATURE REVIEW 11 2.1 Probabilistic Graphical Models . . . . . . . . . . . . . . . . . . . . . 11 2.2 Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Discriminative Models . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Applications in Video Mining . . . . . . . . . . . . . . . . . . . . . . 19 3 PROBLEM FORMULATION 22 3.1 Semantic Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Lowlevel Visual Features . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Midlevel Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4 Highlevel Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 v 3.5 Two Inference Problems . . . . . . . . . . . . . . . . . . . . . . . . . 27 4 LOWLEVEL VISUAL FEATURES 30 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 Color/Landmards Based Visual Features . . . . . . . . . . . . . . . . 31 4.3 Motion Based Visual Features . . . . . . . . . . . . . . . . . . . . . . 32 4.4 Feature Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5 HMMS FOR KEYWORDS DETECTION 36 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.2 Traditional HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2.1 Model Specifications . . . . . . . . . . . . . . . . . . . . . . . 37 5.2.2 Learning and Inferences . . . . . . . . . . . . . . . . . . . . . 38 5.2.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 40 5.3 Embedded HMMGMMs . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.3.1 Model Specifications . . . . . . . . . . . . . . . . . . . . . . . 42 5.3.2 Learning and Inferences . . . . . . . . . . . . . . . . . . . . . 43 5.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 47 5.4 Segmental HMMs (SHMMs) . . . . . . . . . . . . . . . . . . . . . . . 48 5.4.1 Model Specifications . . . . . . . . . . . . . . . . . . . . . . . 50 5.4.2 Learning and Inferences . . . . . . . . . . . . . . . . . . . . . 51 5.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 54 5.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6 MULTICHANNEL SHMM FOR KEYWORDS DETECTION 56 6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.2 Model Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.3 Learning and Inferences . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 vi 6.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 7 CONDITIONAL RANDOM FIELDS FOR EVENT DETECTION 71 7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.2 Model Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.3 Learning and Inferences . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 8 SEGMENTATION CRFS FOR GAME FLOW ANALYSIS 82 8.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 8.2 Segmentation CRFs (SCRFs) for Game Flow Analysis . . . . . . . . 84 8.2.1 Model Specifications . . . . . . . . . . . . . . . . . . . . . . . 86 8.2.2 Learning and Inferences . . . . . . . . . . . . . . . . . . . . . 87 8.3 Missing Data Handling by Auxiliary SCRFs (ASCRFs) . . . . . . . . 90 8.3.1 Model specification . . . . . . . . . . . . . . . . . . . . . . . . 90 8.3.2 Learning and inferences . . . . . . . . . . . . . . . . . . . . . 93 8.3.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 94 8.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 9 CONCLUSIONS 100 BIBLIOGRAPHY 102 vii LIST OF TABLES Table Page 2.1 The comparison between the discriminative and generative models . . 20 4.1 Rank of information gain(I.G.) . . . . . . . . . . . . . . . . . . . . . . 34 5.1 Single shot classification results on camera views . . . . . . . . . . . . 40 5.2 Shotbased classification results of HMMbased algorithms for nine 30 min NFL videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.3 Shotbased classification results of seven algorithms for eight 30min NFL videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.4 Shotbased classification results of eight algorithms for nine 30min NFL videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.1 The model settings for all testing methods. . . . . . . . . . . . . . . . 66 6.2 Shotbased classification results of nine algorithms for nine 30min NFL videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 7.1 Highlevel event detection by rulebased and CRFbased approaches. . 79 viii LIST OF FIGURES Figure Page 1.1 The increasing demands for the the video mining. . . . . . . . . . . . 1 1.2 Examples of two major categories of video data: news  script video (top); football  nonscript video (bottom). . . . . . . . . . . . . . . . 3 1.3 Physical representation of the video data. . . . . . . . . . . . . . . . . 4 1.4 The first challenge: the semantic gap. . . . . . . . . . . . . . . . . . . 5 1.5 The second challenge: a unified representation. . . . . . . . . . . . . . 5 1.6 The outline of the dissertation. . . . . . . . . . . . . . . . . . . . . . 10 2.1 Structures of discriminative models and generative models. . . . . . . 13 2.2 Generative models: variations of the DBN: Hidden Markov Model (HMM), Coupled HMM (CHMM), Factorial HMM (FHMM), and Hierarchical HMM (HHMM). The white and black nodes represent the state and observations, respectively. . . . . . . . . . . . . . . . . . . 14 2.3 Discriminative Models: variations of the conditional random field (CRF): the Conditional Random Field (CRF), the Dynamic CRF (DCRF), and the Hidden CRF (HCRF). . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1 The threelayer semantic space. . . . . . . . . . . . . . . . . . . . . . 23 3.2 Visual features in football video: (a). play type analysis; (b). camera view analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Two semantic keywords and their dynamics in the American football video, i.e., camera views (top) and play types (bottom). . . . . . . . . 27 ix 3.4 A partial game flow in an American football game that contains threes drives: a scored drive, a nonscored drive, and a turnover, and each drive includes a series of annotated plays. . . . . . . . . . . . . . . . . 28 4.1 Play ground detection result. . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Yard line detection result. . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3 Distributions of shotbased (top) and framebased (bottom) Dave with respect to four camera views. . . . . . . . . . . . . . . . . . . . . . . 35 5.1 Hidden Markov Models (HMMs). . . . . . . . . . . . . . . . . . . . . 38 5.2 A new generative model . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.3 The segmental hidden Markov models (SHMM). . . . . . . . . . . . . 50 6.1 In the MCSHMM, the rst layer includes two SHMMs , and the second layer captures the interaction between two channels. . . . . . . . . . . . . . . 57 6.2 The experimental results on synthetic data. . . . . . . . . . . . . . . . . 67 6.3 A proposed semantic analysis paradigm for sports video mining. . . . 70 6.4 Comparison of the state transitions between the ground truth (top) and learning results (bottom). The first three are for the MCSHMM defined in (6.3), (6.5) and (6.5), and the last one for CHMM. (a) A1(4 × 64); (b) A2(4 × 64) (c) A3(16 × 16) (d) Full state transitions of a CHMM (16×16) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 7.1 The second inference problem . . . . . . . . . . . . . . . . . . . . . . 73 7.2 From midlevel semantic keywords to highlevel semantics . . . . . . . 73 7.3 (a) CRF model; (b) CRF for event detection (S): touchdown; M,R,E: central, right and end zone camera views; L,S,G,K: long play, short play, field and kick; LR,RL: possessions from left to right, possessions from right to left. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 x 7.4 Examples of the touchdown in terms of individual midlevel semantic keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 8.1 The example of a scored drive in football game flow. (The football stadium background is from NFL.com) . . . . . . . . . . . . . . . . . 86 8.2 (Left) SCRF and (Right) ASCRF. . . . . . . . . . . . . . . . . . . . . 89 8.3 The experimental results for four models. Left: the segmentation accuracy, right: the overall recognition accuracy. . . . . . . . . . . . . . 95 8.4 The comparison of the matching scores in different drives of two feature functions in ASCRF. Integrity template: (a) scored, (b) turnover, (c) nonscored; Similarity score: (d) scored, (e) turnover, (f) nonscored. 96 xi CHAPTER 1 INTRODUCTION 1.1 Motivation Video mining is to discover knowledge, patterns and events, or called semantics, in the video data that is stored either in databases, data warehouses, or other online repositories [1, 2]. Its benefits range from efficient browsing and summarization of video content to facilitating video access and retrieval. As, shown in Fig. 1.1, driven by the ever increasing need of numerous multimedia and online database applications, there is a growing need of efficient tools that allow us to quickly find the video data/information of interest. How to summarize? How to retrieve? How to browse? Figure 1.1: The increasing demands for the the video mining. As a book has two main tools for readers, i.e., Table of Contents (TOC) and Index, which provides an overview and a quick method for reader to find information 1 of interest, one goal of video mining is to allow a viewer to access the video data like reading a book, correspondingly, there are three main research issues for video mining, i.e., summarization/abstraction [3], browsing/skimming [4], and indexing/retrieval [5, 6]. These three research aspects would facilitate the analysis of the video content and help us build up the TOC and Index for the video data According to different production and edition styles, videos can be classified into two main categories: scripted and nonscripted videos [7], which are usually associated with different video mining tasks. Scripted videos are produced or edited according to a predefined script or plan. Usually, news and movies are highly scripted videos that are composed of predefined segments or episodes. Building a TOC [8] is more suitable for the scripted videos that is able to support efficient browsing or indexing. On the other hand, the events in nonscripted videos happen spontaneously and usually in a relatively fixed setting, such as meeting, sports, and surveillance videos. How to detect the highlights or events of interests is of great interest for nonscripted videos. Sports video mining has been widely studied due to its great commercial value [9, 10, 11, 4]. Although sports videos (nonedited) are considered as nonscripted, they usually have a relatively welldefined structure (such as the field scene) or repetitive patterns (such as a certain play type), which could help us to enhance the scriptedness of sports videos for more versatile and flexible access. In this dissertation, the nonscripted video, i.e., the sports video, is studied by using the American football video as a case study. The proposed framework for sports video mining can also be applied to other types of sports video and other nonscripted or scripted video data. 1.2 Research Goals and Challenges Traditionally, in online video services and general multimedia applications, the video data will be parsed and stored as frames, shots, and video sequence according to their physical representation shown in Fig. 1.3, then labeled manually or based on 2 Figure 1.2: Examples of two major categories of video data: news  script video (top); football  nonscript video (bottom). some predefined rules for specific types of video data. However, it is insufficient or ineffective to label them by these approaches due to the difference between how human perceives the content of the video data and how the editor organizes the video content physically. In general, there are two kinds of video mining approaches: the eventbased [12, 13] and structurebased [9, 14, 15]. The eventbased approaches detect the eventsofinterest or highlights, e.g., goals in a soccer game, which enhance the semantic understanding of video contents of interests. However, this kind of approach is usually taskdependent and designed for specific purpose. In contrast, the structurebased approaches can parse a long video sequence into individual segments, e.g., the play/break, which can be used as an overall video semantic representation. Although the structure based approaches provide limited semantics, they can support further highlevel semantic analysis. Due to the complementary nature between events and structures, researchers have proposed new approaches to integrate two kinds of approach in one video mining framework. For example, a midlevel representation framework was proposed for semantic sports video analysis where both temporal structures and event hierarchy were involved [11]; a mosaicbased generic scene representation was developed from video shots and used to explore both events and structures [1]. Our goal is to address both event detection and structure exploration in a unified 3 Shot i1 Shot i Shot i+1 Shot N } } } ... } A Sport Video Sequence Shot 1 } ... Figure 1.3: Physical representation of the video data. multilevel video mining framework. The unified video mining framework should have the capability of both midlevel keyword detection and highlevel semantic analysis, and supports a lowlevel to highlevel video semantic description, which will facilitate the discovery of knowledge, patterns and events in the video data in order to support video summarization, browsing, and retrieval. There are two challenges: one is how to bridge the semantic gap between lowlevel visual features and highlevel semantics, and the other is how to unify both events and structures in one semantic representation framework: • The semantic gap characterizes the difference between two types of descriptions for the video data by different representations. In sports video mining, the semantic gap is the difference between understandable semantic concepts and computable visual features. As shown in Fig. 1.4, when watching the football game, people are interested in specific types of moments or events, but the computer can only deal with visual features extract from the raw video data, which makes it difficult to automatically find perceivable video contents of interests from the computable visual features. 4 • As shown in Fig. 1.5, there are two types of representations for highlevel semantics for the same video data. They could be interpreted by either semantic events, e.g., kicks, punts, etc., or semantic structures, e.g., play and break. These two types of semantic representations are from different view of points but complementary to each other. Therefore, we need a unified representation framework that has capabilities of revealing both specific semantic events and general semantic structure in a complementary way. Visual Feature ? Semantics Figure 1.4: The first challenge: the semantic gap. Structure ? Events Figure 1.5: The second challenge: a unified representation. 1.3 Our Approaches and Contributions Machine learning refers to algorithms that are capable of the autonomous acquisition and integration of data and knowledge by computers. It is considered as one of the most effective approaches for video mining [4] because of its capabilities of converting the video mining tasks into learnable mathematical problems that can be largely handled by the computer. Particularly, probabilistic graphical models like Hidden Markov models (HMMs) and Conditional Random Fields (CRFs) have been regarded 5 as most popular methods to deal with temporal series information. Their popularity is rooted in intuitively representing the dependency relationship between variable by a graph, which fits the video mining tasks very well. There are a variety of graphical model based approaches proposed for diverse video mining tasks, in our research, we proposed a unified probabilistic graphical model based threelayer framework for semantic video analysis, in which the lowlevel visual feature, midlevel keywords, and the highlevel semantics are incorporated by two correlated inference processes. 1.3.1 A threelayer semantic representation We advocate a threelayer semantic space that includes lowlevel visual features, midlevel keywords, and highlevel semantics. The lowlevel visual features are salient visual characteristics that can be extracted from the raw video data, the midlevel keyword is a set of stable recurrent, and welldefined semantic units that can be inferred from lowlevel features, and the highlevel semantics including events and structures can be inferred from the midlevel keyword, they are of direct user interests. The semantic space is a unique framework with following two characteristics: • The semantic space is a unified representation of the semantic content of the video data. It provides us an explicit semantic modeling framework, which can represent both events and structures jointly. In addition, midlevel keywords that are introduced in the semantic space can serve as building blocks of highlevel semantics to further bridge the semantic gap. • The semantic space guides us to convert the video mining problem into two related statistical inference problems, i.e., from lowlevel visual features to the midlevel keyword, then from the midlevel keyword to the highlevel semantics. Considering distinct characteristics of different levels, we address these two inference steps individually by adapting different approaches, and then reducing the complexity and uncertainty of the video mining problem. 6 1.3.2 Inferring midlevel keywords by generative graphical models We propose a series of generative models, especially extended hidden Markov models, based approaches that are able to support detecting midlevel keywords, which offer more flexibility, capability and functionality than existing HMMs for video mining. Specifically, we have discussed the following three main technical issues: • How to enhance the observation model in the HMMs based approaches to handle variablelength framewise observations for shotwise state inference? • How to capture the interaction between between multiple coexistent semantic keywords, such as the play type (what happened?), the camera view (where did it happen?), and the possession (who is holding the ball)? • How to optimize the model structure and to learn model parameters simultaneously? 1.3.3 Inferring highlevel semantics by discriminative graphical models We studied the discriminative model, especially CRFs based approaches for highlevel semantics analysis, in which both semantic events and semantic structures are well studied based on conditional random fields (CRFs) and extensions. These approaches utilize the midlevel semantic keywords generated by MCSHMM, then detect the highlevel semantics that people are more interested in. Specifically, we have discussed the following three main technical issues: • How to detect highlights/events given a series of consecutive midlevel keywords? • How to jointly perform segmentation and recognition of variable length semantic structures from a set of annotated keywords? 7 • How to handle the missing keywords existing in the inference process, i.e., when some of the midlevel keywords are incomplete, how to find the structure accurately? 1.3.4 Our Contributions The major advantages of our approach are the explicit semantic modeling and direct semantic computing that are implemented by the semantic space and graphical models, which is unique in three aspects: • First, the threelayer semantic space involving lowlevel visual features, midlevel keywords, and highlevel semantics, provides us a unified representation of different level of semantics in sports video data. More importantly, guided by the semantic space, the sports video mining is converted into two intercorrelated inference problems that can be solved by probabilistic graphical models. • Second, midlevel keyword detection is addressed by a series HMMbased approaches. We enhance the modeling capability of traditional HMMs in terms of two aspects, the observation model and the dynamic model. On the one hand, the segmental modeling techniques are used to improve the capability of observation model that can handle variable length of observations belonging to a single keyword, which is a nature of most of sports video but can not be handled by the traditional HMMs based methods; on the other hand, the multichannel segmental models provide a powerful way to improve the dynamic model that can formulate the useful interactions between different semantic keyword sequences then improve the recognition accuracy for each channel, which is very important for handling a complex semantic networks involving multiple intercorrelated semantic sequences. 8 • Third, event detection and structure exploration are solved by the CRFs based discriminative models, which are able to capture long range dependencies in both highlevel semantics and midlevel keywords. They can also jointly segment and recognize a set of annotated keywords into a series of variablelength structures. More importantly, the proposed models can handle the missing data, i.e., incomplete keywords, by adding an auxiliary layer to learn additional contextual information during the model learning stage that is able be used to compensate for the possible missing keywords in the inference stage. In this work, we choose the American football video as a case study, in which keyword detection, event recognition, structure analysis, and missing data manipulation are incorporate into one framework. The proposed paradigm is flexible to be extended to other sports video mining applications by incorporating game specific semantic space and then reformulating two inference problems. 1.4 Outline This dissertation is organized as shown in Fig. 1.6. In chapter 2, we will first review the recent work on video mining, in which both generative and discriminative model based video mining approaches will be introduced. In chapter 3, we introduce the concept of semantic space where relevant lowlevel visual features, midlevel keywords, as well as highlevel semantics are specified. Then, in chapter 48, we introduce the video mining tasks in three levels respectively, including how to extract informative lowlevel feature, how to use generative model based approaches, i.e., HMMs, Embedded HMMs, and Segmental HMMs, to detect midlevel keywords, and how to use discriminative graphical models, i.e., CRFs and Segmentation CRFs, to recognize highlevel semantics, then compare them in terms of their modeling capability, efficiency and the video mining performance. Specifically, in chapter 6, we will introduce a novel algorithm called MultiChannel Segmental Hidden Markov Model 9 (MCSHMM), and compare with previous approaches. And then, in chapter 8, we investigate the Segmentation CRFs (SCRFs) based approach for game flow analysis, and propose a new approach call Auxiliary SCRFs (ASCRFs). Finally, we conclude this dissertation by providing some discussion for future research in chapter 9. C h .3 : P ro b lem F o rm u la t io n C h .5 : G e n e ra t iv e M o d e ls fo r M id  le v e l K e yw o rd s D e te c tio n C h .1 : In tro d u c t io n C h .2 : L ite ra tu re R e v ie w s C h .6 : M u ltiC h a n n e l S H M M s C h .7 : D is c r im in a tiv e M o d e ls fo r H ig h  le v e l S em a n tic s A n a ly s is S em a n tic S p a c e L o w  le v e l V is u a l F e a tu re s C h .9 : C o n c lu s io n & F u tu re W o rk s M id  le v e l S em a n tic K e yw o rd s H id d e n M a rk o v M o d e ls (H M M s ) E m b e d d e d H M M s (E H M M s ) S e gm e n ta l H M M s (S H M M s ) H ig h  le v e l S em a n tic s C h .4 : L o w  le v e l V is u a l F e a tu re E x tra c tio n C h .8 : S e gm e n ta tio n C R F s C o lo r D is tr ib u tio n F ie ld L a n dm a rk s M o t io n C o n d itio n a l R a n d om F ie ld s (C R F s ) S em a n tic E v e n t D e te c tio n S em a n tic S tru c tu re A n a ly s is Figure 1.6: The outline of the dissertation. 10 CHAPTER 2 LITERATURE REVIEW The probabilistic graphical model based approaches for video mining are generally classified into two categories as shown in Fig. 2.1. One is the generative model based approach that estimates posterior probabilities of hidden states by involving conditional densities (likelihood) for observations given priors. The other is the discriminative approach that involves a parametric model to directly represent the posterior probabilities of the hidden states given the observations. 2.1 Probabilistic Graphical Models A probabilistic graphical model is a probabilistic model for which a graph denotes the conditional dependencies between random variables. They are commonly used in probability theory, statistics, and machine learning [16]. Generally, probabilistic graphical models use a graphbased representation as the foundation for encoding a complete distribution over a multidimensional space and a graph that is a compact or factorized representation of a set of independencies that hold in the specific distribution. Two branches of graphical representations of distributions are commonly used, namely, Bayesian networks and Markov networks. Both families encompass the properties of factorization and independencies but their differences are the set of independencies they can encode and in the factorization of the distribution that they induce [17]. Probabilistic graphical models discuss a variety of models, spanning Bayesian networks, undirected Markov networks, discrete and continuous models, and extensions 11 to deal with dynamical systems and relational data. For each class of models, it describes the three fundamental cornerstones: representation, inference, and learning, presenting both basic concepts and advanced techniques. For many applications of machine learning, the goal is to predict the value of a vector C given the value of a vector X of input features. In a classification problem C represents a discrete class label, whereas in a regression problem it corresponds to one or more continuous variables. From a probabilistic perspective, the goal is to find the conditional distribution. The most common approach to this problem is to represent the conditional distribution using a parametric model, and then to determine the parameters using a training set consisting of pairs of input vectors along with their corresponding target output vectors. The resulting conditional distribution can be used to make predictions of C for new values of X. This is known as a discriminative approach, since the conditional distribution discriminates directly between the different values of C. An alternative approach is to find the joint distribution, expressed for instance as a parametric model, and then subsequently uses this joint distribution to evaluate the conditional in order to make predictions of C for new values of X. This is known as a generative approach since by sampling from the joint distribution it is possible to generate synthetic examples of the feature vector X. In practice, the generalization performance of generative models is often found to be poorer than than of discriminative models due to differences between the model and the true distribution of the data. 2.2 Generative Models Generative models, e.g., the simplest case hidden Markov models as shown in Fig. 2.2, are usually preferred for a large data set due to their better generality when prior knowledge is available. They rely on a dynamic model on the state and an observation 12 The Discriminative Model The Generative Model Hidden States Observations Figure 2.1: Structures of discriminative models and generative models. likelihood function, and the decision is made via Bayes rules. Fig. 2.2 shows several typical generative models used for contentbased and semanticbased video analysis, including the prototype hidden Markov model (HMM), coupled HMM (CHMM), Factorial HMM (FHMM), and Hierarchical HMM (HHMM). These models involve specific state dynamics and an observation likelihood function. The expectation maximization (EM) algorithm is usually used for model learning and Bayesian inference is involved for decision making given an observation sequence. The Dynamic Bayesian Network (DBN) [18] provides a unified probabilistic framework to represent various graphical structures with direct dependency, and the HMM is considered as the simplest DBN that has been widely used in many video analysis applications [19, 20]. Recently, there have been intensive efforts to enhance HMMs for semantic video analysis, and the recent studies mainly focus on two issues, i.e., structuring and learning. Various extended HMMs have been recently proposed to further enhance the capability of handling complex temporal series. Specifically, these extended HMMs are focused on the improvement of the original HMM in terms of two major components, i.e., the dynamic and observation models [21]. There are usually two types of dynamic structure involved for semantic video analysis, the parallel and hierarchical ones. A parallel structure involves information fusion at either the decisionlevel or the featurelevel. For example, when there are two parallel semantic structures involved in a video sequence, the Coupled HMM 13 X: latent states; F: Switch variable; Y: Observations. HMM FHMM CHMM HHMM 1 1 X 1 2 X 1 3 X 1 Y 3 Y 2 Y 3 1 X 3 2 X 3 3 X 2 1 X 2 2 X 2 3 X 1 1 F 1 2 F 1 3 F 3 1 F 3 2 F 3 3 F 2 1 F 2 2 F 2 3 F 1 1 X 1 2 X 1 3 X 1 Y 3 Y 2 Y 3 1 X 3 2 X 3 3 X 2 1 X 2 2 X 2 3 X 1 1 X 1 2 X 1 1 Y 1 3 X 1 3 1 Y 2 Y 2 1 X 2 2 X 2 1 Y 2 3 X 2 3 2 Y 2 Y 1 X 2 X 1 Y 3 X 3 Y 2 Y Figure 2.2: Generative models: variations of the DBN: Hidden Markov Model (HMM), Coupled HMM (CHMM), Factorial HMM (FHMM), and Hierarchical HMM (HHMM). The white and black nodes represent the state and observations, respectively. 14 (CHMM) [22] and the Influence Model [23] were proposed to capture the interaction between two Markov processes via decisionlevel fusion. Differently, the Factorial HMM (FHMM) [24] includes multiple uncoupled state transitions that share the same observation sequence (featurelevel fusion) [25] or where observation vectors are obtained by the concatenation of lowlevel features from different modalities or sources [26]. On the other hand, the hierarchical structure usually imposes a multilayer representation where semantic event analysis can be accomplished by two steps, recognition of primitives and recognition of structures. For example, the Hierarchical HMM (HHMM) is able to capture lowlevel primitives based on which we can represent some midlevel structures, e.g., plays and breaks [27]. Furthermore, some signal processing applications may desire a more effective observation model that can represent variablelength observations for state estimation. The segmental HMM (SHMM) was proposed for speech recognition that effectively handle variablelength observations by involving segmental observation models [28]. The key idea in the SHMM is a twolayer observation model that captures feature variability both within a segment and across segments. Additionally, the observation model (or called the emission function) in an HMM also plays an important role for model learning and inference. Traditional HMMs normally represent each state by a single observation either by a Gaussian or Gaussian Mixture Model (GMM). However, in some video analysis, a state may emit a variablelength observation sequence, such as a video sequence composed of a set of consecutive shots where each shot (where the state is defined) consists with many frames (where the observation is collected). How to effectively capture the rich statistics from the variablelength observations is essential to model learning and inference. An interesting segmental HMM was proposed in [28] where a segmental observation model was used to characterize the variablelength observations for speech signals. The key assumption in the SHMM is that the observations in a segment is condition 15 ally independent given the mean, resulting in a closedform likelihood function for model learning and inference. Model learning is another important issue, especially for a HMM with a complex state space or variablelength observations. Generally, there are two learning issues, one is the model structure learning[29], another is the model parameter learning[30]. The former one tries to provide a compact and effective model representation by condensing the state space and eliminating unnecessary state dynamics. The latter one aims at estimating the model parameters given certain model structure. Usually, one generative model contains two related components, i.e., the dynamic model and the observation model. Since the structure of the generative model is mainly focused on the definition of the structure of the dynamic model, the model structure learning mainly involves the state space reconstruction, i.e., selection of the hidden states and their statistical relationships. When the model structure is available, model parameters could be obtained by either supervised or unsupervised learning algorithms [31]. Additionally, it is imperative to have simultaneous structure learning and parameter learning for complex DBNs to ensure their effectiveness and efficiency in practice. In [32], the concepts of entropic prior and parameter extinction were proposed to simplify and optimize the model structure by a state trimming process. Another possible way is to utilize the reversejump Markov Chain Monte Carlo (RJMCMC). In [27], the hierarchical HMMs (HHMMs) were used for unsupervised sports video mining, where the RJMCMC approach is embedded in the learning of HHMMs to support model selection and feature selection. 2.3 Discriminative Models The discriminative model based approaches, like the conditional random field (CRF) [33], the support vector machine (SVM) [34], and traditional neural networks [35], tends to directly employ the posterior probability of possible label sequences given an ob 16 servation sequence to do the learning and classification as shown in Fig. 2.3. The SVM approaches focus on structural risk minimization by maximizing the decision margin. The SVMbased approach is sensitive to the noise or limited in some cases where a large and complex data set is involved. However, it directly models the decision boundary, therefore, it is efficient to separate two sets of data when data set is small or no much prior knowledge is available. In [34], a SVMbased approach is proposed to detect events in field sports video by using audiovisual features. In [36], the SVM was used to classify video shots into certain highlevel semantic concepts, such as sports, buildings, intermediate topic concepts. Compared with the DBN, the CRF and its variations are more expressive and direct due to the fact that it allows more flexible statistical dependency structures on observations and states, as shown in Figure 2. The traditional CRF is an undirected conditional probabilistic graphical model for segmenting and labeling sequential data. In order to handle complex dependencies with no clear prior knowledge, in [37], the authors proposed a CRFbased method that employs the Markov blanket to specify the dependency among observations and states. To label sequential data in an interactive way, in [], the authors proposed the dynamic CRF (DCRF) for partofspeech tagging and nounphrase segmentation. Due to its unique structure, the DCRF is able to accomplish several labeling tasks at once by sharing information between them. For example, in [38], objects and scenes in the video sequence are jointly labeled successfully by integrating the interaction and dynamic information between them via DCRF. To handle the long range dependencies as well as the latent structure, in [39], the authors proposed the hidden CRF (HCRF) for gesture recognition. Different from the traditional CRF where a label is assigned to a single observation, the HCRF incorporates the subsequence of hidden state variables in the CRF to assign a label for an entire observation sequence. The proposed HCRF extends the spatial relationship in the CRF into a joint spatialtemporal way, where the labels of individual observations 17 x1 x3 x2 Y1 Y2 Y3 z1 z3 z2 Y1 Y2 Y3 x1 x3 x2 F1 F2 F3 Y1 Y2 Y3 x CRF DCRF HCRF X & Z: latent states; F: Substates; Y: Observations. Figure 2.3: Discriminative Models: variations of the conditional random field (CRF): the Conditional Random Field (CRF), the Dynamic CRF (DCRF), and the Hidden CRF (HCRF). are optimized and incorporated into a sequence classifier for the recognition task. CRFs have been widely used in video mining research, i.e., highlight detection [37], human tracking and analysis [40], video browsing [41] and personalized contentbased retrieval [42]. More specifically, a Segmentation CRF (SCRF) was proposed in [43] to capture the long range interaction in both observation and semantics for protein segmentation and recognition. It is very similar to the video mining research that involves both segmentation and recognition tasks simultaneously. 2.4 Comparisons In Fig. 2.1 and Table. 2.1, we show some key differences between the discriminative models and the generative models. In our recent research, we are interested in developing new HMMs that offer more capacity, functionality and flexibility than existing HMMs. The new HMMs should be able to take advantage of explicit semantic modeling where midlevel semantic keywords defined at the shot level directly correspond to the latent states and visual features extracted at the framelevel are used as the ob 18 servations. We expect the new HMMs should be able to accommodate rich statistics of framewise observations for shotwise state estimation. Also, we want to explore the mutual interaction between two semantic keywords by integrating two Markov processes into the same inference process. Moreover, as the modeling complexity goes up, model training becomes more challenging and even problematic. We need a more powerful learning algorithm that can optimize the model structure and learn the model parameters at the same time. Although recently proposed HMMs can deal with these issues individually, we want to develop an integrated approach that is able to address all above issues in one framework. In addition, when midlevel keywords are ready, we are interested in developing the discriminative model based approach that supports highlevel semantics detection. The highlevel semantics are quite specific and can hardly be defined uniquely from lowlevel feature. The discriminative model provides a feasible way to relax the strong independence assumptions and prior knowledge requirements that are not available in highlevel semantics detection. 2.5 Applications in Video Mining As aforementioned, the sports video mining research could be roughly divided into two groups, structurebased approaches [4, 27] and eventbased approaches [12, 37]. Structurebased approaches usually attempt to parse a long video sequence into individual segments by detecting scene changes or the boundaries between camera shots in a video stream. For example, in [15], a soccer video sequence can be segmented into plays and breaks, and in [44], the authors classify a video sequence into plays and replays. Both works were focused on the detection of general semantic keywords in the video data and treat them as the basic semantic components. Eventbased approaches aim to summarize the video data by highlights or specific events of interest. Event detection in sports video can be done at both the object level [45, 46, 47] and 19 Properties Generative Models Discriminative Models Graphic Modeling Directed model Undirected model To estimate a distribution over No direct attempts to model Model Inference all variables, i.e., input and the underlying distribution, output, which are represented by only attempt to compute the a joint probability distribution input to output mappings by conditional probability Usage of the Prior We can perimetrically constrain There is no explicit expression knowledge the distribution by giving prior of the prior knowledge about distributions distribution in the model training data Model is not sensitive when More training data is needed lack of training data The versatility does exist since in Task dependent, often lack Model the joint distribution space we can exible modeling tools and Extendability insert knowledge about the methods for prior knowledge relationships between variables, invariants, independencies, prior distributions and so forth Table 2.1: The comparison between the discriminative and generative models the scene level [48, 49, 50]. Objectlevel event detection approaches usually associate semantic events with the appearance or behavior of some specific objects. For example, in [46], the author employed objectbased features such as the trajectory of ball and players to detect goals in a soccer game. However, most of semantic events are typically defined on the complex collections of multiple objects. Also, in some cases, the object of interest may not be always available. Instead of focusing on objects, scenelevel event detection algorithms, i.e., [48], utilize various visual cues in the video data, including color distribution, specific landmarks, camera motion and even 20 caption information, to detect semantic events. Usually, feature extraction at the scenelevel is more robust and efficient compared with that at the objectlevel. There are two major kinds of approaches for event detection, one is the rulebased approaches [51, 52] and the other is the statistical model based approaches [50, 34]. The major advantages of rule based methods are that they are usually goaloriented and can support specific video search tasks effectively. In [51], camera motion is used to classify each play shot into one of seven different plays. However, in general, the rulebased approaches may not be effective to deal with the uncertainty and ambiguity of the visual features for highlevel semantic analysis, and the taskspecificity also limits their flexibilities and expendability. In contrast, the statistical modelbased approaches are usually more flexible and general to discover underlying structures or events in video data, such as recurrent events or highlights [50]. Recently, two types of statistical models are intensively studied for semantic video analysis, the generative models [27], and the discriminative models [34, 37]. Also, some hybrid approaches that combines both generative and discriminative models were also advanced, which are either to enhance data classification [53] or to detect semantic events in video data [54]. Compared with rulebased approaches, statistical modelbased approaches have great potential to offer better capabilities, flexibility, and generality to handle a large volume of video data with high complexity and diverse nature. Recently, an interesting video understanding framework was proposed that uses an ANDOR graph to detect the storyline from annotated video sequences [55]. This method was mainly designed to handle singlelink causal relationships due to the nature of the ANDOR graph. However, we want to consider complex contextual relationships that may require multilink undirected dependencies among multiple semantic keywords. Our goal is to deliver a complete game flow to summarize the game that also support both eventbased and structurebased video understanding. 21 CHAPTER 3 PROBLEM FORMULATION We believe that explicit semantic modeling and direct semantic computing are effective to bridge the semantic gap for sports video mining. There are two methodologies to address this issue, i.e., datadriven bottomup methods and conceptdriven topdown approaches. A balanced interaction between both methods should be encouraged to provide a generic representation for semantic video analysis. Specifically, we argue that the topdown conceptguided semantic representation plays a critical role in sports video mining that can be supported by datadriven bottomup methods in an inference framework. Therefore, we build a threelayer semantic space that can help us understand the semantic structure of the video data. The semantic space contains lowlevel visual features, midlevel keywords, and highlevel semantics. The lowlevel visual features, e.g., color, landmarks, motion, etc., are extracted from the raw video data to support the midlevel semantic keywords. The midlevel semantic keywords are rudimentary semantic building blocks. These building blocks should be frequent, repeatable and relatively welldefined, and their states can be governed by certain dynamics, e.g., camera views and play types that are two common structures in most fieldbased sports. For example, in American football videos, we can define four camera views (central, left, right, endzone views), and four play types (the long play, short play, kick, and field goal play) as shown in Fig. 3.3. Other keywords, e.g., possessions, are also possible. A combination of several semantic keywords can specify some highlevel semantics. For example, a touchdown highlight could be represented by a long play to the endzone followed by 22 a field goal, another highlevel semantics refer to the game flow in this work, which can tell people the semantic evolvement of the game. They are more interested to the audiences. This approach supports not only highlevel semantic detection but also customized eventsofinterest, as well as increase the usability and interactivity of video data. Lowlevel visual features (Color, landmarks, motion) Midlevel semantic keywords (play types, camera views, possessions) Highlevel semantics (scored, nonscored, turnover) Semantic Space Stage 1: infer midlevel semantic keywords Stage 2: infer the highlevel game flow Figure 3.1: The threelayer semantic space. 3.1 Semantic Space As shown in Fig. 3.1, we want to advocate a concept of semantic space that is formed by a set of rules and used to facilitate the understanding of video content and user queries in different levels. We specify a threelayer semantic space, where we want to use midlevel semantic keywords to represent highlevel semantics, such as highlights and game flow, and midlevel semantic keywords are specified by relevant visual features extracted from the video data. It is our belief that the exploration of midlevel semantic keywords from lowlevel visual features is more reliable and feasible than directly inferring highlevel semantics from lowlevel features. Our main reasoning is 23 that midlevel semantic keywords have a relatively stable and repeated pattern and highlevel semantics should be inferred from midlevel semantic keywords due to their inherent relationship. For example, in American football games, highlevel semantics includes highlights, e.g., a touchdown (TD), extra point after TD, field goal, turnovers, etc, as well as any eventofinterest that could be customized by the user. Directly detecting them may be challenging due to their complex nature and diverse variability. On the other hand, some midlevel semantic keywords can be useful to represent and detect highlevel semantics. Similar video mining paradigms can be found in recent literature [11, 55]. It is our attempt to provide a unified machine learning paradigm where semantic video analysis is formulated as two statistical inference problems, i.e., we want to infer midlevel semantic keywords from lowlevel visual features, so that we can further infer highlevel semantics from midlevel semantic keywords. 3.2 Lowlevel Visual Features As the first layer of the semantic space, relevant lowlevel visual features are needed to specify midlevel semantic keywords, i.e., “camera views”, “play types”, and “possessions”. Due to their different natures we need two complete different sets of visual features. In Fig. 3.2(a), we illustrate two major patterns of the camera motion, i.e., the panning and the tilting. In football game, different types of play would be roughly distinguished by the motion patterns of the camera, which help us to identify how each play is happening in terms of range and height. In Fig. 3.2(b), we illustrate typical scenes for four different camera views which demonstrate the visual distinction between them. When watching American football, the viewer can easily identify the camera view of each shot from the yard numbers along the side line of the field. However, it is very hard for a computer vision algorithm to automatically detect and recognize the yard numbers from the video shot due to two facts: (1) the locations of 24 Panning Tilting End zone Banner & logo Field Audience Yard line Left Camera View Central Camera View Right Camera End Zone Camera View View (a) (b) Figure 3.2: Visual features in football video: (a). play type analysis; (b). camera view analysis. yard numbers are arbitrary in a frame or may not even appear in many frames and (2) the significant camera motion and object occlusion make number recognition very difficult. Therefore, we need more robust and accessible visual features to characterize the latent states of the proposed generative video model. In this research, we identify the spatial color distribution and the angle of yard lines to be relevant features. For example, in the center field, there is usually a logo of the host team in the field center that has a strong contrast with the green color of the play ground, and all yard lines are almost vertically oriented. In Chapter 4, we will first discuss how to extract salient visual features, including the color distribution and the angle of yard lines, and then we show the correlation between the camera views and the extracted features. 25 3.3 Midlevel Keywords The main challenge of video mining is the semantic gap between understandable concepts and computable features. We believe that an explicit specification of the underlying semantic structures is effective to bridge the semantic gap for sports video mining. Therefore, we define a set of midlevel semantic keywords for sports video representation, camera views and play types, which can capture the rudimentary semantics in a game and each of which is specified by a set of semantic units (states) governed by certain dynamics 1. For example, camera views include the central, left, right, endzone views, and play types cover the long/short plays, kick, and field goal, as shown in Fig. 3.3, where we also show the dynamics within each semantic structure. There could be several other midlevel semantic keywords, e.g., possession (offense/defense). These semantic keywords will be used the building blocks for highlevel semantic analysis. We expect to have three major advantages of explicit semantic modeling. First, we can fully take advantage of the available semantics and prior knowledge about the rules in a sport game that makes the video mining problem well defined. Second, it provides both structure analysis and event analysis that are complementary in nature for semantic video analysis. Third, it not only supports highlight detection but also is able to deliver customized eventsofinterest, increasing the usability and interactivity of video data. 3.4 Highlevel Semantics When watching a sports game, audiences are interested in the evolution of the game. Based upon the game status and events of interests, we first select the most interesting event, touchdowns, as the highlights we intend to detect, then we have drawn various processes of the game into a semantic model that is structured by the game flow as 1We assume a video sequence is composed of a set of consecutive play shots with all breaks and commercials removed. 26 Left view End zone view Central view Right view Semantic Structures of Camera Views End Zone View End Zone View Left View Central View Right View Long plays Field goals Kicks Short plays Field Goal Short play Long play Kick Semantic Structures of Play Types Figure 3.3: Two semantic keywords and their dynamics in the American football video, i.e., camera views (top) and play types (bottom). shown in Fig.3.4 . The game flow in football game will generate three types of drives: the scored drive, the nonscored drive, and the turnover drive. They are different but are correlated in a statistical manner. The game flow provides a balanced choice for exploring the semantic structure in sports video at different levels, i.e., drives are synthesized by generic semantic units like camera view, play type, and possession. The game flow provides a higher level semantic representation that reveals and supports further analysis carried by users with different interests. 3.5 Two Inference Problems Similar video mining paradigms can be found in recent literature [1, 37]. We want to provide a unified and machine learning paradigm where semantic video analysis can be systematically formulated as two statistical inference problems. The first one 27 A NonScored Drive Play 2 Play 3 Play 4 A Turnover Drive A Scored Drive Play 1 The Game Flow Figure 3.4: A partial game flow in an American football game that contains threes drives: a scored drive, a nonscored drive, and a turnover, and each drive includes a series of annotated plays. is from the lowlevel visual feature to midlevel semantic keywords, and the other is from midlevel semantic keywords to highlevel semantics. • Inference problem 1: Due to the nature of midlevel semantic keywords, generative models, such as HMMs or their variants, are more suitable for the first inference problem where some prior knowledge is available as well as useful. Specifically, the dynamic model directly reflects the transitions of recurrent temporal patterns (i.e., midlevel semantic keywords), and the observation model represents the different midlevel structures by distinct lowlevel features. • Inference problem 2: Highlevel semantics of interest that are immediately useful for users are relatively less welldefined by lowlevel features. It is mainly due to their uncertainty and ambiguity nature in the video data. On the other hand, they can be better specified by midlevel structures. However, a more flexible and general statistical association relationship is needed between states and observations. Therefore, we invoke discriminative models, such as CRFs, for the second inference problem, which support versatile statistical modeling 28 structures. There are three major advantages in this framework. First, we can fully take advantage of the available semantics and prior knowledge about a sport game that makes the video mining problem well structured and formulated. Second, it provides both structure analysis at midlevel and event analysis at highlevel that are complementary in nature. Third, it not only supports highlight detection but also can deliver customized eventsofinterest and increase the usability and interactivity of video data. 29 CHAPTER 4 LOWLEVEL VISUAL FEATURES 4.1 Background We defined relevant visual features for midlevel semantic keywords. We used color distribution and yard line angles to characterize the camera view [56], the camera motion for the play type [57] and possessions. By using these features, video mining is formulated as an inference problem where we want to infer the midlevel semantic structures from the visual features. • Color distribution: The color distribution varies from view to view. For example, in the center view, there is usually a logo of the host team in the field center that has a strong contrast against the green color of the play ground, while in the left or right view, the area close to the end zone usually exhibits a dominant nongreen color. In our work, we use the concept of dominant color [58] to estimate the spatial color distribution by dividing a frame image into multiple overlapped regions (three regions horizontally and vertically) and computing the ratios of dominant color between different regions. • Yard line angle: The angle of yard lines can also reflect the camera view. For example, in the central view, all yard lines show a vertical pattern, while they show certain orientations in other camera views. To obtain yard lines, we can simply use Canny edge detection followed with the Hough transform to detect the yard lines in the region of the playing field and then we can compute their angles. 30 • Camera motion: There are two main types of camera motion in a football video, i.e., panning and tilting, which effectively characterize different play types. For example, a long play is usually associated with strong panning effect, while a short play is normally reflected by weak panning effect. We chose the optical flow based method [59] to compute two kinds of camera motion between two adjacent frames. Also, the frame indices are included as a temporal feature. 4.2 Color/Landmards Based Visual Features In this work, we use the concept of dominant color to estimate the spatial color distribution by dividing a frame image into multiple regions and computing the ratios of dominant color between different regions. Specifically, we use the robust dominant color region detection algorithm [58] in to extract the dominant color region, i.e., the play ground. Fig. 4.1 shows some results of dominant color extraction. To obtain yard lines, we can simply use Canny edge detection and the Hough transform to detect the yard lines in the region of the playing field as shown in Fig. 4.2. Left Center Right Top Bottom Left Center Right Top Bottom Figure 4.1: Play ground detection result. Based on the detected playing field and yard lines, we extract a 6D feature vector composed by the following features: • Ratio of the dominant color region: Rf = Wf=W; • Ratio difference of dominant color between the left/right and center parts: Rd = (Wleft −Wcenter + Wright −Wcenter)=W; 31 Angle 1 Angle 2 Angle 3 Angle 4 Figure 4.2: Yard line detection result. • Ratio difference of dominant color between the left and right parts: Rlr = (Wleft −Wright)=W; • Ratio difference of dominant color between the top and bottom parts: Rtb = (Wtop −Wbottom)=W; • Average angle of all yard lines: Dave = ΣL l=1 l=L; • Angle difference between the first and last frames in a shot: Dshot = last− first. where W is the total pixel number in a frame, Wf is the total pixel number in the dominant color region, and Wleft, Wright, Wtop, Wbottom, Wcenter are the pixel numbers of the dominant color region in the left part, the right part, the top part, the bottom part, and the central part, respectively. In addition we define l as the angle of the lth detected line and L is the number of detected lines, and last as the average angle in the last frame in a shot, first as the average angle in the first frame in a shot. 4.3 Motion Based Visual Features Play types are largely dependent on the pattern of camera motion. There are mainly two types of camera motion: panning and tilting, as shown in Fig. 3.2(a). Most play types can be effectively characterized by these two kinds of camera motion. For example, in a long run from right to left, camera motion could be a short rightpanning then a long leftpanning, which is different from the motion trend of a short run, i.e., 32 a short rightpanning followed by a short leftpanning. To specify camera motion we choose the optical flow based method [59] to qualitatively compute parameters of the panning and tilting between two adjacent frames, e.g., panning and tilting. Also, the frame indices are included as an additional temporal feature to distinguish long plays and short plays. Therefore, we compose a 3D feature vector of camera motion for every two adjacent frames in each shot, which will be employed to identify the play type in each shot. 4.4 Feature Evaluation To quantitatively evaluate the saliency of visual features, we have manually performed camera viewbased shot classification for one football video whose ground truth state sequence is denoted as Sg 1:T , then we employ the information gain [60, 27] to evaluate the feature’s contribution to HMMbased classification. Given feature z ∈ 1; 2; :::; 6, Γz is learned and the optimal state sequence Sz 1:T can be obtained. The information gain is computed as Info(Sg 1:T Sz 1:T ) = H(PSz ) − Σ j PSg · H(PSzSg=j) (4.1) where H is the entropy function, and PSg (k) = #(Sg t = kt = 1; : : : ; T) T ; PSz (k) = #(Sz t = kt = 1; : : : ; T) T ; PSzSg (k; j) = #(Sz t = k and Sg t = jt = 1; : : : ; T) #(Sg t = k; t = 1; : : : ; T) ; where # is the counter. After ranking six features according to Info(Sg 1:T Sz 1:T ) in Table. 4.1, we select the top four features to construct the observation vector of the HMM, i.e., Ot, including Dave, Rf , Dshot, and Rlr. As aforementioned in the previous section, six candidate visual features are extracted to compose model observations. In HMM, the observation should be shot 33 Index z = 1 z = 2 z = 3 z = 4 z = 5 z = 6 Feature Dave Rf Dshot Rlr Rd Rtb I.G. 0.62 0.41 0.37 0.35 0.16 0.11 Table 4.1: Rank of information gain(I.G.) based, but in SHMM, we need to employ the framebased observation to serve as the observation. Therefore, selected features are a little bit different in each model due to the variation of the observation measurement. Following the feature evaluation method mentioned in [61], we choose four salient visual features to construct the observation vector in HMM, i.e., Ot, including Dave, Rf , Dshot, and Rlr. Because the feature Dshot is designed for the shotbased approach, in SHMM, we employ the feature Rd, i.e., the ratio difference of dominant color between the left/right and center parts, instead of Dshot. In addition, we also choose the optical flow based method to qualitatively compute parameters of the panning and tilting between two adjacent frames, e.g., panning and tilting, and the frame indices are included as an additional temporal feature. As shown in In Fig. 4.3, we compare the Dave, i.e., average angle of all yard lines, distribution in each camera view between shotbased and framebased observations. We find that framewised features not only provide us the evidence about the statistical character of the observation but also offer us richer and smoother information about the observation concerning different semantic meaning compared with shotbased features. Therefore we can establish a more powerful mapping relationship, e.g. SHMM, as the generative model between observation and latent states to discover the underlying semantic structure existing in the video data. 34 0 5 Left camera view 0 5 Center camera view 0 5 Right camera view 0 5 End zone camera view 0 5 Left camera view 0 5 Center camera view 0 5 Right camera view 0 5 End zone camera view Figure 4.3: Distributions of shotbased (top) and framebased (bottom) Dave with respect to four camera views. 35 CHAPTER 5 HMMS FOR KEYWORDS DETECTION 5.1 Background HMMs are the most popular machine learning tool for video mining. A typical HMM as shown in Fig. 5.1 is a statistical model where the underlying system is assumed to be a Markov process with unknown parameters including a state transition matrix and a probabilistic observation model. The former one captures the underlying state dynamics, and the latter one characterizes observations pertaining to the hidden states either as probability density functions (continues observations) or probability mass functions (discrete observations). There are three canonical problems about HMMs [62]: • Evaluation: given the model parameters, compute the probability of a particular observation sequence. This problem is solved by the forwardbackground algorithm. • Learning: given an observation sequence, discover the model parameters that include the set of state transition and output probabilities. This problem is solved by the BaumWelch algorithm, or the Expectation Maximization (EM) algorithm. • Decoding: given the model parameters, find the most likely sequence of hidden states that could have generated a given observation sequence. This problem is solved by the Viterbi algorithm. 36 In this work, we assume that a video sequence is composed of a set of presegmented consecutive shots each of which was captured from one of four camera views and corresponds to one of four play types. In the following, we will first introduce the basic HMMbased approach [61] where we want to infer midlevel semantic keywords from lowlevel visual features. This algorithm will serve as the reference for our future improvements. Specifically, we have three new HMMbased approaches which are called the embedded HMMGMM [57], the segmental HMM (SHMM) [56], and the multichannel segmental HMM (MCSHMM) [63] respectively. 5.2 Traditional HMMs We first discuss the implementation of a basic HMM shown in Fig. 5.1 for semantic video analysis that follows our problem formulation and provides a benchmark for our following studies. According to our problem formulation, we want to use the hidden state in the HMM to encode the midlevel semantic keyword at the shot level directly. Correspondingly, we can define two HMMs. One is for camera view analysis where yard line angles and color distributions are used as the observation, and the other is for play type analysis where camera motion is used as the observation. It is worth noting that the hidden state is defined at the shot level, while the observations at the framelevel. Tradition HMMs cannot handle variable length observations, and an easy treatment to obtain the shotwise feature is to compute the average of all framewise observations in a shot. 5.2.1 Model Speci cations Given a video sequence of T shots, there are four latent states that correspond to four camera views or play types, i.e.,{St = kt = 1; :::; T; k = 1; 2; 3; 4}. By the Markov assumption we can have a state transition matrix, i.e.,{ak;j = p(St = jSt−1 = k)k; j = 1; 2; 3; 4}, to govern the state dynamics over time as shown in Fig. 5.1. 37 t1 S t1 o t+1 o t S t+1 S t o Figure 5.1: Hidden Markov Models (HMMs). Observations, {Ott = 1; :::; T}, i.e., relevant visual features extracted from all shots, are assumed to be drawn from certain emission function associated with the latent states. The often used emission functions include Gaussian or the Gaussian mixture model (GMM). Then we can define two kinds of observation models as p(OtSt = k) = N(Ot k;Σk) (5.1) where k and Σk are the mean and covariance matrix of the Gaussian for latent state k, or p(OtSt = k) = ΣN n=1 ( nN(Ot; nk;Σnk)) ; (5.2) where { n; nk;Σnkn = 1; : : : ;N} parameterizes the Norder GMM for latent state k, and ΣN n=1 n = 1. Obviously, Gaussian emission function is a special case of GMM emission when N = 1. Now the key issue is: Can we use aforementioned emission functions to characterize their densities? 5.2.2 Learning and Inferences Given a series of Tshot observations {Ott = 1; : : : ; T}, the HMM with GMM emission can be parameterized by Γ = {S; k; ak;j ; n; nk;Σnkk = 1; : : : ; 4; n = 1; : : : ;N}: • S = k; {k = 1; : : : ; 4}: the hidden states represent four camera views or play types; 38 • k = p(S1 = k): Initial state probabilities vector, which represents the initial probabilities of four camera views or four play types; • A = {ak;j = p(St = jSt−1 = k)t = 1; : : : ; T; k; j = 1; : : : ; 4}: are the state transition probabilities between states (camera views or play types) j and k; • p(OtSt = k) = ΣN n=1 ( nN(Ot; nk;Σnk)): is the Norder GMM emission function of hidden states, which is reduced to the Gaussian emission with order one (N = 1). Then we can use the Expectation Maximization (EM) algorithm [64] to obtain the maximum likelihoodbased parameter estimate of this HMM as follows: Γ∗ = argmaxP(O1:T Γ); (5.3) where P(O1:T Γ) = Σ S1:T 1 ΠT t=1 p(StSt−1)p(OtSt = k): After the EM training, we can use the Viterbi decoding algorithm to estimate the optimal state sequence S∗ 1:T , which corresponds to the camera view or the play type classification results for all video shots O1:T . On the one hand, S∗ 1:T indicates the location of each play. On the other hand, the state transition between two adjacent shots may indicate certain plays, such as advancing the ball or the change of possession. In practice, we use Kmean clustering to initialize the emission functions of the HMM prior to EM training, and N = 3 was found to be a reasonable choice for the GMM emission function. For the prior probability k, we assume the football game always starts from the central camera view with kicks. We also initialize the state transition matrix A that is an important parameter in the HMM by computing the frequency of actual state transitions and averaging over a few games. 39 5.2.3 Experimental Results Based on these selected features, we tested our algorithm on the nine 30minute football videos. After Kmean initialization, we use the EM algorithm for HMM training with two different emissions: Gaussian (HMM(1)) and GMM (HMM(2)). Finally, we implement Viterbi decoding to obtain the optimal state sequence. The experimental results are shown as Table. 5.2. We can see the performance of HMM(2) is much better than that of HMM(1). It is mainly because the GMM can better characterize the densities of visual features than the Gaussian. However, the major limitation of this approach is that each shot is represented by an averaged feature vector that is less representative and informative. This fact motivates us to improve the observation model in the HMM, i.e., we are exploring the framewise features and the temporal dependency across frames in a shot. In Table. 5.1, we pick two video sequences to show the classification result concerning each camera view. We find that most errors are from the misclassification of left or right camera views as the central camera view. This is mainly because the central camera usually covers the major part of the field, and the number of shots captured by the central camera is usually more than by the others. Data Left Central Right End zone Video A 72.5 73.4 61.54 100 Video B 65.79 78.95 70 100 Table 5.1: Single shot classification results on camera views Then we evaluate the experimental results on both camera view and play type, and Fig. 5.2 shows the result. The result shows that the GMM observation model outperforms the Gaussian model in terms of generalization capabilities. 40 Test Videos Semantics HMM(1) HMM(2) [61] [61] Video1 Play Type 43.59% 60.90% (156 shots) Camera View 55.77% 72.44% Video2 Play Type 49.07% 61.34% (163 shots) Camera View 61.35% 67.48% Video3 Play Type 44.91% 49.10% (167 shots) Camera View 53.29% 58.68% Video4 Play Type 58.33% 64.67% (168 shots) Camera View 64.88% 70.24% Video5 Play Type 50.59% 61.90% (168 shots) Camera View 56.55% 65.48% Video6 Play Type 54.11% 60.59% (170 shots) Camera View 52.35% 64.12% Video7 Play Type 64.70% 71.17% (170 shots) Camera View 68.24% 72.35% Video8 Play Type 56.14% 65.50% (171 shots) Camera View 62.57% 75.44% Video9 Play Type 63.01% 67.05% (173 shots) Camera View 66.07% 68.21% Average Accuracy 57.44% 62.62% Table 5.2: Shotbased classification results of HMMbased algorithms for nine 30min NFL videos. 41 St t+1 S t C t+1 C t1 C t1 S t1,1 o 1 , 1   t n t o t ,1 o t nt o , t+1,1 o 1 , 1 + + t n t o Figure 5.2: A new generative model 5.3 Embedded HMMGMMs As mentioned before, the latent state, i.e., the camera views and the play types, are defined at the shot level while the observations are extracted from all frames in a shot. Averaging framewise features in one shot to obtain the shotwise feature inevitably reduce the discriminability of observations in the HMM. Also, each shot may have different lengths, i.e., the number of frames varies from shot to shot. In order to take advantage of the rich statistics of framewise visual features and to handle variablelength observations, we have proposed a new HMM that is embedded with a twolayer observation model, as shown in Fig. 5.2 [57]. 5.3.1 Model Speci cations Now we want to develop a generative model for camera view and play type based video modeling. However, the latent state, i.e., the camera view and play type, is shotbased, and the observation, color distribution, landmark, and camera motion across two adjacent frames, is framebased, which makes it hard to use traditional HMMs where each latent state is associated with one observation only. This constraint eliminates the intrashot statistical properties existing among the frames, which is very important for camera view and play type analysis. Inspired by the 42 recursive estimation algorithms in [65], we assume there is a virtual observation C existing between shotbased latent states and framebased observations. As shown in Algorithm 13, when generating a video sequence at shot t, a virtual observation, or a shotbased GMM of camera view or play type k, is first drawn, then a set of framebased observations {ot;ll = 1; :::; nt} are generated from the GMM. Therefore, we can define the emission function of virtual observation as: p(ot;lCt; St = k) = ΣM m=1 ( km N(ot;l; k m;Σk m) ) (5.4) where Πk = { km ; k m,Σk m } is the parameter set of the Morder GMM and ΣM m=1 km = 1. Because virtual observations are integrated artificially, the density function of p(OSt = k) is not accessible directly, therefore, we fit this density function to the density function of observations as ( 8.2). log p(Ot;1:nt St = k) = 1 nt Σnt l=1 log p(ot;lCt; St = k): (5.5) 5.3.2 Learning and Inferences As shown above, we embed the GMM into the HMM to accommodate both the shotbased latent states (i.e., camera views and play types) with the framebased observations (i.e., color distribution, landmarks, and camera motions) and the intrashot properties of the video data. The new embedded HMMGMM model is able characterize intrashot motion features and intershot play transition simultaneously. Given a series of observations O, the likelihood the proposed is: P(OΓ) = Σ k k ΠN t=1 p(StSt−1)p(OSt = k); (5.6) where Γ = { k; ak;j ;Πkk; j = 1; :::; 4}. Then, we can use the Expectation Maximization( EM) algorithm to obtain the parameter estimation as follows: Specifically, we define the posterior probability of each shot and the joint probability between very 43 two adjacent shots as k(t) = p(St = kO; Γ); (5.7) k;j(t) = p(St = k; St+1 = jO; Γ): (5.8) Due to the embedded twolayer structure, estimating Γ requires updating Πk that involves multiple GMMs to accommodate the intershot statics of the video data. Instead of involving all observations in ForwardBackward algorithm, we use a Monte Carlolike sampling method, as shown in Algorithm 1, to construct a training data pool for Πk, i.e., X{1:k}, by sampling from O according to (5.7). Then the training data pools for all play types are represented by ( 5.9) X{1:k} = {xk t;w k = 1; ::; 4; t = 1; :::; T;w = 1; :::;W}; (5.9) where W represents the sampling number in each shot. After sampling, we can follow Algorithm 2 to update the GMM for each play. In this EM iteration, the likelihood of each p(mxk t;w;Πk) = km pm(xk t;w Πk) Σ m mpm(xk t;w Πk) ; (5.10) and GMM Πk can be updated as following: km = 1 Xk Σ t Σ w p(mxk t;w;Πk) (5.11) k m = Σ t Σ w xk t;wp(mxk t;w;Πk) Σ t Σ w p(mxk t;w;Πk) ; (5.12) Σk m = Σ t Σ w p(mxk t;w; Π)(xk t;w − k m)(xk t;w − k m)T Σ t Σ w p(mxk t;w;Πk) : (5.13) Finally, Γ is obtained by maximizing the expectation of the likelihood of the virtual observation considered, as k = k(1) and ak;j = Σ t k;j(t) Σt k(t) : (5.14) The complete training algorithm is shown in Algorithm 3. After training, we can use the Viterbi decoding algorithm to estimate the play type sequence S∗ in the football video. 44 Algorithm 1: Constructing the new training data Input : O, T, K, basic sample number in a shot B Output: New training data pools {X{1:k}} for k ← 1 to K do for t ← 1to T do Find sampling number W in each shot; if nt > B then if nt ∗ k(t) > B then W = B; else W = B ∗ k(t); else if nt ≤ B then if nt > k(t) ∗ B then W = B ∗ k(t); else W = B; Uniform Sampling W vectors in each shot; end Constructing the training data pool X{k}; end 45 Algorithm 2: EM for GMM Input : Iteration number I, Parameter set Π0, Training data X{1:K} ⇐ Algorithm 1 Output: New Parameter set {ˆΠ kk = 1; 2; 3; 4} for k ← 1 to K do for i ← 1 to I do Estep: Compute p(mxk t;w; Π) ⇐ (5.10); Mstep: Compute ˆ km ⇐ (5.11); Compute ˆ k m ⇐ (5.12); Compute ˆΣ k m ⇐ (5.13); end end Obtaining a new parameter set {ˆΠ kk = 1; 2; 3; 4}; 46 Algorithm 3: EM for Embeded HMMGMM Input : Iteration number I, Parameter set Γ0 Output: New Parameter set ˆΓ for i ← 1 to I do Estep: Compute k(t) ⇐ (5.7); Compute k;j(t) ⇐ (5.8); Mstep: Compute ˆ k ⇐ (5.14); Compute ˆak;j ⇐ (5.14); Compute ˆ km ; ˆ k m; ˆΣ k m ⇐ Algorithm 2; end Obtaining new parameter set ˆΓ; 5.3.3 Experimental Results We have tested the proposed generative model on nine 30minute football videos. First, we implement the GMMbased algorithm where each of four plays or camera views is characterized by a 3order GMM using framebased observations (camera motions and frame indices) in a shot. Then we examine the embedded HMMGMM (HMM(3)) algorithm that is initialized by the previously trained GMM for each play. The Viterbi decoding algorithm is used to find the optimal state sequence, i.e., the play type for each shot. The comparison between four methods are shown in Table 5.3 where we can see the embedded HMMGMM improves the midlevel structure analysis results greatly compared with the GMM, HMM with Gaussian or Gaussian mixture emission. It is mainly because the proposed embedded HMMGMM can characterize both intrashot motion features and intershot play transitions via the GMM embedded into the 47 HMM for more accurate play analysis. There are two straightforward ways that may further improve the classification performance under the same framework. One is to add more salient features to the observations of latent states. The other is to involve adaptive GMM order estimation in Algorithm 2. However, there are still two limitations in the embedded HMMGMM. One is that it assumes that correlated framewise observations in one shot are independent, and other is that the learning of firstlayer GMMs does not fully use all available framewise features due to the sampling method that is used to avoid the training bias introduced by variablelength shots. Still the promising results from this method motivate us to further improve the observation model in the HMM with better generality and capability. 5.4 Segmental HMMs (SHMMs) Although the embedded HMMGMM is able to provide a more effective observation model compared with the traditional HMM, it still assumes that all framewise features are independent in a shot. This assumption ignores the temporal dependency across frames in a shot. Also, it only uses part of observations for the training purpose. Therefore, we invoke a segmental HMM (SHMM) to attack this problem that is able to handle variablelength observations in a more analytical way [56]. The SHMM was first introduced for speech recognition that involves a segmental observation model to characterize variations of the speech signal, as shown in Fig. 5.3 [28] . Instead of generating one observation by each hidden state in the traditional HMM, each hidden state of the SHMM can emit a sequence of observations, which can be called a segment. In SHMM all observations in a given segment are assumed to be independent to the observations belonging to other segments. In addition, in each segment all observations are conditionally independent given the mean of that segment. Thus a closed form likelihood function is attainable that can 48 Test Videos Semantics Supervised HMM(1) HMM(2) HMM(3) GMM [61] [61] [57] Video1 Play Type 26.92% 43.59% 60.90% 77.56% (156 shots) Camera View 35.90% 55.77% 72.44% 76.28% Video2 Play Type 30.67% 49.07% 61.34% 70.56% (163 shots) Camera View 41.10% 61.35% 67.48% 79.14% Video3 Play Type 23.35% 44.91% 49.10% 58.08% (167 shots) Camera View 37.12% 53.29% 58.68% 61.08% Video4 Play Type 36.90% 58.33% 64.67% 70.66% (168 shots) Camera View 47.61% 64.88% 70.24% 73.21% Video5 Play Type 29.17% 50.59% 61.90% 72.02% (168 shots) Camera View 40.47% 56.55% 65.48% 73.81% Video6 Play Type 38.25% 54.11% 60.59% 70.59% (170 shots) Camera View 41.17% 52.35% 64.12% 70.59% Video7 Play Type 37.05% 64.70% 71.17% 70.59% (170 shots) Camera View 52.35% 68.24% 72.35% 75.29% Video8 Play Type 40.93% 56.14% 65.50% 72.51% (171 shots) Camera View 50.88% 62.57% 75.44% 77.78% Video9 Play Type 39.88% 63.01% 67.05% 69.36% (173 shots) Camera View 43.93% 66.07% 68.21% 71.68% Average Accuracy 43.59% 57.44% 62.62% 71.85% Table 5.3: Shotbased classification results of seven algorithms for eight 30min NFL videos. 49 capture rich statistics of framewise features in a shot and is able to accommodate the variablelength shots. St t+1 S m m m t1 S t1,1 o 1 , 1   t n t o t ,1 o t nt o , t+1,1 o 1 , 1 + + t n t o Figure 5.3: The segmental hidden Markov models (SHMM). 5.4.1 Model Speci cations In traditional HMMbased method, due to the mismatch between shotlevel states and framelevel feature, we have to adapt the average feature in a shot to serve as the observation corresponding to the shotbased hidden state, which ignores both the statistical dependency and varieties among frames. Therefore we will employ SHMM instead. In video analysis, we can regard the segment as the shot and observations in a segment as frames in that shot. Therefore, we need to consider two kinds of independencies: first, segments are independent given the state; second, all observations in a segment are independent given the mean of the segment. By these assumptions, we are able to establish a statistical relationship between multiple observations and a single semantic structure. A SHMM ofM states can be characterized by where Θ = { m; am;n; ;m;Σ ;m;Σmm; n = 1; :::;M}. m is the initial probability, and am;n the transition probability. ;m and Σ ;m characterize the mean of the segment of state m, and Σ ;m the variance of the segment. Given a shot in time t with nt observations, i.e., Ot = {Ot;kk = 1; :::; nt}, 50 the conditional likelihood of Ot is defined as: p(OtSt = m; Θ) = ∫ p( St = m; Θ)p(Ot;1:nt  ; St = m; Θ)d ; (5.15) Under the conditionally independent assumption, we have: p(OtSt = m; Θ) = ∫ p( St = k; Θ) Πni t=1 p(Ot;k ; St = m; Θ)d ; (5.16) where p(Ot;k ; St = m; Θ) = N(Ot ;Σm) and p( St = m; Θ) = N(  ;m;Σ ;m). Compared with traditional HMMs, the new term is the Gaussian prior of mean that was originally introduced to capture the characteristics of a segment in a speech signal, i.e., the phone or model level, which are variable due to different speakers or stress conditions [28]. If we set Σ ;m = 0, (5.16) can be reduced to: p(OtSt = m; Θ) = Πnt k=1 p(Ot;kSt = m; Θ); (5.17) where p(Ot;kSt = m; Θ) = N(Ot;k ;m;Σm) that is equivalent to the Gaussian emission. Then it is similar to our previously proposed embedded HMMGMM [57] that involves a twolayer observation model. In practice, the Gaussian prior of was found useful to capture the temporal dependency of all framewise observations in a shot, as shown by the significant improvement of SHMM over the traditional HMMs on playbased and viewbased shot classification [56]. 5.4.2 Learning and Inferences This density function describes the probability distribution of a semantic unit which generates multiple observations, i.e., visual features distribution for a specific camera view within a shot with length nt. Then following the description in [28], let t = Σnt k=1 Ot;k; (5.18) (Σt;m)−1 = (Σ ;m)−1 + nt(Σm)−1; (5.19) t;m = (Σt;m((Σ ;m)−1( ;m)′ + (Σm)−1( t)′))′: (5.20) 51 then using the log probability we can obtain log p(Ot;1:nt St = m; Θ) = log(Rm;ntR ;mRt;m) + 1 2 ;mΣ ;m ′ ;m + 1 2 Σnt k=1 Ot;kΣ−1 m (Ot;k)′ − 1 2 t;mΣ−1 t;m( t;m)′: (5.21) where Rm = 1 (2 ) n 2 Σm 1 2 ; (5.22) R ;m = 1 (2 ) n 2 Σ ;m 1 2 ; (5.23) Rt;m = 1 (2 ) n 2 Σt;m 1 2 : (5.24) Then we can employ the BackwardForward algorithm to calculate the total probability summed over all possible paths. As the size of each segment, i.e., the length of each shot, is known to us, so we can represent the forward and backward processes as (6.13)  (6.15), rather than considering all the possible length of segment introduced in [28]. m;t = p(O1; :::;Ot; St = mΘ); (5.25) m;t = p(Ot+1; :::;OT St = m; Θ); (5.26) m;t = m;t m;t ΣM m=1 m;t m;t ; (5.27) mn(t) = m(t)amnp(Ot+1St+1 = n; Θ) n(i + 1) m(t) ; (5.28) 52 Then we can complete parameter reestimation as follows: ˆ ;m = ΣT t=1 m;t t Σ nt T t=1 m;t ; (5.29) ˆΣ ;m = ΣT t=1 m;t (nt ^ ;m− t)2 n2t ΣT t=1 m;t ; (5.30) ˆΣ (j;i) m = ΣT t=1 m;t( Σnt k=1 (Ot;k)2 − ( t)2 nt ) ΣT t=1 m;t(nt − 1) ; ; (5.31) ˆakj = kj(i) k(i) ; (5.32) ˆ k = k(1): (5.33) Comparing with HMMs, we only add one additional parameter Σk to each state in the SHMM, which does not increase too much computational load for training. After obtaining the estimated parameter set, we can use the Viterbi algorithm to estimate the optimal state sequence S∗ 1:T as we stated in previous section. Algorithm 1: EM for SHMM Input : Iteration number W, Parameter set Π0, video data O = {Ot;kt = 1; :::; T; k = 1; :::; nt} Output: New Parameter set {ˆΠ mm = 1; 2; 3; 4} for w ← 1 to W do for k ← 1 to K do Estep: p(Ot;1:nt St = m; Θ) ⇐ (5.21); k(t); k(t); k(t); m;n(t) ⇐ (5.255.28); Mstep: ˆ ;m; ˆΣ ;m; ˆΣ m; ˆamn; ˆ m ⇐ (5.295.33); end end 53 5.4.3 Experimental Results Classi cation Results Our experiment is still derived by nine 30 minute football videos. five different approaches are tested on those video data, i.e., GMM, HMM with Gaussian emission, HMM with GMM emission, embedded HMMGMM and SHMM. In Table. 5.4, we show the shot classification result, i.e., the overall classification accuracy concerning different approaches. For SHMMbased method, we initialize model parameters Θ by the training result from the HMM with Gaussian emission method, which is mainly because the HMM can provide us a coarse estimation about the statistical property in each shot, therefore we can use them to represent the initialization of the SHMM. We initialize Σm by multiplying Σ ;m with a small scalar, which is mainly because comparing with Σ ;m, the Σm only specify a small variation range of observations. Then we can perform the proposed EM algorithm to obtain the optimized model parameters. After training, we can employ Viterbi decoding again to get the state sequence, i.e. the camera view sequence. 5.5 Discussions From the experimental results shown in Table. 5.4, we can see the performance of each proposed algorithm. The boost in the performance is mainly because the capability to capture the diversity of the statistical property existing in each shot, e.g., the SHMM not only regards the mean each shot as a continues variables but also represent the variance by another statistical model, which gives us a better way to capture the statistics of each shot. Moreover, the quantity of the observation is another important factor. Rich observations provide us more possibilities that find enough realizations of each shot sharing the same statistical property. 54 Test Videos Semantics Supervised HMM(1) HMM(2) HMM(3) SHMM GMM [61] [61] [57] [56] Video1 Play Type 26.92% 43.59% 60.90% 77.56% 80.13% (156 shots) Camera View 35.90% 55.77% 72.44% 76.28% 79.49% Video2 Play Type 30.67% 49.07% 61.34% 70.56% 77.91% (163 shots) Camera View 41.10% 61.35% 67.48% 79.14% 84.05% Video3 Play Type 23.35% 44.91% 49.10% 58.08% 68.86% (167 shots) Camera View 37.12% 53.29% 58.68% 61.08% 74.85% Video4 Play Type 36.90% 58.33% 64.67% 70.66% 77.98% (168 shots) Camera View 47.61% 64.88% 70.24% 73.21% 79.17% Video5 Play Type 29.17% 50.59% 61.90% 72.02% 74.40% (168 shots) Camera View 40.47% 56.55% 65.48% 73.81% 73.21% Video6 Play Type 38.25% 54.11% 60.59% 70.59% 80.59% (170 shots) Camera View 41.17% 52.35% 64.12% 70.59% 80.00% Video7 Play Type 37.05% 64.70% 71.17% 70.59% 75.88% (170 shots) Camera View 52.35% 68.24% 72.35% 75.29% 81.18% Video8 Play Type 40.93% 56.14% 65.50% 72.51% 78.95% (171 shots) Camera View 50.88% 62.57% 75.44% 77.78% 80.11% Video9 Play Type 39.88% 63.01% 67.05% 69.36% 75.14% (173 shots) Camera View 43.93% 66.07% 68.21% 71.68% 77.46% Average Accuracy 43.59% 57.44% 62.62% 71.85% 77.42% Table 5.4: Shotbased classification results of eight algorithms for nine 30min NFL videos. 55 CHAPTER 6 MULTICHANNEL SHMM FOR KEYWORDS DETECTION 6.1 Background This work is based on our previous research [61, 57, 56], and motivated by the recent advancement and progress in HMM theory and applications. We propose a new HMM, i.e., the MultiChannel Segmental Hidden Markov Model (MCSHMM), which incorporates the advantages of the recent extended HMMs, such as CHMM, HHMM, and SHMM. Specifically, the proposed MCSHMM is featured by a hybrid hierarchicalparallel structure for dynamic modeling among hidden states, and it also is equipped with the SHMM for observation modeling. The objective of this new model is to enhance semantic video analysis by characterizing the rich statistics of framewise features for shotwise state inference and capturing the interaction between multiple semantic structures. Moreover, as the modeling complexity goes up, model training becomes more challenging and even problematic. Therefore we want to integrate structure learning and parameter learning in one framework by using the entropic prior and parameter extinction proposed in [32] that can optimize the model structure and model parameters simultaneously. All HMMbased approaches we discussed so far can only detect one semantic structure at a time. Usually, a sports video contains multiple midlevel semantic keywords in parallel, such as play types and camera views. More interestingly, there are some interactions among them. Specifically, it was observed that camera views and play types are quite related to each other during the game. For example, after a shot of the central view accompanied by a short play, the camera view in the next 56 t1 F t+1 F (1) t1 S (2) t1 S (1) t+1 S (2) t S (1) t S (2) t1 S (1) t1 O (2) t+1 O (1) t+1 O (2) t O (1) t O (2) t1 O t F Figure 6.1: In the MCSHMM, the rst layer includes two SHMMs , and the second layer captures the interaction between two channels. shot is likely to remain in the same view; while if it is a long play, the next camera view might be switched to the other camera views, either right or left camera view. In this work, we are interested in exploring the interaction between multiple semantic structures with the aim to improve the overall mining performance. It is tempting to invoke the Coupled HMM (CHMM) in [22] that can support multiple Markov channels simultaneously. However, the CHMM usually assumes relative strong interaction between two Markov chains that may overweight or even corrupt the Markovian property within each chain if the assumption is not true, as observed in our practice. Therefore, we advance a new multichannel SHMM model (MCSHMM) that involves two parallel SHMMs and a twolayer hierarchical Markovian structure, as shown in Fig. 6.1. In the MCSHMM, on the one hand, two SHMMs are equipped by the powerful segmental observation model to mine individual semantic structures, and on the other hand, the MCSHMM can explore multiple semantic structures with a hybrid parallelhierarchical dynamic model. 57 6.2 Model Speci cations The unique feature of the MCSHMM is the integration of both parallel and hierarchical structures in the dynamic model and the incorporation of a segmental observation model, which allow the MCSHMM have more flexibility, capacity, and functionality than the CHMM, SHMM and HHMM. In the view of generative models, both the dynamic model (among hidden states) and the observation model in the MCSHMM have a twolayered structure that greatly enhance its capability of learning and inference. Specifically, at the firstlayer of the dynamic model, S = {S(j) t t = 1; :::; T; j = 1; 2} denotes the state sequence of two channels where S(j) t denotes the state of shot t in channel j, and at the secondlayer of the dynamic model, F = {Ft = (S(1) t ; S(2) t )t = 1; :::; T} represents the state sequence at the second layer where each state consists of two current states at the first layer. At the observation layer O(j) t = {O(j) 1:nt t = 1; :::; T; j = 1; 2} indicates observations of shot t with nt frames in channel j. In this research, all shots are presegmented, and the nt is known. Therefore, the MCSHMM’s parameter set includes following components = {Π;A;Ω}: • Initial probabilities: Π = {P(S(1) 1 ); P(S(2) 1 ); P(F1S(1) 1 ; S(2) 1 )}; (6.1) • Transition probabilities: A = {Aw;w = 1; 2; 3}; (6.2) 58 where A1 = {P(S(1) t = mS(1) t−1 = n; Ft−1 = l)}; (6.3) A2 = {P(S(2) t = mS(2) t−1 = n; Ft−1 = l)}; (6.4) A3 = {P(Ft = lS(1) t = m; S(2) t = n)}; (6.5) • Observation density functions: p(O(j) t S(j) t = m; Ω) = ∫ N(  (j) ;m;Σ(j) ;m) Πnt k=1 N(O(j) t;k  ;Σ(j) m )d ; (6.6) where Ω = { (j) ;m;Σ(j) ;m;Σ(j) m j = 1; 2;m = 1; :::;N(j)} specifies the two segmental models, N(j) is the number of states in channel j, in this paper, N(1) = N(2) = 4, and O(j) t;k denotes the observation of frame k in shot t and channel j. The Gaussian prior of mean is used to capture temporal dependency among observations in a shot by assuming conditional independency under a given . In addition, the closedform likelihood function well represents the variablelength observations in different shots. However, the integration in (6.13) has to be approximated during EM learning [28]. Given the dualchannel observations of T shots, O = {O(j) t t = 1; :::; T; j = 1; 2}, the joint likelihood is defined as: p(S;F;O ) = P(S(1) 1 )P(S(2) 1 )P(F1S(1) 1 ; S(2) 1 ) ΠT t=1 P(FtS(1) t ; S(2) t ) ΠT t=2 Π2 j=1 P(S(t) t S(j) t−1; Ft−1) ΠT t=1 Π2 j=1 p(O(j) t S(j) t ): The Expectation Maximization (EM) algorithm is often used for learning a DBN/HMM, such as the MCSHMM, which finds the optimal parameters by maximizing the likelihood function (e.g., (6.7)) through iterative Expectation (the Estep) and Maximization (the Mstep). 59 However, the direct maximum likelihood (ML) learning of the MCSHMM could be problematic due to the complex state space specified by S and F that leads to a large set of model parameters, such as A given in (6.2). In addition, we expect that the useful state space could be much smaller than the one covering all possible cases. Therefore, we are hoping to find an effective learning algorithm that can optimize the parameter set as well as the model structure (the state space and state dynamics) simultaneously. 6.3 Learning and Inferences As we mentioned before, there are two aspects in model learning, structure learning and parameter learning. The former one aims at finding a compact and effective model structure by simplifying the state space and reducing the parameter set, and the latter one tries to optimize the model parameters given a predefined model structure. More advancingly, we will propose a new unsupervised learning algorithm in which two learning processes could be unified into one framework where the model structure and parameters can be optimized simultaneously [27, 32]. In this work, we can predefine a coarse model structure that includes every possible configuration of the semantic structure, then we will use the ideas of entropic prior and parameter extinction proposed in [32] that result in a maximum a posteriori (MAP) estimator. According to (6.6), for each segment in each channel, we can obtain the likelihood 60 by using the loglikelihood: where we define some auxiliary parameters as follows: (j) t = Σnt k=1 O(j) t;k ; (6.7) (Σ(j) t;m)−1 = (Σ(j) ;m)−1 + nt(Σ(j) m )−1; (6.8) (j) t;m = (Σ(j) t;m((Σ(j) ;m)−1( (j) ;m)′ + (Σ(j) m )−1( (j) t )′))′; (6.9) R(j) m = 1 (2 ) n 2 Σ(j) m  1 2 ; (6.10) R(j) ;m = 1 (2 ) n 2 Σ(j) ;m 1 2 ; (6.11) R(j) t;m = 1 (2 ) n 2 Σ(j) t;m  1 2 : (6.12) The entropic prior is a bias to the compact model with good determinism. In our work, this MAPbased estimator can be incorporated into the Mstep of the EM algorithm which will encourage a maximally structured and minimally ambiguous model. This is accomplished by trimming the weakly supported parameters and states, leading to a compact model with good determinism. In our work, the MAP estimator focuses on A defined in (6.2) that has many possible state transitions due to a large number of possible states of F (N1 × N2 = 16 here). In other words, we want to simplify A by keeping only important state transitions, that would effectively reduce the number of useful states in F and balance the incoming and outgoing transitions between the two layers. Consequentially, the MAPbased EM estimator find the optimal parameter by ∗ = arg max Pe( O); (6.13) where Pe( O) ∝ P(O )Pe( ) = ( Σ S;F p(S;F;O ) ) Pe( ); (6.14) where p(S;F;O ) is given in (6.7) and Pe( ) is the entropic prior of the model corresponding to parameter set that, in this work, depends on A as Pe( ) ∝ exp ( Σ w Σ p Σ q Pw p;q log Pw p;q ) ; (6.15) 61 where Pw p;q denotes a transition probability in Aw. Accordingly, in the Mstep of the EM algorithm, we will update the transition probabilities by maximizing the entropic prior as, A∗ = arg max A {log(Pe( O)) + Σ w Σ p wp ( Σ q Pw p;q − 1)}; (6.16) where w p;q is the Lagrange multiplier to ensure Σ q Pw p;q = 1. Using a similar optimization technique discussed in [32], the MAP estimate of can be achieved by setting the derivative of logposterior given in (6.16) to zero. @{log(Pe( O)) + Σ w Σ p w p;q( Σ q Pw p;q − 1)} @Pw p;q = 0: (6.17) Then, we can obtain w p;q Pw p;q + log Pw p;q + 1 + w p;q = 0; (6.18) where w p;q = p(S;FO; ): (6.19) And this equation can be solved by the Lambert W function,which is the inverse function of f(w) = wew; (6.20) where ew is the natural exponential function and w is any complex number. For any complex number z, we have z = W(z)eW(z): (6.21) So, we have logW(z) +W(z) = log z; (6.22) which equals to −logW(z) −W(z) + log z = 0: (6.23) If we set z = ex, we can obtain −logW(ex) −W(ex) + x = 0: (6.24) 62 For any complex number y, log y W(ex) − y y W(ex) + x − log y = 0: (6.25) If we let x = 1 + w p;q + log y; (6.26) and y = − w p;q; (6.27) then the (6.25) can be expressed by log − w p;q W(e1+ w p;q+log (− w p;q)) + w p;q w p;q W(e1+ w p;q+log (− w p;q)) + 1 + w p;q = 0: (6.28) And we can easily find the solution of Pw p;q by comparing (6.28) with the (6.18). Then, finally we can easily obtain Pw p;q = − w p;q W(− w p;qe1+ w p;q ) (6.29) After A is optimized, other parameters in can also be updated in the Mstep correspondingly. We resort to the junction tree algorithm in [18] to implement the MAPbased EM algorithm. The junction tree is an auxiliary data structure that can convert a Directed Acyclic Graph (DAG), such as the MCSHMM, into an undirected graph by eliminating cycles in the graph, so that belief propagation can be effectively performed on the modified graph for inference and learning. By applying the backwardforward algorithm, we can obtain (j) m;t = p(O(j) 1 ; :::;O(j) t ; S(j) t = m(j)Θ); (6.30) (j) m;t = p(O(j) t+1; :::;O(j) T S(j) t = m(j); Θ); (6.31) (j) m;t = (j) m;t (j) m;t ΣK m(j)=1 (j) m;t (j) m;t : (6.32) 63 Then, we can obtain the parameter set for each each channel as follows: ˆ (j) ;m = ΣT t=1 (j) m;t (j) t n(j) t ΣT t=1 (j) m;t ; (6.33) ˆΣ (j) ;m = ΣT t=1 (j) m;t (nt ^ (j) ;m− (j) t )2 n2t ΣT i=1 (j) m;t ; (6.34) ˆΣ (j;i) m = ΣT t=1 (j) m;t( Σn(j) t k=1 (O(j) t;k )2 − ( (j) t )2 n(j) t ) ΣT t=1 (j) m;t(n(j) t − 1) ; (6.35) Consequentially, this learning process enhances the important states in F and drives the weakly supported ones towards zero. According to the transitions of small probabilities, the states in F that are seldom visited could be found and eventually eliminated. Hereby the state space is optimized by balancing the state transitions between the two layers. Therefore, there are two trimming processes, one is the transition trimming, the other is the state trimming. For the MCSHMM, we can obtain the transition probabilities in the Estep of the EM algorithm, to find a transition that can be trimmed, we need to find the transition probability for which the gain of the entropic prior overweighs the loss in the likelihood, as shown below: Pe( \ Pw p;q) Pe( ) ≥ P(O ) P(O \ Pw p;q) (6.36) where Θ \ Pw p;q represents the parameter set Θ without the element Pw p;q. Now we need to trim the state space based on estimated/trimmed transition probabilities. The purpose is to keep the useful states in F which are visited frequently, as expressed by its incoming and exit transitions. Similar to [32], we can detect timmable states by balancing the prior probability of all its incoming and exit transitions against the probability mass that flow through it, as shown below, P( \ Fl) P( ) ≥ Π m;n;k (a(k) m;n;l)a(k) m;n;l (6.37) 64 where k = {1; 2; 3}, and a1 m;n;l = P(S(1) t = mS(1) t−1 = n; Ft−1 = l); (6.38) a2 m;n;l = P(S(2) t = mS(2) t−1 = n; Ft−1 = l); (6.39) a3 m;n;l = P(Ft = lS(1) t = m; S(2) t = n): (6.40) The above state trimming occurs in each iteration in the Mstep, and each iteration may have some transitions/states trimmed, leading to a fast learning process. In practice, we need to decide the size of expected state space. Here we keep 6 of out 16 states in F. After model learning, the Viterbi algorithm can be used to estimate the optimal state sequence for both channels at the first layer, i.e., S, which encodes two semantic structures, i.e., camera views and play types. Our EM algorithm implementation was based on the Bayes Net Matlab Toolbox developed by KP Murphy. 1 6.4 Experimental Results The MCSHMM was tested on nine 30minute NFL football games (352×240, 30fps) that have been presegmented into a series of consecutive play shots by removing commercials and replays. Each game has 150170 shots and each shot has 300500 frames. Additionally, by using the Bayes Net Matlab Toolbox, we generated four twochannel synthetic data sequences each of which has 200 segments of variablelength observations (200400 samples) according to MCSHMMs learned from four real videos. Using both synthetic and real data allows us to have comprehensive algorithm validation and evaluation. Our algorithm was implemented in Matlab 7.0 and tested on a PC with 3.2GHz CPU and 1GB memory. Seven previous methods are involved for comparison, as shown in Table 6.1, including a supervised GMM (order 3), the HMM with Gaussian emission (HMM(1)) 1http://www.cs.ubc.ca/ murphyk/Software/BNT/bnt.html 65 Methods State number Observation model Transition matrix GMM 4 3order GMM N/A HMM(1) 4 Gaussian 4 × 4 HMM(2) 4 3order GMM 4 × 4 HMM(3) 4 twolayer GMM 4 × 4 SHMM 4 segmental model 4 × 4 CHMM(1) 4, 4 3order GMM 16 × 16 CHMM(2) 4, 4 segmental model 16 × 16 MCHSMM 4, 4 segmental model 4 × 4, 24 × 4 × 2 Table 6.1: The model settings for all testing methods. and the HMM with GMM emission (HMM(2)) [61], and the embedded GMMHMM (HMM(3)) in [57] and the SHMM in [56]. We also implemented two CHMMbased methods [22], among which CHMM(1) and CHMM(2) uses a GMM and a segmental observation model respectively. The first five explore two semantic structures (plays and views) independently and separately, while the last three estimate both jointly. The initialization is important in EM learning. We adopted a coarsetofine initialization strategy that uses the training result of a simpler model to initialize a more complex one. Specially, we first use Kmean (4class) to obtain a coarse classification, and this result can be utilized to initialize HMM(1) whose training result can be used to initialize HMM(2), and so on. The training result of SHMM was used to initialize MCSHMM. Moreover, since the first shot is always in the central view, the initial probability of central view is set to be 1. We can initialize the transition matrix by computing the frequencies of different state transition in a couple of real games. Simultaneous structure learning and parameter learning are crucial for the application of MCSHMM. We found the traditional ML estimator cannot fully take advantage of the MCSHMM due to its large state space (before trimming). The entropic prior 66 GMM HMM(1) HMM(2) HMM(3) SHMM CHMM(1) CHMM(2) MCSHMM Figure 6.2: The experimental results on synthetic data. based MAP estimator is capable of finding the optimal model structure (a trimmed state space) and parameters jointly. It was mentioned in [32] that this MAP estimator can accelerate learning, and rescue EM from local probability maxima. Fig. 6.2 and Table 6.2 shows the experimental results on both synthetic data and real videos. It is clearly shown that the proposed MCSHMM outperforms all other algorithms with significant improvements. 6.5 Discussions The reason that SHMM is better than other HMMs (HMM(1), HMM(2), HMM(3)) is because of the segmental model used. The improvement of MCSHMM over SHMM and two CHMMs lies in the new hierarchicalparallel dynamics. CHMMs usually assume relative strong correlation between two Markov chains that is not true in our case. The MCSHMM can effectively capture the mutual interaction between two Markov chains by introducing the secondlayer dynamics and decisionlevel fusion that are able to balance the dependency both within each channel and across the two channels. To reveal the underlying difference between MCSHMM and CHMMs, we compare the learned transition matrices of both models with the groundtruth 67 Test Videos Semantics Supervised HMM(1) HMM(2) HMM(3) CHMM(1) SHMM CHMM(2) MCSHMM GMM [61] [61] [57] [22] [56] [22] Video1 Play Type 26.92% 43.59% 60.90% 77.56% 41.03% 80.13% 78.85% 82.05% (156 shots) Camera View 35.90% 55.77% 72.44% 76.28% 58.33% 79.49% 81.41% 84.62% Video2 Play Type 30.67% 49.07% 61.34% 70.56% 53.37% 77.91% 76.69% 81.59% (163 shots) Camera View 41.10% 61.35% 67.48% 79.14% 65.03% 84.05% 80.98% 85.28% Video3 Play Type 23.35% 44.91% 49.10% 58.08% 52.10% 68.86% 70.06% 75.45% (167 shots) Camera View 37.12% 53.29% 58.68% 61.08% 58.08% 74.85% 77.25% 81.44% Video4 Play Type 36.90% 58.33% 64.67% 70.66% 67.86% 77.98% 69.62% 82.74% (168 shots) Camera View 47.61% 64.88% 70.24% 73.21% 69.05% 79.17% 75.60% 84.52% Video5 Play Type 29.17% 50.59% 61.90% 72.02% 70.24% 74.40% 70.83% 80.95% (168 shots) Camera View 40.47% 56.55% 65.48% 73.81% 60.71% 73.21% 64.88% 77.38% Video6 Play Type 38.25% 54.11% 60.59% 70.59% 76.47% 80.59% 70.59% 84.71% (170 shots) Camera View 41.17% 52.35% 64.12% 70.59% 74.70% 80.00% 71.18% 84.11% Video7 Play Type 37.05% 64.70% 71.17% 70.59% 72.35% 75.88% 76.47% 83.53% (170 shots) Camera View 52.35% 68.24% 72.35% 75.29% 67.06% 81.18% 71.76% 87.06% Video8 Play Type 40.93% 56.14% 65.50% 72.51% 68.82% 78.95% 82.94% 84.80% (171 shots) Camera View 50.88% 62.57% 75.44% 77.78% 76.02% 80.11% 81.18% 88.30% Video9 Play Type 39.88% 63.01% 67.05% 69.36% 67.63% 75.14% 69.36% 82.08% (173 shots) Camera View 43.93% 66.07% 68.21% 71.68% 69.94% 77.46% 73.41% 84.97% Average Accuracy 43.59% 57.44% 62.62% 71.85% 63.54% 77.42% 75.08% 82.92% Table 6.2: Shotbased classification results of nine algorithms for nine 30min NFL videos. ones, as shown in Fig. 6.4. We can see the sparse nature of three transition matrices in the MCSHMM, and the learning results are very close to the ground truth ones (Fig. 6.4(a)(b)(c)). However, the learned transition matrix in the CHMM does not match the ground truth one (Fig. 6.4(d)), possibly due to the large state space. However, the current MCHSMM implementation (without program optimization) has the highest computational load (about 50 minutes for one video), while other methods are between 220 minutes. It is mainly due to the large number of initial states (most of them will be trimmed during training) at the secondlayer and the complexity of the segmental observation model. More efficient learning schemes are needed for fast state trimming. It is also possible to extend the MCSHMM to the cases where more than two channels are involved. The major challenge will be how to condense the initial state space effectively and efficiently. A possible strategy could be “incremental growing” rather than “progressive trimming”. The proposed MCSHMM could be applied to other fieldbased sports, where the 68 two midlevel semantic keywords, play types and camera views, can be well specified. After extracting some useful visual features that may be composed of color distribution, camera motion, or useful visual landmarks, we can invoke MCHSMM to estimate play types and camera views jointly, which can be further used to infer certain highlevel semantics by involving some domain knowledge. For example, as shown in Fig. 6.3, in a soccer game, we can define several views, such as the whole field, midfield, opposinghalf, fieldcorner, and penalty area, and as well as a few plays, kickoff, pass, free kick, corner kick, and shoot, etc. A goal highlight can be specified by a shoot in the penalty area followed by a kickoff in the midfield. We can have a similar semantic analysis paradigm for video mining in the baseball game, as shown in Fig. 6.3. 69 Lowlevel Visual Features (Color, Motion, etc.) Highlevel Semantics (Successful offense/defense, Fouls, etc.) Middlelevel Semantic Structures (Camera views, Play types) Midfield  Opposing half  Penaltyarea Pitch  Throw  Base Front  Remoteback  Closeback Long pass  Shortpass  Goal Camera View Play Type Camera View Play Type Figure 6.3: A proposed semantic analysis paradigm for sports video mining. (a) (b) (c) (d) Figure 6.4: Comparison of the state transitions between the ground truth (top) and learning results (bottom). The first three are for the MCSHMM defined in (6.3), (6.5) and (6.5), and the last one for CHMM. (a) A1(4×64); (b) A2(4×64) (c) A3(16×16) (d) Full state transitions of a CHMM (16×16) . 70 CHAPTER 7 CONDITIONAL RANDOM FIELDS FOR EVENT DETECTION 7.1 Background In previous sections, we elaborated the generative model, i.e., HMM, based approaches for midlevel semantic keyword analysis. The HMMbased approaches naturally fit in our problem formulation where the hidden states represent midlevel semantic keyword (camera views or play types) defined at the shot level and the observation is the visual features extracted at the frame level. As aforementioned, the midlevel semantic keywords deliver the rudimentary building blocks for highlevel semantic analysis because they can be used to represent highlevel semantics (i.e., events) effectively. Particularly, in addition to the play type and camera view, we also introduce another midlevel structure, i.e., the possession, which indicates the team holding the ball and can be easily detected by using the initial camera motion direction in the beginning of a shot. Thus, given the three kinds of midlevel semantic keywords of all shots, the second inference problem aims at detecting the interesting events, such as highlights (or the touchdown plays a football game). However, in the second inference problem, i.e., from the midlevel semantic keywords to the highlevel semantics, it is different from the first one. Highlevel semantics are usually referred to the events or incidents that are of particular interest to the user with the following three characteristics: • They are rare and specific, and the training data could be insufficient; • They have more ambiguities and uncertainties in terms of lowlevel visual fea 71 tures compared with the midlevel semantic keywords, e.g., a touchdown in football video could occur in many different ways; • They can be better defined or specified from midlevel semantic keywords. Unlike the first inference problem that is well supported by the generative models, the second inference problem, also referred to as event detection in this work, is very different in nature. We use one example below for discussion. For example, as shown in Fig. 7.1, an eventofinterest, e.g., a touchdown, can be inferred from midlevel semantic keywords across several shots, including camera views, play types, and possessions. Event detection requires a flexible modeling structure that allows more versatile dependency relationship between random variables (both observations and states) to accommodate the nature of highlevel semantics. Normally, generative models are mainly focused on the temporal dependency across states (not observations). This motivates us to use discriminative models for the second inference problem, where the observation corresponds to the midlevel semantic keyword, and the state is associated with the highlevel semantics or eventsofinterest. The midlevel semantic keywords are frequent, repeatable and relatively welldefined, therefore, in the first inference problem, generative models are preferred due to their good generality on a large data set (lowlevel visual features) and suitability for capturing repeated and stable patterns (midlevel semantic keywords). However, the highlevel semantics are immediately useful to viewers are quite specific (or userdependent) but can hardly be defined uniquely and deterministically from lowlevel visual features. As shown in Fig. 7.2, can the highlevel semantics be better specified by midlevel semantic keywords due to the enhanced expressiveness? Therefore, the key issue is: what kind of statistical model will offer great flexibility and capability for complex highlevel semantic analysis? Then, we can have a few important observations. 72 Highlevel semantics: Touchdown Camera view sequence: Right Middle Left End zone Play type sequence: Long Long Short Field goal Possession sequence: Right Right Right Right Figure 7.1: The second inference problem Any Highlevel Semantic Available? Figure 7.2: From midlevel semantic keywords to highlevel semantics 73 • There are strong temporal dependencies across shots with respect to each midlevel structure and an evident semantic correlation between multiple midlevel structures in each shot; • A touchdown play cannot be determined from a single shot, and its determination requires multiple semantic midlevel structures from several shots; • There are some uncertainty for the definition of touchdown play and ambiguity between the touchdown play and the field goal. Moreover, the events of interest, such as touchdowns or field goals, are usually relative rare and unbalanced compared other nonspecific events in a game. This is in a clear contrast with the midlevel semantic keywords. A supervised learning method is more appropriate for highlevel semantic analysis due to a very small portion of events of interest. Therefore, it is important for us to find an effective statistical model to address the above characteristics of highlevel semantic analysis. Particularly, we are interested in developing a discriminative model based approach for event detection. The discriminative model provides a flexible and plausible way to capture the complex dependency among multiple midlevel semantic keywords both within and across shots. As we introduced above, midlevel semantic keyword are more repeatable and relatively better defined in the sports video, and hence it is more reliable to explore the highlevel semantics from midlevel semantic keywords than from the lowlevel features. Shown in Fig. 7.1, the highlevel semantics, e.g., the touchdown in American football video, usually contains multiple midlevel semantic keyword sequences with relative stable configuration. Intuitively, when the midlevel semantic keywords are available, we can to fit them as the observation, then regard the highlevel semantics as hidden states. However, the midlevel semantic keyword sequences are different from the lowlevel feature due to they carry more complex dependencies that makes the 74 default assumptions of the generative model fault. The discriminative model provides a feasible way to relax the strong independence assumptions and prior knowledge required in generative models. This motivates us to use discriminative models for the second inference problem. 7.2 Model Speci cation The conditional random fields (CRFs) (Fig. 7.3(a)) were first introduced in [33] that provide a undirected probabilistic model to directly compute the statistical mapping relationship between the input (observations) and the output (states) for classification and regression. Compared with the HMMs, CRF does’t involve the conditional distributions of observations, and the state variable in CRF can be related to multiple observations in different time slices, e.g., multiple shots. Compared with SVM, CRF is more appropriate to capture the temporal dependencies both among observations and states in the sequential data. For example, in the case shown in Fig. 7.3.(b), we use a Markov blanket [37] to represent all feature functions involved in current state St, which includes the first order state transition as the transition feature function, and especially, five sequent shots, i.e., midlevel semantic keywords, to be state feature functions. This number was found to be a good compromise between the complexity and accuracy. A CRF model can be characterized by the parameter set Θ = ( 1; 2; : : : ; n; 1; 2; : : : ; n); (7.1) which specifies the feature functions between observations and hidden states as p (SO) = (7.2) 1 Z(O) exp { Σ t ( Σ i ifi(St−1; St) + Σ j jgj(St;O) )} ; where Z(O) is a normalization over the data sequence O; fi(·) and gj(·) are transition feature functions and state feature functions respectively. Via maximum likelihood 75 t1 S t1 o t+1 o t S t+1 S t o O Play types M R R E M C a m e r a v iews L S S G K L L L L D t2 t1 t t+1 t+2 St (a) (b) Possessions Time slices St1 St+1 Figure 7.3: (a) CRF model; (b) CRF for event detection (S): touchdown; M,R,E: central, right and end zone camera views; L,S,G,K: long play, short play, field and kick; LR,RL: possessions from left to right, possessions from right to left. (ML) estimation, we can optimize the CRF by maximizing the conditional probability defined as: Θ∗ = arg max p (SO); (7.3) Then, for a new data sequence O, we can predict the label S that can maximize the p(SO;Θ∗), i.e., S∗ = arg max S p(SO;Θ∗); (7.4) Specifically, in our research, we can naturally regard the O as the available midlevel semantic keyword sequence and S as the highlevel event sequence. As shown in Fig. 7.3.(b), the CRF used in this work involves a firstorder transition feature function and a fifthorder state feature function. We utilize the dynamic programming algorithms proposed in [66] to optimize the CRF parameters, and then classification task is straightforward according to (7.4). 76 7.3 Learning and Inferences Specifically, in our research, we can naturally regard the O as the available midlevel semantic keyword sequences and S as the corresponding highlevel semantic sequences, therefore, we can formulate the second inference problem as Equ. 7.4 based on the trained CRF. Now, the key point is how to find the parameter set Θ to maximize the conditional probability. Then we can represent the parameter set Θ as: Θ = ( 1; 2; : : : ; n; 1; 2; : : : ; n); (7.5) in which k and k are parameters to be estimated. It is obviously that the CRF provides a direct attempt to compute the input to output mappings for classification and regression and eschew the modeling of the underlying distributions, in which more dependencies between observations could be considered in terms of features regardless the strict independent assumption existed in the generative models. In practice, what we need to do is to define possible features of observation (midlevel semantic keywords) and hidden states (highlevel semantics) in football video sequence, then utilize dynamic programming algorithms, e.g., gradient based approaches [66], to optimize the model parameter, and then classification task is straight forward according to Equ. 7.4. First, the 7.2 can be written in the log form: log p (SO) = Σ e∈E;k kfk(e; se; o) + Σ v∈V;k kgk(v; sv; o) − log Z(o); (7.6) then take the derivative on Θ, we will obtain @ log p (so) @ = @ @ { Σ e∈E;k kfk(e; se; o) + Σ v∈V;k kgk(v; sv; o) − log Z(o)}: (7.7) In training process, the first two items are easy to obtain by counting the number of 1 in Boolean sequence fk and gk, then the normalization Z(o) can be obtained by the Maximal Cliques. [67] 77 7.4 Experimental Results Given the midlevel sema
Click tabs to swap between content that is broken into logical sections.
Rating  
Title  Probabilistic Graphic Models for Sports Video Mining: Hybrid Generativediscriminative Approaches 
Date  20101201 
Author  Ding, Yi 
Keywords  discriminative, generative, graphical model, machine learning, sports video, video mining 
Department  Electrical Engineering 
Document Type  
Full Text Type  Open Access 
Abstract  With the development of multimedia and internet technologies, there is a growing interest in video mining research that is to discover knowledge existing in the video data. The major challenge of video mining is how to bridge the semantic gap between lowlevel features and highlevel semantics, which characterizes the difference between two descriptions of the video data, i.e., the computable visual descriptions and understandable semantic interpretations. In this dissertation, we proposed a new sports video mining framework where a hybrid generativediscriminative approach is used for multilevel video semantic analysis. A threelayer semantic space is proposed, by which the semantic video analysis is converted into two interrelated statistical inference procedures at different levels. The first is to infer the midlevel semantic keywords from the lowlevel visual features via generative models, which can serve as building blocks of highlevel semantic analysis. The second is to detect highlevel semantics from midlevel semantic keywords by using discriminative models, which are of direct interests to users. Specifically, HMMsbased approaches and CRFsbased approaches are employed in two inference problems respectively. In the first inference problem, to explore multiple coexistent semantic keywords jointly, we developed a multichannel segmental hidden Markov model (MCSHMM), which integrates both hierarchical and parallel dynamic structures to offer more flexibility and capacity of capturing interactions between multiple Markov chains as well as incorporates the segmental HMM (SHMM) to deal with variablelength observations. In the second inference problems, in order to segment and recognize the highlevel semantics, i.e., the game flow, simultaneously, we proposed the auxiliary segmentation conditional random fields (ASCRFs) that can both group a set of plays into different drives and handle possible missing keywords with an additional auxiliary layer. The use of hybrid generativediscriminative approaches in two different stages is proved to be effective and appropriate for multilevel semantic analysis in sports video. The experimental results from a set of American football video data demonstrate that the proposed framework offers superior results compared with other traditional approaches. 
Note  Dissertation 
Rights  © Oklahoma Agricultural and Mechanical Board of Regents 
Transcript  PROBABILISTIC GRAPHIC MODELS FOR SPORTS VIDEO MINING: HYBRID GENERATIVEDISCRIMINATIVE APPROACHES By YI DING Bachelor of Science Xi’an University of Technology Xi’an, China 2002 Master of Science Xidian University Xi’an, China 2005 Submitted to the Faculty of the Graduate College of Oklahoma State University in partial fulfillment of the requirements for the Degree of DOCTOR OF PHILOSOPHY December, 2010 COPYRIGHT c⃝ By YI DING December, 2010 PROBABILISTIC GRAPHIC MODELS FOR SPORTS VIDEO MINING: HYBRID GENERATIVEDISCRIMINATIVE APPROACHES Dissertation Approved: Dr. Guoliang Fan Dissertation Advisor Dr. Carl D. Latino Dr. Damon Chandler Dr. Douglas Heisterkamp Dr. Mark E. Payton Dean of the Graduate College iii ACKNOWLEDGMENTS I would like to offer my gratitude to all who have made this dissertation possible! First, I am heartily thankful to my supervisor, Dr. Guoliang Fan for his invaluable guidance and vision that direct me towards my research goals. I have benefited so much from his tireless help in both of my research and my daily life. I feel extremely lucky that I could have the opportunity to work with him in the past five years. Meanwhile, I wish to acknowledge Prof. Laiyao Fan, my advisor in Xidian University, for his continuous kind advising and inspiration. I would like to thank members of my dissertation committee: Dr. Carl D. Latino, Dr. Damon Chandler, Dr. Douglas Heisterkamp, and Dr. Xiaolin Li, as well as other mentors, Dr. Jianping Fan and Dr. Jiebo Luo, who have guided me through my research. I appreciate their guidance and encouragements. Many thanks to all of my colleagues in Visual Computing and Image Processing Lab (VCIPL), especially to Cheng Chen, Xin Fan, Xin Zhang, Vijay Bhaskar Venkataraman, John Davis, Andy Hammett, and Bryan T. Wright. Thank you all for your suggestions, discussion, help, and friendship. I would also thank all my friends in Stillwater, you have made my 5 years Ph.D. life in Stillwater more colorful. Lastly but most importantly, I want to express my deepest gratitude to my parents, He Ding and Li Gao, for their everlasting love, consecutive encouraging and unconditional support during my Ph.D. study, even when they are in a hard situation. I am also grateful to my girlfriend, Xiaojing Yang, whose dedication, support and understanding make my life fulfilled with love and happiness. I love you! iv TABLE OF CONTENTS Chapter Page 1 INTRODUCTION 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Research Goals and Challenges . . . . . . . . . . . . . . . . . . . . . 2 1.3 Our Approaches and Contributions . . . . . . . . . . . . . . . . . . . 5 1.3.1 A threelayer semantic representation . . . . . . . . . . . . . . 6 1.3.2 Inferring midlevel keywords by generative graphical models . 7 1.3.3 Inferring highlevel semantics by discriminative graphical models 7 1.3.4 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 LITERATURE REVIEW 11 2.1 Probabilistic Graphical Models . . . . . . . . . . . . . . . . . . . . . 11 2.2 Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Discriminative Models . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Applications in Video Mining . . . . . . . . . . . . . . . . . . . . . . 19 3 PROBLEM FORMULATION 22 3.1 Semantic Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Lowlevel Visual Features . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Midlevel Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4 Highlevel Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 v 3.5 Two Inference Problems . . . . . . . . . . . . . . . . . . . . . . . . . 27 4 LOWLEVEL VISUAL FEATURES 30 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 Color/Landmards Based Visual Features . . . . . . . . . . . . . . . . 31 4.3 Motion Based Visual Features . . . . . . . . . . . . . . . . . . . . . . 32 4.4 Feature Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5 HMMS FOR KEYWORDS DETECTION 36 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.2 Traditional HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2.1 Model Specifications . . . . . . . . . . . . . . . . . . . . . . . 37 5.2.2 Learning and Inferences . . . . . . . . . . . . . . . . . . . . . 38 5.2.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 40 5.3 Embedded HMMGMMs . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.3.1 Model Specifications . . . . . . . . . . . . . . . . . . . . . . . 42 5.3.2 Learning and Inferences . . . . . . . . . . . . . . . . . . . . . 43 5.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 47 5.4 Segmental HMMs (SHMMs) . . . . . . . . . . . . . . . . . . . . . . . 48 5.4.1 Model Specifications . . . . . . . . . . . . . . . . . . . . . . . 50 5.4.2 Learning and Inferences . . . . . . . . . . . . . . . . . . . . . 51 5.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 54 5.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6 MULTICHANNEL SHMM FOR KEYWORDS DETECTION 56 6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.2 Model Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.3 Learning and Inferences . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 vi 6.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 7 CONDITIONAL RANDOM FIELDS FOR EVENT DETECTION 71 7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.2 Model Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.3 Learning and Inferences . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 8 SEGMENTATION CRFS FOR GAME FLOW ANALYSIS 82 8.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 8.2 Segmentation CRFs (SCRFs) for Game Flow Analysis . . . . . . . . 84 8.2.1 Model Specifications . . . . . . . . . . . . . . . . . . . . . . . 86 8.2.2 Learning and Inferences . . . . . . . . . . . . . . . . . . . . . 87 8.3 Missing Data Handling by Auxiliary SCRFs (ASCRFs) . . . . . . . . 90 8.3.1 Model specification . . . . . . . . . . . . . . . . . . . . . . . . 90 8.3.2 Learning and inferences . . . . . . . . . . . . . . . . . . . . . 93 8.3.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 94 8.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 9 CONCLUSIONS 100 BIBLIOGRAPHY 102 vii LIST OF TABLES Table Page 2.1 The comparison between the discriminative and generative models . . 20 4.1 Rank of information gain(I.G.) . . . . . . . . . . . . . . . . . . . . . . 34 5.1 Single shot classification results on camera views . . . . . . . . . . . . 40 5.2 Shotbased classification results of HMMbased algorithms for nine 30 min NFL videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.3 Shotbased classification results of seven algorithms for eight 30min NFL videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.4 Shotbased classification results of eight algorithms for nine 30min NFL videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.1 The model settings for all testing methods. . . . . . . . . . . . . . . . 66 6.2 Shotbased classification results of nine algorithms for nine 30min NFL videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 7.1 Highlevel event detection by rulebased and CRFbased approaches. . 79 viii LIST OF FIGURES Figure Page 1.1 The increasing demands for the the video mining. . . . . . . . . . . . 1 1.2 Examples of two major categories of video data: news  script video (top); football  nonscript video (bottom). . . . . . . . . . . . . . . . 3 1.3 Physical representation of the video data. . . . . . . . . . . . . . . . . 4 1.4 The first challenge: the semantic gap. . . . . . . . . . . . . . . . . . . 5 1.5 The second challenge: a unified representation. . . . . . . . . . . . . . 5 1.6 The outline of the dissertation. . . . . . . . . . . . . . . . . . . . . . 10 2.1 Structures of discriminative models and generative models. . . . . . . 13 2.2 Generative models: variations of the DBN: Hidden Markov Model (HMM), Coupled HMM (CHMM), Factorial HMM (FHMM), and Hierarchical HMM (HHMM). The white and black nodes represent the state and observations, respectively. . . . . . . . . . . . . . . . . . . 14 2.3 Discriminative Models: variations of the conditional random field (CRF): the Conditional Random Field (CRF), the Dynamic CRF (DCRF), and the Hidden CRF (HCRF). . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1 The threelayer semantic space. . . . . . . . . . . . . . . . . . . . . . 23 3.2 Visual features in football video: (a). play type analysis; (b). camera view analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Two semantic keywords and their dynamics in the American football video, i.e., camera views (top) and play types (bottom). . . . . . . . . 27 ix 3.4 A partial game flow in an American football game that contains threes drives: a scored drive, a nonscored drive, and a turnover, and each drive includes a series of annotated plays. . . . . . . . . . . . . . . . . 28 4.1 Play ground detection result. . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Yard line detection result. . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3 Distributions of shotbased (top) and framebased (bottom) Dave with respect to four camera views. . . . . . . . . . . . . . . . . . . . . . . 35 5.1 Hidden Markov Models (HMMs). . . . . . . . . . . . . . . . . . . . . 38 5.2 A new generative model . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.3 The segmental hidden Markov models (SHMM). . . . . . . . . . . . . 50 6.1 In the MCSHMM, the rst layer includes two SHMMs , and the second layer captures the interaction between two channels. . . . . . . . . . . . . . . 57 6.2 The experimental results on synthetic data. . . . . . . . . . . . . . . . . 67 6.3 A proposed semantic analysis paradigm for sports video mining. . . . 70 6.4 Comparison of the state transitions between the ground truth (top) and learning results (bottom). The first three are for the MCSHMM defined in (6.3), (6.5) and (6.5), and the last one for CHMM. (a) A1(4 × 64); (b) A2(4 × 64) (c) A3(16 × 16) (d) Full state transitions of a CHMM (16×16) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 7.1 The second inference problem . . . . . . . . . . . . . . . . . . . . . . 73 7.2 From midlevel semantic keywords to highlevel semantics . . . . . . . 73 7.3 (a) CRF model; (b) CRF for event detection (S): touchdown; M,R,E: central, right and end zone camera views; L,S,G,K: long play, short play, field and kick; LR,RL: possessions from left to right, possessions from right to left. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 x 7.4 Examples of the touchdown in terms of individual midlevel semantic keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 8.1 The example of a scored drive in football game flow. (The football stadium background is from NFL.com) . . . . . . . . . . . . . . . . . 86 8.2 (Left) SCRF and (Right) ASCRF. . . . . . . . . . . . . . . . . . . . . 89 8.3 The experimental results for four models. Left: the segmentation accuracy, right: the overall recognition accuracy. . . . . . . . . . . . . . 95 8.4 The comparison of the matching scores in different drives of two feature functions in ASCRF. Integrity template: (a) scored, (b) turnover, (c) nonscored; Similarity score: (d) scored, (e) turnover, (f) nonscored. 96 xi CHAPTER 1 INTRODUCTION 1.1 Motivation Video mining is to discover knowledge, patterns and events, or called semantics, in the video data that is stored either in databases, data warehouses, or other online repositories [1, 2]. Its benefits range from efficient browsing and summarization of video content to facilitating video access and retrieval. As, shown in Fig. 1.1, driven by the ever increasing need of numerous multimedia and online database applications, there is a growing need of efficient tools that allow us to quickly find the video data/information of interest. How to summarize? How to retrieve? How to browse? Figure 1.1: The increasing demands for the the video mining. As a book has two main tools for readers, i.e., Table of Contents (TOC) and Index, which provides an overview and a quick method for reader to find information 1 of interest, one goal of video mining is to allow a viewer to access the video data like reading a book, correspondingly, there are three main research issues for video mining, i.e., summarization/abstraction [3], browsing/skimming [4], and indexing/retrieval [5, 6]. These three research aspects would facilitate the analysis of the video content and help us build up the TOC and Index for the video data According to different production and edition styles, videos can be classified into two main categories: scripted and nonscripted videos [7], which are usually associated with different video mining tasks. Scripted videos are produced or edited according to a predefined script or plan. Usually, news and movies are highly scripted videos that are composed of predefined segments or episodes. Building a TOC [8] is more suitable for the scripted videos that is able to support efficient browsing or indexing. On the other hand, the events in nonscripted videos happen spontaneously and usually in a relatively fixed setting, such as meeting, sports, and surveillance videos. How to detect the highlights or events of interests is of great interest for nonscripted videos. Sports video mining has been widely studied due to its great commercial value [9, 10, 11, 4]. Although sports videos (nonedited) are considered as nonscripted, they usually have a relatively welldefined structure (such as the field scene) or repetitive patterns (such as a certain play type), which could help us to enhance the scriptedness of sports videos for more versatile and flexible access. In this dissertation, the nonscripted video, i.e., the sports video, is studied by using the American football video as a case study. The proposed framework for sports video mining can also be applied to other types of sports video and other nonscripted or scripted video data. 1.2 Research Goals and Challenges Traditionally, in online video services and general multimedia applications, the video data will be parsed and stored as frames, shots, and video sequence according to their physical representation shown in Fig. 1.3, then labeled manually or based on 2 Figure 1.2: Examples of two major categories of video data: news  script video (top); football  nonscript video (bottom). some predefined rules for specific types of video data. However, it is insufficient or ineffective to label them by these approaches due to the difference between how human perceives the content of the video data and how the editor organizes the video content physically. In general, there are two kinds of video mining approaches: the eventbased [12, 13] and structurebased [9, 14, 15]. The eventbased approaches detect the eventsofinterest or highlights, e.g., goals in a soccer game, which enhance the semantic understanding of video contents of interests. However, this kind of approach is usually taskdependent and designed for specific purpose. In contrast, the structurebased approaches can parse a long video sequence into individual segments, e.g., the play/break, which can be used as an overall video semantic representation. Although the structure based approaches provide limited semantics, they can support further highlevel semantic analysis. Due to the complementary nature between events and structures, researchers have proposed new approaches to integrate two kinds of approach in one video mining framework. For example, a midlevel representation framework was proposed for semantic sports video analysis where both temporal structures and event hierarchy were involved [11]; a mosaicbased generic scene representation was developed from video shots and used to explore both events and structures [1]. Our goal is to address both event detection and structure exploration in a unified 3 Shot i1 Shot i Shot i+1 Shot N } } } ... } A Sport Video Sequence Shot 1 } ... Figure 1.3: Physical representation of the video data. multilevel video mining framework. The unified video mining framework should have the capability of both midlevel keyword detection and highlevel semantic analysis, and supports a lowlevel to highlevel video semantic description, which will facilitate the discovery of knowledge, patterns and events in the video data in order to support video summarization, browsing, and retrieval. There are two challenges: one is how to bridge the semantic gap between lowlevel visual features and highlevel semantics, and the other is how to unify both events and structures in one semantic representation framework: • The semantic gap characterizes the difference between two types of descriptions for the video data by different representations. In sports video mining, the semantic gap is the difference between understandable semantic concepts and computable visual features. As shown in Fig. 1.4, when watching the football game, people are interested in specific types of moments or events, but the computer can only deal with visual features extract from the raw video data, which makes it difficult to automatically find perceivable video contents of interests from the computable visual features. 4 • As shown in Fig. 1.5, there are two types of representations for highlevel semantics for the same video data. They could be interpreted by either semantic events, e.g., kicks, punts, etc., or semantic structures, e.g., play and break. These two types of semantic representations are from different view of points but complementary to each other. Therefore, we need a unified representation framework that has capabilities of revealing both specific semantic events and general semantic structure in a complementary way. Visual Feature ? Semantics Figure 1.4: The first challenge: the semantic gap. Structure ? Events Figure 1.5: The second challenge: a unified representation. 1.3 Our Approaches and Contributions Machine learning refers to algorithms that are capable of the autonomous acquisition and integration of data and knowledge by computers. It is considered as one of the most effective approaches for video mining [4] because of its capabilities of converting the video mining tasks into learnable mathematical problems that can be largely handled by the computer. Particularly, probabilistic graphical models like Hidden Markov models (HMMs) and Conditional Random Fields (CRFs) have been regarded 5 as most popular methods to deal with temporal series information. Their popularity is rooted in intuitively representing the dependency relationship between variable by a graph, which fits the video mining tasks very well. There are a variety of graphical model based approaches proposed for diverse video mining tasks, in our research, we proposed a unified probabilistic graphical model based threelayer framework for semantic video analysis, in which the lowlevel visual feature, midlevel keywords, and the highlevel semantics are incorporated by two correlated inference processes. 1.3.1 A threelayer semantic representation We advocate a threelayer semantic space that includes lowlevel visual features, midlevel keywords, and highlevel semantics. The lowlevel visual features are salient visual characteristics that can be extracted from the raw video data, the midlevel keyword is a set of stable recurrent, and welldefined semantic units that can be inferred from lowlevel features, and the highlevel semantics including events and structures can be inferred from the midlevel keyword, they are of direct user interests. The semantic space is a unique framework with following two characteristics: • The semantic space is a unified representation of the semantic content of the video data. It provides us an explicit semantic modeling framework, which can represent both events and structures jointly. In addition, midlevel keywords that are introduced in the semantic space can serve as building blocks of highlevel semantics to further bridge the semantic gap. • The semantic space guides us to convert the video mining problem into two related statistical inference problems, i.e., from lowlevel visual features to the midlevel keyword, then from the midlevel keyword to the highlevel semantics. Considering distinct characteristics of different levels, we address these two inference steps individually by adapting different approaches, and then reducing the complexity and uncertainty of the video mining problem. 6 1.3.2 Inferring midlevel keywords by generative graphical models We propose a series of generative models, especially extended hidden Markov models, based approaches that are able to support detecting midlevel keywords, which offer more flexibility, capability and functionality than existing HMMs for video mining. Specifically, we have discussed the following three main technical issues: • How to enhance the observation model in the HMMs based approaches to handle variablelength framewise observations for shotwise state inference? • How to capture the interaction between between multiple coexistent semantic keywords, such as the play type (what happened?), the camera view (where did it happen?), and the possession (who is holding the ball)? • How to optimize the model structure and to learn model parameters simultaneously? 1.3.3 Inferring highlevel semantics by discriminative graphical models We studied the discriminative model, especially CRFs based approaches for highlevel semantics analysis, in which both semantic events and semantic structures are well studied based on conditional random fields (CRFs) and extensions. These approaches utilize the midlevel semantic keywords generated by MCSHMM, then detect the highlevel semantics that people are more interested in. Specifically, we have discussed the following three main technical issues: • How to detect highlights/events given a series of consecutive midlevel keywords? • How to jointly perform segmentation and recognition of variable length semantic structures from a set of annotated keywords? 7 • How to handle the missing keywords existing in the inference process, i.e., when some of the midlevel keywords are incomplete, how to find the structure accurately? 1.3.4 Our Contributions The major advantages of our approach are the explicit semantic modeling and direct semantic computing that are implemented by the semantic space and graphical models, which is unique in three aspects: • First, the threelayer semantic space involving lowlevel visual features, midlevel keywords, and highlevel semantics, provides us a unified representation of different level of semantics in sports video data. More importantly, guided by the semantic space, the sports video mining is converted into two intercorrelated inference problems that can be solved by probabilistic graphical models. • Second, midlevel keyword detection is addressed by a series HMMbased approaches. We enhance the modeling capability of traditional HMMs in terms of two aspects, the observation model and the dynamic model. On the one hand, the segmental modeling techniques are used to improve the capability of observation model that can handle variable length of observations belonging to a single keyword, which is a nature of most of sports video but can not be handled by the traditional HMMs based methods; on the other hand, the multichannel segmental models provide a powerful way to improve the dynamic model that can formulate the useful interactions between different semantic keyword sequences then improve the recognition accuracy for each channel, which is very important for handling a complex semantic networks involving multiple intercorrelated semantic sequences. 8 • Third, event detection and structure exploration are solved by the CRFs based discriminative models, which are able to capture long range dependencies in both highlevel semantics and midlevel keywords. They can also jointly segment and recognize a set of annotated keywords into a series of variablelength structures. More importantly, the proposed models can handle the missing data, i.e., incomplete keywords, by adding an auxiliary layer to learn additional contextual information during the model learning stage that is able be used to compensate for the possible missing keywords in the inference stage. In this work, we choose the American football video as a case study, in which keyword detection, event recognition, structure analysis, and missing data manipulation are incorporate into one framework. The proposed paradigm is flexible to be extended to other sports video mining applications by incorporating game specific semantic space and then reformulating two inference problems. 1.4 Outline This dissertation is organized as shown in Fig. 1.6. In chapter 2, we will first review the recent work on video mining, in which both generative and discriminative model based video mining approaches will be introduced. In chapter 3, we introduce the concept of semantic space where relevant lowlevel visual features, midlevel keywords, as well as highlevel semantics are specified. Then, in chapter 48, we introduce the video mining tasks in three levels respectively, including how to extract informative lowlevel feature, how to use generative model based approaches, i.e., HMMs, Embedded HMMs, and Segmental HMMs, to detect midlevel keywords, and how to use discriminative graphical models, i.e., CRFs and Segmentation CRFs, to recognize highlevel semantics, then compare them in terms of their modeling capability, efficiency and the video mining performance. Specifically, in chapter 6, we will introduce a novel algorithm called MultiChannel Segmental Hidden Markov Model 9 (MCSHMM), and compare with previous approaches. And then, in chapter 8, we investigate the Segmentation CRFs (SCRFs) based approach for game flow analysis, and propose a new approach call Auxiliary SCRFs (ASCRFs). Finally, we conclude this dissertation by providing some discussion for future research in chapter 9. C h .3 : P ro b lem F o rm u la t io n C h .5 : G e n e ra t iv e M o d e ls fo r M id  le v e l K e yw o rd s D e te c tio n C h .1 : In tro d u c t io n C h .2 : L ite ra tu re R e v ie w s C h .6 : M u ltiC h a n n e l S H M M s C h .7 : D is c r im in a tiv e M o d e ls fo r H ig h  le v e l S em a n tic s A n a ly s is S em a n tic S p a c e L o w  le v e l V is u a l F e a tu re s C h .9 : C o n c lu s io n & F u tu re W o rk s M id  le v e l S em a n tic K e yw o rd s H id d e n M a rk o v M o d e ls (H M M s ) E m b e d d e d H M M s (E H M M s ) S e gm e n ta l H M M s (S H M M s ) H ig h  le v e l S em a n tic s C h .4 : L o w  le v e l V is u a l F e a tu re E x tra c tio n C h .8 : S e gm e n ta tio n C R F s C o lo r D is tr ib u tio n F ie ld L a n dm a rk s M o t io n C o n d itio n a l R a n d om F ie ld s (C R F s ) S em a n tic E v e n t D e te c tio n S em a n tic S tru c tu re A n a ly s is Figure 1.6: The outline of the dissertation. 10 CHAPTER 2 LITERATURE REVIEW The probabilistic graphical model based approaches for video mining are generally classified into two categories as shown in Fig. 2.1. One is the generative model based approach that estimates posterior probabilities of hidden states by involving conditional densities (likelihood) for observations given priors. The other is the discriminative approach that involves a parametric model to directly represent the posterior probabilities of the hidden states given the observations. 2.1 Probabilistic Graphical Models A probabilistic graphical model is a probabilistic model for which a graph denotes the conditional dependencies between random variables. They are commonly used in probability theory, statistics, and machine learning [16]. Generally, probabilistic graphical models use a graphbased representation as the foundation for encoding a complete distribution over a multidimensional space and a graph that is a compact or factorized representation of a set of independencies that hold in the specific distribution. Two branches of graphical representations of distributions are commonly used, namely, Bayesian networks and Markov networks. Both families encompass the properties of factorization and independencies but their differences are the set of independencies they can encode and in the factorization of the distribution that they induce [17]. Probabilistic graphical models discuss a variety of models, spanning Bayesian networks, undirected Markov networks, discrete and continuous models, and extensions 11 to deal with dynamical systems and relational data. For each class of models, it describes the three fundamental cornerstones: representation, inference, and learning, presenting both basic concepts and advanced techniques. For many applications of machine learning, the goal is to predict the value of a vector C given the value of a vector X of input features. In a classification problem C represents a discrete class label, whereas in a regression problem it corresponds to one or more continuous variables. From a probabilistic perspective, the goal is to find the conditional distribution. The most common approach to this problem is to represent the conditional distribution using a parametric model, and then to determine the parameters using a training set consisting of pairs of input vectors along with their corresponding target output vectors. The resulting conditional distribution can be used to make predictions of C for new values of X. This is known as a discriminative approach, since the conditional distribution discriminates directly between the different values of C. An alternative approach is to find the joint distribution, expressed for instance as a parametric model, and then subsequently uses this joint distribution to evaluate the conditional in order to make predictions of C for new values of X. This is known as a generative approach since by sampling from the joint distribution it is possible to generate synthetic examples of the feature vector X. In practice, the generalization performance of generative models is often found to be poorer than than of discriminative models due to differences between the model and the true distribution of the data. 2.2 Generative Models Generative models, e.g., the simplest case hidden Markov models as shown in Fig. 2.2, are usually preferred for a large data set due to their better generality when prior knowledge is available. They rely on a dynamic model on the state and an observation 12 The Discriminative Model The Generative Model Hidden States Observations Figure 2.1: Structures of discriminative models and generative models. likelihood function, and the decision is made via Bayes rules. Fig. 2.2 shows several typical generative models used for contentbased and semanticbased video analysis, including the prototype hidden Markov model (HMM), coupled HMM (CHMM), Factorial HMM (FHMM), and Hierarchical HMM (HHMM). These models involve specific state dynamics and an observation likelihood function. The expectation maximization (EM) algorithm is usually used for model learning and Bayesian inference is involved for decision making given an observation sequence. The Dynamic Bayesian Network (DBN) [18] provides a unified probabilistic framework to represent various graphical structures with direct dependency, and the HMM is considered as the simplest DBN that has been widely used in many video analysis applications [19, 20]. Recently, there have been intensive efforts to enhance HMMs for semantic video analysis, and the recent studies mainly focus on two issues, i.e., structuring and learning. Various extended HMMs have been recently proposed to further enhance the capability of handling complex temporal series. Specifically, these extended HMMs are focused on the improvement of the original HMM in terms of two major components, i.e., the dynamic and observation models [21]. There are usually two types of dynamic structure involved for semantic video analysis, the parallel and hierarchical ones. A parallel structure involves information fusion at either the decisionlevel or the featurelevel. For example, when there are two parallel semantic structures involved in a video sequence, the Coupled HMM 13 X: latent states; F: Switch variable; Y: Observations. HMM FHMM CHMM HHMM 1 1 X 1 2 X 1 3 X 1 Y 3 Y 2 Y 3 1 X 3 2 X 3 3 X 2 1 X 2 2 X 2 3 X 1 1 F 1 2 F 1 3 F 3 1 F 3 2 F 3 3 F 2 1 F 2 2 F 2 3 F 1 1 X 1 2 X 1 3 X 1 Y 3 Y 2 Y 3 1 X 3 2 X 3 3 X 2 1 X 2 2 X 2 3 X 1 1 X 1 2 X 1 1 Y 1 3 X 1 3 1 Y 2 Y 2 1 X 2 2 X 2 1 Y 2 3 X 2 3 2 Y 2 Y 1 X 2 X 1 Y 3 X 3 Y 2 Y Figure 2.2: Generative models: variations of the DBN: Hidden Markov Model (HMM), Coupled HMM (CHMM), Factorial HMM (FHMM), and Hierarchical HMM (HHMM). The white and black nodes represent the state and observations, respectively. 14 (CHMM) [22] and the Influence Model [23] were proposed to capture the interaction between two Markov processes via decisionlevel fusion. Differently, the Factorial HMM (FHMM) [24] includes multiple uncoupled state transitions that share the same observation sequence (featurelevel fusion) [25] or where observation vectors are obtained by the concatenation of lowlevel features from different modalities or sources [26]. On the other hand, the hierarchical structure usually imposes a multilayer representation where semantic event analysis can be accomplished by two steps, recognition of primitives and recognition of structures. For example, the Hierarchical HMM (HHMM) is able to capture lowlevel primitives based on which we can represent some midlevel structures, e.g., plays and breaks [27]. Furthermore, some signal processing applications may desire a more effective observation model that can represent variablelength observations for state estimation. The segmental HMM (SHMM) was proposed for speech recognition that effectively handle variablelength observations by involving segmental observation models [28]. The key idea in the SHMM is a twolayer observation model that captures feature variability both within a segment and across segments. Additionally, the observation model (or called the emission function) in an HMM also plays an important role for model learning and inference. Traditional HMMs normally represent each state by a single observation either by a Gaussian or Gaussian Mixture Model (GMM). However, in some video analysis, a state may emit a variablelength observation sequence, such as a video sequence composed of a set of consecutive shots where each shot (where the state is defined) consists with many frames (where the observation is collected). How to effectively capture the rich statistics from the variablelength observations is essential to model learning and inference. An interesting segmental HMM was proposed in [28] where a segmental observation model was used to characterize the variablelength observations for speech signals. The key assumption in the SHMM is that the observations in a segment is condition 15 ally independent given the mean, resulting in a closedform likelihood function for model learning and inference. Model learning is another important issue, especially for a HMM with a complex state space or variablelength observations. Generally, there are two learning issues, one is the model structure learning[29], another is the model parameter learning[30]. The former one tries to provide a compact and effective model representation by condensing the state space and eliminating unnecessary state dynamics. The latter one aims at estimating the model parameters given certain model structure. Usually, one generative model contains two related components, i.e., the dynamic model and the observation model. Since the structure of the generative model is mainly focused on the definition of the structure of the dynamic model, the model structure learning mainly involves the state space reconstruction, i.e., selection of the hidden states and their statistical relationships. When the model structure is available, model parameters could be obtained by either supervised or unsupervised learning algorithms [31]. Additionally, it is imperative to have simultaneous structure learning and parameter learning for complex DBNs to ensure their effectiveness and efficiency in practice. In [32], the concepts of entropic prior and parameter extinction were proposed to simplify and optimize the model structure by a state trimming process. Another possible way is to utilize the reversejump Markov Chain Monte Carlo (RJMCMC). In [27], the hierarchical HMMs (HHMMs) were used for unsupervised sports video mining, where the RJMCMC approach is embedded in the learning of HHMMs to support model selection and feature selection. 2.3 Discriminative Models The discriminative model based approaches, like the conditional random field (CRF) [33], the support vector machine (SVM) [34], and traditional neural networks [35], tends to directly employ the posterior probability of possible label sequences given an ob 16 servation sequence to do the learning and classification as shown in Fig. 2.3. The SVM approaches focus on structural risk minimization by maximizing the decision margin. The SVMbased approach is sensitive to the noise or limited in some cases where a large and complex data set is involved. However, it directly models the decision boundary, therefore, it is efficient to separate two sets of data when data set is small or no much prior knowledge is available. In [34], a SVMbased approach is proposed to detect events in field sports video by using audiovisual features. In [36], the SVM was used to classify video shots into certain highlevel semantic concepts, such as sports, buildings, intermediate topic concepts. Compared with the DBN, the CRF and its variations are more expressive and direct due to the fact that it allows more flexible statistical dependency structures on observations and states, as shown in Figure 2. The traditional CRF is an undirected conditional probabilistic graphical model for segmenting and labeling sequential data. In order to handle complex dependencies with no clear prior knowledge, in [37], the authors proposed a CRFbased method that employs the Markov blanket to specify the dependency among observations and states. To label sequential data in an interactive way, in [], the authors proposed the dynamic CRF (DCRF) for partofspeech tagging and nounphrase segmentation. Due to its unique structure, the DCRF is able to accomplish several labeling tasks at once by sharing information between them. For example, in [38], objects and scenes in the video sequence are jointly labeled successfully by integrating the interaction and dynamic information between them via DCRF. To handle the long range dependencies as well as the latent structure, in [39], the authors proposed the hidden CRF (HCRF) for gesture recognition. Different from the traditional CRF where a label is assigned to a single observation, the HCRF incorporates the subsequence of hidden state variables in the CRF to assign a label for an entire observation sequence. The proposed HCRF extends the spatial relationship in the CRF into a joint spatialtemporal way, where the labels of individual observations 17 x1 x3 x2 Y1 Y2 Y3 z1 z3 z2 Y1 Y2 Y3 x1 x3 x2 F1 F2 F3 Y1 Y2 Y3 x CRF DCRF HCRF X & Z: latent states; F: Substates; Y: Observations. Figure 2.3: Discriminative Models: variations of the conditional random field (CRF): the Conditional Random Field (CRF), the Dynamic CRF (DCRF), and the Hidden CRF (HCRF). are optimized and incorporated into a sequence classifier for the recognition task. CRFs have been widely used in video mining research, i.e., highlight detection [37], human tracking and analysis [40], video browsing [41] and personalized contentbased retrieval [42]. More specifically, a Segmentation CRF (SCRF) was proposed in [43] to capture the long range interaction in both observation and semantics for protein segmentation and recognition. It is very similar to the video mining research that involves both segmentation and recognition tasks simultaneously. 2.4 Comparisons In Fig. 2.1 and Table. 2.1, we show some key differences between the discriminative models and the generative models. In our recent research, we are interested in developing new HMMs that offer more capacity, functionality and flexibility than existing HMMs. The new HMMs should be able to take advantage of explicit semantic modeling where midlevel semantic keywords defined at the shot level directly correspond to the latent states and visual features extracted at the framelevel are used as the ob 18 servations. We expect the new HMMs should be able to accommodate rich statistics of framewise observations for shotwise state estimation. Also, we want to explore the mutual interaction between two semantic keywords by integrating two Markov processes into the same inference process. Moreover, as the modeling complexity goes up, model training becomes more challenging and even problematic. We need a more powerful learning algorithm that can optimize the model structure and learn the model parameters at the same time. Although recently proposed HMMs can deal with these issues individually, we want to develop an integrated approach that is able to address all above issues in one framework. In addition, when midlevel keywords are ready, we are interested in developing the discriminative model based approach that supports highlevel semantics detection. The highlevel semantics are quite specific and can hardly be defined uniquely from lowlevel feature. The discriminative model provides a feasible way to relax the strong independence assumptions and prior knowledge requirements that are not available in highlevel semantics detection. 2.5 Applications in Video Mining As aforementioned, the sports video mining research could be roughly divided into two groups, structurebased approaches [4, 27] and eventbased approaches [12, 37]. Structurebased approaches usually attempt to parse a long video sequence into individual segments by detecting scene changes or the boundaries between camera shots in a video stream. For example, in [15], a soccer video sequence can be segmented into plays and breaks, and in [44], the authors classify a video sequence into plays and replays. Both works were focused on the detection of general semantic keywords in the video data and treat them as the basic semantic components. Eventbased approaches aim to summarize the video data by highlights or specific events of interest. Event detection in sports video can be done at both the object level [45, 46, 47] and 19 Properties Generative Models Discriminative Models Graphic Modeling Directed model Undirected model To estimate a distribution over No direct attempts to model Model Inference all variables, i.e., input and the underlying distribution, output, which are represented by only attempt to compute the a joint probability distribution input to output mappings by conditional probability Usage of the Prior We can perimetrically constrain There is no explicit expression knowledge the distribution by giving prior of the prior knowledge about distributions distribution in the model training data Model is not sensitive when More training data is needed lack of training data The versatility does exist since in Task dependent, often lack Model the joint distribution space we can exible modeling tools and Extendability insert knowledge about the methods for prior knowledge relationships between variables, invariants, independencies, prior distributions and so forth Table 2.1: The comparison between the discriminative and generative models the scene level [48, 49, 50]. Objectlevel event detection approaches usually associate semantic events with the appearance or behavior of some specific objects. For example, in [46], the author employed objectbased features such as the trajectory of ball and players to detect goals in a soccer game. However, most of semantic events are typically defined on the complex collections of multiple objects. Also, in some cases, the object of interest may not be always available. Instead of focusing on objects, scenelevel event detection algorithms, i.e., [48], utilize various visual cues in the video data, including color distribution, specific landmarks, camera motion and even 20 caption information, to detect semantic events. Usually, feature extraction at the scenelevel is more robust and efficient compared with that at the objectlevel. There are two major kinds of approaches for event detection, one is the rulebased approaches [51, 52] and the other is the statistical model based approaches [50, 34]. The major advantages of rule based methods are that they are usually goaloriented and can support specific video search tasks effectively. In [51], camera motion is used to classify each play shot into one of seven different plays. However, in general, the rulebased approaches may not be effective to deal with the uncertainty and ambiguity of the visual features for highlevel semantic analysis, and the taskspecificity also limits their flexibilities and expendability. In contrast, the statistical modelbased approaches are usually more flexible and general to discover underlying structures or events in video data, such as recurrent events or highlights [50]. Recently, two types of statistical models are intensively studied for semantic video analysis, the generative models [27], and the discriminative models [34, 37]. Also, some hybrid approaches that combines both generative and discriminative models were also advanced, which are either to enhance data classification [53] or to detect semantic events in video data [54]. Compared with rulebased approaches, statistical modelbased approaches have great potential to offer better capabilities, flexibility, and generality to handle a large volume of video data with high complexity and diverse nature. Recently, an interesting video understanding framework was proposed that uses an ANDOR graph to detect the storyline from annotated video sequences [55]. This method was mainly designed to handle singlelink causal relationships due to the nature of the ANDOR graph. However, we want to consider complex contextual relationships that may require multilink undirected dependencies among multiple semantic keywords. Our goal is to deliver a complete game flow to summarize the game that also support both eventbased and structurebased video understanding. 21 CHAPTER 3 PROBLEM FORMULATION We believe that explicit semantic modeling and direct semantic computing are effective to bridge the semantic gap for sports video mining. There are two methodologies to address this issue, i.e., datadriven bottomup methods and conceptdriven topdown approaches. A balanced interaction between both methods should be encouraged to provide a generic representation for semantic video analysis. Specifically, we argue that the topdown conceptguided semantic representation plays a critical role in sports video mining that can be supported by datadriven bottomup methods in an inference framework. Therefore, we build a threelayer semantic space that can help us understand the semantic structure of the video data. The semantic space contains lowlevel visual features, midlevel keywords, and highlevel semantics. The lowlevel visual features, e.g., color, landmarks, motion, etc., are extracted from the raw video data to support the midlevel semantic keywords. The midlevel semantic keywords are rudimentary semantic building blocks. These building blocks should be frequent, repeatable and relatively welldefined, and their states can be governed by certain dynamics, e.g., camera views and play types that are two common structures in most fieldbased sports. For example, in American football videos, we can define four camera views (central, left, right, endzone views), and four play types (the long play, short play, kick, and field goal play) as shown in Fig. 3.3. Other keywords, e.g., possessions, are also possible. A combination of several semantic keywords can specify some highlevel semantics. For example, a touchdown highlight could be represented by a long play to the endzone followed by 22 a field goal, another highlevel semantics refer to the game flow in this work, which can tell people the semantic evolvement of the game. They are more interested to the audiences. This approach supports not only highlevel semantic detection but also customized eventsofinterest, as well as increase the usability and interactivity of video data. Lowlevel visual features (Color, landmarks, motion) Midlevel semantic keywords (play types, camera views, possessions) Highlevel semantics (scored, nonscored, turnover) Semantic Space Stage 1: infer midlevel semantic keywords Stage 2: infer the highlevel game flow Figure 3.1: The threelayer semantic space. 3.1 Semantic Space As shown in Fig. 3.1, we want to advocate a concept of semantic space that is formed by a set of rules and used to facilitate the understanding of video content and user queries in different levels. We specify a threelayer semantic space, where we want to use midlevel semantic keywords to represent highlevel semantics, such as highlights and game flow, and midlevel semantic keywords are specified by relevant visual features extracted from the video data. It is our belief that the exploration of midlevel semantic keywords from lowlevel visual features is more reliable and feasible than directly inferring highlevel semantics from lowlevel features. Our main reasoning is 23 that midlevel semantic keywords have a relatively stable and repeated pattern and highlevel semantics should be inferred from midlevel semantic keywords due to their inherent relationship. For example, in American football games, highlevel semantics includes highlights, e.g., a touchdown (TD), extra point after TD, field goal, turnovers, etc, as well as any eventofinterest that could be customized by the user. Directly detecting them may be challenging due to their complex nature and diverse variability. On the other hand, some midlevel semantic keywords can be useful to represent and detect highlevel semantics. Similar video mining paradigms can be found in recent literature [11, 55]. It is our attempt to provide a unified machine learning paradigm where semantic video analysis is formulated as two statistical inference problems, i.e., we want to infer midlevel semantic keywords from lowlevel visual features, so that we can further infer highlevel semantics from midlevel semantic keywords. 3.2 Lowlevel Visual Features As the first layer of the semantic space, relevant lowlevel visual features are needed to specify midlevel semantic keywords, i.e., “camera views”, “play types”, and “possessions”. Due to their different natures we need two complete different sets of visual features. In Fig. 3.2(a), we illustrate two major patterns of the camera motion, i.e., the panning and the tilting. In football game, different types of play would be roughly distinguished by the motion patterns of the camera, which help us to identify how each play is happening in terms of range and height. In Fig. 3.2(b), we illustrate typical scenes for four different camera views which demonstrate the visual distinction between them. When watching American football, the viewer can easily identify the camera view of each shot from the yard numbers along the side line of the field. However, it is very hard for a computer vision algorithm to automatically detect and recognize the yard numbers from the video shot due to two facts: (1) the locations of 24 Panning Tilting End zone Banner & logo Field Audience Yard line Left Camera View Central Camera View Right Camera End Zone Camera View View (a) (b) Figure 3.2: Visual features in football video: (a). play type analysis; (b). camera view analysis. yard numbers are arbitrary in a frame or may not even appear in many frames and (2) the significant camera motion and object occlusion make number recognition very difficult. Therefore, we need more robust and accessible visual features to characterize the latent states of the proposed generative video model. In this research, we identify the spatial color distribution and the angle of yard lines to be relevant features. For example, in the center field, there is usually a logo of the host team in the field center that has a strong contrast with the green color of the play ground, and all yard lines are almost vertically oriented. In Chapter 4, we will first discuss how to extract salient visual features, including the color distribution and the angle of yard lines, and then we show the correlation between the camera views and the extracted features. 25 3.3 Midlevel Keywords The main challenge of video mining is the semantic gap between understandable concepts and computable features. We believe that an explicit specification of the underlying semantic structures is effective to bridge the semantic gap for sports video mining. Therefore, we define a set of midlevel semantic keywords for sports video representation, camera views and play types, which can capture the rudimentary semantics in a game and each of which is specified by a set of semantic units (states) governed by certain dynamics 1. For example, camera views include the central, left, right, endzone views, and play types cover the long/short plays, kick, and field goal, as shown in Fig. 3.3, where we also show the dynamics within each semantic structure. There could be several other midlevel semantic keywords, e.g., possession (offense/defense). These semantic keywords will be used the building blocks for highlevel semantic analysis. We expect to have three major advantages of explicit semantic modeling. First, we can fully take advantage of the available semantics and prior knowledge about the rules in a sport game that makes the video mining problem well defined. Second, it provides both structure analysis and event analysis that are complementary in nature for semantic video analysis. Third, it not only supports highlight detection but also is able to deliver customized eventsofinterest, increasing the usability and interactivity of video data. 3.4 Highlevel Semantics When watching a sports game, audiences are interested in the evolution of the game. Based upon the game status and events of interests, we first select the most interesting event, touchdowns, as the highlights we intend to detect, then we have drawn various processes of the game into a semantic model that is structured by the game flow as 1We assume a video sequence is composed of a set of consecutive play shots with all breaks and commercials removed. 26 Left view End zone view Central view Right view Semantic Structures of Camera Views End Zone View End Zone View Left View Central View Right View Long plays Field goals Kicks Short plays Field Goal Short play Long play Kick Semantic Structures of Play Types Figure 3.3: Two semantic keywords and their dynamics in the American football video, i.e., camera views (top) and play types (bottom). shown in Fig.3.4 . The game flow in football game will generate three types of drives: the scored drive, the nonscored drive, and the turnover drive. They are different but are correlated in a statistical manner. The game flow provides a balanced choice for exploring the semantic structure in sports video at different levels, i.e., drives are synthesized by generic semantic units like camera view, play type, and possession. The game flow provides a higher level semantic representation that reveals and supports further analysis carried by users with different interests. 3.5 Two Inference Problems Similar video mining paradigms can be found in recent literature [1, 37]. We want to provide a unified and machine learning paradigm where semantic video analysis can be systematically formulated as two statistical inference problems. The first one 27 A NonScored Drive Play 2 Play 3 Play 4 A Turnover Drive A Scored Drive Play 1 The Game Flow Figure 3.4: A partial game flow in an American football game that contains threes drives: a scored drive, a nonscored drive, and a turnover, and each drive includes a series of annotated plays. is from the lowlevel visual feature to midlevel semantic keywords, and the other is from midlevel semantic keywords to highlevel semantics. • Inference problem 1: Due to the nature of midlevel semantic keywords, generative models, such as HMMs or their variants, are more suitable for the first inference problem where some prior knowledge is available as well as useful. Specifically, the dynamic model directly reflects the transitions of recurrent temporal patterns (i.e., midlevel semantic keywords), and the observation model represents the different midlevel structures by distinct lowlevel features. • Inference problem 2: Highlevel semantics of interest that are immediately useful for users are relatively less welldefined by lowlevel features. It is mainly due to their uncertainty and ambiguity nature in the video data. On the other hand, they can be better specified by midlevel structures. However, a more flexible and general statistical association relationship is needed between states and observations. Therefore, we invoke discriminative models, such as CRFs, for the second inference problem, which support versatile statistical modeling 28 structures. There are three major advantages in this framework. First, we can fully take advantage of the available semantics and prior knowledge about a sport game that makes the video mining problem well structured and formulated. Second, it provides both structure analysis at midlevel and event analysis at highlevel that are complementary in nature. Third, it not only supports highlight detection but also can deliver customized eventsofinterest and increase the usability and interactivity of video data. 29 CHAPTER 4 LOWLEVEL VISUAL FEATURES 4.1 Background We defined relevant visual features for midlevel semantic keywords. We used color distribution and yard line angles to characterize the camera view [56], the camera motion for the play type [57] and possessions. By using these features, video mining is formulated as an inference problem where we want to infer the midlevel semantic structures from the visual features. • Color distribution: The color distribution varies from view to view. For example, in the center view, there is usually a logo of the host team in the field center that has a strong contrast against the green color of the play ground, while in the left or right view, the area close to the end zone usually exhibits a dominant nongreen color. In our work, we use the concept of dominant color [58] to estimate the spatial color distribution by dividing a frame image into multiple overlapped regions (three regions horizontally and vertically) and computing the ratios of dominant color between different regions. • Yard line angle: The angle of yard lines can also reflect the camera view. For example, in the central view, all yard lines show a vertical pattern, while they show certain orientations in other camera views. To obtain yard lines, we can simply use Canny edge detection followed with the Hough transform to detect the yard lines in the region of the playing field and then we can compute their angles. 30 • Camera motion: There are two main types of camera motion in a football video, i.e., panning and tilting, which effectively characterize different play types. For example, a long play is usually associated with strong panning effect, while a short play is normally reflected by weak panning effect. We chose the optical flow based method [59] to compute two kinds of camera motion between two adjacent frames. Also, the frame indices are included as a temporal feature. 4.2 Color/Landmards Based Visual Features In this work, we use the concept of dominant color to estimate the spatial color distribution by dividing a frame image into multiple regions and computing the ratios of dominant color between different regions. Specifically, we use the robust dominant color region detection algorithm [58] in to extract the dominant color region, i.e., the play ground. Fig. 4.1 shows some results of dominant color extraction. To obtain yard lines, we can simply use Canny edge detection and the Hough transform to detect the yard lines in the region of the playing field as shown in Fig. 4.2. Left Center Right Top Bottom Left Center Right Top Bottom Figure 4.1: Play ground detection result. Based on the detected playing field and yard lines, we extract a 6D feature vector composed by the following features: • Ratio of the dominant color region: Rf = Wf=W; • Ratio difference of dominant color between the left/right and center parts: Rd = (Wleft −Wcenter + Wright −Wcenter)=W; 31 Angle 1 Angle 2 Angle 3 Angle 4 Figure 4.2: Yard line detection result. • Ratio difference of dominant color between the left and right parts: Rlr = (Wleft −Wright)=W; • Ratio difference of dominant color between the top and bottom parts: Rtb = (Wtop −Wbottom)=W; • Average angle of all yard lines: Dave = ΣL l=1 l=L; • Angle difference between the first and last frames in a shot: Dshot = last− first. where W is the total pixel number in a frame, Wf is the total pixel number in the dominant color region, and Wleft, Wright, Wtop, Wbottom, Wcenter are the pixel numbers of the dominant color region in the left part, the right part, the top part, the bottom part, and the central part, respectively. In addition we define l as the angle of the lth detected line and L is the number of detected lines, and last as the average angle in the last frame in a shot, first as the average angle in the first frame in a shot. 4.3 Motion Based Visual Features Play types are largely dependent on the pattern of camera motion. There are mainly two types of camera motion: panning and tilting, as shown in Fig. 3.2(a). Most play types can be effectively characterized by these two kinds of camera motion. For example, in a long run from right to left, camera motion could be a short rightpanning then a long leftpanning, which is different from the motion trend of a short run, i.e., 32 a short rightpanning followed by a short leftpanning. To specify camera motion we choose the optical flow based method [59] to qualitatively compute parameters of the panning and tilting between two adjacent frames, e.g., panning and tilting. Also, the frame indices are included as an additional temporal feature to distinguish long plays and short plays. Therefore, we compose a 3D feature vector of camera motion for every two adjacent frames in each shot, which will be employed to identify the play type in each shot. 4.4 Feature Evaluation To quantitatively evaluate the saliency of visual features, we have manually performed camera viewbased shot classification for one football video whose ground truth state sequence is denoted as Sg 1:T , then we employ the information gain [60, 27] to evaluate the feature’s contribution to HMMbased classification. Given feature z ∈ 1; 2; :::; 6, Γz is learned and the optimal state sequence Sz 1:T can be obtained. The information gain is computed as Info(Sg 1:T Sz 1:T ) = H(PSz ) − Σ j PSg · H(PSzSg=j) (4.1) where H is the entropy function, and PSg (k) = #(Sg t = kt = 1; : : : ; T) T ; PSz (k) = #(Sz t = kt = 1; : : : ; T) T ; PSzSg (k; j) = #(Sz t = k and Sg t = jt = 1; : : : ; T) #(Sg t = k; t = 1; : : : ; T) ; where # is the counter. After ranking six features according to Info(Sg 1:T Sz 1:T ) in Table. 4.1, we select the top four features to construct the observation vector of the HMM, i.e., Ot, including Dave, Rf , Dshot, and Rlr. As aforementioned in the previous section, six candidate visual features are extracted to compose model observations. In HMM, the observation should be shot 33 Index z = 1 z = 2 z = 3 z = 4 z = 5 z = 6 Feature Dave Rf Dshot Rlr Rd Rtb I.G. 0.62 0.41 0.37 0.35 0.16 0.11 Table 4.1: Rank of information gain(I.G.) based, but in SHMM, we need to employ the framebased observation to serve as the observation. Therefore, selected features are a little bit different in each model due to the variation of the observation measurement. Following the feature evaluation method mentioned in [61], we choose four salient visual features to construct the observation vector in HMM, i.e., Ot, including Dave, Rf , Dshot, and Rlr. Because the feature Dshot is designed for the shotbased approach, in SHMM, we employ the feature Rd, i.e., the ratio difference of dominant color between the left/right and center parts, instead of Dshot. In addition, we also choose the optical flow based method to qualitatively compute parameters of the panning and tilting between two adjacent frames, e.g., panning and tilting, and the frame indices are included as an additional temporal feature. As shown in In Fig. 4.3, we compare the Dave, i.e., average angle of all yard lines, distribution in each camera view between shotbased and framebased observations. We find that framewised features not only provide us the evidence about the statistical character of the observation but also offer us richer and smoother information about the observation concerning different semantic meaning compared with shotbased features. Therefore we can establish a more powerful mapping relationship, e.g. SHMM, as the generative model between observation and latent states to discover the underlying semantic structure existing in the video data. 34 0 5 Left camera view 0 5 Center camera view 0 5 Right camera view 0 5 End zone camera view 0 5 Left camera view 0 5 Center camera view 0 5 Right camera view 0 5 End zone camera view Figure 4.3: Distributions of shotbased (top) and framebased (bottom) Dave with respect to four camera views. 35 CHAPTER 5 HMMS FOR KEYWORDS DETECTION 5.1 Background HMMs are the most popular machine learning tool for video mining. A typical HMM as shown in Fig. 5.1 is a statistical model where the underlying system is assumed to be a Markov process with unknown parameters including a state transition matrix and a probabilistic observation model. The former one captures the underlying state dynamics, and the latter one characterizes observations pertaining to the hidden states either as probability density functions (continues observations) or probability mass functions (discrete observations). There are three canonical problems about HMMs [62]: • Evaluation: given the model parameters, compute the probability of a particular observation sequence. This problem is solved by the forwardbackground algorithm. • Learning: given an observation sequence, discover the model parameters that include the set of state transition and output probabilities. This problem is solved by the BaumWelch algorithm, or the Expectation Maximization (EM) algorithm. • Decoding: given the model parameters, find the most likely sequence of hidden states that could have generated a given observation sequence. This problem is solved by the Viterbi algorithm. 36 In this work, we assume that a video sequence is composed of a set of presegmented consecutive shots each of which was captured from one of four camera views and corresponds to one of four play types. In the following, we will first introduce the basic HMMbased approach [61] where we want to infer midlevel semantic keywords from lowlevel visual features. This algorithm will serve as the reference for our future improvements. Specifically, we have three new HMMbased approaches which are called the embedded HMMGMM [57], the segmental HMM (SHMM) [56], and the multichannel segmental HMM (MCSHMM) [63] respectively. 5.2 Traditional HMMs We first discuss the implementation of a basic HMM shown in Fig. 5.1 for semantic video analysis that follows our problem formulation and provides a benchmark for our following studies. According to our problem formulation, we want to use the hidden state in the HMM to encode the midlevel semantic keyword at the shot level directly. Correspondingly, we can define two HMMs. One is for camera view analysis where yard line angles and color distributions are used as the observation, and the other is for play type analysis where camera motion is used as the observation. It is worth noting that the hidden state is defined at the shot level, while the observations at the framelevel. Tradition HMMs cannot handle variable length observations, and an easy treatment to obtain the shotwise feature is to compute the average of all framewise observations in a shot. 5.2.1 Model Speci cations Given a video sequence of T shots, there are four latent states that correspond to four camera views or play types, i.e.,{St = kt = 1; :::; T; k = 1; 2; 3; 4}. By the Markov assumption we can have a state transition matrix, i.e.,{ak;j = p(St = jSt−1 = k)k; j = 1; 2; 3; 4}, to govern the state dynamics over time as shown in Fig. 5.1. 37 t1 S t1 o t+1 o t S t+1 S t o Figure 5.1: Hidden Markov Models (HMMs). Observations, {Ott = 1; :::; T}, i.e., relevant visual features extracted from all shots, are assumed to be drawn from certain emission function associated with the latent states. The often used emission functions include Gaussian or the Gaussian mixture model (GMM). Then we can define two kinds of observation models as p(OtSt = k) = N(Ot k;Σk) (5.1) where k and Σk are the mean and covariance matrix of the Gaussian for latent state k, or p(OtSt = k) = ΣN n=1 ( nN(Ot; nk;Σnk)) ; (5.2) where { n; nk;Σnkn = 1; : : : ;N} parameterizes the Norder GMM for latent state k, and ΣN n=1 n = 1. Obviously, Gaussian emission function is a special case of GMM emission when N = 1. Now the key issue is: Can we use aforementioned emission functions to characterize their densities? 5.2.2 Learning and Inferences Given a series of Tshot observations {Ott = 1; : : : ; T}, the HMM with GMM emission can be parameterized by Γ = {S; k; ak;j ; n; nk;Σnkk = 1; : : : ; 4; n = 1; : : : ;N}: • S = k; {k = 1; : : : ; 4}: the hidden states represent four camera views or play types; 38 • k = p(S1 = k): Initial state probabilities vector, which represents the initial probabilities of four camera views or four play types; • A = {ak;j = p(St = jSt−1 = k)t = 1; : : : ; T; k; j = 1; : : : ; 4}: are the state transition probabilities between states (camera views or play types) j and k; • p(OtSt = k) = ΣN n=1 ( nN(Ot; nk;Σnk)): is the Norder GMM emission function of hidden states, which is reduced to the Gaussian emission with order one (N = 1). Then we can use the Expectation Maximization (EM) algorithm [64] to obtain the maximum likelihoodbased parameter estimate of this HMM as follows: Γ∗ = argmaxP(O1:T Γ); (5.3) where P(O1:T Γ) = Σ S1:T 1 ΠT t=1 p(StSt−1)p(OtSt = k): After the EM training, we can use the Viterbi decoding algorithm to estimate the optimal state sequence S∗ 1:T , which corresponds to the camera view or the play type classification results for all video shots O1:T . On the one hand, S∗ 1:T indicates the location of each play. On the other hand, the state transition between two adjacent shots may indicate certain plays, such as advancing the ball or the change of possession. In practice, we use Kmean clustering to initialize the emission functions of the HMM prior to EM training, and N = 3 was found to be a reasonable choice for the GMM emission function. For the prior probability k, we assume the football game always starts from the central camera view with kicks. We also initialize the state transition matrix A that is an important parameter in the HMM by computing the frequency of actual state transitions and averaging over a few games. 39 5.2.3 Experimental Results Based on these selected features, we tested our algorithm on the nine 30minute football videos. After Kmean initialization, we use the EM algorithm for HMM training with two different emissions: Gaussian (HMM(1)) and GMM (HMM(2)). Finally, we implement Viterbi decoding to obtain the optimal state sequence. The experimental results are shown as Table. 5.2. We can see the performance of HMM(2) is much better than that of HMM(1). It is mainly because the GMM can better characterize the densities of visual features than the Gaussian. However, the major limitation of this approach is that each shot is represented by an averaged feature vector that is less representative and informative. This fact motivates us to improve the observation model in the HMM, i.e., we are exploring the framewise features and the temporal dependency across frames in a shot. In Table. 5.1, we pick two video sequences to show the classification result concerning each camera view. We find that most errors are from the misclassification of left or right camera views as the central camera view. This is mainly because the central camera usually covers the major part of the field, and the number of shots captured by the central camera is usually more than by the others. Data Left Central Right End zone Video A 72.5 73.4 61.54 100 Video B 65.79 78.95 70 100 Table 5.1: Single shot classification results on camera views Then we evaluate the experimental results on both camera view and play type, and Fig. 5.2 shows the result. The result shows that the GMM observation model outperforms the Gaussian model in terms of generalization capabilities. 40 Test Videos Semantics HMM(1) HMM(2) [61] [61] Video1 Play Type 43.59% 60.90% (156 shots) Camera View 55.77% 72.44% Video2 Play Type 49.07% 61.34% (163 shots) Camera View 61.35% 67.48% Video3 Play Type 44.91% 49.10% (167 shots) Camera View 53.29% 58.68% Video4 Play Type 58.33% 64.67% (168 shots) Camera View 64.88% 70.24% Video5 Play Type 50.59% 61.90% (168 shots) Camera View 56.55% 65.48% Video6 Play Type 54.11% 60.59% (170 shots) Camera View 52.35% 64.12% Video7 Play Type 64.70% 71.17% (170 shots) Camera View 68.24% 72.35% Video8 Play Type 56.14% 65.50% (171 shots) Camera View 62.57% 75.44% Video9 Play Type 63.01% 67.05% (173 shots) Camera View 66.07% 68.21% Average Accuracy 57.44% 62.62% Table 5.2: Shotbased classification results of HMMbased algorithms for nine 30min NFL videos. 41 St t+1 S t C t+1 C t1 C t1 S t1,1 o 1 , 1   t n t o t ,1 o t nt o , t+1,1 o 1 , 1 + + t n t o Figure 5.2: A new generative model 5.3 Embedded HMMGMMs As mentioned before, the latent state, i.e., the camera views and the play types, are defined at the shot level while the observations are extracted from all frames in a shot. Averaging framewise features in one shot to obtain the shotwise feature inevitably reduce the discriminability of observations in the HMM. Also, each shot may have different lengths, i.e., the number of frames varies from shot to shot. In order to take advantage of the rich statistics of framewise visual features and to handle variablelength observations, we have proposed a new HMM that is embedded with a twolayer observation model, as shown in Fig. 5.2 [57]. 5.3.1 Model Speci cations Now we want to develop a generative model for camera view and play type based video modeling. However, the latent state, i.e., the camera view and play type, is shotbased, and the observation, color distribution, landmark, and camera motion across two adjacent frames, is framebased, which makes it hard to use traditional HMMs where each latent state is associated with one observation only. This constraint eliminates the intrashot statistical properties existing among the frames, which is very important for camera view and play type analysis. Inspired by the 42 recursive estimation algorithms in [65], we assume there is a virtual observation C existing between shotbased latent states and framebased observations. As shown in Algorithm 13, when generating a video sequence at shot t, a virtual observation, or a shotbased GMM of camera view or play type k, is first drawn, then a set of framebased observations {ot;ll = 1; :::; nt} are generated from the GMM. Therefore, we can define the emission function of virtual observation as: p(ot;lCt; St = k) = ΣM m=1 ( km N(ot;l; k m;Σk m) ) (5.4) where Πk = { km ; k m,Σk m } is the parameter set of the Morder GMM and ΣM m=1 km = 1. Because virtual observations are integrated artificially, the density function of p(OSt = k) is not accessible directly, therefore, we fit this density function to the density function of observations as ( 8.2). log p(Ot;1:nt St = k) = 1 nt Σnt l=1 log p(ot;lCt; St = k): (5.5) 5.3.2 Learning and Inferences As shown above, we embed the GMM into the HMM to accommodate both the shotbased latent states (i.e., camera views and play types) with the framebased observations (i.e., color distribution, landmarks, and camera motions) and the intrashot properties of the video data. The new embedded HMMGMM model is able characterize intrashot motion features and intershot play transition simultaneously. Given a series of observations O, the likelihood the proposed is: P(OΓ) = Σ k k ΠN t=1 p(StSt−1)p(OSt = k); (5.6) where Γ = { k; ak;j ;Πkk; j = 1; :::; 4}. Then, we can use the Expectation Maximization( EM) algorithm to obtain the parameter estimation as follows: Specifically, we define the posterior probability of each shot and the joint probability between very 43 two adjacent shots as k(t) = p(St = kO; Γ); (5.7) k;j(t) = p(St = k; St+1 = jO; Γ): (5.8) Due to the embedded twolayer structure, estimating Γ requires updating Πk that involves multiple GMMs to accommodate the intershot statics of the video data. Instead of involving all observations in ForwardBackward algorithm, we use a Monte Carlolike sampling method, as shown in Algorithm 1, to construct a training data pool for Πk, i.e., X{1:k}, by sampling from O according to (5.7). Then the training data pools for all play types are represented by ( 5.9) X{1:k} = {xk t;w k = 1; ::; 4; t = 1; :::; T;w = 1; :::;W}; (5.9) where W represents the sampling number in each shot. After sampling, we can follow Algorithm 2 to update the GMM for each play. In this EM iteration, the likelihood of each p(mxk t;w;Πk) = km pm(xk t;w Πk) Σ m mpm(xk t;w Πk) ; (5.10) and GMM Πk can be updated as following: km = 1 Xk Σ t Σ w p(mxk t;w;Πk) (5.11) k m = Σ t Σ w xk t;wp(mxk t;w;Πk) Σ t Σ w p(mxk t;w;Πk) ; (5.12) Σk m = Σ t Σ w p(mxk t;w; Π)(xk t;w − k m)(xk t;w − k m)T Σ t Σ w p(mxk t;w;Πk) : (5.13) Finally, Γ is obtained by maximizing the expectation of the likelihood of the virtual observation considered, as k = k(1) and ak;j = Σ t k;j(t) Σt k(t) : (5.14) The complete training algorithm is shown in Algorithm 3. After training, we can use the Viterbi decoding algorithm to estimate the play type sequence S∗ in the football video. 44 Algorithm 1: Constructing the new training data Input : O, T, K, basic sample number in a shot B Output: New training data pools {X{1:k}} for k ← 1 to K do for t ← 1to T do Find sampling number W in each shot; if nt > B then if nt ∗ k(t) > B then W = B; else W = B ∗ k(t); else if nt ≤ B then if nt > k(t) ∗ B then W = B ∗ k(t); else W = B; Uniform Sampling W vectors in each shot; end Constructing the training data pool X{k}; end 45 Algorithm 2: EM for GMM Input : Iteration number I, Parameter set Π0, Training data X{1:K} ⇐ Algorithm 1 Output: New Parameter set {ˆΠ kk = 1; 2; 3; 4} for k ← 1 to K do for i ← 1 to I do Estep: Compute p(mxk t;w; Π) ⇐ (5.10); Mstep: Compute ˆ km ⇐ (5.11); Compute ˆ k m ⇐ (5.12); Compute ˆΣ k m ⇐ (5.13); end end Obtaining a new parameter set {ˆΠ kk = 1; 2; 3; 4}; 46 Algorithm 3: EM for Embeded HMMGMM Input : Iteration number I, Parameter set Γ0 Output: New Parameter set ˆΓ for i ← 1 to I do Estep: Compute k(t) ⇐ (5.7); Compute k;j(t) ⇐ (5.8); Mstep: Compute ˆ k ⇐ (5.14); Compute ˆak;j ⇐ (5.14); Compute ˆ km ; ˆ k m; ˆΣ k m ⇐ Algorithm 2; end Obtaining new parameter set ˆΓ; 5.3.3 Experimental Results We have tested the proposed generative model on nine 30minute football videos. First, we implement the GMMbased algorithm where each of four plays or camera views is characterized by a 3order GMM using framebased observations (camera motions and frame indices) in a shot. Then we examine the embedded HMMGMM (HMM(3)) algorithm that is initialized by the previously trained GMM for each play. The Viterbi decoding algorithm is used to find the optimal state sequence, i.e., the play type for each shot. The comparison between four methods are shown in Table 5.3 where we can see the embedded HMMGMM improves the midlevel structure analysis results greatly compared with the GMM, HMM with Gaussian or Gaussian mixture emission. It is mainly because the proposed embedded HMMGMM can characterize both intrashot motion features and intershot play transitions via the GMM embedded into the 47 HMM for more accurate play analysis. There are two straightforward ways that may further improve the classification performance under the same framework. One is to add more salient features to the observations of latent states. The other is to involve adaptive GMM order estimation in Algorithm 2. However, there are still two limitations in the embedded HMMGMM. One is that it assumes that correlated framewise observations in one shot are independent, and other is that the learning of firstlayer GMMs does not fully use all available framewise features due to the sampling method that is used to avoid the training bias introduced by variablelength shots. Still the promising results from this method motivate us to further improve the observation model in the HMM with better generality and capability. 5.4 Segmental HMMs (SHMMs) Although the embedded HMMGMM is able to provide a more effective observation model compared with the traditional HMM, it still assumes that all framewise features are independent in a shot. This assumption ignores the temporal dependency across frames in a shot. Also, it only uses part of observations for the training purpose. Therefore, we invoke a segmental HMM (SHMM) to attack this problem that is able to handle variablelength observations in a more analytical way [56]. The SHMM was first introduced for speech recognition that involves a segmental observation model to characterize variations of the speech signal, as shown in Fig. 5.3 [28] . Instead of generating one observation by each hidden state in the traditional HMM, each hidden state of the SHMM can emit a sequence of observations, which can be called a segment. In SHMM all observations in a given segment are assumed to be independent to the observations belonging to other segments. In addition, in each segment all observations are conditionally independent given the mean of that segment. Thus a closed form likelihood function is attainable that can 48 Test Videos Semantics Supervised HMM(1) HMM(2) HMM(3) GMM [61] [61] [57] Video1 Play Type 26.92% 43.59% 60.90% 77.56% (156 shots) Camera View 35.90% 55.77% 72.44% 76.28% Video2 Play Type 30.67% 49.07% 61.34% 70.56% (163 shots) Camera View 41.10% 61.35% 67.48% 79.14% Video3 Play Type 23.35% 44.91% 49.10% 58.08% (167 shots) Camera View 37.12% 53.29% 58.68% 61.08% Video4 Play Type 36.90% 58.33% 64.67% 70.66% (168 shots) Camera View 47.61% 64.88% 70.24% 73.21% Video5 Play Type 29.17% 50.59% 61.90% 72.02% (168 shots) Camera View 40.47% 56.55% 65.48% 73.81% Video6 Play Type 38.25% 54.11% 60.59% 70.59% (170 shots) Camera View 41.17% 52.35% 64.12% 70.59% Video7 Play Type 37.05% 64.70% 71.17% 70.59% (170 shots) Camera View 52.35% 68.24% 72.35% 75.29% Video8 Play Type 40.93% 56.14% 65.50% 72.51% (171 shots) Camera View 50.88% 62.57% 75.44% 77.78% Video9 Play Type 39.88% 63.01% 67.05% 69.36% (173 shots) Camera View 43.93% 66.07% 68.21% 71.68% Average Accuracy 43.59% 57.44% 62.62% 71.85% Table 5.3: Shotbased classification results of seven algorithms for eight 30min NFL videos. 49 capture rich statistics of framewise features in a shot and is able to accommodate the variablelength shots. St t+1 S m m m t1 S t1,1 o 1 , 1   t n t o t ,1 o t nt o , t+1,1 o 1 , 1 + + t n t o Figure 5.3: The segmental hidden Markov models (SHMM). 5.4.1 Model Speci cations In traditional HMMbased method, due to the mismatch between shotlevel states and framelevel feature, we have to adapt the average feature in a shot to serve as the observation corresponding to the shotbased hidden state, which ignores both the statistical dependency and varieties among frames. Therefore we will employ SHMM instead. In video analysis, we can regard the segment as the shot and observations in a segment as frames in that shot. Therefore, we need to consider two kinds of independencies: first, segments are independent given the state; second, all observations in a segment are independent given the mean of the segment. By these assumptions, we are able to establish a statistical relationship between multiple observations and a single semantic structure. A SHMM ofM states can be characterized by where Θ = { m; am;n; ;m;Σ ;m;Σmm; n = 1; :::;M}. m is the initial probability, and am;n the transition probability. ;m and Σ ;m characterize the mean of the segment of state m, and Σ ;m the variance of the segment. Given a shot in time t with nt observations, i.e., Ot = {Ot;kk = 1; :::; nt}, 50 the conditional likelihood of Ot is defined as: p(OtSt = m; Θ) = ∫ p( St = m; Θ)p(Ot;1:nt  ; St = m; Θ)d ; (5.15) Under the conditionally independent assumption, we have: p(OtSt = m; Θ) = ∫ p( St = k; Θ) Πni t=1 p(Ot;k ; St = m; Θ)d ; (5.16) where p(Ot;k ; St = m; Θ) = N(Ot ;Σm) and p( St = m; Θ) = N(  ;m;Σ ;m). Compared with traditional HMMs, the new term is the Gaussian prior of mean that was originally introduced to capture the characteristics of a segment in a speech signal, i.e., the phone or model level, which are variable due to different speakers or stress conditions [28]. If we set Σ ;m = 0, (5.16) can be reduced to: p(OtSt = m; Θ) = Πnt k=1 p(Ot;kSt = m; Θ); (5.17) where p(Ot;kSt = m; Θ) = N(Ot;k ;m;Σm) that is equivalent to the Gaussian emission. Then it is similar to our previously proposed embedded HMMGMM [57] that involves a twolayer observation model. In practice, the Gaussian prior of was found useful to capture the temporal dependency of all framewise observations in a shot, as shown by the significant improvement of SHMM over the traditional HMMs on playbased and viewbased shot classification [56]. 5.4.2 Learning and Inferences This density function describes the probability distribution of a semantic unit which generates multiple observations, i.e., visual features distribution for a specific camera view within a shot with length nt. Then following the description in [28], let t = Σnt k=1 Ot;k; (5.18) (Σt;m)−1 = (Σ ;m)−1 + nt(Σm)−1; (5.19) t;m = (Σt;m((Σ ;m)−1( ;m)′ + (Σm)−1( t)′))′: (5.20) 51 then using the log probability we can obtain log p(Ot;1:nt St = m; Θ) = log(Rm;ntR ;mRt;m) + 1 2 ;mΣ ;m ′ ;m + 1 2 Σnt k=1 Ot;kΣ−1 m (Ot;k)′ − 1 2 t;mΣ−1 t;m( t;m)′: (5.21) where Rm = 1 (2 ) n 2 Σm 1 2 ; (5.22) R ;m = 1 (2 ) n 2 Σ ;m 1 2 ; (5.23) Rt;m = 1 (2 ) n 2 Σt;m 1 2 : (5.24) Then we can employ the BackwardForward algorithm to calculate the total probability summed over all possible paths. As the size of each segment, i.e., the length of each shot, is known to us, so we can represent the forward and backward processes as (6.13)  (6.15), rather than considering all the possible length of segment introduced in [28]. m;t = p(O1; :::;Ot; St = mΘ); (5.25) m;t = p(Ot+1; :::;OT St = m; Θ); (5.26) m;t = m;t m;t ΣM m=1 m;t m;t ; (5.27) mn(t) = m(t)amnp(Ot+1St+1 = n; Θ) n(i + 1) m(t) ; (5.28) 52 Then we can complete parameter reestimation as follows: ˆ ;m = ΣT t=1 m;t t Σ nt T t=1 m;t ; (5.29) ˆΣ ;m = ΣT t=1 m;t (nt ^ ;m− t)2 n2t ΣT t=1 m;t ; (5.30) ˆΣ (j;i) m = ΣT t=1 m;t( Σnt k=1 (Ot;k)2 − ( t)2 nt ) ΣT t=1 m;t(nt − 1) ; ; (5.31) ˆakj = kj(i) k(i) ; (5.32) ˆ k = k(1): (5.33) Comparing with HMMs, we only add one additional parameter Σk to each state in the SHMM, which does not increase too much computational load for training. After obtaining the estimated parameter set, we can use the Viterbi algorithm to estimate the optimal state sequence S∗ 1:T as we stated in previous section. Algorithm 1: EM for SHMM Input : Iteration number W, Parameter set Π0, video data O = {Ot;kt = 1; :::; T; k = 1; :::; nt} Output: New Parameter set {ˆΠ mm = 1; 2; 3; 4} for w ← 1 to W do for k ← 1 to K do Estep: p(Ot;1:nt St = m; Θ) ⇐ (5.21); k(t); k(t); k(t); m;n(t) ⇐ (5.255.28); Mstep: ˆ ;m; ˆΣ ;m; ˆΣ m; ˆamn; ˆ m ⇐ (5.295.33); end end 53 5.4.3 Experimental Results Classi cation Results Our experiment is still derived by nine 30 minute football videos. five different approaches are tested on those video data, i.e., GMM, HMM with Gaussian emission, HMM with GMM emission, embedded HMMGMM and SHMM. In Table. 5.4, we show the shot classification result, i.e., the overall classification accuracy concerning different approaches. For SHMMbased method, we initialize model parameters Θ by the training result from the HMM with Gaussian emission method, which is mainly because the HMM can provide us a coarse estimation about the statistical property in each shot, therefore we can use them to represent the initialization of the SHMM. We initialize Σm by multiplying Σ ;m with a small scalar, which is mainly because comparing with Σ ;m, the Σm only specify a small variation range of observations. Then we can perform the proposed EM algorithm to obtain the optimized model parameters. After training, we can employ Viterbi decoding again to get the state sequence, i.e. the camera view sequence. 5.5 Discussions From the experimental results shown in Table. 5.4, we can see the performance of each proposed algorithm. The boost in the performance is mainly because the capability to capture the diversity of the statistical property existing in each shot, e.g., the SHMM not only regards the mean each shot as a continues variables but also represent the variance by another statistical model, which gives us a better way to capture the statistics of each shot. Moreover, the quantity of the observation is another important factor. Rich observations provide us more possibilities that find enough realizations of each shot sharing the same statistical property. 54 Test Videos Semantics Supervised HMM(1) HMM(2) HMM(3) SHMM GMM [61] [61] [57] [56] Video1 Play Type 26.92% 43.59% 60.90% 77.56% 80.13% (156 shots) Camera View 35.90% 55.77% 72.44% 76.28% 79.49% Video2 Play Type 30.67% 49.07% 61.34% 70.56% 77.91% (163 shots) Camera View 41.10% 61.35% 67.48% 79.14% 84.05% Video3 Play Type 23.35% 44.91% 49.10% 58.08% 68.86% (167 shots) Camera View 37.12% 53.29% 58.68% 61.08% 74.85% Video4 Play Type 36.90% 58.33% 64.67% 70.66% 77.98% (168 shots) Camera View 47.61% 64.88% 70.24% 73.21% 79.17% Video5 Play Type 29.17% 50.59% 61.90% 72.02% 74.40% (168 shots) Camera View 40.47% 56.55% 65.48% 73.81% 73.21% Video6 Play Type 38.25% 54.11% 60.59% 70.59% 80.59% (170 shots) Camera View 41.17% 52.35% 64.12% 70.59% 80.00% Video7 Play Type 37.05% 64.70% 71.17% 70.59% 75.88% (170 shots) Camera View 52.35% 68.24% 72.35% 75.29% 81.18% Video8 Play Type 40.93% 56.14% 65.50% 72.51% 78.95% (171 shots) Camera View 50.88% 62.57% 75.44% 77.78% 80.11% Video9 Play Type 39.88% 63.01% 67.05% 69.36% 75.14% (173 shots) Camera View 43.93% 66.07% 68.21% 71.68% 77.46% Average Accuracy 43.59% 57.44% 62.62% 71.85% 77.42% Table 5.4: Shotbased classification results of eight algorithms for nine 30min NFL videos. 55 CHAPTER 6 MULTICHANNEL SHMM FOR KEYWORDS DETECTION 6.1 Background This work is based on our previous research [61, 57, 56], and motivated by the recent advancement and progress in HMM theory and applications. We propose a new HMM, i.e., the MultiChannel Segmental Hidden Markov Model (MCSHMM), which incorporates the advantages of the recent extended HMMs, such as CHMM, HHMM, and SHMM. Specifically, the proposed MCSHMM is featured by a hybrid hierarchicalparallel structure for dynamic modeling among hidden states, and it also is equipped with the SHMM for observation modeling. The objective of this new model is to enhance semantic video analysis by characterizing the rich statistics of framewise features for shotwise state inference and capturing the interaction between multiple semantic structures. Moreover, as the modeling complexity goes up, model training becomes more challenging and even problematic. Therefore we want to integrate structure learning and parameter learning in one framework by using the entropic prior and parameter extinction proposed in [32] that can optimize the model structure and model parameters simultaneously. All HMMbased approaches we discussed so far can only detect one semantic structure at a time. Usually, a sports video contains multiple midlevel semantic keywords in parallel, such as play types and camera views. More interestingly, there are some interactions among them. Specifically, it was observed that camera views and play types are quite related to each other during the game. For example, after a shot of the central view accompanied by a short play, the camera view in the next 56 t1 F t+1 F (1) t1 S (2) t1 S (1) t+1 S (2) t S (1) t S (2) t1 S (1) t1 O (2) t+1 O (1) t+1 O (2) t O (1) t O (2) t1 O t F Figure 6.1: In the MCSHMM, the rst layer includes two SHMMs , and the second layer captures the interaction between two channels. shot is likely to remain in the same view; while if it is a long play, the next camera view might be switched to the other camera views, either right or left camera view. In this work, we are interested in exploring the interaction between multiple semantic structures with the aim to improve the overall mining performance. It is tempting to invoke the Coupled HMM (CHMM) in [22] that can support multiple Markov channels simultaneously. However, the CHMM usually assumes relative strong interaction between two Markov chains that may overweight or even corrupt the Markovian property within each chain if the assumption is not true, as observed in our practice. Therefore, we advance a new multichannel SHMM model (MCSHMM) that involves two parallel SHMMs and a twolayer hierarchical Markovian structure, as shown in Fig. 6.1. In the MCSHMM, on the one hand, two SHMMs are equipped by the powerful segmental observation model to mine individual semantic structures, and on the other hand, the MCSHMM can explore multiple semantic structures with a hybrid parallelhierarchical dynamic model. 57 6.2 Model Speci cations The unique feature of the MCSHMM is the integration of both parallel and hierarchical structures in the dynamic model and the incorporation of a segmental observation model, which allow the MCSHMM have more flexibility, capacity, and functionality than the CHMM, SHMM and HHMM. In the view of generative models, both the dynamic model (among hidden states) and the observation model in the MCSHMM have a twolayered structure that greatly enhance its capability of learning and inference. Specifically, at the firstlayer of the dynamic model, S = {S(j) t t = 1; :::; T; j = 1; 2} denotes the state sequence of two channels where S(j) t denotes the state of shot t in channel j, and at the secondlayer of the dynamic model, F = {Ft = (S(1) t ; S(2) t )t = 1; :::; T} represents the state sequence at the second layer where each state consists of two current states at the first layer. At the observation layer O(j) t = {O(j) 1:nt t = 1; :::; T; j = 1; 2} indicates observations of shot t with nt frames in channel j. In this research, all shots are presegmented, and the nt is known. Therefore, the MCSHMM’s parameter set includes following components = {Π;A;Ω}: • Initial probabilities: Π = {P(S(1) 1 ); P(S(2) 1 ); P(F1S(1) 1 ; S(2) 1 )}; (6.1) • Transition probabilities: A = {Aw;w = 1; 2; 3}; (6.2) 58 where A1 = {P(S(1) t = mS(1) t−1 = n; Ft−1 = l)}; (6.3) A2 = {P(S(2) t = mS(2) t−1 = n; Ft−1 = l)}; (6.4) A3 = {P(Ft = lS(1) t = m; S(2) t = n)}; (6.5) • Observation density functions: p(O(j) t S(j) t = m; Ω) = ∫ N(  (j) ;m;Σ(j) ;m) Πnt k=1 N(O(j) t;k  ;Σ(j) m )d ; (6.6) where Ω = { (j) ;m;Σ(j) ;m;Σ(j) m j = 1; 2;m = 1; :::;N(j)} specifies the two segmental models, N(j) is the number of states in channel j, in this paper, N(1) = N(2) = 4, and O(j) t;k denotes the observation of frame k in shot t and channel j. The Gaussian prior of mean is used to capture temporal dependency among observations in a shot by assuming conditional independency under a given . In addition, the closedform likelihood function well represents the variablelength observations in different shots. However, the integration in (6.13) has to be approximated during EM learning [28]. Given the dualchannel observations of T shots, O = {O(j) t t = 1; :::; T; j = 1; 2}, the joint likelihood is defined as: p(S;F;O ) = P(S(1) 1 )P(S(2) 1 )P(F1S(1) 1 ; S(2) 1 ) ΠT t=1 P(FtS(1) t ; S(2) t ) ΠT t=2 Π2 j=1 P(S(t) t S(j) t−1; Ft−1) ΠT t=1 Π2 j=1 p(O(j) t S(j) t ): The Expectation Maximization (EM) algorithm is often used for learning a DBN/HMM, such as the MCSHMM, which finds the optimal parameters by maximizing the likelihood function (e.g., (6.7)) through iterative Expectation (the Estep) and Maximization (the Mstep). 59 However, the direct maximum likelihood (ML) learning of the MCSHMM could be problematic due to the complex state space specified by S and F that leads to a large set of model parameters, such as A given in (6.2). In addition, we expect that the useful state space could be much smaller than the one covering all possible cases. Therefore, we are hoping to find an effective learning algorithm that can optimize the parameter set as well as the model structure (the state space and state dynamics) simultaneously. 6.3 Learning and Inferences As we mentioned before, there are two aspects in model learning, structure learning and parameter learning. The former one aims at finding a compact and effective model structure by simplifying the state space and reducing the parameter set, and the latter one tries to optimize the model parameters given a predefined model structure. More advancingly, we will propose a new unsupervised learning algorithm in which two learning processes could be unified into one framework where the model structure and parameters can be optimized simultaneously [27, 32]. In this work, we can predefine a coarse model structure that includes every possible configuration of the semantic structure, then we will use the ideas of entropic prior and parameter extinction proposed in [32] that result in a maximum a posteriori (MAP) estimator. According to (6.6), for each segment in each channel, we can obtain the likelihood 60 by using the loglikelihood: where we define some auxiliary parameters as follows: (j) t = Σnt k=1 O(j) t;k ; (6.7) (Σ(j) t;m)−1 = (Σ(j) ;m)−1 + nt(Σ(j) m )−1; (6.8) (j) t;m = (Σ(j) t;m((Σ(j) ;m)−1( (j) ;m)′ + (Σ(j) m )−1( (j) t )′))′; (6.9) R(j) m = 1 (2 ) n 2 Σ(j) m  1 2 ; (6.10) R(j) ;m = 1 (2 ) n 2 Σ(j) ;m 1 2 ; (6.11) R(j) t;m = 1 (2 ) n 2 Σ(j) t;m  1 2 : (6.12) The entropic prior is a bias to the compact model with good determinism. In our work, this MAPbased estimator can be incorporated into the Mstep of the EM algorithm which will encourage a maximally structured and minimally ambiguous model. This is accomplished by trimming the weakly supported parameters and states, leading to a compact model with good determinism. In our work, the MAP estimator focuses on A defined in (6.2) that has many possible state transitions due to a large number of possible states of F (N1 × N2 = 16 here). In other words, we want to simplify A by keeping only important state transitions, that would effectively reduce the number of useful states in F and balance the incoming and outgoing transitions between the two layers. Consequentially, the MAPbased EM estimator find the optimal parameter by ∗ = arg max Pe( O); (6.13) where Pe( O) ∝ P(O )Pe( ) = ( Σ S;F p(S;F;O ) ) Pe( ); (6.14) where p(S;F;O ) is given in (6.7) and Pe( ) is the entropic prior of the model corresponding to parameter set that, in this work, depends on A as Pe( ) ∝ exp ( Σ w Σ p Σ q Pw p;q log Pw p;q ) ; (6.15) 61 where Pw p;q denotes a transition probability in Aw. Accordingly, in the Mstep of the EM algorithm, we will update the transition probabilities by maximizing the entropic prior as, A∗ = arg max A {log(Pe( O)) + Σ w Σ p wp ( Σ q Pw p;q − 1)}; (6.16) where w p;q is the Lagrange multiplier to ensure Σ q Pw p;q = 1. Using a similar optimization technique discussed in [32], the MAP estimate of can be achieved by setting the derivative of logposterior given in (6.16) to zero. @{log(Pe( O)) + Σ w Σ p w p;q( Σ q Pw p;q − 1)} @Pw p;q = 0: (6.17) Then, we can obtain w p;q Pw p;q + log Pw p;q + 1 + w p;q = 0; (6.18) where w p;q = p(S;FO; ): (6.19) And this equation can be solved by the Lambert W function,which is the inverse function of f(w) = wew; (6.20) where ew is the natural exponential function and w is any complex number. For any complex number z, we have z = W(z)eW(z): (6.21) So, we have logW(z) +W(z) = log z; (6.22) which equals to −logW(z) −W(z) + log z = 0: (6.23) If we set z = ex, we can obtain −logW(ex) −W(ex) + x = 0: (6.24) 62 For any complex number y, log y W(ex) − y y W(ex) + x − log y = 0: (6.25) If we let x = 1 + w p;q + log y; (6.26) and y = − w p;q; (6.27) then the (6.25) can be expressed by log − w p;q W(e1+ w p;q+log (− w p;q)) + w p;q w p;q W(e1+ w p;q+log (− w p;q)) + 1 + w p;q = 0: (6.28) And we can easily find the solution of Pw p;q by comparing (6.28) with the (6.18). Then, finally we can easily obtain Pw p;q = − w p;q W(− w p;qe1+ w p;q ) (6.29) After A is optimized, other parameters in can also be updated in the Mstep correspondingly. We resort to the junction tree algorithm in [18] to implement the MAPbased EM algorithm. The junction tree is an auxiliary data structure that can convert a Directed Acyclic Graph (DAG), such as the MCSHMM, into an undirected graph by eliminating cycles in the graph, so that belief propagation can be effectively performed on the modified graph for inference and learning. By applying the backwardforward algorithm, we can obtain (j) m;t = p(O(j) 1 ; :::;O(j) t ; S(j) t = m(j)Θ); (6.30) (j) m;t = p(O(j) t+1; :::;O(j) T S(j) t = m(j); Θ); (6.31) (j) m;t = (j) m;t (j) m;t ΣK m(j)=1 (j) m;t (j) m;t : (6.32) 63 Then, we can obtain the parameter set for each each channel as follows: ˆ (j) ;m = ΣT t=1 (j) m;t (j) t n(j) t ΣT t=1 (j) m;t ; (6.33) ˆΣ (j) ;m = ΣT t=1 (j) m;t (nt ^ (j) ;m− (j) t )2 n2t ΣT i=1 (j) m;t ; (6.34) ˆΣ (j;i) m = ΣT t=1 (j) m;t( Σn(j) t k=1 (O(j) t;k )2 − ( (j) t )2 n(j) t ) ΣT t=1 (j) m;t(n(j) t − 1) ; (6.35) Consequentially, this learning process enhances the important states in F and drives the weakly supported ones towards zero. According to the transitions of small probabilities, the states in F that are seldom visited could be found and eventually eliminated. Hereby the state space is optimized by balancing the state transitions between the two layers. Therefore, there are two trimming processes, one is the transition trimming, the other is the state trimming. For the MCSHMM, we can obtain the transition probabilities in the Estep of the EM algorithm, to find a transition that can be trimmed, we need to find the transition probability for which the gain of the entropic prior overweighs the loss in the likelihood, as shown below: Pe( \ Pw p;q) Pe( ) ≥ P(O ) P(O \ Pw p;q) (6.36) where Θ \ Pw p;q represents the parameter set Θ without the element Pw p;q. Now we need to trim the state space based on estimated/trimmed transition probabilities. The purpose is to keep the useful states in F which are visited frequently, as expressed by its incoming and exit transitions. Similar to [32], we can detect timmable states by balancing the prior probability of all its incoming and exit transitions against the probability mass that flow through it, as shown below, P( \ Fl) P( ) ≥ Π m;n;k (a(k) m;n;l)a(k) m;n;l (6.37) 64 where k = {1; 2; 3}, and a1 m;n;l = P(S(1) t = mS(1) t−1 = n; Ft−1 = l); (6.38) a2 m;n;l = P(S(2) t = mS(2) t−1 = n; Ft−1 = l); (6.39) a3 m;n;l = P(Ft = lS(1) t = m; S(2) t = n): (6.40) The above state trimming occurs in each iteration in the Mstep, and each iteration may have some transitions/states trimmed, leading to a fast learning process. In practice, we need to decide the size of expected state space. Here we keep 6 of out 16 states in F. After model learning, the Viterbi algorithm can be used to estimate the optimal state sequence for both channels at the first layer, i.e., S, which encodes two semantic structures, i.e., camera views and play types. Our EM algorithm implementation was based on the Bayes Net Matlab Toolbox developed by KP Murphy. 1 6.4 Experimental Results The MCSHMM was tested on nine 30minute NFL football games (352×240, 30fps) that have been presegmented into a series of consecutive play shots by removing commercials and replays. Each game has 150170 shots and each shot has 300500 frames. Additionally, by using the Bayes Net Matlab Toolbox, we generated four twochannel synthetic data sequences each of which has 200 segments of variablelength observations (200400 samples) according to MCSHMMs learned from four real videos. Using both synthetic and real data allows us to have comprehensive algorithm validation and evaluation. Our algorithm was implemented in Matlab 7.0 and tested on a PC with 3.2GHz CPU and 1GB memory. Seven previous methods are involved for comparison, as shown in Table 6.1, including a supervised GMM (order 3), the HMM with Gaussian emission (HMM(1)) 1http://www.cs.ubc.ca/ murphyk/Software/BNT/bnt.html 65 Methods State number Observation model Transition matrix GMM 4 3order GMM N/A HMM(1) 4 Gaussian 4 × 4 HMM(2) 4 3order GMM 4 × 4 HMM(3) 4 twolayer GMM 4 × 4 SHMM 4 segmental model 4 × 4 CHMM(1) 4, 4 3order GMM 16 × 16 CHMM(2) 4, 4 segmental model 16 × 16 MCHSMM 4, 4 segmental model 4 × 4, 24 × 4 × 2 Table 6.1: The model settings for all testing methods. and the HMM with GMM emission (HMM(2)) [61], and the embedded GMMHMM (HMM(3)) in [57] and the SHMM in [56]. We also implemented two CHMMbased methods [22], among which CHMM(1) and CHMM(2) uses a GMM and a segmental observation model respectively. The first five explore two semantic structures (plays and views) independently and separately, while the last three estimate both jointly. The initialization is important in EM learning. We adopted a coarsetofine initialization strategy that uses the training result of a simpler model to initialize a more complex one. Specially, we first use Kmean (4class) to obtain a coarse classification, and this result can be utilized to initialize HMM(1) whose training result can be used to initialize HMM(2), and so on. The training result of SHMM was used to initialize MCSHMM. Moreover, since the first shot is always in the central view, the initial probability of central view is set to be 1. We can initialize the transition matrix by computing the frequencies of different state transition in a couple of real games. Simultaneous structure learning and parameter learning are crucial for the application of MCSHMM. We found the traditional ML estimator cannot fully take advantage of the MCSHMM due to its large state space (before trimming). The entropic prior 66 GMM HMM(1) HMM(2) HMM(3) SHMM CHMM(1) CHMM(2) MCSHMM Figure 6.2: The experimental results on synthetic data. based MAP estimator is capable of finding the optimal model structure (a trimmed state space) and parameters jointly. It was mentioned in [32] that this MAP estimator can accelerate learning, and rescue EM from local probability maxima. Fig. 6.2 and Table 6.2 shows the experimental results on both synthetic data and real videos. It is clearly shown that the proposed MCSHMM outperforms all other algorithms with significant improvements. 6.5 Discussions The reason that SHMM is better than other HMMs (HMM(1), HMM(2), HMM(3)) is because of the segmental model used. The improvement of MCSHMM over SHMM and two CHMMs lies in the new hierarchicalparallel dynamics. CHMMs usually assume relative strong correlation between two Markov chains that is not true in our case. The MCSHMM can effectively capture the mutual interaction between two Markov chains by introducing the secondlayer dynamics and decisionlevel fusion that are able to balance the dependency both within each channel and across the two channels. To reveal the underlying difference between MCSHMM and CHMMs, we compare the learned transition matrices of both models with the groundtruth 67 Test Videos Semantics Supervised HMM(1) HMM(2) HMM(3) CHMM(1) SHMM CHMM(2) MCSHMM GMM [61] [61] [57] [22] [56] [22] Video1 Play Type 26.92% 43.59% 60.90% 77.56% 41.03% 80.13% 78.85% 82.05% (156 shots) Camera View 35.90% 55.77% 72.44% 76.28% 58.33% 79.49% 81.41% 84.62% Video2 Play Type 30.67% 49.07% 61.34% 70.56% 53.37% 77.91% 76.69% 81.59% (163 shots) Camera View 41.10% 61.35% 67.48% 79.14% 65.03% 84.05% 80.98% 85.28% Video3 Play Type 23.35% 44.91% 49.10% 58.08% 52.10% 68.86% 70.06% 75.45% (167 shots) Camera View 37.12% 53.29% 58.68% 61.08% 58.08% 74.85% 77.25% 81.44% Video4 Play Type 36.90% 58.33% 64.67% 70.66% 67.86% 77.98% 69.62% 82.74% (168 shots) Camera View 47.61% 64.88% 70.24% 73.21% 69.05% 79.17% 75.60% 84.52% Video5 Play Type 29.17% 50.59% 61.90% 72.02% 70.24% 74.40% 70.83% 80.95% (168 shots) Camera View 40.47% 56.55% 65.48% 73.81% 60.71% 73.21% 64.88% 77.38% Video6 Play Type 38.25% 54.11% 60.59% 70.59% 76.47% 80.59% 70.59% 84.71% (170 shots) Camera View 41.17% 52.35% 64.12% 70.59% 74.70% 80.00% 71.18% 84.11% Video7 Play Type 37.05% 64.70% 71.17% 70.59% 72.35% 75.88% 76.47% 83.53% (170 shots) Camera View 52.35% 68.24% 72.35% 75.29% 67.06% 81.18% 71.76% 87.06% Video8 Play Type 40.93% 56.14% 65.50% 72.51% 68.82% 78.95% 82.94% 84.80% (171 shots) Camera View 50.88% 62.57% 75.44% 77.78% 76.02% 80.11% 81.18% 88.30% Video9 Play Type 39.88% 63.01% 67.05% 69.36% 67.63% 75.14% 69.36% 82.08% (173 shots) Camera View 43.93% 66.07% 68.21% 71.68% 69.94% 77.46% 73.41% 84.97% Average Accuracy 43.59% 57.44% 62.62% 71.85% 63.54% 77.42% 75.08% 82.92% Table 6.2: Shotbased classification results of nine algorithms for nine 30min NFL videos. ones, as shown in Fig. 6.4. We can see the sparse nature of three transition matrices in the MCSHMM, and the learning results are very close to the ground truth ones (Fig. 6.4(a)(b)(c)). However, the learned transition matrix in the CHMM does not match the ground truth one (Fig. 6.4(d)), possibly due to the large state space. However, the current MCHSMM implementation (without program optimization) has the highest computational load (about 50 minutes for one video), while other methods are between 220 minutes. It is mainly due to the large number of initial states (most of them will be trimmed during training) at the secondlayer and the complexity of the segmental observation model. More efficient learning schemes are needed for fast state trimming. It is also possible to extend the MCSHMM to the cases where more than two channels are involved. The major challenge will be how to condense the initial state space effectively and efficiently. A possible strategy could be “incremental growing” rather than “progressive trimming”. The proposed MCSHMM could be applied to other fieldbased sports, where the 68 two midlevel semantic keywords, play types and camera views, can be well specified. After extracting some useful visual features that may be composed of color distribution, camera motion, or useful visual landmarks, we can invoke MCHSMM to estimate play types and camera views jointly, which can be further used to infer certain highlevel semantics by involving some domain knowledge. For example, as shown in Fig. 6.3, in a soccer game, we can define several views, such as the whole field, midfield, opposinghalf, fieldcorner, and penalty area, and as well as a few plays, kickoff, pass, free kick, corner kick, and shoot, etc. A goal highlight can be specified by a shoot in the penalty area followed by a kickoff in the midfield. We can have a similar semantic analysis paradigm for video mining in the baseball game, as shown in Fig. 6.3. 69 Lowlevel Visual Features (Color, Motion, etc.) Highlevel Semantics (Successful offense/defense, Fouls, etc.) Middlelevel Semantic Structures (Camera views, Play types) Midfield  Opposing half  Penaltyarea Pitch  Throw  Base Front  Remoteback  Closeback Long pass  Shortpass  Goal Camera View Play Type Camera View Play Type Figure 6.3: A proposed semantic analysis paradigm for sports video mining. (a) (b) (c) (d) Figure 6.4: Comparison of the state transitions between the ground truth (top) and learning results (bottom). The first three are for the MCSHMM defined in (6.3), (6.5) and (6.5), and the last one for CHMM. (a) A1(4×64); (b) A2(4×64) (c) A3(16×16) (d) Full state transitions of a CHMM (16×16) . 70 CHAPTER 7 CONDITIONAL RANDOM FIELDS FOR EVENT DETECTION 7.1 Background In previous sections, we elaborated the generative model, i.e., HMM, based approaches for midlevel semantic keyword analysis. The HMMbased approaches naturally fit in our problem formulation where the hidden states represent midlevel semantic keyword (camera views or play types) defined at the shot level and the observation is the visual features extracted at the frame level. As aforementioned, the midlevel semantic keywords deliver the rudimentary building blocks for highlevel semantic analysis because they can be used to represent highlevel semantics (i.e., events) effectively. Particularly, in addition to the play type and camera view, we also introduce another midlevel structure, i.e., the possession, which indicates the team holding the ball and can be easily detected by using the initial camera motion direction in the beginning of a shot. Thus, given the three kinds of midlevel semantic keywords of all shots, the second inference problem aims at detecting the interesting events, such as highlights (or the touchdown plays a football game). However, in the second inference problem, i.e., from the midlevel semantic keywords to the highlevel semantics, it is different from the first one. Highlevel semantics are usually referred to the events or incidents that are of particular interest to the user with the following three characteristics: • They are rare and specific, and the training data could be insufficient; • They have more ambiguities and uncertainties in terms of lowlevel visual fea 71 tures compared with the midlevel semantic keywords, e.g., a touchdown in football video could occur in many different ways; • They can be better defined or specified from midlevel semantic keywords. Unlike the first inference problem that is well supported by the generative models, the second inference problem, also referred to as event detection in this work, is very different in nature. We use one example below for discussion. For example, as shown in Fig. 7.1, an eventofinterest, e.g., a touchdown, can be inferred from midlevel semantic keywords across several shots, including camera views, play types, and possessions. Event detection requires a flexible modeling structure that allows more versatile dependency relationship between random variables (both observations and states) to accommodate the nature of highlevel semantics. Normally, generative models are mainly focused on the temporal dependency across states (not observations). This motivates us to use discriminative models for the second inference problem, where the observation corresponds to the midlevel semantic keyword, and the state is associated with the highlevel semantics or eventsofinterest. The midlevel semantic keywords are frequent, repeatable and relatively welldefined, therefore, in the first inference problem, generative models are preferred due to their good generality on a large data set (lowlevel visual features) and suitability for capturing repeated and stable patterns (midlevel semantic keywords). However, the highlevel semantics are immediately useful to viewers are quite specific (or userdependent) but can hardly be defined uniquely and deterministically from lowlevel visual features. As shown in Fig. 7.2, can the highlevel semantics be better specified by midlevel semantic keywords due to the enhanced expressiveness? Therefore, the key issue is: what kind of statistical model will offer great flexibility and capability for complex highlevel semantic analysis? Then, we can have a few important observations. 72 Highlevel semantics: Touchdown Camera view sequence: Right Middle Left End zone Play type sequence: Long Long Short Field goal Possession sequence: Right Right Right Right Figure 7.1: The second inference problem Any Highlevel Semantic Available? Figure 7.2: From midlevel semantic keywords to highlevel semantics 73 • There are strong temporal dependencies across shots with respect to each midlevel structure and an evident semantic correlation between multiple midlevel structures in each shot; • A touchdown play cannot be determined from a single shot, and its determination requires multiple semantic midlevel structures from several shots; • There are some uncertainty for the definition of touchdown play and ambiguity between the touchdown play and the field goal. Moreover, the events of interest, such as touchdowns or field goals, are usually relative rare and unbalanced compared other nonspecific events in a game. This is in a clear contrast with the midlevel semantic keywords. A supervised learning method is more appropriate for highlevel semantic analysis due to a very small portion of events of interest. Therefore, it is important for us to find an effective statistical model to address the above characteristics of highlevel semantic analysis. Particularly, we are interested in developing a discriminative model based approach for event detection. The discriminative model provides a flexible and plausible way to capture the complex dependency among multiple midlevel semantic keywords both within and across shots. As we introduced above, midlevel semantic keyword are more repeatable and relatively better defined in the sports video, and hence it is more reliable to explore the highlevel semantics from midlevel semantic keywords than from the lowlevel features. Shown in Fig. 7.1, the highlevel semantics, e.g., the touchdown in American football video, usually contains multiple midlevel semantic keyword sequences with relative stable configuration. Intuitively, when the midlevel semantic keywords are available, we can to fit them as the observation, then regard the highlevel semantics as hidden states. However, the midlevel semantic keyword sequences are different from the lowlevel feature due to they carry more complex dependencies that makes the 74 default assumptions of the generative model fault. The discriminative model provides a feasible way to relax the strong independence assumptions and prior knowledge required in generative models. This motivates us to use discriminative models for the second inference problem. 7.2 Model Speci cation The conditional random fields (CRFs) (Fig. 7.3(a)) were first introduced in [33] that provide a undirected probabilistic model to directly compute the statistical mapping relationship between the input (observations) and the output (states) for classification and regression. Compared with the HMMs, CRF does’t involve the conditional distributions of observations, and the state variable in CRF can be related to multiple observations in different time slices, e.g., multiple shots. Compared with SVM, CRF is more appropriate to capture the temporal dependencies both among observations and states in the sequential data. For example, in the case shown in Fig. 7.3.(b), we use a Markov blanket [37] to represent all feature functions involved in current state St, which includes the first order state transition as the transition feature function, and especially, five sequent shots, i.e., midlevel semantic keywords, to be state feature functions. This number was found to be a good compromise between the complexity and accuracy. A CRF model can be characterized by the parameter set Θ = ( 1; 2; : : : ; n; 1; 2; : : : ; n); (7.1) which specifies the feature functions between observations and hidden states as p (SO) = (7.2) 1 Z(O) exp { Σ t ( Σ i ifi(St−1; St) + Σ j jgj(St;O) )} ; where Z(O) is a normalization over the data sequence O; fi(·) and gj(·) are transition feature functions and state feature functions respectively. Via maximum likelihood 75 t1 S t1 o t+1 o t S t+1 S t o O Play types M R R E M C a m e r a v iews L S S G K L L L L D t2 t1 t t+1 t+2 St (a) (b) Possessions Time slices St1 St+1 Figure 7.3: (a) CRF model; (b) CRF for event detection (S): touchdown; M,R,E: central, right and end zone camera views; L,S,G,K: long play, short play, field and kick; LR,RL: possessions from left to right, possessions from right to left. (ML) estimation, we can optimize the CRF by maximizing the conditional probability defined as: Θ∗ = arg max p (SO); (7.3) Then, for a new data sequence O, we can predict the label S that can maximize the p(SO;Θ∗), i.e., S∗ = arg max S p(SO;Θ∗); (7.4) Specifically, in our research, we can naturally regard the O as the available midlevel semantic keyword sequence and S as the highlevel event sequence. As shown in Fig. 7.3.(b), the CRF used in this work involves a firstorder transition feature function and a fifthorder state feature function. We utilize the dynamic programming algorithms proposed in [66] to optimize the CRF parameters, and then classification task is straightforward according to (7.4). 76 7.3 Learning and Inferences Specifically, in our research, we can naturally regard the O as the available midlevel semantic keyword sequences and S as the corresponding highlevel semantic sequences, therefore, we can formulate the second inference problem as Equ. 7.4 based on the trained CRF. Now, the key point is how to find the parameter set Θ to maximize the conditional probability. Then we can represent the parameter set Θ as: Θ = ( 1; 2; : : : ; n; 1; 2; : : : ; n); (7.5) in which k and k are parameters to be estimated. It is obviously that the CRF provides a direct attempt to compute the input to output mappings for classification and regression and eschew the modeling of the underlying distributions, in which more dependencies between observations could be considered in terms of features regardless the strict independent assumption existed in the generative models. In practice, what we need to do is to define possible features of observation (midlevel semantic keywords) and hidden states (highlevel semantics) in football video sequence, then utilize dynamic programming algorithms, e.g., gradient based approaches [66], to optimize the model parameter, and then classification task is straight forward according to Equ. 7.4. First, the 7.2 can be written in the log form: log p (SO) = Σ e∈E;k kfk(e; se; o) + Σ v∈V;k kgk(v; sv; o) − log Z(o); (7.6) then take the derivative on Θ, we will obtain @ log p (so) @ = @ @ { Σ e∈E;k kfk(e; se; o) + Σ v∈V;k kgk(v; sv; o) − log Z(o)}: (7.7) In training process, the first two items are easy to obtain by counting the number of 1 in Boolean sequence fk and gk, then the normalization Z(o) can be obtained by the Maximal Cliques. [67] 77 7.4 Experimental Results Given the midlevel sema 



A 

B 

C 

D 

E 

F 

I 

J 

K 

L 

O 

P 

R 

S 

T 

U 

V 

W 


