`
`OPEN ACCESS
`
`algorithms
`
`ISSN 1999-4893
`www.mdpi.com/journal/algorithms
`
`Article
`Recognizing Human Activities from Sensors Using Hidden
`Markov Models Constructed by Feature Selection Techniques
`
`Rodrigo Cilla (cid:2), Miguel A. Patricio, Jes´us Garc´ıa, Antonio Berlanga and Jose M. Molina
`
`Computer Science Department, Universidad Carlos III de Madrid, Avda. de la Universidad Carlos III,
`22, Colmenarejo, Spain
`E-mails: mpatrici@inf.uc3m.es, jgherrer@inf.uc3m.es, aberlan@ia.uc3m.es, molina@ia.uc3m.es
`(cid:2) Author to whom correspondence should be addressed; E-mail: rcilla@inf.uc3m.es
`
`Received: 28 November 2008; in revised form: 2 February 2009 / Accepted: 16 February 2009 /
`Published: 21 February 2009
`
`Abstract: In this paper a method for selecting features for Human Activity Recognition from
`sensors is presented. Using a large feature set that contains features that may describe the
`activities to recognize, Best First Search and Genetic Algorithms are employed to select the
`feature subset that maximizes the accuracy of a Hidden Markov Model generated from the
`subset. A comparative of the proposed techniques is presented to demonstrate their perfor-
`mance building Hidden Markov Models to classify different human activities using video
`sensors
`
`Keywords: Computer vision; Human Activity Recognition; Feature Selection; Hidden
`Markov Models
`
`1.
`
`Introduction
`
`Sensors allow computers to perceive the world that surrounds them. By the use of sensors, like
`thermostats or photocells, computers can meassure the temperature or lighting conditions of a room. In
`the last years,the deployment of multisensor networks has become popular due to the cut down on sensor
`prizes. This networks can include different kind of sensors, maybe more complex, like cameras, indoor
`location systems (ILS), microphones, etc . . . By the use of sensor networks computer systems can take
`more accurate decissions due to the richer information about the environment that they have.
`A common task in sensor processing is the recognition of situations in the environment perceived. If
`these situations to recognize are known a priori, i.e. there is a set of labels that describe each situation,
`
`1
`
`APPLE 1019
`
`
`
`Algorithms 2009, 2
`
`283
`
`the problem can be tackled as a classification task. Then, the problem to solve is to find a model relating
`the data from sensor readings with situation labels. Some sensor data may be too noisy, and other not
`provide any information for label prediction. Relevant sensor data has to be identified to build the model
`from them, not including irrelevant data.
`An application where the use of sensor networks can be useful is Human Activity Recognition, i.e., the
`understanding by the computer of what humans are doing. This field has received increasing attention in
`the last years, due to their promising applications that have in surveillance, human computer interaction
`or ambient intelligence, and the interests that governments and commercial organizations have placed in
`the area. Human Activity Recognition systems can be integrated with existing systems as the proposed
`by Corchado et al. [1] to monitor alzehimer patients. If a patient falls to the floor, the system can alert a
`nurse to attend the accident. Human Activity Recognition systems can also be integrated with the system
`proposed by Pav´on et al. [2], detecting forbidden activities being performed and alerting security staff.
`Human activity recognition may be considered as a classification task, and human activity recognizers
`can be created using supervised learning. A set of activities to recognized has to be defined a priori.
`Different observations about the activities to recognize are extracted using different sensors. Then, the
`problem to solve is to find the function that best relates observations to activity labels. Data from sensors
`is not free from noise, so relevant attributes have to be identified before training the activity recognizer.
`Better results are expected to be obtained when this previous step is performed.
`The sensors that provides the most information for Human Activity Recognition are video cameras.
`Works in activity recognition from video could be divided in two groups [3]: (1) those that are centered in
`small duration activities (i.e. walking, running,...); and (2) those that deal with large duration activities
`(i.e.
`leaving place, going to living room,...). The former are centered in choosing good features for
`activity recognition, whereas the latter usually tackle the temporal structure in the classifier.
`Regarding small duration activities, Neural networks, Neuro-fuzzy systems and C4.5 have been suc-
`cesfully employed [4]. Also, time delay neural networks have been used to recognize the case where the
`activities are hand gestures [5]. In [6] is shown how to perform feature extraction, feature selection and
`classification using a simple bayesian classifier to solve the small duration activity recognition problem.
`In their approach, they use ground truth data from CAVIAR∗dataset of people bounding boxes instead of
`a blob tracker output. Perez et al. [7] use different time averaged speed measures to classify the activities
`present in the CAVIAR dataset using HMM.
`Robertson et al. [3] uses trajectory and velocity concatenated data for five frames, and blob optical
`flow, to decide what is the current small duration activity. This small duration activity is then introduced
`in a Hidden Markov Model (HMM) to decide which is the small duration activity performed.
`To classify large duration activities, Brand et al. [8] have used HMMs for modelling activities per-
`formed in an office. They use different spatial meassures taken from the blob bounding ellipse.The HMM
`approach has been extended in [9] to model activities involving two agents, using a Coupled-HMM. They
`use different velocity meassures as features, being all of them invariant with respect to camera rotations
`and translations, providing camera independent recognizers.
`HMMs are recognized as one effective technique for activity classification, because they offer dy-
`namic time warping, have clear bayesian semantics and well-understood training algorithms. In this
`paper a method to adapt them, selecting the best features in the observation space, is defined, trying to
`∗http://homepages.inf.ed.ac.uk/rbf/CAVIAR/
`
`2
`
`
`
`Algorithms 2009, 2
`
`284
`
`maximize its accuracy for classifying short durative activities. HMMs will be incrementally built using
`both heuristic search and genetic algorithms. A comparative of the performance obtained using these
`algorithm for HMM construction will be shown. Our approach differs from the proposed by Ribeiro et
`al. in some aspects: (1) blob tracker output is used instead of ground truth data; (2) the foreground mask
`of the blob tracker is used to extract features; (3) the classifier used for the recognition of activities is an
`HMM; (4) we use different temporal window sizes; and (4) we use different search methods for feature
`selection.
`This paper is organized as follows: on section 2, an overview of the activity recognition system where
`the feature selection is going to be performed is given; on section 3, the feature selection algorithms
`that are used to build HMMs are presented; on section 4, experiments selecting good features for human
`activity recognition are performed and results are discussed; finally, on section 5 conclusions of this
`work are presented and future lines are discused.
`
`2. Definition of Activity Recognition problem
`
`2.1. Functional description
`
`Figure 1. Overview of the activity recognition process from multiple sensors
`
`
`
`Sensor ProcessingSensor Processing
`
`
`
`Feature extractionFeature extraction
`
`Sensor 0
`
`Sensor 1
`
`
`
`Sensor ProcessingSensor Processing
`
`
`
`Feature extractionFeature extraction
`
`Activity Recognition
`
`Activity
`label
`
`...
`
`Sensor 2
`
`
`
`Sensor ProcessingSensor Processing
`
`
`
`Feature extractionFeature extraction
`
`In the proposed human activity recognition architecture (see Figure 1), the objective is to select from
`a set of activity labels A = {a1, . . . , aN} the one that is the most likely according to the signal p (t)
`retrieved by the sensors at each time step. Sensors could be video cameras, thermal cameras or indoor
`localization systems, depending of the available resources and the conditions of the environment. In case
`of imaging sensors, they get a frame on each time step, p (t) ≡ I (t, x, y). Using p (t), the system needs
`to extract the position and size bi (t) of each human i in the environment. When using imaging sensors,
`bi (t) may be obtained using a blob tracker [10]. Blob tracker takes an image sequence and provides
`the location of each object i moving in the scene, mantaining a temporal coherence of objects between
`frames. Human tracking is a hard task and, despite of the advances in the last years, tracking methods
`still provide a noisy output, caused in part by scene illumination changes or by the complexity of object
`movements [11]. Using p (t) and bi (t), a set of additional features ζi (t) may have to be extracted to be
`
`3
`
`
`
`Algorithms 2009, 2
`
`285
`
`able to differenciate between the activities to be recognized. Finally, using ζi (t), a function Λ (ζi (t)),
`Λ : ζ → A is used to decide wich is the activity being performed.
`Activity recognition solves the correspondence:
`P (ζi (t) , ζi (t − 1) . . . , ζi (1) | ak)
`(1)
`qit = arg max
`ak
`where qit is the activity that maximizes the probability of the observations ζi (t),ζi (t − 1),. . . ,ζi (0) at
`instant t by individual i.
`
`Figure 2. The three steps to create an activity recognizer
`
`To build an activity recognition system, three processes have to be performed (see Figure 2). First,
`an extensive feature set Zi (t) has to be extracted from bi (t) and I (t, x, y). On a second step, the subset
`ζi (t) has to be selected from Z (t), selecting the subset with the most predictive power for A. The last
`problem to solve is to select the best classifier architecture for activity recognition.
`
`2.2. Feature extraction for activity representation using video sensors
`
`The features that can be used for activity representation depend of the type of sensor inputs to be
`processed. The most common sensor used for Human Activity recognition are video cameras, so all the
`features presented here are computed from image sequences.
`The objective of this section is to present a large set of features to characterize the activities to classify
`as complete as possible. The input to the feature extraction process includes the original frame I (t, x, y),
`its foreground mask F g (t, x, y), and the blob bounding box b (t) given by a blob tracker, that represents
`the human (see Figure 3). Using these information, different type of features are going to be derived.
`
`4
`
`
`
`Algorithms 2009, 2
`
`286
`
`Most of the features presented here have been proposed in [6]. The first group of extracted features
`comes from the blob bounding box properties and its time evolution, taking first order derivatives and
`time averaged derivatives, using a temporal window of T frames, as shown on Table 1
`
`Figure 3. Inputs for the feature extraction process
`
`(a) Original frame
`
`(b) Foreground mask
`
`(c) Bounding
`box
`
`Table 1. First feature group: blob bounding box properties
`
`Feature name
`
`Definition
`
`Bounding box properties at time t
`
`(cid:2)
`
`(cid:2)
`
`,
`
`+
`
`(cid:3)
`
`(cid:3)2
`
`∂y(t)
`∂t
`
`∂y(t)
`∂t
`
`#
`
`1
`2
`3
`4
`
`5
`6
`
`7
`
`-
`
`8
`
`9
`
`Position
`Size
`Velocity
`Size derivative
`
`Speed
`Area
`
`Mean speed
`
`Mean velocity norm
`
`Averaging vectors
`
`Mean vectors
`
`10
`11..13
`
`Linear fitting
`speed/velocity ratio
`
`14
`
`speed
`
`15..17
`
`speed (centered)
`
`-
`
`-
`
`velocity
`
`velocity centered
`
`18..21
`
`22..25
`
`trace
`
`eigenvalues ratio
`
`p (t) = (x (t) , y (t))
`size (t) = (w (t) , h( t))
`∂x(t)
`s (t) = ∂p(t)
`=
`,
`∂t
`∂t
`(cid:4)(cid:2)
`∂size(t)
`= ∂w(t)
`∂h(t)
`(cid:3)2
`∂t
`∂t
`∂t
`∂x(t)
`s (t) =
`∂t
`s (t) = w (t) ∗ h (t)
`T(cid:5)
`¯sT (t) = 1
`(cid:6)(cid:6) ¯(cid:2)sT (t)
`(cid:6)(cid:6) 3 different methods:
`T
`i=t−T +1
`(cid:6)(cid:6)(cid:6)(cid:6)(cid:6)(cid:6) 1
`t−1(cid:5)
`(cid:6)(cid:6)
`(cid:6)(cid:6) ¯(cid:2)sT (t)
`(cid:2)p (t) − (cid:2)p (i)
`t − i
`(cid:6)(cid:6)(cid:6)(cid:6) (cid:3)p(t)−
`(cid:6)(cid:6) ¯(cid:2)sT (t)
`(cid:6)(cid:6)
`i=t−T
`p(t−T +1)
`(cid:3)
`(cid:6)(cid:6)
`(cid:6)(cid:6) ¯(cid:2)sT (t)
`2 =
`T
`3 = Linear Least Squares Fitting
`sT (t)
`(cid:3)(cid:3)sT (t)(cid:3) i
`R (cid:3)sT ,i (t) =
`i = 1, 2, 3
`t(cid:5)
`i=t−T +1
`t(cid:5)
`,T (t)1+j = 1
`T −1
`i=t−T +1
`t(cid:5)
`i=t−T +1
`(cid:2)
`t(cid:5)
`(cid:2)s (i) − ¯(cid:2)sT (t)j
`Σ(cid:3)s,T (t)1+j = 1
`T −1
`(cid:3)
`(cid:2)
`i=t−T +1
`(cid:2)
`(cid:3)
`Σ(cid:3)v,T (t)
`i=1,2,3,4
`Σ(cid:3)v,T (t)
`
`Properties averaged over T frames
`
`1 =
`
`T
`
`s (i)
`
`(cid:6)(cid:6)(cid:6)(cid:6)(cid:6)(cid:6)
`
`(cid:6)(cid:6)(cid:6)(cid:6)
`
`Second order moments
`
`2
`s
`
`(i)
`
`,T (t)1 = 1
`T −1
`
`2v
`
`σ
`
`(s (i) − ¯sT (t))j j=1,2,3
`
`2v
`
`σ
`
`Σ(cid:3)s,T (t)1 = 1
`T −1
`
`(cid:2)
`
`(cid:2)s (i) (cid:2)s (i)
`
`trΣ(cid:3)s,T
`RΣ(cid:3)s,T
`
`(t)i = trace
`λmin
`(t)i =
`λmax
`
`i=1,2,3,4
`
`(cid:3) (cid:2)
`
`(cid:2)s (i) − ¯(cid:2)sT (t)j
`
`(cid:3)(cid:2)
`
`j=1,2,3
`
`Second feature group consist of the seven Hu invariant moments [12] {hu1, hu2, hu3, hu4, hu5, hu6, hu7}
`of the foreground mask enclosed by the blob bounding box. These seven moments are shape descriptors
`invariant under translation, rotation and scaling. Each Hu moment is numbered from 26 to 32.
`
`5
`
`
`
`Algorithms 2009, 2
`
`287
`
`Table 2. Third and fourth group: Optical flow measures
`
`fx (t, x, y) , fy (t, x, y)
`(cid:2)F (t, x, y)
`
`N
`
`(cid:8)
`
`, (x, y) ∈ P (t)
`
`Instantaneous
`
`Definition
`P (t) = {(x, y) ∈ boundingbox (t)}
`(cid:7)
`(cid:5)
`(cid:2)F (t, x, y) =
`¯(cid:2)F (t)1 = 1
`(x,y)∈P (t)
`(cid:6)(cid:6)(cid:6)
`(cid:6)(cid:6)(cid:6) (cid:2)F (t, x, y)
`¯(cid:2)F (t)2 = (cid:2)F (t)1 − (cid:2)vt
`(cid:6)(cid:6)(cid:6) (cid:2)F (t, x, y) − (cid:2)v (t)
`(cid:6)(cid:6)(cid:6)
`f (t, x, y)1 =
`f (t, x, y)2 =
`(cid:5)
`(cid:5)
`(x,y)∈P (t)
`(t)1 = 1
`(cid:5)
`(x,y)∈P (t)
`(cid:2)
`(x,y)∈P (t)
`λmin
`Σ (cid:3)F
`λmax
`
`# I (t, x, y)
`-
`
`# F g (t, x, y)
`-
`
`Feature name
`target pixels
`
`-
`-
`
`-
`-
`-
`
`-
`-
`
`-
`-
`-
`
`optical flow
`mean flow
`
`mean flow (centered)
`flow norm
`flow norm (centered)
`
`33, 34
`
`37, 38
`
`motion energy
`
`-
`
`-
`
`-
`
`-
`
`flow cov.
`
`flow cov. centered
`
`35,36
`
`39, 40
`
`eigenvalues ratio
`
`-
`
`41
`
`-
`
`-
`
`42
`
`-
`
`43. . . 46
`
`47 . . . 50
`
`51. . . 54
`
`55. . . 58
`
`mean flow
`
`motion energy
`
`mean flow
`
`trace
`
`eigenvalues ratio
`
`f (t)i = 1
`
`N
`
`Σ (cid:3)F
`
`N
`
`Σ (cid:3)F
`
`(t)2 = 1
`
`N
`
`RΣ (cid:3)F
`
`(t)i =
`
`T
`
`¯(cid:2)F (t)i+2 =
`
`¯fT (T ) = 1
`T
`
`Σ ¯(cid:3)F
`
`(t)i = 1
`
`T
`
`trΣ ¯(cid:3)F
`RΣ ¯(cid:3)F
`
`(t)i =
`
`Spatial energy / 2nd order moments
`(t, x, y)i i = 1, 2
`
`2
`(cid:2)f
`
`(cid:2)
`(cid:2)F (t, x, y) (cid:2)F (t, x, y)
`(cid:2)
`(cid:3) (cid:2)
`(cid:2)F (t, x, y) − ¯(cid:2)F (t)1
`(cid:3)
`
`(cid:2)F (t, x, y) − ¯(cid:2)F (t)1
`
`(cid:3)(cid:2)
`
`(t)
`
`i=1,2
`t(cid:5)
`j=t−T +1
`
`Time averaged over T frames
`¯(cid:2)F (j)i i = 1, 2
`
`f (i)1
`
`Temporal energy / 2nd order moments
`
`i = 1, 2, 3, 4
`
`(cid:2)i
`
`¯(cid:2)F (j)
`
`i = 1, 2, 3, 4
`
`¯(cid:2)F (j)i
`(cid:3)
`(cid:3)
`
`(t)i
`
`i = 1, 2, 3, 4
`
`¯(cid:2)F (t)i − 1
`t(cid:5)
`i=t−T +1
`t(cid:5)
`(cid:2)
`j=t−T +1
`(cid:2)
`(t)i
`(t)i = trace
`Σ ¯(cid:3)F
`λmin
`Σ ¯(cid:3)F
`λmax
`
`Third and Fourth feature groups have been extracted from optical flow meassures from the enclosed
`blob bounding box of I (t, x, y) and F g (t, x, y). Optical flow is a meassure of the motion of each one of
`the pixels of a given frame with respect to a previous one. In our work, it is obtained using the Lucas-
`Kanade method [13]. Again, some properties are taken from the time averaged meassures of different
`optical flow properties, using a temporal window of size T .
`
`2.3. Feature selection for activity recognition problems
`
`Feature selection could be described as a single objective optimization problem where given a set of
`∗ has to be determined, such:
`features F , a subset f
`∗
`
`(2)
`, E) = max
`J (f, E)
`J (f
`f⊆F
`where J : F × Ψ → R is an utility function that meassures how good is a feature subset f ⊆ F for
`classiffing a set of points E ∈ Ψ, being Ψ the space of training examples. In a supervised context, where
`the class of the points in E is known, the goodness of the subset f could be interpreted as the accuracy
`of a classifier built using the attributes in f, while in an unsupervised context the goodness of f could be
`interpreted as the quality of the generated clusters.
`The need for feature selections comes from the fact that available training instances usually contain
`highly correlated and/or noise-dominated attributes. These attributes have to be identified and removed,
`because they don’t provide useful information for classification and their use only increase the classifi-
`cation cost, even dropping the classification accuracy.
`The feature selection problem has been widely studied in the literature. A taxonomy of the existing
`techniques could be established according to the evaluation criteria used to score the candidate feature
`subsets [14]: (1) wrapper methods, where a classifier is trained to evaluate the predictive accuracy of the
`
`6
`
`
`
`Algorithms 2009, 2
`
`288
`
`candidate subset; and (2) filter methods, where a meassure of the predictive power of the feature subset
`is calculated, without training any classifiers. Filter methods are quicker and more general than wrapper
`methods, but the optimality of the obtained feature subsets could not be ensured even using exhaustive
`search methods.
`
`2.4. Classifiers for activity recognition
`
`To solve general classification problems, different algorithms have been proposed, supervised and non
`supervised [15]. Supervised methods build class models using labelled data, whereas non supervised
`methods try to discover the structure of the data. Examples of supervised methods are kNN, Neural
`Networks and Classification Trees. Examples of non supervised methods are clustering techniques. In
`Human Activity Recognition, the most accepted classification techniques are Hidden Markov Models
`(HMMs), because activities have temporal structure and HMMs can tackle with this kind of processes.
`However, others methods have been succesfully used [4]: Neural Networks, Neuro-Fuzzy systems, C4.5.
`etc.
`Be a system that can be described using a set of N different states, where random transitions are
`produced over time, according to a given probability distribution for each state. The state on the system
`on each moment depends on the state that it was in the previous moments. This kind of stochastic process
`is called ”Markov Model”. Additionally, if the present state of the system can not be observed, i.e., it
`could be only measured by an effect that it produces, the system is called ”Hidden Markov Model”.
`A Hidden Markov Model λ [16] is defined by the tuple λ ≡ (S, M, A, B, Π) where:
`• S = {S0, . . . , SN} is the set of hidden states of the model. The state at time t is denoted by qt.
`• M is the set of observation symbols of the model.
`• A = {aij} is the state transition probability distribution:
`aij = p (qt+1 = Sj | qt = Si)
`• B = {bj} is the observation symbol probability distribution in state j:
`bj = p (v | qt = j)
`• Π = {πj} is the initial state probability distribution:
`
`(4)
`
`(3)
`
`pij = p (q0 = j)
`
`(5)
`
`One of the most powerful features of HMMs is that it can model both large duration and small duration
`activities. To model large duration activities, an HMM is defined for each activity. Given a sequence
`O = o 1 . . . o k of observations to classify as a large duration activity, and a set of HMM activity models
`Λ = {λ1 . . . λn}, the sequence is associated to the HMM with the highest probability of generating O:
`P (mi | O)
`arg max
`i
`P (O | mi) can be estimated using Forward-Backward algorithm [16].
`
`(6)
`
`7
`
`
`
`Algorithms 2009, 2
`
`289
`
`When the recognition target are small duration activities, only one HMM is used, with one state per
`activity to classify. Given a sequence O = o 1 . . . ok of observations to classify as small duration activi-
`ties, the problem to solve is the association of each observation with the state most likely of generating
`it:
`P (Si | ot, qt−1)
`qt = arg max
`i
`The solution to estimate the current state of the HMM can be found using Viterbi algorithm [16].
`Viterbi algorithm computes the most likely state sequence that has generated the observed output. The
`algorithm proceeds as fallows:
`
`(7)
`
`1. Initialization:
`
`2. Recursion:
`
`δ1 (i) = πibi (o1)
`
`ψ1 (i) = 0
`
`δt (j) = max
`1≤i≤N
`
`[δt−1 (i) aij]bj (ot)
`
`ψt (j) = arg max
`1≤i≤N
`
`[δt−1 (i) aij]
`
`1 ≤ i ≤ N
`
`2 ≤ t ≤ K
`1 ≤ j ≤ N
`
`2 ≤ t ≤ N
`1 ≤ j ≤ N
`
`(8)
`(9)
`
`(10)
`
`(11)
`
`(12)
`(13)
`
`(14)
`∗ is the probability of q
`
`∗
`
`t = K − 1, K − 2, . . . , 1
`∗. p
`
`The algorithm give us the optimal state sequence stored in vector q
`generating O.
`
`3. Feature selection for HMMs using Heuristic and Genetic Algorithms
`
`Discriminant features have to be used in any classification problem to obtain good results. Also,
`HMMs have been widely used for activity classification, because activities can be easily represented
`using this abstraction. It seems to be of interest to study how HMMs could be built using feature se-
`lection techniques, because it is not clear that some commonly used features [7] provide the necessary
`separability of the activities.
`
`∗t
`
`q
`
`+1
`
`∗
`
`p
`
`∗K
`
`q
`
`= ψt+1
`
`3. Finalization:
`
`∗t
`
`q
`
`4. Path Backtracking:
`
`= max 1 ≤ i ≤ N [δK (i)]
`= arg max 1 ≤ i ≤ N [δK (i)]
`(cid:7)
`(cid:8)
`
`8
`
`
`
`Algorithms 2009, 2
`
`3.1. Feature subset evaluation
`
`290
`
`To evaluate the utility of a given feature subset an HMM wrapper is going to be used. The utility
`of the feature subset will be the accuracy of the HMM for classifying some small duration activity
`sequences. Given a set of training sequences O = {O1 . . . ON}, Oi = {oi1 . . . oik} containing m
`different activities, and their corresponding set of hidden states A = {A1 . . . AN}, Ai = {ai1 . . . aik},
`the estimation of HMM parameters is straightforward. The HMM will have m hidden states, one per
`activity. To estimate he initial state probability distribution, the frequency of each state to be the first
`in each training sequence has to be computed. The state transition probability distribution is going to
`be computed using the frequency of each state transition along the training sequences. To simplify
`even more the estimation of parameters, the observation symbol robability distribution is supposed to
`be a single multivariate gaussian distribution. Other type of distributions, like Gaussian Mixtures or
`KDE [17] can be used, but the construction of the HMM will be harder, and it is not important for the
`objective of this work. The parameters of each state emission distribution to be estimated are the mean
`and covariance matrices.
`Once that the Hidden Markov Model has been estimated, a test sequence O = {o1, . . . , oK} can be
`∗ is compared to ground truth data, and
`classified using Viterbi Algorithm. The obtained state sequence q
`the accuracy of the model for predicting each hidden state is taken as the subset utility.
`
`3.2. Searching the feature space
`
`Given a set of N features, the number of possible feature subsets to evaluate is 2N. As N grows, the
`feature selection problem becomes intractable and exhaustive search methods become unpractical. To
`avoid this course of dimensionality, suboptimal search methods have to be employed. In this paper two
`different methods are proposed: Best First Search and Genetic Search.
`Best First Search (BFS) [18] is a tree search method that explores the search tree expanding the best
`node at each level. In this case, the best node is the one who has the best accuracy on its level. Algorithm
`pseudocode is shown on 1. This search method does not guarantee to find the optimal solution, only a
`local optima, because when selecting only the best child, the path to the optimal solution could be left
`unexplored.
`
`Algorithm 1 Best First Search
`Open ← root
`while Open (cid:8)= φ do
`Pick the best node of Open
`Open ← φ
`Generate its children and introduce into Open
`end while
`
`To search the feature subset space using Best First Search, the root node of the tree will be the empty
`feature set. The successors of a node will be that nodes that contain the same features that its parent plus
`a new one that its parent does not have. The search will finish were the accuracy of children is equal or
`smaller than the accuracy of their parent.
`
`9
`
`
`
`Algorithms 2009, 2
`
`291
`
`Genetic Algorithms (GA) [19] are powerful search techniques inspired in the Darwinian Evolution
`Theory. Populations of individuals, where each one represents a different solution to the problem, are
`evolved using crossover and mutation operators. The pseudocode of a simple genetic algorithm can be
`seen on 2. Again, this search method does not guarantee to find the optimal solution, and even if it is
`found, there is no way to ensure that is the optimum.
`
`Algorithm 2 Genetic Algorithm
`P opulation ← init
`while stop condition is not satisfied do
`Evaluate (P opulation)
`N extP opulation ← selection (P opulation)
`N extP opulation ← reproduction (P opulation)
`P opulation ← replacement (P opulation, N extP opulation)
`end while
`Solution ← best (P opulation)
`
`The individual that is going to be used to explore the feature subset space is a bit string of length
`N, being N the total number of given features. If the ith bit is set to 1, it means that the ith feature is
`selected, whereas if it is set to 0, it means that the ith feature is not selected.
`
`4. Experimental validation
`
`4.1. Experimental setup
`
`A video has been recorded at 25 fps on a corridor using a Sony SSC-DC598P video surveillance
`camera located about 3 meters from the ground. An actor has performed six different small duration
`activities: walking, stopped, bending, falling, lied down and rising (see Figure 4). A blob tracker has
`been used to extract the actor bounding box on each frame where the actor appears. Using the bounding
`box, all the features presented on section 2.2. have been extracted, using temporal windows from 3 to
`25 frames, giving 780 different features per frame. The blob tracker has divided the original video on 11
`video sequences where the activity being performed on each frame has been manually annotated. The
`content of each sequence could be seen on Table 3.
`The blob tracker has produced some erroneous outputs, where the bounding box does not contain the
`actor for some frames (see Figure 5. These errors may be produced by scene illumination changes, or
`by the initialization of a false track that is associated to the actor later. In this cases, the frame has been
`annotated with the especial activity error. It has been considered that is important to have that label to
`be able to detect these type of false alarms when the system is working.
`An HMM is going to be build for each set of small duration activities to be recognized using both BFS
`and GA. To evaluate the utility of a candidate feature subset, leave-one-out cross-validation (LOOCV)
`of the generated HMM is going to be performed with the sequences to be employed on each experiment.
`LOOCV consist on generate the model with all the sequences unless one and test the model with that
`sequence. This process is repeated for every sequence to be employed. Using this training strategy,
`possible overfittings are minimized, because the model has not been built with the sequence used to
`meassure its quality.
`
`10
`
`
`
`Algorithms 2009, 2
`
`292
`
`Figure 4. Examples of the activities to recognize. First row, walking. Second row, stopped.
`Third row, lowing. Fourth row, falling. Fifth row, lied down. Last row, rising
`
`11
`
`
`
`Algorithms 2009, 2
`
`293
`
`Table 3. Content of the sequences used. W stands for walking, S for stopped, F for falling,
`D for lyed down, R for rising, L for bending and E for error. The number of frames that the
`activity last is shown between parenthesis
`
`4
`
`Sequence Content
`1
`W(220), S(214), W(113), E(197), S(161), W(221)
`W(107), S(172), W(116), S(268), W(12)
`2
`W(50), S(134), W(112), S(202), W(224), S(181), W(145), S(239), E(5), S(30),
`3
`W(269), S(190), W(250), E(55)
`W(56), S(131), W(215), F(27), D(211), R(64), W(174), F(57), D(171), R(97),
`W(115), S(69), W(156), F(68), D(58), R(61)
`W(139), F(57), D(79), R(88), W(130), F(54), D(172), R(80), W(176), F(52), D(52),
`R(82), W(236), E(22)
`L(281), W(283), L(208), W(213), L(173), W(156), L(168), W(167), E(23)
`W(62), L(134), W(146), L(43), W(222), L(157), W(158), L(110), W(80), L(180),
`W(133), L(168), W(28), E(18)
`W(733), E(10)
`W(760), E(48)
`E(155), W(271), E(307)
`W(65), E(1)
`
`5
`
`6
`7
`
`8
`9
`10
`11
`
`Figure 5. Example of the error activity. Bounding box (in green) does not contain the actor.
`
`12
`
`
`
`Algorithms 2009, 2
`
`294
`
`1
`780. The
`GA has been employed using a configuration of 780 individuals and a mutation rate of
`selection technique employed has been tournament selection of 3 individuals. The population of the
`genetic algorithm is initialized with one random gen set to true. As the length of the individual is
`780 and the population 780, it is expected to have each bit set to true in the population at least one time.
`Population evolution has stopped when the fitness value of the best individual has been without changing
`for 10 generations. The algorithm has been run 20 times for each experiment, because GAs are random
`search techniques and different results may be achieved in different trials. The result shown is the best
`one found.
`
`4.2. Experiment I: three activities
`
`A first attempt to solve the problem will be done using a set of sequences containing sequences 1,
`2 and 3. It contains three different activities: stopped, walking and noise. Results obtained with GA
`and BFS can be seen on Table 4. Relative confusion matrix for the solution obtained by BFS is shown
`on Table 5 and relative confusion matrix of the solution obtained by GA is shown on Table 6. Relative
`confusion matrices show the frequency of predicting errors per class. The HMM induced can be seen on
`Figure 6
`
`Table 4. Features found by by Best First Search (BFS) and Genetic Algorithm(GA) for
`classificating three activities
`
`Algorithm Accuracy
`BFS
`0.8699
`GA
`0.9046
`
`f (t, F ) , RΣ ¯(cid:2)F4
`
`Selection
`(t, I)2 , RΣ ¯(cid:2)F5
`(t, F )1 , RΣ ¯(cid:2)F3
`(t, F )1 , RΣ ¯(cid:2)F4
`(t, F )1 , f8 (t, F )1 ,
`, RΣ ¯(cid:2)F3
`
`∂y(t)
`∂t
`
`(t, I)1
`
`The solution found by GA is better than the solution found by BFS according to the global accuracy.
`Also, taking a look in the confusion matrices of both classifiers (Tables 5 and 6) reveals that the solution
`found by GA is better than the solution found by BFS for the accuracy of all the classes to predict. In
`both cases, the hardest class to predict is the error class, maybe because it is the class with less examples.
`Solution obtained by BFS algorithm includes different meassures of the optical flow. Solution ob-
`tained by GA includes velocity along y axes and different optical flow meassures.
`
`Table 5. Relative Confusion matrix for the HMM generated using BFS to classify the activ-
`ities walking, stopped and error
`
`PPPPPPPPPPP
`
`prediction
`
`class
`error
`walking
`stopped
`
`error walking stopped
`
`0.5869
`0.0505
`0.0531
`
`0.4161
`0.8984
`0.0796
`
`0
`0.0511
`0.8673
`
`13
`
`
`
`Algorithms 2009, 2
`
`295
`
`Figure 6. Structure of the HMM used in experiment 1
`
`Table 6. Relative Confusion matrix for the HMM generated using GA to classify the activi-
`ties walking, stopped and error
`
`PPPPPPPPPPP
`
`prediction
`
`class
`error
`walking
`stopped
`
`error walking stopped
`
`0.7521
`0.0238
`0.0828
`
`0.0427
`0.9260
`0.0251
`
`0.2051
`0.0502
`0.8921
`
`14
`
`
`
`Algorithms 2009, 2
`
`4.3. Experiment II: six activities
`
`296
`
`Two more sequences have been added to the set previously used: X and Y. Three new activities are
`going to be classified: falling,rising, and lyed down. Results obtained with GA and BFS can be seen on
`Table 7.
`
`Table 7. Features found by by Best First Search (BFS) and Genetic Algorithm(GA) for
`classificating six activities
`
`Algorithm Accuracy
`BFS
`0.7051
`GA
`0.7779
`
`Selection
`¯f8 (t, F ) ; R (cid:4)s6,1 (t) , R (cid:4)s6,2 (t) , trΣ ¯(cid:2)F25
`(t, F )3 , RΣ(cid:2)s,6 (t)2 , RΣ ¯(cid:2)F3
`hu1, RΣ(cid:2)s,4 (t)2 , R (cid:4)s9,1 (t) , R (cid:4)s11,1 (t) , R (cid:4)s14,1 (t) , ¯f17 (t, F ) , trΣ ¯(cid:2)F22
`
`(t)2
`(t, F )3
`
`Table 8. Relative Confusion matrix for the HMM generated using BFS to classify the activ-
`ities walking, stopped, falling, lyed down, rising and error
`
`PPPPPPPPPPP
`
`prediction
`
`class
`error
`walking
`stopped
`falling
`lyed down
`rising
`
`error walking stopped
`
`falling
`
`lyed down
`
`rising
`
`0.2760
`0.0098
`0.0050
`0.0957
`0.0687
`0.1588
`
`0.5082
`0.8134
`0.1267
`0.0319
`0.0102
`0.5829
`
`0.0820
`0.0225
`0.8092
`0.0106
`0.3163
`0.0118
`
`0.0984
`0.0640
`0.0056
`0.6596
`0.0037
`0
`
`0.0027
`0.0060
`0.0195
`0.2021
`0.5793
`0.0024
`
`0.0328
`0.0843
`0.0340
`0
`0.0219
`0.2441
`
`Figure 7. Structure of the HMM used in experiment 2
`
`Again, the solution found by GA is better than the solution found by BFS if the global accuracy of
`the HMM i