`Ji et a].
`
`(10) Patent N0.:
`(45) Date of Patent:
`
`US 8,081,209 B2
`Dec. 20, 2011
`
`US008081209B2
`
`(54) METHOD AND SYSTEM OF SPARSE CODE
`BASED OBJECT CLASSIFICATION WITH
`
`SENSOR FUSION
`
`(75) IIWBIIIOFSI ZheIIgPiIIg Ji, Lansing, MI (Us); Danil
`V. Prokhorov, Canton, MI (US)
`
`(73) Assignee: Toyota Motor Engineering &
`Manufacturin North America Inc
`g
`’
`"
`Erlanger, KY (Us)
`
`7,202,776 B2
`g;
`
`4/2007 Breed
`greeg e: 3%
`
`ree e a .
`,
`,
`3/2008 Andarawis et a1. ......... .. 702/188
`7,343,265 B2 *
`4/2008 B d ........................... .. 701/45
`7,359,782 B2 *
`2/2005 Gbiizdy
`2005/0038573 A1
`2005/0273212 A1 12/2005 Hougen
`2007/0064242 A1* 3/2007 Childers ..................... .. 356/601
`2007/0233353 A1 10/2007 Kade
`2007/0282506 A1 12/2007 Breed et a1.
`2008/0046150 A1* 2/2008 Breed ........................... .. 701/45
`2010/0148940 A1* 6/2010 Gelvin et a1. .......... .. 340/28602
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`gage? 11S sixlgeltdeg 0?; dadJusted under 35
`'
`'
`'
`( ) y
`ays'
`
`OTHER PUBLICATIONS
`Optimal In-Place Learning andthe Lobe ComponentAnalysis; Juyan
`Weng and Nan Zhang.
`
`(21) Appl. N0.: 12/147,001
`
`>i< Cited by examiner
`
`(22) Filed:
`
`Jun‘ 26’ 2008
`.
`.
`.
`PrIor Publication Data
`US 2009/0322871 A1
`Dec. 31, 2009
`
`(65)
`
`(51) Int. Cl.
`(2006 01)
`H04N 7/18
`(200601)
`G06F 15/16
`'
`(52) U 5 Cl
`(58) Field of Classi?cation Search
`See a lication ?le for Com lete'
`pp
`p
`References Cited
`
`(56)
`
`348/115
`348/113i118
`histo
`ry'
`
`U.S. PATENT DOCUMENTS
`McNary et a1.
`10/1999
`5,963,653
`A
`Malhotra
`12/2004
`6,834,232
`B1
`2/2005
`6,856,873
`Breed et a1.
`B2
`3/2005
`6,862,537
`Skrbina et a1.
`B2
`4/2005
`6,885,968
`Breed et a1.
`B2
`6,889,171
`5/2005
`Skrbina et a1.
`B2
`7,049,945
`5/2006
`Breed et a1.
`B2
`YOPP
`7,069,130
`6/2006
`B2
`7,085,637
`8/2006
`Breed et a1.
`B2
`7,110,880
`9/2006
`Breed et a1.
`B2
`
`Primary Examiner * Zami Maung
`(74) Attorney, Agent, or Firm *Gifford, Krass, Spinkle,
`Anderson & Citkowski’ PC‘
`
`(57)
`
`ABSTRACT
`.
`.
`.
`A system and method for object class1?cat1on based upon the
`fusion of a radar system and a natural imaging device using
`sparse code representation. The radar system provides a
`means of detecting the presence of an object Within a prede
`termined path of a vehicle. Detected objects are then fused
`With the image gathered by the camera and then isolated in an
`attention WindoW. The attention WindoW is then transformed
`into a sparse code representation of the object. The sparse
`code representation is then compared With knoWn sparse code
`representation of various objects. Each knoWn sparse code
`representation is given a predetermined variance and subse
`quent sparse code represented objects falling Within saidvar‘i
`ance Will be classi?ed as such. The system and method also
`includes an associative learning algorithm Wherein classi?ed
`sparse code representations are stored and used to help clas
`sifying subsequent sparse code representation.
`
`4 Claims, 5 Drawing Sheets
`
`
`
`'QQQQQQQ' ‘DI
`
`Layer 1
`(Recognition)
`
`L e: 0
`(Sparse Representation)
`
`“Vehicle”
`
`“Non-Vehicle"
`
`Layer 2
`(Association)
`
`Layer 3
`(Classes)
`
`AVS EXHIBIT 2006
`Toyota Inc. v. American Vehicular Sciences LLC
`IPR2013-00419
`
`
`
`Image
`
`. . Projectlon —- WlIldOW —> .
`Extraction
`
`
`
`
`
`
`Sparse -
`Codmg
`
`—> MILN
`
`‘ i_i
`
`Camera
`
`Camera
`Calibration
`
`1.17""
`
`Target
`Kalman POiIll
`Radars) Filtering K
`14
`2°
`
`US. Patent
`
`Dec. 20, 2011
`
`Sheet 1 015
`
`US 8,081,209 B2
`
`16 f- 1 0
`
`[so K24 /18
`
`1
`
`22
`
`image
`
`/
`
`Top Down View:
`
`32\
`
`_\
`
`ImageNurnzlEl
`
`"""""
`
`Training Process:
`
`________
`
`t
`
`l
`
`i
`
`r’
`
`:
`
`I
`
`:i
`
`I’?
`
`""" """
`
`"""" "
`
`it i
`
`I
`
`a
`
`_______ n
`
`_
`
`------
`
`....
`
`....... __
`
`-2O
`
`'
`-10
`
`:
`t,
`1O
`0
`t2] Show Grid:
`
`20
`
`IE!
`
`E
`
`FIG-2
`
`
`
`US. Patent
`
`Dec. 20, 2011
`
`Sheet 2 015
`
`US 8,081,209 B2
`
`X
`
`0
`
`Z
`
`Y
`
`World-Center
`Coordinate System
`
`1, 2°
`Max. Winn; /
`Vehicle
`,1‘
`
`Max. Height
`
`F
`
`3
`
`28
`
`X0
`A
`
`"0'
`
`,’
`,’ Projected Size
`, ' 1 Image Plane
`
`Camera-Center
`Coordinate System
`
`0C
`
`' YC
`
`Natural Images
`
`Randomly Selected
`Subpatches
`
`26
`
`FIG-4
`
`
`
`US. Patent
`
`Dec. 20, 2011
`
`Sheet 3 015
`
`US 8,081,209 B2
`
`'1
`Window Images (56*5 6)
`
`I.‘
`
`HI Ill
`
`||||||‘ lllli‘n
`
`Sparse Representation (36*431)
`
`
`
`US. Patent
`
`Dec. 20, 2011
`
`Sheet 4 015
`
`US 8,081,209 B2
`
`o~\
`
`“v I. I n
`O_\
`O \ f-Om e nce
`
`“Non-Vehicle"
`
`3
`
`L
`
`(Ciggasres)
`
`L
`
`(Assiygzitiunl
`
`2
`
`/ -
`
`O /
`
`0/
`
`O
`
`.
`6
`
`Layerl
`(Recognition)
`
`Layer 0
`(Sparse Representation)
`
`
`
`US. Patent
`
`Dec. 20, 2011
`
`Sheet 5 015
`
`US 8,081,209 B2
`
`Step 1
`
`12
`
`/
`
`Create an Attention Window for each Object
`Detected that is Associated with a Radar Return.
`
`Step 2
`Generate an Orientation-Selective Filter of the
`Image of each Attention Window.
`
`Step 3
`Transform each Orientation-Selective Filter into a
`Sparse Code Representation.
`
`u
`
`Step 4
`
`Step 5
`
`1
`
`Classify each Sparse Code Representation.
`
`Ir
`
`Store each classi?ed S arse Code Representation
`in a atabase.
`
`Step 6
`Compare each Stored Classi?ed Sparse Code
`Representation with Subsepuent Sparse Code
`Representations of Images 0 Objects Captured in
`said Attention Windows to Classify said
`Subsequent Images.
`
`FIG-8
`
`
`
`US 8,081,209 B2
`
`1
`METHOD AND SYSTEM OF SPARSE CODE
`BASED OBJECT CLASSIFICATION WITH
`SENSOR FUSION
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`A system and method for object classi?cation based upon
`the fusion of a radar system and a natural imaging device, and
`a sparse code representation of an identi?ed object.
`2. Description of the Prior Art
`Sensor fusion object classi?cation systems are knoWn and
`Well documented. Such systems Will gather information from
`an active and passive sensor and associate the tWo data to
`provide the user With information relating to the data, such as
`Whether the object is a vehicle or a non-vehicle. Such an
`association is commonly referred to as fusion and is referred
`to as such herein. In operation, fusion relates to return of a
`natural image captured by the passive sensor With respect to
`the detection of an object by the active sensor. Speci?cally, an
`active sensor such as a radar system Will be paired to a passive
`sensor such as a video camera, and the objects detected by the
`radar Will be mapped to the video image taken by the video
`camera. The fusion of such data may be done using algo
`rithms Which map the radar return to the video image. The
`fused data may then be further processed for relevant infor
`mation such as object detection and classi?cation using some
`form of visual graphic imaging interpretation. HoWever,
`visual graphic imaging interpretation requires su?icient
`memory to store the visual graphic data, and su?icient pro
`cessing speed to interpret the visual data in a timely manner.
`For example, US. Pat. No. 6,834,232 to Malhotra teaches the
`use of multiple sensor data fusion architecture to reduce the
`amount of image processing by processing only selected
`areas of an image frame as determined in response to infor
`mation from electromagnetic sensors. Each selected area is
`given a centroid and the center of re?ection for each detected
`object is identi?ed. A set of vectors are determined betWeen
`the centers of re?ection and the centroid. The difference
`betWeen the centers of re?ection and the centroids are used to
`classify objects. HoWever, Malhotra does not teach the use of
`orientation selective ?lters for object classi?cation.
`US. Pat. No. 6,889,171 to Skrbina et al. discloses a system
`fusing radar returns With visual camera imaging to obtain
`environmental information associated With the vehicle such
`as object classi?cation. Speci?cally, a radar is paired With a
`camera and the information received from each are time
`tagged and fused to provide the user With data relating to
`object classi?cation, relative velocity, and the like. This sys
`tem requires the data to be processed through an elaborate and
`complicated algorithm and thus requires a processor With that
`ability to process a tremendous amount of data in a relatively
`short period of time in order to provide the user With usable
`data.
`US. Pat. No. 7,209,221 to Breed et al. discloses a method
`of obtaining information regarding a vehicle blind spot using
`an infrared emitting device. Speci?cally, the method uses a
`trained pattern recognition technique or a neural netWork to
`identify a detected object. HoWever, Breed et al. is dependent
`upon the trained pattern recognition technique Whereby the
`amount of patterns and processes may place a huge burden on
`the system.
`Accordingly, it is desirable to have a system for object
`classi?cation Which does not require the amount of process
`ing capabilities as the prior art, and Which can re?ne and
`improve its classi?cation over time. One form of object rec
`ognition and classi?cation is knoWn as sparse code represen
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`tation. It is understood that sparse coding is hoW the human
`visual system e?iciently codes images. This form of object
`recognition produces a limited response to any given stimulus
`thereby reducing the processing requirements of the prior art
`systems. Furthermore, the use of sparse code recognition
`alloWs the system to be integrated With current systems hav
`ing embedded radar and camera fusion capabilities.
`
`SUMMARY OF THE INVENTION AND
`ADVANTAGES
`
`A system and method for object classi?cation based upon
`the fusion of a radar system and a natural imaging device
`using sparse code representation is provided. Speci?cally, a
`radar system is paired With a camera. The radar system pro
`vides a means of detecting the presence of an object Within a
`predetermined path of a vehicle. Objects identi?ed Within this
`predetermined path are then fused With the image gathered by
`the camera and the detected image is provided to the user in
`separate WindoWs referred to hereafter as “attention Win
`doWs .” These attention WindoWs are the natural camera image
`of the environment surrounding the radar return. The atten
`tion WindoW is then transformed into a sparse code represen
`tation of the object. The sparse code representation is then
`compared With knoWn sparse code representation of various
`objects. Each knoWn sparse code representation is given a
`predetermined variance and subsequent sparse code repre
`sentations falling Within said variance Will be classi?ed as
`such. For instance, a knoWn sparse code representation hav
`ing a value of Y Will be given a variance of Y+/—V, and
`subsequent sparse code representations of an image of an
`object falling Within Y+/—V, Will be classi?ed as the object
`With Y. The system and method also includes an associative
`learning algorithm Wherein classi?ed sparse code represen
`tations are stored and used to help classifying subsequent
`sparse code representation, thus providing the system With
`the capabilities of associative learning.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`Other advantages of the present invention Will be readily
`appreciated, as the same becomes better understood by ref
`erence to the folloWing detailed description When considered
`in connection With the accompanying draWings Wherein:
`FIG. 1 is schematic of the operational architecture of the
`system;
`FIG. 2 is schematic of the system development interface;
`FIG. 3 shoWs the projection from the actual vehicle siZe to
`the WindoW siZe in the image plane;
`FIG. 4 shoWs the generation of orientation-selective ?lters
`using Lobe Component Analysis;
`FIG. 5 shoWs the arrangement of local receptive ?elds over
`the attention WindoW.
`FIG. 6 shoWs the transformation of the natural image into
`a sparse code representation;
`FIG. 7 shoWs the placement of neurons on different levels
`in the netWork; and
`FIG. 8 shoWs the method for object classi?cation using
`sparse code representation.
`
`DETAILED DESCRIPTION OF THE INVENTION
`
`With reference to FIG. 1, a system 10 and method 12 for
`object classi?cation based upon the fusion of a radar system
`14 and a natural imaging device 16 using sparse code repre
`sentation 18 is provided. Speci?cally, the system 10 fuses
`radar return 20 With associated camera imaging 22, and the
`
`
`
`US 8,081,209 B2
`
`3
`fused data 24 is then transformed into a sparse code repre
`sentation 18 and classi?ed. The camera image of the radar
`return 20 is isolated, extracted and ?ltered With the help of
`orientation-selective ?lters 26. Processing of local receptive
`?elds embedded onto the extracted image by orientation
`selective ?lters 26 results in a sparse code for the image. The
`sparse code representation 18 of the object on the image is
`compared to knoWn sparse code representations 18 of a par
`ticular object class. The processing and comparison of any
`subsequent sparse code representation 18 of images is done in
`an associative learning framework.
`Thus, the video image of the object detected by the radar
`system 14 is transformed to a sparse code representation 18,
`Wherein the sparse code representation 18 of the detected
`object is used for higher-level processing such as recognition
`or categorization. A variety of experimental and theoretical
`studies indicate that the human visual system ef?ciently
`codes retinal activation using sparse coding, thus for any
`given stimulus there are only a feW active responses. Accord
`ingly, sparse coding representation is more independent than
`other forms of imaging data. This feature alloWs object leam
`ing and associative learning using sparse code representation
`18 to become a compositional problem and improves the
`memory capacity of associative memories. For illustrative
`purposes, the system 10 described Will classify an object as
`being either a vehicle or a non-vehicle. HoWever it is antici
`pated that the object classi?cation system 10 is capable of
`making other sub classi?cations. For example, if an object is
`classi?ed as a non-vehicle, the system 10 may further classify
`the object as being a human, or an animal such as a deer;
`likeWise, if the system 10 classi?es an object as being a
`vehicle, the system 10 can further classify the object as being
`a motorcycle, bike, or an SUV.
`In the ?rst preferred embodiment of the system 10 for
`object classi?cation, a radar system 14 is paired With a video
`camera for use in an automobile. Thus the speci?cations and
`characteristics of the radar system 14 must be suitable for
`detecting objects Within the customary driving environment
`of an automobile. The speci?cations of a suitable radar sys
`tem 14 includes a range betWeen 2 to 150 m With a tolerance
`of either +/—5% or +/—l .0 m; an angle of at least 15 degrees,
`With a tolerance of either +/—0.3 degrees or +/—0.l m, and a
`speed of+/—56 m/s With a tolerance +/—0.75 m/s. An example
`of a radar system 14 having the qualities described above is
`the F10 mM-Wave radar. The video camera characteristics
`and speci?cations of the video camera paired With such a
`radar system 14 includes a refreshing rate of 15 HZ, a ?eld
`vieW of 45 degrees, and a resolution of 320*240 pixels. An
`example of a video camera system 16 having the qualities
`described above is a Mobileye camera system 10.
`In operation, the radar system 14 provides a return for
`objects detected Within 15 degrees and out to 150 meters of
`the path of the vehicle. The radar return 20 can be processed
`temporally to determine the relative speed of each object. As
`the object classi?cation system 10 is directed at classifying
`objects Within the path of the vehicle, radar returns 20 outside
`of a predetermined parameter Will be discarded. This has the
`bene?t of reducing subsequent process computational loads
`of the system 10 thereby increasing the ef?ciency of said
`system 10. For instance, radar returns 20 more than eight
`meters to the right or left of the vehicle may be discarded as
`the object is considered out of the vehicle’s path. HoWever, it
`is understood that the parameters for discarding radar returns
`20 disclosed above are for illustrative purposes only and are
`not limiting to the disclosure presented herein.
`The radar returns 20 are fused With the real-time video
`image provided by the video camera system 16. Speci?cally,
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`the radar returns 20 are projected onto an image system ref
`erence plane 10 using a perspective mapping transformation
`28. Thus, a processor 3 0 is provided for fusing the radar return
`20 With the real-time visual of the natural imaging device 16
`using an algorithm Which performs the necessary perspective
`mapping transformation 28. The perspective mapping trans
`formation 28 is performed using calibration data that contain
`the intrinsic and extrinsic parameters of each camera. An
`example of such a perspective mapping transformation 28 can
`be found in US. Patent No. to Skrbina et al.
`With reference to FIG. 2, attention WindoWs 32 are created
`for each radar return 20. The attention WindoWs 32 are the
`video images of the radar return 20, and Will be given a
`predetermined siZe for Which to display the detected object.
`Speci?cally, as each radar return 20 is fused With the video
`camera image, the environment surrounding the radar returns
`20 is shoWn in an isolated vieW referred to as attention Win
`doWs 32. The video image of each attention WindoW 32 cor
`relates to the expected height and Width of the detected object.
`Accordingly, if the system 10 is concerned With only classi
`fying the object as either a vehicle or non-vehicle, the video
`image presented in each attention WindoW 32 is designed to
`the expected height of a vehicle. HoWever, as stated above the
`system 10 can be used to provide a broad possibility of clas
`si?cations and thus the dimensions of the video images con
`tained Within the attention WindoW 32 Will be tailored accord
`ingly.
`With reference to FIG. 3, the attention WindoWs 32 provide
`an extracted image of the detected object from the video
`camera. As the video images are designed to capture the
`predetermined area surrounding the object as described
`above, each video image may differ in siZe from one another
`depending upon the distance betWeen the detected object and
`the vehicle. These images are normaliZed Within the attention
`WindoW 32 to prevent the video images from being deformed.
`FIG. 2, shoWs normalized images Within its respective atten
`tion WindoW 32 Wherein the pixel intensity of each attention
`WindoW 32 unoccupied by an image are set to intensities of
`Zero.
`With reference noW to FIG. 4, the generation of orienta
`tion-selective ?lters 26 is provided. In the preferred embodi
`ment, the generation of orientation-selective ?lters 26 is
`achieved using Lobe Component Analysis hereafter
`(“LCA”). Speci?cally, neuron layers are developed from the
`natural images using LCA Wherein the neurons With update
`totals less than a predetermined amount are discarded and the
`neurons With an update total of a predetermined amount or
`more (Winning neuron) are retained. The retained (Winning)
`neurons are then used as orientation-selective ?lters 26.
`As shoWn in FIG. 6, a pool of orientation-selective ?lters
`26 may be developed to provide a specimen for Which sub
`sequent classi?cation of objects may be made. The developed
`?lter pool is comprised of Winning neurons as derived from
`the video image of an example of an object as captured Within
`each attention WindoW 32. For instance, the developed ?lter
`pool may consist of Winning neurons from hundreds of natu
`ral images that have been processed through the LCA. The
`developed ?lter pool is then embedded into the system 10 to
`provide the system 10 With a basis for subsequent generation
`of sparse codes for object classi?cations.
`Before images are processed by the LCA (FIG. 4), they are
`Whitened. Whitening of images is a Well knoWn statistical
`preprocessing procedure Which is intended to decorrelate
`images. In the preferred embodiment, Whitening is achieved
`by selecting a predetermined area x of each attention WindoW
`32, from a predetermined number N of random locations
`(randomly selected patches in FIG. 4), Where the predeter
`mined area x is further de?ned by an area of pixels d, and
`dIPZEMgthXPWZ-dth. The attention WindoW 32 is thereby related
`
`
`
`US 8,081,209 B2
`
`5
`in matrix form whereby x is the attention WindoW 32, and
`x:{xl, x2, .
`.
`. xn}. Each x” is further represented in vector
`form xi:(x,-, xi, .
`.
`. xhd). The Whitening matrix W is generated
`by taking the matrix of principal components V:{Vl, V2, .
`.
`.
`Vk} and dividing each by its standard deviation (the square
`root of variance). The matrix D is a diagonal matrix Where the
`matrix element at roW and column i is
`
`l
`m
`
`and 7», is the eigenvalue of vi. Then, WIVD. To obtain a
`Whitened attention WindoW 32 Y, multiply the response
`vector(s) X by W: YIWXIVDX. HoWever, the image can be
`transformed into a sparse code representation 18 Without
`Whitening.
`Upon obtaining Whitened attention WindoW 32 Y, Y is
`further processed to develop neuron layers through the LCA
`algorithm. The LCA algorithm incrementally updates c of
`20
`corresponding neurons represented by the column vectors
`v1“), v2“),
`. v6“) from Whitened input samples y, y(1),
`y(2), . . . y(i), Where y(i) is a column vector extracted from the
`Matrix Y At time t, the output of the layer is the response
`vector Z(I):(Zl(I), Z2(t), .
`.
`. ZC(I)). The LCA Algorithm
`Z(t):LCA(y(t)) is as folloWs:
`
`25
`
`ALGORITHM l
`
`Lobe Component Analysis
`
`22:
`
`3:
`4:
`5:
`
`6:
`
`7:
`
`Sequentially initialize 0 cells using ?rst c observations:
`v,(‘) = y(t) and set cell-update age n(t) = 1, for t = 1, 2, .
`
`.
`
`.
`
`, c.
`
`fort=c+1,c+2,...do
`Compute output (response) for all neurons:
`for l 2 i 2 0 do
`Compute the response:
`y(i)-W0- 1) ]
`Z‘“) g“(uvm-nuuymu ’
`
`Where g,- is a sigmoidal function.
`Simulating lateral inhibition, decide the
`Winner:j = arg maxlgigc{zi(t)}.
`Update only the Winner neuron vj using its
`temporally scheduled plasticity:
`Vj(t) = ‘Viv/‘(t _ 1) + W2Zjy(t)>
`Where the scheduled plasticity is determined by
`its tWo age-dependent Weights:
`
`W : Imp-hung»
`no)
`
`a
`
`2 1 + #60»
`no)
`
`a
`
`With W1 + W2 E 1. Update the number ofhits
`(cell age) n(j) only for the Winner: n(_i) <— n(_i) + 1.
`
`8:
`
`Where plasticity parameters t1 = 20, t2 = 200, c = 2, and
`r = 2000 in our implementation.
`All other neurons keep their ages and Weight
`unchanged: For all l 2 i E c, i ==j, vi”) = vll’il).
`
`end for
`
`end for
`
`FIG. 4 shoWs the result When using LCA on neurons (“c”)
`and Whitened input samples (“n”). Each neuron’s Weight
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`vector is identical in dimension to the input, thus is able to be
`displayed in each grid as a roW and column image by applying
`a de-Whitening process, Which is achieved by restoring the
`original input vector using the folloWing equation: x:VD_lx.
`In FIG. 4, the lobe component With the most Wins is at the top
`left of the image grid progresses through each roW to the lobe
`component With the least Wins, Which is located at the bottom
`right. As stated before, the neurons having at least a predeter
`mined amount of Wins (update totals) Were kept because said
`neurons shoWed a localiZed orientation pattern. Accordingly,
`the retained Winning neurons are designated as orientation
`selective ?lters 26. Thus, as the real World image samples are
`collected and processed through the above described LCA,
`the Winning neurons are kept and designated as orientation
`selective ?lters 26 and used for further processing. The ben
`e?t of using orientation-selective ?lters 26 is that said orien
`tation-selective ?lters 26 eliminate reliance upon problem
`domain experts, and traditional image processing methods
`12.
`With reference to FIG. 5, square receptive ?elds are applied
`over the entire attention WindoW 32 such that some of the
`receptive ?elds overlap each other. Each receptive ?eld has a
`predetermined area de?ned by pixel length and Width. In the
`preferred embodiment, these receptive ?elds are pixels in the
`attention WindoW 32 image plane. FIG. 5 shoWs each recep
`tive ?eld staggered from each other by pixels, and in an
`overlapping relationship. In this manner a plurality of local
`receptive ?elds are held Within the attention WindoW 32 plane.
`With reference to FIG. 6, the orientation-selective ?lter 26
`may be transformed into sparse code representation 18 S by
`taking orientation-selective ?lters 26 F transposed and mul
`tiplying it by a non-negative matrix L, Where L represents the
`local receptive ?elds as shoWn in the folloWing equation
`SIFTXL. Each receptive ?eld is represented by a stacked
`vector having a correlating pixel length and Width. Accord
`ingly, in the preferred embodiment, the stacked vector dimen
`sion is x. Thus, a non-negative matrix L is provided, Where
`L:{I1, I2, . . .Im} and m is the number oflocal receptive ?elds.
`The set of orientation-selective ?lters 26 are represented as
`F:{fl, f2, .
`.
`. fat} Where c' represents the number of neurons
`retained from the Win selection process, and each f,- is a
`non-negative orientation-selective ?lter 26 from neuron i rep
`resented as fi:(f,-, fi, .
`.
`. fad), Where d:l6><l6.
`As represented by the S equation above, the orientation
`selective ?lters 26 are multiplied by the local receptive ?elds
`to generate sparse representation of the image. The matrix S
`is non-negative and, by operation of the formula SIFTXL, is
`reshaped into a vector having a total number of dimension
`i><m, and i represents the total number of neurons kept from
`the Winning process of FIG. 4, and in represents the total
`number of local receptive ?elds. Thus, matrix S maps the raW
`pixel representation of the attention WindoW 32 to a higher
`dimensional, sparse encoded space Which leads to a sparse
`code representation 18 of the input as opposed to the standard
`pixel representation of the image. The sparse code represen
`tation 18 has the advantage of achieving better ef?ciency in
`both recognition rate and learning speed than classi?cation
`methods 12 based upon the original pixel input, or image
`based classi?cation systems 10 using image recognition.
`An associative learning frameWork such as Multi-layer
`In-place Learning NetWork (MILN), Neural NetWork (NN),
`Incremental Support Vector Machines (I-SVM), or Incremen
`tal Hierarchical Discriminant Regression, is then used to clas
`sify the sparse code representation 18 of the input. The asso
`ciative learning frameWork utiliZes both incremental and
`online learning methods 12 to recogniZe and classify inputs.
`Accordingly, the use of a developed ?lter pool enhances the
`
`
`
`US 8,081,209 B2
`
`7
`system 10 due to the various known properties of objects
`provided. However, the advantage of the associative learning
`framework is that the system 10 can make subsequent object
`classi?cation based upon a developed ?lter pool having only
`one sparse code representations 18 of each object Within a
`particular classi?cation. For example, Where the system 10
`only makes tWo object classi?cations, vehicle or non-vehicle,
`a sparse code representation 18 of a vehicle and a non-vehicle
`Will alloW the system 10 to make subsequent object classi?
`cations based upon the tWo knoWn sparse code representa
`tions. In the preferred embodiment, the MILN is used as it
`provides better performance.
`A comparative study of four different classi?cation meth
`ods Was conducted Where each method Was used in an open
`ended autonomous development Where an ef?cient (memory
`controlled), real-time (incremental and timely), and extend
`
`8
`types of input, and uses the least memory, in terms of the
`number of support vectors automatically determined by the
`data, but training time is the Worst. Another potential problem
`With I-SVM is its lack of extendibility to situations When the
`same data may be later expanded from the original tWo to
`more than tWo classes. As general purpose regressors, IHDR
`and MILN are readily extendable. IHDR uses too much
`memory and does not represent information e?iciently and
`selectively as shoWn in tables 1 and 2. The MILN hoWever,
`alloWs the system to focus analysis on sub-parts of the atten
`tion WindoW to improve generalization (e.g., recogniZe a
`vehicle as a combination of license plate, rear WindoW and
`tWo tail lights). Table 2 shoWs that the overall accuracy of
`object classi?cation is more accurate using MILN Whenusing
`sparse coded inputs.
`
`TABLE 1
`
`AVERAGE PERFORMANCE & COMPARISON OF LEARNING METHODS OVER l0-FOLD CROSS
`VALIDATION FOR PIXEL INPUTS
`
`Learning
`Method
`
`“Overall”
`Accuracy
`
`“Vehicle”
`Accuracy
`
`“Other Objects”
`Accuracy
`
`Training Time
`Per Sample
`
`Test Time
`Per Sample
`
`Final # Storage
`Elements
`
`NN
`I-SVM
`I-HDR
`MILN
`
`N/A
`93.89 1 1.84% 94.32 1 1.42% 93.38 1 2.04%
`94.38 1 2.24% 97.08 1 1.01% 92.10 1 6.08% 134.3 1 0.4 ms
`95.87 1 1.02% 96.36 1 0.74% 95.62 1 2.84%
`2.7 r 0.4 ms
`94.58 12.34% 97.12 1 1.60% 91.20 15.31%
`8.8 10.6 ms
`
`432.0 1 20.3 ms
`2.2 r 0.1 ms
`4.7 r 0.6 ms
`8.8 10.6 ms
`
`621
`44.5 r 2.3
`689
`100
`
`TABLE 2
`
`AVERAGE PERFORMANCE & COMPARISON OF LEARNING METHODS OVER lO-FOLD CROSS
`VALIDATION FOR SPARSE CODED INPUTS
`
`Learning
`Method
`
`“Overall”
`Accuracy
`
`“Vehicle”
`Accuracy
`
`“Other Objects”
`Accuracy
`
`Training Time
`Per Sample
`
`Test Time
`Per Sample
`
`Final # Storage
`Elements
`
`NN
`I-SVM
`I-HDR
`MILN
`
`N/A
`94.32 1 1.24% 95.43 11.02% 91.28 1 1.86%
`96.79 1 1.17% 97.23 1 1.20% 99.40 1 3.27% 324.5 1 22.4 ms
`96.54 1 1.83% 96.79 1 1.04% 96.31 1 2.05%
`12.2 r 1.3 ms
`97.14 1 1.27% 97.93 1 1.63% 95.46 1 2.54%
`109.1 1 3.2 ms
`
`2186.5 1 52.5 ms
`7.6 r 0.3 ms
`21.5 r 1.7 ms
`42.6 r 0.4 ms
`
`621
`45.2 r 2.6
`689
`100
`
`45
`
`55
`
`able (the number of classes can increase) architecture is
`desired. Tables shoWing the results of the study are provided
`for reference. The four different classi?cation methods are:
`K-Nearest Neighbor (“NN”), With kIl, and using a Ll dis
`tance metric for baseline performance, Incremental-SVM
`(“I-SVM”), Incremental Hierarchical Discriminant Regres
`sion (IHDR) and the MILN as discussed herein. A linear
`kernel for I-SVM Was used for high-dimensional problems. A
`50
`summary of the results of the comparative study using tWo
`different input types: non-transformed space, “pixel” space
`having input dimensions of 56x56; and sparse-coded space
`having dimensions of 36><431 by the MILN Layer-l are pro
`vided in the Tables 1 and 2. Table 1 shows the average per
`formance of learning methods over l0-fold cross validation
`for pixel inputs and Table 2 shoWs average performance of
`learning methods over l0-fold cross validation for sparse
`coded inputs.
`The study shoWs that the Nearest Neighbor performed
`fairly Well, but Was prohibitively sloW. IHDR performed
`faster than NN, but required a lot of memory as IHDR auto
`matically develops an overlaying tree structure that organiZes
`and clusters data. Furthermore, While IHDR does alloW
`sample merging, it saves every training sample thus does not
`use memory e?iciently. I-SVM performed Well With both
`
`60
`
`65
`
`The generated sparse code representations 18 are pro
`cessed through the MILN. The advantage of a MILN is that
`there are no global rules for learning, such as the minimiZa
`tion of mean square error for a pre-collected set of inputs and
`outputs. Thus With MILN, each neuron learns on its oWn, as a
`self-contained entity using its oWn internal mechanisms. As
`shoWn in the algorithm 2, a Top-DoWn input is provided.
`Initially, the Top-DoWn input is an external signal that is
`labeled as being in a particular classi?cation. Subsequent
`signals gathered from the systems sensors are then compared
`With the initial signal for classi?cation. The mechanisms con
`tained in the MILN along With each neuron’s stimulation
`affect the neuron’s features over time thus enabling a learning
`process. As shoWn in FIG. 7, the MILN netWork includes tWo
`layers to perform learning, a ?rst layer Which recogniZes the
`sparse code representation 18, and a second layer Which
`associates the recogniZed layer With the classes. Obviously,
`additional layers may be provided for further classifying the
`input in each class into sub-classes. In operation, the inputs
`are processed through the folloWing algorithm Which classi
`?es the sparse code representations 18 into predetermined
`categories:
`
`
`
`US 8,081,209 B2
`
`9
`ALGORITHM 2
`
`MILN
`
`. L- 1,setthe output at layerl attirnet= Oto be
`.
`1: Forl= 1, .