throbber
Machine Learning for Detection
`and Diagnosis of Disease
`Paul Sajda
`Department of Biomedical Engineering, Columbia University, New York, NY 10027;
`email: ps629@columbia.edu
`
`Annu. Rev. Biomed. Eng.
`2006. 8:537–65
`
`First published online as a
`Review in Advance on
`April 17, 2006
`
`The Annual Review of
`Biomedical Engineering is
`online at
`bioeng.annualreviews.org
`
`doi: 10.1146/
`annurev.bioeng.8.061505.095802
`Copyright c(cid:2) 2006 by
`Annual Reviews. All rights
`reserved
`
`1523-9829/06/0815-
`0537$20.00
`
`Key Words
`blind source separation, support vector machine, bayesian network,
`medical imaging, computational biology
`
`Abstract
`Machine learning offers a principled approach for developing sophisticated, auto-
`matic, and objective algorithms for analysis of high-dimensional and multimodal
`biomedical data. This review focuses on several advances in the state of the art that
`have shown promise in improving detection, diagnosis, and therapeutic monitoring of
`disease. Key in the advancement has been the development of a more in-depth under-
`standing and theoretical analysis of critical issues related to algorithmic construction
`and learning theory. These include trade-offs for maximizing generalization perfor-
`mance, use of physically realistic constraints, and incorporation of prior knowledge
`and uncertainty. The review describes recent developments in machine learning, fo-
`cusing on supervised and unsupervised linear methods and Bayesian inference, which
`have made significant impacts in the detection and diagnosis of disease in biomedicine.
`We describe the different methodologies and, for each, provide examples of their ap-
`plication to specific domains in biomedical diagnostics.
`
`537
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`APPLE 1042
`
`1
`
`

`

`INTRODUCTION
`Machine learning, a subdiscipline in the field of artificial intelligence (AI), focuses on
`algorithms capable of learning and/or adapting their structure (e.g., parameters) based
`on a set of observed data, with adaptation done by optimizing over an objective or cost
`function. Machine learning and statistical pattern recognition have been the subject
`of tremendous interest in the biomedical community because they offer promise for
`improving the sensitivity and/or specificity of detection and diagnosis of disease, while
`at the same time increasing objectivity of the decision-making process. However, the
`early promise of these methodologies has resulted in only limited clinical utility,
`perhaps the most notable of which is the use of such methods for mammographic
`screening (1, 2). The potential impact of, and need for, machine learning is perhaps
`greater than ever given the dramatic increase in medical data being collected, new
`detection, and diagnostic modalities being developed and the complexity of the data
`types and importance of multimodal analysis. In all of these cases, machine learning
`can provide new tools for interpreting the high-dimensional and complex datasets
`with which the clinician is confronted.
`Much of the original excitement for the application of machine learning to
`biomedicine originated from the development of artificial neural networks (ANNs)
`(e.g., see 3), which were often proclaimed to be “loosely” modeled after computation
`in the brain. Although in most cases such claims for brain-like computation were
`largely unjustified, one of the interesting properties of ANNs was that they were
`shown to be capable of approximating any arbitrary function through the process of
`learning (also called training) a set of parameters in a connected network of simple
`nonlinear units. Such an approach mapped well to many problems in medical image
`and signal analysis and was in contrast to medical expert systems such as Mycin (4)
`and INTERNIST (5), which, in fact, were very difficult and time consuming to con-
`struct and were based on a set of rules and prior knowledge. Problematic with ANNs,
`however, is the difficulty in understanding how such networks construct the desired
`function and thus how to interpret the results. Thus, often such methods are used as
`a “black box,” with the ANN producing a mapping from input (e.g., medical data) to
`output (e.g., diagnosis) but without a clear understanding of the underlying mapping
`function. This can be particularly problematic in clinical medicine when one must
`also consider merging the interpretation of the computer system with that of the
`clinician because, in most cases, computer analysis systems are seen as adjunctive.
`As the field of machine learning has matured, greater effort has gone into de-
`veloping a deeper understanding of the theoretical basis of the various algorithmic
`approaches. In fact, a major difference between machine learning and statistics is that
`machine learning is concerned with theoretical issues such as computational complex-
`ity, computability, and generalization and is in many respects a marriage of applied
`mathematics and computer science.
`An area in machine learning research receiving considerable attention is the fur-
`ther development and analysis of linear methods for supervised and unsupervised
`feature extraction and pattern classification. Linear methods are attractive in that
`their decision strategies are easier to analyze and interpret relative to nonlinear
`
`538
`
`Sajda
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`2
`
`

`

`Biomarkers: anatomic,
`physiologic, biochemical, or
`molecular parameters
`associated with the presence
`and severity of specific
`disease states
`
`classification and regression functions, for example, constructed by ANNs. In ad-
`dition, a linear model can often be shown to be consistent, at least to first order, with
`underlying physical processes, such as image formation or signal acquisition. Finally,
`linear methods tend to be computationally efficient, and can be trained online and in
`real time.
`Particularly important for biomedical applications has been the development
`of methods for explicitly incorporating prior knowledge and uncertainty into the
`decision-making process. This has led to principled methods based on Bayesian infer-
`ence, which are well suited for incorporating disparate sources of noisy measurements
`and uncertain prior knowledge into the diagnostic process.
`This review describes recent developments in machine learning, focusing on su-
`pervised and unsupervised linear methods and Bayesian inference, which have made
`significant impact in the detection and diagnosis of disease in biomedicine. We de-
`scribe the different methodologies and, for each, provide examples of their application
`to specific domains in biomedical diagnostics.
`
`BLIND SOURCE SEPARATION
`Two important roles for machine learning are (a) extraction of salient structure in the
`data that is more informative than the raw data itself (the feature extraction problem)
`and (b) inferring underlying organized class structure (the classification problem).
`Although strictly speaking the two are not easily separable into distinct problems, we
`consider the two as such and describe the state of the art of linear methods for both.
`In this section we focus on unsupervised methods and application of such methods
`for recovering clinically significant biomarkers.
`
`Linear Mixing
`There are many cases in which one is interested in separating, or factorizing, a set
`of observed data into two or more matrices. Standard methods for such factorization
`include singular value decomposition (SVD) and principal component analysis (PCA)
`(6). These methods have been shown to satisfy specific optimality criteria, for example,
`PCA being optimal in terms of minimum reconstruction error under constraints of
`orthogonal basis vectors. However, in many cases these criteria are not consistent
`with the underlying signal/image-formation process and the resultant matrices have
`little physical relevance. More recently, several groups have developed methods for
`decomposing a data matrix into two matrices in which the underlying optimality
`criteria and constraints yield more physically meaningful results (7–14).
`Assume a set of observations is the result of a linear combination of latent sources.
`Such a linear mixing is quite common in signal and image acquisition/formation,
`at least to a first approximation, and is consistent with underlying physical mixing
`process, ranging from electroencephalography (15) to acoustics (16). Given X as a
`matrix of observations (M rows by N columns) the linear mixing equation is
`X = AS,
`
`(1)
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`www.annualreviews.org • Machine Learning for Disease Diagnosis
`
`539
`
`3
`
`

`

`where A is the set of mixing coefficients and S is a matrix of sources. Depending
`on the modality, the columns of X and S are the coordinate system in which the
`data is represented (i.e., time, space, wavelength, frequency, etc.). The challenge is to
`recover both A and S simultaneously given only the observations X. This problem
`is often termed blind source separation (BSS) because the underlying sources are
`not directly observed and the mixing matrix is not known. BSS methods have been
`applied to many fundamental problems in signal recovery and deconvolution (17).
`Most methods that have been developed attempt to learn an unmixing matrix W,
`which when applied to the data X yields an estimate of the underlying sources (up to
`a scaling and permutation),
`
`ˆS = WX.
`(2)
`Consider the case when one assumes the rows of S (i.e., the source vectors) are random
`variables that are statistically independent. This implies that the joint distribution of
`the sources factors,
`
`(3)
`
`P(s 1, . . . ,s L) = P(s 1)P(s 2) . . . P(s L),
`where L indicates the number of underlying sources (with each s i a row in S), and
`P(.) is the probability density function. In most cases L is not known and represents
`a hyperparameter that must be set or inferred. BSS methods that exploit statistical
`independence in their optimality criteria are termed independent component analysis
`(ICA) (see 18 for review). Several approaches have been developed to recover inde-
`pendent sources, the methods distinguished largely by the objective function they
`employ, e.g., maximum likelihood (19), maximum a posteriori (9), information max-
`imization (20), entropy estimation (21), and mean-field methods (22). In the case of
`time series, or other types of ordered data, one can also exploit other statistical criteria
`such as the nonstationarity and utilize simultaneous decorrelation (16, 23–25). Parra
`& Sajda (15) formulate the problem of BSS as one of solving a generalized eigenvalue
`problem, where one of the matrices is the covariance matrix of the observations and
`the other is chosen based on the underlying statistical assumptions on the sources.
`This view unifies various approaches in simultaneous decorrelation and ICA, together
`with PCA and supervised methods such as common spatial patterns (CSP) (26).
`The attractive property of these decomposition methods is that the recovered
`components often result in a natural basis for the data, in particular, if one considers
`some general properties of natural signals. For example, the marginal statistics of
`many natural signals (or filtered versions of the signals) are highly non-Gaussian (27,
`28). Since, by the central limit theorem, linear mixtures of non-Gaussian random
`variables will result in marginal statistics that are more closely Gaussian, recovering
`the independent components captures the generative or natural axes of the mixing
`process.
`
`Nonnegative Matrix Factorization
`One particularly useful method for factoring the data matrix X under very general
`and physically realistic constraints is the nonnegative matrix factorization (NMF)
`
`540
`
`Sajda
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`4
`
`

`

`− (cid:3)X−AS(cid:3)2
`2σ 2
`
`e
`
`algorithm (7). The basic idea of the NMF algorithm is to construct a gradient descent
`over an objective function that optimizes A and S, and, by appropriately choosing
`gradient stepsizes, to convert an additive update to a multiplicative one. For example,
`assuming Gaussian noise, one can formulate the problem of recovering A and S in
`Equation 1 as a maximum likelihood estimation,
`AML, SML = argmax p(X | A, S)
`A, S
`= argmax
`1√
`2π σ
`A, S
`subject to: A ≥ 0, S ≥ 0,
`where σ is the deviation of the Gaussian noise and (AS) its mean.
`Maximizing the likelihood is equivalent to minimizing the negative log-likelihood,
`and Equation 4 can be written as,
`AML, SML = argmin(− log p(X | A, S))
`A, S
`= argmin(cid:3)X − AS(cid:3)2
`A, S
`subject to: A ≥ 0, S ≥ 0.
`One can compute the gradients of the negative log-likelihood function and construct
`the additive update rules for A and S,
`− (ASST)i,m]
`Ai,m ← Ai,m + δi,m[(XST)i,m
`Sm,λ ← Sm,λ + ηm,λ[(ATX)m,λ − (ATAS)m,λ].
`Note that there are two free parameters, which are the step sizes of the updates. Lee
`& Seung (29) have shown that by appropriately choosing the step sizes, δi,m = Ai,m
`,
`(ASST)i,m
`ηm,λ = Sm,λ
`, the additive update rule can be formulated as a multiplicative update
`(ATAS)m,λ
`rule, with X = AS being a fixed point. The multiplicative update rules for A and S
`therefore become
`
`(4)
`
`(5)
`
`(6)
`
`Ai,m ← Ai,m
`
`Sm,λ ← Sm,λ
`
`(XST)i,m
`(ASST)i,m
`(ATX)m,λ
`(ATAS)m,λ
`where convergence of these update rules is guaranteed (29). By formulating the up-
`dates as multiplicative rules in Equation 7, we can ensure nonnegative A and S, given
`that both are initialized to be nonnegative and the observations, X, are nonnegative.
`An intuitive understanding of NMF via geometrical considerations can be de-
`veloped. The manifold of possible solutions specified by the linear mixing equation
`and nonnegativity constraints represent an M-dimensional polygonal cone spanned
`by the M rows of S. Nonnegativity constraints require that the row vectors of S,
`
`,
`
`(7)
`
`www.annualreviews.org • Machine Learning for Disease Diagnosis
`
`541
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`5
`
`

`

`xj
`
`s1
`
`x i
`
`s2
`
`Figure 1
`Geometrical interpretation of NMF. The axes represent two dimensions of the
`high-dimensional space of the observations. Spans of the recovered sources (s1 and s2) are
`shown as dashed magenta vectors. The recovered sources are constrained to lie in the positive
`hyper-quadrant and tightly envelope the observed data, forming a cone (pink region). Points
`that fall outside of the cone contribute to the error. An analogous picture can be drawn for the
`basis vectors A = {a1 . . .a m}.
`
`representing the edges of the cone, lie in the positive quadrant of the L-dimensional
`points defined by the rows of the observations X, which must fall within that polyg-
`onal cone. The aim of maximum likelihood is to find cone edge vectors that tightly
`envelope the observed L-points. Figure 1 illustrates this interpretation, which is
`sometime referred to as a conic encoder (30).
`The basic NMF algorithm has been modified in several ways, including adding a
`sparsity constraint on the sources (31), weighted NMF (32), and constrained NMF
`(11) (see below). The utility of the NMF algorithm for recovering physically mean-
`ingful sources has been demonstrated in a number of application domains, including
`image classification (33), document classification (34), and separation of audio streams
`(35), as well as biomedical applications such as analysis of positron emission tomog-
`raphy (PET) (36) and microarray analysis of gene expression (37, 38). Below, we
`describe two examples, both using nuclear magnetic resonance (NMR) data, where
`such methods are able to recover signatures of disease and toxicity.
`
`Recovering Spectral Signatures of Brain Cancer
`In vivo magnetic resonance spectroscopy imaging (MRSI) allows noninvasive charac-
`terization and quantification of molecular markers of potentially high clinical utility
`for improving detection, identification, and treatment for a variety of diseases, most
`notably brain cancers (39). MRSI acquires high-frequency resolution MR spectra
`across a volume of tissue with common nuclei, including 1H (proton), 13C (carbon),
`19F (fluorine), and 31P (phosphorus). Machine learning approaches for integrating
`MRSI with structural MRI have been shown to have potential for improving the
`assessment of brain tumors (40).
`
`542
`
`Sajda
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`6
`
`

`

`In MRSI, each tissue type can be viewed as having a characteristic spectral pro-
`file related to its biochemical composition. In brain tumors, for example, 1H MRSI
`has shown that metabolites are heterogeneously distributed and, in a given voxel,
`multiple metabolites and tissue types may be present (41). The observed spectra are
`therefore a combination of different constituent spectra. Because the signal measured
`in MRSI is the response to a coherent stimulation of the entire tissue, the ampli-
`tudes of different coherent resonators are additive. The overall gain with which a
`tissue type contributes is proportional to its abundance/concentration in each voxel.
`As a result, we can explain observations using the linear mixing equation (Equa-
`tion 1). Because we interpret A as abundance/concentration, we can assume the
`matrix to be nonnegative. In addition, because the constituent spectra S represent
`amplitudes of resonances, in theory, the smallest resonance amplitude is zero, cor-
`responding to the absence of resonance at a given band (where we ignore cases of
`negative peaks such as in J-modulation). Figure 2 illustrates the spectral unmixing
`problem.
`Interpretation of MRSI data is challenging, specifically for traditional peak-
`quantifying techniques (42, 43): A typical dataset consists of hundreds of highly cor-
`related spectra, having low signal-to-noise ratio (SNR) with peaks that are numerous
`and overlapping. This has created the need for approaches that can analyze the entire
`dataset simultaneously, taking advantage of the relationships among the spectra to
`improve the quality of the analysis. Such approaches are particularly useful for spectra
`
`A 11 A 12 … A 1
`M
`
`A 21 A 22 … A 2M
`
`=
`
`+
`
`+
`
`AN1 A N 2 … A NM
`
`(M << N )
`
`X
`
`=
`
`A
`
`+
`
`S
`
`+
`
`N
`
`Figure 2
`The spectral unmixing problem. Spectra from multiple voxels, for example, from MRSI and
`represented in the rows of X, are simultaneously analyzed and decomposed into constituent
`spectra S and the corresponding intensity distributions A. The extracted constituent spectra
`are identified by comparing them to known spectra of individual molecules. In most cases, the
`number of rows in S, M, is much less than the number of rows, N, inX—i.e., there is a
`dimensionality reduction in the decomposition. Unidentified spectral components are
`considered residual noise N. Their corresponding magnitudes quantify the modeling error,
`which can be directly compared to the modeling error of alternative parametric estimation
`procedures.
`
`www.annualreviews.org • Machine Learning for Disease Diagnosis
`
`543
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`7
`
`

`

`Figure 3
`cNMF separation of 1H CSI human brain data into clinically significant biomarkers and their
`corresponding spatial distributions. (a) Spectrum indicative of normal brain tissue: low choline
`(CHO), high creatine (CR), and high N-acetyl-aspartate (NAA). (b) Spectrum indicating
`high-grade malignant tumor tissue: highly elevated CHO, low CR, almost no NAA, and LAC
`(lactic acid). (c) Spectrum indicating residual lipids.
`
`with low SNR as they utilize the collective power of the data. Several BSS approaches
`have been developed to simultaneously exploit the statistical structure of an MRSI
`dataset, factorizing Equation 1. For example, ICA (44), second-order blind identifi-
`cation (SOBI) (45), and bayesian spectral decomposition (8) have all been applied to
`MRSI datasets to decompose observed spectra into interpretable components.
`Constrained NMF (cNMF), is a very efficient version of NMF for recovering
`biomarkers of brain cancer in MRSI (11, 12). The algorithm enables nonnegative
`factorization even for noisy observations, which may result in observed spectra hav-
`ing negative values. cNMF includes a positivity constraint, forcing negative values
`in the recovered spectral sources and abundance/concentration distributions to be
`approximately zero. Figure 3 illustrates an example of spectral sources and their cor-
`responding concentrations recovered using cNMF for 1H MRSI data from human
`brain. In this example, the method recovers biomarkers of high-grade malignant tu-
`mor as well as the spatial distribution of their concentration. One of the advantages
`over other decomposition approaches that have been used in NMR, for example,
`those based on Monte Carlo sampling (8), is that cNMF is computationally efficient
`and can be used in near real time, when a patient is in the MR scanner.
`
`544
`
`Sajda
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`8
`
`

`

`Extraction of Metabolic Markers of Toxicity
`Metabolomics [sometimes referred to as metabonomics (46)] quantitatively measures
`the dynamic metabolic response of living systems to pathophysiological stimuli or
`genetic modification. Metabolomic analysis of biofluids based on high-resolution
`MRS and chemometric methods are valuable in characterizing the biochemical re-
`sponse to toxicity (47). Interpretation of high-resolution 1H biofluid NMR spectra
`dataset is challenging, specifically for traditional peak-quantifying techniques: A typ-
`ical dataset consists of at least tens of highly correlated spectra, with thousands of
`partially overlapping peaks arising from hundreds of endogenous molecules. This
`has created the need for approaches that can analyze the entire dataset simultane-
`ously for discriminating between different combinations of metabolites, including
`their dynamic changes.
`PCA is widely used for analyzing metabolomic NMR datasets (48, 49). Although
`a reasonable approach for preprocessing NMR datasets (50), the PCA decomposition
`does not lead to physically realizable spectral biomarkers. Physically realistic decom-
`positions are not only useful in terms of visualization, but also in classification of
`metabolic patterns using machine learning and domain knowledge (51).
`Figure 4 illustrates NMF applied to 1H NMR spectra of urine from Han Wistar
`rats in a hydrazine toxicity experiment. Samples were collected from control rats
`and those treated with three different doses of hydrazine (75, 90, 120 mg/kg) over
`a period of 150 h (52). Preprocessing, including normalization of the data, has been
`described elsewhere (53). The NMF algorithm requires about 300 s (Intel Pentium4
`1.2 GHz) to obtain the recovered spectral sources, orders of magnitude faster than
`other decomposition methods yielding similar results (53). The magnitudes in each
`dose-group, as a function of time, are shown in Figure 5a, with the identified spectral
`patterns in Figure 4a. NMF was run 100 times (100 independent initializations), with
`Figure 4a showing the mean results (solid lines) and variation across runs (dashed lines).
`The small variance demonstrates the robustness and fidelity of the NMF in spectral
`pattern recovery.
`Clear is the association of the four spectral patterns with the hydrazine treatment.
`In control rats, the first ( filled diamonds) and second ( filled upper-triangle) spectral
`sources maintain almost a constant high level, while the third (inverted-triangle) and
`fourth (open circle) are very low. Thus, the first spectral source (Krebs cycle interme-
`diates: citrate and succinate) and second spectral source (2-oxoglutarate) are related
`to the normal patterns, while the third and fourth (2-aminoadipic acid, taurine and
`creatine) are related to hydrazine. Indeed, in the treated animals, the normal pat-
`terns decrease in response to hydrazine and recover after 36 h, while the other two
`exhibit reciprocal behaviors during the course of the experiment. The data from the
`120 mg/kg dose indicates no sign of recovery at 56 h, at which point the animal was
`sacrificed.
`A visual comparison of the spectral sources recovered using NMF with the first
`principal components recovered using PCA is shown in Figure 4a,b. The PCA
`components do not represent physically realizable spectra and do not appear to
`be biomarkers of the metabolic status of the animals. This is further illustrated by
`
`www.annualreviews.org • Machine Learning for Disease Diagnosis
`
`545
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`9
`
`

`

`Figure 4
`(a) Spectral sources, recovered using NMF, indicative of biomarkers for normal metabolic
`function (blue) and hydrazine toxicity (red). Solid lines are the mean results and the dash lines
`are the mean ±2σ . (b) Components recovered using PCA. Note that the patterns are not
`physically realizable spectra because they have negative peaks.
`
`546
`
`Sajda
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`10
`
`

`

`Control
`
`Control
`
`Control
`
`75 mg/kg
`
`90 mg/kg
`
`120 mg/kg
`
`100
`
`0
`
`100
`
`0
`
`0
`
`100
`Time (h)
`
`100
`
`0
`
`100
`
`0
`
`60
`
`Control
`
`Control
`
`Control
`
`75 mg/kg
`
`90 mg/kg
`
`120 mg/kg
`
`100
`
`0
`
`100
`
`0
`
`100
`
`0
`
`100
`
`0
`
`100
`
`0
`
`60
`
`80
`
`60
`
`40
`
`20
`
`0
`0
`
`Concentration
`
`a
`
`b
`
`Abnormal
`
`Normal
`
`0
`
`Abnormal
`
`Normal
`
`0
`
`100
`
`0
`
`100
`
`0
`
`0
`
`100
`Time (h)
`
`100
`
`0
`
`100
`
`0
`
`60
`
`Figure 5
`(a) Time-dependent concentration of the spectral biomarkers recovered using NMF. The
`filled diamonds and filled upright triangles are associated with split normal patterns (blue), and
`the inverted triangles and open circles are associated with aberrant patterns (red)—symbols
`correspond to biomarkers in Figure 4a. Analysis of the time-dependent concentrations shows
`the effect, and in most cases (except the 100 mg/kg dose) recovery from the hydrazine. (b)
`K-means cluster analysis applied to the amplitudes of the NMF patterns and the first four
`principal components. (Top) Concentration profiles recovered via NMF enables correct
`clustering into normal and abnormal classes. The samples corresponding to the control rats
`and the ones collected before hydrazine administration, as well as more than 104 h after
`hydrazine administration for the treated rats, are assigned into the normal cluster, and the
`other samples collected in the experiment are correctly assigned into the abnormal cluster.
`(Bottom) K-means clustering on the first four principal components. Classification is less
`accurate compared to when using NMF recovered biomarkers—e.g., as evident by the
`misclassification of some of the time points for controls.
`
`applying K-means clustering (54) to the amplitudes in the matrix A to classify the
`metabolic status (normal versus abnormal) of the rats as a function of time. The re-
`sults for clustering the samples into two clusters, normal and abnormal, using cNMF
`components are shown in Figure 5b (top), from which we can see that the control
`rats are clearly separated from those that are treated. Both the initial measurements
`
`www.annualreviews.org • Machine Learning for Disease Diagnosis
`
`547
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`11
`
`

`

`(0 h), taken prior to hydrazine administration, and the later data points (after 104 h)
`for the treated rats are correctly assigned to the normal cluster. These samples have
`NMR spectra very similar to those from untreated animals, and in fact correspond
`to time points when the manifested toxic effect of hydrazine is almost minimized
`by biologic recovery. Figure 5b (bottom) shows the classification results using the
`coefficients of the first four PCs. Clearly, these results are less realistic compared with
`Figure 5b (top) because some of the time points for the control rats are classified
`into the abnormal group. We see that a source recovery method that imposes phys-
`ically realistic constraints improves classification because it connects the recovered
`sources, quantitatively, with the biological end-point measurements. The approach
`shows promise for understanding complex metabolic responses of disease, pharma-
`ceuticals, and toxins.
`
`SUPPORT VECTOR MACHINES
`The unsupervised learning decompositions discussed in the previous section can be
`considered methods for constructing descriptive representations of the observed data.
`An alternative is to construct discriminative representations using supervised learn-
`ing, namely representations that are constructed to maximize the difference between
`underlying classes in the data. The most common is linear discrimination. The linear
`discriminant function can be defined as
`f (x) = wTx + w0,
`and can be seen as defining a hyperplane that maps from the space of the data Dn
`to a space of classes Cm, where in most cases m (cid:6) n. In binary classification, m =
`1, and classification is typically done by taking the sign of f (x). An observation x
`is mapped into the space of (binary) classes via the weight vector w and bias w0.
`The bias can be absorbed into the weight vector, and in this case it is termed an
`augmented weight vector (54). The challenge is to learn the weight vector and bias,
`using supervised methods, which result in minimum classification error, specifically
`to maximize generalization performance. An illustration of a discriminant function is
`given in Figure 6a. We can see that there are potentially many ways in which to place
`a discrimination boundary—i.e., many values for the weights and bias will minimize
`the classification error. The question thus becomes “Which boundary is the best?”
`Support vector machines directly address this question.
`
`(8)
`
`Hyperplanes and Maximum Margin
`A support vector machine (SVM) (see 55–57 for detailed tutorials) is a linear dis-
`criminant that separates data into classes using a hyperplane with maximum-margin.
`Specifically, the discriminant function can be defined using the inner product,
`f (y) = wTy,
`(9)
`where y is a result of applying a nonlinear transformation to the data—i.e., yi =
`φ(xi ), and classification is done by taking the sign of f (y). The rationale behind the
`
`548
`
`Sajda
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`12
`
`

`

`a
`
`x1
`
`b
`
`f1(x1, x2)
`f2(x1, x2)
`f3(x1, x2)
`
`x 2
`
`m
`
`m
`
`x 2
`
`Figure 6
`Hyperplanes and maximum margin. (a) Two-dimensional scatter plot for a two-class-labeled
`dataset. The data can be separated by an infinite number of hyperplanes, three of which are
`shown ( f1, f2, f3). (b) Illustration of the hyperplane that maximizes the margin (m). This
`hyperplane is completely specified by the support vectors, those being the example data at the
`margins.
`
`of the function f of the form (T f )( y) = (cid:2)
`
`Bias-variance dilemma: a
`classic tradeoff encountered
`in machine learning where
`one must balance the bias
`introduced by restricting
`the complexity of the model
`with the estimation accuracy
`or variance of the
`parameters. The expected
`generalization error is a
`combination of the bias and
`variance and thus the best
`model simultaneously
`minimizes these two
`
`nonlinear transform is to map the data into a high-dimensional space in which the
`transformed data is linearly separable and thus divided by a hyperplane. In practice,
`this transformation is accomplished using the “kernel trick” (58), which enables dot
`products to be replaced by nonlinear kernel functions—i.e., integral transformation
`b
`a k(x, y) f (x)d x, with the function k(x, y)
`being the kernel. Much of the current research in the field is focused on developing
`kernels useful for specific applications and problem domains, where the choice of
`kernel function embeds some prior knowledge about the problem (e.

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket