`and Diagnosis of Disease
`Paul Sajda
`Department of Biomedical Engineering, Columbia University, New York, NY 10027;
`email: ps629@columbia.edu
`
`Annu. Rev. Biomed. Eng.
`2006. 8:537–65
`
`First published online as a
`Review in Advance on
`April 17, 2006
`
`The Annual Review of
`Biomedical Engineering is
`online at
`bioeng.annualreviews.org
`
`doi: 10.1146/
`annurev.bioeng.8.061505.095802
`Copyright c(cid:2) 2006 by
`Annual Reviews. All rights
`reserved
`
`1523-9829/06/0815-
`0537$20.00
`
`Key Words
`blind source separation, support vector machine, bayesian network,
`medical imaging, computational biology
`
`Abstract
`Machine learning offers a principled approach for developing sophisticated, auto-
`matic, and objective algorithms for analysis of high-dimensional and multimodal
`biomedical data. This review focuses on several advances in the state of the art that
`have shown promise in improving detection, diagnosis, and therapeutic monitoring of
`disease. Key in the advancement has been the development of a more in-depth under-
`standing and theoretical analysis of critical issues related to algorithmic construction
`and learning theory. These include trade-offs for maximizing generalization perfor-
`mance, use of physically realistic constraints, and incorporation of prior knowledge
`and uncertainty. The review describes recent developments in machine learning, fo-
`cusing on supervised and unsupervised linear methods and Bayesian inference, which
`have made significant impacts in the detection and diagnosis of disease in biomedicine.
`We describe the different methodologies and, for each, provide examples of their ap-
`plication to specific domains in biomedical diagnostics.
`
`537
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`APPLE 1042
`
`1
`
`
`
`INTRODUCTION
`Machine learning, a subdiscipline in the field of artificial intelligence (AI), focuses on
`algorithms capable of learning and/or adapting their structure (e.g., parameters) based
`on a set of observed data, with adaptation done by optimizing over an objective or cost
`function. Machine learning and statistical pattern recognition have been the subject
`of tremendous interest in the biomedical community because they offer promise for
`improving the sensitivity and/or specificity of detection and diagnosis of disease, while
`at the same time increasing objectivity of the decision-making process. However, the
`early promise of these methodologies has resulted in only limited clinical utility,
`perhaps the most notable of which is the use of such methods for mammographic
`screening (1, 2). The potential impact of, and need for, machine learning is perhaps
`greater than ever given the dramatic increase in medical data being collected, new
`detection, and diagnostic modalities being developed and the complexity of the data
`types and importance of multimodal analysis. In all of these cases, machine learning
`can provide new tools for interpreting the high-dimensional and complex datasets
`with which the clinician is confronted.
`Much of the original excitement for the application of machine learning to
`biomedicine originated from the development of artificial neural networks (ANNs)
`(e.g., see 3), which were often proclaimed to be “loosely” modeled after computation
`in the brain. Although in most cases such claims for brain-like computation were
`largely unjustified, one of the interesting properties of ANNs was that they were
`shown to be capable of approximating any arbitrary function through the process of
`learning (also called training) a set of parameters in a connected network of simple
`nonlinear units. Such an approach mapped well to many problems in medical image
`and signal analysis and was in contrast to medical expert systems such as Mycin (4)
`and INTERNIST (5), which, in fact, were very difficult and time consuming to con-
`struct and were based on a set of rules and prior knowledge. Problematic with ANNs,
`however, is the difficulty in understanding how such networks construct the desired
`function and thus how to interpret the results. Thus, often such methods are used as
`a “black box,” with the ANN producing a mapping from input (e.g., medical data) to
`output (e.g., diagnosis) but without a clear understanding of the underlying mapping
`function. This can be particularly problematic in clinical medicine when one must
`also consider merging the interpretation of the computer system with that of the
`clinician because, in most cases, computer analysis systems are seen as adjunctive.
`As the field of machine learning has matured, greater effort has gone into de-
`veloping a deeper understanding of the theoretical basis of the various algorithmic
`approaches. In fact, a major difference between machine learning and statistics is that
`machine learning is concerned with theoretical issues such as computational complex-
`ity, computability, and generalization and is in many respects a marriage of applied
`mathematics and computer science.
`An area in machine learning research receiving considerable attention is the fur-
`ther development and analysis of linear methods for supervised and unsupervised
`feature extraction and pattern classification. Linear methods are attractive in that
`their decision strategies are easier to analyze and interpret relative to nonlinear
`
`538
`
`Sajda
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`2
`
`
`
`Biomarkers: anatomic,
`physiologic, biochemical, or
`molecular parameters
`associated with the presence
`and severity of specific
`disease states
`
`classification and regression functions, for example, constructed by ANNs. In ad-
`dition, a linear model can often be shown to be consistent, at least to first order, with
`underlying physical processes, such as image formation or signal acquisition. Finally,
`linear methods tend to be computationally efficient, and can be trained online and in
`real time.
`Particularly important for biomedical applications has been the development
`of methods for explicitly incorporating prior knowledge and uncertainty into the
`decision-making process. This has led to principled methods based on Bayesian infer-
`ence, which are well suited for incorporating disparate sources of noisy measurements
`and uncertain prior knowledge into the diagnostic process.
`This review describes recent developments in machine learning, focusing on su-
`pervised and unsupervised linear methods and Bayesian inference, which have made
`significant impact in the detection and diagnosis of disease in biomedicine. We de-
`scribe the different methodologies and, for each, provide examples of their application
`to specific domains in biomedical diagnostics.
`
`BLIND SOURCE SEPARATION
`Two important roles for machine learning are (a) extraction of salient structure in the
`data that is more informative than the raw data itself (the feature extraction problem)
`and (b) inferring underlying organized class structure (the classification problem).
`Although strictly speaking the two are not easily separable into distinct problems, we
`consider the two as such and describe the state of the art of linear methods for both.
`In this section we focus on unsupervised methods and application of such methods
`for recovering clinically significant biomarkers.
`
`Linear Mixing
`There are many cases in which one is interested in separating, or factorizing, a set
`of observed data into two or more matrices. Standard methods for such factorization
`include singular value decomposition (SVD) and principal component analysis (PCA)
`(6). These methods have been shown to satisfy specific optimality criteria, for example,
`PCA being optimal in terms of minimum reconstruction error under constraints of
`orthogonal basis vectors. However, in many cases these criteria are not consistent
`with the underlying signal/image-formation process and the resultant matrices have
`little physical relevance. More recently, several groups have developed methods for
`decomposing a data matrix into two matrices in which the underlying optimality
`criteria and constraints yield more physically meaningful results (7–14).
`Assume a set of observations is the result of a linear combination of latent sources.
`Such a linear mixing is quite common in signal and image acquisition/formation,
`at least to a first approximation, and is consistent with underlying physical mixing
`process, ranging from electroencephalography (15) to acoustics (16). Given X as a
`matrix of observations (M rows by N columns) the linear mixing equation is
`X = AS,
`
`(1)
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`www.annualreviews.org • Machine Learning for Disease Diagnosis
`
`539
`
`3
`
`
`
`where A is the set of mixing coefficients and S is a matrix of sources. Depending
`on the modality, the columns of X and S are the coordinate system in which the
`data is represented (i.e., time, space, wavelength, frequency, etc.). The challenge is to
`recover both A and S simultaneously given only the observations X. This problem
`is often termed blind source separation (BSS) because the underlying sources are
`not directly observed and the mixing matrix is not known. BSS methods have been
`applied to many fundamental problems in signal recovery and deconvolution (17).
`Most methods that have been developed attempt to learn an unmixing matrix W,
`which when applied to the data X yields an estimate of the underlying sources (up to
`a scaling and permutation),
`
`ˆS = WX.
`(2)
`Consider the case when one assumes the rows of S (i.e., the source vectors) are random
`variables that are statistically independent. This implies that the joint distribution of
`the sources factors,
`
`(3)
`
`P(s 1, . . . ,s L) = P(s 1)P(s 2) . . . P(s L),
`where L indicates the number of underlying sources (with each s i a row in S), and
`P(.) is the probability density function. In most cases L is not known and represents
`a hyperparameter that must be set or inferred. BSS methods that exploit statistical
`independence in their optimality criteria are termed independent component analysis
`(ICA) (see 18 for review). Several approaches have been developed to recover inde-
`pendent sources, the methods distinguished largely by the objective function they
`employ, e.g., maximum likelihood (19), maximum a posteriori (9), information max-
`imization (20), entropy estimation (21), and mean-field methods (22). In the case of
`time series, or other types of ordered data, one can also exploit other statistical criteria
`such as the nonstationarity and utilize simultaneous decorrelation (16, 23–25). Parra
`& Sajda (15) formulate the problem of BSS as one of solving a generalized eigenvalue
`problem, where one of the matrices is the covariance matrix of the observations and
`the other is chosen based on the underlying statistical assumptions on the sources.
`This view unifies various approaches in simultaneous decorrelation and ICA, together
`with PCA and supervised methods such as common spatial patterns (CSP) (26).
`The attractive property of these decomposition methods is that the recovered
`components often result in a natural basis for the data, in particular, if one considers
`some general properties of natural signals. For example, the marginal statistics of
`many natural signals (or filtered versions of the signals) are highly non-Gaussian (27,
`28). Since, by the central limit theorem, linear mixtures of non-Gaussian random
`variables will result in marginal statistics that are more closely Gaussian, recovering
`the independent components captures the generative or natural axes of the mixing
`process.
`
`Nonnegative Matrix Factorization
`One particularly useful method for factoring the data matrix X under very general
`and physically realistic constraints is the nonnegative matrix factorization (NMF)
`
`540
`
`Sajda
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`4
`
`
`
`− (cid:3)X−AS(cid:3)2
`2σ 2
`
`e
`
`algorithm (7). The basic idea of the NMF algorithm is to construct a gradient descent
`over an objective function that optimizes A and S, and, by appropriately choosing
`gradient stepsizes, to convert an additive update to a multiplicative one. For example,
`assuming Gaussian noise, one can formulate the problem of recovering A and S in
`Equation 1 as a maximum likelihood estimation,
`AML, SML = argmax p(X | A, S)
`A, S
`= argmax
`1√
`2π σ
`A, S
`subject to: A ≥ 0, S ≥ 0,
`where σ is the deviation of the Gaussian noise and (AS) its mean.
`Maximizing the likelihood is equivalent to minimizing the negative log-likelihood,
`and Equation 4 can be written as,
`AML, SML = argmin(− log p(X | A, S))
`A, S
`= argmin(cid:3)X − AS(cid:3)2
`A, S
`subject to: A ≥ 0, S ≥ 0.
`One can compute the gradients of the negative log-likelihood function and construct
`the additive update rules for A and S,
`− (ASST)i,m]
`Ai,m ← Ai,m + δi,m[(XST)i,m
`Sm,λ ← Sm,λ + ηm,λ[(ATX)m,λ − (ATAS)m,λ].
`Note that there are two free parameters, which are the step sizes of the updates. Lee
`& Seung (29) have shown that by appropriately choosing the step sizes, δi,m = Ai,m
`,
`(ASST)i,m
`ηm,λ = Sm,λ
`, the additive update rule can be formulated as a multiplicative update
`(ATAS)m,λ
`rule, with X = AS being a fixed point. The multiplicative update rules for A and S
`therefore become
`
`(4)
`
`(5)
`
`(6)
`
`Ai,m ← Ai,m
`
`Sm,λ ← Sm,λ
`
`(XST)i,m
`(ASST)i,m
`(ATX)m,λ
`(ATAS)m,λ
`where convergence of these update rules is guaranteed (29). By formulating the up-
`dates as multiplicative rules in Equation 7, we can ensure nonnegative A and S, given
`that both are initialized to be nonnegative and the observations, X, are nonnegative.
`An intuitive understanding of NMF via geometrical considerations can be de-
`veloped. The manifold of possible solutions specified by the linear mixing equation
`and nonnegativity constraints represent an M-dimensional polygonal cone spanned
`by the M rows of S. Nonnegativity constraints require that the row vectors of S,
`
`,
`
`(7)
`
`www.annualreviews.org • Machine Learning for Disease Diagnosis
`
`541
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`5
`
`
`
`xj
`
`s1
`
`x i
`
`s2
`
`Figure 1
`Geometrical interpretation of NMF. The axes represent two dimensions of the
`high-dimensional space of the observations. Spans of the recovered sources (s1 and s2) are
`shown as dashed magenta vectors. The recovered sources are constrained to lie in the positive
`hyper-quadrant and tightly envelope the observed data, forming a cone (pink region). Points
`that fall outside of the cone contribute to the error. An analogous picture can be drawn for the
`basis vectors A = {a1 . . .a m}.
`
`representing the edges of the cone, lie in the positive quadrant of the L-dimensional
`points defined by the rows of the observations X, which must fall within that polyg-
`onal cone. The aim of maximum likelihood is to find cone edge vectors that tightly
`envelope the observed L-points. Figure 1 illustrates this interpretation, which is
`sometime referred to as a conic encoder (30).
`The basic NMF algorithm has been modified in several ways, including adding a
`sparsity constraint on the sources (31), weighted NMF (32), and constrained NMF
`(11) (see below). The utility of the NMF algorithm for recovering physically mean-
`ingful sources has been demonstrated in a number of application domains, including
`image classification (33), document classification (34), and separation of audio streams
`(35), as well as biomedical applications such as analysis of positron emission tomog-
`raphy (PET) (36) and microarray analysis of gene expression (37, 38). Below, we
`describe two examples, both using nuclear magnetic resonance (NMR) data, where
`such methods are able to recover signatures of disease and toxicity.
`
`Recovering Spectral Signatures of Brain Cancer
`In vivo magnetic resonance spectroscopy imaging (MRSI) allows noninvasive charac-
`terization and quantification of molecular markers of potentially high clinical utility
`for improving detection, identification, and treatment for a variety of diseases, most
`notably brain cancers (39). MRSI acquires high-frequency resolution MR spectra
`across a volume of tissue with common nuclei, including 1H (proton), 13C (carbon),
`19F (fluorine), and 31P (phosphorus). Machine learning approaches for integrating
`MRSI with structural MRI have been shown to have potential for improving the
`assessment of brain tumors (40).
`
`542
`
`Sajda
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`6
`
`
`
`In MRSI, each tissue type can be viewed as having a characteristic spectral pro-
`file related to its biochemical composition. In brain tumors, for example, 1H MRSI
`has shown that metabolites are heterogeneously distributed and, in a given voxel,
`multiple metabolites and tissue types may be present (41). The observed spectra are
`therefore a combination of different constituent spectra. Because the signal measured
`in MRSI is the response to a coherent stimulation of the entire tissue, the ampli-
`tudes of different coherent resonators are additive. The overall gain with which a
`tissue type contributes is proportional to its abundance/concentration in each voxel.
`As a result, we can explain observations using the linear mixing equation (Equa-
`tion 1). Because we interpret A as abundance/concentration, we can assume the
`matrix to be nonnegative. In addition, because the constituent spectra S represent
`amplitudes of resonances, in theory, the smallest resonance amplitude is zero, cor-
`responding to the absence of resonance at a given band (where we ignore cases of
`negative peaks such as in J-modulation). Figure 2 illustrates the spectral unmixing
`problem.
`Interpretation of MRSI data is challenging, specifically for traditional peak-
`quantifying techniques (42, 43): A typical dataset consists of hundreds of highly cor-
`related spectra, having low signal-to-noise ratio (SNR) with peaks that are numerous
`and overlapping. This has created the need for approaches that can analyze the entire
`dataset simultaneously, taking advantage of the relationships among the spectra to
`improve the quality of the analysis. Such approaches are particularly useful for spectra
`
`A 11 A 12 … A 1
`M
`
`A 21 A 22 … A 2M
`
`=
`
`+
`
`+
`
`AN1 A N 2 … A NM
`
`(M << N )
`
`X
`
`=
`
`A
`
`+
`
`S
`
`+
`
`N
`
`Figure 2
`The spectral unmixing problem. Spectra from multiple voxels, for example, from MRSI and
`represented in the rows of X, are simultaneously analyzed and decomposed into constituent
`spectra S and the corresponding intensity distributions A. The extracted constituent spectra
`are identified by comparing them to known spectra of individual molecules. In most cases, the
`number of rows in S, M, is much less than the number of rows, N, inX—i.e., there is a
`dimensionality reduction in the decomposition. Unidentified spectral components are
`considered residual noise N. Their corresponding magnitudes quantify the modeling error,
`which can be directly compared to the modeling error of alternative parametric estimation
`procedures.
`
`www.annualreviews.org • Machine Learning for Disease Diagnosis
`
`543
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`7
`
`
`
`Figure 3
`cNMF separation of 1H CSI human brain data into clinically significant biomarkers and their
`corresponding spatial distributions. (a) Spectrum indicative of normal brain tissue: low choline
`(CHO), high creatine (CR), and high N-acetyl-aspartate (NAA). (b) Spectrum indicating
`high-grade malignant tumor tissue: highly elevated CHO, low CR, almost no NAA, and LAC
`(lactic acid). (c) Spectrum indicating residual lipids.
`
`with low SNR as they utilize the collective power of the data. Several BSS approaches
`have been developed to simultaneously exploit the statistical structure of an MRSI
`dataset, factorizing Equation 1. For example, ICA (44), second-order blind identifi-
`cation (SOBI) (45), and bayesian spectral decomposition (8) have all been applied to
`MRSI datasets to decompose observed spectra into interpretable components.
`Constrained NMF (cNMF), is a very efficient version of NMF for recovering
`biomarkers of brain cancer in MRSI (11, 12). The algorithm enables nonnegative
`factorization even for noisy observations, which may result in observed spectra hav-
`ing negative values. cNMF includes a positivity constraint, forcing negative values
`in the recovered spectral sources and abundance/concentration distributions to be
`approximately zero. Figure 3 illustrates an example of spectral sources and their cor-
`responding concentrations recovered using cNMF for 1H MRSI data from human
`brain. In this example, the method recovers biomarkers of high-grade malignant tu-
`mor as well as the spatial distribution of their concentration. One of the advantages
`over other decomposition approaches that have been used in NMR, for example,
`those based on Monte Carlo sampling (8), is that cNMF is computationally efficient
`and can be used in near real time, when a patient is in the MR scanner.
`
`544
`
`Sajda
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`8
`
`
`
`Extraction of Metabolic Markers of Toxicity
`Metabolomics [sometimes referred to as metabonomics (46)] quantitatively measures
`the dynamic metabolic response of living systems to pathophysiological stimuli or
`genetic modification. Metabolomic analysis of biofluids based on high-resolution
`MRS and chemometric methods are valuable in characterizing the biochemical re-
`sponse to toxicity (47). Interpretation of high-resolution 1H biofluid NMR spectra
`dataset is challenging, specifically for traditional peak-quantifying techniques: A typ-
`ical dataset consists of at least tens of highly correlated spectra, with thousands of
`partially overlapping peaks arising from hundreds of endogenous molecules. This
`has created the need for approaches that can analyze the entire dataset simultane-
`ously for discriminating between different combinations of metabolites, including
`their dynamic changes.
`PCA is widely used for analyzing metabolomic NMR datasets (48, 49). Although
`a reasonable approach for preprocessing NMR datasets (50), the PCA decomposition
`does not lead to physically realizable spectral biomarkers. Physically realistic decom-
`positions are not only useful in terms of visualization, but also in classification of
`metabolic patterns using machine learning and domain knowledge (51).
`Figure 4 illustrates NMF applied to 1H NMR spectra of urine from Han Wistar
`rats in a hydrazine toxicity experiment. Samples were collected from control rats
`and those treated with three different doses of hydrazine (75, 90, 120 mg/kg) over
`a period of 150 h (52). Preprocessing, including normalization of the data, has been
`described elsewhere (53). The NMF algorithm requires about 300 s (Intel Pentium4
`1.2 GHz) to obtain the recovered spectral sources, orders of magnitude faster than
`other decomposition methods yielding similar results (53). The magnitudes in each
`dose-group, as a function of time, are shown in Figure 5a, with the identified spectral
`patterns in Figure 4a. NMF was run 100 times (100 independent initializations), with
`Figure 4a showing the mean results (solid lines) and variation across runs (dashed lines).
`The small variance demonstrates the robustness and fidelity of the NMF in spectral
`pattern recovery.
`Clear is the association of the four spectral patterns with the hydrazine treatment.
`In control rats, the first ( filled diamonds) and second ( filled upper-triangle) spectral
`sources maintain almost a constant high level, while the third (inverted-triangle) and
`fourth (open circle) are very low. Thus, the first spectral source (Krebs cycle interme-
`diates: citrate and succinate) and second spectral source (2-oxoglutarate) are related
`to the normal patterns, while the third and fourth (2-aminoadipic acid, taurine and
`creatine) are related to hydrazine. Indeed, in the treated animals, the normal pat-
`terns decrease in response to hydrazine and recover after 36 h, while the other two
`exhibit reciprocal behaviors during the course of the experiment. The data from the
`120 mg/kg dose indicates no sign of recovery at 56 h, at which point the animal was
`sacrificed.
`A visual comparison of the spectral sources recovered using NMF with the first
`principal components recovered using PCA is shown in Figure 4a,b. The PCA
`components do not represent physically realizable spectra and do not appear to
`be biomarkers of the metabolic status of the animals. This is further illustrated by
`
`www.annualreviews.org • Machine Learning for Disease Diagnosis
`
`545
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`9
`
`
`
`Figure 4
`(a) Spectral sources, recovered using NMF, indicative of biomarkers for normal metabolic
`function (blue) and hydrazine toxicity (red). Solid lines are the mean results and the dash lines
`are the mean ±2σ . (b) Components recovered using PCA. Note that the patterns are not
`physically realizable spectra because they have negative peaks.
`
`546
`
`Sajda
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`10
`
`
`
`Control
`
`Control
`
`Control
`
`75 mg/kg
`
`90 mg/kg
`
`120 mg/kg
`
`100
`
`0
`
`100
`
`0
`
`0
`
`100
`Time (h)
`
`100
`
`0
`
`100
`
`0
`
`60
`
`Control
`
`Control
`
`Control
`
`75 mg/kg
`
`90 mg/kg
`
`120 mg/kg
`
`100
`
`0
`
`100
`
`0
`
`100
`
`0
`
`100
`
`0
`
`100
`
`0
`
`60
`
`80
`
`60
`
`40
`
`20
`
`0
`0
`
`Concentration
`
`a
`
`b
`
`Abnormal
`
`Normal
`
`0
`
`Abnormal
`
`Normal
`
`0
`
`100
`
`0
`
`100
`
`0
`
`0
`
`100
`Time (h)
`
`100
`
`0
`
`100
`
`0
`
`60
`
`Figure 5
`(a) Time-dependent concentration of the spectral biomarkers recovered using NMF. The
`filled diamonds and filled upright triangles are associated with split normal patterns (blue), and
`the inverted triangles and open circles are associated with aberrant patterns (red)—symbols
`correspond to biomarkers in Figure 4a. Analysis of the time-dependent concentrations shows
`the effect, and in most cases (except the 100 mg/kg dose) recovery from the hydrazine. (b)
`K-means cluster analysis applied to the amplitudes of the NMF patterns and the first four
`principal components. (Top) Concentration profiles recovered via NMF enables correct
`clustering into normal and abnormal classes. The samples corresponding to the control rats
`and the ones collected before hydrazine administration, as well as more than 104 h after
`hydrazine administration for the treated rats, are assigned into the normal cluster, and the
`other samples collected in the experiment are correctly assigned into the abnormal cluster.
`(Bottom) K-means clustering on the first four principal components. Classification is less
`accurate compared to when using NMF recovered biomarkers—e.g., as evident by the
`misclassification of some of the time points for controls.
`
`applying K-means clustering (54) to the amplitudes in the matrix A to classify the
`metabolic status (normal versus abnormal) of the rats as a function of time. The re-
`sults for clustering the samples into two clusters, normal and abnormal, using cNMF
`components are shown in Figure 5b (top), from which we can see that the control
`rats are clearly separated from those that are treated. Both the initial measurements
`
`www.annualreviews.org • Machine Learning for Disease Diagnosis
`
`547
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`11
`
`
`
`(0 h), taken prior to hydrazine administration, and the later data points (after 104 h)
`for the treated rats are correctly assigned to the normal cluster. These samples have
`NMR spectra very similar to those from untreated animals, and in fact correspond
`to time points when the manifested toxic effect of hydrazine is almost minimized
`by biologic recovery. Figure 5b (bottom) shows the classification results using the
`coefficients of the first four PCs. Clearly, these results are less realistic compared with
`Figure 5b (top) because some of the time points for the control rats are classified
`into the abnormal group. We see that a source recovery method that imposes phys-
`ically realistic constraints improves classification because it connects the recovered
`sources, quantitatively, with the biological end-point measurements. The approach
`shows promise for understanding complex metabolic responses of disease, pharma-
`ceuticals, and toxins.
`
`SUPPORT VECTOR MACHINES
`The unsupervised learning decompositions discussed in the previous section can be
`considered methods for constructing descriptive representations of the observed data.
`An alternative is to construct discriminative representations using supervised learn-
`ing, namely representations that are constructed to maximize the difference between
`underlying classes in the data. The most common is linear discrimination. The linear
`discriminant function can be defined as
`f (x) = wTx + w0,
`and can be seen as defining a hyperplane that maps from the space of the data Dn
`to a space of classes Cm, where in most cases m (cid:6) n. In binary classification, m =
`1, and classification is typically done by taking the sign of f (x). An observation x
`is mapped into the space of (binary) classes via the weight vector w and bias w0.
`The bias can be absorbed into the weight vector, and in this case it is termed an
`augmented weight vector (54). The challenge is to learn the weight vector and bias,
`using supervised methods, which result in minimum classification error, specifically
`to maximize generalization performance. An illustration of a discriminant function is
`given in Figure 6a. We can see that there are potentially many ways in which to place
`a discrimination boundary—i.e., many values for the weights and bias will minimize
`the classification error. The question thus becomes “Which boundary is the best?”
`Support vector machines directly address this question.
`
`(8)
`
`Hyperplanes and Maximum Margin
`A support vector machine (SVM) (see 55–57 for detailed tutorials) is a linear dis-
`criminant that separates data into classes using a hyperplane with maximum-margin.
`Specifically, the discriminant function can be defined using the inner product,
`f (y) = wTy,
`(9)
`where y is a result of applying a nonlinear transformation to the data—i.e., yi =
`φ(xi ), and classification is done by taking the sign of f (y). The rationale behind the
`
`548
`
`Sajda
`
`by Moscow State University - Scientific Library of Lomonosov on 05/22/13. For personal use only.
`
`Annu. Rev. Biomed. Eng. 2006.8:537-565. Downloaded from www.annualreviews.org
`
`12
`
`
`
`a
`
`x1
`
`b
`
`f1(x1, x2)
`f2(x1, x2)
`f3(x1, x2)
`
`x 2
`
`m
`
`m
`
`x 2
`
`Figure 6
`Hyperplanes and maximum margin. (a) Two-dimensional scatter plot for a two-class-labeled
`dataset. The data can be separated by an infinite number of hyperplanes, three of which are
`shown ( f1, f2, f3). (b) Illustration of the hyperplane that maximizes the margin (m). This
`hyperplane is completely specified by the support vectors, those being the example data at the
`margins.
`
`of the function f of the form (T f )( y) = (cid:2)
`
`Bias-variance dilemma: a
`classic tradeoff encountered
`in machine learning where
`one must balance the bias
`introduced by restricting
`the complexity of the model
`with the estimation accuracy
`or variance of the
`parameters. The expected
`generalization error is a
`combination of the bias and
`variance and thus the best
`model simultaneously
`minimizes these two
`
`nonlinear transform is to map the data into a high-dimensional space in which the
`transformed data is linearly separable and thus divided by a hyperplane. In practice,
`this transformation is accomplished using the “kernel trick” (58), which enables dot
`products to be replaced by nonlinear kernel functions—i.e., integral transformation
`b
`a k(x, y) f (x)d x, with the function k(x, y)
`being the kernel. Much of the current research in the field is focused on developing
`kernels useful for specific applications and problem domains, where the choice of
`kernel function embeds some prior knowledge about the problem (e.