`
`https://doi.org/10.1038/s41586-019-0879-y
`
`Commonality despite exceptional diversity in the
`baseline human antibody repertoire
`
`Bryan Briney1,2,3,4,5*, Anne Inderbitzin1,6, Collin Joyce1,2,3,4 & Dennis R. Burton1,2,4,5,7*
`
`In principle, humans can produce an antibody response to any non-
`self-antigen molecule in the appropriate context. This flexibility is
`achieved by the presence of a large repertoire of naive antibodies, the
`diversity of which is expanded by somatic hypermutation following
`antigen exposure1. The diversity of the naive antibody repertoire in
`humans is estimated to be at least 1012 unique antibodies2. Because
`the number of peripheral blood B cells in a healthy adult human is
`on the order of 5 × 109, the circulating B cell population samples
`only a small fraction of this diversity. Full-scale analyses of human
`antibody repertoires have been prohibitively difficult, primarily
`owing to their massive size. The amount of information encoded
`by all of the rearranged antibody and T cell receptor genes in one
`person—the ‘genome’ of the adaptive immune system—exceeds the
`size of the human genome by more than four orders of magnitude.
`Furthermore, because much of the B lymphocyte population is
`localized in organs or tissues that cannot be comprehensively
`sampled from living subjects, human repertoire studies have
`focused on circulating B cells3. Here we examine the circulating B
`cell populations of ten human subjects and present what is, to our
`knowledge, the largest single collection of adaptive immune receptor
`sequences described to date, comprising almost 3 billion antibody
`heavy-chain sequences. This dataset enables genetic study of the
`baseline human antibody repertoire at an unprecedented depth
`and granularity, which reveals largely unique repertoires for each
`individual studied, a subpopulation of universally shared antibody
`clonotypes, and an exceptional overall diversity of the antibody
`repertoire.
`Eighteen sequencing libraries were generated for each of ten subjects
`(Extended Data Fig. 1). These libraries yielded 2.90 × 109 raw reads.
`After annotation4, which included duplicate removal using unique
`molecular identifiers5, we obtained 3.64 × 108 productive antibody
`sequences (Extended Data Table 1).
`Amplification was reproducible, with similar gene usage between
`replicates (Fig. 1a, Extended Data Fig. 2). The frequencies of IgM-
`encoding (0.62–0.94) and IgG-encoding (0.06–0.38) sequences were
`consistent with the expected frequency of circulating B cells that
`express these isotypes6 (Fig. 1b). Although V-gene, J-gene and CDRH3
`length distributions were similar between subjects (Fig. 1c, e, f), differ-
`ences were large enough that individual repertoires could conceivably
`be distinguished using only these features. We reduced sequence sub-
`samples to the frequency distributions of V-gene, J-gene and CDRH3
`length, and quantified similarity using the Morisita–Horn similarity
`index7,8. Subject repertoires were clearly distinguishable using as few
`as 104 sequences (Fig. 1d, Extended Data Fig. 4) and did not cluster
`by age, gender or ethnicity (Fig. 1g). The IgG+ repertoires were least
`similar, suggesting that the unique immunological histories of subjects
`are a substantial contributor to repertoire individuality (Fig. 1h). A
`one-versus-rest support-vector-machine classifier trained on V-gene,
`J-gene and CDRH3 length data from 5 of the 6 biological replicates
`from each subject accurately assigned the remaining replicate using
`
`test or training datasets of as few as 500 sequences from each replicate
`(Fig. 1i).
`To estimate repertoire diversity and minimize the effects of sequenc-
`ing and amplification error, we first considered clonotype diversity. An
`antibody clonotype is a collection of sequences using the same V and J
`genes, and encoding an identical CDRH3 amino acid sequence9. For
`each subject, all sequences from each biological replicate were collapsed
`into a set of unique clonotypes. Any clonotypes that were repeatedly
`observed after pooling de-duplicated biological replicates must be
`derived from different cells, which provides a straightforward means
`of quantifying multiple occurrence. For clarity, clonotypes or sequences
`present in multiple biological replicates from a single subject will be
`referred to as ‘repeatedly observed’, whereas clonotypes or sequences
`found in multiple subjects will be referred to as ‘shared’.
`Rarefaction curves indicated a low frequency of repeatedly observed
`clonotypes, which is supported by capture–recapture sampling
`(3.9–11.7% recapture; Fig. 2a, Extended Data Fig. 6). To estimate
`repertoire diversity, we selected two estimators: Chao 2 and Recon.
`Chao 2 is a non-parametric estimator that uses repeat occurrence
`data from multiple samples to estimate species richness10. Recon uses
`maximum likelihood to estimate species richness, assuming only that
`the overall size of the repertoire is large (relative to sampling depth)
`and well-mixed11. These estimates represent the total diversity that the
`humoral immune system is capable of generating. Accordingly, these
`estimates may greatly exceed the actual number of B cells present in
`a single individual at any one time. The estimators produced similar
`estimates of clonotype diversity for each subject, with identical rank
`order (Fig. 2b). Recon consistently estimated about twofold greater
`repertoire diversity (2 × 107–1 × 109) than Chao 2 (1 × 107–5 × 108),
`consistent with reports that Chao 2 underestimates richness for samples
`with a non-negligible frequency of rare species12,13. Pooling unique
`clonotypes from multiple subjects enabled us to estimate cohort-wide
`diversity (Fig. 2c). Chao 2 (5 × 109) and Recon (5 × 109) produced
`nearly identical estimates for the complete ten-subject pool. Estimates
`of cohort-wide clonotype diversity exceed individual subject estimates
`by less than two orders of magnitude, which suggests a relatively high
`frequency of shared clonotypes. We next sought to estimate the
`sequence diversity for each individual, again using both the Chao 2
`and Recon estimators (Fig. 2d). As expected, the estimates for
`sequences were substantially higher than for clonotypes, with Chao 2
`(2 × 108–2 × 109) and Recon (1 × 108–2 × 109) producing comparable
`estimates for each subject. Unlike the cohort-wide clonotype esti-
`mates, Recon estimated much lower cohort-wide sequence diversity
`(1 × 1010) than Chao 2 (1 × 1011; Fig. 2e). The light-chain repertoire
`is estimated to be approximately four orders of magnitude less diverse
`than the heavy-chain repertoire (Extended Data Fig. 7) and pairing
`of heavy and light chains is approximately random14, which produces
`a total paired-sequence diversity estimate of 1016 to 1018. The most
`commonly cited estimate of antibody repertoire diversity—1012 unique
`sequences2—considers only the unmutated naive repertoire. As such,
`
`1Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA, USA. 2Center for HIV/AIDS Vaccine Immunology and Immunogen Discovery, The Scripps Research
`Institute, La Jolla, CA, USA. 3Center for Viral Systems Biology, The Scripps Research Institute, La Jolla, CA, USA. 4IAVI Neutralizing Antibody Center, The Scripps Research Institute, La Jolla, CA, USA.
`5Human Vaccines Project, New York, NY, USA. 6Division of Infectious Diseases and Hospital Epidemiology, University Hospital Zurich, University of Zurich, Zurich, Switzerland. 7Ragon Institute of
`MGH, MIT and Harvard, Cambridge, MA, USA. *e-mail: briney@scripps.edu; burton@scripps.edu
`
`© 2019 Springer Nature Limited. All rights reserved.
`
`N A T U R E | www.nature.com/nature
`
`Lassen - Exhibit 1016, p. 1
`
`
`
`316188
`316188 vs 326650
`316188 vs 326651
`316188 vs 326713
`316188 vs 326737
`316188 vs 326780
`316188 vs 326797
`316188 vs 326907
`316188 vs 327059
`316188 vs D103
`
`101
`
`102
`
`105
`104
`103
`Sequence count
`
`106
`
`107
`
`f
`
`d
`
`1.0
`
`0.8
`
`0.6
`
`0.4
`
`0.2
`
`0
`
`Morisita–Horn similarity
`
`0.16
`
`0.84
`
`0
`
`0.2
`
`0.6
`0.4
`Frequency
`
`0.8
`
`1.0
`
`316188
`326650
`326651
`326713
`326737
`326780
`326797
`326907
`327059
`D103
`
`IgM
`
`IgG
`
`0.14
`
`0.12
`
`0.10
`
`0.08
`
`0.06
`
`0.04
`
`0.02
`
`b
`
`c
`
`Frequency
`
`VH1
`VH2
`VH3
`VH4
`VH5
`VH6
`VH7
`
`RESEARCH
`
`LETTER
`
`326650 IgM
`Biological replicates
`r 2 = 0.9978
`
`a
`
`10−1
`
`10−2
`
`10−3
`
`10−4
`
`10−5
`
`10−6
`
`10−7
`
`VJ frequency
`
`10−8
`10−8
`
`10−7
`
`10−6
`
`10−5
`10−4
`VJ frequency
`
`10−3
`
`10−2
`
`10−1
`
`0
`
`0
`
`10
`
`20
`
`30
`
`CDRH3 length (AA)
`
`e
`
`316188
`326650
`326651
`326713
`326737
`326780
`326797
`326907
`327059
`D103
`
`IGHJ6
`
`IGHJ5
`
`IGHJ4
`
`IGHJ3
`
`IGHJ2
`
`IGHJ1
`
`IGHV6-1
`
`IGHV5-51
`
`IGHV4-61
`
`IGHV4-59
`
`IGHV4-4
`
`IGHV4-39
`
`IGHV4-34
`
`IGHV4-31
`
`IGHV4-28
`
`IGHV3-9
`
`IGHV3-74
`
`IGHV3-73
`
`IGHV3-72
`
`IGHV3-7
`
`IGHV3-66
`
`IGHV3-64
`
`IGHV2-5
`
`IGHV1-8
`
`IGHV1-3
`
`IGHV1-2
`
`IGHV7-4-1
`
`IGHV5-10-1
`
`IGHV4-38-2
`
`IGHV4-30-4
`
`IGHV4-30-2
`
`IGHV3-NL1
`
`IGHV3-64D
`
`IGHV3-53
`
`IGHV3-49
`
`IGHV3-48
`
`IGHV3-43D
`
`IGHV3-43
`
`IGHV3-33
`
`IGHV3-30-3
`
`IGHV3-30
`
`IGHV3-23
`
`IGHV3-21
`
`IGHV3-20
`
`IGHV3-15
`
`IGHV3-13
`
`IGHV3-11
`
`IGHV2-70D
`
`IGHV2-70
`
`IGHV2-26
`
`IGHV1-69-2
`
`IGHV1-69
`
`IGHV1-58
`
`IGHV1-46
`
`IGHV1-45
`
`IGHV1-24
`
`IGHV1-18
`
`316188
`326650
`
`326651
`326713
`
`326737
`326780
`
`326797
`326907
`
`327059
`D103
`
`500
`sequences
`
`1.0
`
`0.9
`
`0.8
`
`0.7
`
`0.6
`
`0.5
`
`0.4
`
`i
`
`Mean ROC AUC
`
`All
`IgM (<2 mutations)
`IgM (2+ mutations)
`IgG
`
`1.0
`
`0.9
`
`0.8
`
`0.7
`
`0.6
`
`0.5
`
`0.4
`
`h
`
`Morisita–Horn similarity
`
`3
`1
`7
`6
`2
`3
`
`7
`3
`7
`6
`2
`3
`
`0
`8
`7
`6
`2
`3
`
`0
`5
`6
`6
`2
`3
`
`9
`5
`0
`7
`2
`3
`
`8
`8
`1
`6
`1
`3
`
`7
`0
`9
`6
`2
`3
`
`1
`5
`6
`6
`2
`3
`
`7
`9
`7
`6
`2
`3
`
`3
`0
`1
`D
`
`g
`
`326651
`
`326797
`
`D103
`
`326713
`
`326737
`
`326780
`
`326650
`
`327059
`
`316188
`
`326907
`
`Intra
`
`Inter
`
`Comparison type
`
`101
`
`102
`Sequence count
`
`103
`
`Fig. 1 | Uniqueness of the repertoires of individual subjects.
`a, Frequency comparison of V and J combinations in biological replicates
`from subject 326650. V and J combinations are coloured according to
`the V gene used. b, Sequence frequency by antibody isotype. Subjects
`are coloured as in c. Each point represents a single biological replicate.
`Mean of all samples is indicated for each isotype. c, CDRH3 length
`distribution for each subject. CDRH3 lengths were determined using
`the Immunogenetics (IMGT) numbering scheme. AA, amino acids.
`d, Morisita–Horn similarity of pairwise comparisons between subject
`316188 and each of the other subjects. Lines indicate mean similarity
`of 20 bootstrap samplings, and shaded areas indicate 95% confidence
`intervals. Data from subject 316188 are representative; plots for all other
`subjects can be found in Extended Data Fig. 4. e, f, V gene (e) and J gene
`(f) use by subject. Increased colour intensity indicates higher frequency.
`Subjects are coloured as in c. g, Clustered distance matrix of subjects,
`
`using pairwise Morisita–Horn similarity of V-gene, J-gene and CDRH3
`length as the distance measure. Distance matrix was computed using
`single-linkage clustering (Euclidean distance metric). Subject colours
`are as in c. A dendrogram representation of the distance matrix is also
`shown on the left side of the distance matrix. h, Comparison of intra- and
`inter-subject similarity in V-gene, J-gene and CDRH3 length, using all
`sequences, IgM sequences with fewer than two nucleotide mutations, IgM
`sequences with two or more mutations, or IgG sequences. Points represent
`individual intra- or inter-subject comparisons. Box plots show the median
`line and span the 25th–75th percentile, with whiskers indicating the 95%
`confidence interval. i, Mean receiver operating characteristic (ROC) area
`under the curve (AUC) for a one-versus-rest support-vector-machine
`classifier. The ROC AUC does not drop below 1.0 for any subject when the
`test or training datasets include ≥ 500 sequences each; this 500-sequence
`threshold is indicated with a dashed vertical line.
`
`our sequence diversity estimates, which include both the naive and
`memory sequences, are not directly comparable to this previous esti-
`mate. Clonotype diversity estimates—which incorporate only V- and
`J-gene assignments, and the CDRH3 amino acid sequence—minimize
`the influence of somatic hypermutation, and are more suitable for
`comparison with previous estimates of naive repertoire diversity. The
`cohort-wide paired clonotype diversity using either estimator, under
`the same assumptions regarding light-chain diversity and random
`
`pairing, is estimated at 3 × 1015—over three orders of magnitude
`greater than previously estimated for the naive repertoire.
`Although it is known that convergent antibodies may arise from
`different individuals in response to immunological exposure, and
`a low frequency of CDRH3 sharing has previously been observed
`in healthy adult repertoires9,15, the overall prevalence of repertoire
`sharing is unknown. For each combination of two or more subjects,
`we computed the frequency of shared clonotypes (Fig. 3a). Pairs of
`
`N A T U R E | www.nature.com/nature
`
`© 2019 Springer Nature Limited. All rights reserved.
`
`Lassen - Exhibit 1016, p. 2
`
`
`
`LETTER RESEARCH
`
`0.95
`
`0.90
`
`0.85
`
`316188
`326650
`326651
`326713
`326737
`326780
`326797
`326907
`327059
`D103
`
`a
`
`1.0
`
`0.8
`
`0.6
`
`0.4
`
`0.2
`
`Unique clonotypes (fraction)
`
`0.80
`0.85 0.90 0.95 1.00
`
`0
`
`0
`
`0.8
`0.6
`0.4
`0.2
`Observed clonotypes (fraction)
`
`1.0
`
`1
`
`2
`
`3
`
`6
`5
`4
`Number of subjects
`
`7
`
`8
`
`9
`
`10
`
`Chao 2
`Recon
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`9
`
`10
`
`Chao 2
`Recon
`
`1010
`
`109
`
`108
`
`107
`
`c
`
`Diversity estimate (clonotypes)
`
`e
`
`1013
`
`1012
`
`1011
`
`1010
`
`109
`
`108
`
`Diversity estimate (sequences)
`
`Chao 2
`Recon
`
`0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
`
`C
`
`R
`
`Subsample fraction
`
`Estimator
`
`0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
`
`C
`
`R
`
`Chao 2
`Recon
`
`109
`
`108
`
`107
`
`106
`
`109
`
`108
`
`107
`
`b
`
`Diversity estimate (clonotypes)
`
`d
`
`Diversity estimate (sequences)
`
`Subsample fraction
`
`Estimator
`
`Number of subjects
`
`Fig. 2 | Clonotype and sequence diversity amongst the 10 subjects.
`a, Clonotype rarefaction curves for each subject. Lines represent the mean
`of 10 independent samplings, with the exception of the 1.0 fraction (which
`was sampled once). The dashed line represents a perfectly diverse sample.
`Inset is a close-up of the ends of the rarefaction curves. b, Estimates of
`total repertoire diversity per clonotype were computed for increasingly
`large fractions of the clonotype repertoire of each subject. Each line
`represents the mean of 10 random subsamplings without replacement
`(except for the 1.0 fraction). Chao 2 (C) estimates are shown in solid lines,
`Recon (R) estimates are shown in dashed lines. Subject colours are as in
`a. Maximum diversity (1.0 fraction of each subject) for each estimator is
`shown in the right panel. c, Overall cross-subject clonotype diversity of
`each possible combination of one or more subjects. The Chao 2 estimate
`is a solid line and the Recon estimate is a dashed line. Shaded regions
`
`indicate 95% confidence intervals. The confidence intervals in c are
`for different groupings of subjects, not for the estimators themselves.
`d, Estimates of total sequence repertoire diversity were computed for
`increasingly large fractions of the sequence repertoire of each subject.
`Each line represents the mean of 10 random subsamplings without
`replacement (except for the 1.0 fraction, for which only a single calculation
`was made). Chao 2 estimates are shown in solid lines, Recon estimates
`are shown in dashed lines. Subject colours are as in a. Maximum diversity
`(1.0 fraction of each subject repertoire) for each estimator is shown in the
`right panel. e, Overall cross-subject nucleotide sequence diversity of each
`possible combination of one or more subjects. The Chao 2 estimate is a
`solid line and the Recon estimate is a dashed line. Shaded regions indicate
`95% confidence intervals. Confidence intervals are as in c.
`
`subjects shared—on average—0.95% of their respective clonotypes, and
`0.022% of clonotypes were shared by all ten subjects. We next used two
`approaches to quantify the expected frequency of clonotype sharing by
`chance. Hypergeometric distributions, based on cohort-wide clonotype
`
`diversity (Chao 2) and the number of unique clonotypes for each sub-
`ject, indicated a low likelihood that the observed sharing was due to
`chance (8.8 × 10−6, Bonferroni-corrected P = 0.05 is 1.1 × 10−3). We
`also generated synthetic antibody sequences using IGoR16 to determine
`
`© 2019 Springer Nature Limited. All rights reserved.
`
`N A T U R E | www.nature.com/nature
`
`Lassen - Exhibit 1016, p. 3
`
`
`
`15
`
`20
`
`d
`
`0.125
`
`0.100
`
`0.075
`
`0.050
`
`0.025
`
`0
`
`Frequency
`
`Mean: 15.4
`
`316188
`326650
`326651
`326713
`326737
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`1.25
`
`1.00
`
`0.75
`
`0.50
`
`0.25
`
`0
`
`0
`
`c
`
`Cumulative frequency
`
`RESEARCH
`
`LETTER
`
`3
`
`0
`
`1
`
`D
`
`0.20
`
`0.63
`
`0
`
`0 . 1
`0.03
`
`2
`
`0 .0
`2
`0.0
`
`4
`0.0
`
`327059
`
`0.99
`
`a
`
`b
`
`Observed
`
`Synthetic
`(default)
`
`Synthetic
`(subject-specific)
`
`316188
`
`326650
`
`0.41
`
`0.03
`
`0.03
`
`3 2 6 6 5 1
`
`1.17
`
`0 . 5 6
`
`0 . 0 5
`
`0.0 3
`
`0.022%
`
`5
`
`10
`
`25
`
`30
`
`CDRH3 length (IMGT)
`
`Observed
`
`Synthetic
`(default)
`Synthetic
`(subject-specific)
`
`0.25
`
`0.20
`
`0.15
`
`0.10
`
`0.05
`
`0
`
`Frequency
`
`326780
`326797
`326907
`327059
`D103
`
`CDRH3 length (IMGT)
`
`Mean: 14.0
`
`5
`
`10
`
`15
`
`20
`
`1.25
`
`1.00
`
`0.75
`
`0.50
`
`0.25
`
`0
`
`Cumulative frequency
`
`2 3 4 5 6 7 8 9
`
`Number of shared subjects
`
`1.25
`
`326713
`
`2
`0.0
`3
`
`0 . 0
`
`0 . 0 5
`
`0
`0.2
`
`326907
`
`0.0 3
`
`0.56
`
`0.03
`
`0.05
`
`326780
`
`0.03
`
`5
`
`0 . 2
`
`1.57
`
`0.41
`
`7
`
`3
`
`7
`
`6
`
`2
`
`3
`
`0.51
`
`326797
`
`Bonferroni-corrected
`P = 0.05
`
`10
`
`10−7
`
`10−3
`10−4
`10−5
`10−6
`Probability of observed clonotype sharing frequency
`
`10−5
`
`10−4
`
`10−3
`
`10−2
`
`Shared clonotype frequency
`
`0
`
`25
`
`30
`
`CDRH3 length (IMGT)
`
`5
`
`20
`15
`10
`CDRH3 length (IMGT)
`
`25
`
`30
`
`Acidic
`
`Basic
`
`Hydrophobic
`
`Polar
`
`1
`
`2
`
`3
`
`Head
`
`–3 –2 –1
`
`0.3
`
`0.2
`
`0.1
`
`0
`
`–0.1
`
`–0.2
`
`–0.3
`
`h
`
`Relative abundance
`
`Unshared
`
`g
`
`(synthetic)
`
`Shared
`
`Shared
`
`Unshared
`
`f
`
`(synthetic)
`
`Shared
`
`Shared
`
`Unshared
`Shared
`Unshared (synthetic)
`Shared (synthetic)
`
`7
`
`8
`
`12
`11
`10
`9
`CDRH3 length (IMGT)
`
`13
`
`14
`
`1.0
`
`0.9
`
`0.8
`
`0.7
`
`0.6
`
`0.5
`
`0.4
`
`e
`
`Shannon entropy
`
`CDRH3 position
`
`215,799 sequences
`
`9,390
`
`2,202
`
`765
`
`275
`
`114
`
`58
`
`22
`
`3
`
`2
`
`3
`
`4
`
`7
`6
`5
`Number of subjects
`
`8
`
`9
`
`10
`
`012345
`
`Nucleotide mutations
`
`k
`
`1
`
`2
`
`3
`
`7
`6
`5
`4
`Number of subjects
`
`8
`
`9
`
`10
`
`20
`
`15
`
`10
`
`05
`
`Mutations
`
`j
`
`316188
`
`326650
`
`326651
`
`326713
`
`326737
`
`326780
`
`326797
`
`326907
`
`327059
`
`D103
`
`4
`2
`Number of observations
`
`6
`
`20
`
`15
`
`10
`
`05
`
`Mutations
`
`i
`
`Fig. 3 | Shared clonotypes and sequences amongst the 10 subjects.
`a, Venn diagram of shared clonotype frequency. b, Shared clonotype
`frequency between subject groups. Points represent different group
`combinations. Observed sequences (black), synthetic sequences generated
`with IGoR’s default model (red) and sequences generated with subject-
`specific models (blue) are shown. c, Distribution of CDRH3 lengths for
`clonotypes found in one biological replicate (top) or all six biological
`replicates (bottom). CDRH3 length is defined using IMGT numbering.
`The colour key legend is split to maintain legibility; data for all subjects
`are present in both plots. d, Distribution of CDRH3 length for unshared
`clonotypes (top) or clonotypes shared by the majority of subjects (bottom).
`Observed sequences (black), default model (red) and subject-specific
`model (blue) synthetic sequences are shown. e, Per position Shannon
`entropy of the CDRH3 head regions of unshared (solid) or majority-
`shared (dashed) clonotypes. Points indicate the mean, whiskers indicate
`the 95% confidence interval, and lines represent the linear best fit.
`
`f, g, Sequence logos of the CDRH3s encoded by observed unshared
`clonotypes, observed majority-shared clonotypes and synthetic majority-
`shared clonotypes of length 8 (f) or 13 (g). Head-region amino acid
`colouring: polar amino acids (GSTYCQN) are green; basic amino acids
`(KRH) are blue; acidic amino acids (DE) are red; and hydrophobic amino
`acids (AVLIPWFM) are black. All torso residues are grey. h, Relative
`abundance of amino acid properties in the CDRH3s of majority-shared
`clonotypes. Abundances are normalized to the frequency in unshared
`clonotypes. i, Nucleotide mutations for singly observed or repeatedly
`observed clonotypes. Coloured lines indicate the mean for each subject;
`dashed black line indicates the mean of all subjects. j, Nucleotide
`mutations for shared or unshared clonotypes. Coloured lines indicate the
`mean for each subject; dashed black line indicates the mean of all subjects.
`k, Mutation frequency of nucleotide sequences shared by two or more
`subjects. Points indicate mean mutation frequency. The number of unique
`nucleotide sequences in each shared group is shown.
`
`the expected frequency of clonotype sharing due to coincident V(D)J
`recombination. Synthetic sequence sets were generated using three
`different recombination models: (1) IGoR’s default model, inferred
`from unproductive antibody rearrangements and thus focused only
`on parameters related to V(D)J recombination; (2) subject-specific
`recombination models inferred from unmutated sequences from each
`subject; and (3) a combined-subject recombination model inferred
`from a pool of unmutated sequences drawn from all subjects. For each
`model, 10 batches of 108 sequences were generated, for a total of 3 bil-
`lion synthetic sequences. In the sequence sets generated with IGoR’s
`default model, clonotype sharing was sevenfold lower than in human
`repertoires (0.0032%; Fig. 3b), which indicates that coincident V(D)J
`recombination alone is not sufficient to explain the observed sharing.
`The subject-derived synthetic sequence sets showed much more shar-
`ing (0.1% and 0.16%, respectively; Fig. 3b, Extended Data Fig. 8). In
`addition to containing information about V(D)J recombination, the
`subject-derived models also implicitly encode information about the
`
`selection processes involved in B cell development. The increased
`frequency of clonotype sharing in subject-derived synthetic datasets
`indicates that the sieving effect of B cell development produces naive
`repertoires that are more similar than recombination alone would
`be expected to produce. Combined with our observation that naive-
`enriched repertoires are more similar to each other than are class-
`switched repertoires (Fig. 1h), a model emerges in which individual
`repertoires are very dissimilar after V(D)J recombination, are homo-
`genized during B cell development and become increasingly individ-
`ualized following differential responses to immunological exposure.
`The length distributions of CDRH3s in unique and repeatedly
`observed clonotypes were similar, whereas short CDRH3s were much
`more common in shared clonotypes (Fig. 3c, d). The skew towards
`short CDRH3s in the shared population is probably due to the
`increased probability of similar recombination events among shorter
`CDRH3s. By contrast, repeatedly observed clonotypes are more often
`the result of clonal expansion, as evidenced by their increased mutation
`
`N A T U R E | www.nature.com/nature
`
`© 2019 Springer Nature Limited. All rights reserved.
`
`Lassen - Exhibit 1016, p. 4
`
`
`
`LETTER RESEARCH
`
`frequency (Fig. 3i). Shared nucleotide sequences showed a strong
`inverse relationship between mutation frequency and the number of
`shared subjects (Fig. 3k); almost all sequences shared by four or more
`subjects were unmutated. Thus, although coincident recombination
`infrequently produces identical antibody sequences, the likelihood of
`coincident recombination being linked to an identical set of somatic
`mutations is exceptionally low.
`Antibody CDRH3s can be divided into two primary regions: the
`framework-proximal ‘torso’ and the more-variable ‘head’17,18. When
`comparing size-matched samples of shared and unshared clonotypes,
`we noted less diversity in the head regions of shared clonotypes.
`Furthermore, head-region diversity in shared clonotypes was inversely
`related to length of CDRH3, which is a relationship that is not seen in
`unshared clonotypes or synthetic repertoires (Fig. 3e). This inverse
`relationship—along with the skewed distribution of CDRH3 lengths
`in shared clonotypes (Fig. 3d)—indicates that two distinct processes
`shape the shared clonotype population. The shortest shared CDRH3s
`encode head-region diversity, similar to unshared CDRH3s and syn-
`thetic CDRH3s of the same length (Fig. 3f). Thus, short CDRH3s
`are probably shared primarily owing to their lower CDRH3 diversity
`and concomitantly higher likelihood of independent generation by
`coincident recombination. By contrast, longer shared CDRH3s are
`less diverse than unshared or shared synthetic populations (Fig. 3g),
`and more commonly encode head regions that are enriched in polar,
`uncharged residues and lack hydrophobic residues (Fig. 3h). This
`implies the existence of a mechanism by which these shared clono-
`types are selected or enriched after recombination, on the basis of the
`biochemical properties of their CDRH3 regions.
`In summary, sequencing the circulating B cell population of ten
`individuals at unprecedented depth has revealed repertoires that are
`highly individualized and extremely diverse. We estimate cohort-wide
`repertoire diversity of approximately 5 × 109 unique heavy-chain
`clonotypes, and as many as 1 × 1011 unique heavy-chain sequences.
`This indicates that the paired antibody diversity available to the
`circulating repertoire is very large, perhaps in the region of 1016–1018
`unique antibody sequences. Despite this enormous diversity, clono-
`types are shared more frequently than would be expected from coin-
`cident V(D)J recombination. Furthermore, we found that clonotype
`sharing is probably driven primarily by selection processes related
`to early B cell development rather than by convergent responses
`to common antigens. The possible clinical and diagnostic applica-
`tions of sequencing the adaptive-immune repertoire are myriad—
`however, much work remains to be done before these applications can
`be implemented. The results described here are confined to circulating
`B cells, which represent a minority of the total B cell population. The
`repertories of circulating and tissue-resident B cells are known to dif-
`fer19, and these differences may influence overall repertoire diversity
`and sharing. Furthermore, we have studied only ten individuals from
`a limited age range (18–30 years) and geographical region at a single
`time point. Much larger cohorts—representing diverse ethnicities,
`geographies and ages—will be required to capture the true population-
`wide repertoire diversity. Nevertheless, large-scale sequencing of
`the human adaptive-immune repertoire holds immense potential.
`Our use of high-level antibody-feature frequencies to differentiate
`repertoires raises the possibility of identifying and classifying
`discrete repertoire perturbations associated with autoimmune disease
`and chronic infection. Furthermore, because the repertoire of adaptive-
`immune receptors encodes a comprehensive record of an individ-
`ual’s immunological encounters, leveraging large-scale sequencing
`of adaptive-immune receptors represents an appealing strategy for
`diagnosing infection or deconvoluting infection histories. Finally, the
`individuality of the baseline repertoire of each subject suggests that
`the personalization of vaccine delivery and therapeutic intervention
`may produce substantial benefits in the treatment and prevention of
`infectious diseases.
`
`Online content
`Any methods, additional references, Nature Research reporting summaries, source
`data, statements of data availability and associated accession codes are available at
`https://doi.org/10.1038/s41586-019-0879-y.
`
`Received: 19 September 2017; Accepted: 22 November 2018;
`Published online xx xx xxxx.
`
` 1. Rajewsky, K. Clonal selection and learning in the antibody system. Nature 381,
`751–758 (1996).
` 2. Alberts, B. et al. The Generation of Antibody Diversity (Garland Science, New York,
`2002).
` 3. Boyd, S. D. & Crowe, J. E. Jr. Deep sequencing and human antibody repertoire
`analysis. Curr. Opin. Immunol. 40, 103–109 (2016).
` 4. Briney, B. & Burton, D. Massively scalable genetic analysis of antibody
`repertoires. Preprint at https://www.biorxiv.org/content/
`early/2018/10/19/447813 (2018).
` 5. Briney, B., Le, K., Zhu, J. & Burton, D. R. Clonify: unseeded antibody lineage
`assignment from next-generation sequencing data. Sci. Rep. 6, 23901 (2016).
` 6. Morbach, H., Eichhorn, E. M., Liese, J. G. & Girschick, H. J. Reference values for B
`cell subpopulations from infancy to adulthood. Clin. Exp. Immunol. 162,
`271–279 (2010).
` 7. Morisita, M. Measuring of the dispersion of individuals and analysis of the
`distributional patterns. Mem. Fac. Sci. Kyushu Univ. Ser. E 2, 5–235 (1959).
` 8. Horn, H. S. Measurement of ‘overlap’ in comparative ecological studies. Am.
`Nat. 100, 419–424 (1966).
` 9. Setliff, I. et al. Multi-donor longitudinal antibody repertoire sequencing reveals
`the existence of public antibody clonotypes in HIV-1 infection. Cell Host Microbe
`23, 845–854 (2018).
` 10. Chao, A. Estimating the population size for capture–recapture data with
`unequal catchability. Biometrics 43, 783–791 (1987).
` 11. Kaplinsky, J. & Arnaout, R. Robust estimates of overall immune-repertoire
`diversity from high-throughput measurements on samples. Nat. Commun. 7,
`11881 (2016).
` 12. Chao, A. & Chiu, C.-H. Nonparametric Estimation and Comparison of Species
`Richness https://doi.org/10.1002/9780470015902.a0026329 (John Wiley &
`Sons, 2016).
` 13. Eren, M. I., Chao, A., Hwang, W.-H. & Colwell, R. K. Estimating the richness of a
`population when the maximum number of classes is fixed: a nonparametric
`solution to an archaeological problem. PLoS ONE 7, e34179 (2012).
` 14. DeKosky, B. J. et al. In-depth determination and analysis of the human paired
`heavy- and light-chain antibody repertoire. Nat. Med. 21, 86–91 (2015).
` 15. Arnaout, R. et al. High-resolution description of antibody heavy-chain
`repertoires in humans. PLoS ONE 6, e22365 (2011).
` 16. Marcou, Q., Mora, T. & Walczak, A. M. High-throughput immune repertoire
`analysis with IGoR. Nat. Commun. 9, 561 (2018).
` 17. Morea, V., Tramontano, A., Rustici, M., Chothia, C. & Lesk, A. M. Conformations of
`the third hypervariable region in the VH domain of immunoglobulins. J. Mol.
`Biol. 275, 269–294 (1998).
` 18. Finn, J. A. et al. Improving loop modeling of the antibody complementarity-
`determining region 3 using knowledge-based restraints. PLoS ONE 11,
`e0154811 (2016).
` 19. Briney, B. S., Willis, J. R., Finn, J. A., McKinney, B. A. & Crowe, J. E. Jr. Tissue-
`specific expressed antibody variable gene repertoires. PLoS ONE 9, e100839
`(2014).
`
`Acknowledgements The authors thank all of the study subjects for their
`participation and the Genomic Services Laboratory at the HudsonAlpha
`Institute for Biotechnology for their sequencing expertise. This work was
`supported by the National Institute of Allergy and Infectious Diseases (Center
`for HIV/AIDS Vaccine Immunology and Immunogen Discovery, UM1AI100663
`(D.R.B.); Center for Viral Systems Biology, U19AI135995 (B.B.)), the
`International AIDS Vaccine Initiative (IAVI) through the Neutralizing Antibody
`Consortium SFP1849 (D.R.B.), and the Ragon Institute of MGH, MIT and
`Harvard (D.R.B.).
`
`Author contributions B.B. and D.R.B. planned and designed the experiments.
`B.B., A.I. and C.J. performed experiments. B.B. analysed data. B.B. and D.R.B.
`wrote the manuscript. All authors contributed to manuscript revisions.
`
`Competing interests The authors declare no competing interests.
`
`Additional information
`Extended data is available for this paper at https://doi.org/10.1038/s41586-
`019-0879-y.
`Supplementary information is available for this paper at https://doi.org/
`10.1038/s41586-019-0879-y.
`Reprints and permissions information is available at http://www.nature.com/
`reprints.
`Correspondence and requests for materials should be addressed to B.B. or
`D.R.B.
`Publisher’s note: Springer Nature remains neutral with regard to jurisdictional
`claims in published maps and institutional affiliations.
`
`© 2019 Springer Nature Limited. All rights reserved.
`
`N A T U R E | www.nature.com/nature
`
`Lassen - Exhibit 1016, p. 5
`
`
`
`RESEARCH
`
`LETTER
`
`METHODS
`No statistical methods were used to predetermine sample size. The experiments
`were not randomized and investigators were not blinded to allocation during
`experiments and outcome assessment.
`Leukapheresis samples. Full leukopaks (three blood volumes) were obtained from
`ten human subjects (Hemacare). Samples were collected at Hemacare’s Southern
`California donor centre. Sample collection was performed under a protocol
`approved by the Institutional Research Boards of Scripps Research and Hemacare.
`Informed consent was obtained from each subject. All subjects were healthy, HIV-
`negative adults betwe