IPR2023-00724, No. 1065 Exhibit - Tasneem 2012 (P.T.A.B. Mar. 16, 2023)

The Database for Aggregate Analysis of ClinicalTrials.gov
`(AACT) and Subsequent Regrouping by Clinical Specialty
`
`Asba Tasneem1*, Laura Aberle1, Hari Ananth1, Swati Chakraborty1, Karen Chiswell1, Brian J. McCourt1,
`Ricardo Pietrobon1,2
`
`1 Duke Clinical Research Institute, Durham, North Carolina, United States of America, 2 Department of Surgery, Duke University School of Medicine, Durham, North
`Carolina, United States of America
`
`Abstract
`
`Background: The ClinicalTrials.gov registry provides information regarding characteristics of past, current, and planned
`clinical studies to patients, clinicians, and researchers; in addition, registry data are available for bulk download. However,
`issues related to data structure, nomenclature, and changes in data collection over time present challenges to the
`aggregate analysis and interpretation of these data in general and to the analysis of trials according to clinical specialty in
`particular. Improving usability of these data could enhance the utility of ClinicalTrials.gov as a research resource.
`
`Methods/Principal Results: The purpose of our project was twofold. First, we sought to extend the usability of
`ClinicalTrials.gov for research purposes by developing a database for aggregate analysis of ClinicalTrials.gov (AACT) that
`contains data from the 96,346 clinical trials registered as of September 27, 2010. Second, we developed and validated a
`methodology for annotating studies by clinical specialty, using a custom taxonomy employing Medical Subject Heading
`(MeSH) terms applied by an NLM algorithm, as well as MeSH terms and other disease condition terms provided by study
`sponsors. Clinical specialists reviewed and annotated MeSH and non-MeSH disease condition terms, and an algorithm was
`created to classify studies into clinical specialties based on both MeSH and non-MeSH annotations. False positives and false
`negatives were evaluated by comparing algorithmic classification with manual classification for three specialties.
`
`Conclusions/Significance: The resulting AACT database features study design attributes parsed into discrete fields,
`integrated metadata, and an integrated MeSH thesaurus, and is available for download as Oracle extracts (.dmp file and text
`format). This publicly-accessible dataset will facilitate analysis of studies and permit detailed characterization and analysis of
`the U.S. clinical trials enterprise as a whole. In addition, the methodology we present for creating specialty datasets may
`facilitate other efforts to analyze studies by specialty groups.
`
`Citation: Tasneem A, Aberle L, Ananth H, Chakraborty S, Chiswell K, et al. (2012) The Database for Aggregate Analysis of ClinicalTrials.gov (AACT) and Subsequent
`Regrouping by Clinical Specialty. PLoS ONE 7(3): e33677. doi:10.1371/journal.pone.0033677
`
`Editor: Joel Joseph Gagnier, University of Michigan, United States of America
`
`Received October 14, 2011; Accepted February 14, 2012; Published March 16, 2012
`Copyright: ß 2012 Tasneem et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits
`unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
`
`Funding: Financial support for this work was provided by cooperative agreement U19 FD003800 awarded by the U.S. Food and Drug Administration to Duke
`University in support of the Clinical Trials Transformation Initiative. The funders had no role in study design, data collection and analysis, decision to publish, or
`preparation of the manuscript.
`
`Competing Interests: The authors have declared that no competing interests exist.
`
`* E-mail: asba.tasneem@duke.edu
`
`Introduction
`
`is a registry of
`ClinicalTrials.gov (www.ClinicalTrials.gov)
`human clinical research studies. It is hosted by the National
`Library of Medicine (NLM) at the National Institutes of Health
`(NIH) in collaboration with the U.S. Food and Drug Administra-
`tion (FDA). As mandated by federal law [1], ClinicalTrials.gov
`provides a central resource for information about clinical trials; in
`addition, it increases the public visibility of such research. The
`registry currently contains over 100,000 research studies conduct-
`ed in more than 170 countries and is widely used both by medical
`professionals and the public. New research studies are being
`submitted to the registry by their respective sponsors (or sponsors’
`designees) at a rate of approximately 350 per week [2]. Due to
`legislative [1] and institutional [3] requirements enacted in the
`latter half of
`the previous decade, compliance with registry
`obligations is assumed to be high for U.S. drug and device trials,
`
`and the consistency, quality, and maintenance of registry data
`have improved with increased use [4]. However, the registry has
`not been optimized for the analysis of aggregate data, and a
`systematic effort to create and maintain a database for this purpose
`has not previously been undertaken.
`In November 2007, the FDA and Duke University announced
`the formation of a public-private partnership to improve the
`quality and efficiency of clinical trials. This collaboration of more
`than 60 organizations and government agencies was convened by
`Duke University under a memorandum of understanding with
`FDA, and is now known as the Clinical Trials Transformation
`Initiative (CTTI) [5]. CTTI leaders recognized that Clinical-
`Trials.gov represented a promising source for benchmarking the
`state of the clinical trials enterprise, as the registry contains studies
`from the full range of sponsoring organizations. Increasing the
`usability of ClinicalTrials.gov data may therefore facilitate
`systematic evaluation of clinical studies aimed at building the
`
`PLoS ONE | www.plosone.org
`
`1
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 1
`
`

`Database for Aggregate Analysis of CT.gov
`
`knowledge base needed to inform medical practice and preven-
`tion.
`As data have accumulated in ClinicalTrials.gov, users have
`increasingly sought capabilities
`that would allow aggregated
`descriptive characterization of
`the national research portfolio;
`however, access and data usability issues, including data format
`and design, present obstacles. A number of related initiatives,
`including the Ontology of Clinical Research (OCRe) [6], Human
`Studies Database (HSDB) [7], CDISC Protocol Representation
`Model [8], and LinkedCT [9] projects, are addressing ontological
`annotations, large-scale data mining, data representation format,
`and external association of these data, respectively. The results of
`this project are complementary to these initiatives and are
`expected to collectively advance this area of study as a whole.
`In this article, we report on CTTI’s efforts to prepare and
`maintain a publicly accessible analysis dataset derived from
`ClinicalTrials.gov content—the database for aggregate analysis
`of ClinicalTrials.gov (AACT). We also discuss efforts to extend the
`
`utility of the analysis dataset by means of an associated clinical
`specialty taxonomy designed to support research policy analyses.
`
`Methods
`
`1. Creation of the AACT
`Key design features of AACT include 1) the capacity to extend
`the dataset by parsing existing data; 2) linking to additional data
`resources,
`such as
`the Medical Subject Headings
`(MeSH)
`thesaurus; and 3) integrated metadata. A framework for extensions
`allows entire studies or individual fields to be associated with new
`data resources while preserving provenance. In addition,
`the
`integrated data dictionary developed for this project facilitates
`browsing and analysis of ClinicalTrials.gov and AACT metadata.
`Finally,
`the database incorporates a flexible design that can
`accommodate future developments, such as coding biospecimen
`type, sponsors, and OCRe annotations. Figure 1 shows key
`enhancements achieved by building the AACT.
`
`Figure 1. A schematic representation of the database for Aggregate Analysis of ClinicalTrials.Gov (AACT) with its key
`enhancements.
`doi:10.1371/journal.pone.0033677.g001
`
`PLoS ONE | www.plosone.org
`
`2
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 2
`
`

`Database for Aggregate Analysis of CT.gov
`
`1.1. Data Sources. A dataset comprising 96,346 clinical
`studies was downloaded from ClinicalTrials.gov in XML format
`on September 27, 2010. We chose ClinicalTrials.gov for our study
`because it is the largest database of its kind and because it covers
`the full range of clinical conditions, includes a broad group of trial
`sponsors [10], and has a regulatory mandate [1]. The date of
`download was chosen to coincide with the anniversary of the
`enactment of the FDA Amendments Act (FDAAA) 3 years earlier,
`which mandated the registration of certain trials of FDA-regulated
`drugs, biologics, and devices [1].
`We downloaded the 2010 MeSH thesaurus (http://www.nlm.
`nih.gov/mesh/2010/download/termscon.html) and merged it
`with the AACT database, where it was used as a lookup table to
`locate corresponding tree numbers, referred to as MeSH IDs, for all
`MeSH terms associated with each clinical trial in ClinicalTrials.
`gov. Persons or organizations who submit studies to the registry
`are requested to provide the condition and keyword data elements as
`MeSH terms.
`1.2. Data Model. ClinicalTrials.gov data element definitions,
`xsd specifications for registry data submission, and downloaded
`
`study XML files were used to represent data specifications for the
`downloaded data. A physical data model was designed using
`Enterprise Architect (Sparx Systems Pty Ltd, Creswick, Victoria,
`Australia); this model depicted data tables and their data columns,
`as well as relationships between and among tables. An optimal
`structure was achieved through normalization, which was used to
`organize data efficiently, eliminate redundancy, and ensure logical
`data dependencies by storing only related data within a given table
`[11]. The database (Figure 2) was normalized to the Second
`Normal Form (2NF), a set of criteria designed to prevent logical
`inconsistencies while reducing data redundancy [12].
`We assigned data type and length of data elements based on
`patterns observed for each data element in the downloaded XML
`files. Whenever possible, we followed guidelines provided in
`ClinicalTrials.gov’s draft Protocol Data Element Definitions [13]
`when assigning lengths to given data elements. Data were housed
`in Oracle RDBMS, version 11.1 g (Oracle Corporation, Redwood
`Shores, California, USA). Enterprise Architect 7.1 was used for
`database design and additional
`transformation rules were
`documented as extract-transform-load (ETL) specifications. PL/
`
`Figure 2. High-level Entity-Relationship Diagram (ERD) for AACT.
`doi:10.1371/journal.pone.0033677.g002
`
`PLoS ONE | www.plosone.org
`
`3
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 3
`
`

`Table 1. Escape characters and replacements.
`
`Escape character
`
`Replacement
`
`’
`
`"
`
`&
`
`"
`
`,
`
`’
`
`"
`
`&
`
`.
`
`,
`
`doi:10.1371/journal.pone.0033677.t001
`
`SQL packages that used Oracle’s inbuilt DBMS_LOB package to
`read the input XML files and load the data into the designed
`tables appropriately were developed. Quality control and
`operational support processes were developed using standard
`SQL queries through Toad for Data Analysts (Quest Software,
`Aliso Viejo, CA, USA) and Cognos ReportNet (CRN)
`(IBM
`Corporation, Armonk, NY, USA). We extended the core data
`model to accommodate both data management and data curation
`purposes. Error log tables and indexes were created for testing,
`debugging, and performance enhancement. Manual user accep-
`tance testing was performed by randomly selecting five studies per
`data element (from a total of 109 data elements) from the AACT
`database. The values associated with each data element were
`tested for correctness and completeness by comparing them with
`the original source data from downloaded XML files. We also
`
`Database for Aggregate Analysis of CT.gov
`
`created integrated data dictionary tables as reference tables
`holding explicit data element definitions and system metadata
`(Tables S1 and S2).
`During the course of database development, the NLM made
`several new data elements available for public download, some of
`which included information about the FDA (e.g., Section 801
`clinical
`trials,
`studies with FDA-regulated interventions, and
`expanded-access studies). In addition to these, MeSH condition
`and intervention terms generated by the NLM algorithm were also
`made available for public download.
`In XML files downloaded from ClinicalTrials.gov, the single
`data element Study Design contains a string of concatenated values
`for various different components of a study design, such as primary
`purpose,
`interventional model, observational model, allocation,
`endpoint classification, time perspective, and masking. While this
`format is well-suited for supporting information retrieval, it does
`not readily accommodate aggregate data analysis of the compo-
`nents within the Study Design data element. For this reason, data
`from Study Design was parsed into its components and stored in a
`separate table called DESIGNS. Additional data elements (Design
`Name and Design Value) were created to store all components of
`study design and their respective enumerated values. Values
`related to masking/blinding (e.g., Single; Double-Blind) were further
`parsed into their components, along with the list of corresponding
`(Participant,
`Investigator, Outcome Assessor, and
`masking subjects
`Caregiver).
`loading the
`encountered while
`challenges were
`Several
`database, including foreign characters embedded in XML files
`
`Figure 3. Percentage of interventional studies with complete data by registration year for selected data elements.
`doi:10.1371/journal.pone.0033677.g003
`
`PLoS ONE | www.plosone.org
`
`4
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 4
`
`

`Database for Aggregate Analysis of CT.gov
`
`with most of the data elements; these had to be replaced with
`character references (see Table 1 for examples).
`Other circumstances that prompted several database design
`iterations included the facts that the maximum length for each
`data element noted by ClinicalTrials.gov’s May 2010 Protocol
`Data Element Definitions document was not always consistent
`with the complete dataset, and one-to-one or one-to-many
`
`relationships between or among data elements were not obvious
`in the XML data type definition from ClinicalTrials.gov.
`1.3. Quality Assessment. Of the 96,346 studies downloaded
`from ClinicalTrials.gov in September 2010, a total of 79,413
`(82.4%) were interventional (i.e., a study in which an investigator
`following a protocol assigns research participants
`to receive
`specific interventions, as opposed to an observational study),
`
`Figure 4. An overview of methodology and process of developing clinical specialty datasets. The INTERVENTIONS, CONDITIONS, and
`KEYWORDS tables consist of disease condition terms provided by data submitters that include both MeSH and non-MeSH terms. The
`INTERVENTION_BROWSE and CONDITION_BROWSE tables are populated by MeSH terms generated by NLM algorithm (a) Process illustrating how
`MeSH terms are created in ClinicalTrials.gov. Tables and data shown here does not represent entire ClinicalTrials.gov database (b) Process illustrating
`the annotation and validation of disease conditions (c) Process illustrating the creation of specialty datasets.
`doi:10.1371/journal.pone.0033677.g004
`
`PLoS ONE | www.plosone.org
`
`5
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 5
`
`

`Table 2. MeSH Subject Headings, 2010—Diseases.
`
`Bacterial Infections and Mycoses [C01]
`
`Virus Diseases [C02]
`
`Parasitic Diseases [C03]
`
`Neoplasms [C04]
`
`Musculoskeletal Diseases [C05]
`
`Digestive System Diseases [C06]
`
`Stomatognathic Diseases [C07]
`
`Respiratory Tract Diseases [C08]
`
`Otorhinolaryngologic Diseases [C09]
`
`Nervous System Diseases [C10]
`
`Eye Diseases [C11]
`
`Male Urogenital Diseases [C12]
`
`Female Urogenital Diseases and Pregnancy Complications [C13]
`
`Cardiovascular Diseases [C14]
`
`Hemic and Lymphatic Diseases [C15]
`
`Congenital, Hereditary, and Neonatal Diseases and Abnormalities [C16]
`
`Skin and Connective Tissue Diseases [C17]
`
`Nutritional and Metabolic Diseases [C18]
`
`Endocrine System Diseases [C19]
`
`Immune System Diseases [C20]
`
`Disorders of Environmental Origin [C21]
`
`Animal Diseases [C22]
`
`Pathological Conditions, Signs and Symptoms [C23]
`
`Available at: http://www.nlm.nih.gov/mesh/trees.html
`
`doi:10.1371/journal.pone.0033677.t002
`
`16,506 (17.1%) were observational, 107 (0.1%) were expanded-
`access, and 320 had no information about the study type. We
`analyzed selected data elements in interventional studies for
`completeness of data (e.g., a null value in the data element) and
`observed a trend toward increasing completeness of data over
`time. This trend appears to have been notably affected by two
`milestones in the history of ClinicalTrials.gov. In September 2004,
`the International Council of Medical Journal Editors (ICMJE)
`published a policy requiring registration of interventional trials as
`a condition of publication [3]. The ICMJE requirements took
`effect in September 2005, which may account for the increase in
`completeness for some data elements in 2005 (Figure 3).
`In September 2007, the FDAAA [1] made the registration of
`interventional studies mandatory. This requirement took effect in
`December 2007 and may further account for increases in the
`
`Table 3. Frequency of intermediate terms and top node
`terms that did not match annotations of lower-level terms.
`
`Specialty
`
`Cardiology
`
`Oncology
`
`Mental health
`
`n/N (%)
`
`172/5264 (3.3%)
`
`284/5264 (5.4%)
`
`93/5264 (1.8%)
`
`n = number of intermediate- and top-node MeSH terms for a given specialty
`that do not match the annotations of their lower-level terms. N = total number
`of intermediate- and top-node MeSH terms.
`doi:10.1371/journal.pone.0033677.t003
`
`Database for Aggregate Analysis of CT.gov
`
`completeness of data elements in the ClinicalTrials.gov dataset. In
`Figure 3, the data elements ‘‘data monitoring committee’’ and
`‘‘number of arms’’ were not available at the time that earlier
`studies were registered. It is important to note that the presence of
`these data elements for studies pre-dating December 2007 reflect
`later updates performed by data providers.
`1.4. Changes
`in ClinicalTrials.gov’s Protocol Data
`Element Definitions. The ClinicalTrials.gov Protocol Data
`Element Definitions (PDED) have evolved since the database was
`first launched. Although references containing individual protocol
`data element definitions are provided for submitters with each
`release of the definitions document, there is no document that
`tracks changes
`for all data elements
`for
`review as data
`specifications. These include changing enumerated values for a
`data element, revising a data element definition, making a
`particular data element publicly available,
`introducing a new
`data element, and entirely deleting a data element. However, more
`rigorous submission rules imposed by mandating organizations
`(e.g., NLM, FDA), such as those required by the FDAAA and
`ClinicalTrials.gov, appear to have had the greatest impact on the
`completeness of data.
`Changes to a data element play a significant role in the analysis
`of study data. As we examined each data element’s history, we
`noted that between September 2004 and July 2005 (a period
`spanning 3 releases of the PDED), and again in December 2007,
`the data element requirements were not documented in the
`definitions document. Other inconsistencies were also noted and
`later confirmed (Personal communication, Dr. Deborah Zarin and
`Mr. Nicholas Ide, February 18, 2011).
`1.5. A Public Resource. The AACT can be downloaded as
`Oracle extracts (.dmp file and text format output; available at
`https://www.trialstransformation.org/projects/improving-the-public-
`interface-for-use-of-aggregate-data-in-clinicaltrials.gov/aact-database-
`for-aggregate-analysis-of-clinicaltrials.gov). Additional documents are
`available to assist users in interpreting the data. The high-level data
`dictionary and a comprehensive data dictionary noted previously are
`included in the dataset file. The comprehensive data dictionary
`contains seven sections: 1) current variables, 2) enumerations, 3)
`constraints, 4) record counts, 5) database schema, 6) comprehensive
`change history, and 7) variable history dates. This document provides
`definitions, derivation of terms, data model structure and references,
`NLM and FDAAA requirements, and historical information for each
`data element in ClinicalTrials.gov to facilitate understanding of when
`variables were added, modified, or discontinued. The high-level data
`dictionary provides a summary view of the variables contained in the
`AACT database.
`
`2. A Methodology to Regroup Studies in
`ClinicalTrials.Gov by Specialty
`from multiple clinical
`ClinicalTrials.gov contains
`studies
`domains. While the AACT database facilitates the aggregate
`analysis of the entire dataset, it does not in itself support analysis
`within specific specialty domains. We therefore developed a
`methodology to re-group studies from ClinicalTrials.gov by
`clinical specialties as designated by the Department of Health
`and Human Services [14]. In doing so, we relied on MeSH
`condition terms and free-text disease condition terms associated
`with each study in the ClinicalTrials.gov database—a method
`that can be used to develop other specialized datasets for
`analysis.
`2.1. Use of MeSH Terminology in the ClinicalTrials.gov
`Database. Data submitters (study sponsors or their designees)
`are requested to provide Condition and Keywords data as MeSH
`terms when registering a study. Additionally, an NLM algorithm
`
`PLoS ONE | www.plosone.org
`
`6
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 6
`
`

`Database for Aggregate Analysis of CT.gov
`
`Figure 5. MeSH trees for acromegaly. Source: 2010 online MeSH thesaurus (available: http://www.nlm.nih.gov/cgi/mesh/2010/MB_cgi).
`doi:10.1371/journal.pone.0033677.g005
`
`also evaluates studies and applies MeSH terms according to the
`following steps: 1) study records are checked for the presence of a
`MeSH term,
`including synonyms and lexical variations; 2)
`weighted scores are computed for all matches, with exact
`matches, lexical variations, and synonyms receiving descending
`proportional weight; 3) very common terms are excluded to avoid
`confounding; 4)
`location by data element
`is considered and
`weighted in the term scoring process; and 5) terms with scores
`exceeding the cutoff value are applied to the respective studies.
`(Note that the output from steps 1 and 2 is used for both condition
`and intervention annotations; the field weights are different for
`each and divert
`terms into the target annotation type.) This
`method does not consider the natural-language context
`for
`matched terms or ontologically related concepts that would add
`specificity. Neither the terms from data submitters nor the NLM
`algorithm attempt to associate a term with a particular MeSH
`hierarchy. These resulting annotated MeSH terms are visible on
`the ClinicalTrials.gov website and populated in the condition_browse
`and intervention_browse fields in the downloaded XML file for each
`study. Figure 4 illustrates how MeSH terms are created in the
`ClinicalTrials.gov database.
`
`2.2. MeSH Disease Conditions Annotation. Condition
`and intervention terms in the MeSH thesaurus are arrayed in
`hierarchical branching structures, called trees; each branching
`point is referred to as a node. Nodes range from 1 (highest level) to
`12 (lowest level) in the 2010 version of the MeSH thesaurus. For
`example, one high-level category that we used to classify studies by
`clinical specialty was Diseases. In the 2010 MeSH thesaurus, this
`category contains 23 subcategories (Table 2).
`In order to create specialty datasets from the larger AACT
`dataset, we selected four high-level MeSH nodes
`from the
`thesaurus to serve as an initial basis for identifying studies by
`clinical specialty. Reviewers with relevant subject matter expertise
`annotated MeSH terms from the following high-level nodes: 1)
`Diseases; 2) Analytical, Diagnostic and Therapeutic Techniques and
`Equipment; 3) Psychiatry and Psychology; and 4) Phenomena and Processes.
`A total of 18,491MeSH IDs associated with 9031 MeSH terms
`were reviewed and annotated by clinical specialists belonging to
`one of the 13 clinical specialties and five sub-specialties, which
`were selected on the basis of availability of faculty representation
`and volunteers at Duke, as well as intention to analyze subsets of
`data by clinical specialty. Participating specialty annotations
`
`PLoS ONE | www.plosone.org
`
`7
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 7
`
`

`Database for Aggregate Analysis of CT.gov
`
`Figure 6. Rules for deciding whether a given study belongs to a given specialty.
`doi:10.1371/journal.pone.0033677.g006
`
`included cardiology, dermatology, endocrinology, gastroenterol-
`ogy,
`immunology/ rheumatology,
`infectious diseases, mental
`health, nephrology, neurology, oncology, otolaryngology, pulmo-
`nary medicine, reproductive medicine, while subspecialty anno-
`tations included peripheral vascular disease, peripheral arterial
`disease, diabetes,
`thyroid disease, and bone disease. The
`association of terms with clinical specialties was performed in
`the context of the anticipated analysis of the data subset for
`respective specialties. The results of this extension to the AACT
`database,
`including specialty tags, will be shared in future
`publications.
`2.3. Validation of Inconsistently Annotated MeSH Terms
`and Limitations of Using the MeSH Hierarchy. A term
`occurring at a particular node ‘‘node x’’ (parent) may have several
`branches (children) at node x+1 that provide a finer classification
`of the node-x term. Clinical specialists were advised to review the
`hierarchy of an individual MeSH term during the annotation
`process. Annotated MeSH descriptors were programmatically
`reviewed for hierarchical inconsistencies in order to maintain the
`logical relationship between parent and child MeSH descriptors.
`
`Table 4. Number of studies reviewed by each set of clinician
`reviewers.
`
`Reviewer A ID
`
`Reviewer B ID
`
`Studies reviewed (n)
`
`Clinician 1
`
`Clinician 1
`
`Clinician 4
`
`Clinician 6
`
`Clinician 2
`
`Clinician 3
`
`Clinician 5
`
`Clinician 7
`
`200
`
`400*
`
`200
`
`200
`
`*The combination of Clinician 1 (‘‘A’’) and Clinician 3 (‘‘B’’) together reviewed 2
`batches of studies.
`doi:10.1371/journal.pone.0033677.t004
`
`Tag validity was evaluated by a process based on annotation rules.
`In general, selection or negation of a parent MeSH term should
`match with all subsequent child MeSH terms below that node.
`Hierarchical inconsistencies in MeSH annotations were flagged
`and accepted after further review and confirmation by clinical
`specialists. The
`anticipated
`inconsistency
`of
`the MeSH
`hierarchical
`structure with clinical
`specialty groupings was
`confirmed in the validation process. Table 3 shows
`the
`frequency of parent terms that did not match with annotations
`for their children terms.
`Further, a term might appear within more than one tree. For
`example, the MeSH term Acromegaly appears as part of multiple
`trees within the topmost MeSH hierarchical category of Diseases
`(Figure 5).
`location, its context could fall
`Depending on its hierarchical
`under Musculoskeletal Diseases, Nervous System Diseases, or Endocrine
`System Diseases. Unfortunately,
`there currently is no way to
`differentiate among different tree numbers (MeSH IDs) for the
`same MeSH term. If a study contained the term Acromegaly, the
`three associated MeSH IDs could have conflicting tags (e.g., No,
`No, Yes) for a given specialty. This might result in erroneously
`including this study in a particular specialty dataset. As an
`additional validation check, all MeSH terms that had conflicting
`tags, as in the example above, were flagged and allowed to be
`adjudicated by clinical specialists.
`Tagging was summarized by MeSH term. For a given MeSH
`term, if all MeSH IDs had a Y tag (‘‘yes’’ or ‘‘true’’), then the
`MeSH term was given a Y; if all MeSH IDs had an N tag (‘‘no’’
`or ‘‘false’’), then the MeSH term was given a N tag; and if there
`was a mix of Y and N tags the term was given an A tag
`(‘‘ambiguous’’).
`2.4. Free-text Disease Conditions (non-MeSH condition
`terms): Annotation and Validation.
`In order to ascertain the
`condition being investigated in a given study, we also used the free-
`text condition terms provided by data submitters. These terms are
`
`PLoS ONE | www.plosone.org
`
`8
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 8
`
`

`Database for Aggregate Analysis of CT.gov
`
`Table 5. Contingency table for identifying misclassification errors.
`
`Algorithm
`
`Yes (Y)
`
`No (N)
`
`Ambiguous
`
`Unclassified
`
`Manual review
`
`Yes (Y)
`
`No (N)
`
`Unknown
`
`Total
`
`A
`
`C
`
`E
`A+C+E
`
`B
`
`D
`
`F
`B+D+F
`
`G
`
`I
`
`K
`G+I+K
`
`H
`
`J
`
`L
`H+J+L
`
`Total
`
`A+B+G+H
`C+D+I+J
`E+F+K+L
`
`T
`
`The overall misclassification error rate divides the total number of errors by the total number of studies reviewed. The false positive rate was determined using two
`methods: in the first, the false-positive rate was calculated among studies classified as N by manual review; in the second, the false-positive rate was calculated among
`studies classified as Y by the algorithm. The false-negative rate was evaluated in similar fashion: by dividing the number of false negatives by the number of studies
`classified as Y by manual review, or by the number of studies classified as N by the algorithm.
`doi:10.1371/journal.pone.0033677.t005
`
`visible on the ClinicalTrials.gov website and populated in the
`Condition field in the downloaded XML file for each study. Non-
`MeSH condition terms that appeared in five or more studies were
`also selected for specialty classification from interventional studies
`registered after September 27, 2007 (n = 40,970). These terms
`were reviewed by two independent clinicians from each relevant
`specialty; disagreements were adjudicated by a third independent
`reviewer.
`We elected to use both MeSH and non-MeSH disease condition
`terms for the following reasons: first, over 10% of studies do not
`have condition_browse mesh_terms; second, common terms may be
`excluded from the condition_browse mesh_terms annotation; and
`third, because of
`the potential
`for duplication or mismatch
`described above, reliance on indexing by MeSH term alone does
`
`not suffice for re-grouping studies in ClinicalTrials.gov by clinical
`specialty.
`2.5. Algorithm for Classifying Clinical Discipline. We
`used a combination of rules representing disease conditions and
`MeSH terms for classifying clinical specialty within interventional
`studies. We only included trials registered with ClinicalTrials.gov
`after September 27, 2007. The final
`list of annotated disease
`condition terms (MeSH and free-text) was used as a lookup table
`to create study datasets for individual specialties.
`For each specialty, studies were grouped according to the
`following rules (Figure 6):
`Group 1: Include a study in this group if any of its MeSH terms
`from the CONDITION_BROWSE table or condition terms were
`annotated with a Y (‘‘yes’’ or ‘‘true’’) for the specialty.
`
`Table 6. Classification of studies: algorithmically vs. manually.
`
`CARDIOLOGY
`
`Manual review
`
`ONCOLOGY
`
`Algorithm
`
`N
`
`Y
`
`Unknown
`
`Total
`
`Algorithm
`
`N
`
`836
`
`21
`
`1
`
`858
`
`Y
`
`18
`
`72
`
`0
`
`90
`
`Ambiguous
`
`Unclassified
`
`1
`
`0
`
`0
`
`1
`
`49
`
`2
`
`0
`
`51
`
`Total
`
`904
`
`95
`
`1
`
`1,000
`
`Total
`
`Manual review
`
`MENTAL HEALTH
`
`Manual review
`
`N
`
`Y
`
`Unknown
`
`Total
`
`Algorithm
`
`N
`
`Y
`
`Unknown
`
`Total
`
`N
`
`700
`
`7
`
`0
`
`707
`
`N
`
`838
`
`10
`
`0
`
`848
`
`Y
`
`4
`
`237
`
`0
`
`241
`
`Y
`
`21
`
`72
`
`0
`
`93
`
`Ambiguous
`
`Unclassified
`
`1
`
`0
`
`0
`
`1
`
`49
`
`2
`
`0
`
`51
`
`Ambiguous
`
`Unclassified
`
`8
`
`0
`
`0
`
`8
`
`51
`
`0
`
`0
`
`51
`
`754
`
`246
`
`0
`
`1,000
`
`Total
`
`918
`
`82
`
`0
`
`1,000
`
`doi:10.1371/journal.pone.0033677.t006
`
`PLoS ONE | www.plosone.org
`
`9
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 9
`
`

`Database for Aggregate Analysis of CT.gov
`
`Table 7. Comparison between manual classification and algorithmic classification for cardiology, oncology, and mental health.
`
`% Specialty by manual review
`
`% Specialty by algorithm*
`
`False positives{
`
`Among studies classified as N by manual review
`
`Among studies classified as Y by algorithm
`
`False negatives{
`
`Among studies classified as Y by manual review
`
`Among studies classified as N by algorithm
`
`Overall incorrectly classified studies
`
`Overall ambiguous studies
`
`Overall unclassified studies
`
`Cardiology
`
`Oncology
`
`Mental Health
`
`9.5%
`
`9.5%
`
`2.0%
`
`20.0%
`
`22.1%
`
`2.4%
`
`4.2%
`
`0.1%
`
`5.1%
`
`24.6%
`
`25.4%
`
`0.5%
`
`1.7%
`
`2.8%
`
`1.0%
`
`1.2%
`
`0.1%
`
`5.1%
`
`8.2%
`
`9.9%
`
`2.3%
`
`22.6%
`
`12.2%
`
`1.2%
`
`3.3%
`
`0.8%
`
`5.1%
`
`*Excluding unclassified & ambiguous from denominator.
`{Studies that were incorrectly included in a given specialty (e.g. non-cardiology

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases