`(AACT) and Subsequent Regrouping by Clinical Specialty
`
`Asba Tasneem1*, Laura Aberle1, Hari Ananth1, Swati Chakraborty1, Karen Chiswell1, Brian J. McCourt1,
`Ricardo Pietrobon1,2
`
`1 Duke Clinical Research Institute, Durham, North Carolina, United States of America, 2 Department of Surgery, Duke University School of Medicine, Durham, North
`Carolina, United States of America
`
`Abstract
`
`Background: The ClinicalTrials.gov registry provides information regarding characteristics of past, current, and planned
`clinical studies to patients, clinicians, and researchers; in addition, registry data are available for bulk download. However,
`issues related to data structure, nomenclature, and changes in data collection over time present challenges to the
`aggregate analysis and interpretation of these data in general and to the analysis of trials according to clinical specialty in
`particular. Improving usability of these data could enhance the utility of ClinicalTrials.gov as a research resource.
`
`Methods/Principal Results: The purpose of our project was twofold. First, we sought to extend the usability of
`ClinicalTrials.gov for research purposes by developing a database for aggregate analysis of ClinicalTrials.gov (AACT) that
`contains data from the 96,346 clinical trials registered as of September 27, 2010. Second, we developed and validated a
`methodology for annotating studies by clinical specialty, using a custom taxonomy employing Medical Subject Heading
`(MeSH) terms applied by an NLM algorithm, as well as MeSH terms and other disease condition terms provided by study
`sponsors. Clinical specialists reviewed and annotated MeSH and non-MeSH disease condition terms, and an algorithm was
`created to classify studies into clinical specialties based on both MeSH and non-MeSH annotations. False positives and false
`negatives were evaluated by comparing algorithmic classification with manual classification for three specialties.
`
`Conclusions/Significance: The resulting AACT database features study design attributes parsed into discrete fields,
`integrated metadata, and an integrated MeSH thesaurus, and is available for download as Oracle extracts (.dmp file and text
`format). This publicly-accessible dataset will facilitate analysis of studies and permit detailed characterization and analysis of
`the U.S. clinical trials enterprise as a whole. In addition, the methodology we present for creating specialty datasets may
`facilitate other efforts to analyze studies by specialty groups.
`
`Citation: Tasneem A, Aberle L, Ananth H, Chakraborty S, Chiswell K, et al. (2012) The Database for Aggregate Analysis of ClinicalTrials.gov (AACT) and Subsequent
`Regrouping by Clinical Specialty. PLoS ONE 7(3): e33677. doi:10.1371/journal.pone.0033677
`
`Editor: Joel Joseph Gagnier, University of Michigan, United States of America
`
`Received October 14, 2011; Accepted February 14, 2012; Published March 16, 2012
`Copyright: ß 2012 Tasneem et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits
`unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
`
`Funding: Financial support for this work was provided by cooperative agreement U19 FD003800 awarded by the U.S. Food and Drug Administration to Duke
`University in support of the Clinical Trials Transformation Initiative. The funders had no role in study design, data collection and analysis, decision to publish, or
`preparation of the manuscript.
`
`Competing Interests: The authors have declared that no competing interests exist.
`
`* E-mail: asba.tasneem@duke.edu
`
`Introduction
`
`is a registry of
`ClinicalTrials.gov (www.ClinicalTrials.gov)
`human clinical research studies. It is hosted by the National
`Library of Medicine (NLM) at the National Institutes of Health
`(NIH) in collaboration with the U.S. Food and Drug Administra-
`tion (FDA). As mandated by federal law [1], ClinicalTrials.gov
`provides a central resource for information about clinical trials; in
`addition, it increases the public visibility of such research. The
`registry currently contains over 100,000 research studies conduct-
`ed in more than 170 countries and is widely used both by medical
`professionals and the public. New research studies are being
`submitted to the registry by their respective sponsors (or sponsors’
`designees) at a rate of approximately 350 per week [2]. Due to
`legislative [1] and institutional [3] requirements enacted in the
`latter half of
`the previous decade, compliance with registry
`obligations is assumed to be high for U.S. drug and device trials,
`
`and the consistency, quality, and maintenance of registry data
`have improved with increased use [4]. However, the registry has
`not been optimized for the analysis of aggregate data, and a
`systematic effort to create and maintain a database for this purpose
`has not previously been undertaken.
`In November 2007, the FDA and Duke University announced
`the formation of a public-private partnership to improve the
`quality and efficiency of clinical trials. This collaboration of more
`than 60 organizations and government agencies was convened by
`Duke University under a memorandum of understanding with
`FDA, and is now known as the Clinical Trials Transformation
`Initiative (CTTI) [5]. CTTI leaders recognized that Clinical-
`Trials.gov represented a promising source for benchmarking the
`state of the clinical trials enterprise, as the registry contains studies
`from the full range of sponsoring organizations. Increasing the
`usability of ClinicalTrials.gov data may therefore facilitate
`systematic evaluation of clinical studies aimed at building the
`
`PLoS ONE | www.plosone.org
`
`1
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 1
`
`
`
`Database for Aggregate Analysis of CT.gov
`
`knowledge base needed to inform medical practice and preven-
`tion.
`As data have accumulated in ClinicalTrials.gov, users have
`increasingly sought capabilities
`that would allow aggregated
`descriptive characterization of
`the national research portfolio;
`however, access and data usability issues, including data format
`and design, present obstacles. A number of related initiatives,
`including the Ontology of Clinical Research (OCRe) [6], Human
`Studies Database (HSDB) [7], CDISC Protocol Representation
`Model [8], and LinkedCT [9] projects, are addressing ontological
`annotations, large-scale data mining, data representation format,
`and external association of these data, respectively. The results of
`this project are complementary to these initiatives and are
`expected to collectively advance this area of study as a whole.
`In this article, we report on CTTI’s efforts to prepare and
`maintain a publicly accessible analysis dataset derived from
`ClinicalTrials.gov content—the database for aggregate analysis
`of ClinicalTrials.gov (AACT). We also discuss efforts to extend the
`
`utility of the analysis dataset by means of an associated clinical
`specialty taxonomy designed to support research policy analyses.
`
`Methods
`
`1. Creation of the AACT
`Key design features of AACT include 1) the capacity to extend
`the dataset by parsing existing data; 2) linking to additional data
`resources,
`such as
`the Medical Subject Headings
`(MeSH)
`thesaurus; and 3) integrated metadata. A framework for extensions
`allows entire studies or individual fields to be associated with new
`data resources while preserving provenance. In addition,
`the
`integrated data dictionary developed for this project facilitates
`browsing and analysis of ClinicalTrials.gov and AACT metadata.
`Finally,
`the database incorporates a flexible design that can
`accommodate future developments, such as coding biospecimen
`type, sponsors, and OCRe annotations. Figure 1 shows key
`enhancements achieved by building the AACT.
`
`Figure 1. A schematic representation of the database for Aggregate Analysis of ClinicalTrials.Gov (AACT) with its key
`enhancements.
`doi:10.1371/journal.pone.0033677.g001
`
`PLoS ONE | www.plosone.org
`
`2
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 2
`
`
`
`Database for Aggregate Analysis of CT.gov
`
`1.1. Data Sources. A dataset comprising 96,346 clinical
`studies was downloaded from ClinicalTrials.gov in XML format
`on September 27, 2010. We chose ClinicalTrials.gov for our study
`because it is the largest database of its kind and because it covers
`the full range of clinical conditions, includes a broad group of trial
`sponsors [10], and has a regulatory mandate [1]. The date of
`download was chosen to coincide with the anniversary of the
`enactment of the FDA Amendments Act (FDAAA) 3 years earlier,
`which mandated the registration of certain trials of FDA-regulated
`drugs, biologics, and devices [1].
`We downloaded the 2010 MeSH thesaurus (http://www.nlm.
`nih.gov/mesh/2010/download/termscon.html) and merged it
`with the AACT database, where it was used as a lookup table to
`locate corresponding tree numbers, referred to as MeSH IDs, for all
`MeSH terms associated with each clinical trial in ClinicalTrials.
`gov. Persons or organizations who submit studies to the registry
`are requested to provide the condition and keyword data elements as
`MeSH terms.
`1.2. Data Model. ClinicalTrials.gov data element definitions,
`xsd specifications for registry data submission, and downloaded
`
`study XML files were used to represent data specifications for the
`downloaded data. A physical data model was designed using
`Enterprise Architect (Sparx Systems Pty Ltd, Creswick, Victoria,
`Australia); this model depicted data tables and their data columns,
`as well as relationships between and among tables. An optimal
`structure was achieved through normalization, which was used to
`organize data efficiently, eliminate redundancy, and ensure logical
`data dependencies by storing only related data within a given table
`[11]. The database (Figure 2) was normalized to the Second
`Normal Form (2NF), a set of criteria designed to prevent logical
`inconsistencies while reducing data redundancy [12].
`We assigned data type and length of data elements based on
`patterns observed for each data element in the downloaded XML
`files. Whenever possible, we followed guidelines provided in
`ClinicalTrials.gov’s draft Protocol Data Element Definitions [13]
`when assigning lengths to given data elements. Data were housed
`in Oracle RDBMS, version 11.1 g (Oracle Corporation, Redwood
`Shores, California, USA). Enterprise Architect 7.1 was used for
`database design and additional
`transformation rules were
`documented as extract-transform-load (ETL) specifications. PL/
`
`Figure 2. High-level Entity-Relationship Diagram (ERD) for AACT.
`doi:10.1371/journal.pone.0033677.g002
`
`PLoS ONE | www.plosone.org
`
`3
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 3
`
`
`
`Table 1. Escape characters and replacements.
`
`Escape character
`
`Replacement
`
`’
`
`"
`
`&
`
`"
`
`,
`
`’
`
`"
`
`&
`
`.
`
`,
`
`doi:10.1371/journal.pone.0033677.t001
`
`SQL packages that used Oracle’s inbuilt DBMS_LOB package to
`read the input XML files and load the data into the designed
`tables appropriately were developed. Quality control and
`operational support processes were developed using standard
`SQL queries through Toad for Data Analysts (Quest Software,
`Aliso Viejo, CA, USA) and Cognos ReportNet (CRN)
`(IBM
`Corporation, Armonk, NY, USA). We extended the core data
`model to accommodate both data management and data curation
`purposes. Error log tables and indexes were created for testing,
`debugging, and performance enhancement. Manual user accep-
`tance testing was performed by randomly selecting five studies per
`data element (from a total of 109 data elements) from the AACT
`database. The values associated with each data element were
`tested for correctness and completeness by comparing them with
`the original source data from downloaded XML files. We also
`
`Database for Aggregate Analysis of CT.gov
`
`created integrated data dictionary tables as reference tables
`holding explicit data element definitions and system metadata
`(Tables S1 and S2).
`During the course of database development, the NLM made
`several new data elements available for public download, some of
`which included information about the FDA (e.g., Section 801
`clinical
`trials,
`studies with FDA-regulated interventions, and
`expanded-access studies). In addition to these, MeSH condition
`and intervention terms generated by the NLM algorithm were also
`made available for public download.
`In XML files downloaded from ClinicalTrials.gov, the single
`data element Study Design contains a string of concatenated values
`for various different components of a study design, such as primary
`purpose,
`interventional model, observational model, allocation,
`endpoint classification, time perspective, and masking. While this
`format is well-suited for supporting information retrieval, it does
`not readily accommodate aggregate data analysis of the compo-
`nents within the Study Design data element. For this reason, data
`from Study Design was parsed into its components and stored in a
`separate table called DESIGNS. Additional data elements (Design
`Name and Design Value) were created to store all components of
`study design and their respective enumerated values. Values
`related to masking/blinding (e.g., Single; Double-Blind) were further
`parsed into their components, along with the list of corresponding
`(Participant,
`Investigator, Outcome Assessor, and
`masking subjects
`Caregiver).
`loading the
`encountered while
`challenges were
`Several
`database, including foreign characters embedded in XML files
`
`Figure 3. Percentage of interventional studies with complete data by registration year for selected data elements.
`doi:10.1371/journal.pone.0033677.g003
`
`PLoS ONE | www.plosone.org
`
`4
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 4
`
`
`
`Database for Aggregate Analysis of CT.gov
`
`with most of the data elements; these had to be replaced with
`character references (see Table 1 for examples).
`Other circumstances that prompted several database design
`iterations included the facts that the maximum length for each
`data element noted by ClinicalTrials.gov’s May 2010 Protocol
`Data Element Definitions document was not always consistent
`with the complete dataset, and one-to-one or one-to-many
`
`relationships between or among data elements were not obvious
`in the XML data type definition from ClinicalTrials.gov.
`1.3. Quality Assessment. Of the 96,346 studies downloaded
`from ClinicalTrials.gov in September 2010, a total of 79,413
`(82.4%) were interventional (i.e., a study in which an investigator
`following a protocol assigns research participants
`to receive
`specific interventions, as opposed to an observational study),
`
`Figure 4. An overview of methodology and process of developing clinical specialty datasets. The INTERVENTIONS, CONDITIONS, and
`KEYWORDS tables consist of disease condition terms provided by data submitters that include both MeSH and non-MeSH terms. The
`INTERVENTION_BROWSE and CONDITION_BROWSE tables are populated by MeSH terms generated by NLM algorithm (a) Process illustrating how
`MeSH terms are created in ClinicalTrials.gov. Tables and data shown here does not represent entire ClinicalTrials.gov database (b) Process illustrating
`the annotation and validation of disease conditions (c) Process illustrating the creation of specialty datasets.
`doi:10.1371/journal.pone.0033677.g004
`
`PLoS ONE | www.plosone.org
`
`5
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 5
`
`
`
`Table 2. MeSH Subject Headings, 2010—Diseases.
`
`Bacterial Infections and Mycoses [C01]
`
`Virus Diseases [C02]
`
`Parasitic Diseases [C03]
`
`Neoplasms [C04]
`
`Musculoskeletal Diseases [C05]
`
`Digestive System Diseases [C06]
`
`Stomatognathic Diseases [C07]
`
`Respiratory Tract Diseases [C08]
`
`Otorhinolaryngologic Diseases [C09]
`
`Nervous System Diseases [C10]
`
`Eye Diseases [C11]
`
`Male Urogenital Diseases [C12]
`
`Female Urogenital Diseases and Pregnancy Complications [C13]
`
`Cardiovascular Diseases [C14]
`
`Hemic and Lymphatic Diseases [C15]
`
`Congenital, Hereditary, and Neonatal Diseases and Abnormalities [C16]
`
`Skin and Connective Tissue Diseases [C17]
`
`Nutritional and Metabolic Diseases [C18]
`
`Endocrine System Diseases [C19]
`
`Immune System Diseases [C20]
`
`Disorders of Environmental Origin [C21]
`
`Animal Diseases [C22]
`
`Pathological Conditions, Signs and Symptoms [C23]
`
`Available at: http://www.nlm.nih.gov/mesh/trees.html
`
`doi:10.1371/journal.pone.0033677.t002
`
`16,506 (17.1%) were observational, 107 (0.1%) were expanded-
`access, and 320 had no information about the study type. We
`analyzed selected data elements in interventional studies for
`completeness of data (e.g., a null value in the data element) and
`observed a trend toward increasing completeness of data over
`time. This trend appears to have been notably affected by two
`milestones in the history of ClinicalTrials.gov. In September 2004,
`the International Council of Medical Journal Editors (ICMJE)
`published a policy requiring registration of interventional trials as
`a condition of publication [3]. The ICMJE requirements took
`effect in September 2005, which may account for the increase in
`completeness for some data elements in 2005 (Figure 3).
`In September 2007, the FDAAA [1] made the registration of
`interventional studies mandatory. This requirement took effect in
`December 2007 and may further account for increases in the
`
`Table 3. Frequency of intermediate terms and top node
`terms that did not match annotations of lower-level terms.
`
`Specialty
`
`Cardiology
`
`Oncology
`
`Mental health
`
`n/N (%)
`
`172/5264 (3.3%)
`
`284/5264 (5.4%)
`
`93/5264 (1.8%)
`
`n = number of intermediate- and top-node MeSH terms for a given specialty
`that do not match the annotations of their lower-level terms. N = total number
`of intermediate- and top-node MeSH terms.
`doi:10.1371/journal.pone.0033677.t003
`
`Database for Aggregate Analysis of CT.gov
`
`completeness of data elements in the ClinicalTrials.gov dataset. In
`Figure 3, the data elements ‘‘data monitoring committee’’ and
`‘‘number of arms’’ were not available at the time that earlier
`studies were registered. It is important to note that the presence of
`these data elements for studies pre-dating December 2007 reflect
`later updates performed by data providers.
`1.4. Changes
`in ClinicalTrials.gov’s Protocol Data
`Element Definitions. The ClinicalTrials.gov Protocol Data
`Element Definitions (PDED) have evolved since the database was
`first launched. Although references containing individual protocol
`data element definitions are provided for submitters with each
`release of the definitions document, there is no document that
`tracks changes
`for all data elements
`for
`review as data
`specifications. These include changing enumerated values for a
`data element, revising a data element definition, making a
`particular data element publicly available,
`introducing a new
`data element, and entirely deleting a data element. However, more
`rigorous submission rules imposed by mandating organizations
`(e.g., NLM, FDA), such as those required by the FDAAA and
`ClinicalTrials.gov, appear to have had the greatest impact on the
`completeness of data.
`Changes to a data element play a significant role in the analysis
`of study data. As we examined each data element’s history, we
`noted that between September 2004 and July 2005 (a period
`spanning 3 releases of the PDED), and again in December 2007,
`the data element requirements were not documented in the
`definitions document. Other inconsistencies were also noted and
`later confirmed (Personal communication, Dr. Deborah Zarin and
`Mr. Nicholas Ide, February 18, 2011).
`1.5. A Public Resource. The AACT can be downloaded as
`Oracle extracts (.dmp file and text format output; available at
`https://www.trialstransformation.org/projects/improving-the-public-
`interface-for-use-of-aggregate-data-in-clinicaltrials.gov/aact-database-
`for-aggregate-analysis-of-clinicaltrials.gov). Additional documents are
`available to assist users in interpreting the data. The high-level data
`dictionary and a comprehensive data dictionary noted previously are
`included in the dataset file. The comprehensive data dictionary
`contains seven sections: 1) current variables, 2) enumerations, 3)
`constraints, 4) record counts, 5) database schema, 6) comprehensive
`change history, and 7) variable history dates. This document provides
`definitions, derivation of terms, data model structure and references,
`NLM and FDAAA requirements, and historical information for each
`data element in ClinicalTrials.gov to facilitate understanding of when
`variables were added, modified, or discontinued. The high-level data
`dictionary provides a summary view of the variables contained in the
`AACT database.
`
`2. A Methodology to Regroup Studies in
`ClinicalTrials.Gov by Specialty
`from multiple clinical
`ClinicalTrials.gov contains
`studies
`domains. While the AACT database facilitates the aggregate
`analysis of the entire dataset, it does not in itself support analysis
`within specific specialty domains. We therefore developed a
`methodology to re-group studies from ClinicalTrials.gov by
`clinical specialties as designated by the Department of Health
`and Human Services [14]. In doing so, we relied on MeSH
`condition terms and free-text disease condition terms associated
`with each study in the ClinicalTrials.gov database—a method
`that can be used to develop other specialized datasets for
`analysis.
`2.1. Use of MeSH Terminology in the ClinicalTrials.gov
`Database. Data submitters (study sponsors or their designees)
`are requested to provide Condition and Keywords data as MeSH
`terms when registering a study. Additionally, an NLM algorithm
`
`PLoS ONE | www.plosone.org
`
`6
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 6
`
`
`
`Database for Aggregate Analysis of CT.gov
`
`Figure 5. MeSH trees for acromegaly. Source: 2010 online MeSH thesaurus (available: http://www.nlm.nih.gov/cgi/mesh/2010/MB_cgi).
`doi:10.1371/journal.pone.0033677.g005
`
`also evaluates studies and applies MeSH terms according to the
`following steps: 1) study records are checked for the presence of a
`MeSH term,
`including synonyms and lexical variations; 2)
`weighted scores are computed for all matches, with exact
`matches, lexical variations, and synonyms receiving descending
`proportional weight; 3) very common terms are excluded to avoid
`confounding; 4)
`location by data element
`is considered and
`weighted in the term scoring process; and 5) terms with scores
`exceeding the cutoff value are applied to the respective studies.
`(Note that the output from steps 1 and 2 is used for both condition
`and intervention annotations; the field weights are different for
`each and divert
`terms into the target annotation type.) This
`method does not consider the natural-language context
`for
`matched terms or ontologically related concepts that would add
`specificity. Neither the terms from data submitters nor the NLM
`algorithm attempt to associate a term with a particular MeSH
`hierarchy. These resulting annotated MeSH terms are visible on
`the ClinicalTrials.gov website and populated in the condition_browse
`and intervention_browse fields in the downloaded XML file for each
`study. Figure 4 illustrates how MeSH terms are created in the
`ClinicalTrials.gov database.
`
`2.2. MeSH Disease Conditions Annotation. Condition
`and intervention terms in the MeSH thesaurus are arrayed in
`hierarchical branching structures, called trees; each branching
`point is referred to as a node. Nodes range from 1 (highest level) to
`12 (lowest level) in the 2010 version of the MeSH thesaurus. For
`example, one high-level category that we used to classify studies by
`clinical specialty was Diseases. In the 2010 MeSH thesaurus, this
`category contains 23 subcategories (Table 2).
`In order to create specialty datasets from the larger AACT
`dataset, we selected four high-level MeSH nodes
`from the
`thesaurus to serve as an initial basis for identifying studies by
`clinical specialty. Reviewers with relevant subject matter expertise
`annotated MeSH terms from the following high-level nodes: 1)
`Diseases; 2) Analytical, Diagnostic and Therapeutic Techniques and
`Equipment; 3) Psychiatry and Psychology; and 4) Phenomena and Processes.
`A total of 18,491MeSH IDs associated with 9031 MeSH terms
`were reviewed and annotated by clinical specialists belonging to
`one of the 13 clinical specialties and five sub-specialties, which
`were selected on the basis of availability of faculty representation
`and volunteers at Duke, as well as intention to analyze subsets of
`data by clinical specialty. Participating specialty annotations
`
`PLoS ONE | www.plosone.org
`
`7
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 7
`
`
`
`Database for Aggregate Analysis of CT.gov
`
`Figure 6. Rules for deciding whether a given study belongs to a given specialty.
`doi:10.1371/journal.pone.0033677.g006
`
`included cardiology, dermatology, endocrinology, gastroenterol-
`ogy,
`immunology/ rheumatology,
`infectious diseases, mental
`health, nephrology, neurology, oncology, otolaryngology, pulmo-
`nary medicine, reproductive medicine, while subspecialty anno-
`tations included peripheral vascular disease, peripheral arterial
`disease, diabetes,
`thyroid disease, and bone disease. The
`association of terms with clinical specialties was performed in
`the context of the anticipated analysis of the data subset for
`respective specialties. The results of this extension to the AACT
`database,
`including specialty tags, will be shared in future
`publications.
`2.3. Validation of Inconsistently Annotated MeSH Terms
`and Limitations of Using the MeSH Hierarchy. A term
`occurring at a particular node ‘‘node x’’ (parent) may have several
`branches (children) at node x+1 that provide a finer classification
`of the node-x term. Clinical specialists were advised to review the
`hierarchy of an individual MeSH term during the annotation
`process. Annotated MeSH descriptors were programmatically
`reviewed for hierarchical inconsistencies in order to maintain the
`logical relationship between parent and child MeSH descriptors.
`
`Table 4. Number of studies reviewed by each set of clinician
`reviewers.
`
`Reviewer A ID
`
`Reviewer B ID
`
`Studies reviewed (n)
`
`Clinician 1
`
`Clinician 1
`
`Clinician 4
`
`Clinician 6
`
`Clinician 2
`
`Clinician 3
`
`Clinician 5
`
`Clinician 7
`
`200
`
`400*
`
`200
`
`200
`
`*The combination of Clinician 1 (‘‘A’’) and Clinician 3 (‘‘B’’) together reviewed 2
`batches of studies.
`doi:10.1371/journal.pone.0033677.t004
`
`Tag validity was evaluated by a process based on annotation rules.
`In general, selection or negation of a parent MeSH term should
`match with all subsequent child MeSH terms below that node.
`Hierarchical inconsistencies in MeSH annotations were flagged
`and accepted after further review and confirmation by clinical
`specialists. The
`anticipated
`inconsistency
`of
`the MeSH
`hierarchical
`structure with clinical
`specialty groupings was
`confirmed in the validation process. Table 3 shows
`the
`frequency of parent terms that did not match with annotations
`for their children terms.
`Further, a term might appear within more than one tree. For
`example, the MeSH term Acromegaly appears as part of multiple
`trees within the topmost MeSH hierarchical category of Diseases
`(Figure 5).
`location, its context could fall
`Depending on its hierarchical
`under Musculoskeletal Diseases, Nervous System Diseases, or Endocrine
`System Diseases. Unfortunately,
`there currently is no way to
`differentiate among different tree numbers (MeSH IDs) for the
`same MeSH term. If a study contained the term Acromegaly, the
`three associated MeSH IDs could have conflicting tags (e.g., No,
`No, Yes) for a given specialty. This might result in erroneously
`including this study in a particular specialty dataset. As an
`additional validation check, all MeSH terms that had conflicting
`tags, as in the example above, were flagged and allowed to be
`adjudicated by clinical specialists.
`Tagging was summarized by MeSH term. For a given MeSH
`term, if all MeSH IDs had a Y tag (‘‘yes’’ or ‘‘true’’), then the
`MeSH term was given a Y; if all MeSH IDs had an N tag (‘‘no’’
`or ‘‘false’’), then the MeSH term was given a N tag; and if there
`was a mix of Y and N tags the term was given an A tag
`(‘‘ambiguous’’).
`2.4. Free-text Disease Conditions (non-MeSH condition
`terms): Annotation and Validation.
`In order to ascertain the
`condition being investigated in a given study, we also used the free-
`text condition terms provided by data submitters. These terms are
`
`PLoS ONE | www.plosone.org
`
`8
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 8
`
`
`
`Database for Aggregate Analysis of CT.gov
`
`Table 5. Contingency table for identifying misclassification errors.
`
`Algorithm
`
`Yes (Y)
`
`No (N)
`
`Ambiguous
`
`Unclassified
`
`Manual review
`
`Yes (Y)
`
`No (N)
`
`Unknown
`
`Total
`
`A
`
`C
`
`E
`A+C+E
`
`B
`
`D
`
`F
`B+D+F
`
`G
`
`I
`
`K
`G+I+K
`
`H
`
`J
`
`L
`H+J+L
`
`Total
`
`A+B+G+H
`C+D+I+J
`E+F+K+L
`
`T
`
`The overall misclassification error rate divides the total number of errors by the total number of studies reviewed. The false positive rate was determined using two
`methods: in the first, the false-positive rate was calculated among studies classified as N by manual review; in the second, the false-positive rate was calculated among
`studies classified as Y by the algorithm. The false-negative rate was evaluated in similar fashion: by dividing the number of false negatives by the number of studies
`classified as Y by manual review, or by the number of studies classified as N by the algorithm.
`doi:10.1371/journal.pone.0033677.t005
`
`visible on the ClinicalTrials.gov website and populated in the
`Condition field in the downloaded XML file for each study. Non-
`MeSH condition terms that appeared in five or more studies were
`also selected for specialty classification from interventional studies
`registered after September 27, 2007 (n = 40,970). These terms
`were reviewed by two independent clinicians from each relevant
`specialty; disagreements were adjudicated by a third independent
`reviewer.
`We elected to use both MeSH and non-MeSH disease condition
`terms for the following reasons: first, over 10% of studies do not
`have condition_browse mesh_terms; second, common terms may be
`excluded from the condition_browse mesh_terms annotation; and
`third, because of
`the potential
`for duplication or mismatch
`described above, reliance on indexing by MeSH term alone does
`
`not suffice for re-grouping studies in ClinicalTrials.gov by clinical
`specialty.
`2.5. Algorithm for Classifying Clinical Discipline. We
`used a combination of rules representing disease conditions and
`MeSH terms for classifying clinical specialty within interventional
`studies. We only included trials registered with ClinicalTrials.gov
`after September 27, 2007. The final
`list of annotated disease
`condition terms (MeSH and free-text) was used as a lookup table
`to create study datasets for individual specialties.
`For each specialty, studies were grouped according to the
`following rules (Figure 6):
`Group 1: Include a study in this group if any of its MeSH terms
`from the CONDITION_BROWSE table or condition terms were
`annotated with a Y (‘‘yes’’ or ‘‘true’’) for the specialty.
`
`Table 6. Classification of studies: algorithmically vs. manually.
`
`CARDIOLOGY
`
`Manual review
`
`ONCOLOGY
`
`Algorithm
`
`N
`
`Y
`
`Unknown
`
`Total
`
`Algorithm
`
`N
`
`836
`
`21
`
`1
`
`858
`
`Y
`
`18
`
`72
`
`0
`
`90
`
`Ambiguous
`
`Unclassified
`
`1
`
`0
`
`0
`
`1
`
`49
`
`2
`
`0
`
`51
`
`Total
`
`904
`
`95
`
`1
`
`1,000
`
`Total
`
`Manual review
`
`MENTAL HEALTH
`
`Manual review
`
`N
`
`Y
`
`Unknown
`
`Total
`
`Algorithm
`
`N
`
`Y
`
`Unknown
`
`Total
`
`N
`
`700
`
`7
`
`0
`
`707
`
`N
`
`838
`
`10
`
`0
`
`848
`
`Y
`
`4
`
`237
`
`0
`
`241
`
`Y
`
`21
`
`72
`
`0
`
`93
`
`Ambiguous
`
`Unclassified
`
`1
`
`0
`
`0
`
`1
`
`49
`
`2
`
`0
`
`51
`
`Ambiguous
`
`Unclassified
`
`8
`
`0
`
`0
`
`8
`
`51
`
`0
`
`0
`
`51
`
`754
`
`246
`
`0
`
`1,000
`
`Total
`
`918
`
`82
`
`0
`
`1,000
`
`doi:10.1371/journal.pone.0033677.t006
`
`PLoS ONE | www.plosone.org
`
`9
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 9
`
`
`
`Database for Aggregate Analysis of CT.gov
`
`Table 7. Comparison between manual classification and algorithmic classification for cardiology, oncology, and mental health.
`
`% Specialty by manual review
`
`% Specialty by algorithm*
`
`False positives{
`
`Among studies classified as N by manual review
`
`Among studies classified as Y by algorithm
`
`False negatives{
`
`Among studies classified as Y by manual review
`
`Among studies classified as N by algorithm
`
`Overall incorrectly classified studies
`
`Overall ambiguous studies
`
`Overall unclassified studies
`
`Cardiology
`
`Oncology
`
`Mental Health
`
`9.5%
`
`9.5%
`
`2.0%
`
`20.0%
`
`22.1%
`
`2.4%
`
`4.2%
`
`0.1%
`
`5.1%
`
`24.6%
`
`25.4%
`
`0.5%
`
`1.7%
`
`2.8%
`
`1.0%
`
`1.2%
`
`0.1%
`
`5.1%
`
`8.2%
`
`9.9%
`
`2.3%
`
`22.6%
`
`12.2%
`
`1.2%
`
`3.3%
`
`0.8%
`
`5.1%
`
`*Excluding unclassified & ambiguous from denominator.
`{Studies that were incorrectly included in a given specialty (e.g. non-cardiology