`
`The Database for Aggregate Analysis of ClinicalTrials.gov
`(AACT) and Subsequent Regrouping by Clinical Specialty
`
`Asba Tasneem1*, Laura Aberle1, Hari Ananth1, Swati Chakraborty1, Karen Chiswell1, Brian J. McCourt1,
`Ricardo Pietrobon1,2
`
`1 Duke Clinical Research Institute, Durham, North Carolina, United States of America, 2 Department of Surgery, Duke University School of Medicine, Durham, North
`Carolina, United States of America
`
`Abstract
`
`Background: The ClinicalTrials.gov registry provides information regarding characteristics of past, current, and planned
`clinical studies to patients, clinicians, and researchers; in addition, registry data are available for bulk download. However,
`issues related to data structure, nomenclature, and changes in data collection over time present challenges to the
`aggregate analysis and interpretation of these data in general and to the analysis of trials according to clinical specialty in
`particular. Improving usability of these data could enhance the utility of ClinicalTrials.gov as a research resource.
`
`Methods/Principal Results: The purpose of our project was twofold. First, we sought to extend the usability of
`ClinicalTrials.gov for research purposes by developing a database for aggregate analysis of ClinicalTrials.gov (AACT) that
`contains data from the 96,346 clinical trials registered as of September 27, 2010. Second, we developed and validated a
`methodology for annotating studies by clinical specialty, using a custom taxonomy employing Medical Subject Heading
`(MeSH) terms applied by an NLM algorithm, as well as MeSH terms and other disease condition terms provided by study
`sponsors. Clinical specialists reviewed and annotated MeSH and non-MeSH disease condition terms, and an algorithm was
`created to classify studies into clinical specialties based on both MeSH and non-MeSH annotations. False positives and false
`negatives were evaluated by comparing algorithmic classification with manual classification for three specialties.
`
`Conclusions/Significance: The resulting AACT database features study design attributes parsed into discrete fields,
`integrated metadata, and an integrated MeSH thesaurus, and is available for download as Oracle extracts (.dmp file and text
`format). This publicly-accessible dataset will facilitate analysis of studies and permit detailed characterization and analysis of
`the U.S. clinical trials enterprise as a whole. In addition, the methodology we present for creating specialty datasets may
`facilitate other efforts to analyze studies by specialty groups.
`
`Citation: Tasneem A, Aberle L, Ananth H, Chakraborty S, Chiswell K, et al. (2012) The Database for Aggregate Analysis of ClinicalTrials.gov (AACT) and Subsequent
`Regrouping by Clinical Specialty. PLoS ONE 7(3): e33677. doi:10.1371/journal.pone.0033677
`
`Editor: Joel Joseph Gagnier, University of Michigan, United States of America
`
`Received October 14, 2011; Accepted February 14, 2012; Published March 16, 2012
`Copyright: ß 2012 Tasneem et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits
`unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
`
`Funding: Financial support for this work was provided by cooperative agreement U19 FD003800 awarded by the U.S. Food and Drug Administration to Duke
`University in support of the Clinical Trials Transformation Initiative. The funders had no role in study design, data collection and analysis, decision to publish, or
`preparation of the manuscript.
`
`Competing Interests: The authors have declared that no competing interests exist.
`
`* E-mail: asba.tasneem@duke.edu
`
`Introduction
`
`is a registry of
`ClinicalTrials.gov (www.ClinicalTrials.gov)
`human clinical research studies. It is hosted by the National
`Library of Medicine (NLM) at the National Institutes of Health
`(NIH) in collaboration with the U.S. Food and Drug Administra-
`tion (FDA). As mandated by federal law [1], ClinicalTrials.gov
`provides a central resource for information about clinical trials; in
`addition, it increases the public visibility of such research. The
`registry currently contains over 100,000 research studies conduct-
`ed in more than 170 countries and is widely used both by medical
`professionals and the public. New research studies are being
`submitted to the registry by their respective sponsors (or sponsors’
`designees) at a rate of approximately 350 per week [2]. Due to
`legislative [1] and institutional [3] requirements enacted in the
`latter half of
`the previous decade, compliance with registry
`obligations is assumed to be high for U.S. drug and device trials,
`
`and the consistency, quality, and maintenance of registry data
`have improved with increased use [4]. However, the registry has
`not been optimized for the analysis of aggregate data, and a
`systematic effort to create and maintain a database for this purpose
`has not previously been undertaken.
`In November 2007, the FDA and Duke University announced
`the formation of a public-private partnership to improve the
`quality and efficiency of clinical trials. This collaboration of more
`than 60 organizations and government agencies was convened by
`Duke University under a memorandum of understanding with
`FDA, and is now known as the Clinical Trials Transformation
`Initiative (CTTI) [5]. CTTI leaders recognized that Clinical-
`Trials.gov represented a promising source for benchmarking the
`state of the clinical trials enterprise, as the registry contains studies
`from the full range of sponsoring organizations. Increasing the
`usability of ClinicalTrials.gov data may therefore facilitate
`systematic evaluation of clinical studies aimed at building the
`
`--~-
`-~-·
`
`PLoS ONE | www.plosone.org
`
`1
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 1
`
`Apotex v. Novo - IPR2024-00631
`Petitioner Apotex Exhibit 1065-0001
`
`
`
`Database for Aggregate Analysis of CT.gov
`
`knowledge base needed to inform medical practice and preven-
`tion.
`As data have accumulated in ClinicalTrials.gov, users have
`increasingly sought capabilities
`that would allow aggregated
`descriptive characterization of
`the national research portfolio;
`however, access and data usability issues, including data format
`and design, present obstacles. A number of related initiatives,
`including the Ontology of Clinical Research (OCRe) [6], Human
`Studies Database (HSDB) [7], CDISC Protocol Representation
`Model [8], and LinkedCT [9] projects, are addressing ontological
`annotations, large-scale data mining, data representation format,
`and external association of these data, respectively. The results of
`this project are complementary to these initiatives and are
`expected to collectively advance this area of study as a whole.
`In this article, we report on CTTI’s efforts to prepare and
`maintain a publicly accessible analysis dataset derived from
`ClinicalTrials.gov content—the database for aggregate analysis
`of ClinicalTrials.gov (AACT). We also discuss efforts to extend the
`
`utility of the analysis dataset by means of an associated clinical
`specialty taxonomy designed to support research policy analyses.
`
`Methods
`
`1. Creation of the AACT
`Key design features of AACT include 1) the capacity to extend
`the dataset by parsing existing data; 2) linking to additional data
`resources,
`such as
`the Medical Subject Headings
`(MeSH)
`thesaurus; and 3) integrated metadata. A framework for extensions
`allows entire studies or individual fields to be associated with new
`data resources while preserving provenance. In addition,
`the
`integrated data dictionary developed for this project facilitates
`browsing and analysis of ClinicalTrials.gov and AACT metadata.
`Finally,
`the database incorporates a flexible design that can
`accommodate future developments, such as coding biospecimen
`type, sponsors, and OCRe annotations. Figure 1 shows key
`enhancements achieved by building the AACT.
`
`M etadata Tables:
`CURRENT_ VARIABLES,
`ENUMERATIONS,
`VARIABLE_HISTORY_DATES
`
`Designs Table with parsed
`Study Design (Primary Purpose,
`Masking, Intervention Model,
`Allocation, Endpoint
`Classification, Control,
`Observational Model, Time
`Perspectwe)
`
`• Aggregat e Analysis
`• Customized Queries
`• Comparative Data Analysis
`• Direct Import into Oracle, SAS
`etc. (excluding Specialty data
`set s)
`
`AACT
`
`MeSH Disease
`Conditions
`Annotated by
`Clinicians
`
`MESH_REPORTING Table
`for MeSH Annot ation
`validation
`
`MESH_SPECIALTY Table with
`Annotated MeSH conditions for
`each specialty (e.g. Cardiology,
`Oncology, Mental Health, ... , etc.)
`
`NON_MESH_SPECIALTY Table with
`Annotated free-text disease conditions
`for each specialty (e.g. Cardiology,
`Oncology, Mental Health, .... etc.)
`
`Initial
`Specialty
`Data Sets
`
`Final
`Specialty
`Data Sets
`
`--------,
`: Manual : #
`
`I
`
`I
`
`,
`r eview
`'
`I
`L - - - - - - - -
`
`Figure 1. A schematic representation of the database for Aggregate Analysis of ClinicalTrials.Gov (AACT) with its key
`enhancements.
`doi:10.1371/journal.pone.0033677.g001
`
`PLoS ONE | www.plosone.org
`
`2
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 2
`
`Apotex v. Novo - IPR2024-00631
`Petitioner Apotex Exhibit 1065-0002
`
`
`
`Database for Aggregate Analysis of CT.gov
`
`1.1. Data Sources. A dataset comprising 96,346 clinical
`studies was downloaded from ClinicalTrials.gov in XML format
`on September 27, 2010. We chose ClinicalTrials.gov for our study
`because it is the largest database of its kind and because it covers
`the full range of clinical conditions, includes a broad group of trial
`sponsors [10], and has a regulatory mandate [1]. The date of
`download was chosen to coincide with the anniversary of the
`enactment of the FDA Amendments Act (FDAAA) 3 years earlier,
`which mandated the registration of certain trials of FDA-regulated
`drugs, biologics, and devices [1].
`We downloaded the 2010 MeSH thesaurus (http://www.nlm.
`nih.gov/mesh/2010/download/termscon.html) and merged it
`with the AACT database, where it was used as a lookup table to
`locate corresponding tree numbers, referred to as MeSH IDs, for all
`MeSH terms associated with each clinical trial in ClinicalTrials.
`gov. Persons or organizations who submit studies to the registry
`are requested to provide the condition and keyword data elements as
`MeSH terms.
`1.2. Data Model. ClinicalTrials.gov data element definitions,
`xsd specifications for registry data submission, and downloaded
`
`study XML files were used to represent data specifications for the
`downloaded data. A physical data model was designed using
`Enterprise Architect (Sparx Systems Pty Ltd, Creswick, Victoria,
`Australia); this model depicted data tables and their data columns,
`as well as relationships between and among tables. An optimal
`structure was achieved through normalization, which was used to
`organize data efficiently, eliminate redundancy, and ensure logical
`data dependencies by storing only related data within a given table
`[11]. The database (Figure 2) was normalized to the Second
`Normal Form (2NF), a set of criteria designed to prevent logical
`inconsistencies while reducing data redundancy [12].
`We assigned data type and length of data elements based on
`patterns observed for each data element in the downloaded XML
`files. Whenever possible, we followed guidelines provided in
`ClinicalTrials.gov’s draft Protocol Data Element Definitions [13]
`when assigning lengths to given data elements. Data were housed
`in Oracle RDBMS, version 11.1 g (Oracle Corporation, Redwood
`Shores, California, USA). Enterprise Architect 7.1 was used for
`database design and additional
`transformation rules were
`documented as extract-transform-load (ETL) specifications. PL/
`
`SPONSORS
`
`MESH
`THESAURUS
`
`MESH TREES
`
`LINKS
`
`L
`
`]
`
`INTERVENTION
`BROWSE
`
`CONDITION
`BROWSE
`
`OUTCOMES
`
`CLINICAL_STUOY
`
`[[□ESIGNS 7
`
`LOCATIONS
`
`PERSONS
`
`METADATA TABLES
`
`CURRENT
`VARIABLES
`
`LOCATION
`CONTACT
`
`ENUMERATIONS
`
`FACILITIES
`
`n=A□DRESSES 7
`
`Figure 2. High-level Entity-Relationship Diagram (ERD) for AACT.
`doi:10.1371/journal.pone.0033677.g002
`
`PLoS ONE | www.plosone.org
`
`3
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 3
`
`Apotex v. Novo - IPR2024-00631
`Petitioner Apotex Exhibit 1065-0003
`
`
`
`Table 1. Escape characters and replacements.
`
`Escape character
`
`Replacement
`
`’
`
`"
`
`&
`
`"
`
`,
`
`’
`
`"
`
`&
`
`.
`
`,
`
`doi:10.1371/journal.pone.0033677.t001
`
`SQL packages that used Oracle’s inbuilt DBMS_LOB package to
`read the input XML files and load the data into the designed
`tables appropriately were developed. Quality control and
`operational support processes were developed using standard
`SQL queries through Toad for Data Analysts (Quest Software,
`Aliso Viejo, CA, USA) and Cognos ReportNet (CRN)
`(IBM
`Corporation, Armonk, NY, USA). We extended the core data
`model to accommodate both data management and data curation
`purposes. Error log tables and indexes were created for testing,
`debugging, and performance enhancement. Manual user accep-
`tance testing was performed by randomly selecting five studies per
`data element (from a total of 109 data elements) from the AACT
`database. The values associated with each data element were
`tested for correctness and completeness by comparing them with
`the original source data from downloaded XML files. We also
`
`Database for Aggregate Analysis of CT.gov
`
`created integrated data dictionary tables as reference tables
`holding explicit data element definitions and system metadata
`(Tables S1 and S2).
`During the course of database development, the NLM made
`several new data elements available for public download, some of
`which included information about the FDA (e.g., Section 801
`clinical
`trials,
`studies with FDA-regulated interventions, and
`expanded-access studies). In addition to these, MeSH condition
`and intervention terms generated by the NLM algorithm were also
`made available for public download.
`In XML files downloaded from ClinicalTrials.gov, the single
`data element Study Design contains a string of concatenated values
`for various different components of a study design, such as primary
`purpose,
`interventional model, observational model, allocation,
`endpoint classification, time perspective, and masking. While this
`format is well-suited for supporting information retrieval, it does
`not readily accommodate aggregate data analysis of the compo-
`nents within the Study Design data element. For this reason, data
`from Study Design was parsed into its components and stored in a
`separate table called DESIGNS. Additional data elements (Design
`Name and Design Value) were created to store all components of
`study design and their respective enumerated values. Values
`related to masking/blinding (e.g., Single; Double-Blind) were further
`parsed into their components, along with the list of corresponding
`(Participant,
`Investigator, Outcome Assessor, and
`masking subjects
`Caregiver).
`loading the
`encountered while
`challenges were
`Several
`database, including foreign characters embedded in XML files
`
`ICI, E rials Registra ion Policy
`F
`
`aw ·11
`
`~
`
`100
`
`IV .. IV
`
`0
`4)
`ti
`a.
`E
`0 u
`
`..c: .. 'j
`
`90
`
`80
`
`70
`
`60
`
`so
`
`40
`
`30
`
`20
`
`10
`
`,
`~-··
`' '
`
`0 - 1 - - - . - - - - r - - - - - r - - - , - - . . - - - - , - --
`
`- - . - -........ ---r-----r---,.-----,
`
`-...,OJ~
`
`'\,<::><Si
`
`'\,<::><::>..,,
`
`'\,<::><::>'\, '\,<::><:3'
`
`'\,r::,~
`
`'\,r::,(;)<-,
`
`,..,,r::,&
`
`'\,<::>~
`
`'\,<::>~
`
`'\,<::>~
`
`'\,<::>~
`
`Year Study Registered with ClinicalTrials.gov
`
`-
`
`-
`
`-
`
`-
`
`ata M or, it oring Com, itt ee? #
`
`-
`
`- S udy cla ss i icat ion
`
`ut ber of arms $. :;
`In erv ent ion model *
`
`- - - Alloca io, *
`
`• • • • • •
`
`-
`
`-
`
`l a sking "'
`E, rollmer +
`
`Gender &
`
`- - - ea d spor sor &
`
`$ May be required by FDAAA
`
`* A least one o ' I ese elell'ent s is
`r equired by AAA
`
`' Required by FDAAA
`& Required by FDAAA and
`cli nical rials.gov
`:. a a eler et t in .reduce
`2007·04
`
`in
`
`~ -~
`-g
`::s
`~
`n,
`C:
`
`0 .. C:
`4) .. ..5
`
`4)
`
`> ...
`
`~
`
`Figure 3. Percentage of interventional studies with complete data by registration year for selected data elements.
`doi:10.1371/journal.pone.0033677.g003
`
`.·flfi..
`-~.·
`
`PLoS ONE | www.plosone.org
`
`4
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 4
`
`Apotex v. Novo - IPR2024-00631
`Petitioner Apotex Exhibit 1065-0004
`
`
`
`Database for Aggregate Analysis of CT.gov
`
`with most of the data elements; these had to be replaced with
`character references (see Table 1 for examples).
`Other circumstances that prompted several database design
`iterations included the facts that the maximum length for each
`data element noted by ClinicalTrials.gov’s May 2010 Protocol
`Data Element Definitions document was not always consistent
`with the complete dataset, and one-to-one or one-to-many
`
`relationships between or among data elements were not obvious
`in the XML data type definition from ClinicalTrials.gov.
`1.3. Quality Assessment. Of the 96,346 studies downloaded
`from ClinicalTrials.gov in September 2010, a total of 79,413
`(82.4%) were interventional (i.e., a study in which an investigator
`following a protocol assigns research participants
`to receive
`specific interventions, as opposed to an observational study),
`
`a. Stored disease conditions provided by submitters and MeSH terms generated by NLM alg:orithm
`
`V ...
`
`l
`
`Annotation of disease condition
`terms (MeSH as well as Non(cid:173)
`MeSH) by Clinical Specialists
`
`b. Annotating disease conditions
`
`Confirm
`Annotation
`
`c. Creation of specialty datasets
`
`Annotations Confirmed
`
`- (cid:173)~~
`
`User registers study in ClinicalTrials.gov
`
`Other (e.g., protocol,
`criteria, ... )
`
`l
`
`DI
`
`Q
`::I o·
`~
`DI
`'iii
`G)
`0
`
`< l
`
`n
`::I o·
`~
`:;t
`iii'
`'iii
`-I
`iil
`::I
`a--,
`DI -c>°
`3
`DI -•::°
`
`::I
`
`::I
`2:
`
`CD
`
`Figure 4. An overview of methodology and process of developing clinical specialty datasets. The INTERVENTIONS, CONDITIONS, and
`KEYWORDS tables consist of disease condition terms provided by data submitters that include both MeSH and non-MeSH terms. The
`INTERVENTION_BROWSE and CONDITION_BROWSE tables are populated by MeSH terms generated by NLM algorithm (a) Process illustrating how
`MeSH terms are created in ClinicalTrials.gov. Tables and data shown here does not represent entire ClinicalTrials.gov database (b) Process illustrating
`the annotation and validation of disease conditions (c) Process illustrating the creation of specialty datasets.
`doi:10.1371/journal.pone.0033677.g004
`
`PLoS ONE | www.plosone.org
`
`5
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 5
`
`Apotex v. Novo - IPR2024-00631
`Petitioner Apotex Exhibit 1065-0005
`
`
`
`Table 2. MeSH Subject Headings, 2010—Diseases.
`
`Bacterial Infections and Mycoses [C01]
`
`Virus Diseases [C02]
`
`Parasitic Diseases [C03]
`
`Neoplasms [C04]
`
`Musculoskeletal Diseases [C05]
`
`Digestive System Diseases [C06]
`
`Stomatognathic Diseases [C07]
`
`Respiratory Tract Diseases [C08]
`
`Otorhinolaryngologic Diseases [C09]
`
`Nervous System Diseases [C10]
`
`Eye Diseases [C11]
`
`Male Urogenital Diseases [C12]
`
`Female Urogenital Diseases and Pregnancy Complications [C13]
`
`Cardiovascular Diseases [C14]
`
`Hemic and Lymphatic Diseases [C15]
`
`Congenital, Hereditary, and Neonatal Diseases and Abnormalities [C16]
`
`Skin and Connective Tissue Diseases [C17]
`
`Nutritional and Metabolic Diseases [C18]
`
`Endocrine System Diseases [C19]
`
`Immune System Diseases [C20]
`
`Disorders of Environmental Origin [C21]
`
`Animal Diseases [C22]
`
`Pathological Conditions, Signs and Symptoms [C23]
`
`Available at: http://www.nlm.nih.gov/mesh/trees.html
`
`doi:10.1371/journal.pone.0033677.t002
`
`16,506 (17.1%) were observational, 107 (0.1%) were expanded-
`access, and 320 had no information about the study type. We
`analyzed selected data elements in interventional studies for
`completeness of data (e.g., a null value in the data element) and
`observed a trend toward increasing completeness of data over
`time. This trend appears to have been notably affected by two
`milestones in the history of ClinicalTrials.gov. In September 2004,
`the International Council of Medical Journal Editors (ICMJE)
`published a policy requiring registration of interventional trials as
`a condition of publication [3]. The ICMJE requirements took
`effect in September 2005, which may account for the increase in
`completeness for some data elements in 2005 (Figure 3).
`In September 2007, the FDAAA [1] made the registration of
`interventional studies mandatory. This requirement took effect in
`December 2007 and may further account for increases in the
`
`Table 3. Frequency of intermediate terms and top node
`terms that did not match annotations of lower-level terms.
`
`Specialty
`
`n/N (%)
`
`Cardiology
`
`Oncology
`
`I
`I
`
`Mental health
`
`172/5264 (3.3%)
`
`284/5264 (5.4%)
`
`93/5264 (1.8%)
`
`Database for Aggregate Analysis of CT.gov
`
`completeness of data elements in the ClinicalTrials.gov dataset. In
`Figure 3, the data elements ‘‘data monitoring committee’’ and
`‘‘number of arms’’ were not available at the time that earlier
`studies were registered. It is important to note that the presence of
`these data elements for studies pre-dating December 2007 reflect
`later updates performed by data providers.
`1.4. Changes
`in ClinicalTrials.gov’s Protocol Data
`Element Definitions. The ClinicalTrials.gov Protocol Data
`Element Definitions (PDED) have evolved since the database was
`first launched. Although references containing individual protocol
`data element definitions are provided for submitters with each
`release of the definitions document, there is no document that
`tracks changes
`for all data elements
`for
`review as data
`specifications. These include changing enumerated values for a
`data element, revising a data element definition, making a
`particular data element publicly available,
`introducing a new
`data element, and entirely deleting a data element. However, more
`rigorous submission rules imposed by mandating organizations
`(e.g., NLM, FDA), such as those required by the FDAAA and
`ClinicalTrials.gov, appear to have had the greatest impact on the
`completeness of data.
`Changes to a data element play a significant role in the analysis
`of study data. As we examined each data element’s history, we
`noted that between September 2004 and July 2005 (a period
`spanning 3 releases of the PDED), and again in December 2007,
`the data element requirements were not documented in the
`definitions document. Other inconsistencies were also noted and
`later confirmed (Personal communication, Dr. Deborah Zarin and
`Mr. Nicholas Ide, February 18, 2011).
`1.5. A Public Resource. The AACT can be downloaded as
`Oracle extracts (.dmp file and text format output; available at
`https://www.trialstransformation.org/projects/improving-the-public-
`interface-for-use-of-aggregate-data-in-clinicaltrials.gov/aact-database-
`for-aggregate-analysis-of-clinicaltrials.gov). Additional documents are
`available to assist users in interpreting the data. The high-level data
`dictionary and a comprehensive data dictionary noted previously are
`included in the dataset file. The comprehensive data dictionary
`contains seven sections: 1) current variables, 2) enumerations, 3)
`constraints, 4) record counts, 5) database schema, 6) comprehensive
`change history, and 7) variable history dates. This document provides
`definitions, derivation of terms, data model structure and references,
`NLM and FDAAA requirements, and historical information for each
`data element in ClinicalTrials.gov to facilitate understanding of when
`variables were added, modified, or discontinued. The high-level data
`dictionary provides a summary view of the variables contained in the
`AACT database.
`
`2. A Methodology to Regroup Studies in
`ClinicalTrials.Gov by Specialty
`from multiple clinical
`ClinicalTrials.gov contains
`studies
`domains. While the AACT database facilitates the aggregate
`analysis of the entire dataset, it does not in itself support analysis
`within specific specialty domains. We therefore developed a
`methodology to re-group studies from ClinicalTrials.gov by
`clinical specialties as designated by the Department of Health
`and Human Services [14]. In doing so, we relied on MeSH
`condition terms and free-text disease condition terms associated
`with each study in the ClinicalTrials.gov database—a method
`that can be used to develop other specialized datasets for
`analysis.
`2.1. Use of MeSH Terminology in the ClinicalTrials.gov
`Database. Data submitters (study sponsors or their designees)
`are requested to provide Condition and Keywords data as MeSH
`terms when registering a study. Additionally, an NLM algorithm
`
`n = number of intermediate- and top-node MeSH terms for a given specialty
`that do not match the annotations of their lower-level terms. N = total number
`of intermediate- and top-node MeSH terms.
`doi:10.1371/journal.pone.0033677.t003
`
`.·f!fi..
`-~.·
`
`PLoS ONE | www.plosone.org
`
`6
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 6
`
`Apotex v. Novo - IPR2024-00631
`Petitioner Apotex Exhibit 1065-0006
`
`
`
`Database for Aggregate Analysis of CT.gov
`
`Musculoskeletal Diseases [C05)
`Bone Diseases [C05.116)
`Bone Diseases, Endocrine [C05.116.132)
`► Acromegaly [C05.116.132.082)
`Congenital Hypothyroidism [C05.116.132.256)
`Dwarfism, Pituitary [C05.116.132.358)
`Gigantism [C05.116.132.479)
`Osteitis Fibrosa Cystica [C05.116.132.684)
`
`Nervous System Diseases [Cl0)
`Central Nervous System Diseases [Cl0.228)
`Brain Diseases (Cl0.228.1401
`Hypothalamic Diseases [Cl0.228.140.6171
`Pituitary Diseases [Cl0.228.140.617.738)
`Hyperpituitarism [Cl0.228.140.617.738.2501
`► Acromegaly (Cl0.228.140.617.738.250.100)
`Hyperprolactinemia [Cl0.228.140.617.738.250.450)
`Pituitary ACTII Hypersecretion f Cl0.228.140.617.738.250.7251
`
`Endocrine System Diseases [C191
`Pituitary Diseases [C19.700)
`Hyperpituitarism (C19.700.355)
`► Acromegaly (C 19.700.355.179)
`Gigantism (C19.700.355.528)
`Hyperprolactinemia (C 19. 700. 355. 600)
`Pituitary ACTII Hypersecretion ( C 19. 700. 355. 800)
`
`Figure 5. MeSH trees for acromegaly. Source: 2010 online MeSH thesaurus (available: http://www.nlm.nih.gov/cgi/mesh/2010/MB_cgi).
`doi:10.1371/journal.pone.0033677.g005
`
`also evaluates studies and applies MeSH terms according to the
`following steps: 1) study records are checked for the presence of a
`MeSH term,
`including synonyms and lexical variations; 2)
`weighted scores are computed for all matches, with exact
`matches, lexical variations, and synonyms receiving descending
`proportional weight; 3) very common terms are excluded to avoid
`confounding; 4)
`location by data element
`is considered and
`weighted in the term scoring process; and 5) terms with scores
`exceeding the cutoff value are applied to the respective studies.
`(Note that the output from steps 1 and 2 is used for both condition
`and intervention annotations; the field weights are different for
`each and divert
`terms into the target annotation type.) This
`method does not consider the natural-language context
`for
`matched terms or ontologically related concepts that would add
`specificity. Neither the terms from data submitters nor the NLM
`algorithm attempt to associate a term with a particular MeSH
`hierarchy. These resulting annotated MeSH terms are visible on
`the ClinicalTrials.gov website and populated in the condition_browse
`and intervention_browse fields in the downloaded XML file for each
`study. Figure 4 illustrates how MeSH terms are created in the
`ClinicalTrials.gov database.
`
`2.2. MeSH Disease Conditions Annotation. Condition
`and intervention terms in the MeSH thesaurus are arrayed in
`hierarchical branching structures, called trees; each branching
`point is referred to as a node. Nodes range from 1 (highest level) to
`12 (lowest level) in the 2010 version of the MeSH thesaurus. For
`example, one high-level category that we used to classify studies by
`clinical specialty was Diseases. In the 2010 MeSH thesaurus, this
`category contains 23 subcategories (Table 2).
`In order to create specialty datasets from the larger AACT
`dataset, we selected four high-level MeSH nodes
`from the
`thesaurus to serve as an initial basis for identifying studies by
`clinical specialty. Reviewers with relevant subject matter expertise
`annotated MeSH terms from the following high-level nodes: 1)
`Diseases; 2) Analytical, Diagnostic and Therapeutic Techniques and
`Equipment; 3) Psychiatry and Psychology; and 4) Phenomena and Processes.
`A total of 18,491MeSH IDs associated with 9031 MeSH terms
`were reviewed and annotated by clinical specialists belonging to
`one of the 13 clinical specialties and five sub-specialties, which
`were selected on the basis of availability of faculty representation
`and volunteers at Duke, as well as intention to analyze subsets of
`data by clinical specialty. Participating specialty annotations
`
`PLoS ONE | www.plosone.org
`
`7
`
`March 2012 | Volume 7 |
`
`Issue 3 | e33677
`
`MPI EXHIBIT 1065 PAGE 7
`
`Apotex v. Novo - IPR2024-00631
`Petitioner Apotex Exhibit 1065-0007
`
`
`
`Database for Aggregate Analysis of CT.gov
`
`Do any
`condition_browse or
`condition terms have
`ambiguous tags for
`Specialty X?
`
`Do all
`condition_browse terms
`and do all condition terms
`have a N tag for
`Specialty X?
`
`N
`
`y
`
`Do any
`cond ition_browse terms or
`do any condition terms have
`a N tag for
`Specialty X?
`
`N
`
`Possibly assign to
`Specialty X (GROUP 2)
`
`j
`
`Assign N for
`Specialty X
`(GROUP3)
`
`j
`j
`j
`I
`I
`I
`I
`I
`I
`I
`1
`
`Possibly assign N for
`Specialty X (GROUP 4)
`
`,
`
`"'
`
`Uncl assified for
`Specialty X
`(GROUP 5)
`
`- - -- - - --- - _Y{_ - -- A~:_ - -,
`____ _ studies ___ J Review study details, if !.t.,-,(cid:173)
`
`For study NCT00zzzzzz
`
`Do any
`condition_browse or
`condition terms
`have a Y tag for
`Specialty X?
`
`y
`
`Assign Y for Specialty X
`(GROUP 1)
`'
`--------- ~ -- ---------
`Review study details, if
`desired
`- - - -- - - - - - ....... -- - - - - - - - -
`'
`E1<clude' .. ,
`unrelated
`studies
`
`',,,,
`
`-------- v ---------
`Review study details, if
`desired
`,
`- -- - ---7---- - - - - - - -
`'
`
`-- -
`
`Include
`related
`/
`studies,,, "
`
`Include
`related
`
`Final su bset for Specialty X manuscript.
`
`desi red
`:
`:
`•- - ------ ------- -- - -- -- - I
`
`Figure 6. Rules for deciding whether a given study belongs to a given specialty.
`doi:10.1371/journal.pone.0033677.g006
`
`included cardiology, dermatology, endocrinology, gastroenterol-
`ogy,
`immunology/ rheumatology,
`infectious diseases, mental
`health, nephrology, neurology, oncology, otolaryngology, pulmo-
`nary medicine, reproductive medicine, while subspecialty anno-
`tations included peripheral vascular disease, peripheral arterial
`disease, diabetes,
`thyroid disease, and bone disease. The
`association of terms with clinical specialties was performed in
`the context of the anticipated analysis of the data subset for
`respective specialties. The results of this extension to the AACT
`database,
`including specialty tags, will be shared in future
`publications.
`2.3. Validation of Inconsistently Annotated MeSH Terms
`and Limitations of Using the MeSH Hierarchy. A term
`occurring at a particular node ‘‘node x’’ (parent) may have several
`branches (children) at node x+1 that provide a finer classification
`of the node-x term. Clinical specialists were advised to review the
`hierarchy of an individual MeSH term during the annotation
`process. Annotated MeSH descriptors were programmatically
`reviewed for hierarchical inconsistencies in order to maintain the
`logical relationship between parent and child MeSH descriptors.
`
`Table 4. Number of studies reviewed by each set of clinician
`reviewers.
`
`Reviewer A ID
`
`Reviewer B ID
`
`Studies reviewed (n)
`
`Clinician 1
`
`Clinician 1
`
`Clinician 4
`
`Clinician 6
`
`Clinician 2
`
`Clinician 3
`
`Clinician 5
`
`Clinician 7
`
`200
`
`400*
`
`200
`
`200
`
`*The combination of Clinician 1 (‘‘A’’) and Clinician 3 (‘‘B’’) together reviewed 2
`batches of studies.
`doi:10.1371/journal.pone.0033677.t004
`
`Tag validity was evaluated by a process based on annotation rules.
`In general, selection or negation of a parent MeSH term should
`match with all subsequent child MeSH terms below that node.
`Hierarchical inconsistencies in MeSH annotations were flagged
`and accepted after further review and confirmation by clinical
`specialists. The
`anticipated
`inconsistency
`of
`the MeSH
`hierarchical
`structure with clinical
`specialty groupings was
`confirmed in the validation process. Table 3 shows
`the
`frequency of parent terms that did not match with annotations
`for their children terms.
`Further, a term might appear within more than one tree. For
`example, the MeSH term Acromegaly appears as part of multiple
`trees within the topmost MeSH hierarchical category of Diseases
`(Figure 5).
`location, its context could fall
`Depending on its hierarchical
`under Musculoskeletal Diseases, Nervous System Diseases