(12) INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT)
`
`UMAAOU CAAATAAA
`
`(10) International Publication Number
`WO 2017/214574 Al
`
`= a
`
`WIPO! PCT
`
`(19) World Intellectual Property
`Organization
`International Bureau
`
`(43) International Publication Date
`14 December 2017 (14.12.2017)
`
`(51) International Patent Classification:
`G06F 17/560 (2006.01)
`GO6F 19/28 (2011.01)
`GO6F 19/22 (2011.01)
`Ci2N 15/10 (2006.01)
`
`(21) International Application Number:
`
`PCT/US2017/036868
`
`(72) Inventor: DIGGANS, James; 1324 Cordilleras Avenue,
`San Carlos, California 94070 (US).
`
`(74) Agent: HARBURGER, David; WILSON SONSINI
`GOODRICH & ROSATI, 650 Page Mill Road, Palo Alto,
`California 94304 (US).
`
`(22) InternationalFiling Date:
`
`(25) Filing Language:
`
`(26) Publication Language:
`
`09 June 2017 (09.06.2017)
`
`English
`
`English
`
`(30) Priority Data:
`62/348,786
`62/375,858
`
`10 June 2016 (10.06.2016)
`16 August 2016 (16.08.2016)
`
`US
`US
`
`(71) Applicant: TWIST BIOSCIENCE CORPORATION
`[US/US]; 455 Mission Bay Boulevard South, Suite 545, San
`Francisco, California 94158 (US).
`
`(81) Designated States (unless otherwise indicated, for every
`kind of national protection available): AE, AG, AL, AM,
`AO,AT, AU, AZ, BA, BB, BG, BH, BN, BR, BW,BY, BZ,
`CA, CH, CL, CN, CO, CR, CU, CZ, DE, DJ, DK, DM, DO,
`DZ, EC, EE, EG, ES, FL GB, GD, GE, GH, GM, GT, HN,
`HR, HU,ID,IL,IN, IR, IS, JO, JP, KE, KG, KH, KN, KP,
`KR, KW, KZ, LA, LC, LK, LR, LS, LU, LY, MA, MD, ME,
`MG, MK, MN, MW, MX, MY, MZ, NA, NG,NL, NO, NZ,
`OM,PA, PE, PG, PH, PL, PT, QA, RO, RS, RU, RW, SA,
`SC, SD, SE, SG, SK, SL, SM,ST, SV, SY, TH, TJ, TM, TN,
`TR,TT, TZ, UA, UG, US, UZ, VC, VN, ZA, ZM, ZW.
`
`(54) Title: SYSTEMS AND METHODS FOR AUTOMATED ANNOTATION AND SCREENING OF BIOLOGICAL SEQUENCES
`
`
`
`Mais ss as SG SRS
`
`Screen
`
`
`
`
`
`FIG. 3A
`
`tion. Annotation tools described herein provide assistance to the synthetic biology community to track emerging science on the link
`between individual proteins and negative outcomes. Screening tools described herein enables the community to broaden both interest
`and effective practice of biosecurity so that practitioners and biological sequence or construct providers are empowered to evaluate the
`safety of order requests rather than waiting until synthesis or even expression. In addition, screening tools described herein provide
`for screening of polynucleotides across the same or multiple orders for sequences associated with harmful biological sequences from
`a reference database.
`
`[Continued on next page]
`
`Query File
`
`Protein
`
`database
`
`Blast Report
`
`Restricted
`
`Restricted
`
`
`
`lists
`lists
`
`wo2017/214574A1IMNININMIITANTATC000ATATAA (57) Abstract: The present disclosure describes software tools for effective biosecurity based on community knowledge andparticipa-
`
`

`

`WO 2017/214574 AITNT TNA! TN TATAAA
`
`(84) Designated States (unless otherwise indicated, for every
`kind of regional protection available): ARIPO (BW, GH,
`GM, KE, LR, LS, MW, MZ, NA, RW, SD, SL, ST, SZ, TZ,
`UG, ZM, ZW), Eurasian (AM, AZ, BY, KG, KZ, RU,TJ,
`TM), European (AL, AT, BE, BG, CH, CY, CZ, DE, DK,
`LE, LS, FL, FR, GB, GR, HR, WU, I, 1S, IT, LT, LU, LV,
`MC, MK, MT, NL, NO, PL, PT, RO, RS, SE, SI, SK, SM,
`TR), OAPI (BF, BJ, CF, CG, CI, CM, GA, GN, GQ, GW,
`KM, ML, MR, NE, SN, TD, TG).
`
`Declarations under Rule 4.17:
`
`— asto applicant's entitlement to apply for and be granted a
`patent (Rule 4.17(ii))
`— as to the applicant's entitlement to claim the priority of the
`earlier application (Rule 4.17(iii))
`Published:
`
`— with international search report (Art. 21(3))
`
`

`

`WO 2017/214574
`
`PCT/US2017/036868
`
`SYSTEMS AND METHODS FOR AUTOMATED ANNOTATION AND SCREENING OF
`
`BIOLOGICAL SEQUENCES
`
`CROSS-REFERENCE
`
`[0001] This application claims the benefit of U.S. provisional patent application number
`
`62/348,786 filed on June 10, 2016 and U.S. provisional patent application number 62/375,858 filed
`
`on August 16, 2016, each of which is incorporated by reference in its entirety.
`
`BACKGROUND
`
`[0002] The growth rate in our collective knowledge about individual proteins and biological
`
`systems capable of posing potential threats to public safety and/or the environment is tremendous.
`
`This knowledge, however, is widely distributed across diverse research communities, institutions
`
`and even journals. Thereis a lack of centralized information source focused on annotating the
`
`potential for a given protein to cause harm and in what context this harm can arise. Thus, new
`
`systems and methods are necessary to address the challenge.
`
`[0003] Provided herein are computerized systems for providing enhanced polynucleotide synthesis
`
`comprising a server for hosting a database, wherein the database is adapted for representingalist of
`
`BRIEF SUMMARY
`
`harmful biological sequences; a network connection; and a computer readable medium comprising
`
`instructions for a general purpose computer, wherein said computerized system is configured for
`
`operating in a method of: 1) receiving one or more design instructions, wherein the design
`
`instructions comprise a plurality of biological sequences, wherein each of the biological sequences
`
`is no more than 500 bases in length, and wherein the plurality of biological sequences comprise a
`
`nucleic acid or amino acid sequence; 2) automatically determining whether at least two biological
`
`sequencesof the plurality of biological sequences collectively correspondto at least 20% of a
`
`harmful biological sequence in the database; and 3) automatically generating an alert if at least 20%
`
`of the harmful biological sequence is detected. Further provided herein are computerized systems
`
`further comprising wherein if no alert is generated, then one or more sequences are synthesized.
`
`Further provided herein are computerized systems further comprising receiving instructions for
`
`changing the at least two biological sequencesof the plurality of biological sequences
`
`corresponding to at least 20% of the harmful biological sequence to remove the harmful biological
`
`sequence. Further provided herein are computerized systems wherein the plurality of received
`
`design instructions are received at a one or moretime points. Further provided herein are
`
`computerized systems wherein the plurality of received design instructions are from 3 or more
`-|-
`
`

`

`WO 2017/214574
`
`PCT/US2017/036868
`
`different sources. Further provided herein are computerized systems wherein the plurality of
`
`received design instructions are from 5 or more different sources. Further provided herein are
`
`computerized systems wherein the plurality of received design instructions are from 10 or more
`
`different sources. Further provided herein are computerized systems wherein the one or more
`
`biological sequences are each no more than 200 basesin length. Further provided herein are
`
`computerized systems wherein the one or more biological sequences are each no more than 100
`
`bases in length. Further provided herein are computerized systems wherein the one or more
`
`biological sequences are each no more than 50 bases in length. Further provided herein are
`
`computerized systems wherein the one or more biological sequences are each no more than 20
`
`bases in length.
`
`[0004] Provided herein are methods for providing enhanced polynucleotide synthesis comprising:
`
`1) receiving one or more design instructions, wherein the design instructions comprise a plurality of
`
`biological sequences, wherein each of the biological sequences is no more than 500 basesin length,
`
`and wherein the plurality of biological sequences comprise a nucleic acid or amino acid sequence;
`
`2) automatically determining whetherat least two biological sequences of the plurality of biological
`
`sequencescollectively correspond to at least 20% of a harmful biological sequencein a database;
`
`and 3) automatically generating an alert if at least 20% of the harmful biological sequence is
`
`detected. Further provided herein are methods further comprising wherein if no alert is generated,
`
`the one or more sequences are synthesized. Further provided herein are methods further
`
`comprising receiving instructions for changing the at least two biological sequences of the plurality
`
`of biological sequences corresponding to at least 20% of the harmful biological sequence to remove
`
`the harmful biological sequence.
`
`[0005] Provided herein are computerized systems for providing enhanced polynucleotide synthesis
`
`comprising a server for hosting a database, wherein the database is adapted for representingalist of
`
`sequences; a network connection; and a computer readable medium comprising instructions for a
`
`general purpose computer, wherein said computerized system is configured for operating in a
`
`method of: 1) receiving one or more design instructions, wherein the design instructions comprise a
`
`plurality of biological sequences, wherein the plurality of biological sequences is a vector
`
`sequence, and a plurality of additional insert sequences; 2 automatically determining whether the
`
`vector and at least one of the plurality of insert sequences collectively corresponds to at least 20%
`
`of a harmful biological sequence in the database; and 3) automatically generating an alert if at least
`
`20% of the harmful biological sequence is detected. Further provided herein are computerized
`
`systems wherein the biological sequences are obtained from sequencing a physical nucleic acid
`
`sample. Further provided herein are computerized systems further comprising wherein if noalert is
`
`2-
`
`

`

`WO 2017/214574
`
`PCT/US2017/036868
`
`generated, the one or more biological sequences are synthesized. Further provided herein are
`
`computerized systems further comprising receiving instructions for changing the vector and the at
`
`least one of the plurality of insert sequences corresponding to at least 20% of the harmful biological
`
`sequence to remove the harmful biological sequence. Further provided herein are computerized
`
`systems for providing enhanced polynucleotide synthesis wherein the plurality of received design
`
`instructions are received at one or more time points. Further provided herein are computerized
`
`systems wherein the plurality of received design instructions are received from different sources.
`
`Further provided herein are computerized systems wherein the plurality of received design
`
`instructions are from 3 or more different sources. Further provided herein are computerized systems
`
`wherein the plurality of received design instructions are from 5 or more different sources. Further
`
`provided herein are computerized systems wherein the plurality of received design instructions are
`
`from 10 or more different sources. Further provided herein are computerized systems wherein the
`
`one or more biological sequences are each no more than 200 bases in length. Further provided
`
`herein are computerized systems wherein the one or more biological sequences are each no more
`
`than 100 basesin length. Further provided herein are computerized systems wherein the one or
`
`more biological sequences are each no more than 50 bases in length. Further provided herein are
`
`computerized systems wherein the one or more biological sequences are each no more than 20
`
`bases in length.
`
`[0006] Provided herein are methods for providing enhanced polynucleotide synthesis comprising:
`
`1) receiving one or more design instructions, wherein the design instructions comprise a plurality of
`
`biological sequences, wherein the plurality of biological sequences is a vector sequence, and a
`
`plurality of additional insert sequences; 2) automatically determining whether the vector and at
`
`least one of the plurality of insert sequences collectively corresponds to at least 20% of a harmful
`
`biological sequence in the database; and
`
`3) automatically generating an alert if at least 20% of the harmful biological sequence is detected.
`
`Further provided herein are methods wherein the biological sequences are obtained from
`
`sequencing a physical nucleic acid or protein sample. Further provided herein are methods further
`
`comprising wherein if no alert is generated, the one or more biological sequences are synthesized.
`
`Further provided herein are methods receiving instructions for changing the vector andthe at least
`
`one ofthe plurality of insert sequences corresponding to at least 20% of the harmful biological
`
`sequence to remove the harmful biological sequence.
`
`INCORPORATION BY REFERENCE
`
`[0007] All publications, patents, and patent applications mentionedin this specification are herein
`
`incorporated by reference to the same extent as if each individual publication, patent, or patent
`3-
`
`

`

`WO 2017/214574
`
`PCT/US2017/036868
`
`application was specifically and individually indicated to be incorporated by reference in their
`
`entirety.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0008] The technical features of the present disclosure are set forth with particularity in the
`
`appended claims. A better understanding of the features and advantages of the present disclosure
`
`will be obtained by reference to the following detailed description that sets forth illustrative
`
`embodiments, in which the principles of the disclosure are utilized, and the accompanying
`
`drawings of the following.
`
`[0009] FIG.1 illustrates a user interface which includes a protein sequence and associated species,
`
`host, pathogen, route to harm, outcome and protein type information. Also included are sequence
`
`accession number, a listing of identical proteins, links to a database with sequencerecords, and
`
`links to similar proteins.
`
`[0010] FIG. 2 illustrates a user interface which includes a partial listing of protein variants and an
`
`exemplary protein, “Hemagglutinin Neuraminidase-Newcastle Disease virus.”
`
`[0011] FIG. 3A depicts a flow chart including information from a query file, a protein database, a
`
`blast report, restricted lists (harmful sequence lists) and screen report.
`
`[0012] FIG. 3B depicts a flow chart which includes various forms of input (nucleic acid material,
`
`nucleic acid or protein sequence), decision making(restricted list, unrestricted list, expert review),
`
`and output (issuing alerts).
`
`[0013] FIG.4 illustrates a user interface which includeslists of databases for searching in a screen.
`
`Columns forrole, type, name, description, date added andactive state columns are included.
`
`[0014] FIG. 5 illustrates a user interface which includes a sequence submission screen. Form
`
`entries for name, database, description and FASTFAfile, and a “Submit” button are included. The
`
`database form has a drop-down columnthat appears upon click with subcategories, including
`
`“Seqshield,” “nr” and “Personal Database.”
`
`[0015] FIG.6 illustrates a user interface which includes a summary of screening status.
`
`[0016] FIG.7 illustrates a user interface which includes a pull-down menuforselection of
`
`“Unreviewed,” “Of concern,” or “No concern” sequencesscreened.
`
`[0017] FIG.8 illustrates a computing system.
`
`[0018] FIG.9 illustrates a computer system.
`
`[0019] FIG. 10 is a block diagram illustrating an architecture of a computer system.
`
`[0020] FIG. 11 is a diagram demonstrating a network configured to incorporate a plurality of
`
`computer systems, a plurality of cell phones and personal data assistants, and Network Attached
`
`Storage (NAS).
`
`A.
`
`

`

`WO 2017/214574
`
`PCT/US2017/036868
`
`[0021] FIG. 12 is a block diagram of a multiprocessor computer system using a sharedvirtual
`
`address memory space.
`
`DETAILED DESCRIPTION
`
`[0022] With the rapid growth in design capability in synthetic biology, it is now possible to create
`
`large numbers of constructs often using a heavily mutated sequence that does not directly resemble
`
`the reference sequence from whichit wasoriginally derived. At the same time, scientific advances
`
`in the understanding of the processes behind pathogenicity (in a variety of hosts and biological
`
`contexts) are rapidly creating new knowledge of protein sequencesthat, in context-dependent ways,
`
`can cause harm to humanbeings, specific plants or animals, or to the environment more broadly.
`
`[0023] Ethical, responsible synthetic biologists may unwittingly create constructs capable of
`
`causing harm, but be unable to predict or understand that capability prior to instantiating synthetic
`
`designsin living systems. As predicting function from primary sequencealoneis not feasible, these
`
`scientists would be well-served by having access to 1) arepository of metadata on what sequences
`
`can cause harm along with regulatory status and 2) an effective screening system for checking
`
`DNAorprotein sequences against that metadata and alerting the user to any potential concern. In
`
`addition, a screening system capable of addressing these needs mustitself be amenable to
`
`automation so as to fit seamlessly into high-throughput design/build/test workflows. The present
`
`disclosure provides for software tools to address both the lack of publicly available gene-level
`
`metadata on pathogenicity as well as the lack of open sourcetools for effective screening.
`
`[0024] Definitions
`
`[0025] While various embodiments have been shown anddescribed herein, it will be obvious to
`
`those skilled in the art that such embodiments are provided by way of example only. Numerous
`
`variations, changes, and substitutions may occurto those skilled in the art without departing from
`
`devices, systems and methods disclosed herein. It should be understood that various alternatives to
`
`the embodiments described herein may be employed.
`
`[0026] Unless otherwise defined, all technical terms used herein have the same meaning as
`
`commonly understood by one ofordinary skill in the art to which this disclosure belongs. As used
`99 6¢
`
`in this specification and the appended claims, the singular forms “a,”
`
`“an,” and “the” include plural
`
`references unless the context clearly dictates otherwise. Any reference to “or” herein is intended to
`
`encompass “and/or” unless otherwise stated.
`
`[0027] Unless specifically stated or obvious from context, as used herein, the term "about" in
`
`reference to a numberor range of numbersis understood to mean the stated number and numbers
`
`+/- 10% thereof, or 10% below the lowerlisted limit and 10% above the higherlisted limit for the
`
`valueslisted for a range.
`
`5-
`
`

`

`WO 2017/214574
`
`PCT/US2017/036868
`
`Sequence annotation
`
`[0028] Knowledge about the capacity of any single sequence to cause some type of harm may be
`
`extremely distributed. Individual communities of researchers focus on widely varying aspects of
`
`pathogenicity including the ability of organisms to infiltrate host cells, hijack host cellular
`
`machinery, hide from the host immune system and even to enhance the host immuneresponse.
`
`Exemplary harmful biological sequences include those that encode for a pathogenic sequence, such
`
`as those which are harmful and from viral, bacterial, or parasitic origins. Harmful biological
`
`sequences may include be mutant form of wildtype sequences which are knownto have pathogenic
`
`effects. Harmful biological sequences include sequences that produce harmful sequence products
`
`after transcription or translation, or act as precursors to harmful sequence products. Harmful
`
`biological sequences include sequences that encode for harmful proteins.
`
`[0029] Amongother facets, the present disclosure provides for a Mediawiki-based user interface
`
`that allows a user to submit sequences along with tag-based annotation of roles in pathogenicity.
`
`Users may be encouraged to submit several tags for each sequence to describe the general patterns
`
`of harm associated with a given sequence modeled as:
`
`Host + Context = Outcome + Level of Concern
`
`[0030] The present system may take a tag-based approach soas not a priori to imposea single
`
`controlled vocabulary. The collection oftags resulting from community annotation could form the
`
`basis of such a controlled vocabulary over the longer term.
`
`[0031] As each sequence is uploaded, users may be askedto add tags in each of four categories.
`
`Tagging ‘Host’ and ‘Level of Concern’ are mandatory; adding tags for ‘Context’ and ‘Outcome’
`
`are optional given the additional complexity and domain knowledge required.
`
`[0032] As an example, a sequence encodingthe toxin ricin might be tagged by a useras:
`
`ingestion, inhalation
`neof
`
`
`
`
`|Outcome=|fever,|fever,failure,death|respiratory failure, deathrespiratorycough,
`
`
`
`Concern
`
`Extreme
`
`[0033] The goal is accumulation of metadata over time more than universal completeness. The
`
`system is centrally hosted and offers the entire set of curated sequences (or subsets based on queries
`
`by tag) for download as FASTAforuse in screening.
`
`[0034] Provided herein are methods for sequence annotation wherein a databasereceivesa listing
`
`of characteristics associated with a biological sequence or biological construct (e.g., nucleotide
`
`sequence or protein sequence). Exemplary characteristics include, without limitation: nucleic acid
`-6-
`
`

`

`WO 2017/214574
`
`PCT/US2017/036868
`
`sequence, protein sequence, protein name, strain source, link to sequence database (e.g., NCBI),
`
`sequence database accession number,identical sequences(protein or nucleic acid), similar
`
`sequences(protein or nucleic acid), disease type (e.g., virus, bacterium, or fungi), host information
`
`(e.g., humans, mammals, birds, insects), context or route of harmful interaction (e.g., ingestion,
`
`inhalation), and level of concern. Also provided herein is a user interface which presents each
`
`characteristic or a link to additional information of such characteristics. See FIG. 1.
`
`In some
`
`cases, viral sequencesfor a particular strain are selected. For example, FIG.2 illustrates a portion
`
`of 679 available strains of Hemagglutinin Neuraminidase-Newcastle Disease virus for annotation.
`
`[0035] Exemplary species include animal species. “Animals” as used herein includes, without
`
`limitation, mammals, marsupials, birds, insects, arthropods, amphibians and reptiles. Exemplary
`
`mammals include, without limitation, sheep, cattle, goats, pigs, rabbits, hares, deer, goats, mice,
`
`rats, bats, and possums,and the like. Exemplary disease types include pathogens from the
`
`following classes: viruses, bacterium, fungi and other harmful pathogens. Exemplary viruses
`
`having harmful expression products include, without limitation, Marburg virus, Ebola virus,
`
`Hantavirus, bird flu (e.g., HSN1 strain), Lassa virus, Junin virus, Crimea-Congo fever, Machupo
`
`virus, Kyasanur Forest Virus, Dengue fever, and Chikungunya virus. Exemplary bacterium having
`
`harmful expression products include, without limitation, Multi-Resistant Staphylococcus aureus
`
`(MRSA), E. coli, listeriosis, salmonella, gonococcus, streptococcus and staphylococcus.
`
`Exemplary fungi having harmful expression products include, without limitation,Amanita arocheae,
`
`Amanita bisporigera, Amanita exitialis, Amanita magnivelaris, Amanita ocreata, Amanita verna,
`
`Clitocybe dealbata, Cortinarius gentilis, Lepiota brunneoincarnata, Lepiota brunneoincarnata,
`
`Lepiota brunneoincarnata, and Lepiota brunneoincarnata. Exemplary routes to harm include,
`
`without limitation, ingestion, inhalation, skin contact, and sexual transmission. Exemplary
`
`outcomesinclude, without limitation, fever, headache, nausea, dizziness, and diarrhea. Exemplary
`
`protein databases include US National Library of Medicine National Institutes of Health protein
`
`and gene databases. Exemplary levels of disease concern include low, medium, high, and extreme.
`
`[0036] Provided herein are methods for basic curation, such as identifying a sequence associated
`
`with a query by organism nameand or taxon. Onceidentified, a sequence annotation may
`
`optionally be updated and, optionally, recategorized for a particular descriptive feature. Sequences
`
`identified are further available for downloading in a singular or batch format, optionally with
`
`FASTAformatting.
`
`[0037] Data quality and public participation can both be concerns associated with publicly
`
`available databases. To maximize immediate utility, the disclosed system may carry out aninitial
`
`curation process adding many pathogenic proteins to the database in an attempt to include most
`
`-7-
`
`

`

`WO 2017/214574
`
`PCT/US2017/036868
`
`potentially regulated sequences or other sequences known to be harmful. The system may curate an
`
`“unrestricted” list of NCBI GI identifiers corresponding to genes that may be considered harmless.
`
`That unrestricted list may be also open to curation.
`
`[0038] A scheme of CAPTCHAmaybe used to prevent bot-driven curation and require user
`
`registration before creating or editing pages. GI identifiers may be periodically verified (for
`
`existence), and records may be tagged for human review on failure. Users can also flag records to
`
`request community or administrator review.
`
`[0039] The present disclosure provides for systems and methods that annotate and/or screen at least
`
`one biological sequence. In some instances, the biological sequence is a nucleic acid sequence. The
`
`nucleic acid sequence may comprise 1; 10; 100; 200; 300; 400; 500; 600; 700; 800; 900; 1,000;
`
`2,000; 5,000; 7,000; 10,000, or more nucleic acid residues. In some instances, the nucleic acid
`
`sequence comprises between 100 and 500 nucleic acid residues. In some instances, the nucleic acid
`
`sequence comprises between 50 and 1000 nucleic acid residues. In some instances, the nucleic acid
`
`sequence comprises between 20 and 200 nucleic acid residues. In someinstances, the nucleic acid
`
`sequence comprises 200 residues. In some instances, the biological sequence may be DNA or
`
`RNA.In some instances, the biological sequence is a protein sequence. The biological sequence
`
`may comprise adenine (A), cytosine (C), guanine (G), thymine (T), or uracil (U).
`
`In some
`
`instances, the biological sequence is a protein sequence. The protein may comprise 1; 10; 100;
`
`200; 300; 400; 500; 600; 700; 800; 900; 1,000; 2,000 or more amino acids. In some instances, the
`
`protein sequence comprises between 100 and 300 aminoacids. In someinstances, the nucleic acid
`
`sequence comprises between 50 and 500 aminoacids. In someinstances, the nucleic acid sequence
`
`comprises between 10 and 200 aminoacids. In someinstances, the nucleic acid sequence comprises
`
`60 aminoacids. In some instances, nucleic acid fragments of no more than 2, 5, 10, 20, 50, 100, or
`
`200 residues are assembled in-silico into a nucleic acid sequence. In some instances, nucleic acid
`
`fragments are obtained from one or more sources, or one or more orders from the same source.
`
`Screening tool
`
`[0040] Constructing a screening system capable of determining whether a given sequence poses a
`
`biosecurity risk may include a degree of investment in time and expertise not available to all
`
`synthetic biologists or even to all synthetic biology companies. Even assuming one hasaccess to a
`
`database of dangerous sequences, basic parameterization of an aligner and result processing
`
`(including culling alignment counts to similar regions so as not to hide homologyto shorter
`
`regions) may include domain expertise.
`
`[0041] An illustrative workflow is provided in FIG. 3A. Referring to FIG. 3A, processor receives
`
`a query file containing biological sequence information, and is also in communication with a
`
`-8-
`
`

`

`WO 2017/214574
`
`PCT/US2017/036868
`
`protein database having identified sequence information. A BLAST report is generated listing the
`
`same and similar sequencesidentified associated with the queried biological sequence, in-part or
`
`whole. The BLASTreportis then queried to databases containing sequence annotations identifying
`
`sequencesassociated with harmful biological sequences(protein or nucleic acids), also referred to
`
`as “restricted” lists. A screen report is generated in the form of a user interface which summarizes
`
`the results of these processes.
`
`[0042] An illustrative logic workflow is provided in FIG. 3B. Referring to FIG. 3B, a data input
`
`source such as physical nucleic acid or protein material (which can be sequenced), a nucleic acid
`
`sequence (which can be translated into a protein sequence), or a protein sequence can be evaluated
`
`using an algorithm which searches one or more databases to determine if it is on a restrictedlist.
`
`Exemplary algorithms include but are not limited to, BLAST, DIAMOND, Smith-Waterman,or
`
`other algorithm for comparing sequence information. Sequences found to be on therestrictivelist
`
`are further evaluated against an unrestricted list that comprises knownfalse positives. If no false
`
`positive is identified, the sequence is subjected to expert review. If the sequence is found to be non-
`
`harmful, it is placed on the unrestricted list to prevent further identification of said sequenceas a
`
`false positive. If the sequence is found to be harmful, an output alert is generated. In some
`
`instances, the non-harmful sequence is synthesized. In some instances, the sequence is modified to
`
`remove the harmful sequence. In some instances, the modified sequence is re-screened. In some
`
`instances, this process is repeated iteratively until a modified non-harmful sequence is found. In
`
`someinstances, the modified non-harmful sequence is synthesized.
`
`[0043] Referring to FIG.4, a user interface displays restricted lists available for selection for the
`
`screening process. Referring to FIG.5, an illustrative user interface displays a “Submit a screen”
`
`submission form. The form allowsfor selection of screening against open database(s), e.g., a
`
`collection of publicaly available information, or screening against a personal database, which may
`
`be based on a non-publicly available selection criteria. The submission form also allows for
`
`selection of a biological sequencefile for uploading.
`
`[0044] Referring to FIG.6, an illustrative user interface displays a summary of Biosecurity screens
`
`conducted, with status information, sequences screened, review status, concern or no concern
`
`status, date of sequence addition, and a link to viewing the BLAST result. Referring to FIG. 7, an
`
`illustrative user interface displays a summaryoflists accessed during a screen, sequences screened,
`
`and harmful sequence (restricted) assignments for a sequence.
`
`[0045] The technologies disclosed herein may comprise a Python-based reference implementation
`
`of a screening system. Given a query nucleotide sequence, the system may compare the sequence
`
`

`

`WO 2017/214574
`
`PCT/US2017/036868
`
`(e.g., via BLAST)to the set of protein sequences derived from the annotated collection produced
`
`by the interface discussed in the previous section.
`
`[0046] Results may be filtered by the degree of homology, E-score and alignment length. Passing
`
`hits may be summarizedbythe distribution of tags associated with those sequencesand the regions
`
`of the query found problematic. Links to the originating database entries may be provided so that
`
`users can follow-up in more detail. In compliance with pre-defined guidance, some examples show
`
`that the algorithm is 100% sensitive and reports can be downloaded for archival use. Screening
`
`short (e.g., less than about 200 bases) sequences may result in a large numberoffalse positive
`
`findings. Effective screening of shorter polynucleotide sequences may include an algorithmic
`
`approach.
`
`[0047] The screening system maysit atop a database and include a RESTful application
`
`programmable interface (API) for screen request submission and result retrieval as well as a
`
`graphical user interface. The application may be installed and operate on a laptop computer, and
`
`scale reasonably well to high-throughput use via API calls.
`
`Cumulative Biological sequence or construct Screening
`
`[0048] It is possible to obtain fragments of biological sequences or constructs that when
`
`individually screened will not result identification of a harmful sequence, especially if the
`
`biological sequences or constructs are obtained through multiple sources and at multiple time
`
`points. In someinstances, the source may be a customer. For example, accumulation of a
`
`substantial portion of the genome of any ofthe select agent-regulated bacteria or viruses may be
`
`obtained in smaller pieces, and then assembled into a harmful biological sequence or construct. To
`
`address this, in some instances a background process after each request is received which queries a
`
`database for all previous orders from that biological sequence or construct requesting source and
`
`collects records of any segments with high homology to any harmful biological sequences or
`
`constructs. This ensures evaluation and alerting even if those segments wereinsufficient to trigger
`
`formalalerting or denial of possession during the individual order. In some instances, these high-
`
`homology segments are represented as intervals on the genome ofthe select agent of concern and
`
`then the un

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.

We are unable to display this document.

PTO Denying Access

Refresh this Document
Go to the Docket