`
`UMAAOU CAAATAAA
`
`(10) International Publication Number
`WO 2017/214574 Al
`
`= a
`
`WIPO! PCT
`
`(19) World Intellectual Property
`Organization
`International Bureau
`
`(43) International Publication Date
`14 December 2017 (14.12.2017)
`
`(51) International Patent Classification:
`G06F 17/560 (2006.01)
`GO6F 19/28 (2011.01)
`GO6F 19/22 (2011.01)
`Ci2N 15/10 (2006.01)
`
`(21) International Application Number:
`
`PCT/US2017/036868
`
`(72) Inventor: DIGGANS, James; 1324 Cordilleras Avenue,
`San Carlos, California 94070 (US).
`
`(74) Agent: HARBURGER, David; WILSON SONSINI
`GOODRICH & ROSATI, 650 Page Mill Road, Palo Alto,
`California 94304 (US).
`
`(22) InternationalFiling Date:
`
`(25) Filing Language:
`
`(26) Publication Language:
`
`09 June 2017 (09.06.2017)
`
`English
`
`English
`
`(30) Priority Data:
`62/348,786
`62/375,858
`
`10 June 2016 (10.06.2016)
`16 August 2016 (16.08.2016)
`
`US
`US
`
`(71) Applicant: TWIST BIOSCIENCE CORPORATION
`[US/US]; 455 Mission Bay Boulevard South, Suite 545, San
`Francisco, California 94158 (US).
`
`(81) Designated States (unless otherwise indicated, for every
`kind of national protection available): AE, AG, AL, AM,
`AO,AT, AU, AZ, BA, BB, BG, BH, BN, BR, BW,BY, BZ,
`CA, CH, CL, CN, CO, CR, CU, CZ, DE, DJ, DK, DM, DO,
`DZ, EC, EE, EG, ES, FL GB, GD, GE, GH, GM, GT, HN,
`HR, HU,ID,IL,IN, IR, IS, JO, JP, KE, KG, KH, KN, KP,
`KR, KW, KZ, LA, LC, LK, LR, LS, LU, LY, MA, MD, ME,
`MG, MK, MN, MW, MX, MY, MZ, NA, NG,NL, NO, NZ,
`OM,PA, PE, PG, PH, PL, PT, QA, RO, RS, RU, RW, SA,
`SC, SD, SE, SG, SK, SL, SM,ST, SV, SY, TH, TJ, TM, TN,
`TR,TT, TZ, UA, UG, US, UZ, VC, VN, ZA, ZM, ZW.
`
`(54) Title: SYSTEMS AND METHODS FOR AUTOMATED ANNOTATION AND SCREENING OF BIOLOGICAL SEQUENCES
`
`
`
`Mais ss as SG SRS
`
`Screen
`
`
`
`
`
`FIG. 3A
`
`tion. Annotation tools described herein provide assistance to the synthetic biology community to track emerging science on the link
`between individual proteins and negative outcomes. Screening tools described herein enables the community to broaden both interest
`and effective practice of biosecurity so that practitioners and biological sequence or construct providers are empowered to evaluate the
`safety of order requests rather than waiting until synthesis or even expression. In addition, screening tools described herein provide
`for screening of polynucleotides across the same or multiple orders for sequences associated with harmful biological sequences from
`a reference database.
`
`[Continued on next page]
`
`Query File
`
`Protein
`
`database
`
`Blast Report
`
`Restricted
`
`Restricted
`
`
`
`lists
`lists
`
`wo2017/214574A1IMNININMIITANTATC000ATATAA (57) Abstract: The present disclosure describes software tools for effective biosecurity based on community knowledge andparticipa-
`
`
`
`WO 2017/214574 AITNT TNA! TN TATAAA
`
`(84) Designated States (unless otherwise indicated, for every
`kind of regional protection available): ARIPO (BW, GH,
`GM, KE, LR, LS, MW, MZ, NA, RW, SD, SL, ST, SZ, TZ,
`UG, ZM, ZW), Eurasian (AM, AZ, BY, KG, KZ, RU,TJ,
`TM), European (AL, AT, BE, BG, CH, CY, CZ, DE, DK,
`LE, LS, FL, FR, GB, GR, HR, WU, I, 1S, IT, LT, LU, LV,
`MC, MK, MT, NL, NO, PL, PT, RO, RS, SE, SI, SK, SM,
`TR), OAPI (BF, BJ, CF, CG, CI, CM, GA, GN, GQ, GW,
`KM, ML, MR, NE, SN, TD, TG).
`
`Declarations under Rule 4.17:
`
`— asto applicant's entitlement to apply for and be granted a
`patent (Rule 4.17(ii))
`— as to the applicant's entitlement to claim the priority of the
`earlier application (Rule 4.17(iii))
`Published:
`
`— with international search report (Art. 21(3))
`
`
`
`WO 2017/214574
`
`PCT/US2017/036868
`
`SYSTEMS AND METHODS FOR AUTOMATED ANNOTATION AND SCREENING OF
`
`BIOLOGICAL SEQUENCES
`
`CROSS-REFERENCE
`
`[0001] This application claims the benefit of U.S. provisional patent application number
`
`62/348,786 filed on June 10, 2016 and U.S. provisional patent application number 62/375,858 filed
`
`on August 16, 2016, each of which is incorporated by reference in its entirety.
`
`BACKGROUND
`
`[0002] The growth rate in our collective knowledge about individual proteins and biological
`
`systems capable of posing potential threats to public safety and/or the environment is tremendous.
`
`This knowledge, however, is widely distributed across diverse research communities, institutions
`
`and even journals. Thereis a lack of centralized information source focused on annotating the
`
`potential for a given protein to cause harm and in what context this harm can arise. Thus, new
`
`systems and methods are necessary to address the challenge.
`
`[0003] Provided herein are computerized systems for providing enhanced polynucleotide synthesis
`
`comprising a server for hosting a database, wherein the database is adapted for representingalist of
`
`BRIEF SUMMARY
`
`harmful biological sequences; a network connection; and a computer readable medium comprising
`
`instructions for a general purpose computer, wherein said computerized system is configured for
`
`operating in a method of: 1) receiving one or more design instructions, wherein the design
`
`instructions comprise a plurality of biological sequences, wherein each of the biological sequences
`
`is no more than 500 bases in length, and wherein the plurality of biological sequences comprise a
`
`nucleic acid or amino acid sequence; 2) automatically determining whether at least two biological
`
`sequencesof the plurality of biological sequences collectively correspondto at least 20% of a
`
`harmful biological sequence in the database; and 3) automatically generating an alert if at least 20%
`
`of the harmful biological sequence is detected. Further provided herein are computerized systems
`
`further comprising wherein if no alert is generated, then one or more sequences are synthesized.
`
`Further provided herein are computerized systems further comprising receiving instructions for
`
`changing the at least two biological sequencesof the plurality of biological sequences
`
`corresponding to at least 20% of the harmful biological sequence to remove the harmful biological
`
`sequence. Further provided herein are computerized systems wherein the plurality of received
`
`design instructions are received at a one or moretime points. Further provided herein are
`
`computerized systems wherein the plurality of received design instructions are from 3 or more
`-|-
`
`
`
`WO 2017/214574
`
`PCT/US2017/036868
`
`different sources. Further provided herein are computerized systems wherein the plurality of
`
`received design instructions are from 5 or more different sources. Further provided herein are
`
`computerized systems wherein the plurality of received design instructions are from 10 or more
`
`different sources. Further provided herein are computerized systems wherein the one or more
`
`biological sequences are each no more than 200 basesin length. Further provided herein are
`
`computerized systems wherein the one or more biological sequences are each no more than 100
`
`bases in length. Further provided herein are computerized systems wherein the one or more
`
`biological sequences are each no more than 50 bases in length. Further provided herein are
`
`computerized systems wherein the one or more biological sequences are each no more than 20
`
`bases in length.
`
`[0004] Provided herein are methods for providing enhanced polynucleotide synthesis comprising:
`
`1) receiving one or more design instructions, wherein the design instructions comprise a plurality of
`
`biological sequences, wherein each of the biological sequences is no more than 500 basesin length,
`
`and wherein the plurality of biological sequences comprise a nucleic acid or amino acid sequence;
`
`2) automatically determining whetherat least two biological sequences of the plurality of biological
`
`sequencescollectively correspond to at least 20% of a harmful biological sequencein a database;
`
`and 3) automatically generating an alert if at least 20% of the harmful biological sequence is
`
`detected. Further provided herein are methods further comprising wherein if no alert is generated,
`
`the one or more sequences are synthesized. Further provided herein are methods further
`
`comprising receiving instructions for changing the at least two biological sequences of the plurality
`
`of biological sequences corresponding to at least 20% of the harmful biological sequence to remove
`
`the harmful biological sequence.
`
`[0005] Provided herein are computerized systems for providing enhanced polynucleotide synthesis
`
`comprising a server for hosting a database, wherein the database is adapted for representingalist of
`
`sequences; a network connection; and a computer readable medium comprising instructions for a
`
`general purpose computer, wherein said computerized system is configured for operating in a
`
`method of: 1) receiving one or more design instructions, wherein the design instructions comprise a
`
`plurality of biological sequences, wherein the plurality of biological sequences is a vector
`
`sequence, and a plurality of additional insert sequences; 2 automatically determining whether the
`
`vector and at least one of the plurality of insert sequences collectively corresponds to at least 20%
`
`of a harmful biological sequence in the database; and 3) automatically generating an alert if at least
`
`20% of the harmful biological sequence is detected. Further provided herein are computerized
`
`systems wherein the biological sequences are obtained from sequencing a physical nucleic acid
`
`sample. Further provided herein are computerized systems further comprising wherein if noalert is
`
`2-
`
`
`
`WO 2017/214574
`
`PCT/US2017/036868
`
`generated, the one or more biological sequences are synthesized. Further provided herein are
`
`computerized systems further comprising receiving instructions for changing the vector and the at
`
`least one of the plurality of insert sequences corresponding to at least 20% of the harmful biological
`
`sequence to remove the harmful biological sequence. Further provided herein are computerized
`
`systems for providing enhanced polynucleotide synthesis wherein the plurality of received design
`
`instructions are received at one or more time points. Further provided herein are computerized
`
`systems wherein the plurality of received design instructions are received from different sources.
`
`Further provided herein are computerized systems wherein the plurality of received design
`
`instructions are from 3 or more different sources. Further provided herein are computerized systems
`
`wherein the plurality of received design instructions are from 5 or more different sources. Further
`
`provided herein are computerized systems wherein the plurality of received design instructions are
`
`from 10 or more different sources. Further provided herein are computerized systems wherein the
`
`one or more biological sequences are each no more than 200 bases in length. Further provided
`
`herein are computerized systems wherein the one or more biological sequences are each no more
`
`than 100 basesin length. Further provided herein are computerized systems wherein the one or
`
`more biological sequences are each no more than 50 bases in length. Further provided herein are
`
`computerized systems wherein the one or more biological sequences are each no more than 20
`
`bases in length.
`
`[0006] Provided herein are methods for providing enhanced polynucleotide synthesis comprising:
`
`1) receiving one or more design instructions, wherein the design instructions comprise a plurality of
`
`biological sequences, wherein the plurality of biological sequences is a vector sequence, and a
`
`plurality of additional insert sequences; 2) automatically determining whether the vector and at
`
`least one of the plurality of insert sequences collectively corresponds to at least 20% of a harmful
`
`biological sequence in the database; and
`
`3) automatically generating an alert if at least 20% of the harmful biological sequence is detected.
`
`Further provided herein are methods wherein the biological sequences are obtained from
`
`sequencing a physical nucleic acid or protein sample. Further provided herein are methods further
`
`comprising wherein if no alert is generated, the one or more biological sequences are synthesized.
`
`Further provided herein are methods receiving instructions for changing the vector andthe at least
`
`one ofthe plurality of insert sequences corresponding to at least 20% of the harmful biological
`
`sequence to remove the harmful biological sequence.
`
`INCORPORATION BY REFERENCE
`
`[0007] All publications, patents, and patent applications mentionedin this specification are herein
`
`incorporated by reference to the same extent as if each individual publication, patent, or patent
`3-
`
`
`
`WO 2017/214574
`
`PCT/US2017/036868
`
`application was specifically and individually indicated to be incorporated by reference in their
`
`entirety.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0008] The technical features of the present disclosure are set forth with particularity in the
`
`appended claims. A better understanding of the features and advantages of the present disclosure
`
`will be obtained by reference to the following detailed description that sets forth illustrative
`
`embodiments, in which the principles of the disclosure are utilized, and the accompanying
`
`drawings of the following.
`
`[0009] FIG.1 illustrates a user interface which includes a protein sequence and associated species,
`
`host, pathogen, route to harm, outcome and protein type information. Also included are sequence
`
`accession number, a listing of identical proteins, links to a database with sequencerecords, and
`
`links to similar proteins.
`
`[0010] FIG. 2 illustrates a user interface which includes a partial listing of protein variants and an
`
`exemplary protein, “Hemagglutinin Neuraminidase-Newcastle Disease virus.”
`
`[0011] FIG. 3A depicts a flow chart including information from a query file, a protein database, a
`
`blast report, restricted lists (harmful sequence lists) and screen report.
`
`[0012] FIG. 3B depicts a flow chart which includes various forms of input (nucleic acid material,
`
`nucleic acid or protein sequence), decision making(restricted list, unrestricted list, expert review),
`
`and output (issuing alerts).
`
`[0013] FIG.4 illustrates a user interface which includeslists of databases for searching in a screen.
`
`Columns forrole, type, name, description, date added andactive state columns are included.
`
`[0014] FIG. 5 illustrates a user interface which includes a sequence submission screen. Form
`
`entries for name, database, description and FASTFAfile, and a “Submit” button are included. The
`
`database form has a drop-down columnthat appears upon click with subcategories, including
`
`“Seqshield,” “nr” and “Personal Database.”
`
`[0015] FIG.6 illustrates a user interface which includes a summary of screening status.
`
`[0016] FIG.7 illustrates a user interface which includes a pull-down menuforselection of
`
`“Unreviewed,” “Of concern,” or “No concern” sequencesscreened.
`
`[0017] FIG.8 illustrates a computing system.
`
`[0018] FIG.9 illustrates a computer system.
`
`[0019] FIG. 10 is a block diagram illustrating an architecture of a computer system.
`
`[0020] FIG. 11 is a diagram demonstrating a network configured to incorporate a plurality of
`
`computer systems, a plurality of cell phones and personal data assistants, and Network Attached
`
`Storage (NAS).
`
`A.
`
`
`
`WO 2017/214574
`
`PCT/US2017/036868
`
`[0021] FIG. 12 is a block diagram of a multiprocessor computer system using a sharedvirtual
`
`address memory space.
`
`DETAILED DESCRIPTION
`
`[0022] With the rapid growth in design capability in synthetic biology, it is now possible to create
`
`large numbers of constructs often using a heavily mutated sequence that does not directly resemble
`
`the reference sequence from whichit wasoriginally derived. At the same time, scientific advances
`
`in the understanding of the processes behind pathogenicity (in a variety of hosts and biological
`
`contexts) are rapidly creating new knowledge of protein sequencesthat, in context-dependent ways,
`
`can cause harm to humanbeings, specific plants or animals, or to the environment more broadly.
`
`[0023] Ethical, responsible synthetic biologists may unwittingly create constructs capable of
`
`causing harm, but be unable to predict or understand that capability prior to instantiating synthetic
`
`designsin living systems. As predicting function from primary sequencealoneis not feasible, these
`
`scientists would be well-served by having access to 1) arepository of metadata on what sequences
`
`can cause harm along with regulatory status and 2) an effective screening system for checking
`
`DNAorprotein sequences against that metadata and alerting the user to any potential concern. In
`
`addition, a screening system capable of addressing these needs mustitself be amenable to
`
`automation so as to fit seamlessly into high-throughput design/build/test workflows. The present
`
`disclosure provides for software tools to address both the lack of publicly available gene-level
`
`metadata on pathogenicity as well as the lack of open sourcetools for effective screening.
`
`[0024] Definitions
`
`[0025] While various embodiments have been shown anddescribed herein, it will be obvious to
`
`those skilled in the art that such embodiments are provided by way of example only. Numerous
`
`variations, changes, and substitutions may occurto those skilled in the art without departing from
`
`devices, systems and methods disclosed herein. It should be understood that various alternatives to
`
`the embodiments described herein may be employed.
`
`[0026] Unless otherwise defined, all technical terms used herein have the same meaning as
`
`commonly understood by one ofordinary skill in the art to which this disclosure belongs. As used
`99 6¢
`
`in this specification and the appended claims, the singular forms “a,”
`
`“an,” and “the” include plural
`
`references unless the context clearly dictates otherwise. Any reference to “or” herein is intended to
`
`encompass “and/or” unless otherwise stated.
`
`[0027] Unless specifically stated or obvious from context, as used herein, the term "about" in
`
`reference to a numberor range of numbersis understood to mean the stated number and numbers
`
`+/- 10% thereof, or 10% below the lowerlisted limit and 10% above the higherlisted limit for the
`
`valueslisted for a range.
`
`5-
`
`
`
`WO 2017/214574
`
`PCT/US2017/036868
`
`Sequence annotation
`
`[0028] Knowledge about the capacity of any single sequence to cause some type of harm may be
`
`extremely distributed. Individual communities of researchers focus on widely varying aspects of
`
`pathogenicity including the ability of organisms to infiltrate host cells, hijack host cellular
`
`machinery, hide from the host immune system and even to enhance the host immuneresponse.
`
`Exemplary harmful biological sequences include those that encode for a pathogenic sequence, such
`
`as those which are harmful and from viral, bacterial, or parasitic origins. Harmful biological
`
`sequences may include be mutant form of wildtype sequences which are knownto have pathogenic
`
`effects. Harmful biological sequences include sequences that produce harmful sequence products
`
`after transcription or translation, or act as precursors to harmful sequence products. Harmful
`
`biological sequences include sequences that encode for harmful proteins.
`
`[0029] Amongother facets, the present disclosure provides for a Mediawiki-based user interface
`
`that allows a user to submit sequences along with tag-based annotation of roles in pathogenicity.
`
`Users may be encouraged to submit several tags for each sequence to describe the general patterns
`
`of harm associated with a given sequence modeled as:
`
`Host + Context = Outcome + Level of Concern
`
`[0030] The present system may take a tag-based approach soas not a priori to imposea single
`
`controlled vocabulary. The collection oftags resulting from community annotation could form the
`
`basis of such a controlled vocabulary over the longer term.
`
`[0031] As each sequence is uploaded, users may be askedto add tags in each of four categories.
`
`Tagging ‘Host’ and ‘Level of Concern’ are mandatory; adding tags for ‘Context’ and ‘Outcome’
`
`are optional given the additional complexity and domain knowledge required.
`
`[0032] As an example, a sequence encodingthe toxin ricin might be tagged by a useras:
`
`ingestion, inhalation
`neof
`
`
`
`
`|Outcome=|fever,|fever,failure,death|respiratory failure, deathrespiratorycough,
`
`
`
`Concern
`
`Extreme
`
`[0033] The goal is accumulation of metadata over time more than universal completeness. The
`
`system is centrally hosted and offers the entire set of curated sequences (or subsets based on queries
`
`by tag) for download as FASTAforuse in screening.
`
`[0034] Provided herein are methods for sequence annotation wherein a databasereceivesa listing
`
`of characteristics associated with a biological sequence or biological construct (e.g., nucleotide
`
`sequence or protein sequence). Exemplary characteristics include, without limitation: nucleic acid
`-6-
`
`
`
`WO 2017/214574
`
`PCT/US2017/036868
`
`sequence, protein sequence, protein name, strain source, link to sequence database (e.g., NCBI),
`
`sequence database accession number,identical sequences(protein or nucleic acid), similar
`
`sequences(protein or nucleic acid), disease type (e.g., virus, bacterium, or fungi), host information
`
`(e.g., humans, mammals, birds, insects), context or route of harmful interaction (e.g., ingestion,
`
`inhalation), and level of concern. Also provided herein is a user interface which presents each
`
`characteristic or a link to additional information of such characteristics. See FIG. 1.
`
`In some
`
`cases, viral sequencesfor a particular strain are selected. For example, FIG.2 illustrates a portion
`
`of 679 available strains of Hemagglutinin Neuraminidase-Newcastle Disease virus for annotation.
`
`[0035] Exemplary species include animal species. “Animals” as used herein includes, without
`
`limitation, mammals, marsupials, birds, insects, arthropods, amphibians and reptiles. Exemplary
`
`mammals include, without limitation, sheep, cattle, goats, pigs, rabbits, hares, deer, goats, mice,
`
`rats, bats, and possums,and the like. Exemplary disease types include pathogens from the
`
`following classes: viruses, bacterium, fungi and other harmful pathogens. Exemplary viruses
`
`having harmful expression products include, without limitation, Marburg virus, Ebola virus,
`
`Hantavirus, bird flu (e.g., HSN1 strain), Lassa virus, Junin virus, Crimea-Congo fever, Machupo
`
`virus, Kyasanur Forest Virus, Dengue fever, and Chikungunya virus. Exemplary bacterium having
`
`harmful expression products include, without limitation, Multi-Resistant Staphylococcus aureus
`
`(MRSA), E. coli, listeriosis, salmonella, gonococcus, streptococcus and staphylococcus.
`
`Exemplary fungi having harmful expression products include, without limitation,Amanita arocheae,
`
`Amanita bisporigera, Amanita exitialis, Amanita magnivelaris, Amanita ocreata, Amanita verna,
`
`Clitocybe dealbata, Cortinarius gentilis, Lepiota brunneoincarnata, Lepiota brunneoincarnata,
`
`Lepiota brunneoincarnata, and Lepiota brunneoincarnata. Exemplary routes to harm include,
`
`without limitation, ingestion, inhalation, skin contact, and sexual transmission. Exemplary
`
`outcomesinclude, without limitation, fever, headache, nausea, dizziness, and diarrhea. Exemplary
`
`protein databases include US National Library of Medicine National Institutes of Health protein
`
`and gene databases. Exemplary levels of disease concern include low, medium, high, and extreme.
`
`[0036] Provided herein are methods for basic curation, such as identifying a sequence associated
`
`with a query by organism nameand or taxon. Onceidentified, a sequence annotation may
`
`optionally be updated and, optionally, recategorized for a particular descriptive feature. Sequences
`
`identified are further available for downloading in a singular or batch format, optionally with
`
`FASTAformatting.
`
`[0037] Data quality and public participation can both be concerns associated with publicly
`
`available databases. To maximize immediate utility, the disclosed system may carry out aninitial
`
`curation process adding many pathogenic proteins to the database in an attempt to include most
`
`-7-
`
`
`
`WO 2017/214574
`
`PCT/US2017/036868
`
`potentially regulated sequences or other sequences known to be harmful. The system may curate an
`
`“unrestricted” list of NCBI GI identifiers corresponding to genes that may be considered harmless.
`
`That unrestricted list may be also open to curation.
`
`[0038] A scheme of CAPTCHAmaybe used to prevent bot-driven curation and require user
`
`registration before creating or editing pages. GI identifiers may be periodically verified (for
`
`existence), and records may be tagged for human review on failure. Users can also flag records to
`
`request community or administrator review.
`
`[0039] The present disclosure provides for systems and methods that annotate and/or screen at least
`
`one biological sequence. In some instances, the biological sequence is a nucleic acid sequence. The
`
`nucleic acid sequence may comprise 1; 10; 100; 200; 300; 400; 500; 600; 700; 800; 900; 1,000;
`
`2,000; 5,000; 7,000; 10,000, or more nucleic acid residues. In some instances, the nucleic acid
`
`sequence comprises between 100 and 500 nucleic acid residues. In some instances, the nucleic acid
`
`sequence comprises between 50 and 1000 nucleic acid residues. In some instances, the nucleic acid
`
`sequence comprises between 20 and 200 nucleic acid residues. In someinstances, the nucleic acid
`
`sequence comprises 200 residues. In some instances, the biological sequence may be DNA or
`
`RNA.In some instances, the biological sequence is a protein sequence. The biological sequence
`
`may comprise adenine (A), cytosine (C), guanine (G), thymine (T), or uracil (U).
`
`In some
`
`instances, the biological sequence is a protein sequence. The protein may comprise 1; 10; 100;
`
`200; 300; 400; 500; 600; 700; 800; 900; 1,000; 2,000 or more amino acids. In some instances, the
`
`protein sequence comprises between 100 and 300 aminoacids. In someinstances, the nucleic acid
`
`sequence comprises between 50 and 500 aminoacids. In someinstances, the nucleic acid sequence
`
`comprises between 10 and 200 aminoacids. In someinstances, the nucleic acid sequence comprises
`
`60 aminoacids. In some instances, nucleic acid fragments of no more than 2, 5, 10, 20, 50, 100, or
`
`200 residues are assembled in-silico into a nucleic acid sequence. In some instances, nucleic acid
`
`fragments are obtained from one or more sources, or one or more orders from the same source.
`
`Screening tool
`
`[0040] Constructing a screening system capable of determining whether a given sequence poses a
`
`biosecurity risk may include a degree of investment in time and expertise not available to all
`
`synthetic biologists or even to all synthetic biology companies. Even assuming one hasaccess to a
`
`database of dangerous sequences, basic parameterization of an aligner and result processing
`
`(including culling alignment counts to similar regions so as not to hide homologyto shorter
`
`regions) may include domain expertise.
`
`[0041] An illustrative workflow is provided in FIG. 3A. Referring to FIG. 3A, processor receives
`
`a query file containing biological sequence information, and is also in communication with a
`
`-8-
`
`
`
`WO 2017/214574
`
`PCT/US2017/036868
`
`protein database having identified sequence information. A BLAST report is generated listing the
`
`same and similar sequencesidentified associated with the queried biological sequence, in-part or
`
`whole. The BLASTreportis then queried to databases containing sequence annotations identifying
`
`sequencesassociated with harmful biological sequences(protein or nucleic acids), also referred to
`
`as “restricted” lists. A screen report is generated in the form of a user interface which summarizes
`
`the results of these processes.
`
`[0042] An illustrative logic workflow is provided in FIG. 3B. Referring to FIG. 3B, a data input
`
`source such as physical nucleic acid or protein material (which can be sequenced), a nucleic acid
`
`sequence (which can be translated into a protein sequence), or a protein sequence can be evaluated
`
`using an algorithm which searches one or more databases to determine if it is on a restrictedlist.
`
`Exemplary algorithms include but are not limited to, BLAST, DIAMOND, Smith-Waterman,or
`
`other algorithm for comparing sequence information. Sequences found to be on therestrictivelist
`
`are further evaluated against an unrestricted list that comprises knownfalse positives. If no false
`
`positive is identified, the sequence is subjected to expert review. If the sequence is found to be non-
`
`harmful, it is placed on the unrestricted list to prevent further identification of said sequenceas a
`
`false positive. If the sequence is found to be harmful, an output alert is generated. In some
`
`instances, the non-harmful sequence is synthesized. In some instances, the sequence is modified to
`
`remove the harmful sequence. In some instances, the modified sequence is re-screened. In some
`
`instances, this process is repeated iteratively until a modified non-harmful sequence is found. In
`
`someinstances, the modified non-harmful sequence is synthesized.
`
`[0043] Referring to FIG.4, a user interface displays restricted lists available for selection for the
`
`screening process. Referring to FIG.5, an illustrative user interface displays a “Submit a screen”
`
`submission form. The form allowsfor selection of screening against open database(s), e.g., a
`
`collection of publicaly available information, or screening against a personal database, which may
`
`be based on a non-publicly available selection criteria. The submission form also allows for
`
`selection of a biological sequencefile for uploading.
`
`[0044] Referring to FIG.6, an illustrative user interface displays a summary of Biosecurity screens
`
`conducted, with status information, sequences screened, review status, concern or no concern
`
`status, date of sequence addition, and a link to viewing the BLAST result. Referring to FIG. 7, an
`
`illustrative user interface displays a summaryoflists accessed during a screen, sequences screened,
`
`and harmful sequence (restricted) assignments for a sequence.
`
`[0045] The technologies disclosed herein may comprise a Python-based reference implementation
`
`of a screening system. Given a query nucleotide sequence, the system may compare the sequence
`
`
`
`WO 2017/214574
`
`PCT/US2017/036868
`
`(e.g., via BLAST)to the set of protein sequences derived from the annotated collection produced
`
`by the interface discussed in the previous section.
`
`[0046] Results may be filtered by the degree of homology, E-score and alignment length. Passing
`
`hits may be summarizedbythe distribution of tags associated with those sequencesand the regions
`
`of the query found problematic. Links to the originating database entries may be provided so that
`
`users can follow-up in more detail. In compliance with pre-defined guidance, some examples show
`
`that the algorithm is 100% sensitive and reports can be downloaded for archival use. Screening
`
`short (e.g., less than about 200 bases) sequences may result in a large numberoffalse positive
`
`findings. Effective screening of shorter polynucleotide sequences may include an algorithmic
`
`approach.
`
`[0047] The screening system maysit atop a database and include a RESTful application
`
`programmable interface (API) for screen request submission and result retrieval as well as a
`
`graphical user interface. The application may be installed and operate on a laptop computer, and
`
`scale reasonably well to high-throughput use via API calls.
`
`Cumulative Biological sequence or construct Screening
`
`[0048] It is possible to obtain fragments of biological sequences or constructs that when
`
`individually screened will not result identification of a harmful sequence, especially if the
`
`biological sequences or constructs are obtained through multiple sources and at multiple time
`
`points. In someinstances, the source may be a customer. For example, accumulation of a
`
`substantial portion of the genome of any ofthe select agent-regulated bacteria or viruses may be
`
`obtained in smaller pieces, and then assembled into a harmful biological sequence or construct. To
`
`address this, in some instances a background process after each request is received which queries a
`
`database for all previous orders from that biological sequence or construct requesting source and
`
`collects records of any segments with high homology to any harmful biological sequences or
`
`constructs. This ensures evaluation and alerting even if those segments wereinsufficient to trigger
`
`formalalerting or denial of possession during the individual order. In some instances, these high-
`
`homology segments are represented as intervals on the genome ofthe select agent of concern and
`
`then the un