`WO 2017/214574
`[0001] This application claims the benefit of US. provisional patent application number
`62/348,786 filed on June 10, 2016 and US. provisional patent application number 62/375,858 filed
`on August 16, 2016, each of which is incorporated by reference in its entirety.
`[0002] The growth rate in our collective knowledge about individual proteins and biological
`systems capable of posing potential threats to public safety and/or the environment is tremendous.
`This knowledge, however, is widely distributed across diverse research communities, institutions
`and even journals. There is a lack of centralized information source focused on annotating the
`potential for a given protein to cause harm and in what context this harm can arise. Thus, new
`systems and methods are necessary to address the challenge.
`[0003] Provided herein are computerized systems for providing enhanced polynucleotide synthesis
`comprising a server for hosting a database, wherein the database is adapted for representing a list of
`harmful biological sequences, a network connection, and a computer readable medium comprising
`instructions for a general purpose computer, wherein said computerized system is configured for
`operating in a method of: 1) receiving one or more design instructions, wherein the design
`instructions comprise a plurality of biological sequences, wherein each of the biological sequences
`is no more than 500 bases in length, and wherein the plurality of biological sequences comprise a
`nucleic acid or amino acid sequence; 2) automatically determining whether at least two biological
`sequences of the plurality of biological sequences collectively correspond to at least 20% of a
`harmful biological sequence in the database; and 3) automatically generating an alert if at least 20%
`of the harmful biological sequence is detected. Further provided herein are computerized systems
`further comprising wherein if no alert is generated, then one or more sequences are synthesized.
`Further provided herein are computerized systems further comprising receiving instructions for
`changing the at least two biological sequences of the plurality of biological sequences
`corresponding to at least 20% of the harmful biological sequence to remove the harmful biological
`sequence. Further provided herein are computerized systems wherein the plurality of received
`design instructions are received at a one or more time points. Further provided herein are
`computerized systems wherein the plurality of received design instructions are from 3 or more
`WO 2017/214574
`different sources. Further provided herein are computerized systems wherein the plurality of
`received design instructions are from 5 or more different sources. Further provided herein are
`computerized systems wherein the plurality of received design instructions are from 10 or more
`different sources. Further provided herein are computerized systems wherein the one or more
`biological sequences are each no more than 200 bases in length. Further provided herein are
`computerized systems wherein the one or more biological sequences are each no more than 100
`bases in length. Further provided herein are computerized systems wherein the one or more
`biological sequences are each no more than 50 bases in length. Further provided herein are
`computerized systems wherein the one or more biological sequences are each no more than 20
`bases in length.
`[0004] Provided herein are methods for providing enhanced polynucleotide synthesis comprising:
`1) receiving one or more design instructions, wherein the design instructions comprise a plurality of
`biological sequences, wherein each of the biological sequences is no more than 500 bases in length,
`and wherein the plurality of biological sequences comprise a nucleic acid or amino acid sequence;
`2) automatically determining whether at least two biological sequences of the plurality of biological
`sequences collectively correspond to at least 20% of a harmful biological sequence in a database;
`and 3) automatically generating an alert if at least 20% of the harmful biological sequence is
`detected. Further provided herein are methods further comprising wherein ifno alert is generated,
`the one or more sequences are synthesized. Further provided herein are methods further
`comprising receiving instructions for changing the at least two biological sequences of the plurality
`of biological sequences corresponding to at least 20% of the harmful biological sequence to remove
`the harmful biological sequence.
`[0005] Provided herein are computerized systems for providing enhanced polynucleotide synthesis
`comprising a server for hosting a database, wherein the database is adapted for representing a list of
`sequences; a network connection; and a computer readable medium comprising instructions for a
`general purpose computer, wherein said computerized system is configured for operating in a
`method of: l) receiving one or more design instructions, wherein the design instructions comprise a
`plurality of biological sequences, wherein the plurality of biological sequences is a vector
`sequence, and a plurality of additional insert sequences, 2 automatically determining whether the
`vector and at least one of the plurality of insert sequences collectively corresponds to at least 20%
`of a harmful biological sequence in the database; and 3) automatically generating an alert if at least
`20% of the harmful biological sequence is detected. Further provided herein are computerized
`systems wherein the biological sequences are obtained from sequencing a physical nucleic acid
`sample. Further provided herein are computerized systems further comprising wherein if no alert is
`WO 2017/214574
`generated, the one or more biological sequences are synthesized. Further provided herein are
`computerized systems further comprising receiving instructions for changing the vector and the at
`least one of the plurality of insert sequences corresponding to at least 20% of the harmful biological
`sequence to remove the harmful biological sequence. Further provided herein are computerized
`systems for providing enhanced polynucleotide synthesis wherein the plurality of received design
`instructions are received at one or more time points. Further provided herein are computerized
`systems wherein the plurality of received design instructions are received from different sources.
`Further provided herein are computerized systems wherein the plurality of received design
`instructions are from 3 or more different sources. Further provided herein are computerized systems
`wherein the plurality of received design instructions are from 5 or more different sources. Further
`provided herein are computerized systems wherein the plurality of received design instructions are
`from 10 or more different sources. Further provided herein are computerized systems wherein the
`one or more biological sequences are each no more than 200 bases in length. Further provided
`herein are computerized systems wherein the one or more biological sequences are each no more
`than 100 bases in length. Further provided herein are computerized systems wherein the one or
`more biological sequences are each no more than 50 bases in length. Further provided herein are
`computerized systems wherein the one or more biological sequences are each no more than 20
`bases in length.
`[0006] Provided herein are methods for providing enhanced polynucleotide synthesis comprising:
`1) receiving one or more design instructions, wherein the design instructions comprise a plurality of
`biological sequences, wherein the plurality of biological sequences is a vector sequence, and a
`plurality of additional insert sequences, 2) automatically determining whether the vector and at
`least one of the plurality of insert sequences collectively corresponds to at least 20% of a harmful
`biological sequence in the database; and
`3) automatically generating an alert if at least 20% of the harmful biological sequence is detected.
`Further provided herein are methods wherein the biological sequences are obtained from
`sequencing a physical nucleic acid or protein sample. Further provided herein are methods further
`comprising wherein if no alert is generated, the one or more biological sequences are synthesized.
`Further provided herein are methods receiving instructions for changing the vector and the at least
`one of the plurality of insert sequences corresponding to at least 20% of the harmful biological
`sequence to remove the harmful biological sequence.
`[0007] All publications, patents, and patent applications mentioned in this specificati on are herein
`incorporated by reference to the same extent as if each individual publication, patent, or patent
`WO 2017/214574
`application was specifically and individually indicated to be incorporated by reference in their
`[0008] The technical features of the present disclosure are set forth with palticularity in the
`appended claims. A better understanding of the features and advantages of the present disclosure
`will be obtained by reference to the following detailed description that sets forth illustrative
`embodiments, in which the principles of the disclosure are utilized, and the accompanying
`drawings of the following.
`[0009] FIG. 1 illustrates a user interface which includes a protein sequence and associated species,
`host, pathogen, route to harm, outcome and protein type information. Also included are sequence
`accession number, a listing of identical proteins, links to a database with sequence records, and
`links to similar proteins.
`[0010] FIG. 2 illustrates a user interface which includes a partial listing of protein variants and an
`exemplary protein, “Hemagglutinin Neuraminidase-Newcastle Disease virus.”
`[0011] FIG. 3A depicts a flow chart including information from a query file, a protein database, a
`blast report, restricted lists (harmful sequence lists) and screen report.
`[0012] FIG. 3B depicts a flow chart which includes various forms of input (nucleic acid material,
`nucleic acid or protein sequence), decision making (restricted list, unrestricted list, expert review),
`and output (issuing alerts).
`[0013] FIG. 4 illustrates a user interface which includes lists of databases for searching in a screen.
`Columns for role, type, name, description, date added and active state columns are included.
`[0014] FIG. 5 illustrates a user interface which includes a sequence submission screen. Form
`entries for name, database, description and FASTFA file, and a “Submit” button are included. The
`database form has a drop—down column that appears upon click with subcategories, including
`“Seqshield,” “nr” and “Personal Database.”
`[0015] FIG. 6 illustrates a user interface which includes a summary of screening status.
`[0016] FIG. 7 illustrates a user interface which includes a pull-down menu for selection of
`“Unreviewed,” “Of concern,” or “No concern” sequences screened.
`[0017] FIG. 8 illustrates a computing system.
`[0018] FIG. 9 illustrates a computer system.
`[0019] FIG. 10 is a block diagram illustrating an architecture of a computer system.
`[0020] FIG. 11 is a diagram demonstrating a network configured to incorporate a plurality of
`computer systems, a plurality of cell phones and personal data assistants, and Network Attached
`Storage (NAS).
`WO 2017/214574
`[0021] FIG. 12 is a block diagram of a multiprocessor computer system using a shared virtual
`address memory space.
`[0022] With the rapid growth in design capability in synthetic biology, it is now possible to create
`large numbers of constructs often using a heavily mutated sequence that does not directly resemble
`the reference sequence from which it was originally derived. At the same time, scientific advances
`in the understanding of the processes behind pathogenicity (in a variety of hosts and biological
`contexts) are rapidly creating new knowledge of protein sequences that, in context—dependent ways,
`can cause harm to human beings, speciflc plants or animals, or to the environment more broadly.
`[0023] Ethical, responsible synthetic biologists may unwittingly create constructs capable of
`causing harm, but be unable to predict or understand that capability prior to instantiating synthetic
`designs in living systems. As predicting function from primary sequence alone is not feasible, these
`scientists would be well-served by having access to 1) a repository of metadata on what sequences
`can cause harm along with regulatory status and 2) an effective screening system for checking
`DNA or protein sequences against that metadata and alerting the user to any potential concern. In
`addition, a screening system capable of addressing these needs must itself be amenable to
`automation so as to fit seamlessly into high-throughput design/build/test workflows. The present
`disclosure provides for software tools to address both the lack of publicly available gene-level
`metadata on pathogenicity as well as the lack of open source tools for effective screening.
`[0024] Definitions
`[0025] While various embodiments have been shown and described herein, it will be obvious to
`those skilled in the art that such embodiments are provided by way of example only. Numerous
`variations, changes, and substitutions may occur to those skilled in the art without departing from
`devices, systems and methods disclosed herein. It should be understood that various alternatives to
`the embodiments described herein may be employed.
`[0026] Unless otherwise defined, all technical terms used herein have the same meaning as
`commonly understood by one of ordinary skill in the art to which this disclosure belongs. As used
`77 (L
`in this specification and the appended claims, the singular forms “a,
`an,” and “the” include plural
`references unless the context clearly dictates otherwise. Any reference to “or” herein is intended to
`encompass “and/or” unless otherwise stated.
`[0027] Unless specifically stated or obvious from context, as used herein, the term "about" in
`reference to a number or range of numbers is understood to mean the stated number and numbers
`+/— 10% thereof, or 10% below the lower listed limit and 10% above the higher listed limit for the
`values listed for a range.
`WO 2017/214574
`Sequence annotation
`[0028] Knowledge about the capacity of any single sequence to cause some type of harm may be
`extremely distributed. Individual communities of researchers focus on widely varying aspects of
`pathogenicity including the ability of organisms to infiltrate host cells, hijack host cellular
`machinery, hide from the host immune system and even to enhance the host immune response.
`Exemplary harmful biological sequences include those that encode for a pathogenic sequence, such
`as those which are harmful and from viral, bacterial, or parasitic origins. Harmful biological
`sequences may include be mutant form of wildtype sequences which are known to have pathogenic
`effects. Harmful biological sequences include sequences that produce harmful sequence products
`after transcription or translation, or act as precursors to harmful sequence products. Harmful
`biological sequences include sequences that encode for harmful proteins.
`[0029] Among other facets, the present disclosure provides for a Mediawiki-based user interface
`that allows a user to submit sequences along with tag-based annotation of roles in pathogenicity.
`Users may be encouraged to submit several tags for each sequence to describe the general patterns
`of harm associated with a given sequence modeled as:
`Host + Context : Outcome + Level of Concern
`[0030] The present system may take a tag-based approach so as not a priori to impose a single
`controlled vocabulary. The collection of tags resulting from community annotation could form the
`basis of such a controlled vocabulary over the longer term.
`[0031] As each sequence is uploaded, users may be asked to add tags in each of four categories.
`Tagging ‘Host’ and ‘Level of Concern’ are mandatory, adding tags for ‘Context’ and ‘Outcome’
`are optional given the additional complexity and domain knowledge required.
`[0032] As an example, a sequence encoding the toxin ricin might be tagged by a user as:
`ingestion, inhalation
`fever, cough, respiratory failure, death
`Level of
`[0033] The goal is accumulation of metadata over time more than universal completeness. The
`system is centrally hosted and offers the entire set of curated sequences (or subsets based on queries
`by tag) for download as FASTA for use in screening.
`[0034] Provided herein are methods for sequence annotation wherein a database receives a listing
`of characteristics associated with a biological sequence or biological construct (e.g., nucleotide
`sequence or protein sequence). Exemplary characteristics include, without limitation: nucleic acid
`WO 2017/214574
`sequence, protein sequence, protein name, strain source, link to sequence database (e.g., NCBI),
`sequence database accession number, identical sequences (protein or nucleic acid), similar
`sequences (protein or nucleic acid), disease type (e.g., virus, bacterium, or fungi), host information
`(e. g, humans, mammals, birds, insects), context or route of harmful interaction (e. g., ingestion,
`inhalation), and level of concern. Also provided herein is a user interface which presents each
`characteristic or a link to additional information of such characteristics. See FIG. 1.
`In some
`cases, viral sequences for a particular strain are selected. For example, FIG. 2 illustrates a portion
`of 679 available strains of Hemagglutinin Neuraminidase-Newcastle Disease virus for annotation.
`[0035] Exemplary species include animal species. “Animals” as used herein includes, without
`limitation, mammals, marsupials, birds, insects, arthropods, amphibians and reptiles. Exemplary
`mammals include, without limitation, sheep, cattle, goats, pigs, rabbits, hares, deer, goats, mice,
`rats, bats, and possums, and the like. Exemplary disease types include pathogens from the
`following classes: viruses, bacterium, fungi and other harmful pathogens. Exemplary viruses
`having harmful expression products include, without limitation, Marburg virus, Ebola virus,
`Hantavirus, bird flu (e. g, H5N1 strain), Lassa virus, Junin virus, Crimea—Congo fever, Machupo
`virus, Kyasanur Forest Virus, Dengue fever, and Chikungunya virus. Exemplary bacterium having
`harmful expression products include, without limitation, Multi-Resistant Staphylococcus aureus
`(MRSA), E. coli, listeriosis, salmonella, gonococcus, streptococcus and staphylococcus.
`Exemplary fungi having harmful expression products include, without limitation,Amanita arocheae,
`Amanita bisporigera, Amanita exitialis, Amanita magnivelafis, Amanita ocreata, Amanita verna,
`Clitocybe dealbata, Cortinarius gentilis, Lepiota brunneoincarnata, Lepiota brunneoincarnata,
`Lepiota brunneoincarnata, and Lepiota brunneoincarnata. Exemplary routes to harm include,
`without limitation, ingestion, inhalation, skin contact, and sexual transmission. Exemplary
`outcomes include, without limitation, fever, headache, nausea, dizziness, and diarrhea. Exemplary
`protein databases include US National Library of Medicine National Institutes of Health protein
`and gene databases. Exemplary levels of disease concern include low, medium, high, and extreme.
`[0036] Provided herein are methods for basic curati on, such as identifying a sequence associated
`with a query by organism name and or taxon. Once identified, a sequence annotation may
`optionally be updated and, optionally, recategorized for a particular descriptive feature. Sequences
`identified are further available for downloading in a singular or batch format, optionally with
`FASTA formatting.
`[0037] Data quality and public participation can both be concerns associated with publicly
`available databases. To maximize immediate utility, the disclosed system may carry out an initial
`curation process adding many pathogenic proteins to the database in an attempt to include most
`WO 2017/214574
`potentially regulated sequences or other sequences known to be harmful. The system may curate an
`“unrestricted” list of NCBI GI identifiers corresponding to genes that may be considered harmless.
`That unrestricted list may be also open to curation.
`[0038] A scheme of CAPTCHA may be used to prevent bot-driven curation and require user
`registration before creating or editing pages. GI identifiers may be periodically verified (for
`existence), and records may be tagged for human review on failure. Users can also flag records to
`request community or administrator review.
`[0039] The present disclosure provides for systems and methods that annotate and/or screen at least
`one biological sequence. In some instances, the biological sequence is a nucleic acid sequence. The
`nucleic acid sequence may comprise 1; 10; 100; 200; 300; 400; 500; 600; 700; 800; 900; 1,000;
`2,000; 5,000; 7,000; 10,000, or more nucleic acid residues. In some instances, the nucleic acid
`sequence comprises between 100 and 500 nucleic acid residues. In some instances, the nucleic acid
`sequence comprises between 50 and 1000 nucleic acid residues. In some instances, the nucleic acid
`sequence comprises between 20 and 200 nucleic acid residues. In some instances, the nucleic acid
`sequence comprises 200 residues. In some instances, the biological sequence may be DNA or
`RNA. In some instances, the biological sequence is a protein sequence. The biological sequence
`may comprise adenine (A), cytosine (C), guanine (G), thymine (T), or uracil (U).
`In some
`instances, the biological sequence is a protein sequence. The protein may comprise 1; 10; 100;
`200; 300; 400; 500; 600; 700; 800; 900; 1,000; 2,000 or more amino acids. In some instances, the
`protein sequence comprises between 100 and 300 amino acids. In some instances, the nucleic acid
`sequence comprises between 50 and 500 amino acids. In some instances, the nucleic acid sequence
`comprises between 10 and 200 amino acids. In some instances, the nucleic acid sequence comprises
`60 amino acids. In some instances, nucleic acid fragments of no more than 2, 5, 10, 20, 50, 100, or
`200 residues are assembled in—silico into a nucleic acid sequence. In some instances, nucleic acid
`fragments are obtained from one or more sources, or one or more orders from the same source.
`Screening tool
`[0040] Constructing a screening system capable of determining whether a given sequence poses a
`biosecurity risk may include a degree of investment in time and expertise not available to all
`synthetic biologists or even to all synthetic biology companies. Even assuming one has access to a
`database of dangerous sequences, basic parameterization of an aligner and result processing
`(including culling alignment counts to similar regions so as not to hide homology to shorter
`regions) may include domain expertise.
`[0041] An illustrative workflow is provided in FIG. 3A. Referring to FIG. 3A, processor receives
`a query file containing biological sequence information, and is also in communication with a
`WO 2017/214574
`protein database having identified sequence information. A BLAST report is generated listing the
`same and similar sequences identified associated with the queried biological sequence, in-part or
`whole. The BLAST report is then queried to databases containing sequence annotations identifying
`sequences associated with harmful biological sequences (protein or nucleic acids), also referred to
`as “restricted” lists. A screen report is generated in the form of a user interface which summarizes
`the results of these processes.
`[0042] An illustrative logic workflow is provided in FIG. 3B. Referring to FIG. 3B, a data input
`source such as physical nucleic acid or protein material (which can be sequenced), a nucleic acid
`sequence (which can be translated into a protein sequence), or a protein sequence can be evaluated
`using an algorithm which searches one or more databases to determine ifit is on a restricted list.
`Exemplary algorithms include but are not limited to, BLAST, DIAMOND, Smith-Waterman, or
`other algorithm for comparing sequence information. Sequences found to be on the restrictive list
`are further evaluated against an unrestricted list that comprises known false positives. If no false
`positive is identified, the sequence is subjected to expert review. If the sequence is found to be non—
`harmful, it is placed on the unrestricted list to prevent further identification of said sequence as a
`false positive. If the sequence is found to be harmful, an output alert is generated. In some
`instances, the non-harmful sequence is synthesized. In some instances, the sequence is modified to
`remove the harmful sequence. In some instances, the modified sequence is re-screened. In some
`instances, this process is repeated iteratively until a modified non-harmful sequence is found. In
`some instances, the modified non-harmful sequence is synthesized.
`[0043] Referring to FIG. 4, a user interface displays restricted lists available for selection for the
`screening process. Referring to FIG. 5, an illustrative user interface displays a “Submit a screen”
`submission form. The form allows for selection of screening against open database(s), e. g, a
`collection of publicaly available information, or screening against a personal database, which may
`be based on a non-publicly available selection criteria. The submission form also allows for
`selection of a biological sequence file for uploading.
`[0044] Referring to FIG. 6, an illustrative user interface displays a summary ofBiosecurity screens
`conducted, with status information, sequences screened, review status, concern or no concern
`status, date of sequence addition, and a link to viewing the BLAST result. Referring to FIG. 7, an
`illustrative user interface displays a summary of lists accessed during a screen, sequences screened,
`and harmful sequence (restricted) assignments for a sequence.
`[0045] The technologies disclosed herein may comprise a Python—based reference implementation
`of a screening system. Given a query nucleotide sequence, the system may compare the sequence
`WO 2017/214574
`(e. g, via BLAST) to the set of protein sequences derived from the annotated collection produced
`by the interface discussed in the previous section.
`[0046] Results may be filtered by the degree of homology, E-score and alignment length. Passing
`hits may be summarized by the distribution of tags associated with those sequences and the regions
`of the query found problematic. Links to the originating database entries may be provided so that
`users can follow—up in more detail. In compliance with pre—defined guidance, some examples show
`that the algorithm is 100% sensitive and reports can be downloaded for archival use. Screening
`short (e. g, less than about 200 bases) sequences may result in a large number of false positive
`findings. Effective screening of shorter polynucleotide sequences may include an algorithmic
`[0047] The screening system may sit atop a database and include a RESTful application
`programmable interface (API) for screen request submission and result retrieval as well as a
`graphical user interface. The application may be installed and operate on a laptop computer, and
`scale reasonably well to high—throughput use via API calls.
`Cumulative Biological sequence or construct Screening
`[0048] It is possible to obtain fragments of biological sequences or constructs that when
`individually screened will not result identification of a harmful sequence, especially if the
`biological sequences or constructs are obtained through multiple sources and at multiple time
`points. In some instances, the source may be a customer. For example, accumulation of a
`substantial portion of the genome of any of the select agent-regulated bacteria or viruses may be
`obtained in smaller pieces, and then assembled into a harmful biological sequence or construct. To
`address this, in some instances a background process after each request is received which queries a
`database for all previous orders from that biological sequence or construct requesting source and
`collects records of any segments with high homology to any harmful biological sequences or
`constructs. This ensures evaluation and alerting even if those segments were insufficient t

