P3: Evolutionary Origin, Fixation and Functions of de Novo Protein Coding Genes

With every new genome sequenced a couple of hundred proposed genes remain ''orphans'' because computational methods could not assign any orthologs, even to closely related and well annotated species. Presumably many of these (lineage-specific) genes are transcribed, sometimes translated and proteins functional and adaptive, at least under some (possibly unknown) conditions. De novo emergence is not only against current believe that most novel genes emerge from old ones, it is also difficult to reconcile with a biophysical perspective because novel reading frames emerging from previously non-coding matter must be considered extremely unlikely: they would most likely be disordered, aggregate and thus be deleterious or, at least be purged for purely energetic reasons. So, where do new coding genes actually come from, how do they function and how is their -- potentially detrimental -- expression regulated?

a) New gene emergence through gene duplication b) De novo gene emergence from previously non-coding DNA. First, ORF is gained followed by translation (or vice versa), leading to a proto gene (red square). Over time this proto gene can develop into a fully functional protein.

We ask where novel protein coding genes come from and how genomic novelties and rearrangements trigger adaptation and spur developmental transitions. Using comparative genomics and biophysical analyses (computational and experimental) we test their properties and functions. We found that most genetic novelty comes from novel domains but also many completely new reading frames emerge, e.g. across the insect tree, with an estimated frequency of 500 new genes in the wake of each speciation event. This former process has been termed ''grow slow and moult'' because some novel domains later lose their initially stabilising parent protein and become independent and amenable for further rearrangements. We concentrate on some major transitions which happened during the development of extant life forms: signalling across multicellular organisms, placentation in mammals, the emergence of holometabolism in insects and the onset and reversal of ageing.

Furthermore, to catch novel genes "in the act" of emergence, we investigated genomes not only between species but also from populations and, as an outgroup, their closely related sister species. We determine, using gene and domain prediction programmes, novel ORFs, their expression (RNAseq) and, if necessary, confirm them e.g. with (long-read and primer walking) PCR and qPCR. We are currently screening several systems (populations of fish, mice, flies, and human) to achieve a good genomic coverage for detecting possible recent emergence and reconstruct ancestral sequences which can then be tested for their genetic origin and investigate their structural and biophysical properties with the help of TSA, CD, NMR, and phage display experiments. Additonally, we aim to examine the behavior of the predicted ancestoral de novo gene compared to the existing one in vitro Drosophila experiments.

People: Andreas Lange, Anna Grandchamp, Bharat Ravi, Brennen Heames, Daniel Dowling, Hanna Kuß

Conceptual schema of how new protein coding genes become functionaland eventually fixed:myriads of novel transcripts (small grey hills in sequence space)emerge (a) and most of them disappear again (see changes at the bottom (b)). Over time,fitness requirements may change (indicated by the rising transparent planes) and popula-tions are funneled through bottlenecks (e.g. during speciation events, indicated by the mid-dle plane). A handful of transcripts become full-blown genes featuring UTRs, translation andmay code for – typically short – proteins (grey, rear left). These are then added to the long-term existing ‘canonical’ set of protein families (indicated by blue, green and red hills) whichmay comprise many (blue) or rather few (green) protein sequences. Note that, dependingon the criterion (e.g. non-aggregating, minimal stability) for a protein in the cellular context,sequence space can be more or less populated. Accordingly, mutational paths which wouldbe required to reach another fitness peak may vary in length.


Funding: Leibniz Gemeinschaft (2013 -- 2016); Horizon 2020 Research and Innovation Framework Programme No. 722610 (2017 -- 2021); Volkswagen Stiftung (2021 -- 2026)

a) Ancestral reconstruction of Goddard (the existing melanogaster protein in red, predictions done with QUARK), its direct ancestor (light green), and the most recent common ancestor (dark green). b) Molecular dynamics (MD) structure with mapped representative RMSF (Root Mean Square Fluctuation) shows less flexible (blue), medium flexible (green/yellow), and highly flexible (red) regions. However, both the central and N-terminal helices remain stably folded compared to the rest of the protein. c) circular dichroism (CD) spectrum of Goddard demonstrates a flexible or distorted helix as was observed in the MD simulations.

Related Publications

  • Lange, A, Patel, PH, Heames, B, Damry, AM, Saenger, T, Jackson, CJ, Findlay, GD, Bornberg-Bauer E; Structural and functional characterization of a putative de novo evolved gene essential for male fertility in Drosophila, Nat Comm, 2021:12(1667), Online Access
  • Bornberg-Bauer E, Hlouchova, K, Lange A; Structure and Function of Naturally Evolved de novo Proteins, Curr Opn Struct Biol, 2021, Online Access
  • Bornberg-Bauer, E. and Heames, B.; Becoming a de novo gene; Nature Ecology & Evolution, 2019 Online Access
  • Schmitz JF, Ullrich K, Bornberg-Bauer E; Incipient de novo genes can evolve from "frozen accidents" which escaped rapid transcript turnover; Nature Ecology and Evolution, 2018 Online Access
  • Klasberg, S, Bitard-Feildel, T, Callebaut, I and Bornberg-Bauer, E; Origins and Structural Properties of Novel and De Novo Protein Domains During Insect Evolution.; FEBS Journal, 2018 Online Access
  • Schmitz, JF and Bornberg-Bauer E; Fact or fiction: Updates on how protein coding genes might emerge de novo from previously non-coding DNA; F 1000Research, 2017 Online Access
  • Gubala et al. The goddard and saturn genes are essential for Drosophila male fertility and may have arisen de novo; Molecular Biology and Evolution, 2017 Online Access
  • E Bornberg-Bauer et al. Emergence of de novo proteins from ''dark genomic matter'' by ''grow slow and moult''; Transactions of the Biochemical Society; 2015. Online Access
  • T Bitard-Feildel et al. Detection of Orphan Domains in Drosophila using Hydrophobic clustering analysis; Biochimie, 2015 Online Access
  • L. Wissler et al. Mechanisms and dynamics of orphan gene emergence in insect genomes. Genome Biol Evol. 2013 Online Access
  • P. Feulner et al. Genome-wide patterns of standing genetic variation in a natural marine population of three-spined sticklebacks. Molecular Ecology 2012 Online Access
  • F. Chain et al., Extensive copy-number variation of young genes across stickleback populations. PLoS Genet, 2014. Online Access
  • E Bornberg-Bauer et al. How do new proteins arise?, Curr Opn Struct Biol, 2010. Online Access

Techniques employed: Computational: comparative genomics, differential GO analysis, biophysical predictions (disorder, secondary structure, hydrophobic clusters), ancestral reconstruction and phylogenies, mutational effects on stability (FoldX, Rosetta); experimental: deep sequencing, qPCR; antibody staining; cloning, (over-)expression, purification; SDS page expression quantification; E.coli autodisplay (Jose); in-situ hybridisation; CD; stability measures; in-cell NMR (Selenko); pull down assays (Ivarsson); in vitro expression of ancestral de novo genes (Findlay)