Heames, B., Schmitz, J. and Bornberg-Bauer, E.
A continuum of evolving de novo genes underlies protein-coding novelty in Drosophila
Journal of Molecular Evolution, 2019

[Login to Download]


Orphan genes, lacking detectable homologs in outgroup species, typically represent 10-30% of eukaryotic genomes. Efforts to find the source of these young genes indicate that de novo emergence from non-coding DNA may in part explain their prevalence. Here, we investigate the roots of orphan gene emergence in the Drosophila genus. Across the annotated proteomes of twelve species, we find 6297 orphan genes within 4953 taxon-specific clusters of orthologs. By inferring the ancestral DNA as non-coding for 2467 (39.2%) of these genes, we describe for the first time how de novo emergence contributes to the abundance of clade-specific Drosophila genes. In support of their having functional roles, we show that de novo genes have robust expression, translational support, and appear to be under selective constraint. However, the distinct nucleotide sequences of de novo genes, which have characteristics intermediate between intergenic regions and conserved genes, reflect their recent birth from non-coding DNA. We find that de novo genes encode more disordered proteins than both older genes and intergenic regions, but suggest that this is a consequence of their high GC-content. Together, our results suggest that gene emergence from non-coding DNA provides an abundant source of material for the evolution of new proteins. Following gene birth, gradual evolution over large evolutionary timescales moulds sequence properties towards those of conserved genes, resulting in a continuum of properties whose starting points depend on the nucleotide sequences of the initial pool of novel genes.