Heames, B., Schmitz, J. and Bornberg-Bauer, E.
A continuum of evolving de novo genes underlies protein-coding novelty in Drosophila
Journal of Molecular Evolution, 2019

[Login to Download]


Orphan genes, lacking detectable homologs in outgroup species, typically represent 10-30% of annotated genes. Efforts to find the source of these young genes indicate that de novo emergence from non-coding DNA may in part explain their prevalence. Here, we investigate the roots of orphan gene emergence in the Drosophila clade. Across the annotated proteomes of twelve species, we find 7011 orphan genes within 5363 taxon- specific clusters of orthologs. By inferring the ancestral DNA as non-coding for 1246 (21%) of these genes, we show for the first time how de novo emergence contributes to the abundance of clade-specific Drosophila genes. In support of their having functional roles, de novo genes are just as likely as conserved proteins to have translational evidence and are under selective constraint. However, the distinct nucleotide sequences of de novo genes, which have characteristics intermediate between intergenic regions and conserved genes, reflects their recent birth from non-coding DNA. We find that de novo genes encode more disordered proteins than both older genes and intergenic regions, but conclude that this is a consequence of their high GC-content. Together, our results suggest that gene emergence from non-coding DNA provides an abundant source of material for the evolution of new proteins. Following gene birth, gradual evolution over large evolutionary timescales moulds sequence properties towards those of conserved genes, resulting in a continuum of properties whose starting points depend on the nucleotide sequences of the initial pool of novel genes.