In addition to known genes, much of the human genome is transcribed into RNA. Chance formation of novel open reading frames (ORFs) can lead to the translation of myriad protein. Some of these ORFs may yield advantageous adaptive de novo proteins. As these sequences share minimal similarity with pre-existing proteins they can have vastly different properties. However, widespread translation of non-coding DNA can produce hazardous protein molecules, which fold incorrectly can form toxic aggregations. The dynamics of how de novo proteins emerge from potentially toxic raw materials and what influences their long-term survival are unknown. Here, using transcriptomic data from human and five other primates, we generate a set of transcribed human ORFs at six conservation levels to investigate which properties influence the early emergence and long term retention of these ORFs. We find that novel human-restricted ORFs are preferentially located on GC-rich gene-dense chromosomes, suggesting their emergence is linked to pre-existing genes. Sequence properties such as intrinsic structural disorder and aggregation propensity–which have been proposed to play a role in survival of de novo genes–remain relatively unchanged over time. GC-rich sequences code for proteins with lower aggregation propensities, suggesting that genomic regions amenable to the generation of novel transcribed ORFs are concomitantly less likely to produce ORFs which code for harmful toxic proteins. This may explain how the cell can tolerate the abundant creation of new, potentially toxic, ORFs. Our data indicate that selection does not shape protein structures of ORFs and that their long-term survival is largely stochastic.
Data used in this paper is available here: http://doi.org/10.5281/zenodo.4048343