New Definitions of a "Gene"
COMMENTARY: Allen MacNeill
RPM at Evolgen has a new post about evolving definitions of a "gene". Here are my thoughts on the subject:
For years I have been teaching my students that a gene is a segment of DNA that codes for a single RNA molecule with a complementary sequence, regardless of whether that RNA molecule is translated or not. This definition takes into account the genes for the various rRNAs and tRNAs, which are not translated, and also other forms of non-translated RNA that have recently been discovered. By this definition, genes that code for mRNAs that are actually translated are distinguished as "structural genes," using terminology that was first developed to describe the Jacob-Monod model of the lactose operon. Using this same terminology, the gene that codes for the lactose repressor protein is a "regulatory gene," insofar as the repressor does not function in an "extrinsic" biochemical pathway, but rather participates in the regulation of other structural genes.
However, the distinction between "structural" and "regulatory" genes outlined above is insufficient to describe the various kinds of genetically significant DNA sequences now known. For example, it does not include regions of the DNA to which protein regulators bind, but which are not themselves transcribed. It also does not distinguish between RNAs that are translated into proteins (either enzymes or repressor/regulator proteins) and those that are transcribed into RNA but never translated (such as rRNA, tRNA, and the newer non-translated RNAs).
Given the foregoing, it appears to me that there are four (possibly five) functionally different kinds of DNA coding sequences:
(1) translatable sequences: those DNA sequences that are both transcribed into mRNA and later translated into proteins, regardless of function (these can be further subdivided into proteins that participate in non-DNA related biochemical pathways and those that directly regulate DNA, but those seem to me to be classifications of the proteins, not the DNA sequences that code for them);
(2) transcribable sequences: those DNA sequences that are transcribed into RNA (i.e. rRNA, tRNA, etc.), but are not later translated into proteins/polypeptide chains. Again, what the RNAs do after being transcribed is not a function of the DNA, but rather of the RNAs, and therefore should not really be used to classify DNA coding sequences;
(3) binding sequences: those DNA sequences that are not transcribed into RNA nor translated into protein, but which function as binding sites for regulatory molecules such as repressor proteins, homeotic gene products, etc. While such sequences do not code for the production of a transcribed or translated gene product, they still participate in the regulation of other genes by serving as regulatory binding sites; and
(4) non-binding sequences: those DNA sequences that are not transcribed into RNA, not translated into protein, nor function as binding sites for regulatory moelcules. Such sequences would include highly repetitive sequences, tandom repeats, "spacer DNA", pseudogenes, retroviral and transposon inserts (both "dead" and potentially "alive"), etc. This latter category could be further subdivided into "functional" non-coding/non-binding DNA sequences versus "non-functional/parastitic" non-coding/non-binding DNA sequences, depending on whether they arise as part of the functional architecture of the DNA (primarily of eukaryotes), or whether they arise as side-effects of the action of parasitic genetic elements, such as retroviruses or transposons.
There may be other categories of DNA sequences that have other functions, but right now I can't think of any. Therefore, this is how I intend to teach the concept of a "gene" to my students at Cornell from now on.
So much for the Beadle/Tatum "one gene, one enzyme" model, eh? And the classical Mendelian definition of "one gene, one phenotypic trait" is no longer viable as well...