Misinterpreting Phylogenetic Trees? -- Phosphoseryl-tRNA Pathway of Cys Synthesis is a Late Energy Optimization and not a Feature of LUCA

Jens · December 21, 2012

The phosphoseryl-tRNA pathway of Cys synthesis is a late energy optimization of the methanogens and not a feature of LUCA or

-- How Paralogs Bias Phylogenetic Trees --

Full text (7 pages) with all the references and details

---Start--of--Note--------------------------------------------------------------------

During my text research on origin of life, I often found people claiming that phylogenetic trees prove early origin of some features. In many cases this looked very strange to me. Here I have a case in which – at least from my point of view – evidence is clearly against the interpretation of the phylogenetic tree. Please read through the text provided as link to a pdf document and comment, if you see errors in my argumentation or additional info for this hypothesis.

---End--of--Note-------------------------------------------------------------------------

In this document it is shown that the phosphoseryl-tRNA pathway of cysteine synthesis is clearly a late energy optimization of the methanogens and not a feature of the last universal common ancestor (LUCA) as currently claimed. Chapter VI contains an (hopefully) easy to read explanation (without mathematics) how paralogs bias phylogenetic trees (and systematically lead to these wrong claims, which have not been questioned so far). I hope the biologist community will take this into account more often in future and that a more critical approach to "sure" mathematical results (based on false unmentioned assumptions) becomes more common (see ‘Amino Acid Usage Cannot Prove Hyperthermophile Origin’ in ‘Early Life’ for another case).

Pyrrolysine

[5,6] In contrast to selenocysteine pyrrolysine (Pyl) is found primarily in some methanogenic Archaea and only in very few anaerobic Bacteria. There is a claim [7] that pyrrolysine is very ancient and before LUCA, based on a calculated phylogenetic tree (based on high-quality sequence alignment obtained through 3D-stuctures). But the biological data indicates that it is a late change of the genetic code which happened in a branch of Archaea and then was horizontally transferred to the few Bacteria living in same habitats for the following reasons:

In those methanogenic Archaea the stop codon UAG has been reassigned to code for Pyl, in contrast to the bacterium Desulfitobacterium hafniense in which UAG still has the meaning of a stop codon in most proteins [8]. This is the exact behavior to be expected in case of a late horizontal gene transfer to a predecessor of Desulfitobacterium hafniense and not in the case of inheritance.
‘The high conservation of the Pyl gene cluster and the small number of organisms that utilize Pyl suggest its relatively recent origin’ as stated by [8].

Phospho-Seryl-tRNA Pathway of Cys Synthesis

There is an analogous claim by O’Donoghue [9] that the charging of tRNACys via O-phospho-seryl-tRNACys (Sep-tRNACys) has been evolved before LUCA. Instead it looks much more that the Sep-tRNACys pathway has evolved together with the first methanogens (and not before LUCA) for the following reasons:

The genes of this pathway occur only in a very restraint group of methanogenic Archaea and not widespread in Archaea, Bacteria and Eucarya. The close evolutional relationship between the methanogenic genes and the Sep-tRNACys pathway was already remarked by O’Donoghue [9] himself.
This pathway saves 2 ATP compared to the O-acetyl-serine pathway and at least 4 ATP compared to the cystathionine pathway (see Figure 4 and legend). So it is not a relic of an old, sub-optimal way to charge tRNAs but rather an energetic optimization introduced at time of the first methanogenic Archaea.
The price for this energy savings is that there are thermodynamic issues with the Sep-tRNACys pathway as soon as the H₂S concentration is too low, because in all the other pathways the reaction step incorporating H₂S is based on an acetyl ester, which is more energy rich than a phospho-ester. This means the Sep-tRNACys pathway is only suited for environments with high sulfide concentrations. Methanogenic Archaea are actually living in such environments and the only non-methanogenic archaeon (Archaeoglobus fulgidus) with the Sep-tRNACys pathway is actually a sulfide producer (and still has most of the methanogenic genes).
This is perfectly in line with the fact that the O-acetyl-serine pathway has proven to be inhibited by H₂S [10] in the archaeon Methanosarcina thermophila, a fact was surprising [10] and could not be explained so far. This means the O-acetyl-serine pathway is regulated down under conditions in which the energetically more favorable Sep-tRNACys pathway can work efficiently. This also explains why it makes perfect sense at least for some methanogens to keep both pathways.
Even though a difference of 2 ATP is huge (the whole methanogenesis of is just 1 ATP better than acetogenesis), you might argue it is not much, since cysteine synthesis is not part of primary energy metabolism. However, there is recent evidence that the methanogenic Archaea using this new pathway have twice as much need for Cys as usual organisms [11], pointing out the importance of sulfur metabolism for methanogens.

Conclusion

How can it be, that in both cases (Pyrrolysine and the Sep-tRNACys pathway) only the phylo-genetic tree predictions do not fit into the picture? This is – I think -- because all phylogenetic tree prediction methods are systematically biased to predict all splitting of paralogs before LUCA, even and especially if there is no phylogenetic information left any more (see page 3 chapter Details VI).

(for explanation follow the link to the Details VI chapter.)

overtone · December 21, 2012

I'm not competent to critique the analysis there, but my own reading leaves the impression that early evolutionary phylogenetic trees in general - especially any involving the archaea in particular - are not taken for given by most people. The common approach seems one of uncertainty and recognition of uncertainty.

Is there a consensus out there, an assumed one, that this undermines to anyone's great surprise?

Jens · December 22, 2012

It does not seem to obvious to many:

At least scientists can publish such (to my understanding -- misinterpreting) results in PNAS:

Reference [9]: O’Donoghue P, Sethi A, Woese CR, Luthey-Schulten ZA (2005) The evolutionary history of Cys-tRNA^Cys formation. Proc Natl Acad Sci U S A 102: 19003-19008. doi:10.1073/pnas.0509617102

I agree to you that there seem to be a consensus that phylogenetic trees of whole species (and not of individual proteins or rRNAs which this is about) do not show the same pattern and there is in some cases really uncerntainty.

Arete · December 23, 2012

Hmm, that sequences used to infer organimsal evolutionary history are orthologous is a fundamental assumption of a phylogenetic analysis. If you're trying to infer whole organism evolution from a tree generated from paralogous sequence, it goes without question that you're phylogenetic tree is wrong.

If you're trying to determine the relationships amongst genes from a gene family, rather than whole organism, it makes sense to build a phylogeny of paralogs, however.

Jens · December 25, 2012

Agreed.

Phylogeny of paralogs makes sense. But you should be very careful with the perceived order of the branching points.

Yes, beeing orthologous is a fundamental assumption. But people tend to forget this in practice.

Here it is an example where it is claimed that the phosphoseryl pathway is a a feature left from LUCA, even though looking into the real biological situation, there are really all the pieces of the puzzle available (at least from my point of view you could hardly ask for more evidence) to indicate that it is a late optimization.

Maybe I am simplifying too much but this is what I want to say with regards of method:

If there is an old relationship (e.g. older than the 800 million years) to investigate, you cannot deduct anything out of non-functional conserved regions because they are randomized.
If you include paralogs, the outcome will always be that splitting was before LUCA (even though that is actually not the case). At least the tree will always be biased in a way that it predicts splitting earlier than it actually was.
Even for orthologous genes there is a risk of this issue, because functionality is not only the main catalytic functionality but also the additional regulatory functionality, which is often much less understood. This means that two proteins which are orthologous from the main functionality are actually like paralogs for some regulative functionality (different regulation in other species, binding to different other proteins or RNAs, ...)

Jens · December 27, 2012

I probably better post chapter VI completely here to explain the method point:

Figure 1: Evolution of paralogous proteins in different species. The different paralogous proteins are indicated as A, B and Z. The different species are indicated by different color (black, blue, magenta and green). Time flows from bottom to top. Horizontal distance means distance in similarity. Protein A and B are functionally closely related, protein Z is functionally more distant. The filled black circle B indicates the point in time in which a copy from A was created and evolved into closely related function B. The filled blue circle Z indicates the point in time in which a copy from A was created and evolved into more distant function Z. The open colored circles indicate the origin of the respective species. Lines are going back and forth horizontally to indicate the more or less random sequence changes in the non-conserved areas of the protein.

Figure 2. Predicted functional tree versus actual evolution. Taken the exact same situation as in Figure 1 the assumption is that all non-conserved regions are fully randomized by mutations back and forth and that all the sequence of all conserved regions are fully determined by selection pressure (and do also not contain any historical information any more). The orange tree shows what every tree calculation method will produce. Even though there is zero phylogenetic information left, still a tree with high confident branching points will be predicted. However, the tree only shows the functional relationship and will always predict any branching of paralogous proteins before LUCA, no matter what actually happened. Note that the order of time points of branching of B and Z is predicted completely wrong by the orange tree and that the sequence between branching of Z and the occurrence of the different species is also predicted completely wrong.

Figure 3 : Selection-driven tree versus mutation-driven tree. In orange the same 100% selection-driven tree is shown as in Figure 2. The black tree shows a 100% mutation-driven tree which indicates how the tree looks if all the phylogenetic information is still left. However this only works perfectly, if only the non-functional regions (which face random surviving mutations) are taken into account and the time scale is short, so that mutations back to the same sequence can be ignored. In all other cases you will obtain a tree which represents a combination of both trees.

Explanation in detail:

Phylogenetic tree prediction methods have to assume that differences in sequence occur random, since they are simply using mathematics and ignore biological function. They also have to assume that the amount of multiple mutations within the same branch at the same site can be ignored, since they do not leave any trace in the existing data (at least unless the number of surviving species used as input is much higher than the average number of mutations per site). The first assumption means that phylogenetic tree prediction methods assume that evolution only consists of mutation and ignore selection. The first assumption is false for proteins which share the same origin but do not share the same functions any more (paralogous proteins). The second assumption is false for proteins which appeared in the early diversification of Archaea, Bacteria and Eukarya or even older. If proteins are ancient enough all the sites which are not under high functional selection pressure get randomized by many mutations forward and back, so that you cannot trace the sequence of mutations and hence the actual evolutional relation out of the current sequences any more (see Figure 1). In contrast the sites which are under high functional selection pressure still stay related. This might be enough for homologous proteins (those who share origin and function) to still see rests of a phylogenetic relationship in those conserved areas. However, for paralogous proteins those sites differ systematically because of the different functional selection pressure. In the given example (see chapter Details I) this means that a copy of Phe-tRNA^Phe synthetase changed the function and adopted to bind Pyl instead of Phe and to bind the tRNA^Pyl instead of the tRNA^Phe. This means, if paralogous proteins are old enough (e.g. at the point in time of origin of methanogenic Archaea) you risk to obtain no mutation-driven phylogenetic tree any more but a selection-driven functional relationship tree (see Figure 2). This also explains the apparent paradox why it seems possible to obtain clear and highly resolved phylogenetic information about an event “clearly” before LUCA, even though the much more recent events of species forming within Archaea, Bacteria and Eukarya are often harder to resolve.

Of course the bias only works in direction to show paralog branching points systematically deeper than they are (see Figure 3) and never to show them to be more recent. So it is possible to deduct out of calculated trees that a paralog appeared more recently than a species branching point, but not the inverse. Even trees of homologous proteins might be biased, because of hidden functional differences like binding to different other proteins or RNAs or binding to different regulatory molecules because of different metabolism. True homologous proteins do not only share the main biological function but also need to share the same regulatory features (which are very often not known or verified).

To obtain trees which are more phylogenetic and less functional you should at least take all the > 95% conserved amino acids in the paralog out of the comparison, since they are functional driven and only compare the rest. If the branching point moves up in the tree in the reduced set, this is a strong hint that the tree with the complete data set was functionally biased in a severe way. If the reduced set is showing much more random behavior (e.g. species tree does fit much less to the canonical tree than the complete set), this is a hint that actually there is only very little phylogenetic information left and only high-conserved regions mutated slow enough to be still traceable. However, but exactly those regions are functionally biased. This means it is impossible to deduct any phylogenetic information about the paralog in this case.

Arete · December 27, 2012

It's still highly unclear what you're trying to do - it seems you're trying to resolve deep nodes in a tree of highly divergent organsims, but from your figures, it appears you're constructing a phylogentic tree from paralogous sequences.

This fundamentally violates many assumptions underlying any of the tree building methods used. A) The method assumes a common origin for gene copies A, B and Z - which is a flawed assumption as distinct elements of a genome have a number of distinct origins I would assume a high likelihood of for e.g. assuming that copy A from the green species and copy B from the black would be incorrectly inferred as being more closely related than the correct copies of each gene - again, generating a tree of paralogs from seperate individuals will be uninformative with regards to organismal diversity.. B) If you're trying to construct a tree representative of whole organismal diversity, it's another fundamental assumption that the genes you're looking at are not under signficant selection, so of course severely violating that assumption will again, give you an incorrect tree. A tree, as I'm percieving you're decribing it, would be wholly uninformative for any biological hypothesis, and showing that it's incorrect is unsuprising.

If you were trying to infer whole organismal ancestry , what you'd actually want to do is construct 3 (or preferably more) seperate gene trees, which were not paralogs, and summarize the gene tress using a species tree method. To investigate the evoltionary history of 3 gene families, you'd construct individual tree sets of each paralog from a single individual, and compare independent tree sets. I think ultraconserved genomic elements, and a phylogenomic species tree approach would be a far more realisitic approach than it seems like the one you've presented here, though it has been shown that even using entire genomes, deep polytomies sometimes remain unresolved.

You might find these papers helpful:

http://sws.bu.edu/msoren/Carstens.pdf

http://sysbio.oxfordjournals.org/content/61/5/835.short

http://mbe.oxfordjournals.org/content/25/2/402.short

http://genome.cshlp.org/content/22/4/746.short

Edited December 27, 2012 by Arete

Jens · December 29, 2012

Thanks for your input and discussion!

It's still highly unclear what you're trying to do - it seems you're trying to resolve deep nodes in a tree of highly divergent organsims, but from your figures, it appears you're constructing a phylogentic tree from paralogous sequences.

This fundamentally violates many assumptions underlying any of the tree building methods used.

I fully agree. But this is what have been done and have been claimed in PNAS by the following paper:

http://www.pnas.org/content/102/52/19003.full.pdf+html

The Evolutionary Relationship Between PheRS and SepRS. Our analysis further demonstrates that SepRS and the associated indirect aminoacylation pathway for tRNACys are truly ancient, already present at the time of the LUCAS.

And the same in a similar case for pyrrolysine. In other papers this wrong conclusion is than used as argument of how the genetic code evolved (This is how I came to the topic.)

So you confirm my assumption that the conclusion by O'Donoghue cannot be drawn, since they constructed a phylogenetic tree out of paralogous sequences and draw conclusions about the time of branching. This was my main target.

B) If you're trying to construct a tree representative of whole organismal diversity

This is not what I try to do.

Side Question: Is from your point of view the following statement right:

If a phylogenetic tree containing a mixture of homologs and paralogs is constructed (even though you should not do it ), the branching point of the paralogs is biased in a way that it appears to be earlier than it actually was (but not later). This is especially true for very ancient trees.

Jens · December 30, 2012

http://sws.bu.edu/msoren/Carstens.pdf

http://mbe.oxfordjournals.org/content/25/2/402.short

http://genome.cshlp.org/content/22/4/746.short

I will read to the three of your distinct links which are public. I Just need some time for it.

(...and will try to adapt to the technical language used for better understanding between us....)

Jens · December 31, 2012

A) The method assumes a common origin for gene copies A, B and Z - which is a flawed assumption as distinct elements of a genome have a number of distinct origins

I do not understand your comment "flawed assumption". Do you really think different genes always have been there already? (basically meaning at its extreme that the first living organism already had all the genes of a human). Since you obviously do not mean this, we probably talk aside of each other (most likely because I am talking about a 3000 million year time frame). So I will try to rephrase what I show in Figure 1:

- It is a comparison of 10 homologous protein sequences (and not a whole genome).

- There are 4 different species (green, black, pink, blue).

- There are 3 different proteins functions: A, B, Z (so A, B, Z are paralog to each other)

- Z is a protein only present in species blue and green.

- The different species are very ancient in branching (so from Archea, Bacteria, Eukarya)

- The beginning of the tree shows LUCA (the last universal common ancestor of all life)

- In the beginning LUCA only had one protein (protein A)

Of course LUCA did not had all the genes already. Some of the genes evolved later. Since today there are still plenty of protein families (families of paralogs) in which the individual genes show clear homology even though they have different functions, you have to assume that those paralogs have been created initially by a copy of the gene. Or what is your proposal how this should happen?

I would assume a high likelihood of for e.g. assuming that copy A from the green species and copy B from the black would be incorrectly inferred as being more closely related than the correct copies of each gene

I am not sure, if I understand what you mean.

Protein B has a different function than protein A. For this reason there are highly conserved regions in protein B which differ from those in protein A. This is the reason why all protein B (in the 4 different species) are more close to each other than to protein A.

Maybe we discuss it with Figure 2:

I wanted to illustrate, that comparing paralogs means that every tree you obtain also partially (or fully -- depending on the time frame) contains the functional distance tree and systematically shows the branching points of paralogs as too early.

This might be trivial for method experts like you, but it is very often not taken into consideration in publications about individual genes or individual gene families (especially since there are many hidden paralogs, because in most cases we do not know the regualtion features of the proteins compared).

http://mbe.oxfordjournals.org/content/25/2/402.short

I have read through this reference first (really interesting!):

(As you have already mentioned it is about a complete phylogeny of thousands of DNA sequences and not about a single protein family)

It is in line with regards that abviously the moelcular clock is not constant (in conserved regions) but of course also is highly sensitive to changes in function. So also here most likely also partial paralogs of the Ultra Conserved Elements have been compared since the authors suggest that their function has changed during evolution from fish to amphibia. We just do not know the function enough to even decide, if the homolog is a paralog or an ortholog. (This is actually quite a typical case).

The only other sequences that show a similarly slow evolutionary rate are ribosomal RNA sequences, which have strict structure–

function constraints and multilateral interactions with other RNAs and proteins. There is some evidence that UCEs may be transcribed.

I think this indicates that the UCE are like rRNA part of multilateral interaction in very large complexes similar to splicosomes, translation initiation complexes or ribosomes. And (like in ribosomes, splicosomes, initiation complex) that every change has a broad effect on multiple proteins. The latter is clear, since the paper suggests that they are switches wich have a broad effect in embryogenesis. Both reasons make many tiny individual mutations deadly and the likelyhood of changes very low. RNA is still important today. If only the DNA sequence is the active part of UCE, it would be less conserved (less complex 3D interactions).

Edited December 31, 2012 by Jens

Sign In

Misinterpreting Phylogenetic Trees? -- Phosphoseryl-tRNA Pathway of Cys Synthesis is a Late Energy Optimization and not a Feature of LUCA

Recommended Posts

Jens

overtone

Jens

Arete

Jens

Jens

Arete

Jens

Jens

Jens

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity

Important Information