Bigger is better: the largest phylogenetic tree reconstructed.
GenBank, the standard database for genetic information maintained by National Center for Biotechnology Information, has been accumulating DNA sequences for some three decades now. Since its creation in the late 1980s, it has become the de facto repository for genetic information– genetic data must now be submitted to GenBank for a paper to be accepted for publication. Most sequence data accumulated are the result of the sum of many “local” taxonomic studies that have targeted a particular group of organism for a relatively small, but well-known collection of genes. It contents now span over hundreds of genes across all of life’s domains. So, what would happen if you were to take all the sequence information contained in GenBank and analyze it phylogenetically all together in a single, one-step study? Well, that is what Pablo A. Goloboff and coworkers just did, the results of which were published in last week’s online early edition of Cladistics, the international journal of the Willi Hennig Society.
The phylogenetic analysis comprises an astonishing 73,060 terminal eukaryotic taxa, 9535 molecular characters and, for good measure, they threw in 604 morphological characters. It is therefore the largest phylogenetic analysis published to date and almost six times larger than the former world record. Such feat presented many technical challenges. The logistics required the automatizing of every step in the analysis, via computer scripts, to retrieve and sort thousands of GenBank entries, to align the sequences to construct the data matrix, to perform the actual searches for the optimal solutions, and to interpretation of the mammoth-size phylogenetic trees. The crux of the analysis, the search for the optimal phylogenetic trees, was done with the powerful parsimony phylogenetic program TNT running in parallel in three multi-processor computers for 2.5 months.

Fig. 1. Pruned strict consensus tree for the combined data set (seven trees, 1879 taxa excluded). The bar shows the span of 5000 species.
The resulting phylogeny recovers most traditional taxonomic groups. This is interesting for various reasons. First, as noted about, our understanding of the tree of life is the results of many taxonomically localized efforts that have been informally pasted together1. This is the first time a phylogeny has been reconstructed from scratch, letting the data speak unconstrained for itself without assuming that certain evolutionary relationships most be true a priori. Second, it shows that there is enough historical information contained in the data so that the optimal solution is not a complete mess or largely unresolved answer– consider that there are 9 X 10345,593 possible tree combinations for the number of terminals included. Third, that we do have the current capacity, both in terms of software and hardware, to carry out such a large analysis. And last, but related to the previous two points, that parsimony methods for phylogenetic reconstruction are up for the task. The latter point is worth noting because early simulations, based on just a few taxa (a grand total of four actually) scared systematists into thinking that parsimony methods may result in erroneous reconstructions. Later studies using real data and a much larger collection of species has shown that this is not the case, and this 73,060 taxa analysis serves as the largest of these test cases.
The authors are no strangers when it comes to computer implementation of phylogenetic methods. James S. Farris is a pioneer in the field who developed the algorithmic foundations and produced the some of the first phylogenetic programs in the late 1960s, when the character information for each taxon to be analyzed was contained in a punch card and random addition sequence for the phylogenetic tree construction meant that the set of cards was shuffled by hand before feeding them into the terminal connected to the mainframe. Likewise, Pablo A. Goloboff has been responsible for many of the rapid search techniques developed during the 1990s up to the present, that seek to cover the searchable tree-space in a fast and efficient way.
It seems that, for phylogenetics, the only limit that remains is the availability of data.
References and notes
Goloboff, P., Catalano, S., Marcos Mirande, J., Szumik, C., Salvador Arias, J., Källersjö, M., & Farris, J. (2009). Phylogenetic analysis of 73 060 taxa corroborates major eukaryotic groups Cladistics DOI: 10.1111/j.1096-0031.2009.00255.x
- Only more recently we have the development of “supertree” methods, that seek to construct a large phylogeny based on the consensus of multiple small, partially overlapping, trees following more precise set of rules. ↩
7 Comments to Bigger is better: the largest phylogenetic tree reconstructed.
[...] Meanwhile, the National Center for Biotechnology Information, which has been amassing genetic sequences for three decades recently published the largest phylogenetic tree ever constructed. [...]
May 4, 2009
That branch between Mammalia and Lepidosauria is intriguing. Does it represent the turtles?
This is the first time a phylogeny has been reconstructed from scratch, letting the data speak unconstrained for itself without assuming that certain evolutionary relationships most be true a priori.
Er… no. What do you mean? I don’t get it. In every phylogenetic analysis, the only assumption is that the ingroup is monophyletic with respect to the outgroup; and this assumption was made here, too.
In every phylogenetic analysis, the only assumption is that the ingroup is monophyletic with respect to the outgroup; and this assumption was made here, too.
You are right, ingroup monophyly with respect to the outgroup is the minimal assumption required to add directionality (rooting) to the tree. It is thus an inescapable parameter, common to all phylogenetic analyzes, including this one.
Still the tree by Goloboff and coworkers can be seen as the result of a global, unconstrained analysis when compared againts all the numerous local efforts produced over the years that form the multiple pieces of the Tree of Life puzzle as we know it.
May 6, 2009
Yes, this clade is composed by the 189 turtles we analyzed, forming the sister group of Lepidosauria + Archosauria (the untagged sister group of Aves are the crocs).
July 30, 2009
Would it be possible to publish the final tree online to make further investigation possible? It would interest me very much to look at the inferred Arthropod relationships.
Gunnar- I don’t have access to a high resolution version of this tree, but you should definitely request such figure to one of the original authors of the paper at his blog.
Leave a comment
Search
Recent Comments
Tags
Blogroll
- 2D Goggles
- Ant Blog
- antbase
- Apoica
- Catalogue of Organisms
- Computer cladistics / ¡Cladística a la lata!
- Creature Cast
- Evolving Thoughts
- HAO
- Historias de hormigas
- I Love Insects
- iPhylo
- Macromite’s Blog
- Myrmecoid
- Myrmecos Blog
- myrmician
- Pharyngula
- Photo Synthesis
- SciencePunk
- Sifolinia’s AntBlog
- Systematics and Biogeography
- The Ant Room
- The Dragonfly Woman
- The Lancelet
- The Rough Guide to Evolution
- Vince Smith blogs
Links
- Abouheif Lab

- Ant Genomics

- antweb.org
- Asociación Ibérica de Mirmecología
- Biodiversity Heritage Library
- Comparative Morphology & Development (CSZ)
- filogenética.org
- formicidae.org
- International Society of Hymenopterists
- Miller Lab – Insect Systematics
- Morphbank
- MorphoBank
- Plazi
- Richard Dawkins
- Social Wasps
- Systematics Association
- TNT wiki
- Willi Hennig Society
- ZooBank





May 3, 2009