Our lab is broadly interested in using computational methods to answer important questions in biology. Our focus is on using very large genomic datasets that are becoming rapidly common as the cost of sequencing drops to answers questions in evolutionary biology. What interests us is not only the computational challenges that handling large-scale datasets leads to, but also, the under-appreciated problem of accuracy: methods developed and tested on small datasets are not always highly accurate when applied to large-scale data.

The research of our lab includes several elements:

  • Algorithm development
  • Implementation (e.g., programming) and (occasional) optimization
  • Statistics, probability, and some machine learning
  • (Big) data analysis

We often collaborate with international teams of biologists, medical researchers, computer scientists, and statisticians on large-scale data-driven projects. We have developed multiple scalable and highly accurate algorithms in each of these fields. You can see the tools that implement these algorithms under the Software page and corresponding papers can be seen on the Publications page.

Research Topics

1. Phylogenetics

Our main specialization is in reconstructing and utilizing phylogenetic trees. A phylogeny is a tree that shows a reconstruction of how related species have evolved from a common ancestor through evolutionary time. Reconstructing phylogenetics requires highly sophisticated computational methods. Phylogenetic reconstruction has many facets, some of which we have addressed in the past.

Phylogenomics:

Phylogenomics, as we use the term, refers to phylogenetic studies that use a large number of genes sampled from across the genome to reconstruct the evolutionary tree. The evolutionary histories of individual genes (i.e., gene trees) can be different from each other and from relationships between species as a whole (i.e., the species tree). When such discordances are due to a prevalent biological process called Incomplete Lineage Sorting (ILS), the species tree is statistically identifiable from the distribution of gene trees. Large-scale phylogenomic studies, made possible only recently, are important not only because more data leads to more statistical power, but also because they enable us to study gene tree distributions.

  • NSF grant 1565862: We will study gene tree estimation under a scenario when the true species tree is known, but discordances due to ILS are likely.

  • ASTRAL: a tool for estimating species trees from gene trees; ASTRAL improved both the accuracy and scalability of species tree estimation from gene trees compared to the state-of-the-art. It has been used by a rapidly growing number of biological studies. In 2016, we published a paper (number 32 in Publications) that added a feature to ASTRAL for estimating branch support and branch length.

  • Statistical binning: improves the accuracy of gene tree estimation.

  • We have been involved in two of the largest phylogenomic projects to date: the avian phylogenomics project, which sequenced genomes of 48 birds, and the 1KP, which sequenced the transcriptomes of 103 plants species. We first developed ASTRAL and statistical binning for analyzing the data generated in each of these two projects.

Alignment/tree co-estimation

See the MSA section for more on this.

2. Metagenomics

Metagenomics is the study of whole communities of micro-organisms. The main challenge in metagenomics is identifying the unknown taxonomic composition of the community, given millions of fragmentary sequences. We have developed SEPP, a new algorithm that combines the idea of ensembles of HMMs and existing phylogenetic tools to place fragmentary metagenomic sequences on a reference phylogeny. We have also extended SEPP to a new tool called TIPP that estimates taxonomic profiles for metagenomic datasets. TIPP accounts for the uncertainty inherent in various steps of the phylogenetic placement for fragmentary data.

3. HIV

We have ongoing projects on various aspects of understanding the spread of HIV, its integration into the human genome, among other topics. These works are in collaboration with the CFAR center at UCSD.

4. Multiple Sequence Alignment:

Before a set of related molecular sequences can be analyzed for various purposes, they need to be first aligned so that letters with a common origin (called homologous) are lined up. Many formulations of this problem, mostly NP-complete, have been studied, and accurate MSA estimation for few hundred sequences has been made doable by various heuristics. However, accurately aligning thousands to millions of sequences has remained challenging. Partly in response to the needs of the 1KP project (where some gene families have more than 100,000 sequences), we have developed two new MSA methods, PASTA, and UPP, both of which have been able to produce highly accurate alignments for up to a million sequences.

Research outcome

The main outcomes of these projects are the publications that describe the new algorithms and apply them to real biological datasets, in addition to tools that have implemented these algorithms and have been made available for public use. In addition, our work has been covered in the press, and we have been presenting the work at various venues.

Our work in press

Highlights and invited talks

  • INFORMS, 2015, Philadelphia, “Reconstruction of species histories using genomic data”.
  • RECOMB, 2015, Warsaw, “Statistical binning enables an accurate coalescent-based estimation of the avian tree”.
  • IPAM, 2015, Los Angels, “Ultra-large multiple sequence alignments”.