NIH Research Festival
Inferred orthology (i.e., homology via speciation) among genes is commonly used to predict gene product function. Orthology is also a key consideration when classifying genes coherently and consistently across taxa, but the granularity of current prediction tools is too coarse to resolve clusters of orthologs (i.e., orthogroups) within specific gene families. As a result, classification is generally at the discretion of individual curators manually inspecting gene trees. Here, we present a method that improves granularity and automates classification. This work extends a popular method for identifying clusters in all-by-all similarity graphs, called Markov clustering (MCL). Current MCL-based ortholog clustering tools rely on the BLASTP local alignment algorithm to create similarity metrics, but BLASTP discards information when sequences are very similar or very dissimilar, thus limiting the ability of MCL to resolve orthogroups. Instead, global alignment methods can generate more information-rich metrics between known homologs. Current MCL-based approaches also depend on user-specified parameters that control the final groupings. Default values are conventionally used because it is impossible to know the ‘correct’ parameters for a dataset beforehand. This is a sub-optimal approach, so we have implemented a novel scoring system to facilitate dynamic optimization of MCL parameters. A final weakness with MCL is an assumption of homogeneous rates of evolution among groups. This is an unrealistic assumption, but we can overcome it by recursively decomposing predicted orthogroups with further rounds of dynamic MCL. Our new method is called ‘Recursive Dynamic MCL’ and has been implemented as an open source Python project (https://github.com/biologyguy/RD-MCL).
Scientific Focus Area: Computational Biology
This page was last updated on Friday, March 26, 2021