Almost three years ago, my friend George Sutphin and I were out for a hike chatting about orthologous genes. George studies the genetics of aging in mice at The Jackson Lab. But since mice live so long, he also studies aging in C. elegans. Thus, George finds himself looking back and forth between genomes trying to find the mouse gene that “corresponds” to a given worm gene. In this case, “corresponds” means “is an ortholog of”. A lot of ortholog mapping has been worked out by bioinformaticists using sophisticated statistical models of sequence data, but there are still a number of tough edge cases and not all methods agree. On our fateful hike, George told me about a meta-strategy that was being used in the worm community, which was to pool multiple predictors by simple voting. George was in the process of building the biggest, baddest meta-tool on the market. I had to get in on the action.
Voting for orthologs assumes that if many methods call a gene an ortholog, then it is more likely to be an ortholog than if only one method does. This seemed perfectly sensible to me, but I had the following benign thought: If you have examples of orthologous gene pairs and non-orthologous gene pairs, then you could use machine learning to learn the difference between their respective voting patterns. George and I agreed it would be sensible to try and, long story short, it worked! I am extremely pleased to bring you the fruits of that labor, our PLOS Computational Biology paper WORMHOLE: Novel Least Diverged Ortholog Prediction through Machine Learning.
Three years on, it feels good to have this finally in print. From the bottom of my heart, I wish you a happy WORMHOLE day!