BIOLOGY-BASEL, vol.10, no.9, 2021 (SCI-Expanded)
Simple Summary Viral sequence variation can expand the host repertoire, enhance the infection ability, and/or prevent the build-up of a long-term specific immunity by the host. The study of viral diversity is, thus, critical to understand sequence change and its implications for intervention strategies. Typically, these studies are performed using alignment-dependent approaches. However, such an approach becomes limited with increase in sequence diversity. Herein, we present an alignment-free algorithm, implemented as a publicly available tool, UNIQmin, to determine the effective viral sequence diversity at any rank of the viral taxonomy lineage. UNIQmin enables the generation of a minimal set for a given sequence dataset of interest and is applicable to big data, with a reasonable time performance. The minimal set is the smallest possible number of unique sequences required to represent a given peptidome diversity (pool of distinct peptides of a specific length) exhibited by a non-redundant dataset. This compression is possible through the removal of unique sequences that do not contribute effectively to the peptidome diversity pool. The utility of UNIQmin was demonstrated for the species Dengue virus, genus Flavivirus, family Flaviviridae, and superkingdom Viruses. The concept of a minimal set is generic and thus possibly applicable to both genomic and proteomic data of non-viral, pathogenic microorganisms.