An Alignment-Independent Approach for the Study of Viral Sequence Diversity at Any Given Rank of Taxonomy Lineage


Creative Commons License

Chong L. C., Lim W. L., Ban K. H. K., Khan A. M.

BIOLOGY-BASEL, vol.10, no.9, 2021 (SCI-Expanded) identifier identifier identifier

  • Publication Type: Article / Article
  • Volume: 10 Issue: 9
  • Publication Date: 2021
  • Doi Number: 10.3390/biology10090853
  • Journal Name: BIOLOGY-BASEL
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, BIOSIS, CAB Abstracts, Veterinary Science Database, Directory of Open Access Journals
  • Keywords: minimal set, alignment independent, alignment-free, sequence diversity, proteome, virus, UNIQmin
  • Bezmialem Vakıf University Affiliated: No

Abstract

Simple Summary Viral sequence variation can expand the host repertoire, enhance the infection ability, and/or prevent the build-up of a long-term specific immunity by the host. The study of viral diversity is, thus, critical to understand sequence change and its implications for intervention strategies. Typically, these studies are performed using alignment-dependent approaches. However, such an approach becomes limited with increase in sequence diversity. Herein, we present an alignment-free algorithm, implemented as a publicly available tool, UNIQmin, to determine the effective viral sequence diversity at any rank of the viral taxonomy lineage. UNIQmin enables the generation of a minimal set for a given sequence dataset of interest and is applicable to big data, with a reasonable time performance. The minimal set is the smallest possible number of unique sequences required to represent a given peptidome diversity (pool of distinct peptides of a specific length) exhibited by a non-redundant dataset. This compression is possible through the removal of unique sequences that do not contribute effectively to the peptidome diversity pool. The utility of UNIQmin was demonstrated for the species Dengue virus, genus Flavivirus, family Flaviviridae, and superkingdom Viruses. The concept of a minimal set is generic and thus possibly applicable to both genomic and proteomic data of non-viral, pathogenic microorganisms.