Vector quantization of amino acids: Analysis of the HIV V3 loop region

A. B. Olshen*, P. C. Cosman, A. G. Rodrigo, P. J. Bickel, R. A. Olshen

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

2 Citations (Scopus)

Abstract

This paper is about techniques for clustering sequences such as nucleic or amino acids. Our application is to defining viral subtypes of HIV on the basis of similarities of V3 loop region amino acids of the envelope (env) gene. The techniques introduced here could apply with virtually no change to other HIV genes as well as to other problems and data not necessarily of viral origin. These algorithms as they apply to quantitative data have found much application in engineering contexts to compressing images and speech. They are called vector quantization and involve a mapping from a large number of possible inputs into a much smaller number of outputs. Many implementations, in particular those that go by the name generalized Lloyd or k-means, exist for choosing sets of possible outputs and mappings. With each there is an attempt to maximize similarities among inputs that map to any single output, or, alternatively, to minimize some measure of distortion between input and output. Here, two standard types of vector quantization are brought to bear upon the cited problem of clustering V3 loop amino acid sequences. Results of this clustering are compared to those of the well known UPGMA algorithms, the unweighted pair group method in which arithmetic averages are employed.

Original languageEnglish
Pages (from-to)277-298
Number of pages22
JournalJournal of Statistical Planning and Inference
Volume130
Issue number1-2
DOIs
Publication statusPublished - 1 Mar 2005
Externally publishedYes

Fingerprint

Dive into the research topics of 'Vector quantization of amino acids: Analysis of the HIV V3 loop region'. Together they form a unique fingerprint.

Cite this