Abstract
Background
Human immunodeficiency virus type 1 (HIV1) infects cells by means of ligandreceptor interactions. This lentivirus uses the CD4 receptor in conjunction with a chemokine coreceptor, either CXCR4 or CCR5, to enter a target cell. HIV1 is characterized by high sequence variability. Nonetheless, within this extensive variability, certain features must be conserved to define functions and phenotypes. The determination of coreceptor usage of HIV1, from its protein envelope sequence, falls into a wellstudied machine learning problem known as classification. The support vector machine (SVM), with string kernels, has proven to be very efficient for dealing with a wide class of classification problems ranging from text categorization to protein homology detection. In this paper, we investigate how the SVM can predict HIV1 coreceptor usage when it is equipped with an appropriate string kernel.
Results
Three string kernels were compared. Accuracies of 96.35% (CCR5) 94.80% (CXCR4) and 95.15% (CCR5 and CXCR4) were achieved with the SVM equipped with the distant segments kernel on a test set of 1425 examples with a classifier built on a training set of 1425 examples. Our datasets are built with Los Alamos National Laboratory HIV Databases sequences. A web server is available at http://genome.ulaval.ca/hivdskernel webcite.
Conclusion
We examined string kernels that have been used successfully for protein homology detection and propose a new one that we call the distant segments kernel. We also show how to extract the most relevant features for HIV1 coreceptor usage. The SVM with the distant segments kernel is currently the best method described.
Background
The HIV1 genome contains 9 genes. One of the genes, the env gene, codes for 2 envelope proteins named gp41 and gp120. The gp120 envelope protein must bind to a CD4 receptor and a coreceptor prior to cell infection by HIV1. Two coreceptors can be used by HIV1: the CCR5 (chemokine receptor 5) and the CXCR4 (chemokine receptor 4). Some viruses are only capable of using the CCR5 coreceptor. Other viruses can only use the CXCR4 coreceptor. Finally, some HIV1 viruses are capable of using both of these coreceptors. The pathology of a strain of HIV1 is partly a function of the coreceptor usage [1]. The faster CD4+ cell depletion caused by CXCR4using viruses [2] makes the accurate prediction of coreceptor usage medically warranted. Specific regions of the HIV1 external envelope protein, named hypervariable regions, contribute to the turnover of variants from a phenotype to another [3]. HIV1 tropisms (R5, X4, R5X4) are often (but not always) defined in the following way. R5 viruses are those that can use only the CCR5 coreceptor and X4 viruses are those that can use only the CXCR4 coreceptor. R5X4 viruses, called dualtropic viruses, can use both coreceptors. Tropism switch occurs during progression towards AIDS. Recently, it has been shown that R5 and X4 viruses modulate differentially host gene expression [4].
Computeraided prediction
The simplest method used for HIV1 coreceptor usage prediction is known as the charge rule [5,6]. It relies only on the charge of residues at positions 11 and 25 within the V3 loop aligned against a consensus. The V3 loop is the third highly variable loop in the retroviral envelope protein gp120. Nonetheless, other positions are also important since the removal of these positions gave predictors with comparable (but weaker) performance to those that were trained with these positions present [1]. Other studies [712] also outlined the importance of other positions and proposed machine learning algorithms, such as the random forest [11] and the support vector machine (SVM) with structural descriptors [10], to built better predictors (than the charge rule). Available predictors (through webservers) of HIV1 coreceptor usage are enumerated in [13].
An accuracy of 91.56% for the task of predicting the CXCR4 usage was obtained by [10]. Their method, based on structural descriptors of the V3 loop, employed a single dataset containing 432 sequences without indels and required the multiple alignment of all V3 sequences. However, such a prior alignment before learning might remove information present in the sequences which is relevant to the coreceptor usage task. Furthermore, a prior multiple alignment done on all the data invalidates the crossvalidation method since the testing set in each fold has been used for the construction of the tested classifier. Another drawback of having an alignmentbased method is that sequences having too many indels (when compared to a consensus sequence) are discarded to prevent the multiple alignment from yielding an unacceptable amount of gaps. In this paper, we present a method for predicting the coreceptor usage of HIV1 which does not perform any multiple alignment prior to learning.
The SVM [14] has proven to be very effective at generating classifiers having good generalization (i.e., having high predicting accuracy). In particular, [1] have obtained a significantly improved predictor (in comparison with the charge rule) with an SVM equipped with a linear kernel. However, the linear kernel is not suited for sequence classification since it does not provide a natural measure of dissimilarity between sequences. Moreover, a SVM with a linear kernel can only use sequences that are exactly of the same length. Consequently, [1] aligned all HIV1 V3 loop sequences with respect to a consensus. No such alignment was performed in our experiments. In contrast, string kernels [15] do not suffer from these deficiencies and have been explicitly designed to deal with strings and sequences of varying lengths. Furthermore, they have been successfully used for protein homology detection [16] – a classification problem which is closely related to the one treated in this paper.
Consequently, we have investigated the performance of the SVM, equipped with the appropriate string kernel, at predicting the coreceptor used by HIV1 as a function of its protein envelope sequence (the V3 loop). We have compared two string kernels used for protein homology detection, namely the blended spectrum kernel [15,17] and the local alignment kernel [16], to a newly proposed string kernel, that we called the distant segments (DS) kernel.
Applications
Bioinformatic methods for predicting HIV phenotypes have been tested in different situations and the concordance is high [1821].
As described in [18], current bioinformatics programs are underestimating the use of CXCR4 by dualtropic viruses in the brain. In [19], a concordance rate of 91% was obtained between genotypic and phenotypic assays in a clinical setting of 103 patients. In [20], the authors showed that the SVM with a linear kernel achieves a concordance of 86.5% with the Trofile assay and a concordance of 79.7% with the TRT assay. Recombinant assays (Trofile and TRT) are described in [20].
Further improvements in available HIV classifiers could presumably allow the replacement of in vitro phenotypic assays by a combination of sequencing and machine learning to determine the coreceptor usage. DNA sequencing is cheap, machine learning technologies are very accurate whereas phenotypic assays are laborintensive and take weeks to produce readouts [13]. Thus, the next generation of bioinformatics programs for the prediction of coreceptor usage promises major improvements in clinical settings.
Methods
We used the SVM to predict the coreceptor usage of HIV1 as a function of its protein envelope sequence. The SVM is a discriminative learning algorithm used for binary classification problems. For these problems, we are given a training set of examples, where each example is labelled as being either positive or negative. In our case, each example is a string s of amino acids. When the binary classification task consists of predicting the usage of CCR5, the label of string s is +1 if s is the V3 loop of the protein envelope sequence of a HIV1 virion that uses the CCR5 coreceptor, and 1 otherwise. The same method applies for the prediction of the CXCR4 coreceptor usage. When the binary classification task consists of predicting the capability of utilizing CCR5 and CXCR4 coreceptors, the label of string s is +1 if s is the V3 loop of the protein envelope sequence of a HIV1 virion that uses both the CCR5 and CXCR4 coreceptors, and 1 if it is a virion that does not use CCR5 or does not use CXCR4.
Given a training set of binary labelled examples, each generated according to a fixed (but unknown) distribution D, the task of the learning algorithm is to produce a classifier f which will be as accurate as possible at predicting the correct class y of a test string s generated according to D (i.e., the same distribution that generated the training set). More precisely, if f (s) denotes the output of classifier f on input string s, then the task of the learner is to find f that minimizes the probability of error . A classifier f achieving a low probability of error is said to generalize well (on examples that are not in the training set).
To achieve its task, the learning algorithm (or learner) does not have access to the unknown distribution D, but only to a limited set of training examples, each generated according to D. It is still unknown exactly what is best for the learner to optimize on the training set, but the learning strategy used by the SVM currently provides the best empirical results for many practical binary classification tasks. Given a training set of labelled examples, the learning strategy used by the SVM consists at finding a softmargin hyperplane [14,22], in a feature space of high dimensionality, that achieves the appropriate tradeoff between the number of training errors and the magnitude of the separating margin realized on the training examples that are correctly classified (see, for example, [15]).
In our case, the SVM is used to classify strings of amino acids. The feature space, upon which the separating hyperplane is built, is defined by a mapping from each possible string s to a highdimensional vector ϕ (s). For example, in the case of the blended spectrum kernel [15], each component ϕ_{α }(s) is the frequency of occurrence in s of a specific substring α that we call a segment. The whole vector ϕ (s) is the collection of all these frequencies for each possible segment of at most p symbols. Consequently, vector ϕ (s) has components for an alphabet Σ containing Σ symbols. If w denotes the normal vector of the separating hyperplane, and b its bias (which is related to the distance that the hyperplane has from the origin), then the output f (s) of the SVM classifier, on input string s, is given by
where sgn(a) = +1 if a > 0 and 1 otherwise, and where ⟨w, ϕ (s)⟩ denotes the inner product between vectors w and ϕ (s). We have ⟨w, ϕ (s)⟩ = for ddimensional vectors. The normal vector w is often called the discriminant or the weight vector.
Learning in spaces of large dimensionality
Constructing a separating hyperplane in spaces of very large dimensionality has potentially two serious drawbacks. The first drawback concerns the obvious danger of overfitting. Indeed, with so many degrees of freedom for a vector w having more components than the number of training examples, there may exist many different w having a high probability of error while making very few training errors. However, several theoretical results [15,22] indicate that overfitting is unlikely to occur when a large separating margin is found on the (numerous) correctly classified examples – thus giving theoretical support to the learning strategy used by the SVM.
The second potential drawback concerns the computational cost of using very high dimensional feature vectors ϕ (s_{1}), ϕ (s_{2}),..., ϕ(s_{m}) of training examples. As we now demonstrate, this drawback can elegantly be avoided by using kernels instead of feature vectors. The basic idea consists of representing the discriminant w as a linear combination of the feature vectors of the training examples. More precisely, given a training set {(s_{1}, y_{1}), (s_{2}, y_{2}),..., (s_{m}, y_{m})} and a mapping ϕ (·), we write . The set {α_{1},..., α_{m}} is called the dual representation of the (primal) weight vector w. Consequently, the inner product ⟨w, ϕ (s)⟩, used for computing the output of an SVM classifier, becomes
where defines the kernel function associated with the feature map ϕ (·). With the dual representation, the SVM classifier is entirely described in terms of the training examples s_{i }having a nonzero value for α_{i}. These examples are called support vectors. The socalled "kernel trick" consists of using k (s, t) without explicitly computing ⟨ϕ (s), ϕ (t)⟩ – a computationally prohibitive task for feature vectors of very large dimensionality. This is possible for many feature maps ϕ (·). Consider again, for example, the blended spectrum (BS) kernel where each component ϕ_{α }(s) is the frequency of occurrence of a segment α in string s (for all words of at most p characters of an alphabet Σ). In this case, instead of performing multiplications to compute explicitly ⟨ϕ (s), ϕ (t)⟩, we can compute, for each position i in string s and each position j in string t, the number of consecutive symbols that matches in s and t. We use the bigOh notation to provide an upper bound to the running time of algorithms. Let T (n) denote the execution time of an algorithm on an input of size n. We say that T (n) is in O (g (n)) if and only if there exists a constant c and a critical n_{0 }such that T (n) ≤ cg (n) for all n ≥ n_{0}. The blended spectrum kernel requires at most O (p·s·t) time for each string pair (s, t) – an enormous improvement over the Ω (Σ^{p}) time required for the explicit computation of the inner product between a pair of feature vectors. In fact, there exists an algorithm [15] for computing the blended spectrum kernel in O (p·max (s, t)) time.
The distant segments kernel
The blended spectrum kernel is interesting because it contains all the information concerning the population of segments that are present in a string of symbols without considering their relative positions. Here, we propose the distant segments (DS) kernel that, in some sense, extends the BS kernel to include (relative) positional information of segments in a string of symbols.
If one considers the frequencies of all possible segment distances inside a string as its features, then a precise comparison can be done between any pair of strings. Remote protein homology can be detected using distances between polypeptide segments [23]. For any string s of amino acids, these authors used explicitly a feature vector ϕ (s) where each component ϕ_{d, α, α' }(s) denotes the number of times the (polypeptide) segment α' is located at distance d (in units of symbols) following the (polypeptide) segment α. They have restricted themselves to the case where α and α' have the same length p, with p ≤ 3. Since the distance d is measured from the first symbol in α to the first symbol in α', the d = 0 components of ϕ (s), i.e., ϕ_{0,α,α' }(s), are nonzero only for α = α' and represent the number of occurrences of segment α in string s. Consequently, this feature vector strictly includes all the components of the feature vector associated with the BS kernel but is limited to segments of size p (for p ≤ 3). By working with the explicit feature vectors, these authors were able to obtain easily the components of the discriminant vector w that are largest in magnitude and, consequently, are the most relevant for the binary classification task. However, the memory requirement of their algorithm increases exponentialy in p. Not surprisingly, only the results for p ≤ 3 were reported by [23].
Despite these limitations, the results of [23] clearly show the relevance of having features representing the frequency of occurrences of pairs of segments that are separated by some distance for protein remote homology detection. Hence, we propose in this section the distance segments (DS) kernel that potentially includes all the features considered by [23] without limiting ourselves to p ≤ 3 and to the case where the words (or segments) have to be of the same length. Indeed, we find no obvious biological motivation for these restrictions. Also, as we will show, there is no loss of interpretability of the results by using a kernel instead of the feature vectors. In particular, we can easily obtain the most significant components of the discriminant w by using a kernel. We will show that the time and space required for computing the kernel matrix and obtaining the most significant components of the discriminant w are bounded polynomially in terms of all the relevant parameters.
Consider a protein as a string of symbols from the alphabet Σ of amino acids. Σ* represents the set of all finite strings (including the empty string). For μ ∈ Σ*, μ denotes the length of the string μ. Throughout the paper, s, t, α, μ and ν will denote strings of Σ*, whereas θ and δ will be lengths of such strings. Moreover, μ ν will denote the concatenation of μ and ν. The DS kernel is based on the following set. Given a string s, let be the set of all the occurrences of substrings of length δ that are beginning by segment α and ending by segment α'. More precisely,
Note that the substring length δ is related to the distance d of [23] by δ = d + α' where d = α + ν when α and α' do not overlap. Note also that, in contrast with [23], we may have α ≠ α'. Moreover, the segments α and α' never overlap since μανα' μ' equals to the whole string s and 0 ≤ ν. We have made this choice because it appeared biologically more plausible to have a distance ranging from the end of the first segment to the beginning of the second segment. Nevertheless, we will see shortly that we can include the possibility of overlap between segments with a very minor modification of the kernel.
The DS kernel is defined by the following inner product
Hence, the kernel is computed for a fixed maximum value θ_{m }of segment sizes and a fixed maximum value δ_{m }of substring length. Note that, the number of strings of size θ of Σ* grows exponentially with respect to θ. Fortunately, we are able to avoid this potentially devastating combinatorial explosion in our computation of . Figure 1 shows the pseudocode of the algorithm. In the pseudocode, s [i] denotes the symbol located at position i in the string s (with i ∈ {1, 2,..., s}). Moreover, for any integers i, j, denotes if 0 ≤ i ≤ j, and 0 otherwise. Admittedly, it is certainly not clear that the algorithm of Figure 1 actually computes the value of given by Equation 2. Hence, a proof of correctness of this algorithm is presented at the appendix (located after the conclusion). The worstcase running time is easy to obtain because the algorithm is essentially composed of three imbricated loops: one for j_{s }∈ {0,..., s1}, one for j_{t }∈ {0,..., t1}, and one for i ∈ {1,..., min(s, t, δ_{m})}. The time complexity is therefore in O (s·t·min(s, t, δ_{m})).
Figure 1. The algorithm for computing .
Note that the definition of the DSkernel can be easily modified in order to accept overlaps between α and α'. Indeed, when overlaps are permitted, they can only occur when both α and α' start and end in {j_{s }+ i_{0},..., j_{s }+ i_{1}1}. The number of elements of for which i_{2r }≤ δ <i_{2r+1 }is thus the same for all values of r, including r = 0. Consequently, the algorithm to compute the DS kernel, when overlaps are permitted, is the same as the one in Figure 1 except that we need the replace the last two lines of the FOR loop, involved in the computation of c, by the single line:
Similar simple modifications can be performed for the more restrictive case of α = α'.
Extracting the discriminant vector with the distant segments kernel
We now show how to extract (with reasonable time and space resources) the components of the discriminant w that are nonzero. Recall that when the SVM contains l support vectors {(s_{1}, y_{1}),..., (s_{l}, y_{l})}. Recall also that each feature ϕ_{δ, α, α' }(s_{i}) is identified by a triplet (δ, α, α'), with δ ≥ α + α'. Hence, to obtain the nonzero valued components of w, we first obtain the nonzero valued features ϕ_{δ, α, α' }(s_{i}) from each support vector (with Algorithm EXTRACTFEATURES of Figure 2) and then collect and merge every feature of each support vector by multiplying each of them by α_{i}y_{i }(with Algorithm EXTRACTDISCRIMINANT of Figure 3).
Figure 2. The algorithm for extracting the features of a string s into a Map. Here, s (i : j) denotes the substring of s starting at position i and ending at position j.
Figure 3. The algorithm for merging every feature from the set S = {(s_{1}, y_{1}), (s_{2}, y_{2}), ..., (s_{l}, y_{l})} of all support vectors into a Map representing the discriminant w.
We transform each support vector ϕ (s_{i}) into a Map of features. Each Map key is an identifier for a (δ, α, α') having ϕ_{δ, α, α' }(s_{i}) > 0. The Map value is given by ϕ_{δ, α, α'}(s_{i}) for each key.
The worstcase access time for an AVLtreeMap of n elements is O (log n). Hence, from Figure 2, the time complexity of extracting all the (nonzero valued) features of a support vector is in . Moreover, since the total number of features inserted to the Map by the algorithm EXTRACTDISCRIMINANT is at most , the time complexity of extracting all the nonzero valued components of w is in .
SVM
We have used a publicly available SVM software, named SVM^{light }[24], for predicting the coreceptor usage. Learning SVM classifier requires to choose the right tradeoff between training accuracy and the magnitude of the separating margin on the correctly classified examples. This tradeoff is generally captured by a socalled softmargin hyperparameter C.
The learner must choose the value of C from the training set only – the testing set must be used only for estimating the performance of the final classifier. We have used the (wellknown) 10fold crossvalidation method (on the training set) to determine the best value of C and the best values of the kernel hyperparameters (that we describe below). Once the values of all the hyperparameters were found, we used these values to train the final SVM classifier on the whole training set.
Selected metrics
The testing of the final SVM classifier was done according to several metrics. Let P and N denote, respectively, the number of positive examples and the number of negative examples in the test set. Let TP, the number of "true positives", denote the number of positive testing examples that are classified (by the SVM) as positive. A similar definition applies to TN, the number of "true negatives". Let FP, the number of "false positives", denote the number of negative testing examples that are classified as positive. A similar definition applied to FN, the number of "false negatives". To quantify the "fitness" of the final SVM classifier, we have computed the accuracy, which is (TP+TN)/(P+T), the sensitivity, which is TP/P, and the specificity, which is TN/N. Finally, for those who cannot decide how much to weight the cost of a false positive, in comparison with a false negative, we have computed the "area under the ROC curve" as described by [25].
Unlike the other metrics, the accuracy (which is 1 – the testing error) has the advantage of having very tight confidence intervals that can be computed straightforwardly from the binomial tail inversion, as described by [26]. We have used this method to find if whether or not the observed difference of testing accuracy (between two classifiers) was statistically significant. We have reported the results only when a statistically significant difference was observed with a 90% confidence level.
Selected string kernels
One of the kernel used was the blended spectrum (BS) kernel that we have described above. Recall that the feature space, for this kernel, is the count of all kmers with 1 ≤ k ≤ p. Hence p is the sole hyperparameter of this kernel.
We have also used the local alignment (LA) kernel [16] which can be thought of as a softmax version of the SmithWaterman local alignment algorithm for pair of sequences. Indeed, instead of considering the alignment that maximizes the SmithWaterman (SW) score, the LA kernel considers every local alignment with a Gibbs distribution that depends on its SW score. Unfortunately, the LA kernel has too many hyperparameters precluding their optimization by crossvalidation. Hence, a number of choices were made based on the results of [16]. Namely, the alignment parameters were set to (BLOSUM 62, e = 11, d = 1) and the empirical kernel map of the LA kernel was used. The hyperparameter β was the only one that was adjusted by crossvalidation.
Of course, the proposed distant segments (DS) kernel was also tested. The θ_{m }hyperparameter was set to δ_{m }to avoid the limitation of segment length. Hence, δ_{m }was the sole hyperparameter for this kernel that was optimized by crossvalidation.
Other interesting kernels, not considered here because they yielded inferior results (according to [16], and [23]) on the remote protein homology detection problem, include the mismatch kernel [27] and the pairwise kernel [28].
Datasets
The V3 loop sequence and coreceptor usage of HIV1 samples were retrieved from Los Alamos National Laboratory HIV Databases http://www.hiv.lanl.gov/ webcite through available online forms.
Every sample had a unique GENBANK identifier. Sequences containing #, $ or * were eliminated from the dataset. The signification of these symbols was reported by Brian Foley of Los Alamos National Laboratory (personal communication). The # character indicates that the codon could not be translated, either because it had a gap character in it (a frameshifting deletion in the virus RNA), or an ambiguity code (such as R for purine). The $ and * symbols represent a stop codon in the RNA sequence. TAA, TGA or TAG are stop codons. The dataset was first shuffled and then splitted halfhalf, yielding a training and a testing set. The decision to shuffle the dataset was made to increase the probability that both the training and testing examples are obtained from the same distribution. The decision to use half of the dataset for testing was made in order to obtain tight confidence intervals for accuracy.
Samples having the same V3 loop sequence and a different coreceptor usage label are called contradictions. Contradictions were kept in the datasets to have prediction performances that take into account the biological reality of dual tropism for which frontiers are not well defined.
Statistics were compiled for the coreceptor usage distribution, the count of contradictions, the amount of samples in each clades and the distribution of the V3 loop length.
Results
Here we report statistics on our datasets, namely the distribution, contradictions, subtypes and the varying lengths. We also show the results of our classifiers on the HIV1 coreceptor usage prediction task, a brief summary of existing methods and an analysis of the discriminant vector with the distant segments kernel.
Statistics
In Table 1 is reported the distribution of coreceptor usages in the datasets created from Los Alamos National Laboratory HIV Databases data. In the training set, there are 1225 CCR5utilizing samples (85.9%), 375 CXCR4utilizing samples (26.3%) and 175 CCR5andCXCR4utilizing samples (12.2%). The distribution is approximatly the same in the test set. There are contradictions (entries with the same V3 sequence and a different coreceptor usage) in all classes of our datasets. A majority of viruses can use CCR5 in our datasets.
Table 1. Datasets. Contradictions are in parenthesis.
In Table 2, the count is reported regarding HIV1 subtypes, also known as genetic clades. HIV1 subtype B is the most numerous in our datasets. The clade information is not an attribute that we provided to our classifiers, we only built our method on the primary structure of the V3 loop. Therefore, our method is independant of the clades.
Table 2. HIV1 subtypes.
The V3 loops have variable lengths among the virions of a population. In our dataset (Table 3), the majority of sequences has exactly 36 residues, although the length ranges from 31 to 40.
Table 3. Sequence length distribution. The minimum length is 31 residues and the maximum length is 40 residues.
Coreceptor usage predictions
Classification results on the three different tasks (CCR5, CXCR4, CCR5andCXCR4) are presented in Table 4 for three different kernels.
For the CCR5usage prediction task, the SVM classifier achieved a testing accuracy of 96.63%, 96.42%, and 96.35%, respectively, for the BS, LA, and DS kernels. By using the binomial tail inversion method of [26], we find no statistically significant difference, with 90% confidence, between kernels.
For the CXCR4usage prediction task, the SVM classifier achieved a testing accuracy of 93.68%, 92.21%, and 94.80%, respectively, for the BS, LA, and DS kernels. By using the binomial tail inversion method of [26], we find that the difference is statistically significant, with 90% confidence, for the DS versus the LA kernel.
For the CCR5andCXCR4usage task, the SVM classifier achieved a testing accuracy of 94.38%, 92.28 %, and 95.15%, respectively, for the BS, LA, and DS kernels. Again, we find that the difference is statistically significant, with 90% confidence, for the DS versus the LA kernel.
Overall, all the tested string kernels perform well on the CCR5 task, but the DS kernel is significantly better than the LA kernel (with 90% confidence) for the CXCR4 and CCR5andCXCR4 tasks. For these two prediction tasks, the performance of the BS kernel was closer to the one obtained for the DS kernel than the one obtained for the LA kernel.
Classification with the perfect deterministic classifier
Also present in Table 4 are the results of the perfect deterministic classifier. This classifier is the deterministic classifier achieving the highest possible accuracy on the test set. For any input string s in a testing set T, the perfect determinist classifier (h*) returns the most frequently encountered class label for string s in T. Hence, the accuracy on T of h* is an overall measure of the amount of contradictions that are present in T. There are no contradictions in T if and only if the testing accuracy of h* is 100%. As shown in Table 4, there is a significant amount of contradictions in the test set T. These results indicate that any deterministic classifier cannot achieve an accuracy greater than 99.15%, 98.66% and 97.96%, respectively for the CCR5, CXCR4, and CCR5andCXCR4 coreceptor usage tasks.
Discriminative power
To determine if a SVM classifier equipped with the distant segments (DS) kernel had enough discriminative power to achieve the accuracy of perfect determinist classifier, we trained the SVM, equipped with the DS kernel, on the testing set. From the results of Table 4, we conclude that the SVM equipped with the DS kernel possess sufficient discriminative power since it achieved (almost) the same accuracy as the perfect deterministic classifier for all three tasks. Hence, the fact that the SVM with the DS kernel does not achieve the same accuracy as the perfect determinist classifier when it is obtained from the training set (as indicated in Table 4) is not due to a lack of discriminative power from the part of the learner.
Discriminant vectors
The discriminant vector that maximizes the softmargin has (almost always) many nonzero valued components which can be extracted by the algorithm of Figure 3. We examine which components of the discriminant vector have the largest absolute magnitude. These components give weight to the most relevant features for a given classification task. In Figure 4, we describe the most relevant features for each tasks. Only the 20 most significant features are shown.
Figure 4. Features (20 are shown) with highest and lowest weights for each coreceptor usage prediction task.
A subset of positiveweighted features shown for CCR5utilizing viruses are also in the negativeweighted features shown for CXCR4utilizing viruses. Furthermore, a subset of positiveweighted features shown for CXCR4utilizing viruses are also in the negativeweighted features reported for CCR5utilizing viruses. Thus, CCR5 and CXCR4 discriminant models are complementary. However, since 3 tropisms exist (R5, X4 and R5X4), features contributing to CCR5andCXCR4 should also include some of the features contributing to CCR5 and some of the features contributing to CXCR4. Among shown positiveweighted features for CCR5andCXCR4, there are features that also contribute to CXCR4 ([8, R, R], [13, R, T], [9, R, R]). On another hand, this is not the case for CCR5. However, only the twenty most relevant features have been shown and there are many more features, with similar weights, that contribute to the discriminant vector. In fact, the classifiers that we have obtained depend on a very large number of features (instead of a very small subset of relevant features).
Discussion
The proposed HIV1 coreceptorusage prediction tool achieved very high accuracy in comparison with other existing prediction methods. In view of the results of Pillai et al, we have shown that the SVM classification accuracy can be greatly improved with the usage of a string kernel. Surprisingly, the local alignment (LA) kernel, which makes an explicit use of biologicallymotivated scoring matrices (such as BLOSUM 62), turns out to be outperformed by the blended spectrum (BS) and the distant segments (DS) kernels which do not try to exploit any concept of similarity between residues but rely, instead, on a very large set of easilyinterpretable features. Thus, a weightedmajority vote over a very high number of simple features constitutes a very productive approach, that is both sensitive and specific to what it is trained for, and applies well in the field of viral phenotype prediction.
Comparison with available bioinformatic methods
In Table 5, we show a summary of the available methods. The simplest method (the charge rule) has an accuracy of 87.45%. Thus, the charge rule is the worst method presented in table 5. The SVM with string kernels is the only approach without multiple alignments. Therefore, V3 sequences with many indels can be used with our method, but not with the other. These other methods were not directly tested here with our datasets because they all rely on multiple alignments. The purpose of those alignments is to produce a consensus and to yield transformed sequences having all the same length. As indicated by the size of the training set in those methods, sequences having larger indels were discarded, thus making these datasets smaller. Most of the methods rely on crossvalidation to perform quality assessment but, as we have mentioned, this is problematic when multiple alignments are performed prior to learning, since, in these cases, the testing set in each fold is used for the construction of the tested classifier. It is also important to mention that the various methods presented in Table 5 do not produce predictors for the same coreceptor usage task. Indeed, the definition of X4 viruses is not always the same: some authors refer to it as CXCR4only while other use it as CXCR4utilizing. It is thus unfeasible to assess the fitness of these approaches, which are twisted by crossvalidation, multiple alignments and heterogeneous dataset composition.
Table 5. Available methods. The results column contains the metric and what the classifier is predicting.
The work by Lamers and colleagues [12] is the first development in HIV1 coreceptor usage prediction regarding dualtropic viruses. Using evolved neural networks, an accuracy of 75.50% was achieved on a training set of 149 sequences with the crossvalidation method. However, the SVM equipped with the distant segments kernel reached an accuracy of 95.15% on a large test set (1425 sequences) in our experiments. Thus, our SVM outperforms the neural network described by Lamers and colleagues [12] for the prediction of dualtropic viruses.
Los Alamos National Laboratory HIV Databases
Although we used only the Los Alamos National Laboratory HIV Databases as our source of sequence information, it is notable that this data provider represents a metaresource, fetching bioinformation from databases around the planet, namely GenBank (USA, http://www.ncbi.nlm.nih.gov/Genbank/ webcite), EMBL (Europe, http://www.ebi.ac.uk/embl/ webcite) and DDBJ (Japan, http://www.ddbj.nig.ac.jp/ webcite). Researchers cannot directly send their HIV sequences to LANL, but it is clear that this approach makes this database less likely to contain errors.
Noise
The primary cause of contradictions (e.g. a sequence having two or more phenotypes) remains uncharacterized. It may be due to a particular mix, to some extent, of virion envelope attributes (regions other than the V3) and of the host cell receptor counterparts. As genotypic assays, based on bioinformatics prediction software, rely on sequencing technologies, they are likely to play a more important role in clinical settings as sequencing cost drops. Nextgeneration sequencing platforms promise a radical improvement on the throughput and more affordable prices. Meanwhile, effective algorithmic methods with proven statistical significance must be developed. Bioinformatics practitioners have to innovate by creating new algorithms to deal with large datasets and need to take into consideration sequencing errors and noise in phenotypic assay readouts. Consequently, we investigated the use of statistical machine learning algorithms, such as the SVM, whose robustness against noise has been observed in many classification tasks. The high accuracy results we have obtained here indicate that this is also the case for the task of predicting the coreceptor usage of HIV1. While it remains uncertain whether or not other components of the HIV1 envelope contribute to the predictability of the viral phenotype, we have shown that the V3 loop alone produces very acceptable outputs despite the presence of a small amount of noise.
Web server
To allow HIV researchers to test our method on the web, we have implemented a web server for the HIV1 coreceptor usage prediction. The address of this web server is http://genome.ulaval.ca/hivdskernel webcite. In this setting, one has to submit fastaformatted V3 sequences in a web form. Then, using the dual representation of the SVM with the distant segments kernel, the software predicts the coreceptor usage of each submitted viral sequence. Those predictions are characterized by high accuracy (according to results in Table 4). Source codes for the web server and for a commandline backend are available in additional file 1.
Additional file 1. Source code and data. Web server, classifiers, discriminant vectors and data sets.
Format: ZIP Size: 9MB Download file
Conclusion
To our knowledge, this is the first paper that investigates the use of string kernels for predicting the coreceptor usage of HIV1. Our contributions include a novel string kernel (the distant segments kernel), a SVM predictor for HIV1 coreceptor usage with the identification of the most relevant features and stateoftheart results on accuracy, specificity, sensitivity and receiver operating characteristic. As suggested, string kernels outperform all published algorithms for HIV1 coreceptor usage prediction. Large margin classifiers and string kernels promise improvements in drug selection, namely CCR5 coreceptor inhibitors and CXCR4 coreceptor inhibitors, in clinical settings. Since the binding of an envelope protein to a receptor/coreceptor prior to infection is not specific to HIV1, one could extend this work to other diseases. Furthermore, most ligand interactions could be analyzed in such a fashion. Detailed features in primary structures (DNA or protein sequences) can be elucidated with the proposed bioinformatic method. Although we have exposed that even the perfect algorithm (entitled "Perfect determinist classifier") can not reach faultless outcomes, we have also empirically demonstrated that our algorithms are very competitive (more than 96% with distant segments for CCR5). It is thus feasible to apply kernel methods based on features in primary structures to compare sequence information in the perspective of predicting a phenotype. The distant segments kernel has broad applicability in HIV research such as drug resistance, coreceptor usage (as shown in this paper), immune escape, and other viral phenotypes.
Appendix
Proof of the correctness of the distant segments kernel
We now prove that the algorithm DISTANTSEGMENTSKERNEL (s, t, δ_{m}, θ _{m}) does, indeed, compute as defined by Equation 2.
Proof. For each pair (j_{s}, j_{t}) such that s [j_{s }+ 1] = t [j_{t }+ 1] and each δ ≥ 2, let us define to be the set of all triples (δ, (μ_{s}, α, ν_{s}, α', ), (μ_{t}, α, ν_{t}, α', )) such that
Clearly, is the sum of all the values of  over all the possible pairs (j_{s}, j_{t}). Moreover, it is easy to see that  can be computed only from the knowledge of the set of indices i ∈ {1,..., n} satisfying property P (i) : = s [j_{s }+ i] = t [j_{t }+ i]. Note that when the test of the first WHILE loop is performed, the value of i is such that P (i) is valid but not P (i  1). Moreover, in the second WHILE loop, P (i) remains valid, except for the last test. Thus, s [j_{s }+ i] = t [j_{t }+ i] if and only if i_{2r }≤ i <i_{2r+1 }for some r ∈ {0,..., k}. This, in turn, implies that each element of must be such that i_{2r }≤ δ <i_{2r+1 }for some r. To obtain the result, it is therefore sufficient to prove that both of theses properties hold.
P1 The number of elements of , for which i_{0 }<δ <i_{1}, is given by
P2 For r ∈ {1,..., k}, the number of elements of , for which i_{2r }≤ δ <i_{2r+1}, is given by
To prove these properties, we will use the fact that, if 1 ≤ i ≤ k, then counts the number of sequences ⟨a_{1},..., a_{i}⟩ satisfying 1 ≤ a_{1 }<a_{2 }< ⋯ <a_{i }≤ k where {a_{1}, a_{2}, ..., a_{i}} are i string positions. Moreover, for any substring u of s, let us denote by b_{u }the starting position of u in s, and by e_{u }the first position of s after the substring u (if s ends with u, choose e_{u }= s + 1).
We first prove P2. Fix an r. Since i_{0 }= 1, we have that any ssubstring α of length ≤ θ_{m }with b_{α }= j_{s }+ i_{0 }and e_{α }≤ j_{s }+ i_{1 }together with any ssubstring α' of length ≤ θ_{m }such that
will give rise to exactly one element of with i_{2r }≤ δ <i_{2r+1}. Conversely, each element such that i_{2r }≤ δ <i_{2r+1 }will have an α and an α' with these properties. Since α has to start at j_{s }+ i_{0}, it is easy to see that the number of such possible α is exactly min(θ_{m}, i_{1 } i_{0}). Thus let us show that the number of possible α' is exactly .
Since l_{r }gives the number of positions from j_{s }+ i_{2r }to j_{s }+ i_{2r+1 }inclusively, counts all the possible, choices of b_{α' }and e_{α' }with j_{s }+ i_{2r }≤ b_{α' }<e_{α' }≤ j_{s }+ i_{2r+1}. Thus counts the number of possible strings α' of all possible lengths (including lengths > θ_{m}). On another hand, the number of α' having a length > θ_{m}. is equal to . Indeed, if l_{r } θ_{m }< 2, = 0 as wanted, and otherwise, there is a onetoone correspondence between the set of all sequences ⟨a_{1}, a_{2}⟩ such that 1 ≤ a_{1 }<a_{2 }≤ l_{r } θ_{m }and the set of all α' of length > θ_{m}. The correspondence is obtained by putting b_{α' }= i_{2r }+ a_{1 } 1 and e_{α' }= i_{2r }+ θ_{m }+ a_{2 } 1.
The proof for P1 is similar to the one for P2 except that we have to consider the fact that both α and α' start and end in {j_{s }+ i_{0},..., j_{s }+ i_{1 } 1}. Since no overlap is allowed and b_{α }= j_{s }+ i_{0}, we must have
Since l_{0 }gives the number of positions from j_{s }+ i_{0 }to j_{s }+ i_{1 }inclusively, counts all the possible choices of α and α' for all possible lengths. Recall that if l_{0 }< 3, which can only occur if i_{1 }= i_{0 }+ 1, we have that , as wanted.
On another hand, counts all the possible choices of α of length > θ_{m }and of α' of arbitrary length. This set of possible choices is non empty only if l_{0 } θ_{m }≥ 3 and, then, the onetoone correspondence between a sequence ⟨a_{1}, a_{2}, a_{3}⟩ such that 1 ≤ a_{1 }<a_{2 }<a_{3 }≤ l_{0 } θ_{m }and the values of ⟨e_{α}, b_{α'}, e_{α'}⟩ is
Similarly, counts all the possible choices of α' of length > θ_{m }and of α of arbitrary length, the correspondence being e_{α } 1 = j_{s }+ a_{1}, b_{α' }= j_{s }+ a_{2 }and e_{α' }= j_{s }+ θ_{m }+ a_{3}. Finally, counts all the possible choices of α and α', both of length > θ_{m}. In the cases where such possible choices exist (i.e., if l_{0 } 2θ_{m }≥ 3), the correspondence is e_{α } 1 = j_{s }+ θ_{m }+ a_{1}, b_{α' }= j_{s }+ θ_{m }+ a_{2 }and e_{α' }= j_{s }+ 2θ_{m }+ a_{3}. Then, property P1 immediately follows from the inclusionexclusion argument. □
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
SB, MM, FL and JC drafted the manuscript. FL wrote the proof for the distant segments kernel. SB performed experiments. SB, MM, FL and JC approved the manuscript.
Acknowledgements
This project was funded by the Canadian Institutes of Health Research and by the Natural Sciences and Engineering Research Council of Canada (MM, 122405 and FL, 262067). JC is the holder of Canada Research Chair in Medical Genomics.
References

Pillai S, Good B, Richman D, Corbeil J: A new perspective on V3 phenotype prediction.
AIDS Res Hum Retroviruses 2003, 19:145149. PubMed Abstract

Richman D, Bozzette S: The impact of the syncytiuminducing phenotype of human immunodeficiency virus on disease progression.
J Infect Dis 1994, 169:968974. PubMed Abstract

Zhang L, Robertson P, Holmes EC, Cleland A, Leigh Brown A, Simmonds P: Selection for specific V3 sequences on transmission of human immunodeficiency virus.
J Virol 1993, 67:334556. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sirois M, Robitaille L, Sasik R, Estaquier J, Fortin J, Corbeil J: R5 and X4 HIV viruses differentially modulate host gene expression in resting CD4+ T cells.
AIDS Res Hum Retroviruses 2008, 24:485493. PubMed Abstract  Publisher Full Text

Milich L, Margolin B, Swanstrom R: V3 loop of the human immunodeficiency virus type 1 Env protein: interpreting sequence variability.
J Virol 1993, 67:56235634. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Fouchier R, Groenink M, Kootstra N, Tersmette M, Huisman H, Miedema F, Schuitemaker H: Phenotypeassociated sequence variation in the third variable domain of the human immunodeficiency virus type 1 gp120 molecule.
J Virol 1992, 66:31833187. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Resch W, Hoffman N, Swanstrom R: Improved success of phenotype prediction of the human immunodeficiency virus type 1 from envelope variable loop 3 sequence using neural networks.
Virology 2001, 288:5162. PubMed Abstract  Publisher Full Text

Jensen M, Li F, van 't Wout A, Nickle D, Shriner D, He H, McLaughlin S, Shankarappa R, Margolick J, Mullins J: Improved coreceptor usage prediction and genotypic monitoring of R5toX4 transition by motif analysis of human immunodeficiency virus type 1 env V3 loop sequences.
J Virol 2003, 77:1337613388. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Jensen M, Coetzer M, van 't Wout A, Morris L, Mullins J: A reliable phenotype predictor for human immunodeficiency virus type 1 subtype C based on envelope V3 sequences.
J Virol 2006, 80:46984704. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sander O, Sing T, Sommer I, Low A, Cheung P, Harrigan P, Lengauer T, Domingues F: Structural descriptors of gp120 V3 loop for the prediction of HIV1 coreceptor usage.
PLoS Comput Biol 2007, 3:e58. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Xu S, Huang X, Xu H, Zhang C: Improved prediction of coreceptor usage and phenotype of HIV1 based on combined features of V3 loop sequence using random forest.
J Microbiol 2007, 45:441446. PubMed Abstract  Publisher Full Text

Lamers S, Salemi M, McGrath M, Fogel G: Prediction of R5, X4, and R5X4 HIV1 coreceptor usage with evolved neural networks.
IEEE/ACM Trans Comput Biol Bioinform 2008, 5:291300. PubMed Abstract  Publisher Full Text

Lengauer T, Sander O, Sierra S, Thielen A, Kaiser R: Bioinformatics prediction of HIV coreceptor usage.
Nat Biotechnol 2007, 25:14071410. PubMed Abstract  Publisher Full Text

ShaweTaylor J, Cristianini N: Kernel Methods for Pattern Analysis. Cambridge University Press; 2004.

Saigo H, Vert J, Ueda N, Akutsu T: Protein homology detection using string alignment kernels.
Bioinformatics 2004, 20:16821689. PubMed Abstract  Publisher Full Text

Leslie C, Eskin E, Noble W: The spectrum kernel: a string kernel for SVM protein classification.
Pac Symp Biocomput 2002, 564575. PubMed Abstract

Mefford M, Gorry P, Kunstman K, Wolinsky S, Gabuzda D: Bioinformatic prediction programs underestimate the frequency of CXCR4 usage by R5X4 HIV type 1 in brain and other tissues.
AIDS Res Hum Retroviruses 2008, 24:12151220. PubMed Abstract  Publisher Full Text

Raymond S, Delobel P, Mavigner M, Cazabat M, Souyris C, SandresSauné K, Cuzin L, Marchou B, Massip P, Izopet J: Correlation between genotypic predictions based on V3 sequences and phenotypic determination of HIV1 tropism.
AIDS 2008, 22:F1116. PubMed Abstract

Skrabal K, Low A, Dong W, Sing T, Cheung P, Mammano F, Harrigan P: Determining human immunodeficiency virus coreceptor use in a clinical setting: degree of correlation between two phenotypic assays and a bioinformatic model.
J Clin Microbiol 2007, 45:279284. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sing T, Low A, Beerenwinkel N, Sander O, Cheung P, Domingues F, Büch J, Däumer M, Kaiser R, Lengauer T, Harrigan P: Predicting HIV coreceptor usage on the basis of genetic and clinical covariates.
Antivir Ther (Lond) 2007, 12:10971106. PubMed Abstract

Vapnik V: Statistical learning Theory. New York: Wiley; 1998.

Lingner T, Meinicke P: Remote homology detection based on oligomer distances.
Bioinformatics 2006, 22:22242231. PubMed Abstract  Publisher Full Text

Joachims T: Making largeScale SVM Learning Practical. In Advances in Kernel Methods – Support Vector Learning. Edited by Scholkopf B, Burges C, Smola A. MIT Press; 1999.

Gribskov M, Robinson N: Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching.
Comput Chem 1996, 20:2533. PubMed Abstract  Publisher Full Text

Langford J: Tutorial on practical prediction theory for classification.

Leslie C, Eskin E, Cohen A, Weston J, Noble W: Mismatch string kernels for discriminative protein classification.
Bioinformatics 2004, 20:467476. PubMed Abstract  Publisher Full Text

Liao L, Noble W: Combining pairwise sequence similarity and support vector machines for remote protein homology detection.
Proceedings of the Sixth Annual Conference on Research in Computational Molecular Biology 2002, 225232.