Table 1

Comparison of subtype assignments (jpHMM results versus current database assignment that is based on the original literature)

AG set

BC set


Num of sequences

Full length (world)

N = 140

Full length (world)

N = 509

Fragments (Asia)

N = 4413


Database subtype

A

G

02

AG

B

C

07

08

BC

B

C

07

08

BC


Num of sequences

72

12

48

8

152

334

7

4

12

3133

1048

17

171

44


Num of problematic sequences1

1

0

2

0

15

12

0

0

3

0

0

0

0

0


Num of discordant sequences2

0

0

1

0

2

0

0

0

2

24

6

6

102

27


BF set


Num of sequences

Full length (world)

N = 220

Fragments (S. America)

N = 4153


Database subtype

B

F

12

17

28

29

BF

B

F

12

17

28

29

BF


Num of sequences

152

12

11

2

3

4

36

3070

242

261

0

0

0

580


Num of problematic sequences1

15

0

0

0

0

0

0

0

0

0

0

0

0

0


Num of discordant sequences2

2

2

6

2

1

1

1

74

19

31

0

0

0

107


1. Problematic sequences are those that could not be unequivocally assigned. They meet one of the following criteria: 1) Contain an unusually high content of IUPAC code N (defined as > 100 continuous Ns, or > 7% N for sequences of length < 1000 nt, or > 5% N for sequences of length 1000-2999, or > 3% N for sequences of length 3000 or above); 2) Contain an artifactual deletion of > 100 nt.

2. Classification of the sequences was compared between the database assignments (of which the majority were extracted from the literature) and the jpHMM predictions.

Zhang et al. Retrovirology 2010 7:25   doi:10.1186/1742-4690-7-25

Open Data