DNA barcodes identify Central Asian Colias butterflies (Lepidoptera, Pieridae)

Abstract A majority of the known Colias species (Lepidoptera: Pieridae, Coliadinae) occur in the mountainous regions of Central-Asia, vast areas that are hard to access, rendering the knowledge of many species limited due to the lack of extensive sampling. Two gene regions, the mitochondrial COI ‘barcode’ region and the nuclear ribosomal protein RpS2 gene region were used for exploring the utility of these DNA markers for species identification. A comprehensive sampling of COI barcodes for Central Asian Colias butterflies showed that the barcodes facilitated identification of most of the included species. Phylogenetic reconstruction based on parsimony and Neighbour-Joining recovered most species as monophyletic entities. For the RpS2 gene region species-specific sequences were registered for some of the included Colias spp. Nevertheless, this gene region was not deemed useful as additional molecular ‘barcode’. A parsimony analysis of the combined COI and RpS2 data did not support the current subgeneric classification based on morphological characteristics.


introduction
The use of a standardized gene region, i.e. a 650 bp fragment of the 5'-region of the mitochondrial cytochrome c oxidase subunit I (hereafter COI), as a DNA barcode (Hebert et al. 2003), to facilitate identification of biological specimens, as well as for calling attention to possible new species has generated a steadily increasing number of DNA barcoding studies of invertebrates (Taylor and Harris 2012), and particularly of Lepidoptera (see www.lepbarcoding.org). While the utility of DNA barcoding as an investigative tool has gained much support, there still remain a number of problems related to the use of a single DNA sequence as a taxon barcode. Several studies on Lepidoptera have shown that species may be polymorphic and/or share haplotypes (Nice et al. 2002, Wahlberg et al. 2003, Elias et al. 2007, Schmidt and Sperling 2008, so that identifications may become less reliable. Additionally, it has been shown that incomplete lineage sorting or mitochondrial introgression could obscure the delimitation of closely related taxa (Tautz et al. 2003). Using one or a few specimens as representatives of a species indeed provides us with little information about their intraspecific variation, particularly for widely distributed species (e.g. Funk and Omland 2003, Seberg et al. 2003, Sperling 2003.

the genus Colias
The butterfly genus Colias Fabricius, 1807 is a genus of the family Pieridae (subfamily Coliadinae), comprising about 85 species. Most of its species have a limited distribution in the Arctic and Alpine regions of the Holarctic realm, but two species occur in the Afrotropical and seven are known from the Neotropical regions (Verhulst 2000). A few species are widely distributed and common, such as the Palaearctic C. erate (Esper, 1805) and C. croceus (Geoffroy, 1785), and the Nearctic C. eurytheme Boisduval, 1852 and C. philodice Godart, 1819. As a consequence, these taxa are frequently used in ethological, ecological and genetic research (e.g. Pollock et al. 1998, Wang and Porter 2004, Porter and Levin 2010. Colias erate and C. croceus are a species pair where only typical specimens can be reliably distinguished morphologically, and members of these species are known to frequently hybridize (e.g. Dinca et al. 2011 and references therein). Lukhtanov et al. (2009) indicated that mitochondrial introgression was a likely explanation for the shared barcodes they registered between these sympatric taxa. The Nearctic taxa C. eurytheme and C. philodice are broadly sympatric sister species that hybridize frequently and that likely share a significant portion of their genomes through introgression (e.g. Wang andPorter 2004, Porter andLevin 2010). Verhulst (2000) illustrated hybrid individuals of six species of Colias from the Palaearctic region, including C. croceus.
The Central Asian mountainous regions harbour nearly half of all Colias species. The distribution, ecology and taxonomy are still incompletely documented for most of these species, mainly due to their remote occurrences (Verhulst 2000). Central Asian Colias species occurring in remote mountainous areas that are hard to access have been far less studied than their North American or European congeners. An important part of the older material that exists in museum collections worldwide (e.g. from Tibet) originates from early collecting expeditions in the late 19 th and early 20 th centuries. Important material was, however, also collected within the former Soviet Union during 20 th century. Fieldwork in Central-Asia has subsequently become less complicated, and thus new material is again available for research. As a result of this, new species such as Colias aegidii Verhulst, 1990 and Colias adelaidae Verhulst, 1991, have been described, as well as a number of new subspecies. Despite an increasing research effort on Central Asian Colias species there are as yet no published studies on their phylogenetic relationships.
The first contribution to the species classification of Colias was given by Berger (1986), who used a few morphological characters to establish a comprehensive subgeneric classification, comprising the subgenera Colias Fabricius, 1807, Neocolias Berger, 1986, Eucolias Berger, 1986, Eriocolias Watson, 1895, Palaeocolias Berger, 1986, Similicolias Berger, 1986, Scalidoneura Butler, 1869and Paracolias Berger, 1986. Later, Ferris (1993 used 84, mainly morphological, characters to reconstruct a phylogeny of all North American Colias species known at that time, which was the first species phylogeny within the genus Colias. The first contribution to the knowledge of the molecular phylogenetic relationships of the North American Colias species was made by Pollock et al. (1998), who studied a number of Colias species using a 333 bp sequence fragment of the mtDNA COI gene. They found some small differences between species classified in the subgenera Neocolias and Eriocolias, thus supporting Berger's (1986) separation of Neocolias from Eriocolias. Pollock et al. (1998) also noted that even though Colias is a speciose genus, this was not mirrored in the COI sequence diversity. Wheat and Watt (2008) studied the molecular phylogenetic relationships of North American Colias taxa using mitochondrial gene sequences (ribosomal 12S and 16S rRNA, Leu2 and Val tRNA and COI + II). Their results showed that the COI sequences only allowed identification of some of the taxa supported by the full data set used in their study. The results of their study further suggested that species radiations within Colias are comparatively young as compared with those of related pierid butterflies, since molecular divergences among species were small. Based on molecular data Brunton (1998) studied the phylogenetic relationships of the 12 Colias species occurring in Europe. He recovered three monophyletic groups largely corresponding to geographical distributions. He concluded that the Scandinavian species appeared to be the oldest in Europe, sharing a common ancestor with Colias species from the USA. According to Brunton (1998) the European Colias species radiated from Scandinavia to the rest of Europe forming an eastern clade and a western clade. As with Pollock et al. (1998), the results did not agree with Berger's (1986) subgeneric classification.
The aim of the present study was to test the usefulness of COI barcodes for species identification of a broad representation of Central Asian Colias species, including nine Colias species overlapping with Lukhtanov et al.'s (2009) study, and 19 species not previously barcoded. In addition, we wanted to elucidate the informativeness of the RpS2 gene region that Wahlberg and Wheat (2008) found informative for lepidopter-an phylogenetic relationships. We tested the nuclear ribosomal protein gene RpS2 as a potential complementary barcode region for Colias and for use in a combined analysis with COI for testing the current subgeneric classification of the species in the present study. We also contrasted our COI barcodes against a larger set of COI barcodes of Colias taxa available from GenBank (GB).

Study area and taxon sampling
This study includes material from the mountain regions of Kirgizistan, Tadzhikistan, northern Afghanistan, northern Pakistan and India (e.g. mountain ranges Tian Shan, Hindu Kush, Karakorum, Himalaya) and the mountain regions in the Chinese provinces Qinghai, Gansu, Sichuan, Yunnan and the autonomous regions Tibet and Xinjiang Uygur. The Colias fauna of these Central Asian regions comprises about 34 species (Verhulst 2000) while the species number for Central Asia in broad sense is over 40 species.
The taxon sampling aimed to cover as many of the Colias species from this area as possible. Additionally, a few Colias species occurring in adjacent territories (e.g. Buryatia) were also available for molecular study. Whenever possible, several individuals of each species were analysed to assess intraspecific variation. The available specimens used for molecular study consisted of a total of 56 adult specimens covering 27 species of Central Asian Colias and two Colias species from adjacent territories ( Table 1). The specimens are preserved as DNA voucher specimens and labelled accordingly, to be deposited in the collections of the Zoological Museum of Finnish Museum of Natural History, Helsinki, Finland (MZH) (DNA voucher specimens MZH_JL1-JL71). Species identifications were verified by JL based on easily recognizable diagnostic characters using the monograph by Verhulst (2000), while the taxonomy is according to Grieshuber and Lamas (2007). Additionally, we used 35 COI barcode sequences (17 species) of Palaearctic Colias species obtained from GB, as listed in Table 2.

Laboratory methods
Total genomic DNA was extracted form 2-5 legs of dried, pinned butterfly specimens using NucleoSpin® Tissue Kit (Machery-Nagel), according to manufacturer's protocols, and resuspended in 50 µl ultrapure water.
The primer pair LCO-1490 (5'-GGTCAACAAATCATAAAGATATTGG-3') and HCO-2198 (5'-TAAACTTCAGGGTGACCAAAAAATCA-3') (Folmer et al. 1994) was used to amplify a ca. 650 bp fragment of the mitochondrial COI gene. The polymerase chain reactions (PCR) were done under the following parameters: initial heating 95 °C for 2 min, following 30 cycles of 94 °C for 30 s, 49 °C for 30 s and 72 °C for 2 min, followed by a final extension of 72 °C for 7 min. The primer pair RpS2 nF (5'-ATCWCGYGGTGGYGATAGAG-3') and RpS2 nR (5'-ATGRGGCTTKC-CRATCTTGT-3') (Wahlberg and Wheat 2008) was used to amplify a ca. 400 bp fragment of the nuclear RpS2 gene. The PCR were carried out following the PCR cycling profile described in Wahlberg and Wheat (2008): initial heating 95 °C for 7 min, 40 cycles of 95 °C for 30 s, 50 °C for 30 s, 72 °C for 2 min, and a final extension period of 72 °C for 10 min. Sequencing of the double-stranded PCR product was carried out on an ABI PRISM ® 377 Automated Sequencer (Applied Biosystems) following manufacturer's recommendations. All PCR primers were used for sequencing. Sequences were inspected and edited using Sequence Navigator ® (Applied Biosystems).

Sequence analysis
We analysed and clustered our sequence data using parsimony and Neighbour-Joining (NJ) of K2P-distances. We used parsimony and NJ for our newly generated COI sequence dataset, NJ for RpS2 sequences, parsimony for the concatenated COI and RpS2 sequences, and, finally, NJ for the combined COI sequences generated in this study and those in GB. All trees were rooted using Papilio glaucus (family Papilionidae) and Aporia crategi (Pieridae, subfamily Pierinae) as outgroup taxa. Parsimony analysis was performed using NONA (Goloboff 1999) and spawn with the aid of Winclada (Nixon 2002), using a heuristic search algorithm with 1000 random addition replicates (mult*1000), holding 10 trees per round (hold/10), max trees set to 10,000 and applying TBR branch swapping. All base positions were treated as equally weighted characters. Nodal support was assessed with bootstrap resampling (1000 replicates) using Winclada (Nixon 2002). MEGA5 (Tamura et al. 2011) was used for NJ clustering using 1000 bootstrap replicates. The Kimura 2-parameter model was used for NJ clustering of the COI sequences, while the Tamura-Nei model with gamma distributed rates was chosen for the RpS2 sequences.

Sequences
We obtained a 643 bp COI barcode for 56 Colias specimens, and a 409 bp fragment of RpS2 was obtained for 49 specimens (Table 1). A+T content of the COI sequences was 69.22%, and of the RpS2 45.0%. There were 115 parsimony informative sites for COI and 39 for RpS2.
Uncorrected pairwise divergences between ingroup taxa ranged between 1.09 and 4.09% (mean 2.77%) for COI and 0.0-1.7% (mean 1.0%) for RpS2. GenBank accession numbers are given in Table 1. Intraspecific uncorrected distances were up to 1.09% (in C. thisoa) for COI, with specimens of most species differing by less than 4 nucleotide changes.

Identification: COI vs. RpS2
The parsimony analysis of the new COI sequences yielded four equally parsimonious trees (CI = 0.59, RI = 0.75) the strict consensus tree of which is presented in Figure 1. The NJ tree is presented in Figure 2.
The majority of the species could be identified with COI alone, as no COI haplotypes were shared between species. Both parsimony and NJ trees recovered 25 (out of 28) species as monophyletic groups (Figures 1-2). Neither Colias cocandica, nor C. nebulosa formed monophyletic entities, as their sequences were scattered over various parts of the trees. The two samples of C. tyche were not recovered as sister taxa, for sample MZH_JL5 appeared as sister taxon of C. heos. The overall topologies of the parsimony and NJ trees were identical, except for the placement of C. thrasibulus. Parsimony placed the taxon as sister to a clade of five taxa (Figure 1), while NJ placed it as sister to C. romanovi (Figure 2). The external morphology of C. thrasibulus is rather different from that of C. romanovi, while some similarities can be found between C. thrasibulus and C. nina, C. ladakensis, C. tibetana and C. cocandica (Figure 1). Only 17 of the 39 parsimony informative sites of RpS2 were variable among the 49 ingroup members. NJ only recovered few species as separate lineages due to the shallow divergences (Figure 3). The information content of this gene region is best interpreted as a character-based diagnostic table, as suggested by DeSalle et al. (2005). This gene region yielded species specific (diagnostic) haplotypes for 11 species out of 33 (Table 3).

Analysis of the concatenated COI + RpS2 data
The parsimony analysis of COI + RpS2 yielded nine trees of length 560 steps (CI = 0.63, RI = 0.72), the strict consensus tree of which is shown in Figure 4. Colias cocandica, C. nebulosa and C. tyche were not monophyletic and C. thrasibulus had the same position as in the COI cladogram ( Figure 1).

Analysis of all the COI sequences
The strict consensus cladogram for all the available COI data resolved the taxa in the same positions as in the tree of the new COI sequences only. For ten species of the present study sequences were also available from GB. Sequences of most species clustered together as monophyletic entities, except for C. nebulosa, C. cocandica, C. tyche and C. regia. For C. regia the GB sequence (GB accession no FJ663427) did not cluster together with our sequences. The GB barcodes of C. erate and C. croceus were shared by these two taxa.
Neither the Himalayan and south Tibetan adjacent mountain Colias fauna (berylla, ladakensis, nina, stoliczkana, thrasibulus, tibetana), nor the east Tibetan,  14,152,170,176,189,191,194,195,218,284,287,302,341,353,356,365,380). Qinghai, Gansu and Sichuan species aggregates (adelaidae, grumi, lada, montium, nebulosa, sifanica, wanda) were resolved as species clusters similar to the Tian Shan, Pamir and Hindukush species. Several COI haplotypes were noted for a few species, even among specimens obtained from the same locality (e.g. C. staudingeri and C. thisoa). Taxa not resolved as monophyletic clusters were the species C. cocandica and C. nebulosa. All the included subspecies of C. cocandica (C. c. cocandica, C. c. pljutshtshi and C. c. hinducucia) showed distinct COI sequences, with cocandica cocandica as most different.    Lukhtanov et al. (2009) tested the utility of COI barcodes for Central Asian butterflies by sampling specimens from a considerable geographical range. They observed that this substantially increased intraspecific variation reducing the interspecific divergences ("barcoding gap"), but that this did not hamper species identification. The present study shows that most Colias taxa form monophyletic entities that can be identified with COI data alone. The RpS2 gene region showed identical sequences in cocandica pljutshtshi and cocandica hinducucia (Table 3, Figure 3), differing by only three nucleotides from cocandica cocandica. Based on the molecular data the recognition of these subspecies is not or weakly supported.

Barcoding
The fact that the three C. nebulosa samples were scattered over different parts of the COI tree might be the result of a laboratory contamination due to carry over between samples. The C. nebulosa samples were collected on the same day and in the same place. C. nebulosa is morphologically distinct from other Colias species, excluding possible misidentification. The RpS2 data, however, could point to two morphologically cryptic species in sympatry (samples MZH_JL24 vs. MZH_JL9 and MZH_JL26), so that the different COI barcodes might represent numts, despite no apparent 'signs' (no indels). This discrepancy between morphology and DNA sequence data emphasises the necessity to use multiple samples to detect this sort of challenging issues.
Even though C. cocandica and C. nebulosa did not form monophyletic groups our results show that COI barcodes are useful for (1) identifying Palaearctic and Central Asian Colias, (2) pointing to a possible cryptic species, and (3) highlighting the necessity to further investigate the question on the subspecific rank of C. cocandica cocandica.
The utility of RpS2 as a species barcode for Colias spp. is clearly more limited, since e.g. C. heos, C. lada, C. nina, C. thisoa of the subgenus Eriocolias and C. tyche (subgenus Eucolias) have identical sequences (Table 3, Figure 3). Still, RpS2 yielded species specific (diagnostic) haplotypes for 11 species of the subgenus Eriocolias and for C. hyale (subgenus Colias s.str.).

Congruence with traditional classification: analysis of concatenated COI + RpS2
The strict consensus tree was more resolved than either of the trees resulting from separate analyses of the gene regions (Figure 4).
Although the concatenated data did not resolve the phylogenetic relationships among all Colias species, some observations can be made. The majority of the species confined to the adjacent Tian Shan, Pamir and Hindukush mountain ranges form a well supported clade. This includes C. eogene, C. regia, C. romanovi, C. marcopolo, C. staudingeri, C. christophi, C. alpherakii and C. wiskotti. Yet, C. sieversi, which also occurs in these mountain ranges (Peter I and Khozratishoh mountains), was not included in this clade. C. sieversi is morphologically most similar to C. alpherakii, thus showing another case of disagreement between morphological and DNA sequence data. C. thisoa, too, lives in the aforementioned mountain ranges, but it has a wider distribution, stretching from Turkey to the Altai Mountains. A third taxon, C. c. cocandica, is considered closely related to C. tamerlana (e.g. Verhulst 2000), a species occurring in southern Siberia and Mongolia. Thus, the origin of C. thisoa and C. c. cocandica may differ from that of the species confined to the Tian Shan, Pamir and Hindukush mountain range. One sample of C. cocandica (MZH_JL43) was placed within this "mountain" clade, while the other two samples appeared as sister taxa to the Himalayan species C. ladakensis. As with C. sieversi, our DNA data disagree with the morphological characters, but it should be noted that this clade is not well supported. Conversely, two morphologically similar Himalayan species, viz. C. nina and C. ladakensis, were assigned to different clades. In the COI + RpS2 tree they were placed in different, more encompassing species clusters (Figure 4), in the COI NJ tree they were joined with C. c. pljutshtshi and C. c. hinducucia (Figure 2), while the COI cladogram resolved these taxa together with C. adelaidae, C. tibetana, C. c. pljutshtshi and C. c. hinducucia (Figure 1).
The analyses did not support the monophyly of the subgenera Eucolias and Eriocolias sensu Berger (1986). The Eucolias species C. tyche was not resolved as a separate monophyletic lineage, but was resolved into Eriocolias. This is congruent with the results of Pollock et al. (1998) and Brunton (1998). Only the the subgenus Colias, here represented by C. hyale, is supported as a distinct lineage, placed as sister to all other Colias sp.

Barcodes of Palaearctic Colias spp.
The parsimony ( Figure 5) and NJ analyses (Figure 6) of the larger matrix of Palaearctic COI barcodes (total COI) recovered the same species clusters, but some of the species show different placements (e.g. C. thisoa, C. christophi). This is not surprising as all internal nodes are very shallow. The samples of C. tyche and C. hyperborea show very low sequence difference, morphologically these taxa are different, and they largely share the same distribution area. An example of species that share the same distribution and that exhibit clear morphological similarities, and which as such were resolved as sister species in both analyses, includes C. wiskotti and C. alpherakii. Identification of Palaearctic Colias based on COI barcodes is in most cases possible, since shared haplotypes were recorded only for C. erate and C. croceus.
Intraspecific variation is notable between some of the recognized subspecies, both among our own samples and those downloaded from GB. The intraspecific variation can partly be explained by morphologically clearly distinct subspecies, such as those of C. wiskotti, or by specimens from widely different localities, such the different specimens of C. hyale (sample FJ663418 from Russia, FJ663421 from Kazakhstan, HQ004297 from Romania and MZH_JL35 and MZH_JL44 from SW Transbaikalia). However, notable intraspecific variation also occurs within populations, such as C. thisoa aeolides with all samples originating from the same locality and date, but the limited sampling prevents conclusions on the reasons for this. It is apparent that the understanding of intraspecific variability of the COI barcode for Colias is presently very limited.
The combined COI data of our sequences and sequences downloaded from GB include species belonging to one additional subgenus, Neocolias, represented by C. myrmidone and C. erate. Only the subgenus Colias, represented by C. hyale, is well supported as distinct lineage. Yet, one specimen of C. hyale (FJ663419) clustered together with C. erate (Neocolias) and C. croceus (Eriocolias). The other subgenera were not resolved as clades according to present classification, in agreement with our results for the combined analysis.
Our findings generally support COI as a species specific barcode for Colias, but we also highlight the necessity of including multiple individuals of species in molecular barcoding studies. Problematic 'cases' of widely divergent barcodes or conflicting morphological and molecular 'signals' are found in most if not all barcoding studies, and this study makes no exception.