PCR primers for 30 novel gene regions in the nuclear genomes of Lepidoptera

Abstract We report primer pairs for 30 new gene regions in the nuclear genomes of Lepidoptera that can be amplified using a standard PCR protocol. The new primers were tested across diverse Lepidoptera, including nonditrysians and a wide selection of ditrysians. These new gene regions give a total of 11,043 bp of DNA sequence data and they show similar variability to traditionally used nuclear gene regions in studies of Lepidoptera. We feel that a PCR-based approach still has its place in molecular systematic studies of Lepidoptera, particularly at the intrafamilial level, and our new set of primers now provides a route to generating phylogenomic datasets using traditional methods.


Introduction
Post-Sanger sequencing technologies have opened up vast possibilities for acquiring molecular data for inferring phylogenetic relationships among taxa using 100s to 1000s of loci (Lemmon and Lemmon 2013), from whole genome sequences (e.g. Jarvis et al. 2014), to whole transcriptome sequences (e.g. Misof et al. 2014), to the targeted capture of conserved regions in genomes (e.g. Prum et al. 2015). However, these approaches require high quality DNA or RNA extracted from samples that are fresh or have been stored appropriately. Unfortunately, this quality requirement fails to capitalize on the wealth of material collected and cataloged in museums around the world. Recognizing this, many researchers are currently attempting to develop protocols to allow the extraction and sequencing of large amounts of DNA from old museum samples (e.g. Timmermans et al. 2016), but these methods are primarily limited to mitochondrial DNA.
For the past two decades, the standard protocol in insect molecular systematics has been to extract genomic DNA from one or two legs of dried individuals, often several years old, generally yielding very low concentrations of DNA. Today, millions of such genomic DNA extracts exist, each taken from suboptimally stored specimens, generated by individual researchers and large facilities such as the Canadian Centre for DNA Barcoding. These extracts have been used to PCR amplify specific gene regions, followed by Sanger sequencing. This standard approach has traditionally been restricted to fewer than 10 gene regions due to the lack of universal primers for more regions. Given this extensive DNA resource and the inability of the aforementioned methods to be easily applied to them, here we present an approach for using these extracts in the pursuit of phylogenomic insights.
As DNA sequencing technologies continue to evolve, the molecular systematist must judiciously choose which tools are best suited to the questions they wish to address. While genome scale data are certainly useful, such data are expensive, difficult to analyze and ultimately only a small fraction is utilized. Perhaps most importantly, such large scale datasets are likely only necessary for resolving deeper evolutionary events, such as the relationships among orders of insects (Misof et al. 2014) or superfamilies of e.g. Lepidoptera Kawahara and Breinholt 2014). In contrast, datasets on the order of tens of genes have been highly useful for resolving relationships at the intrafamilial level, as has been repeatedly shown for e.g. lepidopteran families (Wahlberg et al. 2009(Wahlberg et al. , 2014Kaila et al. 2011;Kawahara et al. 2011;Sihvonen et al. 2011;Zahiri et al. 2011Zahiri et al. , 2012Zwick et al. 2011;Regier et al. 2012aRegier et al. , 2012bRota and Wahlberg 2012;Sohn et al. 2013). Thus, datasets generated with PCR-based methods have been and continue to be very insightful. However, in many such studies, some nodes are poorly supported with the scale of data available and more sequence data is needed. But, while it would be very interesting to sequence e.g. transcriptomes for the same species sampled, financial and practical constraints preclude such attempts. Rather, what is most likely to help resolve many of these ambiguities in a cost effective and timely fashion are more high quality loci that can be amplified with PCR across a range of DNA quality.
Whole genome sequences can now be used to search for suitable gene regions for primer design (e.g. Wahlberg and Wheat 2008). Such suitable gene regions are considered to be protein-coding genes that are single copy and have an exon that is longer than 500 bp. Long exons are needed as intron lengths can vary thousand fold between taxa, sometimes even between close relatives (Zhang and Hewitt 2003). Protein-coding genes are also preferred for inferring phylogenetic relationships as their alignments are generally unambiguous and conserved regions can be found for primer design.
Here we design and test PCR primers for long exon regions of single copy, proteincoding genes across Lepidoptera based on publicly available whole genome sequences of the order. The new gene regions are shown to be phylogenetically informative for Lepidoptera and can be used to complement the eight gene regions that have become standard in Lepidoptera phylogenetics (Wahlberg and Wheat 2008).

Material and methods
Single copy, protein-coding genes with exons longer than 500 bp were found while manually curating the set of genes listed in Misof et al. (2014) that were pulled out of eight publicly available Lepidoptera genomes using tblastn: Bombyx mori (NCBI accession GCA_000151625), Plutella xylostella (GCA_000330985), Manduca sexta (GCA_000262585), Danaus plexippus (GCA_000235995), Heliconius melpomene (GCA_000313835), Melitaea cinxia (GCA_000716385), Chilo suppressalis (GCA_000636095), and Spodoptera frugiperda (GCA_000753635). Sequences from all eight genomes were then used to design universal primers. We used the Python library primer-designer v0.2.0 (Peña 2015) to submit batches of FASTA alignments to the website primer4clades (Contreras-Moreira et al. 2009) in order to retrieve candidate primers for each gene sequence. Primer selection was based on high quality and amplicon length between 200 and 500 bp.
As in Wahlberg and Wheat (2008), universal tails were added to all primers to facilitate sequencing. Primers were aliquoted to a standard concentration of 10 μM for use. Primers were tested on a set of 24 species of Lepidoptera that represent major lineages within the order (Table 1). The DNA extracts of these specimens were previously used in the study by Mutanen et al. (2010). They mainly come from small amounts of tissue (such as legs) preserved in 100% EtOH (details of preservation and extraction methods can be found in Mutanen et al. 2010). The PCR reactions for all samples were done using the MyTaq TM HS Red Mix (Bioline) in a final volume of 12.5 μl per sample. For each reaction we used 4 μl of MQ-H 2 O, 6.25 μl of 2x MyTaq HS Red Mix, 0.625 μl of both forward and reverse primers and 1 μl of extracted DNA. All primers were tested with a standard thermal cycling profile of 95 °C for 7 minutes, then 40 cycles of 94 °C for 30 seconds, 55 °C for 30 seconds, 72 °C for 2 minutes, with a final extension period of 72 °C for 10 minutes. A standard cycling profile was chosen to simplify procedures and allow for large scale testing. No optimization of PCR reactions was attempted, as the goal was to find primers that work under exactly the same conditions, allowing the efficient processing of large numbers of samples in the laboratory without having to keep track of specific protocols for specific primer pairs. Success of PCR was visualized on agarose gels and successful PCR products were cleaned enzymatically and sent to Macrogen Europe (Amsterdam) for Sanger sequencing.
Sequences were trimmed of primer sequences and aligned by eye with reference to amino acid sequence in BioEdit 7 (Hall 1999) using the sequence from Bombyx mori as a reference for each gene. Aligned sequences were stored and curated using VoSeq (Peña and Malm 2012).

Results
We selected a total of 48 gene regions (see Supplementary material for alignments) for primer design, of which 30 successfully amplified (Suppl. material 1 and 2) with the designed primers (Table 3). Only two gene regions were successfully amplified from all 24 test samples (ArgKin and DDX23), but the majority were successfully amplified from 20 or more samples ( Table 2). The least successful gene region was LeuZip, which was sequenced from only 9 samples. No samples amplified all 30 gene regions (Table  1); the average number of successful gene regions was about 23. The least successful sample was Micropterix (11 out of 30 gene regions sequenced), which is not surprising as the primer design was based on ditrysian species, while Micropterix is likely to be the sister group to all the rest of Lepidoptera Regier et al. 2015).
The new gene regions give a total of 11,043 bp of data. The average amplicon length is 368 bp (ranging from 178 to 729 bp).

Gene Primer AFG3a_F
TAATACGACTCACTATAGGGTGTGAAGAAGCTAAGatwgaratyatggartt AFG3a_R ATTAACCCTCACTAAAGGGTGTTGTTGTATTAAAAccrtccatytchac AFG3b_F TAATACGACTCACTATAGGGTGCTCAAGACGACCtdaaraaratmac The variability in the new gene regions appears to be similar to the widely used nuclear gene regions reported in Wahlberg and Wheat (2008) (Table 2). The base content across most of the fragments is fairly even, with some of them having a small AT bias (e.g., Exp1, CycY, and TIF3Cb), but none having a larger percentage than for example CAD, one of the widely used gene regions (Wahlberg and Wheat 2008), which has 64.1% of As and Ts. The number of parsimony informative sites ranges from 30.2% in Ca-ATPase to a little over 48% in MMP41 and Ssu72. For comparison, the range of parsimony informative sites in the Wahlberg and Wheat (2008) gene regions is almost the same, from 27.2% (EF1alpha) to 48.5% (wingless).

Discussion
We report here primers for 30 new nuclear gene regions that can be used to complement existing molecular data for Lepidoptera systematics. Our primers were designed to amplify gene regions across the entire taxonomic array of Lepidoptera and to work on relatively degraded material by amplifying less than 500 bp segments of the genome. Many of these primers are being used successfully in our laboratory for projects on e.g. the nymphalid subfamily Limenitidinae (Dhungel and Wahlberg in prep.), the families Geometridae (Brehm et al. in prep.), Choreutidae (Rota et al. in prep.), Limacodidae (Dupont et al. in prep.) and Riodinidae (Seraphim et al. in prep.). The phylogenetic utility of the used gene regions will be reported in more detail in the forthcoming papers: in summary, they are providing similar resolution as the standard gene regions reported in Wahlberg and Wheat (2008) in preliminary maximum likelihood analyses with RAxML that have been conducted.
We would like to stress that the gene regions described here should be seen as complementary to the standard gene regions (Wahlberg and Wheat 2008) and could be used in the event that more data is needed. Potential users should consult Suppl. material 1 to see which primers worked for taxa they are interested in. Based on our experiences in the laboratory, we would recommend that researchers consider using primers for AFG3a, ArgKin, Ca-ATPase, DDX23, MMP41, MPP2, Nex9, PolII, ProSup, PSb, SSU72, UDPG6DH, and WD40, as these tend to amplify consistently, especially across Ditrysia.
More specifically, it seems that several fragments are not very suitable for nonditrysians (none of the three exemplars that we used amplified AFG3b, CHITINASE, KRR1, NC, SARAH, VPS4) and the utility of several other fragments for these groups needs to be further tested (Ca2, GLYP, MPP2, NEX9, POLII, TIF3CB, and UD-PG6DH amplified in only one of the nonditrysians tested). On the other hand, 21 fragments amplified in four or more of the six exemplars of Macroheterocera (the exceptions being LeuZip, which amplified in only one of them, and TIF3Cb and VPS4, which amplified in three out of six). The situation is more complex across the lower ditrysians and apoditrysians, which can be expected since these groups are quite divergent (Mutanen et al. 2010;Regier et al. 2013). For these groups our recommendation is to try CHITINASE and MK6 in addition to the above-mentioned fragments that appear to work across Lepidoptera.
In this study, we have used traditional Sanger sequencing to acquire the DNA sequences. However, almost all of the amplicons are short enough to be multiplexed and sequenced on a NextGen sequencing platform, such as Illumina. The advantages would be quick generation of a large number of sequences for a large number of samples. On the other hand, many systematists do not have access to NextGen sequencers, or the bioinformatics knowhow to process the raw data into useable formats, in which case the traditional PCR-based Sanger sequencing approach is still appropriate.
The approach we have used is highly conservative, as we sought to find primer pairs that work under standard conditions. It would thus be possible to design primers for the 18 gene regions that did not work under our strict criteria, but would work under different conditions. It is also possible to design primers that would amplify a longer segment of DNA, although such primer pairs would require fresh samples with little degradation of genomic DNA. It would also be possible to find more gene regions with exon lengths more than 500 bp, although a PCR-based approach becomes less and less efficient as the number of reactions grows. It is quite likely that datasets comprising up to 20 gene regions are sufficient for most phylogenetic studies within families (Zwick et al. 2011). For more difficult phylogenetic problems, NextGen sequencing approaches are recommended.