Diagnostic survey of Malagasy Nesomyrmex species-groups and revision of hafahafa group species via morphology based cluster delimitation protocol

Abstract Madagascar and its surrounding islands are among the world’s greatest biodiversity hotspots, harboring predominantly endemic and threatened communities meriting special attention from biodiversity scientists. Building on the considerable efforts in recent years to inventory the Malagasy ant fauna, the myrmicine genus Nesomyrmex is reviewed and (1) subdivided into four major groups based on salient morphological features corroborated by numeric morphology: angulatus-, hafahafa-, madecassus- and sikorai-groups, and (2) the hafahafa species-group endemic to Madagascar is revised. Diversity within hafahafa species-group was assessed via hypothesis-free nest-centroid-clustering combined with gap statistic to assess the number of clusters and to determine the most probable boundaries between them. This combination of methods provides a highly automatized, objective species delineation protocol based on continuous morphometric data. Delimitations of clusters recognized by these exploratory analyses were tested via confirmatory Linear Discriminant Analysis. These results suggest the existence of four morphologically distinct species, Nesomyrmex capricornis sp. n., Nesomyrmex hafahafa sp. n., Nesomyrmex medusus sp. n. and Nesomyrmex spinosus sp. n.; all are described and an identification key for their worker castes using morphometric data is provided. Two members of the newly outlined hafahafa species-group, Nesomyrmex hafahafa sp. n., Nesomyrmex medusus sp. n., are distributed along the southeastern coast Madagascar and occupy rather large ranges, but two other species, Nesomyrmex capricornis sp. n. and Nesomyrmex spinosus sp. n., are only known to occur in small and isolated forest, highlighting the importance of small forest patches for conserving arthropod diversity.


Introduction
The Malagasy zoogeographical region, i.e. Madagascar and surrounding islands (Bolton 1994), is considered one of the world's hottest biodiversity hotspots (Myers et al. 2000) and harbors a unique and threatened biota (Ganzhorn et al. 2001). The recently recognized global biodiversity crisis has highlighted the need to explore the flora and fauna of tropical areas, where biodiversity remains largely unexplored, and is enduring the fastest rate of environmental transformation. Thanks to intensive ant systematic research in Madagascar over the last decade (e.g. Fisher 2009, Blaimer and Fisher 2013, Yoshimura and Fisher 2012, Hita-Garcia and Fisher 2014 our knowledge of Malagasy myrmecofauna has increased considerably, supporting earlier assumptions about the extreme species diversity of the region. However, questions of diversity, rate of endemism, and connections to the African continent for several genera such as Malagasy Nesomyrmex have never been the subject of focused research. To date, only four valid Nesomyrmex species have been recorded to occur in Madagascar (Mbanyana and Robertson 2008), Based on the recent inventories of Fisher and team, this paper reassesses the Nesomyrmex fauna and describes the species from one species group.
A novel approach was used to facilitate species delimitations using multivariate morphometric analyses. Morphological diversity is assessed via NC-clustering ). This exploratory data analysis technique has proved efficient at pattern recognition within large and complex datasets , Guillem et al. 2014, Wachter et al. 2015. The estimation of the optimal number of clusters representing species within a morphological dataset is determined via gap statistic algorithm (Tibshirani et al. 2001). This algorithm helps to find statistically supported number of groups in normally distributed data such as continuous morphometric data based on intra-cluster variance. The combination of NC-clustering and gap statistic offers a highly automated, hypothesis-free protocol producing a statistically calculated goodness of clustering measure that minimizes opportunities for subjective interpretation.
In the present paper, the Malagasy Nesomyrmex fauna is subdivided into four clearly delimited species groups diagnosed here and a key to the species groups is provided. The first step of the current project, to inventory the entire Malagasy Nesomyrmex fauna, will involve providing a detailed description of the diversity of the Nesomyrmex hafahafa species-group. The three pairs of dorsal spines (pronotal spines, propodeal spines and antero-dorsal spines on petiolar node) makes the appearance of this group extremely unique; no similar species group has been found either in the Malagasy region or on the African continent. Multivariate evaluation of morphological data has revealed that the unique-looking N. hafahafa species-group comprises four welloutlined clusters, or species, that are endemic to Madagascar. The four new species outlined, N. capricornis sp. n., N. hafahafa sp. n., N. medusus sp. n., and N. spinosus sp. n., are described here based on worker caste, and both a key that includes both a numeric identification tool that helps readers to resolve the most problematic cases and a traditional character based key. Distribution maps are also provided. Our research has also revealed that two of the four species, N. capricornis sp. n. and N. spinosus sp. n., occur in small, highly isolated forests, leaving them at a high risk of extinction from continuing environmental destruction or climatic changes.

Material and methods
In the present study, 21 continuous morphometric traits were recorded in 177 worker individuals belonging to 100 nest samples collected in the Malagasy region (Table 1). The material is deposited in the California Academy of Sciences (CAS), San Francisco, USA. The full list of non-type material morphometrically examined in this revision is listed in Table 1 with unique specimen identifiers (e.g. CASENT0460666). Designation of type material with detailed label information is given in relevant sections type material investigated for each taxon. All images and specimens used in this study are available online on AntWeb (http://www.antweb.org). Images are linked to their specimens via their unique specimen code affixed to each pin (CASENT0002660). Online specimen identifiers follow this format: http://www.antweb.org/specimen/ CASENT0002660.
Digital color montage images were created using a JVC KY-F75 digital camera and Syncroscopy Auto-Montage software (version 5.0), or a Leica DFC 425 camera in combination with the Leica Application Suite software (version 3.8). Distribution maps were generated by using QGIS 2.4.0 software (QGIS Development Team 2014).
The measurements were taken with a Leica MZ 12.5 stereomicroscope equipped with an ocular micrometer at a magnification of 100×. Measurements and indices are presented as arithmetic means with minimum and maximum values in parentheses. Body size dimensions are expressed in µm. Due to the abundance of worker individuals in contrast to the limited number of queen and male specimens available the present revision is based on worker caste only. Worker-based revision is further facilitated by the fact that name-bearing type specimens of the vast majority of existing ant taxa were designated from worker caste. All measurements were made by the first author. For the definition of morphometric characters, earlier protocols (Schlick-Steiner et al. 2006, Seifert 2006, Seifert and Csősz 2015 were considered. Explanations and abbreviations for measured characters are as follows:

CL
Maximum cephalic length in median line. The head must be carefully tilted to the position providing the true maximum. Excavations of hind vertex and/or clypeus reduce CL (Fig. 1).

CW
Maximum width of the head including compound eyes (Fig. 1).

CWb
Maximum width of head capsule without the compound eyes. Measured just posterior of the eyes (Fig. 1).

Cdep
Antero-median clypeal depression. Maximum depth of the median clypeal depression on its anterior contour line as it appears in fronto-dorsal view.

EL
Maximum diameter of the compound eye.

FRS
Frontal carina distance. Distance of the frontal carinae immediately caudal of the posterior intersection points between frontal carinae and the torular lamellae. If these dorsal lamellae do not laterally surpass the frontal carinae, the deepest point of scape corner pits may be taken as reference line. These pits take up the inner corner of scape base when the scape is fully directed caudally and produces a dark triangular shadow in the lateral frontal lobes immediately posterior to the dorsal lamellae of the scape joint capsule (Fig. 2).

ML (Weber length)
Mesosoma length from caudalmost point of propodeal lobe to transition point between anterior pronotal slope and anterior pronotal shield (preferentially measured in lateral view; if the transition point is not well defined, use dorsal view and take the centre of the dark-shaded borderline between pronotal slope and pronotal shield as anterior reference point). In gynes: length from caudalmost point of propodeal lobe to the most distant point of steep anterior pronotal face (Fig. 3). MPST Maximum distance from the center of the propodeal spiracle to the posteroventral corner of the ventrolateral margin of the metapleuron (Fig. 4).

MW
Mesosoma width. In workers MW is defined as the longest width of the pronotum in dorsal view excluding the pronotal spines (Fig. 5).

NOL
Length of the petiolar node. Measured in lateral view from the centre of petiolar spiracle to dorso-caudal corner of caudal cylinder. Do not erroneously take as the reference point the dorso-caudal corner of the helcium, which is sometimes visible (Fig. 4).

NSTI
Apical distance of the anterodorsal spines on the petiolar node in dorsal view; if spine tips are rounded or thick take the centers of spine tips as reference points (Fig. 6).

PEL
Diagonal petiolar length in lateral view; measured from anterior corner of subpetiolar process to dorso-caudal corner of caudal cylinder (Fig. 3).

PEW
Maximum width of petiole in dorsal view. Nodal spines are not considered (Fig. 5). PoOC Postocular distance. Use a cross-scaled ocular micrometer and adjust the head to the measuring position of CL. Caudal measuring point: median occipital margin; frontal measuring point: median head at the level of the posterior eye margin (Fig. 1).

PPL
Postpetiole length. The longest anatomical line that is perpendicular to the posterior margin of the postpetiole and is between the posterior postpetiolar margin and the anterior postpetiolar margin (Fig. 4).

PSTI
Apical distance of pronotal spines in dorsal view; if spine tips are rounded or thick take the centers of spine tips as reference points (Fig. 5).

SL
Scape length. Maximum straight line scape length excluding the articular condyle. SPBA Minimum propodeal spine distance. The smallest distance of the lateral margins of the propodeal spines at their base. This should be measured in dorsofrontal view, since the wider parts of the ventral propodeum do not interfere with the measurement in this position. If the lateral margins of propodeal spines diverge continuously from the tip to the base, a smallest distance at base is not defined. In this case, SPBA is measured at the level of the bottom of the interspinal meniscus (Fig. 6).

SPST
Propodeal spine length. Distance between the centre of propodeal spiracle and spine tip. The spiracle centre refers to the midpoint defined by the outer cuticular ring but not to the centre of real spiracle opening that may be positioned eccentrically (Fig. 4).

SPTI
Apical propodeal spine distance. The distance of propodeal spine tips in dorsal view; if spine tips are rounded or truncated, the centres of spine tips are taken as reference points (Fig. 6).
Taxonomic nomenclature, OTU concepts and natural language (NL) phenotypes were compiled in mx (http://purl.org/NET/mx-database). Taxonomic history and descriptions of taxonomic treatments were rendered from this software. Hymenopteraspecific terminology of morphological statements used in descriptions, identification key, and diagnoses are mapped to classes in phenotype-relevant ontologies (Hymenoptera Anatomy Ontology (HAO) (Yoder et al. 2010) via a URI table (Table 2); see Seltmann et al. (2012), Mikó et al. (2014) for more information about this approach.
In verbal descriptions of taxa based on external morphological traits, recent taxonomic papers Csősz 2015) were considered. Definitions of surface sculpturing are linked to Harris (1979). Body size is given in µm, means of morphometric ratios as well as minimum and maximum values are given in parentheses with up to three digits. Estimated inclination of pilosity and cuticular spines is given in degrees. Definitions of species-groups as well as descriptions of species are surveyed in alphabetic order.

Statistical analyses of continuous morphometric data
Hypothesis formation by exploratory analyses. Our hypothesis of the number of clusters and classification of samples was formulated by an exploratory data analysis technique, NC-clustering ) using continuous morphometric data. NC-clustering searches for discontinuities in data, sorting all similar cases into the same cluster by transforming morphological differences between nest samples into a distance matrix in a linear discriminant space. The linear discriminant scores for each nest sample are displayed in a dendrogram within Euclidean space via UPGMA (Unweighted Pair Group Method with Arithmetic Mean) distance method. This method is able to tackle large datasets with high dimensionality , Guillem et al. 2014, Wachter et al. 2015, providing readily inferable patterns even for a high number of clusters. A bootstrap version of cluster analysis was applied to evaluate how consistently the same clusters appear with a sub-sampled dataset by running 1000 iterations (method = "average", method.dist = "euclidean", nboot = 1000) using package pvclust (Suzuki and Shimodaira 2014). Package pvclust returns two type of p values: the Approximately Unbiased P-value (AU) is computed by multiscale bootstrap resampling, and the raw Bootstrap Probabilities (BP) that is calculated before statistical adjustments by normal bootstrap resampling. Table 2. URI table for morphometric characters and Hymenoptera-specific terminology of morphological statements used in descriptions, identification key, and diagnoses are mapped to classes in phenotype-relevant ontologies.

Label Class genus differentia definition
Comments uri CL maximum cephalic length in median view The median anatomical line that extends between the posterior margin of the cranium and the distal margin of the clypeus in frontal view. The maximum cephalic length in median view is not equivalent to the maximum cephalic size that extends between the posterior cranial margin and the distal clypeal line. The head must be carefully tilted to the position with the true maximum. Excavations of hind vertex and/or clypeus reduce CL (Fig. 1A).
http://purl.obolibrary.org/ obo/HAO_0002331 CW head width The anatomical line that is the longest horizontal diameter of the cranium in frontal view. The head width is the largest distance between the lateral margins of the compound eyes measured in frontal view (Fig. 1A).
http://purl.obolibrary.org/ obo/HAO_0002268 CWb dorsal head width The anatomical line between the intersections of the cranium contour line and dorsal head line in frontal view. The dorsal head width is the maximum width of head capsule without the compound eyes that is measured just posterior of the eyes in frontal view (Fig. 1A).     http://purl.obolibrary.org/ obo/HAO_0002345 SL scape length The proximodistal anatomical line of the scapal area distal to the radicle. Maximum straight line scape length excluding the radicle (Fig. 1A).
http://purl.obolibrary.org/ obo/HAO_0002346 SPBA minimum spine distance The shortest anatomical line between the lateral margins of the propodeal spines. This should be measured in anterodorsal view, since the wider parts of the ventral propodeum do not interfere with the measurement in this position. If the lateral margins of spines diverge continuously from the tip to the base, a smallest distance at base is not defined. In this case, SPBA is measured at the level of the bottom of the interspinal meniscus (Fig. 1C).
http://purl.obolibrary.org/ obo/HAO_0002347 SPST spine length The anatomical line between the center of the propodeal spiracle and the distal end of the propodeal spine. Spine length. Distance between the centre of propodeal stigma and spine tip. The stigma centre refers to the midpoint defined by the outer cuticular ring but not to the centre of real stigma opening that may be positioned eccentrically (Fig. 1F).
http://purl.obolibrary.org/ obo/HAO_0002348 SPTI apical spine distance The anatomical line between the distal ends of the propodeal spines. If spine tips are rounded or truncated, the centres of spine tips are taken as reference points (Fig. 1C The optimal number of clusters was determined via gap statistic using gap criterion introduced by Tibshirani et al. (2001). The gap statistic is a standard method for determining the number of clusters in a set of data (Mohajer et al. 2010). It clusters the observed data, varying the number of clusters and computes the corresponding within-cluster dispersion (i.e. the sum of the squared distances between the observations and the center of the cluster). For each number of clusters the gap statistic compares the standardized within-cluster dispersion to its expectation under an appropriate null reference distribution (i.e. each observation is assumed to fall in a single cluster). The optimal number of clusters is the value for which the observed within-cluster dispersion falls the farthest below this reference curve (Tibshirani et al. 2001).
Hypothesis testing by confirmatory LDA. To increase the reliability of species delimitation, hypotheses on clusters and classifications of cases via two exploratory processes were tested by a confirmative LDA. Classification hypotheses were imposed for all samples congruently classified by exploratory methods while wild-card settings (i.e. no prior hypothesis imposed on its classification) were given to samples that were incongruently classified by the two methods. The confirmative LDA was run as an iterative process to achieve the lowest number of characters necessary to achieve the desired level (>97%) of classification success (Seifert 2014).  (Donisthorpe, 1946) madecassus (Forel, 1892) sikorai group retusispinosus (Forel, 1892) sikorai (Emery, 1896)

I. Definitions and diagnoses of groups
Key to species-groups 1 Anterodorsal spines on petiolar node present (Fig. 7)

Multivariate Analyses of Numeric Morphology
Four clusters were revealed by gap statistic (Fig. 12) to be the most parsimonious solution corroborating the evaluation of the NC-clustering dendrogram (Fig. 13). The grouping hypotheses generated by hypothesis-free exploratory analyses is confirmed by Linear Discriminant Analysis (LDA) with 99.4% classification success. This pattern is also supported by the examination of external morphological traits (e.g. shape of petiolar node, length and deviation of anterodorsal spines on petiolar node), hence the four clusters can be defined as morphospecies based on descriptive morphology. The distinctive morphology of these species permits considerable character reduction, so that the four taxa can be separated based on the combination of four continuous morphometric traits (FRS, NSTI, PSTI and SPST see Table 3) with 99.4% classification success (Fig. 14). Synopses of species were defined based on multivariate analyses of morphological traits: Nesomyrmex capricornis sp. n., Nesomyrmex hafahafa sp. n., Nesomyrmex medusus sp. n., Nesomyrmex spinosus sp. n. Coefficients of linear discriminants of LD1 and LD2 help to place every additional sample in the discriminant space illustrated in Fig. 14. These placements were calculated using the four most discriminative characters. The morphometric data are in micrometer. Classification functions based on linear discriminants LD1 and LD2 are as follows: LD1= -(0.0324×PEL)+(0.0121×SPST)-(0.0023×PSTI)+(0.0281×NSTI) +1.6 LD2= +(0.0336×PEL)+(0.0258×SPST)-(0.0328×PSTI)+(0.0049×NSTI)-2.9 Discriminant scores (LD1, LD2) obtained here can either be compared to the values given in Table 3, or can also be used as coordinates in Fig. 14, if relevant scores are fitted on axes LD1 and LD2, and the position of every new sample can be readily identified visually.
Though all species defined in this revisionary work proved to be highly separable via descriptive morphology, or by using simple indices, the application of classification functions LD1 and LD2 provides a foolproof, numeric morphology-based identification tool when decisions based on conventional diagnostic traits fail. Gap statistic for dataset of hafahafa species-group. Four-cluster solution is highly supported by the elbow at 4 components by the dispersion curve (left) and by the peak at cluster number four by the gap curve (right). Number of clusters in the data (X axis), the total within-cluster dispersion for each evaluated partition (Y axis for the left plot) and the vector of length Kmax giving the Gap statistic for each evaluated partition (Y axix for the right plot) is illustrated.

Description of the species in the Nesomyrmex hafahafa species-group
In this section, four new species of the N. hafahafa species-group are described, and a key to these species is provided. Diagnoses are given in the key, the basic statistics of body size ratios are given in Table 4 for each species. The biogeography of the hafahafa group is detailed in the discussion. The diagnoses and a key to the four Malagasy Nesomyrmex species groups (angulatus-group, hafahafa-group, madecassus-group and sikorai-group) defined here are followed by the descriptions of species belonging to the hafahafa group.

Key to the species of hafahafa group
The species of the Nesomyrmex hafahafa group differ in body ratios. The following dichotomous identification key for the worker caste was generated based on ratios of morphological features that allow quick identification. Minimum and maximum values for each character is given in parentheses. The reliability of all characters has been tested and calculated classification success was always higher than 95% for each node. Where classification error was detected (i.e. the range of a given trait overlaps between two species) a percentile range 5-95% was also provided in brackets.

1
Propodeal spine very short (Fig. 15). 2 Bases of anterodorsal petiolar spines enclose a triangular truncate area on the dorsum of petiolar node delineated by a rim (Fig. 16). In dorsal view, anterodorsal petiolar spines distantly surpassing lateral margin of petiole ( In dorsal view, distance between tips of anterodorsal petiolar spines longer than petiole width, spines surpassing lateral margins of petiole (Fig. 17).    The list of 21 non-type individuals belonging to 14 nest samples of other material investigated is given in Table 1  Etymology. This species is named for the shape of the anterodorsal spines on the petiolar node, which resemble goat horns.
Distribution. This species is known to occur in small, highly isolated forests (Toliara, Forêt Mahavelo and Parc National d'Andohahela, Forêt de Manantalinjo) in the southern part of Madagascar (Fig. 13).  Table 4 Etymology. This Malagasy word "hafahafa" means weird, and refers to the unusual morphology of this species.

Nesomyrmex medusus
The list of 54 non-type individuals belonging to 28 nest samples of other material investigated is given in Table 1.   Distribution. This species is known to occur in small, highly isolated forests (Réserve Privé Berenty, Forêt d'Anjapolo and Parc National d'Andohahela, Forêt d'Ambohibory) in the southern part of Madagascar (Fig. 13).

Discussion
In this paper we placed the Malagasy Nesomyrmex fauna into four species-groups delimited based on morphological features corroborated by morphometric data (see definition and diagnoses of groups). The within-group diversity of one of these new groups, Nesomyrmex hafahafa group, was revealed by an enhanced hypothesisfree approach. The exploratory NC-clustering  technique was combined with a gap statistic (Tibshirani et al. 2001) in order to address the central problem of taxonomic workflow on estimating the number of optimal clusters (i.e. how many species).
A gap statistic algorithm (function 'gap') implemented in the package clus-terGenomics  was employed to determine the optimal number of cluster within data that were transformed into discriminant space by the NC-clustering and recursive partitioning (function 'part') assigned observations (i.e. specimens, or samples) into partitions. Gap statistic is a global method, determines the number of clusters based on gap criterion described by Tibshirani et al. (2001), while recursive partitioning searches for sub-clusters by running 'gap' recursively .
Our research demonstrates that combination of NC-clustering with gap statistics and recursive partitioning algorithms performs well in distinguishing partitions in the present data based on morphological distances among nest sample means. Four-cluster hypothesis was returned by both gap statistic (Fig. 12) and recursive partitioning (Fig.  14) as the most parsimonious solution for the diversity of the hafahafa-group. This classification was confirmed by multiple lines of evidence. The error rate between the exploratory procedure and the results of the confirmatory Linear Discriminant Analysis was 0.6%. Moreover the pattern recognized by the exploratory process was also corroborated by both the examination of diagnostic morphological traits (e.g. shape of petiolar node, length and deviation of anterodorsal spines on petiolar node) and the known biogeographic patterns (Fig. 14).
We highlight the importance and advantages of the combination of NC-clustering with algorithms to statistically infer gaps and create array of clusters. This protocol also has the potential at accelerate and improve taxonomic decision making process considerably by enabling taxonomists to objectively interpret results based on quantitative morphometric data even in a largely underexplored or poorly understood group such as the Malagasy genus Nesomyrmex.
Combination of these approaches allows researchers to recognize cryptic species, but also prevent users from inferring overly diverse pattern in the data. A taxonomist without long-term training in a given group can evaluate new specimens and potential new species by repeating the analysis with measurements from new specimens. This method is best included with an integrated approach that includes conventional morphological characters, biogeography, ecology or molecular data.