Corresponding author:Thomas M. Donegan (
Academic editor: Y. Mutafchiev
Existing models for assigning species, subspecies, or no taxonomic rank to populations which are geographically separated from one another were analyzed. This was done by subjecting over 3,000 pairwise comparisons of vocal or biometric data based on birds to a variety of statistical tests that have been proposed as measures of differentiation. One current model which aims to test diagnosability (
Donegan TM (2018) What is a species? A new universal method to measure differentiation and assess the taxonomic rank of allopatric populations, using continuous variables. ZooKeys 757: 1–67.
This paper aims to help address the “allopatric problem” when determining species rank in taxonomic science. Humans have categorized populations into named groups since the dawn of known civilization (
Sympatric species, which occur together in the same place during the breeding season but do not successfully interbreed to any material extent, are demonstrably real. With enough data and persistence, it is usually possible to determine whether or not sympatric populations interbreed regularly and whether they produce fertile offspring (
A traditionally more difficult problem, and the focus of this paper, is that of “allopatric” (
The subjectivity involved in comparing allopatric species and the rise of molecular science have doubtless encouraged the development of a multitude of different species criteria or concepts. As noted by
Whilst statistical and mathematical techniques to analyze molecular data have been a rich field for methodological advancement, the same cannot be said for the study of real world variables. Supportable statistical schemes for assessing betweenpopulation differentiation are noteworthy principally by their absence. Those schemes which have been proposed are either widely criticized, only applicable to particular taxonomic groups or vague.
This paper will concentrate on the traditional currency of taxonomy: continuous variables such as those based on measurement of specimens, whether in the museum or in the field. Many researchers and advanced amateurs do not have a molecular laboratory available and few genera have been exhaustively sampled in a way that includes multiple individuals at population level. In contrast, vocal and biometric data are easy to collate, accessible to many and cheaper to analyze. A wide variety of other ‘real world’ organism characters are capable of measurement as continuous variables. For vocalizations, lengths or acoustic frequencies of notes can be measured using sonograms, for example. Coloration can be measured using spectrometry. Noncontinuous or discrete variables, e.g., presence or absence of a particular character and molecular markers, can be analyzed best using cladistics and other phylogenetic tools and are not covered here in detail.
When species rank is assessed across a taxonomic group as a whole, consistency is a virtue. Under a biological species conceptbased approach, attaining such consistency will require a determination of which allopatric populations have differentiated to the same extent as related sympatrics and which have not. Those that have so differentiated are species; those that have not are, at most, subspecies. Unfortunately, consistency is not attained in current classifications, especially as regards more diverse tropical faunas. This is generally due to discrepancies in available data, the regularity of different genera being revised and differences in approaches by regional committees or textbook authorities to studies using different taxonomic methods (e.g., molecular vs. morphological) (
Neither of these two splits is problematic from a phylogenetic species concept or “enthusiastic splitter” perspective in isolation; and further studies could give stronger support to these treatments. However, based on my experience of working with birds in the Neotropics, the benchmark applied to these situations would result in the specific recognition of probably several thousands of current subspecies or unnamed taxa occurring in that region.
The
In light of the difficulties with scoring “systems” and other developments, Halley et al. (2017) have argued for a return to monophyly and essentially
Over the last 20 years, I have been studying the taxonomy of birds in Colombia using biometric data (from mistnetting and museums) and using sound recordings. This resulted in the production of a large amount of data relevant to studying differentiation. It has become transparent to me that steps might be taken towards resolving some of these seemingly intractable fundamental disagreements, by developing an objective and agreeable basis, grounded in scientific method, statistics, the analysis of large data and based on traditional biological species concept thinking, that could be used better, more consistently and more rationally to assess the rank of allopatric populations. Ultimately, the aim of this study is to attempt definitively to provide a robust, objective and universal method to address the centuriesold question (unresolved since
In the present study, I took a large data set that had been developed for purposes of various particular taxonomic studies of birds (citations below) and used this to roadtest proposed and possible alternative statistical tests for measuring differentiation or diagnosis, with the intention of studying outcomes of tests in order to inform recommendations.
I compiled vocal and biometric data from multiple studies, including of representatives of the three major assemblages of birds: nonpasserines (three families), suboscine passerines (four families), and oscine passerines (two families) (citations in Tables
Summary information on the vocal studies used in the analysis.
Order: Family  Genus  No. taxa / populations  No. spp. before review  No. spp. after review  No. continuous vocal variables  No. Pairwise tests omitted  Pairwise comparisons  Sample sizes (mean ± s.d.) (min–max)  Reference 


2  1  2  2  0  2  22.0 ± 4.6 (18–26) 



2  1  1  10  0  10  15.7 ± 1.5 (14–18)  Donegan and Avendaño (2015)  

2  1  2  5  0  5  5.5 ± 1.6 (4–7) 



8  4  5  26  114  614  42.7 ± 49.2 (3–179) 



10  1  2  14  224  406  18.2 ± 12.9 (3–63) 



8  3  3  12  0  336  23.0 ± 13.8 (4–57) 



2  1  1  7  0  7  14.9 ± 2.2 (12–17) 



4  1  4  18  64  44  39.1 ± 41.5 (3–146) 



13  3  6  19  558  924  25.5 ± 19.0 (2–78) 









Summary information on the biometric studies used in the analysis.
Order: Family  Genus  No. taxa / populations  No. spp. before review  No. spp. after review  No. continuous biometric variables  No. Pairwise tests omitted  Pairwise comparisons  Sample sizes (mean ± s.d.) (min.–max.)  Reference 


2  1  1  4  0  4  9.6 ± 3.2(6–13)  Donegan and Avendaño (2015)  

7  4  5  5  18  87  21.1 ± 19.5(2–65) 



11  1  3  6  49  281  12.4 ± 9.3(3–37) 



8  4  3  5  31  109  8.7 ± 7.4(2–24) 



2  1  1  5  0  5  4.9 ± 2.3(3–9) 



10  2  2  5  39  186  25.1 ± 34.7(4–214) 



9  3  5  5  30  150  15.4 ± 10.9(2–42) 









Vocal variables always included measures of maximum acoustic frequency, length, number of notes and speed. In some studies, change in pace, minimum frequencies, frequencies of particular notes, note bandwidth, changes in acoustic frequency and position of peaks or troughs of frequency within a vocalization, or any of the same measures for particular parts of vocalizations, were also measured. In each study, the variables under study were designed so as to document as fully as possible observed subjective differences between populations. Biometric variables were in all cases wing, tail, tarsus and bill length and mass, except for
Pairwise comparisons were undertaken on a matrix basis of each population against each other population. Some pairwise tests were omitted due to lack of data for a particular population, i.e., where there were
The data set was not designed for the study of statistical tests used in taxonomy, since this study had not been conceived at the time of data collection. The choice of taxonomic groups was not based only on studies which include among their components sympatric pairs (cf.
Several statistical tests were applied multiple times on a pairwise basis using a Microsoft Excel spreadsheet devised by the author for rapid assessment of multiple pairwise statistical tests across multiple populations. This spreadsheet is being published on the author’s
First, the entire data set was subjected to various proposed tests of species or subspecies rank. In the formulae used below,
LEVEL 1: Welch’s
When applying tests of statistical significance across multiple variables for the same pair, there is a risk of socalled “type 1” errors occurring. If testing for
LEVEL 2: a ‘50%/95%’ test, following one of
(
LEVEL 3: The traditional ‘75% / 99+%’ test for subspecies (
(
(
LEVEL 4: diagnosability based on nonoverlap of recorded values (the first part of
LEVEL 5: ‘Full’ diagnosability (where sample means are four average
(
Figure
Graphical depiction of datasets which satisfy the Level 1–5 statistical tests addressed in this study.
These five tests were applied to 2348 population/variable combinations for voice and 822 population/variable combinations for biometrics. A population/variable combination is one comparison between two populations for a single variable. For example, in the
Recorded test satisfaction outcomes and the mapping of such outcomes to diagnosis groupings.
Outcome  Meaning  Grouping 

0  None of the tests are met.  No diagnosis 
1  Statistically significant difference between means but no tests of diagnosis are met and data overlap.  Statistical significance 
14  Statistically significant difference between means and data show no overlap but no tests of diagnosis are met.  
12  Statistically significant difference between means but diagnosis only up to 50% and data overlap.  50% differentiation 
124  Statistically significant difference between means, diagnosis up to 50% and data show no overlap.  
123  Statistically significant difference between means and diagnosis at both 50% and 75% levels but data overlap.  75% differentiation 
1234  Statistically significant difference between means and diagnosis at both 50% and 75% levels and data do not overlap  
12345  Statistically significant difference between means and diagnosis at 50%, 75% and 95% levels and data do not overlap  95% differentiation 
1235  Statistically significant difference between means and diagnosis at 50%, 75% and 95% levels but data overlap.  
1245  Statistically significant difference between means and diagnosis at 50% and 95% levels and data overlap but 75% test is not met.  
125  Statistically significant difference between means and diagnosis at 50% and 95% levels and data overlap but 75% test is not met and data overlap.  
2  No statistically significant difference between means, but 50% diagnosis test is met.  Possible false results 
2345  No statistically significant difference between means, but 50%, 75% and 95% diagnosis tests are met.  
24  No statistically significant difference between means, but 50% diagnosis test is met and data do not overlap.  
245  No statistically significant difference between means, but 50% and 95% diagnosis tests are met and data do not overlap.  
25  No statistically significant difference between means and data overlap, but 50% and 95% diagnosis tests are met.  
4  Data do not overlap but no other statistical tests are met 
Certain minor methodological changes were undertaken here as compared to some of the underlying studies on which this paper is based: (i) where a single population had only one data point, it was excluded here from analyses, since only “Level 4” tests can be applied where degrees of freedom are 0 and this paper sought to compare outcomes for all comparisons; (ii) for the number of notes in the call for
The second part of this study aimed to measure effect sizes four different ways, in order to inform appropriate benchmarks for measuring or scoring differentiation. The impacts of using pooled standard deviations (as per
Effect sizes were first calculated using the following formula:
(
This uses an arithmetic mean of the standard deviations of the two populations to measure the difference between the means of the same two populations.
A control was applied using
(
This measures the distance between the means of two populations in terms of numbers of
To illustrate the impact of this correction versus the results from using bare unpooled effect sizes, the maximum acoustic frequency in the “slow song” in Santa Marta Warbler
Effect sizes using a pooled standard deviation, or Cohen’s
Cohen’s
(
or, in full:
(
This was the measure of effect size used by
Bare pooled effect sizes were subjected to an equivalent control for sample size (as for bare unpooled effect sizes), but using
Cohen’s
Where
or, in full:
(
Thee four measures of effect sizes were calculated for each population/variable combination and each outcome was then placed into two sets of buckets. First, in order to obtain a general resolution on effect sizes magnitude in taxonomic studies, population/variable combinations were placed into a set of buckets divided at 2 effect sizes (i.e., at approximately 50% differentiation) intervals: 0–2, 2–4, 4–6, etc. A second set of buckets was based on
To compare the outcomes achieved using the four different measures of effect size and analyses of Levels 1–5, plots were produced between several of the outcomes. Spearman’s rank correlation coefficient was calculated as between statistical significance and effect size outcomes, based on the entire vocal and biometric data sets, so as to examine the interrelation between the outcomes of applying different measures of differentiation.
Tables
Effect of applying different “Type 1 error” corrections on the vocal data set. The tests are ordered (AG) from least to most conservative corrections.






TOTALS  [ 
[ 
[ 



10  26  14  12  7  19  18  2  5  


0.05  0.05  0.05  0.05  0.05  0.05  0.05  0.05  0.05  
Passed  2  406  323  236  4  240  1211  32  2  2 
Total  10  614  406  336  7  343  1716  44  2  5 
% passed  


0.00512  0.00639  0.00730  0.00851  0.00730  0.00730  0  0.0253  0.0102  
Passed  2  357  293  197  3  200  1052  25  2  1 
Total  10  614  406  336  7  343  1716  44  2  5 
%  


0.005  0.00625  0.00714  0.00833  0.00714  0.00714286  0.01  0.025  0.01  
Passed  2  357  293  197  3  200  1052  25  2  1 
Total  10  614  406  336  7  343  1716  44  2  5 
% 




0.00512  0.00197  0.00366  0.00427  0.00730  0.00730  0.00285  0.0253  0.0102  
Passed  2  321  267  186  3  189  968  20  2  1 
Total  10  614  406  336  7  343  1716  44  2  5 
%  


0.005  0.00192  0.00357  0.00417  0.00714  0.00714  0.00278  0.025  0.01  
Passed  2  321  266  185  3  188  965  20  2  1 
Total  10  614  406  336  7  343  1716  44  2  5 
%  


0.00366  0.00165  0.00256  0.00301  0.00427  0.00427  
Passed  2  317  260  179  3  182  943  
Total  10  614  406  336  7  343  1716  
%  


0.00357  0.00161  0.0025  0.00294  0.00417  0.00417  
Passed  2  317  260  178  3  181  941  
Total  10  614  406  336  7  343  1716  
% 
Effect of applying different “Type 1 error” corrections on the biometric data set. The tests are ordered (A–E) from least to most conservative corrections.






TOTALS  [ 



4  5  6  5  5  5  5  


0.05  0.05  0.05  0.05  0.05  0.05  0.05  
Passed  0  46  142  45  3  66 

88 
Total  4  87  281  109  5  150 

186 
% passed  


0.0127  0.0102  0.00851  0.0102  0.0102  0.0102  0.0102  
Passed  0  31  108  35  2  50 

66 
Total  4  87  281  109  5  150 

186 
%  


0.0125  0.01  0.00833  0.01  0.01  0.01  0.01  
Passed  0  31  108  35  2  50 

66 
Total  4  87  281  109  5  150 

186 
%  


0.00366  0.00165  0.00256  0.00301  0.00427  0.00213  
Passed  0  33  92  31  2  40 


Total  4  87  281  109  5  150 


%  


0.00357  0.00161  0.0025  0.00294  0.00417  0.00208  
Passed  0  33  92  31  2  40 


Total  4  87  281  109  5  150 


% 
In the biometrics study, lower levels of statistically significant differentiation were found than for voice. More comparisons were nonsignificant (52.5%) than significant, even prior to applying any type 1 corrections. Applying type 1 corrections eliminated a further 12–16% of outcomes. Fewer than 5% of these eliminations result from treating voice and biometrics together; the bulk resulted from applying Bonferroni on the biometric data set itself. DunnŠidák corrections had no impact compared to using Bonferroni.
Tables
Outcomes of pairwise comparisons for vocal characters, placed into the different categories recovered by testing statistical tests of Levels 1 through 5. 1 = statistically significant, 2 = 50% diagnosis, 3=75% diagnosis, 4 = actual value diagnosis, 5 = 95% diagnosis (as detailed further in methods). See Table
Voice: levels passed  None  1  14  12  124  123  1234  12345  1235  1245  125  2  2345  24  245  4 


1  1  

8  2  

3  1  1  

284  38  53  35  23  9  4  109  11  2  37  2  5  2  

118  92  1  47  27  7  5  71  3  12  1  2  12  8  

148  82  37  17  5  7  36  1  1  2  

4  3  

22  7  7  3  1  2  1  1  

504  188  3  73  53  7  6  47  7  2  0  1  11  0  22  


















Outcomes of pairwise comparisons for biometric characters, placed into the different categories recovered by testing statistical tests of Levels 1 through 5. 1 = statistically significant, 2 = 50% diagnosis, 3 = 75% diagnosis, 4 = actual value diagnosis, 5 = 95% diagnosis (as detailed further in methods). See Table
Biometrics: levels passed  None  1  14  12  124  123  1234  12345  1235  1245  125  2  2345  24  245  4 


4  

25  24  9  1  4  24  

166  17  1  15  23  1  5  35  3  7  8  

47  6  4  8  12  2  3  27  

3  1  1  

120  43  12  6  1  4  

97  27  1  10  9  1  2  1  2  


















Outcomes of pairwise comparisons using Levels analysis, for voice, by grouping. See Table
Voice: Taxon  Pairwise statistical tests (/5)  No diff.  Poss. false results  Signif. Only  50%  75%  95% 


2  0  0  1  1  0  0 

10  8  0  2  0  0  0 

5  3  1  0  1  0  0 

614  284  9  91  58  13  159 

406  118  22  93  74  12  87 

336  148  3  82  54  12  37 

7  4  0  0  3  0  0 

44  22  2  7  10  1  2 

924  504  34  191  126  13  56 








OVERALL %  
% (comparable) 
Outcomes of pairwise comparisons using Levels analysis, for biometrics, by grouping. See Table
Biometrics: Taxon  Pairwise statistical tests (/5)  No diff.  Poss. false results  Signif. only  50%  75%  95% 


4  4  0  0  0  0  0 

87  25  24  24  10  0  4 

281  166  15  18  38  6  38 

109  47  27  10  20  2  3 

5  3  0  0  1  0  1 

186  120  0  43  18  1  4 

150  97  3  28  19  1  2 








OVERALL %  
% (comparable) 
As foreshadowed in the type 1 error analysis (Tables
Levels 1–5 were generally ordered by least to most exacting in terms of difficulty to pass. However, several examples of “outliers” were uncovered, where more liberal test outcomes were apparently “skipped”, e.g.: (i) only statistical significance and nonoverlap (1&4); (ii) statistical significance with 50% and nonoverlap but not 75% diagnosis (124); (iii) all tests being passed except nonoverlap (123&5); (iv) all tests including 95% diagnosis being passed, but excluding 75% diagnosis (124&5); (v) full statistical diagnosis and 50% and 95% diagnosis being met but neither 75% nor nonoverlap (12&5); and (vi) combinations skipping statistical significance altogether, but passing other tests (all outcomes starting with 2 or 4). These outcomes are all statistically plausible, including as a result of the values of
In terms of specific findings for birds, biometric data were less informative than vocal data with “possibly false results” also being more frequent for biometric comparisons.
Results for effect sizes divided into buckets of 2
Results of the effects size study for voice, partitioning the data into 2 effect size intervals. The top two tables are based upon actual standard deviations for each set of data subjected to pairwise comparison. The lower two tables are based on pooled standard deviation data pooling. In each case, “bare” effect sizes are shown first (above). The second and fourth tables use “controlled effect sizes” for the relevant pooling approach, calculated by taking into account
Bare Unpooled Effect Sizes (Voice)  














1  1  

10  

3  2  

361  119  55  33  11  14  8  6  6  1  0 

208  91  47  16  10  7  3  14  6  3  1 

227  66  30  7  6  

4  3  

27  14  2  0  1  

654  168  47  29  20  4  1  1  0  0  0 





























1  1  

10  

4  1  

375  114  53  32  14  15  0  7  3  1  

219  88  43  25  14  5  6  4  1  0  1 

230  69  27  7  3  

4  3  

29  12  3  

717  151  34  14  6  2  





























1  1  

10  

3  2  

371  130  48  19  17  10  9  4  6  0  0 

222  98  33  16  6  9  7  6  3  2  4 

232  58  36  5  5  

4  3  

28  13  2  0  0  0  0  1  

687  151  40  16  11  8  5  2  1  1  2 





























1  1  

10  

3  2  

374  127  49  21  19  6  8  4  6  

226  99  31  17  8  8  6  5  2  2  2 

233  58  35  5  5  

4  3  

28  13  2  0  0  0  1  

694  153  31  22  8  7  4  1  1  1  2 













Results of the effects size study for biometrics, partitioning the data into 2 effect size intervals. The top two tables are based upon actual standard deviations for each set of data subjected to pairwise comparison. The lower two tables are based on pooled standard deviation data. In each case, “bare” effect sizes are shown first (above). The second and fourth tables use “controlled effect sizes” for the relevant pooling approach, calculated by taking into account















4  

58  25  4  

170  63  21  14  5  3  1  2  0  2  

67  33  7  1  1  

2  2  0  1  

154  26  5  0  1  

116  27  7  





























4  

73  10  4  

192  51  20  12  3  0  0  1  2  

84  22  2  1  

3  1  1  

163  19  3  1  

127  21  2  





























4  

57  28  2  

177  54  23  14  3  5  2  2  1  

67  36  6  

2  2  1  

158  23  4  0  1  

127  17  6  





























4  

58  27  2  

181  54  21  11  8  3  1  1  1  

75  30  4  

3  2  

161  21  3  0  1  

128  17  5  













A good portion (15–22%) of outcomes fell into the 2–4 effect sizes category, which, when using controlled unpooled effect sizes, corresponds to Level 2 in Tables
Table
Changes in effect size categories resulting from increasingly more conservative tests of effect size being applied. This table is based upon changes between the categories in Tables












Bare Unpooled –> Bare Pooled  +63  8  22  29  9  +2  +9  8  2  1  +5 
Bare Pooled –> Controlled Pooled  +15  0  11  +9  +1  6  2  3  1  0  2 
Controlled Pooled –> Controlled Unpooled  +16  17  +12  +13  3  +1  13  +1  5  2  3 

+ 
 
 
 
 
 
 
 
 
 

As percentage of total  + 
 
 
 
 
 
 
 
 
 













Bare Unpooled –> Bare Pooled  +21  16  2  2  3  +2  +1  0  +1  2  0 
Bare Pooled –> Controlled Pooled  +18  9  7  3  +5  2  1  1  0  0  0 
Controlled Pooled –> Controlled Unpooled  +36  27  3  +3  6  3  1  0  +1  0  0 

+ 
 
 
 
 
 
 
 
+ 
 

As percentage of total  + 
 
 
 
 
 
 
 
+ 
 
Results for effect sizes divided into
Results of the effects size study for voice using









0  1  1  

3  7  

0  3  2  

70  291  152  66  35 

34  174  119  45  34 

31  196  83  26  

0  4  3  

4  23  14  3  

96  558  200  64  6 

















0  1  1  

3  7  

0  4  1  

73  302  144  69  26 

37  182  121  49  17 

33  197  85  21  

0  4  3  

4  25  12  3  

113  604  171  34  2 

















0  1  1  

3  7  

0  3  2  

71  300  157  57  29 

35  187  118  35  31 

31  201  78  26  

0  4  3  

4  24  14  2  

105  582  181  37  19 

















0  1  1  

3  7  

0  3  2  

71  303  158  58  24 

36  190  115  40  25 

30  203  79  24  

0  4  3  

4  24  14  2  

109  585  177  37  16 







Results of the effects size study for biometrics using









1  3  

5  53  27  2  

24  146  70  33  8 

7  60  40  2  

0  2  2  1  

22  132  28  4  

19  97  32  2  

















1  3  

8  65  14  

29  163  61  25  3 

17  67  24  1  

0  3  2  

24  139  21  2  

28  99  23  

















1  3  

6  51  28  2  

26  151  63  31  10 

8  59  40  2  

0  2  3  

23  135  24  4  

20  97  21  12  

















1  3  

7  51  28  1  

30  151  63  31  6 

11  64  32  2  

0  2  3  

23  138  22  3  

21  107  22  







Changes between category (Table
Changes in






Bare Unpooled –> Bare Pooled  +11  +52  20  47  +4 
Bare Pooled –> Controlled Pooled  +4  +11  5  +4  14 
Controlled Pooled –> Controlled Unpooled  +10  +6  11  +15  20 

+ 
+ 
 
 
 
As percentage of total   
 
 







Bare Unpooled –> Bare Pooled  +6  +5  20  +7  +2 
Bare Pooled –> Controlled Pooled  +9  +18  9  14  4 
Controlled Pooled –> Controlled Unpooled  +14  +23  25  9  3 

+ 
+ 
 
 
 
As percentage of total   
 
 
Tables
Scattergraphs showing the effects of applying different corrections of effect size on the entire vocal data set. Each axis shows effect size, measured in a different way.
Scattergraphs showing the effects of applying different corrections of effect size on the entire biometric data set. Scattergraphs showing the effects of applying different corrections of effect size on the entire vocal data set. Each axis shows effect size, measured in a different way.
Spearman’s rank correlation coefficients as between the results of five statistical tests carried out on the vocal data set. Confidence for the values stated were given in PAST as zero or less than
Tests conducted on vocal data set  Bare Unpooled Effect Sizes  Controlled Unpooled Effect Sizes  Bare Pooled Effect Sizes  Controlled Pooled Effect Sizes 


0.863  0.910  0.849  0.859 

0.987  0.988  0.908  

0.894  0.980  

0.999 
Spearman’s rank correlation coefficients as between the results of five statistical tests carried out on the biometric data set. Confidence for the values stated were given in PAST as zero or less than
Tests conducted on biometric data set  Bare Unpooled Effect Sizes  Controlled Unpooled Effect Sizes  Bare Pooled Effect Sizes  Controlled Pooled Effect Sizes 


0.851  0.951  0.831  0.857 

0.936  0.990  0.983  

0.920  0.936  

0.994 
Changes in actual effect sizes resulting from changes between different methods of measuring effect sizes in the format actual mean ± standard deviation (minimum–maximum) [mean using absolute values ± standard deviation using absolute values], for voice. The figure in each cell demonstrates the outcomes of subtracting the effect size in the columns from the effect sizes in the rows, for each data point studied.
Tests conducted on vocal data set  Controlled Unpooled Effect Sizes  Bare Pooled Effect Sizes  Controlled Pooled Effect Sizes 


0.35 ± 1.15(0.20–24.26)[0.35 ± 1.15]  0.15 ± 0.91(11.73–7.13)[0.36 ± 0.84]  0.24 ± 1.07(11.32–19.28)[0.41 ± 1.02] 

0.19 ± 1.55(20.70–4.40)[0.45 ± 1.50]  0.11 ± 1.36(20.26–5.63)[0.40 ± 1.31]  

0.08 ± 0.46(0.27–15.40)[0.09 ± 0.46] 
Changes in actual effect sizes resulting from changes between different methods of measuring effect sizes in the format actual mean ± standard deviation (minimum–maximum) [mean using absolute values ± standard deviation using absolute values] for biometrics. The figure in each cell demonstrates the outcomes of subtracting the effect size in the columns from the effect sizes in the rows, for each data point studied.
Tests conducted on biometric data set  Controlled Unpooled Effect Sizes  Bare Pooled Effect Sizes  Controlled Pooled Effect Sizes 


0.41 ± 0.68(0.01–5.61)[0.41 ± 0.68]  0.09 ± 0.44(2.17–5.34)[0.19 ± 0.40]  0.21 ± 0.53(1.97–5.77)[0.26 ± 0.51] 

0.32 ± 0.73(6.31–2.36)[0.39 ± 0.69]  0.20 ± 0.57(4.4–2.78)[0.30 ± 0.53]  

0.12 ± 0.28(0.01–2.84)[0.12 ± 0.28] 
The largest shift was observed between bare unpooled effect sizes versus controlled unpooled effect sizes, where a 3.9% (voice) or 9.1% (biometrics) increase in the number of outcomes in the lowest category of differentiation (0–2) was observed.
The impact of using bare pooled versus bare unpooled effect sizes is illustrated in Figures
The overall magnitude of reduction of effect size measurements between bare pooled effect sizes and controlled pooled effect sizes was moderate. Degrees of freedom for pooled standard deviation are higher (the sum of the two samples’ sample sizes minus 2) than when using unpooled methods (where each sample is treated separately), resulting in lower
Although these overall trends were observed, the impact of applying differing methods of measurement of effect sizes on actual pairwise comparisons was not uniform (see Figures
Statistical significance presented a weak negative correlation with most effect size measurements, but being most closely correlated with controlled unpooled effect sizes. In the case of biometrics, there was a strong negative correlation with controlled unpooled effect sizes (Tables
The variability between particular scores using different effect size measures are defined further in Tables
The relationship between each measurement of effect size and statistical significance is explored in Tables
Effect sizes under the four models studied here, grouped into the three “zones” of statistical significance illustrated in Figures
Statistical significance  Bare unpooled effect sizes  Bare pooled effect sizes  Controlled pooled effect sizes  Controlled unpooled effect sizes 


0.60 ± 1.31(0.00–11.56)  0.38 ± 0.33(0.00–2.21)  0.71 ± 2.13(0.00–22.92)  0.67 ± 2.02(0.00–22.47) 

1.82 ± 2.69(0.33–18.22)  1.29 ± 1.24(0.33–8.47)  1.82 ± 3.00(0.26–21.60)  1.67 ± 2.61(0.26–20.14) 

3.69 ± 3.30(0.36–45.33)  3.32 ± 2.73(0.37–21.07)  3.32 ± 3.05(0.37–41.45)  3.22 ± 2.81(0.37–26.05) 
Effect sizes under the four models studied here, grouped into the three “zones” of statistical significance illustrated in Figures
Statistical significance  Bare unpooled effect sizes  Bare pooled effect sizes  Controlled pooled effect sizes  Controlled unpooled effect sizes 


0.72 ± 0.70(0.00–3.97)  0.42 ± 0.29(0.00–1.95)  0.71 ± 0.72(0.00–3.92)  0.63 ± 0.60(0.00–3.41) 

1.45 ± 0.86(0.44–3.98)  1.06 ± 0.41(0.44–2.44)  1.37 ± 0.96(0.40–6.01)  1.26 ± 0.84(0.40–5.81) 

3.37 ± 2.64(0.61–19.80)  2.79 ± 2.07(0.61–17.09)  3.15 ± 2.44(0.58–16.04)  2.97 ± 2.23(0.58–15.57) 
The dataset studied here exhibits comparable overall levels of variation to
Several broader aspects of the results can be explained by considering the number of standard deviations’ difference required to satisfy various models (Figure
The overall lower differentiation levels in biometrics can in part be explained due to lower sample size (see Tables
The outcomes of using pooled versus unpooled and bare versus controlled effect sizes are substantial across the data set as a whole and can be drastic in individual cases (Tables
The distinction between using pooled versus unpooled standard deviations in taxonomy has passed by barely without discussion in ornithological taxonomic literature.
Usage of pooled standard deviations, as a matter of statistical methodology, should only be undertaken where the standard deviations of the two populations under comparison can be assumed to be equal. This does not necessarily mean that measured standard deviations of the two populations must be equal, or even close to one another, since these will usually differ for two measured populations as a result of the sampling. However, it must be reasonable to make this assumption in order to apply this method. The pooling formula attributes greater weight to the standard deviation of the population with higher sample size and produces a “weighted average” standard deviation which is closer to that of the population of which there is a larger sample. Degrees of freedom for the pooled standard deviation are greater due to summing those of the two separate populations. In practice, in taxonomy, we will usually have no idea as to whether or not the standard deviations of two populations under comparison are equal or not. Special care should be adopted in using pooled standard deviations where estimated population sizes, molecular or geographical attributes of the two populations vary greatly. For example, comparing an isolated, very small montane population with low intrapopulation molecular variation versus a very widespread lowland population which is known to exhibit substantial clinal variation and has higher intrapopulation molecular variation would be inappropriate, since assumptions underlying the usage of pooled standard deviations are likely not just to be unknown but incorrect. The greater correlation between statistical significance and controlled unpooled effect sizes (Tables
There are still likely to be “use cases” for pooled standard deviations to measure effect sizes in taxonomy.
Most papers concerning the application of statistical tests for determining the taxonomic rank of allopatric populations have noted that statistical significance is not a good measure, due to its potential for liberal satisfaction by increasing sample size, its failure to indicate higher levels of differentiation or false positives when sampling from different parts of a geographical cline; and then move quickly on to discuss better tests (
Although large samples sizes are cited as the basis to reject statistical significance as a useful measure in taxonomy, such problems are rarely faced by taxonomists. Having too few specimens (whether in museums or measured in the field) or sound recordings is likely a more material problem. For new species descriptions in birds published between 1935–2009, 332 of 477 (70%) were based on 0–5 specimens, only one was based on >100 specimens and the mean number of specimens was 6 (
The classic test of statistical significance between two populations of data is the Student’s
Although the
The
The
In the field of medicine, the outcome of tests of statistical significance is widely understood and accepted to be just a first phase in demonstrating an interesting result. A variety of different approaches exist in medical science which must also be passed to show
That all said, statistical significance can be a tougher one than some proposed measures of differentiation. Instances were found here of pairwise comparisons passing
Scattergraphs of controlled unpooled effect size (
Introducing the DunnŠidák correction (as opposed to the simpler but overly conservative Bonferroni correction) had a virtually negligible effect (Tables
Bonferroni (and DunnŠidák) corrections are appropriately applied to “families” of variables. A middleground of treating voice and biometrics as separate “families” is recommended based on this study. This could be criticized, since certain aspects of voice and biometrics can be linked (e.g.,
Diagnosability was considered to be the most frequently applied criterion to assess rank in a review of over 1000 taxonomic revisions (
In this study, 14.5% of the vocal data set and 6.3% of the biometric data set passed the
There is no consensus as to whether any differentiation below diagnosability for particular characters ought to be recognized in taxonomy.
Originally, 75% diagnosis tests for subspecies used one of two approaches: (i) a 75%/75% test (i.e., 75% of population 1 is diagnosable from 75% of population 2; or (ii) a 75%/99+% test (i.e., 75% of population 1 is diagnosable from essentially all of population 2).
The application of sharp, seemingly arbitrary, tests such as these to classify normally distributed data into segments to which scores are attributed is a situation not unique to taxonomy. Similar hard boundaries are also rife in most education and examination systems. In UK universities, a student scoring 60.1% or 69.9% in an examination will be given the same award (an upper second class degree) but a student attaining 70.1% will get a different award, a first class degree. This is despite the students scoring 69.9% and 70.1% having attained more similar levels of achievement to one another. Whilst any cutoff may be criticized as arbitrarily generous or harsh to outcomes falling close to the line on either side, the application of cutoffs is something that humans tend to do in their quest to categorize things. Where cutoffs are applied, a test of whether the cutoff is a valid one should best be based upon: (a) differentiation of a meaningful number of outcomes; and (b) the setting of boundaries at statistically, mathematically, or biologicallymeaningful positions.
In this data set, with very large numbers of pairwise comparisons, necessarily many individual cases fall very close to each of the cutoff boundaries proposed by previous models for attributing taxonomic significance, whether at or below diagnosability. Two populations differing by 95% using
In contrast to the 75% (Level 3) test, 50%/95% differentiation (Level 2) measures a mathematically relevant point of differentiation, when the mean of one population moves outside the normal distribution of the other. It also signifies the point at which a population has moved half way towards diagnosability. The number of pairwise comparisons meeting the Level 2 test (but not falling in other buckets) was material but not enormous. Only 30.6% for voice and 20.6% for biometrics of outcomes passed this test at all (Tables
In addition to the diagnosis formula for Level 5,
Traditional interpretations of effect sizes may be appropriately used in other fields but are inappropriate for taxonomic study. It should be borne in mind that the traditional subjective descriptors for effect sizes starting at 0.2 have been developed largely in the fields of social and behavioral science (Cohen 1998,
Overall, this study suggests that: (i)
Solution A:
1 point: Level 1 statistical significance only.
2 points: Level 1 plus Level 2 50% diagnosability.
3 points: Level 1 plus Level 5 full diagnosability (3 points).
4 points: Level 1 plus a new measure of a “species and a half” worth of diagnosability (equivalent to 6 controlled effect sizes).
Solution B: would use more proportionate scoring, eliminate the weighting for statistical significance and allow only three scores:
1.5 points: Level 1 plus Level 2 (2 controlled effect sizes).
3 points: Level 1 plus Level 5 (4 controlled effect sizes).
4.5 points: Level 1 plus 6 controlled effect sizes.
Solution C: would abandon these various cutoffs and instead use controlled unpooled effect sizes, calibrated by a scale factor such that no difference = 0 and full diagnosability = 3 and capped at a score of 4.
As has been argued elsewhere (e.g.,
For the reasons above, the Level 4 nonoverlap test should be abandoned from this framework in order to positively score the 2.5% of vocal outcomes which were diagnosable to 95% but actually overlapped due to very large sample sizes (Table
A disadvantage of the
Logarithmic plot of the same data as in Figure
Graph showing relationship between sample size (
Graph illustrating the “hard cutoff” approaches of
One could adapt the
∑[
≤
∑[
Such a hardedged statistical framework would go beyond the recommendations of
The
There is a relatively simple solution to this shortcoming: with continuous data, to move away from models which attribute cutoffs and instead to apply precise scoring under a system which only uses a hard cutoff at the very final point of determining species or subspecies rank. Such an approach was effectively attempted in a recent study (e.g.,
Immediately prior to going to press,
In this and the next section, a new, universal measure of differentiation is developed. It is potentially usable in any taxonomic group where continuous variables are studied and in other contexts to measure effect sizes.
Step 1: identify a comparison group.
For an assessment of the rank of allopatric populations, this method compares: (i) two sympatric and closely related populations which are demonstrably good species and broadly accepted as such (Species 1 and Species 2) as well as (ii) two allopatric populations under study (Population 3 and Population 4). Ideally, Species 1 and Species 2 should also be sister taxa or be known or suspected to be very closely related through molecular studies, such that they represent a good benchmark. However, this may not always be known for certain. Preferably, Species 1 and 2 and Populations 3 and 4 should all be congeneric, but this might not be possible and they might be merely a good example from the same family or order, depending on how speciose the relevant higherlevel taxonomy is. Either (but not both) of Population 3 or Population 4 might be the same as Species 1 or Species 2 or they may all be different populations.
Step 2: collect data for relevant variables using continuous measurements.
It is critical to ensure a fair identification of variables, which adequately and honestly document the maximum possible observed variations between all populations (i.e., not just the allopatric pair, but also the sympatric pair). Variables differentiating sympatric Species 1 and 2 should not be overlooked, even if more time is spent studying allopatric Populations 3 and 4. Returning to the theme of taxonomic significance and not simply statistical significance, it is important that the variables under study are likely to be taxonomically relevant. Field experience or knowledge of the organisms concerned is important to avoid splits or lumps being published based on statistical tests applied to inappropriately selected variables.
Unlike in multivariate statistics, the technique presented here will not require each data set to have the same measures from the same individuals. This means that a biometric data set based on museum specimens and a vocal data set based on a different set of individuals and with different sampling can be combined, so data from all possible sources can be collated and combined. The broadest possible geographical and numerical sampling is important (e.g.,
Step 3: undertake pairwise comparisons using controlled unpooled effect sizes.
The following formula should be applied to measure controlled unpooled effect sizes on a pairwise basis, separately for each population/variable combination under study, e.g., for Species 1 and Species 2:
(
Step 4: exclude all the statistically insignificant data.
Comparisons showing no statistical significance should be eliminated and scored as 0. This process needs conducting separately for each population/variable combination under study: a variable might be scored as zero as between Species 1 and Species 2, but may be scored positively as between Population 3 and Population 4. Bonferroni correction is applied here, in order to keep the formula simple and due to the nearnil impact of using less conservative “type 1” error corrections. It is recommended that different sets or “families” of data (biometric, vocal, colorimetric) are treated separately for purposes of determining the appropriate Bonferroni correction. Other more complex “type 1 error” corrections such as DunnŠidák should be considered for situations where very large numbers of variables are compared. The exclusion of statistically insignificant data results in the following modification to the effect size formula above, e.g., for Species 1 and Species 2:
Step 5: add up all the results of the above calculations (using a Euclidian approach).
It would be simple then to add up all the effect sizes, as follows, and see whether Species 1 vs Species 2 or Population 3 vs Population 4 had the better score. This would apply the formula:
∑ [
≤
∑ [
However, this would be suboptimal statistically. Applying such a formula would reflect the underlying conceptual approach of existing systems to rank allopatric populations (including
Justification for Euclidian summation, using a univariate/bivariate example.
(
then Pythagorian principles result in the following calculation of distance between points
√[(
And this can be simplified to:
√(∑ (
This approach cannot perfectly be applied to a series of effect size measures based on multiple pairwise comparisons, in that such data are not necessarily linked to one another as a set of corresponding coordinates. However, assuming that the variables studied are independent, it is valid to measure distance this way. Independence of variables can be verified through correlation tests and promoted in variable selection by seeking to capture the maximum possible observed variation efficiently.
Each controlled unpooled effect size (that has not been eliminated to zero using the statistical significance filter) can be considered to represent the equivalent of a distance 
In studies using continuous variables, allopatric populations should be ranked as species if they show equal to or greater variation than that shown between closely related sympatric species (
√(∑ [
≤
√(∑ [
Where:
Species 1 and Species 2 are two sympatric species that are closely related to one another (preferably known to be sisters) and which are related to Population 3 and Population 4.
Population 3 and Population 4 are two allopatric populations whose rank is being determined.
1, 2, 3, and 4 refer to relevant data for Species 1, Species 2, Population 3 and Population 4 respectively.
Because this formula is not simple to calculate, a spreadsheet is being published alongside this paper on the author’s
In some ways, when one looks carefully at the outcomes of different statistical tests undertaken here, the formula is a statement of the obvious. The statistics underlying it are basic. It merely relies upon the good aspects of longestablished statistical methods for comparing continuous variables of previous authors (e.g.,
As regards
Some important recommendations should be borne in mind when using this method, in addition to those set out above under “Steps”:
(i)
(ii)
(iii)
(iv)
(v)
Table
Workedthrough example of the new formula proposed herein, assessing the rank of two allopatric
Variable  

Controlled unpooled effect size  Score  Controlled unpooled effect size  Score  


No. of notes  1.41  5.2 × 10^{29}  1.41  4.54  6.92 × 10^{33}  4.54 
Song length  0.37  0.0015  0.37  0.71  0.0022  0 
Song speed  2.05  3.9 × 10^{49}  2.05  7.65  1.03 × 10^{89}  7.65 
Max. acoustic frequency second note  1.03  2.7 × 10^{16}  1.03  3.30  7.20 × 10^{18}  3.30 
Max. acoustic frequency of last note  1.31  3.9 × 10^{23}  1.31  2.88  2.03 × 10^{15}  2.88 
Change in acoustic frequency  0.52  3.7 × 10^{5}  0.52  1.36  2.62 × 10^{8}  1.36 
Position of peak of frequency  4.75  4.5 × 10^{74}  4.75  0.08  0.76  0 
Position of trough in frequency  3.83  9.5 × 10^{55}  3.83  0.03  0.91  0 


Call length  0.91  0.00270  0  0.99  0.040  0 
Maximum acoustic frequency  0.99  0.00103  0.99  0.37  0.16  0 


No. of notes  0.11  0.770  0  0.01  0.99  0 
Song length  0.06  0.880  0  0.22  0.57  0 
Song speed  0.21  0.494  0  0.21  0.56  0 
Max. acoustic frequency  0.31  0.371  0  1.71  0.00049  1.71 
Min. acoustic frequency  0.19  0.529  0  2.29  0.00013  2.29 
Change in acoustic frequency  0.24  0.447  0  0.45  0.25  0 
Position of peak of frequency  0.49  0.396  0  0.65  0.12  0 
Position of trough in frequency  0.02  0.946  0  0.12  0.74  0 


No. of notes  0.78  0.0073  0  5.52  2.41 × 10^{15}  5.52 
Song length  0.60  0.034  0  1.00  0.023  0 
Song speed  1.18  7.07 × 10^{5}  1.18  7.00  1.50 × 10^{19}  7.00 
Max. acoustic frequency second note  0.61  0.031  0  1.62  0.0034  0 
Max. acoustic frequency of last note  1.14  0.00020  1.14  2.61  3.86 × 10^{7}  2.61 
Change in acoustic frequency  0.23  0.40  0  0.71  0.18  0 
Position of peak of frequency  0.04  0.89  0  0.96  0.11  0 
Position of trough in frequency  0.27  0.33  0  0.47  0.41  0 



Table
Full scores across the









3.56*  12.22  11.23  28.00  18.15  14.60 

13.75  15.64  19.51  21.49  15.50  

4.08*  23.74  21.13  17.85  

28.45  28.15  24.39  






Biometric scores are shown in Table
Scores across the
Taxon 








0*  3.02  1.72  3.35  [5.36]  5.96 

2.49  0  3.01  [5.13]  5.72  

1.64*  2.22  [3.01]  5.07  

2.54  [4.59]  5.35  

[ 



[ 
As above, although a universal
There are however some parameters and examples available from the case studies (Table
Scores of examples from the data set which are both (i) sister species (or relevant sympatric subspecies of sister species) as shown by molecular studies; and (ii) sympatric, to show ranges of scores. Also presented are examples of scores passing other authors’ species tests. Note the
Sympatric pair or proposed score for species rank  Type of data  Score 

Voice  7.14  
Voice + biometrics = total  9.16 + 0 = 9.16  
Voice + biometrics = total  10.59 + 0 = 10.59  
Voice + biometrics = total  8.79 + 0 = 8.79  
Voice + biometrics = total  7.90 + 8.01 = 15.91  


Basis for 
Voice: diagnosability of three characters = 3 × 4 SD  6.92 
A basis for 
Voice or biometrics: 1 × 10 
11.18 
A basis for 
Voice and biometrics: 1 × 10 
10.20 
A basis for 
Voice and biometrics: 1 × 10 
10.01 
A basis for 
Voice and biometrics: 2 × 5 
7.07 
A basis for 
Voice and biometrics: 1 × 5 
5.74 
A basis for 
Voice and biometrics: 1 × 5 
5.39 
A basis for 
Voice and biometrics: 3 × 2 
3.47 
As regards subspecies or
Among the unnamed populations in the study group, only the “Apurímac south” population of
Probably the most difficult taxonomic decision in this series of papers was that of how to rank
The
The method proposed here involves no universal score for species rank. However, it would still be interesting to see how other pairwise situations involving sympatric sister species measure up under this system, and then possibly to revisit the philosophy underlying
Thanks to Michael Patten, Mort Isler, and George Sangster for helpful comments on this paper. The inclusion of a Bonferroni correction in the Level 1 test was suggested by F. Gary Stiles (in litt. 2007). Michael Patten made various insightful comments which resulted in addition of all the analyses and methods involving pooled standard deviation data and the various additional “type 1” corrections. Some of the methods underlying this paper were developed in papers published together with Jorge Avendaño, Paul Salaman, and others. Thanks as always to Blanca and Lucas for their love, good cheer, and encouragement.