Research Article |
|
Corresponding author: L. Lee Grismer ( lgrismer@lasierra.edu ) Academic editor: Minh Duc Le
© 2025 L. Lee Grismer.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Grismer LL (2025) Introducing multiple factor analysis (MFA) as a diagnostic taxonomic tool complementing principal component analysis (PCA). ZooKeys 1248: 93-109. https://doi.org/10.3897/zookeys.1248.159516
|
Multiple factor analysis (MFA) is introduced as a diagnostic tool for taxonomy and discussed using examples from the herpetological literature. Its methodology and output are compared and contrasted to the more often used principal component analysis (PCA). The most significant difference between MFA and PCA is that the former can more appropriately integrate numeric (meristic and/or morphometric) and categorical characters (e.g., big-small, blue-red, striped-banded, keeled-smooth, etc.) in the analysis, thus creating a nearly total-evidence morphological output. MFA emphasizes the diagnostic utility of categorical characters in a statistically defensible landscape as opposed to their often-anecdotal treatment or complete omission in species diagnoses, usually owing to their variability. PCA is most informative when only a single numeric data type (e.g., morphometric or meristic) is analyzed. Using PCA to analyze different data types separately and comparing the results, one can determine which data type and which of their variables (traits/characters) bear most heavily on the differentiation among the operational taxonomic units (OTUs [i.e., populations or species]) and, in some cases, their biological significance. If more than one data type is used in a PCA, the output may be biased by the data type with the largest amount of variation or statistical variance. Also discussed is the necessity of using a non-parametric permutation of analysis of variance (PERMANOVA)—or a similar analysis—as a robust, statistically defensible method for assessing the significance of OTU plot positions as opposed to subjective visual interpretations.
Diagnosis, herpetology, meristic data, morphometric data, multivariate statistics, statistical defensibility, taxonomy
Even before the development of modern biology, morphological characters have been the evidential foundation upon which species were compared and diagnosed. Despite this longstanding convention, however, morphological diagnoses are rarely the result of rigorous statistical evaluation—quite notably so for herpetological taxonomy (see
Overall, these analyses can augment a taxonomist’s ability to construct statistically defensible classifications based on both multivariate and standard univariate analyses (i.e., Student t-tests, Welch’s t-test, Mann-Whitney U test, various types of ANOVAs). This review is not intended to be an exhaustive overview of the multitude of currently available multivariate methods in this increasingly complex and growing landscape (see
Multivariate ordination analyses summarize and visualize complex datasets by arranging data points (i.e., characters/traits/variables/metrics/morphometirs) along a few key axes (components or dimensions) that represent underlying gradients or patterns in the data. In general, multivariate analyses are equally exploratory as they are inferential. They can provide researchers with initial hypotheses and predictions as to how individual samples (specimens) may cluster together in morphospace (i.e., their position in a scatter plot) based on a set of input characters (usually morphology but genetic, physiological, ecological, and other data may also be used)—and more specific to taxonomy, how these OTU clusters may align with their phylogenetic delimitation. The most commonly used ordination procedure in herpetological taxonomy is PCA. Like other multivariate analyses, PCA reduces the dimensionality (i.e., number of variables/characters) of large complex datasets while graphically arranging the most signal-rich data along a few independent axes (i.e., components) that—as noted above—recover the majority of the underlying variation in the dataset. In other words, it provides the best signal-to-noise ratio (i.e., informative data-to-uninformative data ratio) for descriptive interpretations (
A notable advantage of multivariate analyses over univariate analyses is that in some datasets, there may be no statistically significant mean differences among OTUs in character-by-character analyses (e.g., Student t-tests, Welch’s t-test, Mann-Whitney U test, various types of ANOVAs). However, multivariate analyses may recover statistically significant differences among OTU plots because all the character variation in the dataset is analyzed simultaneously, and as such, the data points (i.e., the specimens) will cluster based on their overall similarity to one another. Another notable advantage of multivariate analyses is their ability to detect complex patterns in shape/size change, as well as patterns based on the interaction or combination of traits that univariate analyses cannot recover. These too can have diagnostic value. When morphospatial clustering (plots) aligns with OTUs (or MOTU [molecular operational taxonomic unit] for molecular data) in a phylogeny, it represents a robust inference for testable taxonomic hypotheses.
As noted, PCA is one of the most widely used multivariate analyses in herpetological taxonomy, as well as being one of the oldest methodologies of ordination analyses (
PCA can use both size-corrected morphometric (see below) and meristic data (i.e., numeric data [numbers]) but not raw untransformed categorical data. However, combining morphometric and meristic data types in the same dataset should be avoided (see below). There are even arguments for not using discrete data (e.g., meristic [scale counts], transformed categorical data, etc.) in a PCA as opposed to continuous data (e.g., morphometric, physiological, call frequency, etc.) because discrete data do not fulfill the assumptions of linearity (
PCA is useful in that separate analyses of different data types can be used to infer how each data type explains variation among OTUs independently of other data types. This is important if one is interested in how different OTUs are shaped (i.e., morphometrically related) with respect to one another, what characters are the principal drivers of the different shapes, and how these shapes may or may not correlate with some other aspects of the OTU’s biology. For example,
A–C. Morphometric data among allopatric populations of Cyrtodactylus kampingpoiensis showing only two pairs of populations bearing significant differences (shaded cells in the PERMANOVA analysis in C) and the wide degree of overlap among all populations in the discriminant analysis of principal components (DAPC) in B. D–F. Meristic data among allopatric populations of C. kampingpoiensis showing five pairs of populations bearing significant differences (shaded cells in the PERMANOVA analysis in F. and very little overlap among the populations in the DAPC in E. Circles outlined in white are centroids. Centroids in B and E. are where plot lines converge. Reproduced and modified from
A. PCA biplot illustrating the degree of correlation of large eyes, long snouts, long heads, and long limbs among the granite-cave-dwelling ecomorphs of Cyrtodactylus eisenmanae and C. condorensis compared to the general scansorial ecomorphs C. condorensis and C. leegrismeri; B. DAPC plot showing the separation of the granite-cave-dwelling and general scansorial ecomorphs. Illustrations in A are of C. eisenmanae (left) and C. grismeri (right). AXG = axilla-groin length, PH = pelvic height, HW = head width, PW = pelvic width, HD = head depth, HDW = hindlimb width, FLW = forelimb width, FLL = forelimb length, HL = head length, HDL = hindlimb length, SNT = snout length, and ED = eyeball diameter. Reproduced and modified from
A discriminant analysis of principal components (DAPC)— part of the adegenet R package (
MFA is a global, unsupervised, multivariate analysis whose goal is to integrate different kinds of variables/traits/characters (referred to here as data types) describing the same observation. MFA can include categorical (e.g., big or small, blue or red, striped or banded, keeled or smooth, present or absent, etc.) and numeric data (continuous and discrete) in a single dataset (
MFA works in two phases (Table
Sample MFA dataset comprised of size-corrected morphometric (green), meristic (blue), and three data types of categorical characters (coded yellow, orange, and gray). Each color-coded column of a different data type will be analyzed separately in the first phase of an MFA which will be separate PCAs for green and blue and separate MCAs (for yellow, orange, and gray (see Fig.
| Species | Snout-vent length | Head length | Head width | Head depth | Axilla-groin length | Forelimb length | Hindlimb length | |
| speciesA | 1.897077003 | 1.049100026 | 1.13665543 | 1.514108961 | 1.323680148 | 1.139453736 | 0.905969562 | |
| speciesA | 1.851869601 | 1.085979234 | 1.160260645 | 1.548274112 | 1.31547634 | 1.153272842 | 0.963635997 | |
| speciesA | 1.853698212 | 1.057721181 | 1.152014555 | 1.52439554 | 1.320371821 | 1.135465944 | 0.931935749 | |
| speciesB | 1.900913068 | 1.05211498 | 1.152180577 | 1.522084325 | 1.326336748 | 1.152507762 | 0.971406547 | |
| speciesB | 1.891537458 | 1.075794174 | 1.131102333 | 1.511341139 | 1.322321054 | 1.153800315 | 0.895991691 | |
| speciesB | 1.851258349 | 1.040460898 | 1.151603126 | 1.52431887 | 1.311669644 | 1.166228157 | 0.902068728 | |
| speciesC | 1.72427587 | 1.058852622 | 1.135780471 | 1.514570835 | 1.325793134 | 1.15545368 | 0.93677663 | |
| speciesC | 1.897627091 | 1.062530419 | 1.130211672 | 1.518379877 | 1.325162144 | 1.174979452 | 0.961562967 | |
| speciesC | 1.851258349 | 1.032385245 | 1.172353038 | 1.508095446 | 1.325510884 | 1.148379582 | 0.933746817 | |
| Suprlabials | Infralabials | 4th toe lamellae | Ventrals | Pattern | Color | Size | Tubercles | |
| speciesA | 9 | 9 | 32 | 20 | spotted | red | big | absent |
| speciesA | 11 | 9 | 37 | 21 | spotted | red | big | absent |
| speciesA | 10 | 10 | 34 | 20 | spotted | red | big | absent |
| speciesB | 11 | 9 | 34 | 19 | banded | blue | big | present |
| speciesB | 11 | 9 | 35 | 20 | banded | blue | big | present |
| speciesB | 10 | 10 | 31 | 20 | banded | blue | big | present |
| speciesC | 10 | 9 | 30 | 20 | banded | green | small | absent |
| speciesC | 11 | 10 | 34 | 21 | banded | green | small | absent |
| speciesC | 10 | 10 | 37 | 20 | banded | green | small | absent |
There are notable advantages that MFA has over PCA. The first is its ability to integrate different data types into the same analysis. This makes good biological sense because different data types represent different biological systems and, as such, are presumably under different selection pressures (or some may be under no selection pressure at all). Thus, they could be evolving independently but collectively contribute to the overall evolutionary trajectory of the OTU. Running independent analyses on the different data types in the first phase and then normalizing them in the second phase is analogous to using a partition scheme to model the evolution of different genes in a concatenated genetic dataset (or perhaps codons within the same gene), whereas using a single data type in a PCA is analogous to running an unpartitioned analysis. Another advantage is that many species diagnoses omit categorical data because a particular character may not occur in all individuals. For example, in species A, 70% of the individuals may be spotted while 30% are banded and vice versa for species B. This does not prevent MFA from using these characters because each individual is coded accordingly (in this case “spotted or banded”). Being that MFA is an unsupervised analysis, each individual is treated independently of every other individual, and individuals will ultimately cluster (plot) accordingly with individuals to which they are most similar. Therefore, in this way, categorical data can be jointly analyzed statistically to inform a multivariate diagnosis. Additionally, categorical data do not have to be binary but can be multistate. For example, species A may have blue spots on the head, the head spots of species B may be green, and those of species C may be red—and these can also occur in varying frequencies. Treating categorical characters in this manner emphasizes their diagnostic utility in a statistically defensible landscape as opposed to their often-anecdotal treatment or even omission.
Another advantage of MFA—as noted above—is that each data type is normalized separately and then combined with the other data types so one cannot overleverage the other. When using a mixed dataset of meristic and morphometric characters in a PCA, the different data types are standardized or scaled simultaneously. This treats meristic and morphometric data as a single biological system, which they clearly are not. For example, body parts change allometrically with growth and age, whereas scale counts remain the same throughout life. In fact,
Lastly, MFA automatically handles missing data (i.e., empty cells in the data matrix) when using the mfa() function in FactoMine R (
In general, MFA has greater discriminating power than PCA largely because it incorporates different data types, including categorical characters, which are usually invariable within OTUs or at least lack the range of variation (after normalization) found in numeric characters and are a valuable source of additional informative character data. For example,
A. PCA of the species of the Cyrtodactylus intermedius group based on meristic data. Black circles outlined in white within the plots are centroids; B. MFA of the same species of the C. intermedius group based on meristic, size-corrected morphometric, and categorical characters showing greater discrimination among the plots. White squares within the plots are centroids. Reproduced and modified from
Scatter plots of an east Asian clade of skinks from the genus Scincella from the Ryukyu Archipelago, Japan and Taiwan. A. PCA based on meristic data. B. PCA based on size-corrected morphometric data. C. PCA based on a combined meristic and size-corrected morphometric dataset. Circles outlined in white are centroids. D. MFA based on meristic, size-corrected morphometric, and categorical (color pattern) characters. White squares within the plots are centroids. Data for all analyses come from
Objective interpretations of scatterplots are difficult (
One method of objective interpretation is a non-parametric permutation of analysis of variance (PERMANOVA) (
In my opinion, a statistically defensible species diagnosis is the sine qua non of any diagnosis and this is not simply academic. Such species are paramount for understanding biodiversity, ecological monitoring, informing conservation efforts, effectively and efficiently managing ecosystems, evaluating species-management plans, medical science, and much more (see
Multivariate diagnoses are not without some epistemological challenges in that they do not recover characters traditionally used in dichotomous taxonomic keys or that are normally included in a list of diagnostic traits (often true for single characters with overlapping ranges between or among species that have significantly different means.) However, this is a methodological challenge that does not bear on the ontological properties of the species. Among the standard list of written diagnostic characters in the diagnosis section of a species description, it can be simply noted that “species A sp. nov. differs statistically (p = some value) in morphospace from other species”. The same way that characters bearing significantly different mean values can be listed in the diagnosis section.
Another issue for all statistical analyses is very small sample sizes (N < 3)—with N = 3, at least a legitimate range, mean, and standard deviation can be calculated. Small sample sizes are not necessarily an insurmountable problem in many cases. In the absence of a recent phylogeny, if the numeric value of an individual character from an underrepresented population falls “well-outside” the range of that of well-sampled putatively close relatives, then that is a solid foundation for building a testable hypothetical diagnosis followed with the acquisition of additional specimens. The same holds for a multivariate approach. If the underrepresented population falls “well-outside” the clusters of well-sampled putative close relatives, then that is an added layer of evidence that aligns with the previously mentioned raw data. If the multivariate analysis is an MFA that includes invariant categorical characters (or only “slightly” variable) in the other sampled taxa, then the diagnostic hypothesis becomes even more robust and is another reason categorical characters should be folded into all statistical operations when possible. Therefore, small sample sizes should not necessarily be dismissed from statistical procedures. In fact, I would opine all the more that if the sample size for one OTU is small, statistical evaluation becomes even more necessary because it can at least inform us where the individuals of the poorly represented OTU plots out with respect to the other closely related species—and this can be the basis for a testable hypothesis (Fig.
MFA of the Sphenomorphus stellatus complex after
In an era of bioinformatics with increasingly complex data-rich and analytically intensive reconstructions of molecular phylogenies and species delimitation methods, it is ironic that the methodology (or lack thereof) for morphologically diagnosing species has not progressed much beyond the 1800s.
I wish to thank Anita Malhotra, Bryan L. Stuart, Jesse L. Grismer, Evan S. H. Quah, Minh D. Le, Kin O. Chan, Vinh Q. Luu, and Anirddhadatta-Roy for numerous discussions (sometimes heated) and comments on early drafts of the manuscript that greatly improved its scope and content. Any expressed opinions in this publication are the author’s.
The author has declared that no competing interests exist.
No ethical statement was reported.
No use of AI was reported.
The author funded all research.
The author conceptualized the study, wrote the manuscript, and constructed all the figures in their sourced references.
L. Lee Grismer https://orcid.org/0000-0001-8422-3698
All of the data that support the findings of this study are available in the main text.