--- title: "Compute and Interpret Quality of Functional Spaces" author: "Sebastien Villeger" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Compute and Interpret Quality of Functional Spaces} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- # About this tutorial
This tutorial illustrates how to compute and interpret quality of functional spaces using `mFD`, with a special emphasis on how functional dendrograms and functional spaces with a low dimension could distort original trait-based distances. This tutorial also explains why square-rooting the original distances before computing PCoA may be misleading.
# 1. Tutorial's data
**DATA** The dataset used to illustrate this tutorial is *a fruit dataset* based on 25 types of fruits. Each fruit is characterized by 5 traits summarized in the following table: | Trait name | Trait measurement | Trait type | Number of classes | Classes code | Unit | |:------------:|:------------------:|:------------:|:-------------------:|:---------------:|:-----:| |Size | Maximal diameter |Ordinal |5 |0-1 ; 1-3 ; 3-5 ; 5-10 ; 10-20|cm| |Plant | Growth form |Categorical |4 |tree; schrub; vine; forb|NA| |Climate | Climatic niche |Ordinal |3 |temperate ; subtropical ; tropical|NA| |Seed | Seed type |Ordinal |3 |none ; pip ; pit| NA| |Sugar | Sugar |Continuous |NA |NA |g/kg|
**NOTE** This dataset is a subset of the dataset used in the [mFD: General Workflow](https://cmlmagneville.github.io/mFD/articles/mFD_general_workflow.html) tutorial to keep only non-fuzzy traits.
The dataframe gathering species traits, looks as follows:
```{r} data("fruits_traits", package = "mFD") # remove non-fuzzy traits: fruits_traits <- fruits_traits[1:5] # plot the table: knitr::kable(head(fruits_traits), caption = "Species x traits dataframe based on *fruits* dataset") ```
Thus, this dataset contains 5 traits: 3 ordinal (Size, Climate, Seed), 1 categorical (Plant type), 1 continuous (sugar content):
```{r, echo = FALSE} summary(fruits_traits) ```
These traits are summed up in the following dataframe (details: [mFD: General Workflow](https://cmlmagneville.github.io/mFD/articles/mFD_general_workflow.html) tutorial):
```{r} fruits_traits_cat <- data.frame(names(fruits_traits), c("O","N","O","O","Q")) colnames(fruits_traits_cat) <- c("trait_name", "trait_type") fruits_traits_cat ```
# 2. Compute trait-based distance between species
First, trait-based distance between species should be computed using `mFD::funct.dist()`. Here we use Gower distance.
**USAGE** ```{r} # compute trait-based distances: dist_fruits <- mFD::funct.dist( sp_tr = fruits_traits, tr_cat = fruits_traits_cat, metric = "gower", scale_euclid = "noscale", ordinal_var = "classic", weight_type = "equal", stop_if_NA = TRUE) # sum up the distance matrix: summary(as.matrix(dist_fruits)) ```
The Gower distances range from < 0.01 to 0.790. For instance, Gower distances between blackberry and 3 other fruits are:
```{r} # retrieve fruits names: ex_blackberry <- c("blackberry","currant","cherry","banana") # get the distance matrix only for these species: round(as.matrix(dist_fruits)[ex_blackberry, ex_blackberry], 2) ```
Those observed differences in values are intuitively related to trait values of these 4 species:
```{r} fruits_traits[ex_blackberry, ] ```
Indeed: * blackberry shares 3 traits values with currant and these 2 species have close values for other 2 traits which explains the low distance (< 0.1) * blackberry shares 2 traits with cherry (size and climate), differs slightly for seed size (by only 1 order) but is quite different in terms of plant type and sugar content, hence Gower distance is around 0.5 * blackberry is maximally different to banana for ordinal traits, difference for categorical and sugar content differ by a 2.5 factor, hence Gower distance is high (> 0.8).
# 3. Compute functional space, quality metrics and plot them
## 3.1. Compute functional spaces and associated quality metrics
We now compute varying number of functional space from 1 to 9 dimensions based on a PCoA as well as an UPGMA dendrogram using `mFD::quality.fspaces()` function. We also compute 4 quality metrics *(= all combinations of deviation weighting and distance scaling)* (details: [mFD General Workflow](https://cmlmagneville.github.io/mFD/articles/mFD_general_workflow.html) tutorial, **step 4.1**).
**USAGE** ```{r, warning = FALSE} # use quality.fpscaes function to compute quality metrics: quality_fspaces_fruits <- mFD::quality.fspaces( sp_dist = dist_fruits, fdendro = "average", maxdim_pcoa = 9, deviation_weighting = c("absolute", "squared"), fdist_scaling = c(TRUE, FALSE)) # display the table gathering quality metrics: quality_fspaces_fruits$"quality_fspaces" # retrieve the functional space associated with minimal quality metric: apply(quality_fspaces_fruits$quality_fspaces, 2, which.min) ```
The best space (with the minimum deviation between trait-based distance and space-based distance) is the 4D according to all indices. Then using the output of `mFD::quality.fspaces()`, we plot quality metrics of each space:
```{r, fig.height = 7, fig.width = 12, fig.align = "center", warning = FALSE} library("magrittr") quality_fspaces_fruits$"quality_fspaces" %>% tibble::as_tibble(rownames = "Funct.space") %>% tidyr::pivot_longer(cols =! Funct.space, names_to = "quality_metric", values_to = "Quality") %>% ggplot2::ggplot(ggplot2::aes(x = Funct.space, y = Quality, color = quality_metric, shape = quality_metric)) + ggplot2::geom_point() ```
**NB** The higher the value of metric, the higher the deviations between trait-based and space-based distance between species, hence the lower the quality of the functional space is.
We can here notice that: * inaccuracy of dendrogram (shown on the right) is much higher than inaccuracy of spaces made of at least 3 dimensions * ranking of spaces is only slightly affected by quality metric, with here higher values for indices based on squared deviation * scaling distance increases inaccuracy of dendrogram
As FD indices will eventually be computed on coordinates on space (hence raw distance), we hereafter will consider only the mean absolute-deviation metric.
The raw and absolute deviation of distances for only dendrogram and 2, 3, 4D spaces are plotted below thanks to the `mFD::quality.fspaces.plot()` function:
**USAGE** ```{r, fig.height = 7, fig.width = 12, fig.align = "center", warning = FALSE} mFD::quality.fspaces.plot( fspaces_quality = quality_fspaces_fruits, quality_metric = "mad", fspaces_plot = c("tree_average", "pcoa_2d", "pcoa_3d", "pcoa_4d", 'pcoa_5d')) ```
2D and 3D spaces bias distance (hence have high deviation, see top row) because some species pairs are closer in those spaces than they have close trait values. In the 4D space most species pairs are accurately represented (absolute deviation < 0.1).
## 3.2. Focus on dendrograms
**NB** **Many of the pairwise distance on dendrogram deviate by more than 0.3 from the trait-based distances** (top-left panel of the above figure), particularly with some of the highest distances on the dendrogram corresponding to pairs of species with actually close trait values (Gower distance < 0.3). The dichotomous nature of dendrogram implies that many species pairs have the same distance, with especially all species pairs being on different sides of the tree root having all the maximal distance.
For instance, let's consider the 3 fruits: lemon, lime and cherry:
```{r} # get fruits traits: fruits_traits[c("cherry", "lime", "lemon"), ] ```
The 2 Citrus fruits have similar trait values and differ from the cherry. Now let's have a look at their pairwise distances: Gower distance on trait values, Euclidean distance in the 4 dimensions PCoA space and cophenetic distance on the UPGMA dendrogram.
```{r, warning = FALSE} quality_fspaces_fruits$"details_fspaces"$"pairsp_fspaces_dist" %>% dplyr::filter(sp.x %in% c("cherry", "lime", "lemon") & sp.y %in% c("cherry", "lime", "lemon")) %>% dplyr::select(sp.x, sp.y, tr, pcoa_4d, tree_average) %>% dplyr::mutate(dplyr::across(where(is.numeric), round, 2)) ```
As expected given trait values, Gower distance between lime and lemon is 2.75 (0.44/0.16 = 2.75) times lower than distance between each of them and cherry. Euclidean distances in the 4D space (pcoa_4d) are very similar to those Gower distance, with only a slight overestimation. Meanwhile, on the UPGMA dendrogram, lime is as distant to lemon than to the cherry and lemon is even closer to the cherry than to the lime. This is an illustration of the usual bias of **dendrogram that overestimates distance between some pairs of species having actually similar trait values**.
Now let's have look to the distance between pineapple and other fruits:
```{r, warning = FALSE, fig.height = 7, fig.width = 12, fig.align = "center"} quality_fspaces_fruits$"details_fspaces"$"pairsp_fspaces_dist" %>% dplyr::filter(sp.x %in% c("pineapple") | sp.y %in% c("pineapple")) %>% dplyr::mutate(fruit = stringr::str_replace_all(string = paste0(sp.x, "", sp.y), pattern = "pineapple", replacement = "")) %>% dplyr::select(fruit, Gower_distance = tr, Cophenetic_distance = tree_average) %>% ggplot2::ggplot(ggplot2::aes(x = Gower_distance, y = Cophenetic_distance, label = fruit)) + ggplot2::geom_point(size = 1) + ggplot2::geom_text(size = 2, nudge_y = 0.08, check_overlap = TRUE) + ggplot2::geom_abline(slope = 1, intercept = 0) + ggplot2::scale_x_continuous(limits = c(0, 1)) + ggplot2::scale_y_continuous(limits = c(0, 1)) ```
The cophenetic distance on the dendrogram between pineapple and all species but banana is 0.53 while trait-based Gower distance with those 22 fruits varied by a two-fold magnitude from 0.32 (water melon) to 0.73 (currant). This homogenization of distance is due to the ultrametricity of the dendrogram, *i.e.* a species is at the same distance to all species not on the same main branch (*i.e.* descending from the root). Let's plot of UPGMA dendrogram:
```{r, warning = FALSE, fig.height = 7, fig.width = 12, fig.align = "center"} quality_fspaces_fruits$"details_fspaces"$"dendro" %>% as.dendrogram() %>% dendextend::plot_horiz.dendrogram(side = TRUE) ```
We notice that pineapple is in the 'outer' group with other tropical fruits and that lime is as 'close' to cherry than to lemon.
## 3.3. Focus on the effect of square-rooting distance matrix before computing PcoA
A known 'issue' associated with the Gower metric applied to non-continuous traits is that distance matrix is not Euclidean. Let's have a look:
```{r} # check if distance matrix checks Euclidean properties: quality_fspaces_fruits$"details_trdist"$"trdist_euclidean" ```
It is `FALSE` with the fruit case: this is actually intuitive because of the formula of Gower metric for categorical traits that is binary (see example below)
Applying PCoA to a non-Euclidean distance eventually leads to PC axes with negative eigenvalues. Those axes are meaningless and removed by default by the `ape::pcoa()` function used in the `mFD::quality.fspaces()` function.
```{r} # retrieve eigen values: quality_fspaces_fruits$"details_fspaces"$"pc_eigenvalues" ```
Here, PCoA on the 25 fruits species described with 5 traits produced 9 PC axes with positive eigenvalues.
To deal with the non-Euclidean 'issue', it **has been recommended to square-root the Gower distance matrix before computing the PCoA**. However as Gower distance is by definition between 0 and 1, and as for 0 < x < 1, sqrt(x) > x: **this transformation means that all square-root distances are higher than raw distances and the difference between raw and square-root distances varies non-linearly with raw distances**.
If we look at raw and square-rooted Gower distance between blackberry and 3 other species:
```{r} quality_fspaces_fruits$"details_fspaces"$"pairsp_fspaces_dist" %>% dplyr::select(sp.x, sp.y, Gower = tr) %>% dplyr::mutate(sqrt_Gower = sqrt(Gower)) %>% dplyr::filter(sp.x %in% ex_blackberry & sp.y %in% ex_blackberry) %>% dplyr::mutate(dplyr::across(where(is.numeric), round, 2)) ```
Raw Gower distance between blackberry and banana is almost twice higher than distance between blackberry and cherry and 10 times higher than distance between blackberry and currant. Square-root distance between blackberry and banana differs by a 1.5 and 3-fold factor to distance between blackberry and cherry and currant, respectively because of the high slope of the square-root function (close to 0). If we apply `mFD::quality.fspace()` on the square-root of Gower distance:
```{r} # compute quality metrics with square-root transformed distances: quality_fspaces_fruits_sqrtgower <- mFD::quality.fspaces( sp_dist = sqrt(dist_fruits), fdendro = NULL, maxdim_pcoa = 24, deviation_weighting = "absolute", fdist_scaling = FALSE) # check if distance matrix checks Euclidean properties: quality_fspaces_fruits_sqrtgower$"details_trdist"$"trdist_euclidean" # input distance is now Euclidean # get mean Absolute Deviation: quality_fspaces_fruits_sqrtgower$"quality_fspaces" ```
The inaccuracy (measured with *mAD* (mean absolute deviation) metric) decreases with the number of axes down to 0.
But do not forget that the input used here is the square-root of Gower distance. So let's compare deviation between trait-based Gower distance and Euclidean distance in the 24D PCoA space:
```{r, fig.height = 7, fig.width = 12, fig.align = "center"} quality_fspaces_fruits$"details_fspaces"$"pairsp_fspaces_dist" %>% dplyr::select(sp.x, sp.y, Gower_distance = tr) %>% dplyr::mutate(Eucli_dist_24D_sqrt = quality_fspaces_fruits_sqrtgower$"details_fspaces"$"pairsp_fspaces_dist"$"pcoa_24d") %>% ggplot2::ggplot(ggplot2::aes(x = Gower_distance, y = Eucli_dist_24D_sqrt)) + ggplot2::geom_point(size = 1) + ggplot2::geom_abline(slope = 1, intercept = 0) + ggplot2::scale_x_continuous(limits = c(0, 1)) + ggplot2::scale_y_continuous(limits = c(0, 1)) ```
As expected, the ranking of distances is 'perfectly' kept but with a square-root shape above the 1:1 line.
If we now compute the actual *mAD* between Gower and Euclidean distances in this apparently perfect 24D space:
```{r, echo = FALSE} mean(abs(quality_fspaces_fruits$"details_fspaces"$"pairsp_fspaces_dist"$"tr" - quality_fspaces_fruits_sqrtgower$"details_fspaces"$"pairsp_fspaces_dist"$"pcoa_24d")) ```
We notice that mAD = 0.212: inaccuracy is much higher than the worst space and of the dendrogram computed on the raw Gower distance matrix that does represent the actual difference in trait values
So to **sum up**: * non-continuous traits ofteh make Gower distances between species being non-Euclidean * PCoA on such Gower distance could lead to PCoA axes with negative eigenvalues but the remaining axes always represent accurately Gower distance (for more details see [Maire _et al._ (2015)](https://onlinelibrary.wiley.com/doi/full/10.1111/geb.12299) and especially *Figure 2*). * Square-root transformed Gower distance is apparently increasing the efficiency of the PCoA (no more negative eigenvalue) but Euclidean distance in this space are square-root biased representation of trait-based distances that are the key features to account for when computing FD
**_NOTE:_** If you are not convinced about Gower being both intuitive but non-Euclidean consider the following simple case of 8 species described with 3 categorical traits (2 modalities each), so there are 8 unique combinations of trait values and the 28 species pairs are sharing 0, 1, or 2 trait values:
```{r} # create a new dataset: sp_tr <- data.frame( tra = factor(c(LETTERS[1:2], LETTERS[1:2], LETTERS[1:2], LETTERS[1:2])), trb = factor(c(rep("M", 4), rep("N", 4))) , trc = factor(c(rep("X", 2), rep("Y", 4), rep("X", 2))) ) row.names(sp_tr) <- paste0("sp", 1:8) sp_tr # compute Gower distance between all pairs of species: dist_gower <- cluster::daisy(sp_tr, metric = "gower") round(dist_gower, 2) ```
There are thus only 3 distances values, 0.33, 0.67 or 1, depending on the number of traits with the same values (0, 1 or 2)
```{r, echo = FALSE} # square-root transformation of Gower distance gower_sqrt <- sqrt(dist_gower) round(gower_sqrt, 2) ```
After applying the squareroot transformation, the (1/0.33) 3-fold difference in Gower distance between pairs of species sharing no trait value and pairs of species sharing 2 traits becomes (1/0.58) < 2. Thus, applying the square-root transformation to Gower distance decreases the magnitude of variation in trait-based distance, by increasing the distances between the most similar species
# References - Maire _et al._ (2015) How many dimensions are needed to accurately assess functional diversity? A pragmatic approach for assessing the quality of functional spaces. _Global Ecology and Biogeography_, **24**, 728-740.