Ordered categorical data are commonplace in ecology when quantitative measurement of a variable of interest is not feasible or too costly. Examples include field estimation of the size of individuals, body condition and relative abundances of species.

Cumulative link models are a powerful class of models for analyzing such data since observations are treated as categorical, the ordered nature is exploited and the flexible regression framework allows in-depth analyses.

Today, we will use the ordinal package in R to analyse how size and diet determine mobility in reef fishes.

The data

is available here

fish = read.csv("reeffish.csv")
summary(fish)
##    Fish.code       Mobility          Size                        Diet    
##  Min.   :  16   Min.   :1.000   Min.   :  2.00   Macroinvertebrates:346  
##  1st Qu.:2519   1st Qu.:2.000   1st Qu.:  7.00   Microinvertebrates:257  
##  Median :3880   Median :2.000   Median : 12.00   Zooplancton       :243  
##  Mean   :3630   Mean   :2.489   Mean   : 20.06   Necton            :198  
##  3rd Qu.:4864   3rd Qu.:3.000   3rd Qu.: 22.50   Microalgae        :136  
##  Max.   :6574   Max.   :4.000   Max.   :280.00   Coral             : 38  
##                                                  (Other)           : 33  
##            Family   
##  LABRIDAE     :115  
##  POMACENTRIDAE:112  
##  GOBIIDAE     :103  
##  APOGONIDAE   : 67  
##  SERRANIDAE   : 65  
##  ACANTHURIDAE : 42  
##  (Other)      :747

The mobility of 1251 species of coral reef fishes from New Caledonia was recorded as an ordinal variable with 4 possible values (data courtesy of M. Kulbicki):

  1. strictly site-attached
  2. roams on contiguous habitat patches
  3. reef scale movement
  4. vagile, can move between reefs

we want to model mobility as a function of diet and average size.

Fitting models

We set Mobility as an ordered factor and run a first model with Diet as an explanatory variable

fish$Mobility=as.ordered(fish$Mobility)
library(ordinal)
fm1 = clm(Mobility~Diet, data=fish)
summary(fm1)
## formula: Mobility ~ Diet
## data:    fish
## 
##  link  threshold nobs logLik   AIC     niter max.grad cond.H 
##  logit flexible  1251 -1474.92 2969.84 6(0)  5.82e-13 5.3e+02
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## DietDetritus             3.8402     0.6730   5.706 1.15e-08 ***
## DietMacroalgae           2.4707     0.4830   5.115 3.14e-07 ***
## DietMacroinvertebrates   1.8921     0.3250   5.823 5.79e-09 ***
## DietMicroalgae           0.8007     0.3514   2.279   0.0227 *  
## DietMicroinvertebrates   0.5418     0.3269   1.658   0.0974 .  
## DietNecton               2.4272     0.3384   7.172 7.41e-13 ***
## DietZooplancton          1.3731     0.3329   4.124 3.72e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##     Estimate Std. Error z value
## 1|2  -0.8001     0.3056  -2.618
## 2|3   1.8344     0.3119   5.882
## 3|4   3.0767     0.3180   9.676

The summary provides basic information about the model fit. There are two coefficient tables: one for the regression variables and one for the thresholds. By default the logit link is used, and the parameters can be interpreted in terms of odd ratios (more on that soon). The summary also provides Wald based p-values for tests of the parameters being zero. Here Diet is a categorical variable and by default the contrasts are set to contr.treatment. Therefore, significance indicates whether the coefficient differs significantly from that of the base level (DietCoral). To test for significance of Diet as a whole, we can use drop1 or anova.

drop1(fm1, test = "Chisq")
## Single term deletions
## 
## Model:
## Mobility ~ Diet
##        Df    AIC    LRT  Pr(>Chi)    
## <none>    2969.8                     
## Diet    7 3143.2 187.35 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Diet is highly significant, not a surprise compared to an intercept-only model.

Interpretation

When the default logit link is used, the cumulative link model is known as proportional odds model, because a unit change in the linear predictors correpond to a proportional change in the odds of the event \(Y \leq j\). We can plot how the log-odds change with the predictors:

xlim = c(-0.1, max(fm1$beta)*1.1)
ylim = c(fm1$Theta[1]-max(fm1$beta), fm1$Theta[3])*1.1
plot(0,0,xlim=xlim, ylim=ylim, type="n", ylab="log(Odds)", xlab="", xaxt = "n")
axis(side = 1, at = c(0,fm1$beta),labels = levels(fish$Diet), las=2)
abline(a = fm1$Theta[1], b=-1, col=1)
abline(a = fm1$Theta[2], b=-1, col=2)
abline(a = fm1$Theta[3], b=-1, col=3)
abline(v=c(0,fm1$beta))

The theta parameters determine the intercept of the 3 parallel lines for DietCoral, while the betas parameters determine the position of the Diet categories on the x-axis. The assumption of proportion odds corresponds to the fact that the lines are parallel.

We now plot the same graph on the probability scale instead of the log(odds) scale and add the observed proportions of categories of mobility for each diet.

# plot on probability scale
inv.logit=function(x) exp(x)/(1+exp(x))
xlim = c(-0.1, max(fm1$beta)*1.1)
ylim = c(0,1)
plot(0,0,xlim=xlim, ylim=ylim, type="n", ylab=expression(italic(P(Y[i] <= j))), xlab="", xaxt = "n")
axis(side = 1, at = c(0,fm1$beta),labels = levels(fish$Diet), las=2)
xs = seq(xlim[1], xlim[2], length.out=100)
lines(xs, inv.logit(fm1$Theta[1] - xs), col=1)
lines(xs, inv.logit(fm1$Theta[2] - xs), col=2)
lines(xs, inv.logit(fm1$Theta[3] - xs), col=3)
abline(v=c(0,fm1$beta))
abline(h=0, lty="dashed")
abline(h=1, lty="dashed")

# plot observed values
freqtable = table(fish$Mobility, fish$Diet)
cumprobtable = apply(freqtable,2,function(col) cumsum(col)/sum(col))[-4,]
points(rep(c(0,fm1$beta), rep(3,length(c(0,fm1$beta)))), cumprobtable, col=c(1:3), pch=20)

The model does a good job at capturing the variation in Mobility. Note, however, that the proportional odds assumption is a poor fit of macro-algae eaters. Additional diagnostics can be performed by examining residuals plots. Unfortunately, ordinal has no built-in capability for residuals. Here we will our own function to compute Dunn-Smyth residuals and assess model fit.

clm.residuals = function(obj) {
  preds = predict(obj, type="cum.prob")
  residuals = runif(length(preds$cprob1))*(preds$cprob1-preds$cprob2) + preds$cprob2
  return(qnorm(residuals))
}
plot(clm.residuals(fm1) ~ Diet, data=fish, xlab="Diet", ylab="Dunn-Smyth residuals", main="residuals")

Adding a continuous predictor

adding predictors and testing for significance is straightforward with ordinal and similar to linear models. Now we proceed by adding Size as a predictor and testing for an interaction betweeen size and diet. But first let’s plot the data:

hist(fish$Size, main="Frequency distribution of fish size", xlab="Size")

Size is very skewed, which suggest that a log-transformation might be appropriate

hist(log(fish$Size), main="Frequency distribution of log-transformed fish size", xlab="log(Size)")

This is better, let’s have a look at the relationship between mobility and size:

plot(Mobility~log(Size), data=fish)

There is a strong relationship between Size and Mobility.

fm2 = clm(Mobility ~ Diet + log(Size), data=fish)
summary(fm2)
## formula: Mobility ~ Diet + log(Size)
## data:    fish
## 
##  link  threshold nobs logLik   AIC     niter max.grad cond.H 
##  logit flexible  1251 -1309.65 2641.29 6(0)  9.64e-11 2.8e+03
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## DietDetritus            3.71576    0.68580   5.418 6.02e-08 ***
## DietMacroalgae          1.49070    0.49987   2.982  0.00286 ** 
## DietMacroinvertebrates  1.40004    0.33223   4.214 2.51e-05 ***
## DietMicroalgae          0.67651    0.35327   1.915  0.05549 .  
## DietMicroinvertebrates  1.36033    0.33772   4.028 5.63e-05 ***
## DietNecton              1.10339    0.35363   3.120  0.00181 ** 
## DietZooplancton         1.88550    0.34188   5.515 3.48e-08 ***
## log(Size)               1.54873    0.09293  16.666  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##     Estimate Std. Error z value
## 1|2   2.5821     0.3715    6.95
## 2|3   5.6451     0.3985   14.17
## 3|4   7.2086     0.4191   17.20
anova(fm1, fm2)
## Likelihood ratio tests of cumulative link models:
##  
##     formula:                    link: threshold:
## fm1 Mobility ~ Diet             logit flexible  
## fm2 Mobility ~ Diet + log(Size) logit flexible  
## 
##     no.par    AIC  logLik LR.stat df Pr(>Chisq)    
## fm1     10 2969.8 -1474.9                          
## fm2     11 2641.3 -1309.6  330.54  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The likelihood ratio test strongly supports the inclusion of Size as a predictor. We can plot the residuals to check for departure from the linearity assumption.

plot(clm.residuals(fm2) ~ log(Size), data=fish, xlab="log(Size)", ylab="Dunn-Smyth residuals", main="residuals")