We used R to analyze all examples in Chapter 13. We put the code here so that you can too.
Also see our lab on comparing two means using R.
Data transformations:
marine$lnBiomassRatio <- log(marine$biomassRatio)
Quantile plots:
qqnorm(marine$biomassRatio, datax = TRUE)
Mann-Whitney U-test (R uses the equivalent Wilcoxon rank-sum test):
wilcox.test(timeToMating ~ feedingStatus, data = cannibalism)
We’ll use the ggplot2
add on package to make graphs, and dplyr
to calculate summary statistics by group.
library(ggplot2)
library(dplyr)
The normal quantile plot, Shapiro-Wilk test of normality, and the log transformation, investigating the ratio of biomass between marine reserves and non-reserve control areas.
marine <- read.csv(url("https://whitlockschluter3e.zoology.ubc.ca/Data/chapter13/chap13e1MarineReserve.csv"), stringsAsFactors = FALSE)
head(marine)
## biomassRatio
## 1 1.34
## 2 1.96
## 3 2.49
## 4 1.27
## 5 1.19
## 6 1.15
A histogram visualizes the frequency distribution of biomass ratio (Figure 13.1-4).
hist(marine$biomassRatio, right = FALSE, col = "firebrick",
las = 1, xlab = "Biomass ratio", main = "")
or
ggplot(marine, aes(biomassRatio)) +
geom_histogram(fill = "firebrick", col = "black", binwidth = 0.5,
boundary = 0, closed = "left") +
labs(x = "Biomass ratio", y = "Frequency") +
theme_classic()
The line in the normal quantile plot of Figure 13.1-4 passes through the first and 3rd quartiles of the data.
qqnorm(marine$biomassRatio, col = "firebrick", pch = 16, las = 1,
datax = TRUE, xlab = "Biomass ratio")
qqline(marine$biomassRatio, datax = TRUE, lwd = 2)
A test of the goodness of fit of the normal distribution to data.
shapiro.test(marine$biomassRatio)
##
## Shapiro-Wilk normality test
##
## data: marine$biomassRatio
## W = 0.81751, p-value = 8.851e-05
The function log()
takes the natural logarithm of all the elements of a vector or variable. The following command saves the result as a new variable in the original data frame, marine
.
marine$lnBiomassRatio <- log(marine$biomassRatio)
A histogram of the data after log-transformation of marine biomass ratio (Figure 13.3-2).
hist(marine$lnBiomassRatio, right = FALSE, col = "firebrick",
las = 1, xlab = "Log biomass ratio", main = "",
breaks = seq(-0.25, 1.5, by = 0.25))
or
ggplot(marine, aes(lnBiomassRatio)) +
geom_histogram(fill = "firebrick", col = "black", binwidth = 0.25,
boundary = 0, closed = "left") +
labs(x = "Log biomass ratio", y = "Frequency") +
theme_classic()
Confidence interval for the mean of the log-transformed data.
t.test(marine$lnBiomassRatio)$conf.int
## [1] 0.3470180 0.6112365
## attr(,"conf.level")
## [1] 0.95
Transform the lower and upper limits of the confidence interval back to the original untransformed scale (exp()
is the inverse of the natural log function).
exp( t.test(marine$lnBiomassRatio)$conf.int )
## [1] 1.414842 1.842708
## attr(,"conf.level")
## [1] 0.95
The sign test, comparing the numbers of species in 25 pairs of closely related insect taxa.
conflict <- read.csv(url("https://whitlockschluter3e.zoology.ubc.ca/Data/chapter13/chap13e4SexualConflict.csv"), stringsAsFactors = FALSE)
head(conflict)
## taxonPair nSpeciesMultipleMating nSpeciesSingleMating difference
## 1 A 53 10 43
## 2 B 73 120 -47
## 3 C 228 74 154
## 4 D 353 289 64
## 5 E 157 30 127
## 6 F 300 4 296
A histogram of the difference in numbers of species (Figure 13.4-1).
hist(conflict$difference, right = FALSE, col = "firebrick",
breaks = 50, xlab = "Difference in species number", las = 1)
Count up the frequency of differences that are below, equal to, and above zero.
The command below creates a new variable in the data frame called conflictZero
to indicate whether each difference is below, equal to, or above 0.
conflictZero <- cut(conflict$difference, breaks = c(-Inf, 0, 1, Inf),
labels = c("below","equal","above"), right = FALSE)
table(conflictZero)
## conflictZero
## below equal above
## 7 0 18
The sign test is just a binomial test. The output includes a confidence interval for the proportion using the Clopper-Pearson method, which isn’t covered in the book.
binom.test(7, n = 25, p = 0.5)
##
## Exact binomial test
##
## data: 7 and 25
## number of successes = 7, number of trials = 25, p-value = 0.04329
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.1207167 0.4938768
## sample estimates:
## probability of success
## 0.28
**The Wilcoxon rank-sum test (equivalent to the Mann-Whitney U*-test) comparing times to mating (in hours) of starved and fed female sagebrush crickets. We also apply the permutation test** to the same data*.
cannibalism <- read.csv(url("https://whitlockschluter3e.zoology.ubc.ca/Data/chapter13/chap13e5SagebrushCrickets.csv"), stringsAsFactors = FALSE)
head(cannibalism)
## feedingStatus timeToMating
## 1 starved 1.9
## 2 starved 2.1
## 3 starved 3.8
## 4 starved 9.0
## 5 starved 9.6
## 6 starved 13.0
The following commands produce Figure 13.5-1. We begin by ordering the categories of the feeding status variable so that “starved” comes before “fed” in the histogram.
cannibalism$feedingStatus <- factor(cannibalism$feedingStatus,
levels = c("starved", "fed"))
ggplot(cannibalism, aes(x = timeToMating)) +
geom_histogram(fill = "firebrick", col = "black", binwidth = 20,
boundary = 0, closed = "left") +
facet_wrap( ~ feedingStatus, ncol = 1, scales = "free_y") +
labs(x = "Time to mating (hours)", y = "Frequency") +
theme_classic()
This test, available in R, is equivalent to the Mann-Whitney U-test.
wilcox.test(timeToMating ~ feedingStatus, data = cannibalism)
##
## Wilcoxon rank sum test
##
## data: timeToMating by feedingStatus
## W = 55, p-value = 0.3607
## alternative hypothesis: true location shift is not equal to 0
Permutation test of the difference between mean time to mating of starved and fed crickets.
Begin by calculating the observed difference between means (starved minus fed). The difference is -18.25734 in this data set.
cricketMeans <- summarize(group_by(cannibalism, feedingStatus),
timeToMating = mean(timeToMating, na.rm = TRUE))
cricketMeans <- as.data.frame(cricketMeans)
cricketMeans
## feedingStatus timeToMating
## 1 starved 17.72727
## 2 fed 35.98462
diffMeans <- cricketMeans$timeToMating[1] - cricketMeans$timeToMating[2]
diffMeans
## [1] -18.25734
Decide on the number of permutations. We use 10K here.
nPerm <- 10000
Create a loop to permute the data many times (the number of times determined by nperm
). In the following loop, i
is a counter that climbs from 1 to nPerm
by 1. In each iteration, the permuted difference is saved in the object permResult
.
permResult <- vector() # initializes
for(i in 1:nPerm){
# step 1: permute the times to mating
cannibalism$permSample <- sample(cannibalism$timeToMating, replace = FALSE)
# step 2: calculate difference betweeen means
permSampleMean <- as.data.frame(summarize(group_by(cannibalism, feedingStatus),
permMean = mean(permSample, na.rm = TRUE)))
permResult[i] <- permSampleMean$permMean[1] - permSampleMean$permMean[2]
}
Make a histogram of the permuted differences (Figure 13.8-1).
hist(permResult, right = FALSE, breaks = 100)
or
ggplot(data.frame(permResult), aes(permResult)) +
geom_histogram(fill = "white", col = "black", binwidth = 1,
boundary = 0, closed = "left") +
labs(x = "Permuted difference in treatment means (hours)", y = "Frequency") +
theme_classic()
Use the null distribution to calculate an approximate P-value. This is the twice the proportion of the permuted means that fall below the observed difference in means, diffMeans
(-18.25734 in this example). The following code calculates the number of permuted means falling below diffMeans
.
sum(as.numeric(permResult <= diffMeans))
## [1] 649
These commands obtain the fraction of permuted means falling below diffMeans
.
sum(as.numeric(permResult <= diffMeans)) / nPerm
## [1] 0.0649
Finally, multiply by 2 to get the P-value for a two-sided test. The value won’t be identical to the one in the book, or to the value you obtain, because 10,000 iterations is not large enough for extreme accuracy.
2 * ( sum(as.numeric(permResult <= diffMeans)) / nPerm )
## [1] 0.1298