We used R to analyze all examples in Chapter 2. We put the code here so that you can too.
Also see our lab on graphics using R.
Read data from a comma-separated (.csv) file into a data frame.
locustData <- read.csv(chap02f1_2locustSerotonin.csv, stringsAsFactors = FALSE)
Inspect data in a data frame.
head(locustData)
nrow(locustData)
Strip chart relating a numeric variable to a categorical variable:
ggplot(data = locustData, aes(x = treatmentTime, y = serotoninLevel)) +
geom_jitter()
Frequency table:
table(tigerData$activity)
Bar graph for count data:
ggplot(data = educationSpending, aes(x = as.character(year),
y = spendingPerStudent)) +
geom_bar(stat = "identity")
Histogram for a numeric variable:
ggplot(data = zikaData, aes(x = biparietalDiamMm)) +
geom_histogram()
Scatter plot showing association between two numeric variables:
ggplot(data = guppyFatherSonData, aes(x = fatherOrnamentation,
y = sonAttractiveness)) +
geom_point()
We’ll use the ggplot2
add on package to draw some of the plots. It needs to be loaded only once per session.
library(ggplot2)
Strip chart of serotonin levels in the central nervous system of desert locusts that were experimentally crowded for 0 (the control group), 1, and 2 hours.
The read.csv()
command grabs the data from a file on the book web site and stores in the data frame locustData
.
locustData <- read.csv(url("https://whitlockschluter3e.zoology.ubc.ca/Data/chapter02/chap02f1_2locustSerotonin.csv"), stringsAsFactors = FALSE)
To check that the data were read correctly and determine the number of cases (rows) in the data.
head(locustData)
## serotoninLevel treatmentTime
## 1 5.3 0
## 2 4.6 0
## 3 4.5 0
## 4 4.3 0
## 5 4.2 0
## 6 3.6 0
nrow(locustData)
## [1] 30
The ggplot()
code stitches together multiple components here shown on different lines: the data and variables; the strip chart itself (with options specifying color and size of dots and width of strips); the axis labels, and the theme. We convert treatmentTime
to a character variable for best results.
meanLevel <- tapply(locustData$serotoninLevel, locustData$treatmentTime,mean)
ggplot(data = locustData, aes(x = as.character(treatmentTime), y = serotoninLevel)) +
geom_jitter(color = "firebrick", size = 3, width = 0.15) +
labs(x = "Treatment time (hours)", y = "Serotonin (pmoles)") +
theme_classic()
A bar graph of education spending per student in different years in British Columbia.
educationSpending <- read.csv(url("https://whitlockschluter3e.zoology.ubc.ca/Data/chapter02/chap02f1_3EducationSpending.csv"), stringsAsFactors = FALSE)
head(educationSpending)
## year spendingPerStudent
## 1 1998 5844
## 2 1999 5983
## 3 2000 6216
## 4 2001 6328
## 5 2002 6455
## 6 2003 6529
In this example, the actual bar heights are provided in the data frame, so we use stat = "identity"
. We convert year to a categorial variable for best results.
ggplot(data = educationSpending, aes(x = as.character(year), y = spendingPerStudent)) +
geom_bar(stat = "identity", fill = "firebrick") +
labs(x = "", y = "Education spending ($ per student)") +
theme_classic()
Frequency table and bar graph showing activities of 88 people at the time they were killed by tigers near Chitwan National Park, Nepal, from 1979 to 2006.
We store the data into a data frame named tigerData
.
tigerData <- read.csv(url("https://whitlockschluter3e.zoology.ubc.ca/Data/chapter02/chap02e2aDeathsFromTigers.csv"), stringsAsFactors = FALSE)
head(tigerData)
## person activity
## 1 1 Disturbing tiger kill
## 2 2 Forest products
## 3 3 Grass/fodder
## 4 4 Fuelwood/timber
## 5 5 Grass/fodder
## 6 6 Forest products
This is a basic frequency table. R will arrange the categories in alphabetical order by default. Use sort()
to order categories from most to least frequent. The last step, data.frame()
arranges the frequencies in a column.
tigerTable <- table(tigerData$activity)
tigerTable <- sort(tigerTable, decreasing = TRUE)
data.frame(tigerTable, row.names = 1)
## Freq
## Grass/fodder 44
## Forest products 11
## Fishing 8
## Herding 7
## Disturbing tiger kill 5
## Fuelwood/timber 5
## Sleeping in house 3
## Walking 3
## Toilet 2
We can instead use factor()
to make a new activity variable whose categories (“levels”) are ordered by frequency.
tigerTable <- table(tigerData$activity)
tigerData$activity_ordered <- factor(tigerData$activity,
levels = names(sort(tigerTable, decreasing = TRUE)) )
data.frame(table(tigerData$activity_ordered), row.names = 1)
## Freq
## Grass/fodder 44
## Forest products 11
## Fishing 8
## Herding 7
## Disturbing tiger kill 5
## Fuelwood/timber 5
## Sleeping in house 3
## Walking 3
## Toilet 2
In this example, ggplot()
tallies the frequency in each category from the data using stat = "count"
. We’ll use the new, ordered activity variable created in the previous step. We also add an option to rotate the axis labels so they fit on the page.
ggplot(data = tigerData, aes(x = activity_ordered)) +
geom_bar(stat = "count", fill = "firebrick") +
labs(x = "Activity", y = "Frequency") +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Frequency table and histogram illustrating the frequency distribution of head diameters of fetuses of pregnant women infected with the Zika virus.
zikaData <- read.csv(url("https://whitlockschluter3e.zoology.ubc.ca/Data/chapter02/chap02e2bZikaBiparietalDiameter.csv"), stringsAsFactors = FALSE)
head(zikaData)
## gestationalAgeWks biparietalDiamMm
## 1 34.1 61
## 2 32.9 69
## 3 35.6 70
## 4 35.8 82
## 5 34.4 80
## 6 33.3 80
To make a frequency table for a numeric variable we need to place observations into bins. THe variables head width ranges from 61 to 92.
range(zikaData$biparietalDiamMm)
## [1] 61 92
So let’s make bins of width 5 from 60 to 95. The option right = FALSE
ensures that abundance value 70 (for example) is counted in the 70-75 bin rather than in the 65-70 bin.
zikaTable <- table(cut(zikaData$biparietalDiamMm,
breaks = seq(00, 95, by=5), right = FALSE))
data.frame(zikaTable, row.names = 1)
## Freq
## [0,5) 0
## [5,10) 0
## [10,15) 0
## [15,20) 0
## [20,25) 0
## [25,30) 0
## [30,35) 0
## [35,40) 0
## [40,45) 0
## [45,50) 0
## [50,55) 0
## [55,60) 0
## [60,65) 1
## [65,70) 1
## [70,75) 1
## [75,80) 0
## [80,85) 13
## [85,90) 20
## [90,95) 4
A quick histogram in base R can be obtained as follows
hist(zikaData$biparietalDiamMm, right = FALSE, col = "firebrick")
Or use the ggplot()
method. The option boundary = 0
puts bars between the bin cut points. The option closed = left
ensures that abundance value 70 (for example) is counted in the 70-75 bin rather than in the 65-70 bin.
ggplot(data = zikaData, aes(x = biparietalDiamMm)) +
geom_histogram(fill = "firebrick", col = "black", binwidth = 5,
boundary = 0, closed = "left") +
labs(x = "Head diamater (mm)", y = "Frequency") +
theme_classic()
Histograms of body mass of 228 female sockeye salmon sampled from Pick Creek in Alaska. The same data are plotted for three different interval widths: 0.1 kg, 0.3 kg, and 0.5 kg.
salmonSizeData <- read.csv("https://whitlockschluter3e.zoology.ubc.ca/Data/chapter02/chap02f2_5SalmonBodySize.csv", stringsAsFactors = FALSE)
head(salmonSizeData)
## year sex oceanAgeYears lengthMm massKg
## 1 1996 FALSE 3 513 3.090
## 2 1996 FALSE 3 513 2.909
## 3 1996 FALSE 3 525 3.056
## 4 1996 FALSE 3 501 2.690
## 5 1996 FALSE 3 513 2.876
## 6 1996 FALSE 3 501 2.978
Comparing the effect of setting bin width to 0.1, 0.3, and 0.5 kg.
ggplot(data = salmonSizeData, aes(x = massKg)) +
geom_histogram(fill = "firebrick", col = "black", binwidth = 0.1,
boundary = 0, closed = "left") +
labs(x = "Body mass (kg)", y = "Frequency") +
theme_classic()
ggplot(data = salmonSizeData, aes(x = massKg)) +
geom_histogram(fill = "firebrick", col = "black", binwidth = 0.3,
boundary = 0, closed = "left") +
labs(x = "Body mass (kg)", y = "Frequency") +
theme_classic()
ggplot(data = salmonSizeData, aes(x = massKg)) +
geom_histogram(fill = "firebrick", col = "black", binwidth = 0.5,
boundary = 0, closed = "left") +
labs(x = "Body mass (kg)", y = "Frequency") +
theme_classic()
Contingency table, grouped bar plot, and mosaic plot showing the association between egg removal treatment and incidence of malaria in female great tits.
birdMalariaData <- read.csv(url("https://whitlockschluter3e.zoology.ubc.ca/Data/chapter02/chap02e3aBirdMalaria.csv"), stringsAsFactors = FALSE)
head(birdMalariaData)
## bird treatment response
## 1 1 Control Malaria
## 2 2 Control Malaria
## 3 3 Control Malaria
## 4 4 Control Malaria
## 5 5 Control Malaria
## 6 6 Control Malaria
Use table()
with the names of two categorical variables as arguments, beginning with the response variable.
birdMalariaTable <- table(birdMalariaData$response, birdMalariaData$treatment)
birdMalariaTable
##
## Control Egg removal
## Malaria 7 15
## No Malaria 28 15
To add row and column totals to the table, use the following.
addmargins(birdMalariaTable, margin = c(1,2), FUN = sum, quiet = TRUE)
##
## Control Egg removal sum
## Malaria 7 15 22
## No Malaria 28 15 43
## sum 35 30 65
In a grouped bar graph, ggplot()
uses the argument fill
for the response variable (here named response
). The argument position
controls the placement of bars. scale_fill_manual
is optional but it allows control of the colors used.
ggplot(birdMalariaData, aes(x = treatment, fill = response)) +
geom_bar(stat = "count", position = position_dodge2(preserve="single")) +
scale_fill_manual(values=c("firebrick", "goldenrod1")) +
labs(x = "Treatment", y = "Frequency") +
theme_classic()
Mosaic plots are easiest in base R rather than ggplot()
. The t()
command flips (transposes) the table to ensure that the explanatory variable is along the horizontal axis.
mosaicplot( t(birdMalariaTable), col = c("firebrick", "goldenrod1"),
sub = "Treatment", ylab = "Relative frequency", main = "")
Scatter plot of the relationship between the ornamentation of male guppies and the average attractiveness of their sons.
guppyFatherSonData <- read.csv("https://whitlockschluter3e.zoology.ubc.ca/Data/chapter02/chap02f3_3GuppyFatherSonAttractiveness.csv", stringsAsFactors = FALSE)
head(guppyFatherSonData)
## fatherOrnamentation sonAttractiveness
## 1 0.35 -0.32
## 2 0.03 -0.03
## 3 0.14 0.11
## 4 0.10 0.28
## 5 0.22 0.31
## 6 0.23 0.18
A quick scatter plot can be made simply with base R’s plot()
command. The tilde (~) relates the response variable sonAttractiveness
to the explanatory variable fatherOrnamentation
.
plot(sonAttractiveness ~ fatherOrnamentation, data = guppyFatherSonData,
pch = 16, col = "firebrick")
In ggplot()
, the commands for a scatter plot are as follows.
ggplot(data = guppyFatherSonData, aes(x = fatherOrnamentation, y = sonAttractiveness)) +
geom_point(size = 3, col = "firebrick") +
labs(x = "Father's ornamentation", y = "Son's attractiveness") +
theme_classic()
Strip chart, violin plot, and multiple histograms showing hemoglobin concentration in men living at high altitude in three different parts of the world and in a sea level population (USA).
hemoglobinData <- read.csv(url("https://whitlockschluter3e.zoology.ubc.ca/Data/chapter02/chap02e3bHumanHemoglobinElevation.csv"), stringsAsFactors = FALSE)
head(hemoglobinData)
## id hemoglobin population
## 1 US.Sea.level1 10.40 USA
## 2 US.Sea.level2 11.20 USA
## 3 US.Sea.level3 11.70 USA
## 4 US.Sea.level4 11.80 USA
## 5 US.Sea.level5 11.90 USA
## 6 US.Sea.level6 12.05 USA
table(hemoglobinData$population)
##
## Andes Ethiopia Tibet USA
## 71 128 59 1704
ggplot(data = hemoglobinData, aes(x = population, y = hemoglobin)) +
geom_jitter(color = "firebrick", shape = 1, size = 3, width = 0.15) +
labs(x = "Male population", y = "Hemoglobin concentration (g/dL)") +
theme_classic()
A violin plot using ggplot()
, like that in Figure 2.3-4.
ggplot(data = hemoglobinData, aes(y = hemoglobin, x = population)) +
geom_violin(fill = "goldenrod1", col = "black") +
labs(x = "Male population", y = "Hemoglobin concentration (g/dL)") +
stat_summary(fun.y = mean, geom = "point", color = "black") +
theme_classic()
Shows the association between hemoglobin and population using stacked histograms all having the same x-axis (Figure 2.3-5). The facet_wrap()
function indicates the groups (in this case populations), with the tilde “~” indicating that they represent another explanatory variable.
ggplot(data = hemoglobinData, aes(x = hemoglobin)) +
geom_histogram(fill = "firebrick", binwidth = 1, col = "black",
boundary = 0, closed = "left") +
labs(x = "Hemoglobin concentration (g/dL)", y = "Frequency") +
theme_classic() +
facet_wrap( ~ population, ncol = 1, scales = "free_y", strip.position = "right")
Line graph showing confirmed cases of measles in England and Wales from 1995 to 2011. Annual counts are quarterly.
measlesData <- read.csv("https://whitlockschluter3e.zoology.ubc.ca/Data/chapter02/chap02f4_1MeaslesOutbreaks.csv", stringsAsFactors = FALSE)
head(measlesData)
## year quarter confirmedCases yearByQuarter
## 1 2011 4th 136 2011.88
## 2 2011 3rd 154 2011.62
## 3 2011 2nd 346 2011.38
## 4 2011 1st 151 2011.12
## 5 2010 4th 31 2010.88
## 6 2010 3rd 134 2010.62
The graph in this example consists of both points and lines, and so both geom_line()
and geom_point()
are included.
ggplot(data = measlesData, aes(x = yearByQuarter, y = confirmedCases)) +
geom_line() +
geom_point(color = "firebrick") +
labs(x = "Year", y = "Number of new measles cases") +
theme_classic()