Let’s start by loading the tidyverse
package:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.1 ✓ dplyr 1.0.5
## ✓ tidyr 1.1.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
For the purposes of this session, we will be using data from Bushong & Jaeger (2019), an experiment on speech perception. In this experiment, participants hear sentences like “I noticed a [?]ent in the fender…”, where the [?] is a sound manipulated to range from sounding more like /t/, more like /d/, or somewhere in the middle by changing the value of an acoustic variable called VOT. We also manipulate a later word in the sentence to bias more towards a /d/ interpretation (i.e. “fender” should make you more likely to think the earlier word was “dent”), or a /t/ interpretation (e.g., “campgrounds”). Finally, we also manipulate how far away this biasing context word appears (“short” = 3 syllables after the target word, “long” = 6-9 syllables after). After listening to the sentence, the participant indicates whether they thought the word was “dent” or “tent” (key dependent variable). We also collect their reaction time on this response.
This figure gives a conceptual overview of the manipulated variables:
I’ve shared the dataset with you as a .RDS file, a special format for storing R data frames. We can load the dataset by using the readRDS()
function:
d <- readRDS("data_preprocessed.RDS")
The functions View()
and head()
will be your best friends for taking a look to see what is in your data frame. head()
, for example, shows the first 6 rows of your data frame:
head(d)
## subject Trial distance context VOT sFrame RT respond_t
## 1 1 0 short tent 50 10 6074 1
## 2 1 1 long tent 50 10 4895 1
## 3 1 2 long tent 30 4 8908 0
## 4 1 3 short tent 35 6 5815 0
## 5 1 4 long dent 50 9 5206 1
## 6 1 5 long dent 50 10 4771 1
To view an individual column, we can subset our data using the $
operator. For example, to see the first 6 values of the RT
column, I would use the command:
head(d$RT)
## [1] 6074 4895 8908 5815 5206 4771
Here are a few other functions I commonly use to inspect data:
names()
will return the names of all columns in your data frameclass()
(using a specific column as an input) will tell you what data type a specific column in your data frame is.summary()
will give some basic descriptive statistics of your columnsunique()
will return all unique values of a variable when you give a single data frame column as inputHere’s how the output of each of those looks:
names(d) # get column names
## [1] "subject" "Trial" "distance" "context" "VOT" "sFrame"
## [7] "RT" "respond_t"
class(d$RT) # what data type is RT?
## [1] "integer"
summary(d)
## subject Trial distance context VOT
## Min. : 1.00 Min. : 0.00 long :5040 dent:5040 Min. :10.00
## 1st Qu.: 21.75 1st Qu.: 41.75 short:5040 tent:5040 1st Qu.:30.00
## Median : 50.50 Median : 83.50 Median :37.50
## Mean : 56.13 Mean : 83.50 Mean :41.67
## 3rd Qu.: 86.25 3rd Qu.:125.25 3rd Qu.:50.00
## Max. :120.00 Max. :167.00 Max. :85.00
## sFrame RT respond_t
## Min. : 1.00 Min. : 3572 Min. :0.0000
## 1st Qu.: 5.75 1st Qu.: 4963 1st Qu.:0.0000
## Median :10.50 Median : 5426 Median :0.0000
## Mean :10.85 Mean : 6105 Mean :0.3254
## 3rd Qu.:16.25 3rd Qu.: 6050 3rd Qu.:1.0000
## Max. :21.00 Max. :455273 Max. :1.0000
unique(d$VOT) # how many levels of the VOT variable are there?
## [1] 50 30 35 40 10 85
ggplot2
tidyverse
contains the library ggplot2
which uses the “grammar of graphics” framework for data visualization. This works essentially as a layering system: we start with a base layer of a ggplot()
call, which creates the basic template (at minimum, the data and variables that will be on our x- and y-axes) from which we will work. The first argument of ggplot()
will be our data frame d
; then, we need to give a second argument called aes()
(“aesthetics”) specifying our x and y axes. Let’s say that what we eventually want to do is create a plot showing the proportion of /t/ responses (y axis) by VOT (x axis) and context word (we’ll get to that later). This is how we would create our base layer:
ggplot(d, aes(x = VOT, y = respond_t))
Now what we want to do is add geom
objects to our plot, creating our actual visualization. For this example, let’s use geom_point()
, which will create a point at each VOT value.
Our data is in its raw form, meaning that our respond_t
variable is a bunch of 0’s and 1’s. We want to transform that into a proportion by taking the mean of the column; the ggplot2
function stat_summary()
allows us to do just that! We need to give stat_summary()
a couple different arguments:
fun
: the function we want to apply (in this case, mean()
)geom
: the geom we want (in this case, “point”)Let’s add it to our base layer:
ggplot(d, aes(x = VOT, y = respond_t)) +
stat_summary(fun = mean, geom = "point")
To make this visualization better, we may want to add error bars to our points! Turns out there are two built-in geoms for just this purpose: geom_pointrange()
and geom_errorbar
. Since we are already using points, let’s use geom_pointrange()
. To use it in conjunction with stat_summary()
, we will need to use a function that computes both the mean and a measure of uncertainty. I like the function mean_cl_boot()
: this computes the mean and 95% confidence intervals using a bootstrap method. Here’s how we would add that to our plot:
ggplot(d, aes(x = VOT, y = respond_t)) +
stat_summary(fun.data = mean_cl_boot, geom = "pointrange") # notice here I have to use the argument fun.data instead of fun
Let’s return to our original visualization goal: to plot /t/ responses by both VOT and context word. One good way to do that would be to plot points in different colors that correspond to the different context word conditions. The way that we can do this is to specify an additional aes()
argument in our original ggplot()
call: color
:
ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
stat_summary(fun.data = mean_cl_boot, geom = "pointrange") # notice here I have to use the argument fun.data instead of fun
The basic plotting defaults are a tad bit ugly. Here are a few issues right off the bat:
Fortunately, there is a massive selection of functions we can add to our plot to correct these issues!
There is a family of functions all starting with scale_color
that allow us to change the color of our points. Here are a few that I like:
scale_color_manual()
allows you to manually enter which colors you would like your points (or other geoms) to be. You can specify with RGB values, hex code names, or the built-in names of R colors. A comprehensive list of R color names can be found here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdfscale_color_brewer()
uses the R Color Brewer system. You have three color scheme options: sequential (gives a gradient from light-dark for one color), qualitative (gives easily distinguishable colors), and diverging (gradient from one color to its opposite). You can find a list of all the palettes here: https://r-graph-gallery.com/38-rcolorbrewers-palettes.html. This is a great option, especially the qualitative color palettes, because they are designed to be color-blind friendly!Here’s an example of our plot with the color brewer palette Dark 2:
ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
stat_summary(fun.data = mean_cl_boot, geom = "pointrange") +
scale_color_brewer(type = "qual", palette = "Dark2")
The functions xlab()
and ylab()
allow us to input our own text labels for our axes, like so:
ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
stat_summary(fun.data = mean_cl_boot, geom = "pointrange") +
scale_color_brewer(type = "qual", palette = "Dark2") +
xlab("VOT (ms)") +
ylab("Proportion /t/ responses")
In the same function I used to create my point colors, I can also use the name
and labels
arguments to change the name of the legend and the labels, respectively:
ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
stat_summary(fun.data = mean_cl_boot, geom = "pointrange") +
scale_color_brewer(type = "qual", palette = "Dark2", name = "Context bias", labels = c("dent-biasing", "tent-biasing")) +
xlab("VOT (ms)") +
ylab("Proportion /t/ responses")
theme()
Literally everything you can imagine about your plots are customizable, and much of this is done with the theme()
function you can add to your ggplot()
. There are some built-in themes; my favorites are theme_classic()
and theme_bw()
, which in my opinion are much more readable than the default gray-background plot. Here’s an example of theme_bw()
:
ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
stat_summary(fun.data = mean_cl_boot, geom = "pointrange") +
scale_color_brewer(type = "qual", palette = "Dark2", name = "Context bias", labels = c("dent-biasing", "tent-biasing")) +
xlab("VOT (ms)") +
ylab("Proportion /t/ responses") +
theme_bw()
But the most customizable option is to add your own theme elements. There are approximately 80 million theme objects, which you can view by running the help function on theme (?theme
). I find this walkthrough to be quite helpful: https://henrywang.nl/ggplot2-theme-elements-demonstration/.
Here’s how theme objects work on a basic level. Each theme object has its own associated element type – for example, anything that deals with text will be specified by an element_text()
function call which takes arguments like size
, color
, etc. Here are all element types:
element_text
: text elements (axis labels, axis values, legend titles, etc.)element_rect
: box elements (like the border of the plot, etc.). Takes arguments like fill
, color
(outline)element_line
: linear elements (like gridlines in the plot). Takes arguments like color
, size
(thickness of line)element_blank
: this will remove an element. E.g. if you don’t want an axis label and you want that space to be taken away completely, you can assign the axis label element to element_blank()
.Here are a few theme objects I find myself frequently editing:
axis.text
, axis.title
, legend.text
, legend.title
: Text associated with the axes and legend (value labels and title, respectively). You can make these more specific by adding which axis you would like to change (e.g., axis.text.x
)panel.background
and plot.background
: Outline & fill of the plot. plot.background
deals with the entirety of the plot, while panel.background
is just the area within the axes of your plot.panel.grid
: Great for changing how obvious or subtle your grid lines are. You can make them lighter to make them more unobtrusive (I like the color “grey95”, it’s practically white but still somewhat visible)ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
stat_summary(fun.data = mean_cl_boot, geom = "pointrange") +
scale_color_brewer(type = "qual", palette = "Dark2", name = "Context bias", labels = c("dent-biasing", "tent-biasing")) +
xlab("VOT (ms)") +
ylab("Proportion /t/ responses") +
theme(axis.text = element_text(size = 18),
axis.title = element_text(size = 18),
legend.text = element_text(size = 18),
legend.title = element_text(size = 18),
panel.grid = element_line(color = "grey95"),
panel.background = element_rect(color = "black", fill = "NA")) # fill = "NA" creates a transparent background, very good for putting on slides with non-white background color!)
Let’s imagine that I want to see the VOT & context effect broken up by an additional manipulation in the experiment: distance
. In this experiment, I manipulated how far away the biasing context word occurred (“short” distance = 3 syllables later, “long” distance = 6-9 syllables later). We can show this additional variable by using facets.
Try to replicate this plot:
Here is the series of steps you’ll need to take:
distance
variable (hint: check out the facet_wrap()
function)strip.background
and strip.text
theme objects)Let’s say that I want to see how the effect of context bias changes over the course of the experiment. Try to replicate this plot:
Here is the series of steps you’ll need to take:
respond_t
for each Trial
using a point geom.geom_smooth()
function and its associated method
options).theme_classic()
.Try to replicate this plot:
First, I’m going to remove some RT outliers. Run this line of code to do that, and use d2
as your plotting data:
d2 <- subset(d, RT < 10000) # removes all RTs above 10000 (10 seconds)
Here is the series of steps you’ll need to take:
RT
. Hint: take a look at the ggplot cheat sheet or geom_histogram()
help docsface
argument to element_text()
)