Let’s start by loading the tidyverse package:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.1     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

For the purposes of this session, we will be using data from Bushong & Jaeger (2019), an experiment on speech perception. In this experiment, participants hear sentences like “I noticed a [?]ent in the fender…”, where the [?] is a sound manipulated to range from sounding more like /t/, more like /d/, or somewhere in the middle by changing the value of an acoustic variable called VOT. We also manipulate a later word in the sentence to bias more towards a /d/ interpretation (i.e. “fender” should make you more likely to think the earlier word was “dent”), or a /t/ interpretation (e.g., “campgrounds”). Finally, we also manipulate how far away this biasing context word appears (“short” = 3 syllables after the target word, “long” = 6-9 syllables after). After listening to the sentence, the participant indicates whether they thought the word was “dent” or “tent” (key dependent variable). We also collect their reaction time on this response.

This figure gives a conceptual overview of the manipulated variables:

I’ve shared the dataset with you as a .RDS file, a special format for storing R data frames. We can load the dataset by using the readRDS() function:

d <- readRDS("data_preprocessed.RDS")

Inspecting Data

The functions View() and head() will be your best friends for taking a look to see what is in your data frame. head(), for example, shows the first 6 rows of your data frame:

head(d)
##   subject Trial distance context VOT sFrame   RT respond_t
## 1       1     0    short    tent  50     10 6074         1
## 2       1     1     long    tent  50     10 4895         1
## 3       1     2     long    tent  30      4 8908         0
## 4       1     3    short    tent  35      6 5815         0
## 5       1     4     long    dent  50      9 5206         1
## 6       1     5     long    dent  50     10 4771         1

To view an individual column, we can subset our data using the $ operator. For example, to see the first 6 values of the RT column, I would use the command:

head(d$RT)
## [1] 6074 4895 8908 5815 5206 4771

Here are a few other functions I commonly use to inspect data:

Here’s how the output of each of those looks:

names(d) # get column names
## [1] "subject"   "Trial"     "distance"  "context"   "VOT"       "sFrame"   
## [7] "RT"        "respond_t"
class(d$RT) # what data type is RT?
## [1] "integer"
summary(d)
##     subject           Trial         distance    context          VOT       
##  Min.   :  1.00   Min.   :  0.00   long :5040   dent:5040   Min.   :10.00  
##  1st Qu.: 21.75   1st Qu.: 41.75   short:5040   tent:5040   1st Qu.:30.00  
##  Median : 50.50   Median : 83.50                            Median :37.50  
##  Mean   : 56.13   Mean   : 83.50                            Mean   :41.67  
##  3rd Qu.: 86.25   3rd Qu.:125.25                            3rd Qu.:50.00  
##  Max.   :120.00   Max.   :167.00                            Max.   :85.00  
##      sFrame            RT           respond_t     
##  Min.   : 1.00   Min.   :  3572   Min.   :0.0000  
##  1st Qu.: 5.75   1st Qu.:  4963   1st Qu.:0.0000  
##  Median :10.50   Median :  5426   Median :0.0000  
##  Mean   :10.85   Mean   :  6105   Mean   :0.3254  
##  3rd Qu.:16.25   3rd Qu.:  6050   3rd Qu.:1.0000  
##  Max.   :21.00   Max.   :455273   Max.   :1.0000
unique(d$VOT) # how many levels of the VOT variable are there?
## [1] 50 30 35 40 10 85

Data Visualization using ggplot2

tidyverse contains the library ggplot2 which uses the “grammar of graphics” framework for data visualization. This works essentially as a layering system: we start with a base layer of a ggplot() call, which creates the basic template (at minimum, the data and variables that will be on our x- and y-axes) from which we will work. The first argument of ggplot() will be our data frame d; then, we need to give a second argument called aes() (“aesthetics”) specifying our x and y axes. Let’s say that what we eventually want to do is create a plot showing the proportion of /t/ responses (y axis) by VOT (x axis) and context word (we’ll get to that later). This is how we would create our base layer:

ggplot(d, aes(x = VOT, y = respond_t))

Now what we want to do is add geom objects to our plot, creating our actual visualization. For this example, let’s use geom_point(), which will create a point at each VOT value.

Our data is in its raw form, meaning that our respond_t variable is a bunch of 0’s and 1’s. We want to transform that into a proportion by taking the mean of the column; the ggplot2 function stat_summary() allows us to do just that! We need to give stat_summary() a couple different arguments:

Let’s add it to our base layer:

ggplot(d, aes(x = VOT, y = respond_t)) +
  stat_summary(fun = mean, geom = "point")

To make this visualization better, we may want to add error bars to our points! Turns out there are two built-in geoms for just this purpose: geom_pointrange() and geom_errorbar. Since we are already using points, let’s use geom_pointrange(). To use it in conjunction with stat_summary(), we will need to use a function that computes both the mean and a measure of uncertainty. I like the function mean_cl_boot(): this computes the mean and 95% confidence intervals using a bootstrap method. Here’s how we would add that to our plot:

ggplot(d, aes(x = VOT, y = respond_t)) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") # notice here I have to use the argument fun.data instead of fun

Let’s return to our original visualization goal: to plot /t/ responses by both VOT and context word. One good way to do that would be to plot points in different colors that correspond to the different context word conditions. The way that we can do this is to specify an additional aes() argument in our original ggplot() call: color:

ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") # notice here I have to use the argument fun.data instead of fun

Making Your Plots Prettier

The basic plotting defaults are a tad bit ugly. Here are a few issues right off the bat:

Fortunately, there is a massive selection of functions we can add to our plot to correct these issues!

Geom Colors

There is a family of functions all starting with scale_color that allow us to change the color of our points. Here are a few that I like:

  • scale_color_manual() allows you to manually enter which colors you would like your points (or other geoms) to be. You can specify with RGB values, hex code names, or the built-in names of R colors. A comprehensive list of R color names can be found here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
  • scale_color_brewer() uses the R Color Brewer system. You have three color scheme options: sequential (gives a gradient from light-dark for one color), qualitative (gives easily distinguishable colors), and diverging (gradient from one color to its opposite). You can find a list of all the palettes here: https://r-graph-gallery.com/38-rcolorbrewers-palettes.html. This is a great option, especially the qualitative color palettes, because they are designed to be color-blind friendly!

Here’s an example of our plot with the color brewer palette Dark 2:

ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") +
  scale_color_brewer(type = "qual", palette = "Dark2")

Axis & Legend Labeling

The functions xlab() and ylab() allow us to input our own text labels for our axes, like so:

ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") +
  scale_color_brewer(type = "qual", palette = "Dark2") +
  xlab("VOT (ms)") +
  ylab("Proportion /t/ responses")

In the same function I used to create my point colors, I can also use the name and labels arguments to change the name of the legend and the labels, respectively:

ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") +
  scale_color_brewer(type = "qual", palette = "Dark2", name = "Context bias", labels = c("dent-biasing", "tent-biasing")) +
  xlab("VOT (ms)") +
  ylab("Proportion /t/ responses")

Plot Customization with theme()

Literally everything you can imagine about your plots are customizable, and much of this is done with the theme() function you can add to your ggplot(). There are some built-in themes; my favorites are theme_classic() and theme_bw(), which in my opinion are much more readable than the default gray-background plot. Here’s an example of theme_bw():

ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") +
  scale_color_brewer(type = "qual", palette = "Dark2", name = "Context bias", labels = c("dent-biasing", "tent-biasing")) +
  xlab("VOT (ms)") +
  ylab("Proportion /t/ responses") +
  theme_bw()

But the most customizable option is to add your own theme elements. There are approximately 80 million theme objects, which you can view by running the help function on theme (?theme). I find this walkthrough to be quite helpful: https://henrywang.nl/ggplot2-theme-elements-demonstration/.

Here’s how theme objects work on a basic level. Each theme object has its own associated element type – for example, anything that deals with text will be specified by an element_text() function call which takes arguments like size, color, etc. Here are all element types:

Here are a few theme objects I find myself frequently editing:

ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") +
  scale_color_brewer(type = "qual", palette = "Dark2", name = "Context bias", labels = c("dent-biasing", "tent-biasing")) +
  xlab("VOT (ms)") +
  ylab("Proportion /t/ responses") +
  theme(axis.text = element_text(size = 18),
        axis.title = element_text(size = 18),
        legend.text = element_text(size = 18),
        legend.title = element_text(size = 18),
        panel.grid = element_line(color = "grey95"),
        panel.background = element_rect(color = "black", fill = "NA")) # fill = "NA" creates a transparent background, very good for putting on slides with non-white background color!) 

Your Turn!

Problem 1: Facets

Let’s imagine that I want to see the VOT & context effect broken up by an additional manipulation in the experiment: distance. In this experiment, I manipulated how far away the biasing context word occurred (“short” distance = 3 syllables later, “long” distance = 6-9 syllables later). We can show this additional variable by using facets.

Try to replicate this plot:

Here is the series of steps you’ll need to take:

  1. Use our point plot above as a ‘base’.
  2. Facet the plot by the distance variable (hint: check out the facet_wrap() function)
  3. Change the facet headers to have a white background and black border, and text size 18 to match our axes and legend (hint: look at the strip.background and strip.text theme objects)

Problem 2: Trial Effects

Let’s say that I want to see how the effect of context bias changes over the course of the experiment. Try to replicate this plot:

Here is the series of steps you’ll need to take:

  1. Plot the mean of respond_t for each Trial using a point geom.
  2. Show a linear fit line (hint: check out the geom_smooth() function and its associated method options).
  3. Change axis & legend labels.
  4. Use the built-in theme theme_classic().

Problem 3: Histograms

Try to replicate this plot:

First, I’m going to remove some RT outliers. Run this line of code to do that, and use d2 as your plotting data:

d2 <- subset(d, RT < 10000) # removes all RTs above 10000 (10 seconds)

Here is the series of steps you’ll need to take:

  1. Create a histogram of RT. Hint: take a look at the ggplot cheat sheet or geom_histogram() help docs
  2. Change the histogram bar colors to a lighter gray and make the outline of the bars black.
  3. Change the plot’s background color to white, remove grid lines, and make a black panel outline. Also, change text to bold (hint: check out the face argument to element_text())