Lab 1: Data Visualization

Let’s start by loading the tidyverse package:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.1     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

For the purposes of this session, we will be using data from Bushong & Jaeger (2019), an experiment on speech perception. In this experiment, participants hear sentences like “I noticed a [?]ent in the fender…”, where the [?] is a sound manipulated to range from sounding more like /t/, more like /d/, or somewhere in the middle by changing the value of an acoustic variable called VOT. We also manipulate a later word in the sentence to bias more towards a /d/ interpretation (i.e. “fender” should make you more likely to think the earlier word was “dent”), or a /t/ interpretation (e.g., “campgrounds”). Finally, we also manipulate how far away this biasing context word appears (“short” = 3 syllables after the target word, “long” = 6-9 syllables after). After listening to the sentence, the participant indicates whether they thought the word was “dent” or “tent” (key dependent variable). We also collect their reaction time on this response.

This figure gives a conceptual overview of the manipulated variables:

I’ve shared the dataset with you as a .RDS file, a special format for storing R data frames. We can load the dataset by using the readRDS() function:

d <- readRDS("data_preprocessed.RDS")

Inspecting Data

The functions View() and head() will be your best friends for taking a look to see what is in your data frame. head(), for example, shows the first 6 rows of your data frame:

head(d)

##   subject Trial distance context VOT sFrame   RT respond_t
## 1       1     0    short    tent  50     10 6074         1
## 2       1     1     long    tent  50     10 4895         1
## 3       1     2     long    tent  30      4 8908         0
## 4       1     3    short    tent  35      6 5815         0
## 5       1     4     long    dent  50      9 5206         1
## 6       1     5     long    dent  50     10 4771         1

To view an individual column, we can subset our data using the $ operator. For example, to see the first 6 values of the RT column, I would use the command:

head(d$RT)

## [1] 6074 4895 8908 5815 5206 4771

Here are a few other functions I commonly use to inspect data:

names() will return the names of all columns in your data frame
class() (using a specific column as an input) will tell you what data type a specific column in your data frame is.
summary() will give some basic descriptive statistics of your columns
unique() will return all unique values of a variable when you give a single data frame column as input

Here’s how the output of each of those looks:

names(d) # get column names

## [1] "subject"   "Trial"     "distance"  "context"   "VOT"       "sFrame"   
## [7] "RT"        "respond_t"

class(d$RT) # what data type is RT?

## [1] "integer"

summary(d)

##     subject           Trial         distance    context          VOT       
##  Min.   :  1.00   Min.   :  0.00   long :5040   dent:5040   Min.   :10.00  
##  1st Qu.: 21.75   1st Qu.: 41.75   short:5040   tent:5040   1st Qu.:30.00  
##  Median : 50.50   Median : 83.50                            Median :37.50  
##  Mean   : 56.13   Mean   : 83.50                            Mean   :41.67  
##  3rd Qu.: 86.25   3rd Qu.:125.25                            3rd Qu.:50.00  
##  Max.   :120.00   Max.   :167.00                            Max.   :85.00  
##      sFrame            RT           respond_t     
##  Min.   : 1.00   Min.   :  3572   Min.   :0.0000  
##  1st Qu.: 5.75   1st Qu.:  4963   1st Qu.:0.0000  
##  Median :10.50   Median :  5426   Median :0.0000  
##  Mean   :10.85   Mean   :  6105   Mean   :0.3254  
##  3rd Qu.:16.25   3rd Qu.:  6050   3rd Qu.:1.0000  
##  Max.   :21.00   Max.   :455273   Max.   :1.0000

unique(d$VOT) # how many levels of the VOT variable are there?

## [1] 50 30 35 40 10 85

Data Visualization using `ggplot2`

tidyverse contains the library ggplot2 which uses the “grammar of graphics” framework for data visualization. This works essentially as a layering system: we start with a base layer of a ggplot() call, which creates the basic template (at minimum, the data and variables that will be on our x- and y-axes) from which we will work. The first argument of ggplot() will be our data frame d; then, we need to give a second argument called aes() (“aesthetics”) specifying our x and y axes. Let’s say that what we eventually want to do is create a plot showing the proportion of /t/ responses (y axis) by VOT (x axis) and context word (we’ll get to that later). This is how we would create our base layer:

ggplot(d, aes(x = VOT, y = respond_t))

Now what we want to do is add geom objects to our plot, creating our actual visualization. For this example, let’s use geom_point(), which will create a point at each VOT value.

Our data is in its raw form, meaning that our respond_t variable is a bunch of 0’s and 1’s. We want to transform that into a proportion by taking the mean of the column; the ggplot2 function stat_summary() allows us to do just that! We need to give stat_summary() a couple different arguments:

fun: the function we want to apply (in this case, mean())
geom: the geom we want (in this case, “point”)

Let’s add it to our base layer:

ggplot(d, aes(x = VOT, y = respond_t)) +
  stat_summary(fun = mean, geom = "point")

To make this visualization better, we may want to add error bars to our points! Turns out there are two built-in geoms for just this purpose: geom_pointrange() and geom_errorbar. Since we are already using points, let’s use geom_pointrange(). To use it in conjunction with stat_summary(), we will need to use a function that computes both the mean and a measure of uncertainty. I like the function mean_cl_boot(): this computes the mean and 95% confidence intervals using a bootstrap method. Here’s how we would add that to our plot:

ggplot(d, aes(x = VOT, y = respond_t)) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") # notice here I have to use the argument fun.data instead of fun

Let’s return to our original visualization goal: to plot /t/ responses by both VOT and context word. One good way to do that would be to plot points in different colors that correspond to the different context word conditions. The way that we can do this is to specify an additional aes() argument in our original ggplot() call: color:

ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") # notice here I have to use the argument fun.data instead of fun

Making Your Plots Prettier

The basic plotting defaults are a tad bit ugly. Here are a few issues right off the bat:

The axes are labeled with our variable names, which might not be very understandable for our eventual reader
The default colors are ugly (my own personal opinion lol)
The axis text is a bit small and difficult to read
The default background of gray with white gridlines can make some plots difficult to read

Fortunately, there is a massive selection of functions we can add to our plot to correct these issues!

Geom Colors

There is a family of functions all starting with scale_color that allow us to change the color of our points. Here are a few that I like:

scale_color_manual() allows you to manually enter which colors you would like your points (or other geoms) to be. You can specify with RGB values, hex code names, or the built-in names of R colors. A comprehensive list of R color names can be found here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
scale_color_brewer() uses the R Color Brewer system. You have three color scheme options: sequential (gives a gradient from light-dark for one color), qualitative (gives easily distinguishable colors), and diverging (gradient from one color to its opposite). You can find a list of all the palettes here: https://r-graph-gallery.com/38-rcolorbrewers-palettes.html. This is a great option, especially the qualitative color palettes, because they are designed to be color-blind friendly!

Here’s an example of our plot with the color brewer palette Dark 2:

ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") +
  scale_color_brewer(type = "qual", palette = "Dark2")

Axis & Legend Labeling

The functions xlab() and ylab() allow us to input our own text labels for our axes, like so:

ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") +
  scale_color_brewer(type = "qual", palette = "Dark2") +
  xlab("VOT (ms)") +
  ylab("Proportion /t/ responses")

In the same function I used to create my point colors, I can also use the name and labels arguments to change the name of the legend and the labels, respectively:

ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") +
  scale_color_brewer(type = "qual", palette = "Dark2", name = "Context bias", labels = c("dent-biasing", "tent-biasing")) +
  xlab("VOT (ms)") +
  ylab("Proportion /t/ responses")

Plot Customization with `theme()`

Literally everything you can imagine about your plots are customizable, and much of this is done with the theme() function you can add to your ggplot(). There are some built-in themes; my favorites are theme_classic() and theme_bw(), which in my opinion are much more readable than the default gray-background plot. Here’s an example of theme_bw():

ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") +
  scale_color_brewer(type = "qual", palette = "Dark2", name = "Context bias", labels = c("dent-biasing", "tent-biasing")) +
  xlab("VOT (ms)") +
  ylab("Proportion /t/ responses") +
  theme_bw()

But the most customizable option is to add your own theme elements. There are approximately 80 million theme objects, which you can view by running the help function on theme (?theme). I find this walkthrough to be quite helpful: https://henrywang.nl/ggplot2-theme-elements-demonstration/.

Here’s how theme objects work on a basic level. Each theme object has its own associated element type – for example, anything that deals with text will be specified by an element_text() function call which takes arguments like size, color, etc. Here are all element types:

element_text: text elements (axis labels, axis values, legend titles, etc.)
element_rect: box elements (like the border of the plot, etc.). Takes arguments like fill, color (outline)
element_line: linear elements (like gridlines in the plot). Takes arguments like color, size (thickness of line)
element_blank: this will remove an element. E.g. if you don’t want an axis label and you want that space to be taken away completely, you can assign the axis label element to element_blank().

Here are a few theme objects I find myself frequently editing:

axis.text, axis.title, legend.text, legend.title: Text associated with the axes and legend (value labels and title, respectively). You can make these more specific by adding which axis you would like to change (e.g., axis.text.x)
panel.background and plot.background: Outline & fill of the plot. plot.background deals with the entirety of the plot, while panel.background is just the area within the axes of your plot.
panel.grid: Great for changing how obvious or subtle your grid lines are. You can make them lighter to make them more unobtrusive (I like the color “grey95”, it’s practically white but still somewhat visible)

ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") +
  scale_color_brewer(type = "qual", palette = "Dark2", name = "Context bias", labels = c("dent-biasing", "tent-biasing")) +
  xlab("VOT (ms)") +
  ylab("Proportion /t/ responses") +
  theme(axis.text = element_text(size = 18),
        axis.title = element_text(size = 18),
        legend.text = element_text(size = 18),
        legend.title = element_text(size = 18),
        panel.grid = element_line(color = "grey95"),
        panel.background = element_rect(color = "black", fill = "NA")) # fill = "NA" creates a transparent background, very good for putting on slides with non-white background color!)

Your Turn!

Problem 2: Trial Effects

Let’s say that I want to see how the effect of context bias changes over the course of the experiment. Try to replicate this plot:

Here is the series of steps you’ll need to take:

Plot the mean of respond_t for each Trial using a point geom.
Show a linear fit line (hint: check out the geom_smooth() function and its associated method options).
Change axis & legend labels.
Use the built-in theme theme_classic().

Problem 3: Histograms

Try to replicate this plot:

First, I’m going to remove some RT outliers. Run this line of code to do that, and use d2 as your plotting data:

d2 <- subset(d, RT < 10000) # removes all RTs above 10000 (10 seconds)

Here is the series of steps you’ll need to take:

Create a histogram of RT. Hint: take a look at the ggplot cheat sheet or geom_histogram() help docs
Change the histogram bar colors to a lighter gray and make the outline of the bars black.
Change the plot’s background color to white, remove grid lines, and make a black panel outline. Also, change text to bold (hint: check out the face argument to element_text())

Lab 1: Data Visualization

Wednesday Bushong

May 24, 2022

Inspecting Data

Data Visualization using `ggplot2`

Making Your Plots Prettier

Geom Colors

Axis & Legend Labeling

Plot Customization with `theme()`

Your Turn!

Problem 1: Facets

Problem 2: Trial Effects

Problem 3: Histograms

Lab 1: Data Visualization

Wednesday Bushong

May 24, 2022

Inspecting Data

Data Visualization using ggplot2

Making Your Plots Prettier

Geom Colors

Axis & Legend Labeling

Plot Customization with theme()

Your Turn!

Problem 1: Facets

Problem 2: Trial Effects

Problem 3: Histograms

Data Visualization using `ggplot2`

Plot Customization with `theme()`