Part 1: Data Visualization

Let’s start by loading the tidyverse package:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.1     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

For the purposes of this session, we will be using data from Bushong & Jaeger (2019), an experiment on speech perception. In this experiment, participants hear sentences like “I noticed a [?]ent in the fender…”, where the [?] is a sound manipulated to range from sounding more like /t/, more like /d/, or somewhere in the middle by changing the value of an acoustic variable called VOT. We also manipulate a later word in the sentence to bias more towards a /d/ interpretation (i.e. “fender” should make you more likely to think the earlier word was “dent”), or a /t/ interpretation (e.g., “campgrounds”). Finally, we also manipulate how far away this biasing context word appears (“short” = 3 syllables after the target word, “long” = 6-9 syllables after). After listening to the sentence, the participant indicates whether they thought the word was “dent” or “tent” (key dependent variable). We also collect their reaction time on this response.

This figure gives a conceptual overview of the manipulated variables:

I’ve shared the dataset with you as a .RDS file, a special format for storing R data frames. We can load the dataset by using the readRDS() function:

d <- readRDS("data_preprocessed.RDS")

Inspecting Data

The functions View() and head() will be your best friends for taking a look to see what is in your data frame. head(), for example, shows the first 6 rows of your data frame:

head(d)
##   subject Trial distance context VOT sFrame   RT respond_t
## 1       1     0    short    tent  50     10 6074         1
## 2       1     1     long    tent  50     10 4895         1
## 3       1     2     long    tent  30      4 8908         0
## 4       1     3    short    tent  35      6 5815         0
## 5       1     4     long    dent  50      9 5206         1
## 6       1     5     long    dent  50     10 4771         1

To view an individual column, we can subset our data using the $ operator. For example, to see the first 6 values of the RT column, I would use the command:

head(d$RT)
## [1] 6074 4895 8908 5815 5206 4771

Here are a few other functions I commonly use to inspect data:

  • names() will return the names of all columns in your data frame
  • class() (using a specific column as an input) will tell you what data type a specific column in your data frame is.
  • summary() will give some basic descriptive statistics of your columns
  • unique() will return all unique values of a variable when you give a single data frame column as input

Here’s how the output of each of those looks:

names(d) # get column names
## [1] "subject"   "Trial"     "distance"  "context"   "VOT"       "sFrame"   
## [7] "RT"        "respond_t"
class(d$RT) # what data type is RT?
## [1] "integer"
summary(d)
##     subject           Trial         distance    context          VOT       
##  Min.   :  1.00   Min.   :  0.00   long :5040   dent:5040   Min.   :10.00  
##  1st Qu.: 21.75   1st Qu.: 41.75   short:5040   tent:5040   1st Qu.:30.00  
##  Median : 50.50   Median : 83.50                            Median :37.50  
##  Mean   : 56.13   Mean   : 83.50                            Mean   :41.67  
##  3rd Qu.: 86.25   3rd Qu.:125.25                            3rd Qu.:50.00  
##  Max.   :120.00   Max.   :167.00                            Max.   :85.00  
##      sFrame            RT           respond_t     
##  Min.   : 1.00   Min.   :  3572   Min.   :0.0000  
##  1st Qu.: 5.75   1st Qu.:  4963   1st Qu.:0.0000  
##  Median :10.50   Median :  5426   Median :0.0000  
##  Mean   :10.85   Mean   :  6105   Mean   :0.3254  
##  3rd Qu.:16.25   3rd Qu.:  6050   3rd Qu.:1.0000  
##  Max.   :21.00   Max.   :455273   Max.   :1.0000
unique(d$VOT) # how many levels of the VOT variable are there?
## [1] 50 30 35 40 10 85

Data Visualization using ggplot2

tidyverse contains the library ggplot2 which uses the “grammar of graphics” framework for data visualization. This works essentially as a layering system: we start with a base layer of a ggplot() call, which creates the basic template (at minimum, the data and variables that will be on our x- and y-axes) from which we will work. The first argument of ggplot() will be our data frame d; then, we need to give a second argument called aes() (“aesthetics”) specifying our x and y axes. Let’s say that what we eventually want to do is create a plot showing the proportion of /t/ responses (y axis) by VOT (x axis) and context word (we’ll get to that later). This is how we would create our base layer:

ggplot(d, aes(x = VOT, y = respond_t))

Now what we want to do is add geom objects to our plot, creating our actual visualization. For this example, let’s use geom_point(), which will create a point at each VOT value.

Our data is in its raw form, meaning that our respond_t variable is a bunch of 0’s and 1’s. We want to transform that into a proportion by taking the mean of the column; the ggplot2 function stat_summary() allows us to do just that! We need to give stat_summary() a couple different arguments:

  • fun: the function we want to apply (in this case, mean())
  • geom: the geom we want (in this case, “point”)

Let’s add it to our base layer:

ggplot(d, aes(x = VOT, y = respond_t)) +
  stat_summary(fun = mean, geom = "point")

To make this visualization better, we may want to add error bars to our points! Turns out there are two built-in geoms for just this purpose: geom_pointrange() and geom_errorbar. Since we are already using points, let’s use geom_pointrange(). To use it in conjunction with stat_summary(), we will need to use a function that computes both the mean and a measure of uncertainty. I like the function mean_cl_boot(): this computes the mean and 95% confidence intervals using a bootstrap method. Here’s how we would add that to our plot:

ggplot(d, aes(x = VOT, y = respond_t)) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") # notice here I have to use the argument fun.data instead of fun

Let’s return to our original visualization goal: to plot /t/ responses by both VOT and context word. One good way to do that would be to plot points in different colors that correspond to the different context word conditions. The way that we can do this is to specify an additional aes() argument in our original ggplot() call: color:

ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") # notice here I have to use the argument fun.data instead of fun

Making Your Plots Prettier

The basic plotting defaults are a tad bit ugly. Here are a few issues right off the bat:

  • The axes are labeled with our variable names, which might not be very understandable for our eventual reader
  • The default colors are ugly (my own personal opinion lol)
  • The axis text is a bit small and difficult to read
  • The default background of gray with white gridlines can make some plots difficult to read

Fortunately, there is a massive selection of functions we can add to our plot to correct these issues!

Geom Colors

There is a family of functions all starting with scale_color that allow us to change the color of our points. Here are a few that I like:

  • scale_color_manual() allows you to manually enter which colors you would like your points (or other geoms) to be. You can specify with RGB values, hex code names, or the built-in names of R colors. A comprehensive list of R color names can be found here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
  • scale_color_brewer() uses the R Color Brewer system. You have three color scheme options: sequential (gives a gradient from light-dark for one color), qualitative (gives easily distinguishable colors), and diverging (gradient from one color to its opposite). You can find a list of all the palettes here: https://r-graph-gallery.com/38-rcolorbrewers-palettes.html. This is a great option, especially the qualitative color palettes, because they are designed to be color-blind friendly!

Here’s an example of our plot with the color brewer palette Dark 2:

ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") +
  scale_color_brewer(type = "qual", palette = "Dark2")

Axis & Legend Labeling

The functions xlab() and ylab() allow us to input our own text labels for our axes, like so:

ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") +
  scale_color_brewer(type = "qual", palette = "Dark2") +
  xlab("VOT (ms)") +
  ylab("Proportion /t/ responses")

In the same function I used to create my point colors, I can also use the name and labels arguments to change the name of the legend and the labels, respectively:

ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") +
  scale_color_brewer(type = "qual", palette = "Dark2", name = "Context bias", labels = c("dent-biasing", "tent-biasing")) +
  xlab("VOT (ms)") +
  ylab("Proportion /t/ responses")

Plot Customization with theme()

Literally everything you can imagine about your plots are customizable, and much of this is done with the theme() function you can add to your ggplot(). There are some built-in themes; my favorites are theme_classic() and theme_bw(), which in my opinion are much more readable than the default gray-background plot. Here’s an example of theme_bw():

ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") +
  scale_color_brewer(type = "qual", palette = "Dark2", name = "Context bias", labels = c("dent-biasing", "tent-biasing")) +
  xlab("VOT (ms)") +
  ylab("Proportion /t/ responses") +
  theme_bw()

But the most customizable option is to add your own theme elements. There are approximately 80 million theme objects, which you can view by running the help function on theme (?theme). I find this walkthrough to be quite helpful: https://henrywang.nl/ggplot2-theme-elements-demonstration/.

Here’s how theme objects work on a basic level. Each theme object has its own associated element type – for example, anything that deals with text will be specified by an element_text() function call which takes arguments like size, color, etc. Here are all element types:

  • element_text: text elements (axis labels, axis values, legend titles, etc.)
  • element_rect: box elements (like the border of the plot, etc.). Takes arguments like fill, color (outline)
  • element_line: linear elements (like gridlines in the plot). Takes arguments like color, size (thickness of line)
  • element_blank: this will remove an element. E.g. if you don’t want an axis label and you want that space to be taken away completely, you can assign the axis label element to element_blank().

Here are a few theme objects I find myself frequently editing:

  • axis.text, axis.title, legend.text, legend.title: Text associated with the axes and legend (value labels and title, respectively). You can make these more specific by adding which axis you would like to change (e.g., axis.text.x)
  • panel.background and plot.background: Outline & fill of the plot. plot.background deals with the entirety of the plot, while panel.background is just the area within the axes of your plot.
  • panel.grid: Great for changing how obvious or subtle your grid lines are. You can make them lighter to make them more unobtrusive (I like the color “grey95”, it’s practically white but still somewhat visible)
ggplot(d, aes(x = VOT, y = respond_t, color = context)) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange") +
  scale_color_brewer(type = "qual", palette = "Dark2", name = "Context bias", labels = c("dent-biasing", "tent-biasing")) +
  xlab("VOT (ms)") +
  ylab("Proportion /t/ responses") +
  theme(axis.text = element_text(size = 18),
        axis.title = element_text(size = 18),
        legend.text = element_text(size = 18),
        legend.title = element_text(size = 18),
        panel.grid = element_line(color = "grey95"),
        panel.background = element_rect(color = "black", fill = "NA")) # fill = "NA" creates a transparent background, very good for putting on slides with non-white background color!) 

Your Turn!

Problem 1: Facets

Let’s imagine that I want to see the VOT & context effect broken up by an additional manipulation in the experiment: distance. In this experiment, I manipulated how far away the biasing context word occurred (“short” distance = 3 syllables later, “long” distance = 6-9 syllables later). We can show this additional variable by using facets.

Try to replicate this plot:

Here is the series of steps you’ll need to take:

  1. Use our point plot above as a ‘base’.
  2. Facet the plot by the distance variable (hint: check out the facet_wrap() function)
  3. Change the facet headers to have a white background and black border, and text size 18 to match our axes and legend (hint: look at the strip.background and strip.text theme objects)

Problem 2: Trial Effects

Let’s say that I want to see how the effect of context bias changes over the course of the experiment. Try to replicate this plot:

Here is the series of steps you’ll need to take:

  1. Plot the mean of respond_t for each Trial using a point geom.
  2. Show a linear fit line (hint: check out the geom_smooth() function and its associated method options).
  3. Change axis & legend labels.
  4. Use the built-in theme theme_classic().

Problem 3: Histograms

Try to replicate this plot:

First, I’m going to remove some RT outliers. Run this line of code to do that, and use d2 as your plotting data:

d2 <- subset(d, RT < 10000) # removes all RTs above 10000 (10 seconds)

Here is the series of steps you’ll need to take:

  1. Create a histogram of RT. Hint: take a look at the ggplot cheat sheet or geom_histogram() help docs
  2. Change the histogram bar colors to a lighter gray and make the outline of the bars black.
  3. Change the plot’s background color to white, remove grid lines, and make a black panel outline. Also, change text to bold (hint: check out the face argument to element_text())

Part 2: R Basics & Data Transformation

R Basics

You can type expressions directly for into the console for evaluation like so:

1 + 1
## [1] 2

If you want to save a computation or value, you can assign it to a variable using the <- operator:

x <- 1 + 1

To view the contents of a variable, simply type in its name:

x
## [1] 2

We frequently need to use logical operators to evaluate expressions. Here are the most common ones:

1 == 1 # equal  to
## [1] TRUE
1 != 2 # not equal to
## [1] TRUE
1 == 1 & 1 == 2 # and
## [1] FALSE
1 == 1 | 1 == 2 # INCLUSIVE or
## [1] TRUE
xor(1 == 1, 1 != 2) # EXCLUSIVE or (checks if one and only one is true)
## [1] FALSE

We can create vectors using c() (mnemonic: stands for concatenate). To index a particular value of a vector, we use square brackets [].

myvec <- c(1, 2, 3, 4)
myvec
## [1] 1 2 3 4
myvec[1]
## [1] 1
myvec[1:3]
## [1] 1 2 3
myvec[c(1, 3)] # return first and third element of vector 
## [1] 1 3
myvec[myvec == 1] # return all of the elements in the vector that have a value of "1" 
## [1] 1

Here are some typical descriptive statistics functions:

mean(myvec)
## [1] 2.5
sd(myvec)
## [1] 1.290994
range(myvec)
## [1] 1 4
min(myvec)
## [1] 1
max(myvec)
## [1] 4

Matrices are one step up in complexity from vectors in that we can have multiple rows/columns. We index in a similar way, though now we need to refer to both row (first index) and column (second index):

mymat <- matrix(c(1:10, 21:30), ncol = 2)
mymat
##       [,1] [,2]
##  [1,]    1   21
##  [2,]    2   22
##  [3,]    3   23
##  [4,]    4   24
##  [5,]    5   25
##  [6,]    6   26
##  [7,]    7   27
##  [8,]    8   28
##  [9,]    9   29
## [10,]   10   30
mymat[3, 2]
## [1] 23
mymat[3, ] # leaving an index blank means "all", so this will return all columns of the second row
## [1]  3 23
mean(mymat[, 2]) # mean of the second column
## [1] 25.5
mymat[mymat[, 1] == 5, ] # all the rows which have a "5" in the first column
## [1]  5 25

Matrices are not great for data analysis, however – all of the columns must have the same data type, whereas it is ideal for us to be able to have a mix of types (character, number, etc.). This is where the data.frame object comes in. These objects are like matrices, but each column can have its own data type. We’ve already seen an example of data frames and basic data frame operations above with our data visualization! Now let’s see more complex data transformation functions…

Basic Data Transformations

Grouping & Summarizing

The tidyverse library contains many useful functions for transforming our data into more functional/usable formats. For example, let’s say that I want to find the mean /t/-responses for each context condition in my experiment. I can do that by using the filter() or subset() functions (they’re more or less interchangeable):

t_context <- subset(d, context == "tent")
mean_t_context <- mean(t_context$respond_t)
d_context <- subset(d, context == "dent")
mean_d_context <- mean(d_context$respond_t)

But this quickly becomes very inefficient! It’s much more efficient to use group_by(), which will group my data frame by any variables I specify. I can then use summarise() to compute summary statistics for each level of grouped variable:

means_by_context <- d %>% # %>% is known as the pipe operator, and passes the data frame as the first argument to the next row
  group_by(context) %>%
  summarise(mean_t = mean(respond_t))

Mutating

In base R, I can add a new column to my data frame my simply creating a new column using the $ operator:

d$RT_zscore <- scale(d$RT) # the scale() function z-scores values

The tidyverse function mutate() performs the same function:

d <- d %>%
  mutate(RT_zscore2 = scale(RT))

The mutate() function is more powerful, however, since you can pair it with group_by(). I don’t have a great example for this particular dataset so this will be a bit contrived, but bear with me. Let’s say that I want to use subjects’ mean reaction times as an predictor of their categorization responses. I don’t want to summarize my data because I want each individual response to be preserved, but I do want to calculate a summary value over a particular goruping variable. I can do this by using group_by(subject) then creating a new column for each subject’s mean RT:

d <- d %>%
  group_by(subject) %>%
  mutate(mean_rt = mean(RT))

Now I have a summary value but with the same size of dataset as before!

Your Turn!

Problem 1: Outlier Removal

Create a new data frame which removes all data points where the RT is more than 3 standard deviations above or below the mean.

Problem 2: Plotting Subject Means

Create a new data frame where you compute mean /t/ responses grouped by subject, VOT, and context. Then, recreate our point plot from the beginning of the lab using subject means rather than raw respond_t as the y varaiable.

Part 3: Linear Regression

Every statistical model in R works in a similar way. We specify a formula of the form:

outcome variable ~ predictor variables

and our dataset. Let’s say, for example, that I simply want to do a t-test predicting RT from distance. We can do this using the t.test() function like so:

myttest <- t.test(RT ~ distance, d)
myttest
## 
##  Welch Two Sample t-test
## 
## data:  RT by distance
## t = 1.3248, df = 9501.8, p-value = 0.1853
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -95.20318 492.23413
## sample estimates:
##  mean in group long mean in group short 
##            6204.288            6005.773

When we get to anything more complex than a t-test, the model object output also becomes more complex. For the most part, we will want to use the summary() function on these model objects; this will provide us with the most relevant statistical information. Let’s look at an example of an ANOVA using the aov() function:

`## Basic Data Transformations

myanova <- aov(RT ~ distance, d)
summary(myanova)
##                Df    Sum Sq  Mean Sq F value Pr(>F)
## distance        1 9.931e+07 99309154   1.755  0.185
## Residuals   10078 5.702e+11 56579196

Linear regression is just as simple to implement! I will give the same formula, but now I use the lm() function (standing for ‘linear model’):

mylinreg <- lm(RT ~ distance,  d)
summary(mylinreg)
## 
## Call:
## lm(formula = RT ~ distance, data = d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -2434  -1140   -675    -59 449267 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6204.3      106.0  58.557   <2e-16 ***
## distanceshort   -198.5      149.8  -1.325    0.185    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7522 on 10078 degrees of freedom
## Multiple R-squared:  0.0001741,  Adjusted R-squared:  7.492e-05 
## F-statistic: 1.755 on 1 and 10078 DF,  p-value: 0.1853

To add more predictors, I simply use the + and * operators. + adds another predictor as a main effect, and * adds a main effect and interaction. Here is a linear regression with distance and VOT as main effects with no interactions:

maineffects <- lm(RT ~ VOT + distance,  d)
summary(maineffects)
## 
## Call:
## lm(formula = RT ~ VOT + distance, data = d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -2479  -1138   -677    -55 449242 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6293.220    172.882  36.402   <2e-16 ***
## VOT             -2.134      3.279  -0.651    0.515    
## distanceshort -198.515    149.844  -1.325    0.185    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7522 on 10077 degrees of freedom
## Multiple R-squared:  0.0002162,  Adjusted R-squared:  1.775e-05 
## F-statistic: 1.089 on 2 and 10077 DF,  p-value: 0.3364

Here is the model with the interaction:

interactionmodel <- lm(RT ~ VOT * distance,  d)
summary(interactionmodel)
## 
## Call:
## lm(formula = RT ~ VOT * distance, data = d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -2488  -1139   -676    -56 449246 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6305.1464   220.3501  28.614   <2e-16 ***
## VOT                 -2.4206     4.6368  -0.522    0.602    
## distanceshort     -222.3689   311.6221  -0.714    0.476    
## VOT:distanceshort    0.5725     6.5574   0.087    0.930    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7522 on 10076 degrees of freedom
## Multiple R-squared:  0.0002169,  Adjusted R-squared:  -8.073e-05 
## F-statistic: 0.7288 on 3 and 10076 DF,  p-value: 0.5347

Let’s talk about interpretation for a moment. Remember that we are fitting a model of the form:

y = intercept + beta1 x VOT + beta2 x distance + beta3 x VOT x distance

In the presence of an interaction, to interpret each of our coefficients, we need to set all the others to zero. So what does an intercept of 6305 mean? It means that the average reaction time when VOT = 0 & distance = 0 is 6305. In this particular instance, this is a hypothetical value because there are no examples of VOT = 0 in the experiment (the lowest VOT value is 10). And what exactly is distance = 0? Well, R assigns values to categorical variables using what is called “dummy coding” by default. For a categorical variable with 2 levels, the first value alphabeticall is assigned 0, and the second is assigned 1. So “distance = long” is “distance = 0”. So, our intercept is the estimated reaction time for distance = long and VOT = 0.

The VOT main effect needs to be interpreted in a similar way: with a value of -2.4, this means that the estimated increase in RT for each 1-unit increase of VOT is -2.4, specifically in the distance = long condition. Similarly, the distance main effect of -222.3 is the estimated effect at VOT = 0.

The interaction coefficient is now a description of how each main effect changes with a unit increase in the other. That is to say, the interaction of .5 means that the effect of VOT is .5 higher in the distance = short condition than the distance = long condition. I.e., the estimate of the VOT effect is -2.4 in the long distance condition, and -1.9 in the short distance condition. Alternatively, we can frame it in terms of the distance effect: the distance effect is -222.3 at VOT = 0, and increases by .5 for every unit increase in VOT. So, for example, the estimated distance effect at VOT = 10 would be -222.3 + (.5 x 10) = -217.3, and so on.

Notice how different this interpretation is from an ANOVA! Next week, we will see how we can re-code variables in order to give them an ANOVA-like interpretation.

Your Turn!

Fit a linear regression predicting RT from distance and context, including the interaction.

  • How would you interpret each coefficient?

  • What is the predicted value for the distance = short and context = dent condition?

  • What is the predicted value for the distance = long and context = tent condition?