Transitioning from SPSS to R

Introduction

For many of us, SPSS has been something we’ve used for statistical analysis since early in our experience with statistics. Running tests in SPSS might be second-nature to you, and there are lots of resources online for how to do things you may be unclear about. The comfort of SPSS can make it daunting to explore other options; however, SPSS has limitations, and you may have stumbled upon an interesting statistical test that can be run in R and not in SPSS. Alternatively, you may have decided that SPSS licenses are very expensive and inconvenient, and you’re interested in other options. I have gradually been transitioning from SPSS to R, so I’m hoping to share some of what I’ve learned that might ease your transition.

In my opinion, the biggest shift one makes in the move from SPSS to R is a change in how you think about your data and your analysis. If you are used to using the SPSS to GUI and clicky navigation to move around and explore your data, you’ll notice that won’t be as possible in R. Once you get over this hurdle, the flexibility and power of R is sure to win you over!

Packages

SPSS is a one-piece statistical package. All the functionalities and features available to you are included in the installation. R has many useful base functions, but there are tons of packages available from CRAN that allow you to do more things or do them better. To install a package from CRAN, use the function install.packages: for example, install.packages("ggplot2"). In order to use packages, you need to load them into R, which makes their functionalities available. Do this using library. Here are the packages we’ll be using today.

library(psych)
library(tidyverse)
library(ggplot2)
library(haven)
library(car)
library(Hmisc)
library(moments)
library(flextable)
library(broom)
select <- dplyr::select
summarize <- dplyr::summarize
path <- getwd()

R Basics

In R, a single ‘line’ of data is called a vector - a row or column could be a vector. The most common higher dimensions of data you’ll use are matrices (if they’re all one variable type eg. numeric), lists (bundles of different kinds of information) or data.frames (like the spreadsheets you’d use in Excel or SPSS). These 3 simple base R functions are some of your best friends - you will use them often! Especially c() (combine) - you’ll use this on its own and within functions. To assign a piece of information to a name (eg. to make a vector named “item1”), use the <- operator with the Label on the left and the Contents on the right (by convention). <- and = are not the same thing in R!

# makes a vector
item1 <- c(1, 2, 3, 5)
item2 <- c(3, 6, 7, 2)

# binds these by column - ie. these are variable vectors
cbind(item1, item2)
##      item1 item2
## [1,]     1     3
## [2,]     2     6
## [3,]     3     7
## [4,]     5     2

# binds these by row - ie. these are observation vectors
rbind(item1, item2)
##       [,1] [,2] [,3] [,4]
## item1    1    2    3    5
## item2    3    6    7    2

Setting Up Your Data

We’re going to be using one of the default datasets available in R through ggplot2, diamonds. We’ll start by loading in that data. You would usually need to import your own data. You can do this using read_csv from readr (e.g. my_data <- read_csv(/myfolder/my_data.csv)). A helpful tip is to use path <- getwd() at the beginning of your .R or .Rmd - this saves your current path. As long as you keep your .R or .Rmd in the same place as the files you’re looking for, you can easily input the path of a file at anytime using file.path(path, "filename.csv").

data(diamonds)
diamonds
## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # ... with 53,930 more rows

If you’d like to look a the data in R Studio’s GUI, you can do this by typing View(diamonds) into the console. Coming from SPSS, this is probably the view of your data that you’re familiar with. In R, there are also other ways to explore the data.

Navigating Your Data in R

str is a useful base R function that gives an overview of your data, including the number of observations and variables, the classes of your data frame (more on this later), and previews the data by showing you each variable (and its class) and the first few observations. glimpse is a similar function from dplyr - the major difference here is it just shows the data a bit more cleanly (shows exactly as many observations as will fit in your current window size). head previews the first 6 rows of your data, and tail previews the last 6 rows. You can also customize the number of rows to display: e.g. head(diamonds, 10). names will output the names of your columns, which are usually your variables.

str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
head(diamonds)
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
tail(diamonds)
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.72 Premium   D     SI1      62.7    59  2757  5.69  5.73  3.58
## 2  0.72 Ideal     D     SI1      60.8    57  2757  5.75  5.76  3.5 
## 3  0.72 Good      D     SI1      63.1    55  2757  5.69  5.75  3.61
## 4  0.7  Very Good D     SI1      62.8    60  2757  5.66  5.68  3.56
## 5  0.86 Premium   H     SI2      61      58  2757  6.15  6.12  3.74
## 6  0.75 Ideal     D     SI2      62.2    55  2757  5.83  5.87  3.64
names(diamonds)
##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"
glimpse(diamonds)
## Observations: 53,940
## Variables: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, ...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very G...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, ...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, ...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, ...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, ...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, ...

There are also multiple ways to call specific rows (observations) and columns (variables). You can call row/column indices in square brackets (e.g. dataframe[row,column]) or using a $ and the variable name (dataframe$variable). You always need to tell R where to find a particular variable or observation. I’ve wrapped these all in head for convenience, because 54,000 is a lot of diamonds!

# 7th column (variable)
head(diamonds[,7])
## # A tibble: 6 x 1
##   price
##   <int>
## 1   326
## 2   326
## 3   327
## 4   334
## 5   335
## 6   336

# Also the 7th column - defaults to column if there's no comma!
head(diamonds[7])
## # A tibble: 6 x 1
##   price
##   <int>
## 1   326
## 2   326
## 3   327
## 4   334
## 5   335
## 6   336

# 7th row (observation)
head(diamonds[7,])
## # A tibble: 1 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47

# Value of the 3rd variable for the 50th observation
head(diamonds[50, 3])
## # A tibble: 1 x 1
##   color
##   <ord>
## 1 H

# List of all the price values
head(diamonds$price)
## [1] 326 326 327 334 335 336

dplyr also gives us great tools for selecting particular rows and variables.

# Grab the first 20 rows of diamonds for convenience
short_diamonds <- diamonds[1:20,]

# Select only the physical dimensions
short_diamonds %>% select(x, y, z)
## # A tibble: 20 x 3
##        x     y     z
##    <dbl> <dbl> <dbl>
##  1  3.95  3.98  2.43
##  2  3.89  3.84  2.31
##  3  4.05  4.07  2.31
##  4  4.2   4.23  2.63
##  5  4.34  4.35  2.75
##  6  3.94  3.96  2.48
##  7  3.95  3.98  2.47
##  8  4.07  4.11  2.53
##  9  3.87  3.78  2.49
## 10  4     4.05  2.39
## 11  4.25  4.28  2.73
## 12  3.93  3.9   2.46
## 13  3.88  3.84  2.33
## 14  4.35  4.37  2.71
## 15  3.79  3.75  2.27
## 16  4.38  4.42  2.68
## 17  4.31  4.34  2.68
## 18  4.23  4.29  2.7 
## 19  4.23  4.26  2.71
## 20  4.21  4.27  2.66

# Select everything but the physical dimensions
short_diamonds %>% select(carat:price)
## # A tibble: 20 x 7
##    carat cut       color clarity depth table price
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
##  1 0.23  Ideal     E     SI2      61.5    55   326
##  2 0.21  Premium   E     SI1      59.8    61   326
##  3 0.23  Good      E     VS1      56.9    65   327
##  4 0.290 Premium   I     VS2      62.4    58   334
##  5 0.31  Good      J     SI2      63.3    58   335
##  6 0.24  Very Good J     VVS2     62.8    57   336
##  7 0.24  Very Good I     VVS1     62.3    57   336
##  8 0.26  Very Good H     SI1      61.9    55   337
##  9 0.22  Fair      E     VS2      65.1    61   337
## 10 0.23  Very Good H     VS1      59.4    61   338
## 11 0.3   Good      J     SI1      64      55   339
## 12 0.23  Ideal     J     VS1      62.8    56   340
## 13 0.22  Premium   F     SI1      60.4    61   342
## 14 0.31  Ideal     J     SI2      62.2    54   344
## 15 0.2   Premium   E     SI2      60.2    62   345
## 16 0.32  Premium   E     I1       60.9    58   345
## 17 0.3   Ideal     I     SI2      62      54   348
## 18 0.3   Good      J     SI1      63.4    54   351
## 19 0.3   Good      J     SI1      63.8    56   351
## 20 0.3   Very Good J     SI1      62.7    59   351

## OR
short_diamonds %>% select(-x, -y, -z)
## # A tibble: 20 x 7
##    carat cut       color clarity depth table price
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
##  1 0.23  Ideal     E     SI2      61.5    55   326
##  2 0.21  Premium   E     SI1      59.8    61   326
##  3 0.23  Good      E     VS1      56.9    65   327
##  4 0.290 Premium   I     VS2      62.4    58   334
##  5 0.31  Good      J     SI2      63.3    58   335
##  6 0.24  Very Good J     VVS2     62.8    57   336
##  7 0.24  Very Good I     VVS1     62.3    57   336
##  8 0.26  Very Good H     SI1      61.9    55   337
##  9 0.22  Fair      E     VS2      65.1    61   337
## 10 0.23  Very Good H     VS1      59.4    61   338
## 11 0.3   Good      J     SI1      64      55   339
## 12 0.23  Ideal     J     VS1      62.8    56   340
## 13 0.22  Premium   F     SI1      60.4    61   342
## 14 0.31  Ideal     J     SI2      62.2    54   344
## 15 0.2   Premium   E     SI2      60.2    62   345
## 16 0.32  Premium   E     I1       60.9    58   345
## 17 0.3   Ideal     I     SI2      62      54   348
## 18 0.3   Good      J     SI1      63.4    54   351
## 19 0.3   Good      J     SI1      63.8    56   351
## 20 0.3   Very Good J     SI1      62.7    59   351

# Filter to get only observations with a Premium cut
short_diamonds %>% filter(cut == "Premium")
## # A tibble: 5 x 10
##   carat cut     color clarity depth table price     x     y     z
##   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.21  Premium E     SI1      59.8    61   326  3.89  3.84  2.31
## 2 0.290 Premium I     VS2      62.4    58   334  4.2   4.23  2.63
## 3 0.22  Premium F     SI1      60.4    61   342  3.88  3.84  2.33
## 4 0.2   Premium E     SI2      60.2    62   345  3.79  3.75  2.27
## 5 0.32  Premium E     I1       60.9    58   345  4.38  4.42  2.68

# Combine filter and select to see only the physical dimensions
# of the observations with a Premium cut
short_diamonds %>% filter(cut == "Premium") %>%
  select(-(carat:price))
## # A tibble: 5 x 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1  3.89  3.84  2.31
## 2  4.2   4.23  2.63
## 3  3.88  3.84  2.33
## 4  3.79  3.75  2.27
## 5  4.38  4.42  2.68

# Make sure you do this in the right order!
# The following would throw an error because you've already
# selected out 'cut' by the time you asked filter to use it!

# short_diamonds %>% select(-(carat:price)) %>%
#   filter(cut == "Premium")

Exercise 1: Indexing

Can you find 2 different ways to find the price of the 10,000th diamond?

You might also want to sort the rows of your table based on the values of particular variables. You can do this with arrange and desc from dplyr. Sorting by an ordinal factor will sort in order of the levels of the variable. Other variables will sort in alphanumeric order.

# Ascending
short_diamonds %>% arrange(cut)
## # A tibble: 20 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
##  2 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  3 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  4 0.3   Good      J     SI1      64      55   339  4.25  4.28  2.73
##  5 0.3   Good      J     SI1      63.4    54   351  4.23  4.29  2.7 
##  6 0.3   Good      J     SI1      63.8    56   351  4.23  4.26  2.71
##  7 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  8 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  9 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## 11 0.3   Very Good J     SI1      62.7    59   351  4.21  4.27  2.66
## 12 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 13 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 14 0.22  Premium   F     SI1      60.4    61   342  3.88  3.84  2.33
## 15 0.2   Premium   E     SI2      60.2    62   345  3.79  3.75  2.27
## 16 0.32  Premium   E     I1       60.9    58   345  4.38  4.42  2.68
## 17 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 18 0.23  Ideal     J     VS1      62.8    56   340  3.93  3.9   2.46
## 19 0.31  Ideal     J     SI2      62.2    54   344  4.35  4.37  2.71
## 20 0.3   Ideal     I     SI2      62      54   348  4.31  4.34  2.68

# Descending
short_diamonds %>% arrange(desc(cut))
## # A tibble: 20 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.23  Ideal     J     VS1      62.8    56   340  3.93  3.9   2.46
##  3 0.31  Ideal     J     SI2      62.2    54   344  4.35  4.37  2.71
##  4 0.3   Ideal     I     SI2      62      54   348  4.31  4.34  2.68
##  5 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  6 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  7 0.22  Premium   F     SI1      60.4    61   342  3.88  3.84  2.33
##  8 0.2   Premium   E     SI2      60.2    62   345  3.79  3.75  2.27
##  9 0.32  Premium   E     I1       60.9    58   345  4.38  4.42  2.68
## 10 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
## 11 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
## 12 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
## 13 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## 14 0.3   Very Good J     SI1      62.7    59   351  4.21  4.27  2.66
## 15 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 16 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 17 0.3   Good      J     SI1      64      55   339  4.25  4.28  2.73
## 18 0.3   Good      J     SI1      63.4    54   351  4.23  4.29  2.7 
## 19 0.3   Good      J     SI1      63.8    56   351  4.23  4.26  2.71
## 20 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49

# Custom Order
short_diamonds %>%
  arrange(match(cut, c("Premium", "Ideal", "Very Good", "Fair", "Good")))
## # A tibble: 20 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  2 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  3 0.22  Premium   F     SI1      60.4    61   342  3.88  3.84  2.33
##  4 0.2   Premium   E     SI2      60.2    62   345  3.79  3.75  2.27
##  5 0.32  Premium   E     I1       60.9    58   345  4.38  4.42  2.68
##  6 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  7 0.23  Ideal     J     VS1      62.8    56   340  3.93  3.9   2.46
##  8 0.31  Ideal     J     SI2      62.2    54   344  4.35  4.37  2.71
##  9 0.3   Ideal     I     SI2      62      54   348  4.31  4.34  2.68
## 10 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
## 11 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
## 12 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
## 13 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## 14 0.3   Very Good J     SI1      62.7    59   351  4.21  4.27  2.66
## 15 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 16 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 17 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 18 0.3   Good      J     SI1      64      55   339  4.25  4.28  2.73
## 19 0.3   Good      J     SI1      63.4    54   351  4.23  4.29  2.7 
## 20 0.3   Good      J     SI1      63.8    56   351  4.23  4.26  2.71

Looks great… but what is this data trying to show us? If we want to find out what this dataset contains, we can look up the documentation by typing ??diamonds into the console. In the documentation, we find that diamonds is a dataset containing prices and attributes of almost 54,000 diamonds. The variables are:

price: price in US dollars
carat: weight of the diamond
cut: quality of the cut
colour: diamond colour (J being the worst, D being the best)
clarity:ranging from I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)
physical dimensions in mm: x (length), y (width), z (depth)
depth: depth percentage calculated based on physical dimensions
table: width of top of the diamond relative to the widest point (flat part at the top)

Because ggplot2 gave us this data out of the goodness of its heart, it’s very neat! All the variables already have their appropriate classes - and we also saw this when we called str. You can check a particular variable by calling class. Also, we don’t see in my variable summary what the levels of cut are. We can check this by calling levels.

class(diamonds$cut)
## [1] "ordered" "factor"
levels(diamonds$cut)
## [1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal"

We see that cut is an ordered factor, and we know based on our knowledge of the data that diamond cut is an ordinal variable, so this is what we expect. The levels of the data seem correct, with no obvious errors, and they appear to go in a logical order.

Intro to Wrangling

When you are loading in your own data, it rarely will be so neat! It’s important that all variables are classed correctly in order for appropriate calculation and visualization down the line. In SPSS, you would make these kinds of changes in Variable View. In order to play in parallel between R and SPSS today, I’m going to start by outputting our diamonds data to a .csv.

write_csv(diamonds, file.path(path, "diamonds.csv"))

When I import this into SPSS straight from the .csv, here’s what I get:

We know that all our categorical variables are ordinal in this dataset, so we change those by clicking in the Measure column.

You’ll notice the Values column is empty - this is important because some tests in SPSS will not allow variables with strings, so you need to set up key-value pairs. For example, for cut, you could assign a number of 1 to Fair, 2 to Good, etc. You can do this manually in SPSS - but R can help us out! We’ll come back to this.

When you first import data into R, variables tend to be chr - character (string) if they have any letters, or num, int, or dbl (numerical, integer, or double) if they are all numbers. You will need to change your variables’ classes to fit what you know them to be. Categorical variables in R are called factors, and these can be unordered (nominal) or ordered (ordinal).

In this block of code, I’m going to create a messy version of diamonds so we can practice cleaning it up. If you’re looking at the .Rmd, no peeking! It’ll give away how you fix it.

Take a look at messy_diamonds with str. What’s wrong with it?

depth is a character vector, but should be numeric
color is a character vector, but should be an ordered factor
price is factored, but it should be integer

## DEPTH
# Try putting depth back to numeric by coercing it with as.numeric
head(as.numeric(messy_diamonds$depth))
## [1] 61.5 59.8 56.9 62.4 63.3 62.8

# The warning 'NAs introduced by coercion' indicates that there's a 
# non numerical value somewhere - let's find it to understand 
# what's wrong before we erase it!
# I'm asking R to tell me which value is NA when I coerced it
which(is.na(as.numeric(messy_diamonds$depth)))
## [1] 90

# Let's look at what the depth of diamond #90 is
messy_diamonds[90,5]
## # A tibble: 1 x 1
##   depth
##   <chr>
## 1 a

# Oops! Looks like a data-entry error. If we had another place the data
# was entered (such as an original output document or paper questionnaire), 
# we could check it and assign the correct value. 
# I'm going to cheat and use the original diamonds to find out!
diamonds[90,5]
## # A tibble: 1 x 1
##   depth
##   <dbl>
## 1  62.9

# Originally, this value was 62.9 - we'll go ahead and assign that now.
messy_diamonds[90,5] <- 62.9

# If you didn't have a way to check the original value, you probably
# have to leave it as a missing value. R will automatically 
# turn this character into a NA - which is what the warning was telling us.

# Let's go ahead and coerce depth to numeric now that we've checked it.
messy_diamonds$depth <- as.numeric(messy_diamonds$depth)

## COLOR
# Color should be a factor! It needs to be ordered. If this were our data, 
# we'd know what the order should be. In this case, let's check 
# the original diamonds to find out. 

levels(diamonds$color)
## [1] "D" "E" "F" "G" "H" "I" "J"

messy_diamonds$color <- factor(messy_diamonds$color, ordered = TRUE, levels = c("D", "E", "F", "G", "H", "I", "J"))

## PRICE
# Price should be integer, but it's a factor right now.
# If we were to call as.integer right away, it would convert 
# them based on the factor levels! To avoid this, 
# we'll wrap it with as.character first. 
messy_diamonds$price <- as.integer(as.character(messy_diamonds$price))

# No warnings! Let's call str to make sure things look right. 
str(messy_diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Exercise 2: Cleaning

No peeking on the Rmd!

I’ve done the same thing again - messed up a perfectly good dataset and called it cars. This comes from the dataset mtcars - you can search its documentation by typing ?mtcars into the console. Make sure the data ends up clean based on your expectations from the documentation.

Now that we’ve gotten things cleaned up, we’re ready to go! Remember how I said that we’d come back to an easier way to get data set up in SPSS? The package haven is our best friend. Using write_sav, we can create an SPSS data file (.sav) with variable types and values set-up nicely!

write_sav(diamonds, file.path(path, "diamonds.sav"))

Take a look at the Variable View of diamonds.sav in SPSS.

The categorical values now have numerical values corresponding to their original factor levels, and everything is appropriately classed! Take a look at the Data View for the two files to compare.

Straight from the CSV:

Through Haven:

This is great, because many of us don’t immediately abandon our old tools when we pick up new ones. You might still want to use SPSS for some things, but R can still help you manage your data. Alternatively, it’s helpful if you have collaborators who use SPSS.

More Tips for Setting Up Your Data

The ‘Missing’ column in Variable View of SPSS is a helpful way to quickly identify variables with missing data. In R, you can quickly find out if there are any missing data points using is.na and a couple friends.

testvariable <- c(1, 2, 3, 5, 3, 1, 3, 3, 4, 5, 6, NA, 3, 3, 4, 7, 8, 1, 3, NA, 3, 4, 6)
any(is.na(testvariable))
## [1] TRUE
which(is.na(testvariable))
## [1] 12 20

You can use Values (key-value pairs) and Labels in SPSS to store information about your variables. In R, you can give your values names (similar to key-value pairs), and you can make a data dictionary to store detailed information that you want access to. dataMeta provides some helpful tools for making data dictionaries, but you can scrape by with base R. Here are some simple examples of ways to label data and store additional information about variables. You could of course import these by reading in a .csv (read_csv through readr is my go-to) or other spreadsheet - you don’t have to type these things directly into R.

var1 <- c(1, 2, 3, 3, 4, 5, 3, 2, 1, 2)
var2 <- c(4, 5, 3, 5, 5, 4, 2, 1, 2, 3)
var3 <- c(1, 1, 2, 4, 5, 3, 3, 3, 2, 2)
df <- data.frame(var1, var2, var3)

# Labeling individual values
df$var1 <- factor(var1, levels = c("1", "2", "3", "4", "5"), labels = c("Poor", "Fair", "Good", "Very Good", "Excellent"), ordered = TRUE)

## All 3 of these questions have the same levels/labels so for convenience
## we'll define vectors that store this info once
q_labels <- c("Poor", "Fair", "Good", "Very Good", "Excellent")
q_levels <- c("1", "2", "3", "4", "5")

df$var1 <- factor(var1, levels = q_levels, labels = q_labels, ordered = TRUE)
df$var2 <- factor(var2, levels = q_levels, labels = q_labels, ordered = TRUE)
df$var3 <- factor(var3, levels = q_levels, labels = q_labels, ordered = TRUE)

## Note: if you use an ordered factor but want to access numerical 
## features (ie. to calculate things like mean, min or max) you can
## wrap it in as.numeric - does not matter if factor is labeled as characters
mean(as.numeric(df$var1))
## [1] 2.6

# Naming observations can be helpful to keep track
rownames(df) <- c("PD01", "YC01", "PD02", "PD03", "PD04", "YC02", "OC01", "OC02", "YC03", "OC03")

# Renaming the variables
colnames(df) <- c("delivery", "value", "meal")

# If you want to put these into their own column (helpful if tidying):
rownames_to_column(df)
##    rowname  delivery     value      meal
## 1     PD01      Poor Very Good      Poor
## 2     YC01      Fair Excellent      Poor
## 3     PD02      Good      Good      Fair
## 4     PD03      Good Excellent Very Good
## 5     PD04 Very Good Excellent Excellent
## 6     YC02 Excellent Very Good      Good
## 7     OC01      Good      Fair      Good
## 8     OC02      Fair      Poor      Good
## 9     YC03      Poor      Fair      Fair
## 10    OC03      Fair      Good      Fair

# Quick and dirty data dictionary with question details and levels
## Because the levels are consistent, I've only specified them once
## and R will multiply it to fill the dataframe. If you have different 
## details for each question, make sure you input them all separately. 
## This is a good thing to do in your data entry/data prep. 
question <- c("How was your delivery?", "Value for cost?", "How was your meal?")
level <- c("1 = Poor, 2 = Fair, 3 = Good, 4 = Very Good, 5  = Excellent")
data_dict <- data.frame(colnames(df), question, level)

Importing, cleaning, and wrangling your data is a huge topic, but there are tons of resources to help you with this!

Running Tests

What sample to do??

Running statistical tests in R is a bit different from running them in SPSS, but thankfully Google is your best friend here! Let’s work through a sample analysis on our diamonds data.

Let’s say I wanted to do a t-test to compare the average prices of Ideal and Premium diamonds.

To set this up in SPSS, I would select an Indepedendent Samples t-test from the Compare Means menu. This is what my setup window would look like - I know that Ideal and Premium are coded as “4” and “5” by clicking on the Values in Data View.

And this would be my output:

In R, to run this t-test I’m going to use the function t.test. However, SPSS runs Levene’s test to check for homogeneity of variance as part of the t-test. We’ll have to do that ourselves, using leveneTest from the car package. We’ll also need to filter in order to just have Ideal and Premium diamonds.

Seeing that Levene’s test is significant, our assumption of homogeneity of variance is not upheld, and we should use a Welch t-test, rather than a Student’s t-test. This is the default in R’s t.test.

diamonds_subset <- diamonds %>% filter(cut == "Ideal" | cut == "Premium")

leveneTest(diamonds_subset$price, diamonds_subset$cut, center = mean)
## Levene's Test for Homogeneity of Variance (center = mean)
##          Df F value Pr(>F)    
## group     1     350 <2e-16 ***
##       35340                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Welch - the one we're using
t.test(diamonds_subset$price ~ diamonds_subset$cut, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  diamonds_subset$price by diamonds_subset$cut
## t = 24.9, df = 26600, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1038.1 1215.3
## sample estimates:
## mean in group Premium   mean in group Ideal 
##                4584.3                3457.5

# Student - if Levene's hadn't been true
t.test(diamonds_subset$price ~ diamonds_subset$cut, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  diamonds_subset$price by diamonds_subset$cut
## t = 25.7, df = 35300, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1040.6 1212.8
## sample estimates:
## mean in group Premium   mean in group Ideal 
##                4584.3                3457.5

The output from our t-test includes almost all of the same information as the SPSS output. All we’re missing is mean difference and the standard error of the difference. Mean difference is relatively easy to calculate - standard error of the difference is a bit more complex but here it is! Once again, these funky formulas are coming from the SPSS algorithm manual. In most cases, you won’t need to get a specific value that comes from the SPSS output - but this is just to provide an example of how to match things up.

# Mean difference - the same for Student and Welch tests 
ideal <- filter(diamonds, diamonds$cut == "Ideal")
premium <- filter(diamonds, diamonds$cut == "Premium")
meandiff <- abs(mean(ideal$price) - mean(premium$price))

n_ideal <- sum(!is.na(ideal$price))
n_premium <- sum(!is.na(premium$price))

# Standard error of the difference: unpooled - for Welch's t-test 
SE_diff_unpooled <- sqrt((var(ideal$price)/n_ideal) + (var(premium$price)/n_premium))

# Standard error of the difference: pooled variance - for Student's t-test
pooled_variance <- ((n_ideal-1)*var(ideal$price) + (n_premium-1)*var(premium$price))/(n_ideal+n_premium - 2)
SE_diff_pooled <- sqrt(pooled_variance)*sqrt(1/n_ideal+1/n_premium)

Creating Summary Documents

Once you’ve finished your beautiful work in R, you need a way to output your documents! A few ways you can do this are:

Spreadsheets: e.g. outputting to a .csv with readr::write_csv - Syntax: write_csv(dataFrame, file.path(path, “results.csv”))
Images: e.g. outputting a plot to a .png with ggplot2::ggsave - Syntax: ggsave(plotName, file = “plot.png”)
R Markdown: Knitting to HTML, PDF, or Word

My current favourite package for formatting summary documents in R Markdown is flextable, which allows you to output to HTML, PDF and Word with ease. It also offers significant customizability. If you don’t use Word documents, kableExtra is another fabulous option, and a bit simpler!

The first thing you might want to include in a summary document is some descriptives. Calculating them is easy to do using dplyr’s summarize. Here’s a quick sample - there’s more on this in the Explore section of the Resources. Once you’ve got your summary, you can use flextable to prep it for output. flextable takes a data.frame as its input.

diamonds %>%
  summarize(n = n(), n_missing = sum(is.na(price)), mean = mean(price), sd = sd(price), min = min(price), max = max(price)) %>%
  mutate(stderror = sd/sqrt(n)) %>%
  mutate(confint_lower = mean-2*stderror) %>%
  mutate(confint_upper = mean+2*stderror)
## # A tibble: 1 x 9
##       n n_missing  mean    sd   min   max stderror confint_lower
##   <int>     <int> <dbl> <dbl> <dbl> <dbl>    <dbl>         <dbl>
## 1 53940         0 3933. 3989.   326 18823     17.2         3898.
## # ... with 1 more variable: confint_upper <dbl>

summary <- diamonds %>%
  summarise_each(vars = c(carat, depth, table, price, x, y, z), funs(mean, sd, min, max, median, IQR)) 

sum_ft <- flextable(summary)

# Subset tables by stat
flextable(select(summary, contains("mean")))

carat_mean	depth_mean	table_mean	price_mean	x_mean	y_mean	z_mean
0.798	61.749	57.457	3932.800	5.731	5.735	3.539

flextable(select(summary, contains("sd")))

carat_sd	depth_sd	table_sd	price_sd	x_sd	y_sd	z_sd
0.474	1.433	2.234	3989.440	1.122	1.142	0.706


# Subset tables by variable
flextable(select(summary, contains("price")))

price_mean	price_sd	price_min	price_max	price_median	price_IQR
3932.800	3989.440	326.000	18823.000	2401.000	4374.250


# Playing with formatting
sum_ft <- flextable(summary) %>%
  theme_zebra %>%
  font(part = "header", fontname = "Cambria") %>%
  bold(part = "header") %>%
  color(part = "header", color = "blue") %>%
  fontsize(part = "all", size = 16)

Let’s say we want to output our t-test output from above. Output from statistical tests is often a little cluttered - broom::tidy can clean that up for us.

output <- tidy(t.test(diamonds_subset$price ~ diamonds_subset$cut, var.equal = FALSE)) %>%
  select(-estimate)

# Very basic
flextable(output)

estimate1	estimate2	statistic	p.value	parameter	conf.low	conf.high	method	alternative
4584.258	3457.542	24.918	0.000	26552.158	1038.088	1215.344	Welch Two Sample t-test	two.sided


# How exciting!
out_ft <- flextable(output) %>%
  set_header_labels(estimate1 = "Premium Mean", estimate2 = "Ideal Mean", statistic = "Statistic", p.value = "p", parameter = "df", conf.low = "Conf. Int. Lower", conf.high = "Conf. Int. High", method = "Method", alternative = "Alternative") %>%
  bold(part = "header") %>%
  fontsize(part = "header", size = 16) %>%
  fontsize(part = "body", size = 14) %>%
  bg(part = "body", bg = "#E9E9E9") 

out_ft

Premium Mean	Ideal Mean	Statistic	p	df	Conf. Int. Lower	Conf. Int. High	Method	Alternative
4584.258	3457.542	24.918	0.000	26552.158	1038.088	1215.344	Welch Two Sample t-test	two.sided

Great! So we’ve got a couple tables we’d like to add to our R Markdown file. I think a good beginner strategy for getting easy summary documents is to work in a .Rmd (rather than a .R) and just put all your rough work in a code chunk with include = FALSE. That way, when you knit your document, nothing in your code chunk will be included - only things you call outside (such as with in-line code or with another chunk). Here’s an example of what I mean.

Here’s what the sample below looks like in R Markdown:

Summary Document Sample

The data included a sample of 53940 diamonds. The average price among the data was 3932.79972 USD. Descriptive statistics of diamond prices are below:

N	N Missing	Mean	SD	Minimum	Maximum	Standard Error	Conf. Int. Lower	Conf. Int. Upper
53940	0	3932.800	3989.440	326.000	18823.000	17.177	3898.445	3967.154

I was interested in the relationship between a diamond’s cut and its price at the two highest levels of cut (Ideal and Premium). I ran an independent samples t-test to compare the average price in these two groups. Testing for homogeneity of variance with Levene’s test indicated a F-statistic of 349.572 and a p-value of < .001. As a result, a Welch t-test was conducted - results are below.

Premium Mean	Ideal Mean	Statistic	p	df	Conf. Int. Lower	Conf. Int. Upper	Method	Alternative
4584.258	3457.542	24.918	0.000	26552.158	1038.088	1215.344	Welch Two Sample t-test	two.sided

Exercise 3: Summary Stats

Use what you’ve learned to do the following:

Calculate the mean and SD of price
Calculate the mean and SD of price among diamonds with a cut of Ideal
Turn the second one into a flextable
Bold the body of the table

Tips

Be careful when running things in your console. Code you run from your .R or .Rmd will stay when you save/close it, but work done in the console will disappear when you close R or if your R session ever needs to restart. I usually use the console for experimentation, and then write things that work/that I’m keeping in my .R or .Rmd file.
Keep your .R or .Rmd in the same place as the files you use (Also check out R Projects!). Set up your directory at the beginning of your document by assigning it to a variable: e.g. path <- getwd(). You can then use the function file.path to paste together your directory with a file you’re reading (importing) or writing (exporting): e.g. read_csv(file.path(path, "test.csv")). The advantage here is that file.path will take care of the Mac/PC/Other OS differences in slash defaults.
Assign results of a statistical test to an object (e.g. t_results <- t.test(df$var1 ~ df$var2))) because it makes it easier access individuals elements (using the $) or tidy/export the output.

Resources

If you’re stuck, you can reach out to me - I probably won’t have an answer for you off-hand, but I’m certainly willing to look into things with you! Email: dcushni@uwo.ca

More Samples of R’s Output vs. SPSS Output

Correlation

Bivariate correlation between carat and price:

# Base R - very simple, only gives the r value
cor(diamonds$carat, diamonds$price)
## [1] 0.92159
cor(diamonds$carat, diamonds$price, method = "spearman")
## [1] 0.96288

# from Hmisc - also works on a matrix; includes p-values
rcorr(diamonds$carat, diamonds$price)
##      x    y
## x 1.00 0.92
## y 0.92 1.00
## 
## n= 53940 
## 
## 
## P
##   x  y 
## x     0
## y  0
rcorr(diamonds$carat, diamonds$price, type = "spearman")
##      x    y
## x 1.00 0.96
## y 0.96 1.00
## 
## n= 53940 
## 
## 
## P
##   x  y 
## x     0
## y  0
rcorr(as.matrix(select(diamonds, -cut, -color, -clarity)))
##       carat depth table price     x     y    z
## carat  1.00  0.03  0.18  0.92  0.98  0.95 0.95
## depth  0.03  1.00 -0.30 -0.01 -0.03 -0.03 0.09
## table  0.18 -0.30  1.00  0.13  0.20  0.18 0.15
## price  0.92 -0.01  0.13  1.00  0.88  0.87 0.86
## x      0.98 -0.03  0.20  0.88  1.00  0.97 0.97
## y      0.95 -0.03  0.18  0.87  0.97  1.00 0.95
## z      0.95  0.09  0.15  0.86  0.97  0.95 1.00
## 
## n= 53940 
## 
## 
## P
##       carat  depth  table  price  x      y      z     
## carat        0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## depth 0.0000        0.0000 0.0134 0.0000 0.0000 0.0000
## table 0.0000 0.0000        0.0000 0.0000 0.0000 0.0000
## price 0.0000 0.0134 0.0000        0.0000 0.0000 0.0000
## x     0.0000 0.0000 0.0000 0.0000        0.0000 0.0000
## y     0.0000 0.0000 0.0000 0.0000 0.0000        0.0000
## z     0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

Regression

Predicting price as a function of carat:

price_carat <- lm(diamonds$price ~ diamonds$carat)
summary(price_carat)
## 
## Call:
## lm(formula = diamonds$price ~ diamonds$carat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -18585   -805    -19    537  12732 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -2256.4       13.1    -173   <2e-16 ***
## diamonds$carat   7756.4       14.1     551   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1550 on 53938 degrees of freedom
## Multiple R-squared:  0.849,  Adjusted R-squared:  0.849 
## F-statistic: 3.04e+05 on 1 and 53938 DF,  p-value: <2e-16

ANOVA

Note: If you want your outputted values to match SPSS, you need to use the same contrasts as SPSS. This is discussed here. To change your contrasts, do the following.

options(contrasts = c("contr.sum", "contr.poly"))

One-Way ANOVA comparing mean price between all diamond cuts:

cut_price <- lm(price ~ cut, data = diamonds)
cut_price_aov <- aov(cut_price)
summary(cut_price_aov)
##                Df       Sum Sq    Mean Sq F value Pr(>F)    
## cut             4  11041745359 2760436340     176 <2e-16 ***
## Residuals   53935 847431390159   15712087                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Post-hoc with Bonferroni Correction
pairwise.t.test(diamonds$price, diamonds$cut, p.adj = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  diamonds$price and diamonds$cut 
## 
##           Fair    Good            Very Good Premium
## Good      0.002   -               -         -      
## Very Good 0.003   1.000           -         -      
## Premium   0.308   < 2e-16         < 2e-16   -      
## Ideal     < 2e-16 0.0000000000006 < 2e-16   < 2e-16
## 
## P value adjustment method: bonferroni

Two-Way ANOVA comparing mean price as a function of diamond cut and clarity:

cut_clarity_price <- lm(price ~ cut*clarity, data = diamonds)
cut_clarity_price_aov <- aov(cut_clarity_price)
summary(cut_clarity_price_aov)
##                Df       Sum Sq    Mean Sq F value Pr(>F)    
## cut             4  11041745359 2760436340  180.16 <2e-16 ***
## clarity         7  18909811326 2701401618  176.30 <2e-16 ***
## cut:clarity    28   2646560835   94520030    6.17 <2e-16 ***
## Residuals   53900 825875017997   15322357                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Post-hoc with Bonferroni correction
# This print-out would be very long - showing you cut element as a sample
cut_clarity_price_posthoc <- TukeyHSD(cut_clarity_price_aov)
cut_clarity_price_posthoc$cut
##                        diff       lwr       upr                p adj
## Good-Fair          -429.893  -736.573  -123.214 0.001240718749947023
## Very Good-Fair     -376.998  -660.282   -93.714 0.002624399706012026
## Premium-Fair        225.500   -55.713   506.713 0.184389469834228326
## Ideal-Fair         -901.216 -1177.085  -625.347 0.000000000000035749
## Very Good-Good       52.895  -127.867   233.658 0.931207257351990947
## Premium-Good        655.393   477.894   832.892 0.000000000000055400
## Ideal-Good         -471.322  -640.228  -302.417 0.000000000000301092
## Premium-Very Good   602.498   469.444   735.552 0.000000000000000000
## Ideal-Very Good    -524.218  -645.571  -402.865 0.000000000000000000
## Ideal-Premium     -1126.716 -1243.151 -1010.280 0.000000000000000000

Explore

Running SPSS Explore on Price (limited screenshots because this is long output!):

Descriptives

Histograms
Boxplots

Note: Kurtosis is calculated differently between SPSS and R. Different kurtosis calculation methods are discussed here.

# Simple investigation of price

diamonds %>%
  summarize(n = n(), n_missing = sum(is.na(price)), mean = mean(price, na.rm = TRUE), sd = sd(price, na.rm = TRUE), min = min(price, na.rm = TRUE), max = max(price, na.rm = TRUE), median = median (price, na.rm = TRUE), IQR = IQR(price, na.rm = TRUE), variance = var(price, na.rm = TRUE), trimmed = mean(price, na.rm = TRUE, trim = 0.05), skewness = skewness(price, na.rm = TRUE), kurtosis = kurtosis(price, na.rm = TRUE)) %>%
  mutate(range = max - min) %>%
  mutate(stderror = sd/sqrt(n)) %>%
  mutate(confint_lower = mean-2*stderror) %>%
  mutate(confint_upper = mean+2*stderror)
## # A tibble: 1 x 16
##       n n_missing  mean    sd   min   max median   IQR variance trimmed
##   <int>     <int> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>    <dbl>   <dbl>
## 1 53940         0 3933. 3989.   326 18823   2401 4374.   1.59e7   3471.
## # ... with 6 more variables: skewness <dbl>, kurtosis <dbl>, range <dbl>,
## #   stderror <dbl>, confint_lower <dbl>, confint_upper <dbl>

hist(diamonds$price)

boxplot(diamonds$price)


# Price grouped by cut

diamonds %>%
  group_by(cut) %>%
  summarize(n = n(), n_missing = sum(is.na(price)), mean = mean(price, na.rm = TRUE), sd = sd(price, na.rm = TRUE), min = min(price, na.rm = TRUE), max = max(price, na.rm = TRUE), median = median (price, na.rm = TRUE), IQR = IQR(price, na.rm = TRUE), variance = var(price, na.rm = TRUE), trimmed = mean(price, na.rm = TRUE, trim = 0.05), skewness = skewness(diamonds$price, na.rm = TRUE), kurtosis = kurtosis(diamonds$price, na.rm = TRUE)) %>%
  mutate(range = max - min) %>%
  mutate(stderror = sd/sqrt(n)) %>%
  mutate(confint_lower = mean-2*stderror) %>%
  mutate(confint_upper = mean+2*stderror)
## # A tibble: 5 x 17
##   cut       n n_missing  mean    sd   min   max median   IQR variance
##   <ord> <int>     <int> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>    <dbl>
## 1 Fair   1610         0 4359. 3560.   337 18574  3282  3155.   1.27e7
## 2 Good   4906         0 3929. 3682.   327 18788  3050. 3883    1.36e7
## 3 Very~ 12082         0 3982. 3936.   336 18818  2648  4461.   1.55e7
## 4 Prem~ 13791         0 4584. 4349.   326 18823  3185  5250    1.89e7
## 5 Ideal 21551         0 3458. 3808.   326 18806  1810  3800.   1.45e7
## # ... with 7 more variables: trimmed <dbl>, skewness <dbl>,
## #   kurtosis <dbl>, range <dbl>, stderror <dbl>, confint_lower <dbl>,
## #   confint_upper <dbl>

ggplot(diamonds, aes(x = price)) + 
  geom_histogram(color = "black", fill = "white") +
  facet_grid(cut ~.)


ggplot(diamonds, aes(x = cut, y = price)) + 
  geom_boxplot(color = "black", fill = "white")

Questionnaires?

Working with questionnaire data offers unique challenges. For those of you who are into that, here are some quick tips on how to manage this kind of data.

We’re going to use the bfi dataset - documentation can tell you lots about it! The bfi documentation actually includes a sample of how to use an item key and score negatively keyed items - so this section is us working through the provided sample. In SPSS, you would negatively key items using Transform.

Briefly, bfi is a set of 25 personality self-report items from 2800 subjects, plus demogrpahics on gender, education, and age. The 25 items are oranized around 5 factors: (A) Agreeableness, (C) Conscientiousness, (E) Extraversion, (N) Neuroticism, (O) Openness. As is common with questionnaire data, some of these items are negatively keyed (ie. high scores on the question represent low levels of the dimension measured). This is easy to do in R! To make a keys.list, define vectors for each factor (ie. Agreeableness) and list the included questions by name. Negatively keyed variables are prefaced with a -.

data(bfi)
glimpse(bfi)
## Observations: 2,800
## Variables: 28
## $ A1        <int> 2, 2, 5, 4, 2, 6, 2, 4, 4, 2, 4, 2, 5, 5, 4, 4, 4, 5...
## $ A2        <int> 4, 4, 4, 4, 3, 6, 5, 3, 3, 5, 4, 5, 5, 5, 5, 3, 6, 5...
## $ A3        <int> 3, 5, 5, 6, 3, 5, 5, 1, 6, 6, 5, 5, 5, 5, 2, 6, 6, 5...
## $ A4        <int> 4, 2, 4, 5, 4, 6, 3, 5, 3, 6, 6, 5, 6, 6, 2, 6, 2, 4...
## $ A5        <int> 4, 5, 4, 5, 5, 5, 5, 1, 3, 5, 5, 5, 4, 6, 1, 3, 5, 5...
## $ C1        <int> 2, 5, 4, 4, 4, 6, 5, 3, 6, 6, 4, 5, 5, 4, 5, 5, 4, 5...
## $ C2        <int> 3, 4, 5, 4, 4, 6, 4, 2, 6, 5, 3, 4, 4, 4, 5, 5, 4, 5...
## $ C3        <int> 3, 4, 4, 3, 5, 6, 4, 4, 3, 6, 5, 5, 3, 4, 5, 5, 4, 5...
## $ C4        <int> 4, 3, 2, 5, 3, 1, 2, 2, 4, 2, 3, 4, 2, 2, 2, 3, 4, 4...
## $ C5        <int> 4, 4, 5, 5, 2, 3, 3, 4, 5, 1, 2, 5, 2, 1, 2, 5, 4, 3...
## $ E1        <int> 3, 1, 2, 5, 2, 2, 4, 3, 5, 2, 1, 3, 3, 2, 3, 1, 1, 2...
## $ E2        <int> 3, 1, 4, 3, 2, 1, 3, 6, 3, 2, 3, 3, 3, 2, 4, 1, 2, 2...
## $ E3        <int> 3, 6, 4, 4, 5, 6, 4, 4, NA, 4, 2, 4, 3, 4, 3, 6, 5, ...
## $ E4        <int> 4, 4, 4, 4, 4, 5, 5, 2, 4, 5, 5, 5, 2, 6, 6, 6, 5, 6...
## $ E5        <int> 4, 3, 5, 4, 5, 6, 5, 1, 3, 5, 4, 4, 4, 5, 5, 4, 5, 6...
## $ N1        <int> 3, 3, 4, 2, 2, 3, 1, 6, 5, 5, 3, 4, 1, 1, 2, 4, 4, 6...
## $ N2        <int> 4, 3, 5, 5, 3, 5, 2, 3, 5, 5, 3, 5, 2, 1, 4, 5, 4, 5...
## $ N3        <int> 2, 3, 4, 2, 4, 2, 2, 2, 2, 5, 4, 3, 2, 1, 2, 4, 4, 5...
## $ N4        <int> 2, 5, 2, 4, 4, 2, 1, 6, 3, 2, 2, 2, 2, 2, 2, 5, 4, 4...
## $ N5        <int> 3, 5, 3, 1, 3, 3, 1, 4, 3, 4, 3, NA, 2, 1, 3, 5, 5, ...
## $ O1        <int> 3, 4, 4, 3, 3, 4, 5, 3, 6, 5, 5, 4, 4, 5, 5, 6, 5, 5...
## $ O2        <int> 6, 2, 2, 3, 3, 3, 2, 2, 6, 1, 3, 6, 2, 3, 2, 6, 1, 1...
## $ O3        <int> 3, 4, 5, 4, 4, 5, 5, 4, 6, 5, 5, 4, 4, 4, 5, 6, 5, 4...
## $ O4        <int> 4, 3, 5, 3, 3, 6, 6, 5, 6, 5, 6, 5, 5, 4, 5, 3, 6, 5...
## $ O5        <int> 3, 3, 2, 5, 3, 1, 1, 3, 1, 2, 3, 4, 2, 4, 5, 2, 3, 4...
## $ gender    <int> 1, 2, 2, 2, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1...
## $ education <int> NA, NA, NA, NA, NA, 3, NA, 2, 1, NA, 1, NA, NA, NA, ...
## $ age       <int> 16, 18, 17, 17, 17, 21, 18, 19, 19, 17, 21, 16, 16, ...

keys.list <- list(agree=c("-A1","A2","A3","A4","A5"),conscientious=c("C1","C2","C3","-C4","-C5"),
extraversion=c("-E1","-E2","E3","E4","E5"),neuroticism=c("N1","N2","N3","N4","N5"),
openness = c("O1","-O2","O3","O4","-O5")) 

# scoreItems gives a ton of info about items! 
scores <- scoreItems(keys.list,bfi,min=1,max=6) # Need to specify the minimum and maximum values
scores
## Call: scoreItems(keys = keys.list, items = bfi, min = 1, max = 6)
## 
## (Unstandardized) Alpha:
##       agree conscientious extraversion neuroticism openness
## alpha   0.7          0.72         0.76        0.81      0.6
## 
## Standard errors of unstandardized Alpha:
##       agree conscientious extraversion neuroticism openness
## ASE   0.014         0.014        0.013       0.011    0.017
## 
## Average item correlation:
##           agree conscientious extraversion neuroticism openness
## average.r  0.32          0.34         0.39        0.46     0.23
## 
## Median item correlation:
##         agree conscientious  extraversion   neuroticism      openness 
##          0.34          0.34          0.38          0.41          0.22 
## 
##  Guttman 6* reliability: 
##          agree conscientious extraversion neuroticism openness
## Lambda.6   0.7          0.72         0.76        0.81      0.6
## 
## Signal/Noise based upon av.r : 
##              agree conscientious extraversion neuroticism openness
## Signal/Noise   2.3           2.6          3.2         4.3      1.5
## 
## Scale intercorrelations corrected for attenuation 
##  raw correlations below the diagonal, alpha on the diagonal 
##  corrected correlations above the diagonal:
##               agree conscientious extraversion neuroticism openness
## agree          0.70          0.36         0.63      -0.245     0.23
## conscientious  0.26          0.72         0.35      -0.305     0.30
## extraversion   0.46          0.26         0.76      -0.284     0.32
## neuroticism   -0.18         -0.23        -0.22       0.812    -0.12
## openness       0.15          0.19         0.22      -0.086     0.60
## 
##  In order to see the item by scale loadings and frequency counts of the data
##  print with the short option = FALSE

# Call str() on scores to see all the pieces of information you can access
# individually! The *scores* element of scores includes the re-keyed items 
# Summarise from dplyr can help you investigate these scores.  
keyed_scores <- data.frame(scores$scores)

keyed_scores %>%
  summarise_all(mean)
##    agree conscientious extraversion neuroticism openness
## 1 4.6531        4.2694        4.146       3.162    4.591

# If you wanted to group by a variable (such as gender) - remember that 
# we dropped demographics when we used keys.list. We can get those back!
# According to the documentation, males = 1 and females =2
keyed_scores <- cbind(gender = bfi$gender, keyed_scores)

keyed_scores %>%
  group_by(gender) %>%
  summarise_all(mean)
## # A tibble: 2 x 6
##   gender agree conscientious extraversion neuroticism openness
##    <int> <dbl>         <dbl>        <dbl>       <dbl>    <dbl>
## 1      1  4.39          4.14         3.99        2.95     4.66
## 2      2  4.78          4.33         4.22        3.27     4.56

# Using dplyr::select to help you subset to look at individual pieces of the data
open <- bfi %>%
  select(starts_with("O"))

Dummy coding is useful and not exclusive to questionnaire data. This is very simple in R. Let’s say we wanted to look at the relationship between education and openness.

# Factor the variable first mostly to re-label nicely
factored_edu <- factor(bfi$education, labels = c("someHS", "doneHS", "someCollege", "doneCollege", "graduate"), ordered = TRUE, exclude = NA)

edu <- dummy.code(factored_edu)

# Your dummy coded variables are now an island!
# Bind these dummy coded education variables to the openness df
# we made earlier with dplyr::select. 
open_edu <- cbind(edu, open)

Transitioning from SPSS to R - R-Ladies London

Daryn Cushnie-Sparrow

November 29, 2018