Introduction

For many of us, SPSS has been something we’ve used for statistical analysis since early in our experience with statistics. Running tests in SPSS might be second-nature to you, and there are lots of resources online for how to do things you may be unclear about. The comfort of SPSS can make it daunting to explore other options; however, SPSS has limitations, and you may have stumbled upon an interesting statistical test that can be run in R and not in SPSS. Alternatively, you may have decided that SPSS licenses are very expensive and inconvenient, and you’re interested in other options. I have gradually been transitioning from SPSS to R, so I’m hoping to share some of what I’ve learned that might ease your transition.

In my opinion, the biggest shift one makes in the move from SPSS to R is a change in how you think about your data and your analysis. If you are used to using the SPSS to GUI and clicky navigation to move around and explore your data, you’ll notice that won’t be as possible in R. Once you get over this hurdle, the flexibility and power of R is sure to win you over!

Packages

SPSS is a one-piece statistical package. All the functionalities and features available to you are included in the installation. R has many useful base functions, but there are tons of packages available from CRAN that allow you to do more things or do them better. To install a package from CRAN, use the function install.packages: for example, install.packages("ggplot2"). In order to use packages, you need to load them into R, which makes their functionalities available. Do this using library. Here are the packages we’ll be using today.

library(psych)
library(tidyverse)
library(ggplot2)
library(haven)
library(car)
library(Hmisc)
library(moments)
library(flextable)
library(broom)
select <- dplyr::select
summarize <- dplyr::summarize
path <- getwd()

R Basics

In R, a single ‘line’ of data is called a vector - a row or column could be a vector. The most common higher dimensions of data you’ll use are matrices (if they’re all one variable type eg. numeric), lists (bundles of different kinds of information) or data.frames (like the spreadsheets you’d use in Excel or SPSS). These 3 simple base R functions are some of your best friends - you will use them often! Especially c() (combine) - you’ll use this on its own and within functions. To assign a piece of information to a name (eg. to make a vector named “item1”), use the <- operator with the Label on the left and the Contents on the right (by convention). <- and = are not the same thing in R!

# makes a vector
item1 <- c(1, 2, 3, 5)
item2 <- c(3, 6, 7, 2)

# binds these by column - ie. these are variable vectors
cbind(item1, item2)
##      item1 item2
## [1,]     1     3
## [2,]     2     6
## [3,]     3     7
## [4,]     5     2

# binds these by row - ie. these are observation vectors
rbind(item1, item2)
##       [,1] [,2] [,3] [,4]
## item1    1    2    3    5
## item2    3    6    7    2

Setting Up Your Data

We’re going to be using one of the default datasets available in R through ggplot2, diamonds. We’ll start by loading in that data. You would usually need to import your own data. You can do this using read_csv from readr (e.g. my_data <- read_csv(/myfolder/my_data.csv)). A helpful tip is to use path <- getwd() at the beginning of your .R or .Rmd - this saves your current path. As long as you keep your .R or .Rmd in the same place as the files you’re looking for, you can easily input the path of a file at anytime using file.path(path, "filename.csv").

data(diamonds)
diamonds
## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # ... with 53,930 more rows

If you’d like to look a the data in R Studio’s GUI, you can do this by typing View(diamonds) into the console. Coming from SPSS, this is probably the view of your data that you’re familiar with. In R, there are also other ways to explore the data.

Intro to Wrangling

When you are loading in your own data, it rarely will be so neat! It’s important that all variables are classed correctly in order for appropriate calculation and visualization down the line. In SPSS, you would make these kinds of changes in Variable View. In order to play in parallel between R and SPSS today, I’m going to start by outputting our diamonds data to a .csv.

write_csv(diamonds, file.path(path, "diamonds.csv"))

When I import this into SPSS straight from the .csv, here’s what I get:

We know that all our categorical variables are ordinal in this dataset, so we change those by clicking in the Measure column.

You’ll notice the Values column is empty - this is important because some tests in SPSS will not allow variables with strings, so you need to set up key-value pairs. For example, for cut, you could assign a number of 1 to Fair, 2 to Good, etc. You can do this manually in SPSS - but R can help us out! We’ll come back to this.

When you first import data into R, variables tend to be chr - character (string) if they have any letters, or num, int, or dbl (numerical, integer, or double) if they are all numbers. You will need to change your variables’ classes to fit what you know them to be. Categorical variables in R are called factors, and these can be unordered (nominal) or ordered (ordinal).

In this block of code, I’m going to create a messy version of diamonds so we can practice cleaning it up. If you’re looking at the .Rmd, no peeking! It’ll give away how you fix it.

Take a look at messy_diamonds with str. What’s wrong with it?

  • depth is a character vector, but should be numeric
  • color is a character vector, but should be an ordered factor
  • price is factored, but it should be integer
## DEPTH
# Try putting depth back to numeric by coercing it with as.numeric
head(as.numeric(messy_diamonds$depth))
## [1] 61.5 59.8 56.9 62.4 63.3 62.8

# The warning 'NAs introduced by coercion' indicates that there's a 
# non numerical value somewhere - let's find it to understand 
# what's wrong before we erase it!
# I'm asking R to tell me which value is NA when I coerced it
which(is.na(as.numeric(messy_diamonds$depth)))
## [1] 90

# Let's look at what the depth of diamond #90 is
messy_diamonds[90,5]
## # A tibble: 1 x 1
##   depth
##   <chr>
## 1 a

# Oops! Looks like a data-entry error. If we had another place the data
# was entered (such as an original output document or paper questionnaire), 
# we could check it and assign the correct value. 
# I'm going to cheat and use the original diamonds to find out!
diamonds[90,5]
## # A tibble: 1 x 1
##   depth
##   <dbl>
## 1  62.9

# Originally, this value was 62.9 - we'll go ahead and assign that now.
messy_diamonds[90,5] <- 62.9

# If you didn't have a way to check the original value, you probably
# have to leave it as a missing value. R will automatically 
# turn this character into a NA - which is what the warning was telling us.

# Let's go ahead and coerce depth to numeric now that we've checked it.
messy_diamonds$depth <- as.numeric(messy_diamonds$depth)

## COLOR
# Color should be a factor! It needs to be ordered. If this were our data, 
# we'd know what the order should be. In this case, let's check 
# the original diamonds to find out. 

levels(diamonds$color)
## [1] "D" "E" "F" "G" "H" "I" "J"

messy_diamonds$color <- factor(messy_diamonds$color, ordered = TRUE, levels = c("D", "E", "F", "G", "H", "I", "J"))

## PRICE
# Price should be integer, but it's a factor right now.
# If we were to call as.integer right away, it would convert 
# them based on the factor levels! To avoid this, 
# we'll wrap it with as.character first. 
messy_diamonds$price <- as.integer(as.character(messy_diamonds$price))

# No warnings! Let's call str to make sure things look right. 
str(messy_diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Exercise 2: Cleaning

No peeking on the Rmd!

I’ve done the same thing again - messed up a perfectly good dataset and called it cars. This comes from the dataset mtcars - you can search its documentation by typing ?mtcars into the console. Make sure the data ends up clean based on your expectations from the documentation.


Now that we’ve gotten things cleaned up, we’re ready to go! Remember how I said that we’d come back to an easier way to get data set up in SPSS? The package haven is our best friend. Using write_sav, we can create an SPSS data file (.sav) with variable types and values set-up nicely!

write_sav(diamonds, file.path(path, "diamonds.sav"))

Take a look at the Variable View of diamonds.sav in SPSS.

The categorical values now have numerical values corresponding to their original factor levels, and everything is appropriately classed! Take a look at the Data View for the two files to compare.

Straight from the CSV:

Through Haven:

This is great, because many of us don’t immediately abandon our old tools when we pick up new ones. You might still want to use SPSS for some things, but R can still help you manage your data. Alternatively, it’s helpful if you have collaborators who use SPSS.

More Tips for Setting Up Your Data

The ‘Missing’ column in Variable View of SPSS is a helpful way to quickly identify variables with missing data. In R, you can quickly find out if there are any missing data points using is.na and a couple friends.

testvariable <- c(1, 2, 3, 5, 3, 1, 3, 3, 4, 5, 6, NA, 3, 3, 4, 7, 8, 1, 3, NA, 3, 4, 6)
any(is.na(testvariable))
## [1] TRUE
which(is.na(testvariable))
## [1] 12 20

You can use Values (key-value pairs) and Labels in SPSS to store information about your variables. In R, you can give your values names (similar to key-value pairs), and you can make a data dictionary to store detailed information that you want access to. dataMeta provides some helpful tools for making data dictionaries, but you can scrape by with base R. Here are some simple examples of ways to label data and store additional information about variables. You could of course import these by reading in a .csv (read_csv through readr is my go-to) or other spreadsheet - you don’t have to type these things directly into R.

var1 <- c(1, 2, 3, 3, 4, 5, 3, 2, 1, 2)
var2 <- c(4, 5, 3, 5, 5, 4, 2, 1, 2, 3)
var3 <- c(1, 1, 2, 4, 5, 3, 3, 3, 2, 2)
df <- data.frame(var1, var2, var3)

# Labeling individual values
df$var1 <- factor(var1, levels = c("1", "2", "3", "4", "5"), labels = c("Poor", "Fair", "Good", "Very Good", "Excellent"), ordered = TRUE)

## All 3 of these questions have the same levels/labels so for convenience
## we'll define vectors that store this info once
q_labels <- c("Poor", "Fair", "Good", "Very Good", "Excellent")
q_levels <- c("1", "2", "3", "4", "5")

df$var1 <- factor(var1, levels = q_levels, labels = q_labels, ordered = TRUE)
df$var2 <- factor(var2, levels = q_levels, labels = q_labels, ordered = TRUE)
df$var3 <- factor(var3, levels = q_levels, labels = q_labels, ordered = TRUE)

## Note: if you use an ordered factor but want to access numerical 
## features (ie. to calculate things like mean, min or max) you can
## wrap it in as.numeric - does not matter if factor is labeled as characters
mean(as.numeric(df$var1))
## [1] 2.6

# Naming observations can be helpful to keep track
rownames(df) <- c("PD01", "YC01", "PD02", "PD03", "PD04", "YC02", "OC01", "OC02", "YC03", "OC03")

# Renaming the variables
colnames(df) <- c("delivery", "value", "meal")

# If you want to put these into their own column (helpful if tidying):
rownames_to_column(df)
##    rowname  delivery     value      meal
## 1     PD01      Poor Very Good      Poor
## 2     YC01      Fair Excellent      Poor
## 3     PD02      Good      Good      Fair
## 4     PD03      Good Excellent Very Good
## 5     PD04 Very Good Excellent Excellent
## 6     YC02 Excellent Very Good      Good
## 7     OC01      Good      Fair      Good
## 8     OC02      Fair      Poor      Good
## 9     YC03      Poor      Fair      Fair
## 10    OC03      Fair      Good      Fair

# Quick and dirty data dictionary with question details and levels
## Because the levels are consistent, I've only specified them once
## and R will multiply it to fill the dataframe. If you have different 
## details for each question, make sure you input them all separately. 
## This is a good thing to do in your data entry/data prep. 
question <- c("How was your delivery?", "Value for cost?", "How was your meal?")
level <- c("1 = Poor, 2 = Fair, 3 = Good, 4 = Very Good, 5  = Excellent")
data_dict <- data.frame(colnames(df), question, level)

Importing, cleaning, and wrangling your data is a huge topic, but there are tons of resources to help you with this!

Running Tests

What sample to do??

Running statistical tests in R is a bit different from running them in SPSS, but thankfully Google is your best friend here! Let’s work through a sample analysis on our diamonds data.

Let’s say I wanted to do a t-test to compare the average prices of Ideal and Premium diamonds.

To set this up in SPSS, I would select an Indepedendent Samples t-test from the Compare Means menu. This is what my setup window would look like - I know that Ideal and Premium are coded as “4” and “5” by clicking on the Values in Data View.

And this would be my output:

In R, to run this t-test I’m going to use the function t.test. However, SPSS runs Levene’s test to check for homogeneity of variance as part of the t-test. We’ll have to do that ourselves, using leveneTest from the car package. We’ll also need to filter in order to just have Ideal and Premium diamonds.

Seeing that Levene’s test is significant, our assumption of homogeneity of variance is not upheld, and we should use a Welch t-test, rather than a Student’s t-test. This is the default in R’s t.test.

diamonds_subset <- diamonds %>% filter(cut == "Ideal" | cut == "Premium")

leveneTest(diamonds_subset$price, diamonds_subset$cut, center = mean)
## Levene's Test for Homogeneity of Variance (center = mean)
##          Df F value Pr(>F)    
## group     1     350 <2e-16 ***
##       35340                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Welch - the one we're using
t.test(diamonds_subset$price ~ diamonds_subset$cut, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  diamonds_subset$price by diamonds_subset$cut
## t = 24.9, df = 26600, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1038.1 1215.3
## sample estimates:
## mean in group Premium   mean in group Ideal 
##                4584.3                3457.5

# Student - if Levene's hadn't been true
t.test(diamonds_subset$price ~ diamonds_subset$cut, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  diamonds_subset$price by diamonds_subset$cut
## t = 25.7, df = 35300, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1040.6 1212.8
## sample estimates:
## mean in group Premium   mean in group Ideal 
##                4584.3                3457.5

The output from our t-test includes almost all of the same information as the SPSS output. All we’re missing is mean difference and the standard error of the difference. Mean difference is relatively easy to calculate - standard error of the difference is a bit more complex but here it is! Once again, these funky formulas are coming from the SPSS algorithm manual. In most cases, you won’t need to get a specific value that comes from the SPSS output - but this is just to provide an example of how to match things up.

# Mean difference - the same for Student and Welch tests 
ideal <- filter(diamonds, diamonds$cut == "Ideal")
premium <- filter(diamonds, diamonds$cut == "Premium")
meandiff <- abs(mean(ideal$price) - mean(premium$price))

n_ideal <- sum(!is.na(ideal$price))
n_premium <- sum(!is.na(premium$price))

# Standard error of the difference: unpooled - for Welch's t-test 
SE_diff_unpooled <- sqrt((var(ideal$price)/n_ideal) + (var(premium$price)/n_premium))

# Standard error of the difference: pooled variance - for Student's t-test
pooled_variance <- ((n_ideal-1)*var(ideal$price) + (n_premium-1)*var(premium$price))/(n_ideal+n_premium - 2)
SE_diff_pooled <- sqrt(pooled_variance)*sqrt(1/n_ideal+1/n_premium)

Creating Summary Documents

Once you’ve finished your beautiful work in R, you need a way to output your documents! A few ways you can do this are:

  • Spreadsheets: e.g. outputting to a .csv with readr::write_csv - Syntax: write_csv(dataFrame, file.path(path, “results.csv”))
  • Images: e.g. outputting a plot to a .png with ggplot2::ggsave - Syntax: ggsave(plotName, file = “plot.png”)
  • R Markdown: Knitting to HTML, PDF, or Word

My current favourite package for formatting summary documents in R Markdown is flextable, which allows you to output to HTML, PDF and Word with ease. It also offers significant customizability. If you don’t use Word documents, kableExtra is another fabulous option, and a bit simpler!

The first thing you might want to include in a summary document is some descriptives. Calculating them is easy to do using dplyr’s summarize. Here’s a quick sample - there’s more on this in the Explore section of the Resources. Once you’ve got your summary, you can use flextable to prep it for output. flextable takes a data.frame as its input.

diamonds %>%
  summarize(n = n(), n_missing = sum(is.na(price)), mean = mean(price), sd = sd(price), min = min(price), max = max(price)) %>%
  mutate(stderror = sd/sqrt(n)) %>%
  mutate(confint_lower = mean-2*stderror) %>%
  mutate(confint_upper = mean+2*stderror)
## # A tibble: 1 x 9
##       n n_missing  mean    sd   min   max stderror confint_lower
##   <int>     <int> <dbl> <dbl> <dbl> <dbl>    <dbl>         <dbl>
## 1 53940         0 3933. 3989.   326 18823     17.2         3898.
## # ... with 1 more variable: confint_upper <dbl>

summary <- diamonds %>%
  summarise_each(vars = c(carat, depth, table, price, x, y, z), funs(mean, sd, min, max, median, IQR)) 

sum_ft <- flextable(summary)

# Subset tables by stat
flextable(select(summary, contains("mean")))

carat_mean

depth_mean

table_mean

price_mean

x_mean

y_mean

z_mean

0.798

61.749

57.457

3932.800

5.731

5.735

3.539

flextable(select(summary, contains("sd")))

carat_sd

depth_sd

table_sd

price_sd

x_sd

y_sd

z_sd

0.474

1.433

2.234

3989.440

1.122

1.142

0.706


# Subset tables by variable
flextable(select(summary, contains("price")))

price_mean

price_sd

price_min

price_max

price_median

price_IQR

3932.800

3989.440

326.000

18823.000

2401.000

4374.250


# Playing with formatting
sum_ft <- flextable(summary) %>%
  theme_zebra %>%
  font(part = "header", fontname = "Cambria") %>%
  bold(part = "header") %>%
  color(part = "header", color = "blue") %>%
  fontsize(part = "all", size = 16)

Let’s say we want to output our t-test output from above. Output from statistical tests is often a little cluttered - broom::tidy can clean that up for us.

output <- tidy(t.test(diamonds_subset$price ~ diamonds_subset$cut, var.equal = FALSE)) %>%
  select(-estimate)

# Very basic
flextable(output) 

estimate1

estimate2

statistic

p.value

parameter

conf.low

conf.high

method

alternative

4584.258

3457.542

24.918

0.000

26552.158

1038.088

1215.344

Welch Two Sample t-test

two.sided


# How exciting!
out_ft <- flextable(output) %>%
  set_header_labels(estimate1 = "Premium Mean", estimate2 = "Ideal Mean", statistic = "Statistic", p.value = "p", parameter = "df", conf.low = "Conf. Int. Lower", conf.high = "Conf. Int. High", method = "Method", alternative = "Alternative") %>%
  bold(part = "header") %>%
  fontsize(part = "header", size = 16) %>%
  fontsize(part = "body", size = 14) %>%
  bg(part = "body", bg = "#E9E9E9") 

out_ft

Premium Mean

Ideal Mean

Statistic

p

df

Conf. Int. Lower

Conf. Int. High

Method

Alternative

4584.258

3457.542

24.918

0.000

26552.158

1038.088

1215.344

Welch Two Sample t-test

two.sided

Great! So we’ve got a couple tables we’d like to add to our R Markdown file. I think a good beginner strategy for getting easy summary documents is to work in a .Rmd (rather than a .R) and just put all your rough work in a code chunk with include = FALSE. That way, when you knit your document, nothing in your code chunk will be included - only things you call outside (such as with in-line code or with another chunk). Here’s an example of what I mean.

Here’s what the sample below looks like in R Markdown:

Summary Document Sample

The data included a sample of 53940 diamonds. The average price among the data was 3932.79972 USD. Descriptive statistics of diamond prices are below:

N

N Missing

Mean

SD

Minimum

Maximum

Standard Error

Conf. Int. Lower

Conf. Int. Upper

53940

0

3932.800

3989.440

326.000

18823.000

17.177

3898.445

3967.154

I was interested in the relationship between a diamond’s cut and its price at the two highest levels of cut (Ideal and Premium). I ran an independent samples t-test to compare the average price in these two groups. Testing for homogeneity of variance with Levene’s test indicated a F-statistic of 349.572 and a p-value of < .001. As a result, a Welch t-test was conducted - results are below.

Premium Mean

Ideal Mean

Statistic

p

df

Conf. Int. Lower

Conf. Int. Upper

Method

Alternative

4584.258

3457.542

24.918

0.000

26552.158

1038.088

1215.344

Welch Two Sample t-test

two.sided


Exercise 3: Summary Stats

Use what you’ve learned to do the following:

  • Calculate the mean and SD of price
  • Calculate the mean and SD of price among diamonds with a cut of Ideal
  • Turn the second one into a flextable
  • Bold the body of the table

Tips

Resources

If you’re stuck, you can reach out to me - I probably won’t have an answer for you off-hand, but I’m certainly willing to look into things with you! Email: dcushni@uwo.ca

More Samples of R’s Output vs. SPSS Output

Correlation

Bivariate correlation between carat and price:

# Base R - very simple, only gives the r value
cor(diamonds$carat, diamonds$price)
## [1] 0.92159
cor(diamonds$carat, diamonds$price, method = "spearman")
## [1] 0.96288

# from Hmisc - also works on a matrix; includes p-values
rcorr(diamonds$carat, diamonds$price)
##      x    y
## x 1.00 0.92
## y 0.92 1.00
## 
## n= 53940 
## 
## 
## P
##   x  y 
## x     0
## y  0
rcorr(diamonds$carat, diamonds$price, type = "spearman")
##      x    y
## x 1.00 0.96
## y 0.96 1.00
## 
## n= 53940 
## 
## 
## P
##   x  y 
## x     0
## y  0
rcorr(as.matrix(select(diamonds, -cut, -color, -clarity)))
##       carat depth table price     x     y    z
## carat  1.00  0.03  0.18  0.92  0.98  0.95 0.95
## depth  0.03  1.00 -0.30 -0.01 -0.03 -0.03 0.09
## table  0.18 -0.30  1.00  0.13  0.20  0.18 0.15
## price  0.92 -0.01  0.13  1.00  0.88  0.87 0.86
## x      0.98 -0.03  0.20  0.88  1.00  0.97 0.97
## y      0.95 -0.03  0.18  0.87  0.97  1.00 0.95
## z      0.95  0.09  0.15  0.86  0.97  0.95 1.00
## 
## n= 53940 
## 
## 
## P
##       carat  depth  table  price  x      y      z     
## carat        0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## depth 0.0000        0.0000 0.0134 0.0000 0.0000 0.0000
## table 0.0000 0.0000        0.0000 0.0000 0.0000 0.0000
## price 0.0000 0.0134 0.0000        0.0000 0.0000 0.0000
## x     0.0000 0.0000 0.0000 0.0000        0.0000 0.0000
## y     0.0000 0.0000 0.0000 0.0000 0.0000        0.0000
## z     0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

Regression

Predicting price as a function of carat:

price_carat <- lm(diamonds$price ~ diamonds$carat)
summary(price_carat)
## 
## Call:
## lm(formula = diamonds$price ~ diamonds$carat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -18585   -805    -19    537  12732 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -2256.4       13.1    -173   <2e-16 ***
## diamonds$carat   7756.4       14.1     551   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1550 on 53938 degrees of freedom
## Multiple R-squared:  0.849,  Adjusted R-squared:  0.849 
## F-statistic: 3.04e+05 on 1 and 53938 DF,  p-value: <2e-16

ANOVA

Note: If you want your outputted values to match SPSS, you need to use the same contrasts as SPSS. This is discussed here. To change your contrasts, do the following.

options(contrasts = c("contr.sum", "contr.poly"))

One-Way ANOVA comparing mean price between all diamond cuts:

cut_price <- lm(price ~ cut, data = diamonds)
cut_price_aov <- aov(cut_price)
summary(cut_price_aov)
##                Df       Sum Sq    Mean Sq F value Pr(>F)    
## cut             4  11041745359 2760436340     176 <2e-16 ***
## Residuals   53935 847431390159   15712087                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Post-hoc with Bonferroni Correction
pairwise.t.test(diamonds$price, diamonds$cut, p.adj = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  diamonds$price and diamonds$cut 
## 
##           Fair    Good            Very Good Premium
## Good      0.002   -               -         -      
## Very Good 0.003   1.000           -         -      
## Premium   0.308   < 2e-16         < 2e-16   -      
## Ideal     < 2e-16 0.0000000000006 < 2e-16   < 2e-16
## 
## P value adjustment method: bonferroni

Two-Way ANOVA comparing mean price as a function of diamond cut and clarity:

cut_clarity_price <- lm(price ~ cut*clarity, data = diamonds)
cut_clarity_price_aov <- aov(cut_clarity_price)
summary(cut_clarity_price_aov)
##                Df       Sum Sq    Mean Sq F value Pr(>F)    
## cut             4  11041745359 2760436340  180.16 <2e-16 ***
## clarity         7  18909811326 2701401618  176.30 <2e-16 ***
## cut:clarity    28   2646560835   94520030    6.17 <2e-16 ***
## Residuals   53900 825875017997   15322357                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Post-hoc with Bonferroni correction
# This print-out would be very long - showing you cut element as a sample
cut_clarity_price_posthoc <- TukeyHSD(cut_clarity_price_aov)
cut_clarity_price_posthoc$cut
##                        diff       lwr       upr                p adj
## Good-Fair          -429.893  -736.573  -123.214 0.001240718749947023
## Very Good-Fair     -376.998  -660.282   -93.714 0.002624399706012026
## Premium-Fair        225.500   -55.713   506.713 0.184389469834228326
## Ideal-Fair         -901.216 -1177.085  -625.347 0.000000000000035749
## Very Good-Good       52.895  -127.867   233.658 0.931207257351990947
## Premium-Good        655.393   477.894   832.892 0.000000000000055400
## Ideal-Good         -471.322  -640.228  -302.417 0.000000000000301092
## Premium-Very Good   602.498   469.444   735.552 0.000000000000000000
## Ideal-Very Good    -524.218  -645.571  -402.865 0.000000000000000000
## Ideal-Premium     -1126.716 -1243.151 -1010.280 0.000000000000000000

Explore

Running SPSS Explore on Price (limited screenshots because this is long output!):

  • Descriptives

  • Histograms
  • Boxplots

Note: Kurtosis is calculated differently between SPSS and R. Different kurtosis calculation methods are discussed here.

# Simple investigation of price

diamonds %>%
  summarize(n = n(), n_missing = sum(is.na(price)), mean = mean(price, na.rm = TRUE), sd = sd(price, na.rm = TRUE), min = min(price, na.rm = TRUE), max = max(price, na.rm = TRUE), median = median (price, na.rm = TRUE), IQR = IQR(price, na.rm = TRUE), variance = var(price, na.rm = TRUE), trimmed = mean(price, na.rm = TRUE, trim = 0.05), skewness = skewness(price, na.rm = TRUE), kurtosis = kurtosis(price, na.rm = TRUE)) %>%
  mutate(range = max - min) %>%
  mutate(stderror = sd/sqrt(n)) %>%
  mutate(confint_lower = mean-2*stderror) %>%
  mutate(confint_upper = mean+2*stderror)
## # A tibble: 1 x 16
##       n n_missing  mean    sd   min   max median   IQR variance trimmed
##   <int>     <int> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>    <dbl>   <dbl>
## 1 53940         0 3933. 3989.   326 18823   2401 4374.   1.59e7   3471.
## # ... with 6 more variables: skewness <dbl>, kurtosis <dbl>, range <dbl>,
## #   stderror <dbl>, confint_lower <dbl>, confint_upper <dbl>

hist(diamonds$price)

boxplot(diamonds$price)


# Price grouped by cut

diamonds %>%
  group_by(cut) %>%
  summarize(n = n(), n_missing = sum(is.na(price)), mean = mean(price, na.rm = TRUE), sd = sd(price, na.rm = TRUE), min = min(price, na.rm = TRUE), max = max(price, na.rm = TRUE), median = median (price, na.rm = TRUE), IQR = IQR(price, na.rm = TRUE), variance = var(price, na.rm = TRUE), trimmed = mean(price, na.rm = TRUE, trim = 0.05), skewness = skewness(diamonds$price, na.rm = TRUE), kurtosis = kurtosis(diamonds$price, na.rm = TRUE)) %>%
  mutate(range = max - min) %>%
  mutate(stderror = sd/sqrt(n)) %>%
  mutate(confint_lower = mean-2*stderror) %>%
  mutate(confint_upper = mean+2*stderror)
## # A tibble: 5 x 17
##   cut       n n_missing  mean    sd   min   max median   IQR variance
##   <ord> <int>     <int> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>    <dbl>
## 1 Fair   1610         0 4359. 3560.   337 18574  3282  3155.   1.27e7
## 2 Good   4906         0 3929. 3682.   327 18788  3050. 3883    1.36e7
## 3 Very~ 12082         0 3982. 3936.   336 18818  2648  4461.   1.55e7
## 4 Prem~ 13791         0 4584. 4349.   326 18823  3185  5250    1.89e7
## 5 Ideal 21551         0 3458. 3808.   326 18806  1810  3800.   1.45e7
## # ... with 7 more variables: trimmed <dbl>, skewness <dbl>,
## #   kurtosis <dbl>, range <dbl>, stderror <dbl>, confint_lower <dbl>,
## #   confint_upper <dbl>

ggplot(diamonds, aes(x = price)) + 
  geom_histogram(color = "black", fill = "white") +
  facet_grid(cut ~.)


ggplot(diamonds, aes(x = cut, y = price)) + 
  geom_boxplot(color = "black", fill = "white")

Questionnaires?

Working with questionnaire data offers unique challenges. For those of you who are into that, here are some quick tips on how to manage this kind of data.

We’re going to use the bfi dataset - documentation can tell you lots about it! The bfi documentation actually includes a sample of how to use an item key and score negatively keyed items - so this section is us working through the provided sample. In SPSS, you would negatively key items using Transform.

Briefly, bfi is a set of 25 personality self-report items from 2800 subjects, plus demogrpahics on gender, education, and age. The 25 items are oranized around 5 factors: (A) Agreeableness, (C) Conscientiousness, (E) Extraversion, (N) Neuroticism, (O) Openness. As is common with questionnaire data, some of these items are negatively keyed (ie. high scores on the question represent low levels of the dimension measured). This is easy to do in R! To make a keys.list, define vectors for each factor (ie. Agreeableness) and list the included questions by name. Negatively keyed variables are prefaced with a -.

data(bfi)
glimpse(bfi)
## Observations: 2,800
## Variables: 28
## $ A1        <int> 2, 2, 5, 4, 2, 6, 2, 4, 4, 2, 4, 2, 5, 5, 4, 4, 4, 5...
## $ A2        <int> 4, 4, 4, 4, 3, 6, 5, 3, 3, 5, 4, 5, 5, 5, 5, 3, 6, 5...
## $ A3        <int> 3, 5, 5, 6, 3, 5, 5, 1, 6, 6, 5, 5, 5, 5, 2, 6, 6, 5...
## $ A4        <int> 4, 2, 4, 5, 4, 6, 3, 5, 3, 6, 6, 5, 6, 6, 2, 6, 2, 4...
## $ A5        <int> 4, 5, 4, 5, 5, 5, 5, 1, 3, 5, 5, 5, 4, 6, 1, 3, 5, 5...
## $ C1        <int> 2, 5, 4, 4, 4, 6, 5, 3, 6, 6, 4, 5, 5, 4, 5, 5, 4, 5...
## $ C2        <int> 3, 4, 5, 4, 4, 6, 4, 2, 6, 5, 3, 4, 4, 4, 5, 5, 4, 5...
## $ C3        <int> 3, 4, 4, 3, 5, 6, 4, 4, 3, 6, 5, 5, 3, 4, 5, 5, 4, 5...
## $ C4        <int> 4, 3, 2, 5, 3, 1, 2, 2, 4, 2, 3, 4, 2, 2, 2, 3, 4, 4...
## $ C5        <int> 4, 4, 5, 5, 2, 3, 3, 4, 5, 1, 2, 5, 2, 1, 2, 5, 4, 3...
## $ E1        <int> 3, 1, 2, 5, 2, 2, 4, 3, 5, 2, 1, 3, 3, 2, 3, 1, 1, 2...
## $ E2        <int> 3, 1, 4, 3, 2, 1, 3, 6, 3, 2, 3, 3, 3, 2, 4, 1, 2, 2...
## $ E3        <int> 3, 6, 4, 4, 5, 6, 4, 4, NA, 4, 2, 4, 3, 4, 3, 6, 5, ...
## $ E4        <int> 4, 4, 4, 4, 4, 5, 5, 2, 4, 5, 5, 5, 2, 6, 6, 6, 5, 6...
## $ E5        <int> 4, 3, 5, 4, 5, 6, 5, 1, 3, 5, 4, 4, 4, 5, 5, 4, 5, 6...
## $ N1        <int> 3, 3, 4, 2, 2, 3, 1, 6, 5, 5, 3, 4, 1, 1, 2, 4, 4, 6...
## $ N2        <int> 4, 3, 5, 5, 3, 5, 2, 3, 5, 5, 3, 5, 2, 1, 4, 5, 4, 5...
## $ N3        <int> 2, 3, 4, 2, 4, 2, 2, 2, 2, 5, 4, 3, 2, 1, 2, 4, 4, 5...
## $ N4        <int> 2, 5, 2, 4, 4, 2, 1, 6, 3, 2, 2, 2, 2, 2, 2, 5, 4, 4...
## $ N5        <int> 3, 5, 3, 1, 3, 3, 1, 4, 3, 4, 3, NA, 2, 1, 3, 5, 5, ...
## $ O1        <int> 3, 4, 4, 3, 3, 4, 5, 3, 6, 5, 5, 4, 4, 5, 5, 6, 5, 5...
## $ O2        <int> 6, 2, 2, 3, 3, 3, 2, 2, 6, 1, 3, 6, 2, 3, 2, 6, 1, 1...
## $ O3        <int> 3, 4, 5, 4, 4, 5, 5, 4, 6, 5, 5, 4, 4, 4, 5, 6, 5, 4...
## $ O4        <int> 4, 3, 5, 3, 3, 6, 6, 5, 6, 5, 6, 5, 5, 4, 5, 3, 6, 5...
## $ O5        <int> 3, 3, 2, 5, 3, 1, 1, 3, 1, 2, 3, 4, 2, 4, 5, 2, 3, 4...
## $ gender    <int> 1, 2, 2, 2, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1...
## $ education <int> NA, NA, NA, NA, NA, 3, NA, 2, 1, NA, 1, NA, NA, NA, ...
## $ age       <int> 16, 18, 17, 17, 17, 21, 18, 19, 19, 17, 21, 16, 16, ...

keys.list <- list(agree=c("-A1","A2","A3","A4","A5"),conscientious=c("C1","C2","C3","-C4","-C5"),
extraversion=c("-E1","-E2","E3","E4","E5"),neuroticism=c("N1","N2","N3","N4","N5"),
openness = c("O1","-O2","O3","O4","-O5")) 

# scoreItems gives a ton of info about items! 
scores <- scoreItems(keys.list,bfi,min=1,max=6) # Need to specify the minimum and maximum values
scores
## Call: scoreItems(keys = keys.list, items = bfi, min = 1, max = 6)
## 
## (Unstandardized) Alpha:
##       agree conscientious extraversion neuroticism openness
## alpha   0.7          0.72         0.76        0.81      0.6
## 
## Standard errors of unstandardized Alpha:
##       agree conscientious extraversion neuroticism openness
## ASE   0.014         0.014        0.013       0.011    0.017
## 
## Average item correlation:
##           agree conscientious extraversion neuroticism openness
## average.r  0.32          0.34         0.39        0.46     0.23
## 
## Median item correlation:
##         agree conscientious  extraversion   neuroticism      openness 
##          0.34          0.34          0.38          0.41          0.22 
## 
##  Guttman 6* reliability: 
##          agree conscientious extraversion neuroticism openness
## Lambda.6   0.7          0.72         0.76        0.81      0.6
## 
## Signal/Noise based upon av.r : 
##              agree conscientious extraversion neuroticism openness
## Signal/Noise   2.3           2.6          3.2         4.3      1.5
## 
## Scale intercorrelations corrected for attenuation 
##  raw correlations below the diagonal, alpha on the diagonal 
##  corrected correlations above the diagonal:
##               agree conscientious extraversion neuroticism openness
## agree          0.70          0.36         0.63      -0.245     0.23
## conscientious  0.26          0.72         0.35      -0.305     0.30
## extraversion   0.46          0.26         0.76      -0.284     0.32
## neuroticism   -0.18         -0.23        -0.22       0.812    -0.12
## openness       0.15          0.19         0.22      -0.086     0.60
## 
##  In order to see the item by scale loadings and frequency counts of the data
##  print with the short option = FALSE

# Call str() on scores to see all the pieces of information you can access
# individually! The *scores* element of scores includes the re-keyed items 
# Summarise from dplyr can help you investigate these scores.  
keyed_scores <- data.frame(scores$scores)

keyed_scores %>%
  summarise_all(mean)
##    agree conscientious extraversion neuroticism openness
## 1 4.6531        4.2694        4.146       3.162    4.591

# If you wanted to group by a variable (such as gender) - remember that 
# we dropped demographics when we used keys.list. We can get those back!
# According to the documentation, males = 1 and females =2
keyed_scores <- cbind(gender = bfi$gender, keyed_scores)

keyed_scores %>%
  group_by(gender) %>%
  summarise_all(mean)
## # A tibble: 2 x 6
##   gender agree conscientious extraversion neuroticism openness
##    <int> <dbl>         <dbl>        <dbl>       <dbl>    <dbl>
## 1      1  4.39          4.14         3.99        2.95     4.66
## 2      2  4.78          4.33         4.22        3.27     4.56

# Using dplyr::select to help you subset to look at individual pieces of the data
open <- bfi %>%
  select(starts_with("O"))

Dummy coding is useful and not exclusive to questionnaire data. This is very simple in R. Let’s say we wanted to look at the relationship between education and openness.

# Factor the variable first mostly to re-label nicely
factored_edu <- factor(bfi$education, labels = c("someHS", "doneHS", "someCollege", "doneCollege", "graduate"), ordered = TRUE, exclude = NA)

edu <- dummy.code(factored_edu)

# Your dummy coded variables are now an island!
# Bind these dummy coded education variables to the openness df
# we made earlier with dplyr::select. 
open_edu <- cbind(edu, open)