For many of us, SPSS has been something we’ve used for statistical analysis since early in our experience with statistics. Running tests in SPSS might be second-nature to you, and there are lots of resources online for how to do things you may be unclear about. The comfort of SPSS can make it daunting to explore other options; however, SPSS has limitations, and you may have stumbled upon an interesting statistical test that can be run in R and not in SPSS. Alternatively, you may have decided that SPSS licenses are very expensive and inconvenient, and you’re interested in other options. I have gradually been transitioning from SPSS to R, so I’m hoping to share some of what I’ve learned that might ease your transition.
In my opinion, the biggest shift one makes in the move from SPSS to R is a change in how you think about your data and your analysis. If you are used to using the SPSS to GUI and clicky navigation to move around and explore your data, you’ll notice that won’t be as possible in R. Once you get over this hurdle, the flexibility and power of R is sure to win you over!
SPSS is a one-piece statistical package. All the functionalities and features available to you are included in the installation. R has many useful base functions, but there are tons of packages available from CRAN that allow you to do more things or do them better. To install a package from CRAN, use the function install.packages
: for example, install.packages("ggplot2")
. In order to use packages, you need to load them into R, which makes their functionalities available. Do this using library
. Here are the packages we’ll be using today.
library(psych)
library(tidyverse)
library(ggplot2)
library(haven)
library(car)
library(Hmisc)
library(moments)
library(flextable)
library(broom)
select <- dplyr::select
summarize <- dplyr::summarize
path <- getwd()
In R, a single ‘line’ of data is called a vector - a row or column could be a vector. The most common higher dimensions of data you’ll use are matrices (if they’re all one variable type eg. numeric), lists (bundles of different kinds of information) or data.frames (like the spreadsheets you’d use in Excel or SPSS). These 3 simple base R functions are some of your best friends - you will use them often! Especially c()
(combine) - you’ll use this on its own and within functions. To assign a piece of information to a name (eg. to make a vector named “item1”), use the <- operator with the Label on the left and the Contents on the right (by convention). <- and = are not the same thing in R!
# makes a vector
item1 <- c(1, 2, 3, 5)
item2 <- c(3, 6, 7, 2)
# binds these by column - ie. these are variable vectors
cbind(item1, item2)
## item1 item2
## [1,] 1 3
## [2,] 2 6
## [3,] 3 7
## [4,] 5 2
# binds these by row - ie. these are observation vectors
rbind(item1, item2)
## [,1] [,2] [,3] [,4]
## item1 1 2 3 5
## item2 3 6 7 2
We’re going to be using one of the default datasets available in R through ggplot2, diamonds
. We’ll start by loading in that data. You would usually need to import your own data. You can do this using read_csv
from readr (e.g. my_data <- read_csv(/myfolder/my_data.csv)
). A helpful tip is to use path <- getwd()
at the beginning of your .R or .Rmd - this saves your current path. As long as you keep your .R or .Rmd in the same place as the files you’re looking for, you can easily input the path of a file at anytime using file.path(path, "filename.csv")
.
data(diamonds)
diamonds
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # ... with 53,930 more rows
If you’d like to look a the data in R Studio’s GUI, you can do this by typing View(diamonds)
into the console. Coming from SPSS, this is probably the view of your data that you’re familiar with. In R, there are also other ways to explore the data.
When you are loading in your own data, it rarely will be so neat! It’s important that all variables are classed correctly in order for appropriate calculation and visualization down the line. In SPSS, you would make these kinds of changes in Variable View. In order to play in parallel between R and SPSS today, I’m going to start by outputting our diamonds
data to a .csv.
write_csv(diamonds, file.path(path, "diamonds.csv"))
When I import this into SPSS straight from the .csv, here’s what I get:
We know that all our categorical variables are ordinal in this dataset, so we change those by clicking in the Measure column.
You’ll notice the Values column is empty - this is important because some tests in SPSS will not allow variables with strings, so you need to set up key-value pairs. For example, for cut, you could assign a number of 1 to Fair, 2 to Good, etc. You can do this manually in SPSS - but R can help us out! We’ll come back to this.
When you first import data into R, variables tend to be chr - character (string) if they have any letters, or num, int, or dbl (numerical, integer, or double) if they are all numbers. You will need to change your variables’ classes to fit what you know them to be. Categorical variables in R are called factors, and these can be unordered (nominal) or ordered (ordinal).
In this block of code, I’m going to create a messy version of diamonds
so we can practice cleaning it up. If you’re looking at the .Rmd, no peeking! It’ll give away how you fix it.
Take a look at messy_diamonds
with str
. What’s wrong with it?
## DEPTH
# Try putting depth back to numeric by coercing it with as.numeric
head(as.numeric(messy_diamonds$depth))
## [1] 61.5 59.8 56.9 62.4 63.3 62.8
# The warning 'NAs introduced by coercion' indicates that there's a
# non numerical value somewhere - let's find it to understand
# what's wrong before we erase it!
# I'm asking R to tell me which value is NA when I coerced it
which(is.na(as.numeric(messy_diamonds$depth)))
## [1] 90
# Let's look at what the depth of diamond #90 is
messy_diamonds[90,5]
## # A tibble: 1 x 1
## depth
## <chr>
## 1 a
# Oops! Looks like a data-entry error. If we had another place the data
# was entered (such as an original output document or paper questionnaire),
# we could check it and assign the correct value.
# I'm going to cheat and use the original diamonds to find out!
diamonds[90,5]
## # A tibble: 1 x 1
## depth
## <dbl>
## 1 62.9
# Originally, this value was 62.9 - we'll go ahead and assign that now.
messy_diamonds[90,5] <- 62.9
# If you didn't have a way to check the original value, you probably
# have to leave it as a missing value. R will automatically
# turn this character into a NA - which is what the warning was telling us.
# Let's go ahead and coerce depth to numeric now that we've checked it.
messy_diamonds$depth <- as.numeric(messy_diamonds$depth)
## COLOR
# Color should be a factor! It needs to be ordered. If this were our data,
# we'd know what the order should be. In this case, let's check
# the original diamonds to find out.
levels(diamonds$color)
## [1] "D" "E" "F" "G" "H" "I" "J"
messy_diamonds$color <- factor(messy_diamonds$color, ordered = TRUE, levels = c("D", "E", "F", "G", "H", "I", "J"))
## PRICE
# Price should be integer, but it's a factor right now.
# If we were to call as.integer right away, it would convert
# them based on the factor levels! To avoid this,
# we'll wrap it with as.character first.
messy_diamonds$price <- as.integer(as.character(messy_diamonds$price))
# No warnings! Let's call str to make sure things look right.
str(messy_diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
No peeking on the Rmd!
I’ve done the same thing again - messed up a perfectly good dataset and called it cars
. This comes from the dataset mtcars
- you can search its documentation by typing ?mtcars
into the console. Make sure the data ends up clean based on your expectations from the documentation.
Now that we’ve gotten things cleaned up, we’re ready to go! Remember how I said that we’d come back to an easier way to get data set up in SPSS? The package haven is our best friend. Using write_sav
, we can create an SPSS data file (.sav) with variable types and values set-up nicely!
write_sav(diamonds, file.path(path, "diamonds.sav"))
Take a look at the Variable View of diamonds.sav in SPSS.
The categorical values now have numerical values corresponding to their original factor levels, and everything is appropriately classed! Take a look at the Data View for the two files to compare.
Straight from the CSV:
Through Haven:
This is great, because many of us don’t immediately abandon our old tools when we pick up new ones. You might still want to use SPSS for some things, but R can still help you manage your data. Alternatively, it’s helpful if you have collaborators who use SPSS.
The ‘Missing’ column in Variable View of SPSS is a helpful way to quickly identify variables with missing data. In R, you can quickly find out if there are any missing data points using is.na
and a couple friends.
testvariable <- c(1, 2, 3, 5, 3, 1, 3, 3, 4, 5, 6, NA, 3, 3, 4, 7, 8, 1, 3, NA, 3, 4, 6)
any(is.na(testvariable))
## [1] TRUE
which(is.na(testvariable))
## [1] 12 20
You can use Values (key-value pairs) and Labels in SPSS to store information about your variables. In R, you can give your values names (similar to key-value pairs), and you can make a data dictionary to store detailed information that you want access to. dataMeta provides some helpful tools for making data dictionaries, but you can scrape by with base R. Here are some simple examples of ways to label data and store additional information about variables. You could of course import these by reading in a .csv (read_csv
through readr is my go-to) or other spreadsheet - you don’t have to type these things directly into R.
var1 <- c(1, 2, 3, 3, 4, 5, 3, 2, 1, 2)
var2 <- c(4, 5, 3, 5, 5, 4, 2, 1, 2, 3)
var3 <- c(1, 1, 2, 4, 5, 3, 3, 3, 2, 2)
df <- data.frame(var1, var2, var3)
# Labeling individual values
df$var1 <- factor(var1, levels = c("1", "2", "3", "4", "5"), labels = c("Poor", "Fair", "Good", "Very Good", "Excellent"), ordered = TRUE)
## All 3 of these questions have the same levels/labels so for convenience
## we'll define vectors that store this info once
q_labels <- c("Poor", "Fair", "Good", "Very Good", "Excellent")
q_levels <- c("1", "2", "3", "4", "5")
df$var1 <- factor(var1, levels = q_levels, labels = q_labels, ordered = TRUE)
df$var2 <- factor(var2, levels = q_levels, labels = q_labels, ordered = TRUE)
df$var3 <- factor(var3, levels = q_levels, labels = q_labels, ordered = TRUE)
## Note: if you use an ordered factor but want to access numerical
## features (ie. to calculate things like mean, min or max) you can
## wrap it in as.numeric - does not matter if factor is labeled as characters
mean(as.numeric(df$var1))
## [1] 2.6
# Naming observations can be helpful to keep track
rownames(df) <- c("PD01", "YC01", "PD02", "PD03", "PD04", "YC02", "OC01", "OC02", "YC03", "OC03")
# Renaming the variables
colnames(df) <- c("delivery", "value", "meal")
# If you want to put these into their own column (helpful if tidying):
rownames_to_column(df)
## rowname delivery value meal
## 1 PD01 Poor Very Good Poor
## 2 YC01 Fair Excellent Poor
## 3 PD02 Good Good Fair
## 4 PD03 Good Excellent Very Good
## 5 PD04 Very Good Excellent Excellent
## 6 YC02 Excellent Very Good Good
## 7 OC01 Good Fair Good
## 8 OC02 Fair Poor Good
## 9 YC03 Poor Fair Fair
## 10 OC03 Fair Good Fair
# Quick and dirty data dictionary with question details and levels
## Because the levels are consistent, I've only specified them once
## and R will multiply it to fill the dataframe. If you have different
## details for each question, make sure you input them all separately.
## This is a good thing to do in your data entry/data prep.
question <- c("How was your delivery?", "Value for cost?", "How was your meal?")
level <- c("1 = Poor, 2 = Fair, 3 = Good, 4 = Very Good, 5 = Excellent")
data_dict <- data.frame(colnames(df), question, level)
Importing, cleaning, and wrangling your data is a huge topic, but there are tons of resources to help you with this!
What sample to do??
Running statistical tests in R is a bit different from running them in SPSS, but thankfully Google is your best friend here! Let’s work through a sample analysis on our diamonds
data.
Let’s say I wanted to do a t-test to compare the average prices of Ideal and Premium diamonds.
To set this up in SPSS, I would select an Indepedendent Samples t-test from the Compare Means menu. This is what my setup window would look like - I know that Ideal and Premium are coded as “4” and “5” by clicking on the Values in Data View.
And this would be my output:
In R, to run this t-test I’m going to use the function t.test
. However, SPSS runs Levene’s test to check for homogeneity of variance as part of the t-test. We’ll have to do that ourselves, using leveneTest
from the car package. We’ll also need to filter in order to just have Ideal and Premium diamonds.
Seeing that Levene’s test is significant, our assumption of homogeneity of variance is not upheld, and we should use a Welch t-test, rather than a Student’s t-test. This is the default in R’s t.test
.
diamonds_subset <- diamonds %>% filter(cut == "Ideal" | cut == "Premium")
leveneTest(diamonds_subset$price, diamonds_subset$cut, center = mean)
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 1 350 <2e-16 ***
## 35340
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Welch - the one we're using
t.test(diamonds_subset$price ~ diamonds_subset$cut, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: diamonds_subset$price by diamonds_subset$cut
## t = 24.9, df = 26600, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1038.1 1215.3
## sample estimates:
## mean in group Premium mean in group Ideal
## 4584.3 3457.5
# Student - if Levene's hadn't been true
t.test(diamonds_subset$price ~ diamonds_subset$cut, var.equal = TRUE)
##
## Two Sample t-test
##
## data: diamonds_subset$price by diamonds_subset$cut
## t = 25.7, df = 35300, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1040.6 1212.8
## sample estimates:
## mean in group Premium mean in group Ideal
## 4584.3 3457.5
The output from our t-test includes almost all of the same information as the SPSS output. All we’re missing is mean difference and the standard error of the difference. Mean difference is relatively easy to calculate - standard error of the difference is a bit more complex but here it is! Once again, these funky formulas are coming from the SPSS algorithm manual. In most cases, you won’t need to get a specific value that comes from the SPSS output - but this is just to provide an example of how to match things up.
# Mean difference - the same for Student and Welch tests
ideal <- filter(diamonds, diamonds$cut == "Ideal")
premium <- filter(diamonds, diamonds$cut == "Premium")
meandiff <- abs(mean(ideal$price) - mean(premium$price))
n_ideal <- sum(!is.na(ideal$price))
n_premium <- sum(!is.na(premium$price))
# Standard error of the difference: unpooled - for Welch's t-test
SE_diff_unpooled <- sqrt((var(ideal$price)/n_ideal) + (var(premium$price)/n_premium))
# Standard error of the difference: pooled variance - for Student's t-test
pooled_variance <- ((n_ideal-1)*var(ideal$price) + (n_premium-1)*var(premium$price))/(n_ideal+n_premium - 2)
SE_diff_pooled <- sqrt(pooled_variance)*sqrt(1/n_ideal+1/n_premium)
Once you’ve finished your beautiful work in R, you need a way to output your documents! A few ways you can do this are:
readr::write_csv
- Syntax: write_csv(dataFrame, file.path(path, “results.csv”))ggplot2::ggsave
- Syntax: ggsave(plotName, file = “plot.png”)My current favourite package for formatting summary documents in R Markdown is flextable, which allows you to output to HTML, PDF and Word with ease. It also offers significant customizability. If you don’t use Word documents, kableExtra is another fabulous option, and a bit simpler!
The first thing you might want to include in a summary document is some descriptives. Calculating them is easy to do using dplyr’s summarize
. Here’s a quick sample - there’s more on this in the Explore section of the Resources. Once you’ve got your summary, you can use flextable to prep it for output. flextable takes a data.frame as its input.
diamonds %>%
summarize(n = n(), n_missing = sum(is.na(price)), mean = mean(price), sd = sd(price), min = min(price), max = max(price)) %>%
mutate(stderror = sd/sqrt(n)) %>%
mutate(confint_lower = mean-2*stderror) %>%
mutate(confint_upper = mean+2*stderror)
## # A tibble: 1 x 9
## n n_missing mean sd min max stderror confint_lower
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 53940 0 3933. 3989. 326 18823 17.2 3898.
## # ... with 1 more variable: confint_upper <dbl>
summary <- diamonds %>%
summarise_each(vars = c(carat, depth, table, price, x, y, z), funs(mean, sd, min, max, median, IQR))
sum_ft <- flextable(summary)
# Subset tables by stat
flextable(select(summary, contains("mean")))
carat_mean | depth_mean | table_mean | price_mean | x_mean | y_mean | z_mean |
0.798 | 61.749 | 57.457 | 3932.800 | 5.731 | 5.735 | 3.539 |
flextable(select(summary, contains("sd")))
carat_sd | depth_sd | table_sd | price_sd | x_sd | y_sd | z_sd |
0.474 | 1.433 | 2.234 | 3989.440 | 1.122 | 1.142 | 0.706 |
# Subset tables by variable
flextable(select(summary, contains("price")))
price_mean | price_sd | price_min | price_max | price_median | price_IQR |
3932.800 | 3989.440 | 326.000 | 18823.000 | 2401.000 | 4374.250 |
# Playing with formatting
sum_ft <- flextable(summary) %>%
theme_zebra %>%
font(part = "header", fontname = "Cambria") %>%
bold(part = "header") %>%
color(part = "header", color = "blue") %>%
fontsize(part = "all", size = 16)
Let’s say we want to output our t-test output from above. Output from statistical tests is often a little cluttered - broom::tidy
can clean that up for us.
output <- tidy(t.test(diamonds_subset$price ~ diamonds_subset$cut, var.equal = FALSE)) %>%
select(-estimate)
# Very basic
flextable(output)
estimate1 | estimate2 | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
4584.258 | 3457.542 | 24.918 | 0.000 | 26552.158 | 1038.088 | 1215.344 | Welch Two Sample t-test | two.sided |
# How exciting!
out_ft <- flextable(output) %>%
set_header_labels(estimate1 = "Premium Mean", estimate2 = "Ideal Mean", statistic = "Statistic", p.value = "p", parameter = "df", conf.low = "Conf. Int. Lower", conf.high = "Conf. Int. High", method = "Method", alternative = "Alternative") %>%
bold(part = "header") %>%
fontsize(part = "header", size = 16) %>%
fontsize(part = "body", size = 14) %>%
bg(part = "body", bg = "#E9E9E9")
out_ft
Premium Mean | Ideal Mean | Statistic | p | df | Conf. Int. Lower | Conf. Int. High | Method | Alternative |
4584.258 | 3457.542 | 24.918 | 0.000 | 26552.158 | 1038.088 | 1215.344 | Welch Two Sample t-test | two.sided |
Great! So we’ve got a couple tables we’d like to add to our R Markdown file. I think a good beginner strategy for getting easy summary documents is to work in a .Rmd (rather than a .R) and just put all your rough work in a code chunk with include = FALSE
. That way, when you knit your document, nothing in your code chunk will be included - only things you call outside (such as with in-line code or with another chunk). Here’s an example of what I mean.
Here’s what the sample below looks like in R Markdown:
The data included a sample of 53940 diamonds. The average price among the data was 3932.79972 USD. Descriptive statistics of diamond prices are below:
N | N Missing | Mean | SD | Minimum | Maximum | Standard Error | Conf. Int. Lower | Conf. Int. Upper |
53940 | 0 | 3932.800 | 3989.440 | 326.000 | 18823.000 | 17.177 | 3898.445 | 3967.154 |
I was interested in the relationship between a diamond’s cut and its price at the two highest levels of cut (Ideal and Premium). I ran an independent samples t-test to compare the average price in these two groups. Testing for homogeneity of variance with Levene’s test indicated a F-statistic of 349.572 and a p-value of < .001. As a result, a Welch t-test was conducted - results are below.
Premium Mean | Ideal Mean | Statistic | p | df | Conf. Int. Lower | Conf. Int. Upper | Method | Alternative |
4584.258 | 3457.542 | 24.918 | 0.000 | 26552.158 | 1038.088 | 1215.344 | Welch Two Sample t-test | two.sided |
Use what you’ve learned to do the following:
path <- getwd()
. You can then use the function file.path
to paste together your directory with a file you’re reading (importing) or writing (exporting): e.g. read_csv(file.path(path, "test.csv"))
. The advantage here is that file.path
will take care of the Mac/PC/Other OS differences in slash defaults.t_results <- t.test(df$var1 ~ df$var2))
) because it makes it easier access individuals elements (using the $
) or tidy/export the output.If you’re stuck, you can reach out to me - I probably won’t have an answer for you off-hand, but I’m certainly willing to look into things with you! Email: dcushni@uwo.ca
Bivariate correlation between carat and price:
# Base R - very simple, only gives the r value
cor(diamonds$carat, diamonds$price)
## [1] 0.92159
cor(diamonds$carat, diamonds$price, method = "spearman")
## [1] 0.96288
# from Hmisc - also works on a matrix; includes p-values
rcorr(diamonds$carat, diamonds$price)
## x y
## x 1.00 0.92
## y 0.92 1.00
##
## n= 53940
##
##
## P
## x y
## x 0
## y 0
rcorr(diamonds$carat, diamonds$price, type = "spearman")
## x y
## x 1.00 0.96
## y 0.96 1.00
##
## n= 53940
##
##
## P
## x y
## x 0
## y 0
rcorr(as.matrix(select(diamonds, -cut, -color, -clarity)))
## carat depth table price x y z
## carat 1.00 0.03 0.18 0.92 0.98 0.95 0.95
## depth 0.03 1.00 -0.30 -0.01 -0.03 -0.03 0.09
## table 0.18 -0.30 1.00 0.13 0.20 0.18 0.15
## price 0.92 -0.01 0.13 1.00 0.88 0.87 0.86
## x 0.98 -0.03 0.20 0.88 1.00 0.97 0.97
## y 0.95 -0.03 0.18 0.87 0.97 1.00 0.95
## z 0.95 0.09 0.15 0.86 0.97 0.95 1.00
##
## n= 53940
##
##
## P
## carat depth table price x y z
## carat 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## depth 0.0000 0.0000 0.0134 0.0000 0.0000 0.0000
## table 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## price 0.0000 0.0134 0.0000 0.0000 0.0000 0.0000
## x 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## y 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## z 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Predicting price as a function of carat:
price_carat <- lm(diamonds$price ~ diamonds$carat)
summary(price_carat)
##
## Call:
## lm(formula = diamonds$price ~ diamonds$carat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18585 -805 -19 537 12732
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2256.4 13.1 -173 <2e-16 ***
## diamonds$carat 7756.4 14.1 551 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1550 on 53938 degrees of freedom
## Multiple R-squared: 0.849, Adjusted R-squared: 0.849
## F-statistic: 3.04e+05 on 1 and 53938 DF, p-value: <2e-16
Note: If you want your outputted values to match SPSS, you need to use the same contrasts as SPSS. This is discussed here. To change your contrasts, do the following.
options(contrasts = c("contr.sum", "contr.poly"))
One-Way ANOVA comparing mean price between all diamond cuts:
cut_price <- lm(price ~ cut, data = diamonds)
cut_price_aov <- aov(cut_price)
summary(cut_price_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## cut 4 11041745359 2760436340 176 <2e-16 ***
## Residuals 53935 847431390159 15712087
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Post-hoc with Bonferroni Correction
pairwise.t.test(diamonds$price, diamonds$cut, p.adj = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: diamonds$price and diamonds$cut
##
## Fair Good Very Good Premium
## Good 0.002 - - -
## Very Good 0.003 1.000 - -
## Premium 0.308 < 2e-16 < 2e-16 -
## Ideal < 2e-16 0.0000000000006 < 2e-16 < 2e-16
##
## P value adjustment method: bonferroni
Two-Way ANOVA comparing mean price as a function of diamond cut and clarity:
cut_clarity_price <- lm(price ~ cut*clarity, data = diamonds)
cut_clarity_price_aov <- aov(cut_clarity_price)
summary(cut_clarity_price_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## cut 4 11041745359 2760436340 180.16 <2e-16 ***
## clarity 7 18909811326 2701401618 176.30 <2e-16 ***
## cut:clarity 28 2646560835 94520030 6.17 <2e-16 ***
## Residuals 53900 825875017997 15322357
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Post-hoc with Bonferroni correction
# This print-out would be very long - showing you cut element as a sample
cut_clarity_price_posthoc <- TukeyHSD(cut_clarity_price_aov)
cut_clarity_price_posthoc$cut
## diff lwr upr p adj
## Good-Fair -429.893 -736.573 -123.214 0.001240718749947023
## Very Good-Fair -376.998 -660.282 -93.714 0.002624399706012026
## Premium-Fair 225.500 -55.713 506.713 0.184389469834228326
## Ideal-Fair -901.216 -1177.085 -625.347 0.000000000000035749
## Very Good-Good 52.895 -127.867 233.658 0.931207257351990947
## Premium-Good 655.393 477.894 832.892 0.000000000000055400
## Ideal-Good -471.322 -640.228 -302.417 0.000000000000301092
## Premium-Very Good 602.498 469.444 735.552 0.000000000000000000
## Ideal-Very Good -524.218 -645.571 -402.865 0.000000000000000000
## Ideal-Premium -1126.716 -1243.151 -1010.280 0.000000000000000000
Running SPSS Explore on Price (limited screenshots because this is long output!):
Note: Kurtosis is calculated differently between SPSS and R. Different kurtosis calculation methods are discussed here.
# Simple investigation of price
diamonds %>%
summarize(n = n(), n_missing = sum(is.na(price)), mean = mean(price, na.rm = TRUE), sd = sd(price, na.rm = TRUE), min = min(price, na.rm = TRUE), max = max(price, na.rm = TRUE), median = median (price, na.rm = TRUE), IQR = IQR(price, na.rm = TRUE), variance = var(price, na.rm = TRUE), trimmed = mean(price, na.rm = TRUE, trim = 0.05), skewness = skewness(price, na.rm = TRUE), kurtosis = kurtosis(price, na.rm = TRUE)) %>%
mutate(range = max - min) %>%
mutate(stderror = sd/sqrt(n)) %>%
mutate(confint_lower = mean-2*stderror) %>%
mutate(confint_upper = mean+2*stderror)
## # A tibble: 1 x 16
## n n_missing mean sd min max median IQR variance trimmed
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 53940 0 3933. 3989. 326 18823 2401 4374. 1.59e7 3471.
## # ... with 6 more variables: skewness <dbl>, kurtosis <dbl>, range <dbl>,
## # stderror <dbl>, confint_lower <dbl>, confint_upper <dbl>
hist(diamonds$price)
boxplot(diamonds$price)
# Price grouped by cut
diamonds %>%
group_by(cut) %>%
summarize(n = n(), n_missing = sum(is.na(price)), mean = mean(price, na.rm = TRUE), sd = sd(price, na.rm = TRUE), min = min(price, na.rm = TRUE), max = max(price, na.rm = TRUE), median = median (price, na.rm = TRUE), IQR = IQR(price, na.rm = TRUE), variance = var(price, na.rm = TRUE), trimmed = mean(price, na.rm = TRUE, trim = 0.05), skewness = skewness(diamonds$price, na.rm = TRUE), kurtosis = kurtosis(diamonds$price, na.rm = TRUE)) %>%
mutate(range = max - min) %>%
mutate(stderror = sd/sqrt(n)) %>%
mutate(confint_lower = mean-2*stderror) %>%
mutate(confint_upper = mean+2*stderror)
## # A tibble: 5 x 17
## cut n n_missing mean sd min max median IQR variance
## <ord> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Fair 1610 0 4359. 3560. 337 18574 3282 3155. 1.27e7
## 2 Good 4906 0 3929. 3682. 327 18788 3050. 3883 1.36e7
## 3 Very~ 12082 0 3982. 3936. 336 18818 2648 4461. 1.55e7
## 4 Prem~ 13791 0 4584. 4349. 326 18823 3185 5250 1.89e7
## 5 Ideal 21551 0 3458. 3808. 326 18806 1810 3800. 1.45e7
## # ... with 7 more variables: trimmed <dbl>, skewness <dbl>,
## # kurtosis <dbl>, range <dbl>, stderror <dbl>, confint_lower <dbl>,
## # confint_upper <dbl>
ggplot(diamonds, aes(x = price)) +
geom_histogram(color = "black", fill = "white") +
facet_grid(cut ~.)
ggplot(diamonds, aes(x = cut, y = price)) +
geom_boxplot(color = "black", fill = "white")
Working with questionnaire data offers unique challenges. For those of you who are into that, here are some quick tips on how to manage this kind of data.
We’re going to use the bfi
dataset - documentation can tell you lots about it! The bfi
documentation actually includes a sample of how to use an item key and score negatively keyed items - so this section is us working through the provided sample. In SPSS, you would negatively key items using Transform.
Briefly, bfi
is a set of 25 personality self-report items from 2800 subjects, plus demogrpahics on gender, education, and age. The 25 items are oranized around 5 factors: (A) Agreeableness, (C) Conscientiousness, (E) Extraversion, (N) Neuroticism, (O) Openness. As is common with questionnaire data, some of these items are negatively keyed (ie. high scores on the question represent low levels of the dimension measured). This is easy to do in R! To make a keys.list, define vectors for each factor (ie. Agreeableness) and list the included questions by name. Negatively keyed variables are prefaced with a -.
data(bfi)
glimpse(bfi)
## Observations: 2,800
## Variables: 28
## $ A1 <int> 2, 2, 5, 4, 2, 6, 2, 4, 4, 2, 4, 2, 5, 5, 4, 4, 4, 5...
## $ A2 <int> 4, 4, 4, 4, 3, 6, 5, 3, 3, 5, 4, 5, 5, 5, 5, 3, 6, 5...
## $ A3 <int> 3, 5, 5, 6, 3, 5, 5, 1, 6, 6, 5, 5, 5, 5, 2, 6, 6, 5...
## $ A4 <int> 4, 2, 4, 5, 4, 6, 3, 5, 3, 6, 6, 5, 6, 6, 2, 6, 2, 4...
## $ A5 <int> 4, 5, 4, 5, 5, 5, 5, 1, 3, 5, 5, 5, 4, 6, 1, 3, 5, 5...
## $ C1 <int> 2, 5, 4, 4, 4, 6, 5, 3, 6, 6, 4, 5, 5, 4, 5, 5, 4, 5...
## $ C2 <int> 3, 4, 5, 4, 4, 6, 4, 2, 6, 5, 3, 4, 4, 4, 5, 5, 4, 5...
## $ C3 <int> 3, 4, 4, 3, 5, 6, 4, 4, 3, 6, 5, 5, 3, 4, 5, 5, 4, 5...
## $ C4 <int> 4, 3, 2, 5, 3, 1, 2, 2, 4, 2, 3, 4, 2, 2, 2, 3, 4, 4...
## $ C5 <int> 4, 4, 5, 5, 2, 3, 3, 4, 5, 1, 2, 5, 2, 1, 2, 5, 4, 3...
## $ E1 <int> 3, 1, 2, 5, 2, 2, 4, 3, 5, 2, 1, 3, 3, 2, 3, 1, 1, 2...
## $ E2 <int> 3, 1, 4, 3, 2, 1, 3, 6, 3, 2, 3, 3, 3, 2, 4, 1, 2, 2...
## $ E3 <int> 3, 6, 4, 4, 5, 6, 4, 4, NA, 4, 2, 4, 3, 4, 3, 6, 5, ...
## $ E4 <int> 4, 4, 4, 4, 4, 5, 5, 2, 4, 5, 5, 5, 2, 6, 6, 6, 5, 6...
## $ E5 <int> 4, 3, 5, 4, 5, 6, 5, 1, 3, 5, 4, 4, 4, 5, 5, 4, 5, 6...
## $ N1 <int> 3, 3, 4, 2, 2, 3, 1, 6, 5, 5, 3, 4, 1, 1, 2, 4, 4, 6...
## $ N2 <int> 4, 3, 5, 5, 3, 5, 2, 3, 5, 5, 3, 5, 2, 1, 4, 5, 4, 5...
## $ N3 <int> 2, 3, 4, 2, 4, 2, 2, 2, 2, 5, 4, 3, 2, 1, 2, 4, 4, 5...
## $ N4 <int> 2, 5, 2, 4, 4, 2, 1, 6, 3, 2, 2, 2, 2, 2, 2, 5, 4, 4...
## $ N5 <int> 3, 5, 3, 1, 3, 3, 1, 4, 3, 4, 3, NA, 2, 1, 3, 5, 5, ...
## $ O1 <int> 3, 4, 4, 3, 3, 4, 5, 3, 6, 5, 5, 4, 4, 5, 5, 6, 5, 5...
## $ O2 <int> 6, 2, 2, 3, 3, 3, 2, 2, 6, 1, 3, 6, 2, 3, 2, 6, 1, 1...
## $ O3 <int> 3, 4, 5, 4, 4, 5, 5, 4, 6, 5, 5, 4, 4, 4, 5, 6, 5, 4...
## $ O4 <int> 4, 3, 5, 3, 3, 6, 6, 5, 6, 5, 6, 5, 5, 4, 5, 3, 6, 5...
## $ O5 <int> 3, 3, 2, 5, 3, 1, 1, 3, 1, 2, 3, 4, 2, 4, 5, 2, 3, 4...
## $ gender <int> 1, 2, 2, 2, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1...
## $ education <int> NA, NA, NA, NA, NA, 3, NA, 2, 1, NA, 1, NA, NA, NA, ...
## $ age <int> 16, 18, 17, 17, 17, 21, 18, 19, 19, 17, 21, 16, 16, ...
keys.list <- list(agree=c("-A1","A2","A3","A4","A5"),conscientious=c("C1","C2","C3","-C4","-C5"),
extraversion=c("-E1","-E2","E3","E4","E5"),neuroticism=c("N1","N2","N3","N4","N5"),
openness = c("O1","-O2","O3","O4","-O5"))
# scoreItems gives a ton of info about items!
scores <- scoreItems(keys.list,bfi,min=1,max=6) # Need to specify the minimum and maximum values
scores
## Call: scoreItems(keys = keys.list, items = bfi, min = 1, max = 6)
##
## (Unstandardized) Alpha:
## agree conscientious extraversion neuroticism openness
## alpha 0.7 0.72 0.76 0.81 0.6
##
## Standard errors of unstandardized Alpha:
## agree conscientious extraversion neuroticism openness
## ASE 0.014 0.014 0.013 0.011 0.017
##
## Average item correlation:
## agree conscientious extraversion neuroticism openness
## average.r 0.32 0.34 0.39 0.46 0.23
##
## Median item correlation:
## agree conscientious extraversion neuroticism openness
## 0.34 0.34 0.38 0.41 0.22
##
## Guttman 6* reliability:
## agree conscientious extraversion neuroticism openness
## Lambda.6 0.7 0.72 0.76 0.81 0.6
##
## Signal/Noise based upon av.r :
## agree conscientious extraversion neuroticism openness
## Signal/Noise 2.3 2.6 3.2 4.3 1.5
##
## Scale intercorrelations corrected for attenuation
## raw correlations below the diagonal, alpha on the diagonal
## corrected correlations above the diagonal:
## agree conscientious extraversion neuroticism openness
## agree 0.70 0.36 0.63 -0.245 0.23
## conscientious 0.26 0.72 0.35 -0.305 0.30
## extraversion 0.46 0.26 0.76 -0.284 0.32
## neuroticism -0.18 -0.23 -0.22 0.812 -0.12
## openness 0.15 0.19 0.22 -0.086 0.60
##
## In order to see the item by scale loadings and frequency counts of the data
## print with the short option = FALSE
# Call str() on scores to see all the pieces of information you can access
# individually! The *scores* element of scores includes the re-keyed items
# Summarise from dplyr can help you investigate these scores.
keyed_scores <- data.frame(scores$scores)
keyed_scores %>%
summarise_all(mean)
## agree conscientious extraversion neuroticism openness
## 1 4.6531 4.2694 4.146 3.162 4.591
# If you wanted to group by a variable (such as gender) - remember that
# we dropped demographics when we used keys.list. We can get those back!
# According to the documentation, males = 1 and females =2
keyed_scores <- cbind(gender = bfi$gender, keyed_scores)
keyed_scores %>%
group_by(gender) %>%
summarise_all(mean)
## # A tibble: 2 x 6
## gender agree conscientious extraversion neuroticism openness
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 4.39 4.14 3.99 2.95 4.66
## 2 2 4.78 4.33 4.22 3.27 4.56
# Using dplyr::select to help you subset to look at individual pieces of the data
open <- bfi %>%
select(starts_with("O"))
Dummy coding is useful and not exclusive to questionnaire data. This is very simple in R. Let’s say we wanted to look at the relationship between education and openness.
# Factor the variable first mostly to re-label nicely
factored_edu <- factor(bfi$education, labels = c("someHS", "doneHS", "someCollege", "doneCollege", "graduate"), ordered = TRUE, exclude = NA)
edu <- dummy.code(factored_edu)
# Your dummy coded variables are now an island!
# Bind these dummy coded education variables to the openness df
# we made earlier with dplyr::select.
open_edu <- cbind(edu, open)
Thanks for reading/listening!