2.1 SPSS, SAS and Stata

SPSS, SAS and Stata don’t provide public documentation of their file formats, so we rely on a few good eggs reverse engineering the file formats to be able to read these dirrectly into R. The two most used R packages for accessing these datasets are haven and foreign.

2.1.1 haven

Haven enables R to read and write various data formats used by other statistical packages by wrapping the fantastic ReadStat C library written by Evan Miller. Haven is part of the tidyverse.

— https://haven.tidyverse.org/

Pros

Labelled datasets!
Good support for recent file formats, and good translation of data types into appropriate R classes.
Supports writing as well as reading.

Cons

For the non-tidyverse fans - deeply embedded in the tidyverse way of doing things.
Somewhat stable, but still has the occasional breaking change.

Examples

The example below reads the GSS dataset in both SPSS and Stata formats.

Note the inclusion of the user_na = TRUE for reading SPSS files. By default read_sav() converts user tagged NA values to NA in R - setting user_na = TRUE retains these values. We’ll go into more detail on working with tagged missing values in section 3.1.2.

install.packages("haven")
library(haven)

gss <- read_sav("data/gss/GSS2018.sav", user_na = TRUE)

gss <- read_stata("data/gss/GSS2018.dta")

2.1.2 foreign

Reading and writing data stored by some versions of ‘Epi Info’, ‘Minitab’, ‘S’, ‘SAS’, ‘SPSS’, ‘Stata’, ‘Systat’, ‘Weka’, and for reading and writing some ‘dBase’ files.

— https://cran.r-project.org/package=foreign

Pros

Extremely stable.
Supported and developed by the R Core Team.
Supports additional file formats not supported by haven.

Cons

Splits long character variables into 255 character variables.
Inconsistent support for newer file formats (e.g. no support for Stata after version 12).
Difficult to use categorical labels without converting to factors.

Examples

The example below reads the GSS dataset in both SPSS and Stata formats.

install.packages("foreign")
library(foreign)

gss <- read.spss("data/gss/GSS2018.sav", use.value.labels = FALSE) %>%
  as_tibble()

gss <- read.dta("data/gss/GSS2018.dta", convert.factors = FALSE) %>%
  as_tibble()