2.1 SPSS, SAS and Stata

SPSS, SAS and Stata don’t provide public documentation of their file formats, so we rely on a few good eggs reverse engineering the file formats to be able to read these dirrectly into R. The two most used R packages for accessing these datasets are haven and foreign.

2.1.1 haven

Haven enables R to read and write various data formats used by other statistical packages by wrapping the fantastic ReadStat C library written by Evan Miller. Haven is part of the tidyverse.

https://haven.tidyverse.org/

Pros

  • Labelled datasets!

  • Good support for recent file formats, and good translation of data types into appropriate R classes.

  • Supports writing as well as reading.

Cons

  • For the non-tidyverse fans - deeply embedded in the tidyverse way of doing things.

  • Somewhat stable, but still has the occasional breaking change.

Examples

The example below reads the GSS dataset in both SPSS and Stata formats.

Note the inclusion of the user_na = TRUE for reading SPSS files. By default read_sav() converts user tagged NA values to NA in R - setting user_na = TRUE retains these values. We’ll go into more detail on working with tagged missing values in section 3.1.2.

install.packages("haven")
library(haven)

gss <- read_sav("data/gss/GSS2018.sav", user_na = TRUE)

gss <- read_stata("data/gss/GSS2018.dta")

2.1.2 foreign

Reading and writing data stored by some versions of ‘Epi Info’, ‘Minitab’, ‘S’, ‘SAS’, ‘SPSS’, ‘Stata’, ‘Systat’, ‘Weka’, and for reading and writing some ‘dBase’ files.

https://cran.r-project.org/package=foreign

Pros

  • Extremely stable.

  • Supported and developed by the R Core Team.

  • Supports additional file formats not supported by haven.

Cons

  • Splits long character variables into 255 character variables.

  • Inconsistent support for newer file formats (e.g. no support for Stata after version 12).

  • Difficult to use categorical labels without converting to factors.

Examples

The example below reads the GSS dataset in both SPSS and Stata formats.

install.packages("foreign")
library(foreign)

gss <- read.spss("data/gss/GSS2018.sav", use.value.labels = FALSE) %>%
  as_tibble()

gss <- read.dta("data/gss/GSS2018.dta", convert.factors = FALSE) %>%
  as_tibble()