2.1 SPSS, SAS and Stata
SPSS, SAS and Stata don’t provide public documentation of their file formats, so we rely on a few good eggs reverse engineering the file formats to be able to read these dirrectly into R. The two most used R packages for accessing these datasets are haven and foreign.
2.1.1 haven
Haven enables R to read and write various data formats used by other statistical packages by wrapping the fantastic ReadStat C library written by Evan Miller. Haven is part of the tidyverse.
Pros
Labelled datasets!
Good support for recent file formats, and good translation of data types into appropriate R classes.
Supports writing as well as reading.
Cons
For the non-tidyverse fans - deeply embedded in the tidyverse way of doing things.
Somewhat stable, but still has the occasional breaking change.
Examples
The example below reads the GSS dataset in both SPSS and Stata formats.
Note the inclusion of the user_na = TRUE
for reading SPSS files. By default read_sav()
converts user tagged NA
values to NA
in R - setting user_na = TRUE
retains these values. We’ll go into more detail on working with tagged missing values in section 3.1.2.
2.1.2 foreign
Reading and writing data stored by some versions of ‘Epi Info’, ‘Minitab’, ‘S’, ‘SAS’, ‘SPSS’, ‘Stata’, ‘Systat’, ‘Weka’, and for reading and writing some ‘dBase’ files.
Pros
Extremely stable.
Supported and developed by the R Core Team.
Supports additional file formats not supported by haven.
Cons
Splits long character variables into 255 character variables.
Inconsistent support for newer file formats (e.g. no support for Stata after version 12).
Difficult to use categorical labels without converting to factors.