4.2 dbplyr

dbplyr is an extension package for dplyr that allows us to access database tables as if they were R data frames. This essentially means we can access a database without directly bulding SQL queries in our code, which can make working with database tables much cleaner and easier to follow.

There are some simple examples below to give you the gist, but for a more detailed rundown we’d recommend the Introduction to dbplyr vignette.

4.2.1 Accessing the database

Since these examples are intended for public consumption we can’t use a “real” external database. Fortunately, since DBI is a generic frontend we can use any backend for demonstration purposes.

First, we’ll create an SQLite database in memory from the gss dataset. Note that this will lose all the label metadata, since SQL tables do not support attributes or extended R classes.

If you’re using a different type of database, the dbConnect() call will look a bit different - this function specifies the driver and connection parameters for the database we’re connecting to.

gss <- haven::read_sav("data/gss/GSS2018.sav", user_na = TRUE)

# Connect to a temporary SQLite database in memory
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

# Write the gss dataset to a new table "gss" in the database
DBI::dbListTables(con)
#> character(0)
DBI::dbWriteTable(con, "gss", gss)
DBI::dbListTables(con)
#> [1] "gss"

dplyr allows us to reference a database table as a tibble using the tbl() function. Once we’re connected to a database, it’s simple to create a reference to a SQL table with the tbl() function, using this database connection and the name of the table.

Since a DBI connection (our con object) is used in the tbl() function, dplyr knows to use the database functionality provided by dbplyr and DBI to access this data.

library(dplyr, warn.conflicts = FALSE)

gss_db <- tbl(con, "gss")
gss_db
#> # Source:   table<gss> [?? x 1,065]
#> # Database: sqlite 3.33.0 [:memory:]
#>    ABANY ABDEFECT ABFELEGL ABHELP1 ABHELP2 ABHELP3 ABHELP4 ABHLTH ABINSPAY
#>    <dbl>    <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>    <dbl>
#>  1     2        1        0       1       1       1       1      1        1
#>  2     1        1        3       2       2       2       2      1        2
#>  3     0        0        0       1       2       1       1      0        2
#>  4     0        0        1       1       1       1       1      0        1
#>  5     2        1        0       2       2       2       1      1        2
#>  6     1        1        1       1       1       1       1      1        1
#>  7     1        1        3       1       2       1       1      1        1
#>  8     2        1        0       1       2       1       1      1        9
#>  9     0        0        3       1       1       1       1      0        1
#> 10     0        0        0       1       2       2       1      0        2
#> # … with more rows, and 1,056 more variables: ABMEDGOV1 <dbl>, ABMEDGOV2 <dbl>,
#> #   ABMELEGL <dbl>, ABMORAL <dbl>, ABNOMORE <dbl>, ABPOOR <dbl>, ABPOORW <dbl>,
#> #   ABRAPE <dbl>, ABSINGLE <dbl>, ABSTATE1 <dbl>, ABSTATE2 <dbl>,
#> #   ACQNTSEX <dbl>, ACTSSOC <dbl>, ADMINCONSENT <dbl>, ADULTS <dbl>,
#> #   ADVFRONT <dbl>, AFFRMACT <dbl>, AFRAIDOF <dbl>, AFTERLIF <dbl>, AGE <dbl>,
#> #   AGED <dbl>, AGEKDBRN <dbl>, ANCESTRS <dbl>, ARTHRTIS <dbl>, ASTROLGY <dbl>,
#> #   ASTROSCI <dbl>, ATHEISTS <dbl>, ATTEND <dbl>, ATTEND12 <dbl>,
#> #   ATTENDMA <dbl>, ATTENDPA <dbl>, AWAY1 <dbl>, AWAY11 <dbl>, AWAY2 <dbl>,
#> #   AWAY3 <dbl>, AWAY4 <dbl>, AWAY5 <dbl>, AWAY6 <dbl>, AWAY7 <dbl>,
#> #   BABIES <dbl>, BACKPAIN <dbl>, BALLOT <dbl>, BALNEG <dbl>, BALPOS <dbl>,
#> #   BEFAIR <dbl>, BETRLANG <dbl>, BIBLE <dbl>, BIGBANG <dbl>, BIGBANG1 <dbl>,
#> #   BIGBANG2 <dbl>, BIRD <dbl>, BIRDB4 <dbl>, BORN <dbl>, BOYORGRL <dbl>,
#> #   BREAKDWN <dbl>, BUDDHSTS <dbl>, BUYESOP <dbl>, BUYVALUE <dbl>,
#> #   CANTRUST <dbl>, CAPPUN <dbl>, CAT <dbl>, CATB4 <dbl>, CHARACTR <dbl>,
#> #   CHEMGEN <dbl>, CHILDS <dbl>, CHLDIDEL <dbl>, CHRISTNS <dbl>,
#> #   CHURHPOW <dbl>, CLASS <dbl>, CLERGVTE <dbl>, CLOSETO1 <dbl>,
#> #   CLOSETO2 <dbl>, CLOSETO3 <dbl>, CLOSETO4 <dbl>, CLOSETO5 <dbl>,
#> #   CNTCTFAM <dbl>, CNTCTFRD <dbl>, CNTCTKID <dbl>, CNTCTPAR <dbl>,
#> #   CNTCTSIB <dbl>, CODEG <dbl>, CODEN <dbl>, COEDUC <dbl>, COEVWORK <dbl>,
#> #   COFUND <dbl>, COHORT <dbl>, COHRS1 <dbl>, COHRS2 <dbl>, COIND10 <dbl>,
#> #   COISCO08 <dbl>, COJew <dbl>, COLATH <dbl>, COLCOM <dbl>, COLDEG1 <dbl>,
#> #   COLHOMO <dbl>, COLMIL <dbl>, COLMSLM <dbl>, COLRAC <dbl>, COLSCI <dbl>,
#> #   COLSCINM <dbl>, …

On first glance gss_db acts and looks like a tibble, but it’s actually quite different. You’ll notice in the example above that the data source is listed in the header, but with an unknown (??) number of rows:

# Source:   table<gss> [?? x 1,065]
# Database: sqlite 3.33.0 [:memory:]

The gss_db object is not actually a data frame, but a database connection that pretends to be one. If we look at the class list we’ll see that the original gss dataset is a data frame underneath the tibble, but our database connection is something else.

class(gss)
#> [1] "tbl_df"     "tbl"        "data.frame"

class(gss_db)
#> [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
#> [4] "tbl_lazy"             "tbl"

So although our gss_db object looks like a data frame, it isn’t one. Operations that you can run on a normal data frame won’t necessarily work on a database tibble.

table(gss$INCOME)
#> 
#>    1    2    3    4    5    6    7    8    9   10   11   12   13   98 
#>   37   28   19   14   14    9   13   42  142   81  140 1444  274   91

table(gss_db$INCOME)
#> < table of extent 0 >

4.2.2 Using dplyr verbs

Although we can’t use this object exactly like a normal data frame, dplyr is “database aware” via dbplyr. This means that we can use the standard dplyr verbs (select(), mutate(), group_by(), summarise(), etc.) on our gss_db object to perform processing on the SQL table.

Instead of immediately processing anything, dplyr builds the operations into a SQL query that will be run on the remote database. We can view the underlying SQL query using show_query().

income <- gss_db %>%
  group_by(HELPSICK) %>%
  summarise(REALINC_AVG = mean(REALINC, na.rm = TRUE))

show_query(income)
#> <SQL>
#> SELECT `HELPSICK`, AVG(`REALINC`) AS `REALINC_AVG`
#> FROM `gss`
#> GROUP BY `HELPSICK`

At this point we still haven’t really done any processing. Similar to the gss_db object, the income object we just created isn’t actually a data frame, but a “lazy query” that stores definitional data prior to processing. When we print the object, it presents a preview of the results.

To run the query we call the collect() function, which returns the results as a standard local tibble.

# Note that this prints as a "lazy query"
income
#> # Source:   lazy query [?? x 2]
#> # Database: sqlite 3.33.0 [:memory:]
#>   HELPSICK REALINC_AVG
#>      <dbl>       <dbl>
#> 1        0      30993.
#> 2        1      25532.
#> 3        2      35434.
#> 4        3      32190.
#> 5        4      41244.
#> 6        5      30856.
#> 7        8      17339.
#> 8        9      26389.

collect(income)
#> # A tibble: 8 x 2
#>   HELPSICK REALINC_AVG
#>      <dbl>       <dbl>
#> 1        0      30993.
#> 2        1      25532.
#> 3        2      35434.
#> 4        3      32190.
#> 5        4      41244.
#> 6        5      30856.
#> 7        8      17339.
#> 8        9      26389.

Not all functions can be translated to SQL, and not all processing can be done using dplyr verbs. By using the collect() function in a pipe we can easily combine remote and local processing in a familiar R native way.

This is particularly powerful when working with large datasets that may not be possible to load into memory in R - every operation before the collect() is processed by the SQL database, so operations like selecting, filtering and aggregating can take advantage of the SQL infrastructure.

library(tidyr)

gss_db %>%
  group_by(HELPSICK, HEALTH) %>%
  summarise(N = n()) %>%
  collect() %>%
  pivot_wider(names_from = "HEALTH", names_sort = TRUE, values_from = "N")
#> # A tibble: 8 x 7
#> # Groups:   HELPSICK [8]
#>   HELPSICK   `0`   `1`   `2`   `3`   `4`   `8`
#>      <dbl> <int> <int> <int> <int> <int> <int>
#> 1        0    NA   177   391   171    45     1
#> 2        1   250    52   132    73    20     1
#> 3        2   138    44    79    40     5    NA
#> 4        3   242    56   101    44     3     2
#> 5        4    62    17    34    12     4    NA
#> 6        5    63     9    28    10     7    NA
#> 7        8    19     4     6     3    NA     1
#> 8        9    NA    NA    NA     2    NA    NA

4.2.3 Disclaimer

There’s one big caveat to this approach - it’s not always possible to translate functions to SQL code, and some SQL backends are more robust than others. For simple operations you shouldn’t have any problems, but we highly recommend reading the dbplyr vignettes for a better understanding of how this translation works.

If you’re having strange issues preforming certain operations, using the show_query() command to see what dbplyr is actually trying to do is the best first debugging step.