max_resultsargument has been deprecated in favor of
n_max, which reflects what we actually do with this number and is consistent with the
n_maxargument elsewhere, e.g.,
page_sizeis no longer fixed and, instead, is determined empirically. Users are strongly recommended to let bigrquery select
page_sizeautomatically, unless there’s a specific reason to do otherwise.
collect.tbl_BigQueryConnection() honours the
bigint field found in a connection object created with
DBI::dbConnect() and passes
bigint along to
bq_table_download(). This improves support for 64-bit integers when reading BigQuery tables with dplyr syntax (@zoews, #439, #437).
When used with dbplyr >= 2.0.0, ambiguous variables in joins will get suffixes
_y (instead of
.y which don’t work with BigQuery) (#403).
bq_table_download() works once again with large row counts (@gjuggler, #395). Google’s API has stopped accepting
startIndex parameters with scientific formatting, which was happening for large values (>1e5) by default.
bq_perform_*() fails, you now see all errors, not just the first (#355).
bq_perform_query() can now execute parameterised query with parameters of
ARRAY type (@byapparov, #303). Vectors of length > 1 will be automatically converted to
ARRAY type, or use
bq_param_array() to be explicit.
bigrquery’s auth functionality now comes from the gargle package, which provides R infrastructure to work with Google APIs, in general. The same transition is underway in several other packages, such as googledrive. This will make user interfaces more consistent and makes two new token flows available in bigrquery:
Where to learn more:
bq_auth()all that most users need
OAuth2 tokens are now cached at the user level, by default, instead of in
.httr-oauth in the current project. The default OAuth app has also changed. This means you will need to re-authorize bigrquery (i.e. get a new token). You may want to delete any vestigial
.httr-oauth files lying around your bigrquery projects.
The OAuth2 token key-value store now incorporates the associated Google user when indexing, which makes it easier to switch between Google identities.
bq_user() is a new function that reveals the email of the user associated with the current token.
If you previously used
set_service_token() to use a service account token, it still works. But you’ll get a deprecation warning. Switch over to
bq_auth(path = "/path/to/your/service-account.json"). Several other functions are similarly soft-deprecated.
bq_table_download() and the
DBI::dbConnect method now has a
bigint argument which governs how BigQuery integer columns are imported into R. As before, the default is
bigint = "integer". You can set
bigint = "integer64" to import BigQuery integer columns as
bit64::integer64 columns in R which allows for values outside the range of
2147483647) (@rasmusab, #94).
Jobs now print their ids while running (#252)
bq_perform_upload() will only autodetect a schema if the table does not already exist.
Unparseable date times return NA (#285)
The system for downloading data from BigQuery into R has been rewritten from the ground up to give considerable improvements in performance and flexibility.
The two steps, downloading and parsing, now happen in sequence, rather than interleaved. This means that you’ll now see two progress bars: one for downloading JSON from BigQuery and one for parsing that JSON into a data frame.
Downloads now occur in parallel, using up to 6 simultaneous connections by default.
The parsing code has been rewritten in C++. As well as considerably improving performance, this also adds support for nested (record/struct) and repeated (array) columns (#145). These columns will yield list-columns in the following forms:
Results are now returned as tibbles, not data frames, because the base print method does not handle list columns well.
I can now download the first million rows of
publicdata.samples.natality in about a minute. This data frame is about 170 MB in BigQuery and 140 MB in R; a minute to download this much data seems reasonable to me. The bottleneck for loading BigQuery data is now parsing BigQuery’s json format. I don’t see any obvious way to make this faster as I’m already using the fastest C++ json parser, RapidJson. If this is still too slow for you (i.e. you’re downloading GBs of data), see
?bq_table_download for an alternative approach.
dbConnect() now allows
dataset to be omitted; this is natural when you want to use tables from multiple datasets.
The low-level API has been completely overhauled to make it easier to use. The primary motivation was to make bigrquery development more enjoyable for me, but it should also be helpful to you when you need to go outside of the features provided by higher-level DBI and dplyr interfaces. The old API has been soft-deprecated - it will continue to work, but no further development will occur (including bug fixes). It will be formally deprecated in the next version, and then removed in the version after that.
bq_fields() constructor functions create S3 objects corresponding to important BigQuery objects (#150). These are paired with
as_ coercion functions and used throughout the new API.
Easier local testing: New
bq_test_dataset() make it easier to run bigrquery tests locally. To run the tests yourself, you need to create a BigQuery project, and then follow the instructions in
More efficient data transfer: The new API makes extensive use of the
fields query parameter, ensuring that functions only download data that they actually use (#153).
The dplyr interface can work with literal SQL once more (#218).
Request error messages now contain the “reason”, which can contain useful information for debugging (#209).
dplyr support has been updated to require dplyr 0.7.0 and use dbplyr. This means that you can now more naturally work directly with DBI connections. dplyr now also uses modern BigQuery SQL which supports a broader set of translations. Along the way I’ve also fixed some SQL generation bugs (#48).
The DBI driver gets a new name:
insert_table() allows you to insert empty tables into a dataset.
All POST requests (inserts, updates, copies and
query_exec) now take
.... This allows you to add arbitrary additional data to the request body making it possible to use parts of the BigQuery API that are otherwise not exposed (#149).
snake_case argument names are automatically converted to
camelCase so you can stick consistently to snake case in your R code.
Full support for DATE, TIME, and DATETIME types (#128).
All bigrquery requests now have a custom user agent that specifies the versions of bigrquery and httr that are used (#151).
insert_upload_job() now sends data in newline-delimited JSON instead of csv (#97). This should be considerably faster and avoids character encoding issues (#45).
POSIXlt columns are now also correctly coerced to TIMESTAMPS (#98).
query_exec() should be considerably faster because profiling revealed that ~40% of the time taken by was a single line inside a function that helps parse BigQuery’s json into an R data frame. I replaced the slow R code with a faster C function.
wait_for() uses now reports the query total bytes billed, which is more accurate because it takes into account caching and other factors.
Compatible with latest httr.
Add support for API keys via the
BIGRQUERY_API_KEY environment variable. (#49)