Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable bulk loading or gracious failing if certain indicators are not available. #35

Open
SebKrantz opened this issue Jul 28, 2021 · 2 comments

Comments

@SebKrantz
Copy link

SebKrantz commented Jul 28, 2021

Hello, thanks a lot for this package and the comprehensive access to World Bank statistics it provides. I am using this package in part to download large volumnes of indicators to populate a local macroeconomic database for a particular country. I have 2 issues here which are (1) loading many indicators sequentially is quite slow, and (2) if a partticular indicator is not available for my country the whole API call fails and everything that was loaded before is lost. To give an example: I wanted to load all 4269 indicators in the Education Statistics database. For Example:

library(wbstats)
indlist <- wb_indicators()
ind <- subset(indlist, source == "Education Statistics")[[1]]
WB_EDU <- wb_data(indicator = ind, country = "KEN", start_date = 1960, end_date = 2021)

Fails after a while with "Error: World Bank API request failed for indicator LO.EGRA.NCWPM.DAG.2GRD", and the rest of my query is lost.

I know this probably means some extra work, but in the medium term, an option to bulk loading all indicators from a specific source and / or skipping indicators which cannot be loaded (i.e. using tryCatch around the API calls and displaying a warning without terminating the request) would be great. Both could also be enabled through additional arguments to wb_data.

@jpiburn
Copy link
Contributor

jpiburn commented Jul 28, 2021

Hi,

Thank you for opening up this issue. Good idea and something I have been meaning to integrate for awhile. I will keep this issue updated with progress

@SebKrantz
Copy link
Author

SebKrantz commented Jul 28, 2021

Hi, thanks for the response, I just wrote a function to do bulk loading by source. maybe it is of help. Is there any way to strip the JSON query to reduce the amount of information downloaded for each data value?

wbstats::wbsources()

library(jsonlite)
library(data.table)

WBAPI <- function(country = "KEN", sourceID = 12, series = "all", wide = FALSE, per_page = 100000L, custom_query = NULL) {
  if(length(custom_query)) {
    x <- fromJSON(custom_query) 
  } else {
    per_page <- as.integer(per_page)
    x <- fromJSON(paste0("http://api.worldbank.org/v2/sources/", as.integer(sourceID), "/country/", 
                          as.character(country), "/series/", as.character(series), "?format=json&per_page=", 
                          per_page, "&page=1"))
    if(x$total > per_page) {
      iter <- floor(x$total / per_page)
      for(i in seq_len(iter)) {
        x_i <- tryCatch(fromJSON(paste0("http://api.worldbank.org/v2/sources/", as.integer(sourceID), "/country/", 
                        as.character(country), "/series/", as.character(series), "?format=json&per_page=", 
                        per_page, "&page=", i + 1L)), error = function(e) {
                          warning("Could not complete downloading, stopping at iteration", i)
                          return(x)})
        x$source$data <- rbind(x$source$data, x_i$source$data)
      }
    }
  }
  d <- x$source$data
  cc <- which(!is.na(d$value))
  fld <- function(y, z) c(as.vector(unlist(.subset(y, -1L), use.names = FALSE), "list"), list(z))
  res <-  rbindlist(Map(fld, d$variable[cc], as.vector(d$value[cc], "list")))
  names(res) <- c("iso3c", "indicator", "yr", "country", "label", "year", "value")
  res$year <- as.integer(res$year)
  res$value <- as.numeric(res$value)
  setcolorder(res, c("iso3c", "country", "year", "yr"))
  setorder(res, iso3c, indicator, year)
  if(!wide) return(res)
  un <- which(!duplicated(res$indicator))
  lab <- res$label[un]
  ind <- res$indicator[un]
  res$label <- NULL
  res <- dcast(res, ... ~ indicator, value.var = "value")
  if(!identical(names(res)[-(1:4)], ind)) warning("indicator mismatch")
  oldClass(res) <- NULL # to speed up for loop
  for(i in 5:length(res)) attr(res[[i]], "label") <- lab[i-4L] # setting labels
  oldClass(res) <- c("data.table", "data.frame")
  setDT(res)
  return(res)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants