diff --git a/DESCRIPTION b/DESCRIPTION index 9e59232..b3acef7 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,7 +1,7 @@ Package: eurlex Type: Package Title: Retrieve Data on European Union Law -Version: 0.3.5 +Version: 0.3.6 Authors@R: c(person(given = "Michal", family = "Ovadek", role = c("aut", "cre", "cph"), @@ -10,7 +10,6 @@ Authors@R: c(person(given = "Michal", Description: Access to data on European Union laws and court decisions made easy with pre-defined 'SPARQL' queries and 'GET' requests. License: GPL-3 Encoding: UTF-8 -LazyData: true Depends: R (>= 3.4.0) Imports: @@ -19,6 +18,7 @@ Imports: xml2, tidyr, httr, + curl, rvest, rlang, stringr, diff --git a/NEWS.md b/NEWS.md index 99b3bda..bdef761 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,3 +1,14 @@ +# eurlex 0.3.6 + +## Major changes + +- `elx_run_query()` now fails gracefully in presence of internet/server problems +- `elx_fetch_data()` now automatically fixes urls with parentheses (e.g. "32019H1115(01)" used to fail) + +## Minor changes + +- minor fixes to vignette + # eurlex 0.3.5 ## Major changes diff --git a/R/elx_curia_list.R b/R/elx_curia_list.R index 0de6f4c..d8a7024 100644 --- a/R/elx_curia_list.R +++ b/R/elx_curia_list.R @@ -8,7 +8,8 @@ #' @param parse If `TRUE`, references to cases and appeals are parsed out from `case_info` into separate columns #' @return #' A data frame containing case identifiers and information as character columns. Where the case id -#' contains a hyperlink to Eur-Lex, the CELEX identifier is retrieved as well. +#' contains a hyperlink to Eur-Lex, the CELEX identifier is retrieved as well. Hyperlinks to Eur-Lex +#' disappeared from more recent cases. #' @importFrom rlang .data #' @export #' @examples diff --git a/R/elx_fetch_data.R b/R/elx_fetch_data.R index bd45ad8..f205e5e 100644 --- a/R/elx_fetch_data.R +++ b/R/elx_fetch_data.R @@ -22,6 +22,19 @@ elx_fetch_data <- function(url, type = c("title","text","ids"), language <- paste(language_1,", ",language_2,";q=0.8, ",language_3,";q=0.7", sep = "") + if (stringr::str_detect(url,"celex")){ + + clx <- stringr::str_extract(url, "(?<=celex\\/).*") %>% + stringr::str_replace_all("\\(","%28") %>% + stringr::str_replace_all("\\)","%29") %>% + stringr::str_replace_all("\\/","%2F") + + url <- paste("http://publications.europa.eu/resource/celex/", + clx, + sep = "") + + } + if (type == "title"){ response <- httr::GET(url=url, diff --git a/R/elx_run_query.R b/R/elx_run_query.R index 277283f..18873f5 100644 --- a/R/elx_run_query.R +++ b/R/elx_run_query.R @@ -20,9 +20,7 @@ elx_run_query <- function(query = "", endpoint = "http://publications.europa.eu/ curlready <- paste(endpoint,"?query=",gsub("\\+","%2B", utils::URLencode(query, reserved = TRUE)), sep = "") - sparql_response <- httr::GET(url = curlready, - httr::add_headers('Accept' = 'application/sparql-results+xml') - ) + sparql_response <- graceful_http(curlready) sparql_response_parsed <- sparql_response %>% elx_parse_xml() @@ -31,3 +29,49 @@ elx_run_query <- function(query = "", endpoint = "http://publications.europa.eu/ } +#' Fail http call gracefully +#' +#' @importFrom rlang .data +#' +#' @noRd +#' + +graceful_http <- function(remote_file) { + + try_GET <- function(x, ...) { + tryCatch( + httr::GET(url = x, + #httr::timeout(1000000000), + httr::add_headers('Accept' = 'application/sparql-results+xml')), + error = function(e) conditionMessage(e), + warning = function(w) conditionMessage(w) + ) + } + + is_response <- function(x) { + class(x) == "response" + } + + # First check internet connection + if (!curl::has_internet()) { + message("No internet connection.") + return(invisible(NULL)) + } + + # Then try for timeout problems + resp <- try_GET(remote_file) + if (!is_response(resp)) { + message(resp) + return(invisible(NULL)) + } + + # Then stop if status > 400 + if (httr::http_error(resp)) { + httr::message_for_status(resp) + return(invisible(NULL)) + } + + return(resp) + +} + diff --git a/README.md b/README.md index 51d079c..0781d5a 100644 --- a/README.md +++ b/README.md @@ -42,6 +42,12 @@ Please consider contributing to the maintanance and development of the package b ## Latest changes +### eurlex 0.3.6 + +- `elx_run_query()` now fails gracefully in presence of internet/server problems +- `elx_fetch_data()` now automatically fixes urls with parentheses (e.g. "32019H1115(01)" used to fail) +- minor fixes to vignette + ### eurlex 0.3.5 - it is now possible to select all resource types available with `elx_make_query(resource_type = "any")`. Since there are nearly 1 million CELEX codes, use with discretion and expect long execution times diff --git a/doc/eurlexpkg.html b/doc/eurlexpkg.html index 495b333..5a049e3 100644 --- a/doc/eurlexpkg.html +++ b/doc/eurlexpkg.html @@ -453,14 +453,14 @@
elx_run_query()
: Execute SPARQL queriesas_tibble(results)
-#> # A tibble: 4,317 x 3
+#> # A tibble: 4,335 x 3
#> work type celex
#> <chr> <chr> <chr>
#> 1 http://publications.europa.eu/resourc~ http://publications.europa.eu/~ 31979L~
#> 2 http://publications.europa.eu/resourc~ http://publications.europa.eu/~ 31989L~
#> 3 http://publications.europa.eu/resourc~ http://publications.europa.eu/~ 31984L~
#> 4 http://publications.europa.eu/resourc~ http://publications.europa.eu/~ 31966L~
-#> # ... with 4,313 more rows
The function outputs a data.frame
where each column corresponds to one of the requested variables, while the rows accumulate observations of the resource type satisfying the query criteria. Obviously, the more data is to be returned, the longer the execution time, varying from a few seconds to several minutes, depending also on your connection.
The first column always contains the unique URI of a “work” (legislative act or court judgment) which identifies each resource in Cellar. Several human-readable identifiers are normally associated with each “work” but the most useful one is CELEX, retrieved by default.2
One column you should always pay attention to is type
(as in resource_type
). The URIs contained there reflect the FILTER argument in the SPARQL query, which is manually pre-specified. All resources are indexed as being of one type or another. For example, when retrieving directives, the results are going to return also delegated directives, which might not be desirable, depending on your needs. You can filter results by type
to make the necessary adjustments. The queries are expansive by default in the spirit of erring on the side of over-inclusiveness rather than vice versa.
Directives become naturally outdated with time. It might be all the more interesting to see which older acts are thus still surviving.
dirs %>%
filter(!is.na(force)) %>%
@@ -591,7 +591,7 @@ Application
theme(axis.text.y = element_blank(),
axis.line.y = element_blank(),
axis.ticks.y = element_blank())
We want to know a bit more about the directives from 1970s that are still in force today. Their titles could give us a clue.
dirs_1970_title <- dirs %>%
filter(between(as.Date(date), as.Date("1970-01-01"), as.Date("1980-01-01")),
@@ -600,14 +600,14 @@ Application
as_tibble()
print(dirs_1970_title)
-#> # A tibble: 70 x 6
+#> # A tibble: 67 x 6
#> work type celex date force title
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 http://publications~ http://publicatio~ 31975~ 1975~ true Council Directive ~
#> 2 http://publications~ http://publicatio~ 31977~ 1977~ true First Commission D~
#> 3 http://publications~ http://publicatio~ 31977~ 1977~ true Council Directive ~
#> 4 http://publications~ http://publicatio~ 31973~ 1973~ true Council Directive ~
-#> # ... with 66 more rows
I will use the tidytext
package to get a quick idea of what the legislation is about.
library(tidytext)
library(wordcloud)
@@ -619,7 +619,7 @@ Application
filter(!grepl("\\d", word)) %>%
bind_tf_idf(word, celex, n) %>%
with(wordcloud(word, tf_idf, max.words = 40, scale = c(1.8,0.1)))
I use term-frequency inverse-document frequency (tf-idf) to weight the importance of the words in the wordcloud. If we used pure frequencies, the wordcloud would largely consist of words conveying little meaning (“the”, “and”, …).
This is an extremely basic application of the eurlex
package. Much more sophisticated methods can be used to analyse both the content and metadata of European Union legislation. If the package is useful for your research, please consider citing the accompanying paper.4