-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
bc3c355
commit ee7d333
Showing
16 changed files
with
405 additions
and
28 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,3 +5,4 @@ | |
^\.github$ | ||
^codecov\.yml$ | ||
^cran-comments\.md$ | ||
^CRAN-SUBMISSION$ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
Version: 0.1.0 | ||
Date: 2025-01-10 13:37:31 UTC | ||
SHA: bc3c35539d6f13a5cedf8322c879d3f8dc77bb6f |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
Package: urlparse | ||
Type: Package | ||
Title: Fast Simple URL Parser | ||
Version: 0.1.0 | ||
Version: 0.1.9999 | ||
Authors@R: | ||
person("Dyfan", "Jones", , "[email protected]", role = c("aut", "cre")) | ||
Description: A fast and simple 'URL' parser package for 'R'. This package provides | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,7 @@ | ||
# urlparse 0.1.9999 | ||
|
||
* new function `url_parse_v2` to vectorise parsing urls. | ||
|
||
# urlparse 0.1.0 | ||
|
||
* Initial CRAN submission. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -59,6 +59,37 @@ url_parse <- function(url) { | |
.Call('_urlparse_url_parse', PACKAGE = 'urlparse', url) | ||
} | ||
|
||
#' @title Parses a vector URLs into a dataframe. | ||
#' @description Parses a vector of URLs into their respective components. | ||
#' It returns a data.frame where each row represents a URL, | ||
#' and each column represents a specific component of the URL | ||
#' such as the scheme, user, password, host, port, path, raw path, | ||
#' raw query, and fragment. | ||
#' @param url A vector of strings, where each string is a URL to be parsed. | ||
#' @return A data frame with the following columns: | ||
#' - href: The original URL. | ||
#' - scheme: The scheme component of the URL (e.g., "http", "https"). | ||
#' - user: The user component of the URL. | ||
#' - password: The password component of the URL. | ||
#' - host: The host component of the URL. | ||
#' - port: The port component of the URL. | ||
#' - path: The decoded path component of the URL. | ||
#' - raw_path: The raw path component of the URL. | ||
#' - raw_query: The raw query component of the URL. | ||
#' - fragment: The fragment component of the URL. | ||
#' @examples | ||
#' library(urlparse) | ||
#' urls <- c("https://user:[email protected]:8080/path/to/resource?query=example#fragment", | ||
#' "http://www.test.com") | ||
#' url_parse_v2(urls) | ||
#' | ||
#' @export | ||
#' @useDynLib urlparse _urlparse_url_parse_v2 | ||
#' @importFrom Rcpp evalCpp | ||
url_parse_v2 <- function(url) { | ||
.Call('_urlparse_url_parse_v2', PACKAGE = 'urlparse', url) | ||
} | ||
|
||
#' @title Builds a URL string from its components. | ||
#' | ||
#' @param url_components A list containing the components of the URL: scheme, host, port, path, query, and fragment. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -67,6 +67,8 @@ quote("foo = bar + 5", safe = "+") | |
url_encoder("foo = bar + 5", safe = "+") | ||
``` | ||
|
||
Modify an `url` through piping using the `set_*` functions or using the stand alone `url_modify` function. | ||
|
||
```{r url_modify} | ||
url <- "http://example.com" | ||
|
@@ -94,7 +96,6 @@ bench::mark( | |
) | ||
``` | ||
|
||
|
||
## Benchmark: | ||
|
||
```{r, echo = FALSE} | ||
|
@@ -121,6 +122,41 @@ show_relative(bm) | |
ggplot2::autoplot(bm) | ||
``` | ||
|
||
Since `urlpase v0.1.999+` you can use the vectorised url parser `url_parser_v2` | ||
```{r benchmark_vectorise} | ||
urls <- c( | ||
"https://www.example.com", | ||
"https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519", | ||
"https://user_1:[email protected]:8080/dir/../api?q=1#frag", | ||
"https://user:[email protected]", | ||
"https://www.example.com:8080/search%3D1%2B3", | ||
"https://www.google.co.jp/search?q=\u30c9\u30a4\u30c4", | ||
"https://www.example.com:8080?var1=foo&var2=ba%20r&var3=baz+larry", | ||
"https://user:[email protected]:8080", | ||
"https://user:[email protected]", | ||
"https://[email protected]:8080", | ||
"https://[email protected]" | ||
) | ||
(bm <- bench::mark( | ||
urlparse = lapply(urls, urlparse::url_parse), | ||
urlparse_v2 = urlparse::url_parse_v2(urls), | ||
httr2 = lapply(urls, httr2::url_parse), | ||
curl = lapply(urls, curl::curl_parse_url), | ||
urltools = urltools::url_parse(urls), | ||
check = F | ||
)) | ||
show_relative(bm) | ||
ggplot2::autoplot(bm) | ||
``` | ||
|
||
Note: `url_parse_v2` returns the parsed url as a `data.frame` this is similar behaviour to `urltools` and `adaR`: | ||
|
||
```{r url_parse_v2} | ||
urlparse::url_parse_v2(urls) | ||
``` | ||
|
||
### Encoding URL: | ||
|
||
Note: `urltools` encode special characters to lower case hex i.e.: "?" -> "%3f" instead of "%3F" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -97,6 +97,9 @@ url_encoder("foo = bar + 5", safe = "+") | |
#> [1] "foo%20%3D%20bar%20+%205" | ||
``` | ||
|
||
Modify an `url` through piping using the `set_*` functions or using the | ||
stand alone `url_modify` function. | ||
|
||
``` r | ||
|
||
url <- "http://example.com" | ||
|
@@ -128,8 +131,8 @@ bench::mark( | |
#> # A tibble: 2 × 6 | ||
#> expression min median `itr/sec` mem_alloc `gc/sec` | ||
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> | ||
#> 1 piping 5.29µs 5.86µs 169576. 0B 0 | ||
#> 2 single_function 1.64µs 1.8µs 507863. 0B 0 | ||
#> 1 piping 5.37µs 5.9µs 167991. 0B 0 | ||
#> 2 single_function 1.64µs 1.8µs 519161. 0B 0 | ||
``` | ||
|
||
## Benchmark: | ||
|
@@ -148,26 +151,142 @@ url <- "https://user:[email protected]:8000/path?query=1#fragment" | |
#> # A tibble: 4 × 6 | ||
#> expression min median `itr/sec` mem_alloc `gc/sec` | ||
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> | ||
#> 1 urlparse 1.68µs 1.84µs 503156. 0B 0 | ||
#> 2 httr2 64.86µs 68.59µs 14312. 560.9KB 17.4 | ||
#> 3 curl 27.22µs 28.54µs 34390. 48.78KB 13.8 | ||
#> 4 urltools 124.35µs 129.03µs 7604. 2.17MB 20.9 | ||
#> 1 urlparse 1.72µs 1.89µs 494724. 0B 0 | ||
#> 2 httr2 64.98µs 68.59µs 14019. 560.9KB 17.5 | ||
#> 3 curl 27.27µs 28.54µs 34199. 48.78KB 13.7 | ||
#> 4 urltools 124.52µs 130.13µs 7460. 2.17MB 20.9 | ||
|
||
show_relative(bm) | ||
#> # A tibble: 4 × 6 | ||
#> expression min median `itr/sec` mem_alloc `gc/sec` | ||
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl> | ||
#> 1 urlparse 1 1 66.2 NaN NaN | ||
#> 2 httr2 38.6 37.2 1.88 Inf Inf | ||
#> 3 curl 16.2 15.5 4.52 Inf Inf | ||
#> 4 urltools 74.0 69.9 1 Inf Inf | ||
#> 1 urlparse 1 1 66.3 NaN NaN | ||
#> 2 httr2 37.7 36.4 1.88 Inf Inf | ||
#> 3 curl 15.8 15.1 4.58 Inf Inf | ||
#> 4 urltools 72.3 69.0 1 Inf Inf | ||
|
||
ggplot2::autoplot(bm) | ||
#> Loading required namespace: tidyr | ||
``` | ||
|
||
<img src="man/figures/README-benchmark-1.png" width="100%" /> | ||
|
||
Since `urlpase v0.1.999+` you can use the vectorised url parser | ||
`url_parser_v2` | ||
|
||
``` r | ||
urls <- c( | ||
"https://www.example.com", | ||
"https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519", | ||
"https://user_1:[email protected]:8080/dir/../api?q=1#frag", | ||
"https://user:[email protected]", | ||
"https://www.example.com:8080/search%3D1%2B3", | ||
"https://www.google.co.jp/search?q=\u30c9\u30a4\u30c4", | ||
"https://www.example.com:8080?var1=foo&var2=ba%20r&var3=baz+larry", | ||
"https://user:[email protected]:8080", | ||
"https://user:[email protected]", | ||
"https://[email protected]:8080", | ||
"https://[email protected]" | ||
) | ||
(bm <- bench::mark( | ||
urlparse = lapply(urls, urlparse::url_parse), | ||
urlparse_v2 = urlparse::url_parse_v2(urls), | ||
httr2 = lapply(urls, httr2::url_parse), | ||
curl = lapply(urls, curl::curl_parse_url), | ||
urltools = urltools::url_parse(urls), | ||
check = F | ||
)) | ||
#> # A tibble: 5 × 6 | ||
#> expression min median `itr/sec` mem_alloc `gc/sec` | ||
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> | ||
#> 1 urlparse 19.9µs 21.4µs 44755. 200B 17.9 | ||
#> 2 urlparse_v2 10.5µs 11.2µs 87440. 4.3KB 8.74 | ||
#> 3 httr2 452.8µs 473.1µs 2060. 23.6KB 21.3 | ||
#> 4 curl 190.5µs 201.4µs 4881. 0B 9.64 | ||
#> 5 urltools 130.3µs 136.7µs 7196. 0B 12.3 | ||
|
||
show_relative(bm) | ||
#> # A tibble: 5 × 6 | ||
#> expression min median `itr/sec` mem_alloc `gc/sec` | ||
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl> | ||
#> 1 urlparse 1.89 1.92 21.7 Inf 2.05 | ||
#> 2 urlparse_v2 1 1 42.4 Inf 1 | ||
#> 3 httr2 43.0 42.3 1 Inf 2.44 | ||
#> 4 curl 18.1 18.0 2.37 NaN 1.10 | ||
#> 5 urltools 12.4 12.2 3.49 NaN 1.41 | ||
|
||
ggplot2::autoplot(bm) | ||
``` | ||
|
||
<img src="man/figures/README-benchmark_vectorise-1.png" width="100%" /> | ||
|
||
Note: `url_parse_v2` returns the parsed url as a `data.frame` this is | ||
similar behaviour to `urltools` and `adaR`: | ||
|
||
``` r | ||
urlparse::url_parse_v2(urls) | ||
#> href | ||
#> 1 https://www.example.com | ||
#> 2 https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519 | ||
#> 3 https://user_1:[email protected]:8080/dir/../api?q=1#frag | ||
#> 4 https://user:[email protected] | ||
#> 5 https://www.example.com:8080/search%3D1%2B3 | ||
#> 6 https://www.google.co.jp/search?q=ドイツ | ||
#> 7 https://www.example.com:8080?var1=foo&var2=ba%20r&var3=baz+larry | ||
#> 8 https://user:[email protected]:8080 | ||
#> 9 https://user:[email protected] | ||
#> 10 https://[email protected]:8080 | ||
#> 11 https://[email protected] | ||
#> scheme user password host port | ||
#> 1 https www.example.com | ||
#> 2 https www.google.com | ||
#> 3 https user_1 password_1 example.org 8080 | ||
#> 4 https user password example.com | ||
#> 5 https www.example.com 8080 | ||
#> 6 https www.google.co.jp | ||
#> 7 https www.example.com 8080 | ||
#> 8 https user password example.com 8080 | ||
#> 9 https user password example.com | ||
#> 10 https user example.com 8080 | ||
#> 11 https user example.com | ||
#> path | ||
#> 1 | ||
#> 2 /maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519 | ||
#> 3 /dir/../api | ||
#> 4 | ||
#> 5 /search=1+3 | ||
#> 6 /search | ||
#> 7 | ||
#> 8 | ||
#> 9 | ||
#> 10 | ||
#> 11 | ||
#> raw_path | ||
#> 1 | ||
#> 2 /maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519 | ||
#> 3 | ||
#> 4 | ||
#> 5 /search%3D1%2B3 | ||
#> 6 | ||
#> 7 | ||
#> 8 | ||
#> 9 | ||
#> 10 | ||
#> 11 | ||
#> raw_query fragment | ||
#> 1 | ||
#> 2 | ||
#> 3 q=1 frag | ||
#> 4 | ||
#> 5 | ||
#> 6 q=%E3%83%89%E3%82%A4%E3%83%84 | ||
#> 7 var1=foo&var2=ba%20r&var3=baz%2Blarry | ||
#> 8 | ||
#> 9 | ||
#> 10 | ||
#> 11 | ||
``` | ||
|
||
### Encoding URL: | ||
|
||
Note: `urltools` encode special characters to lower case hex i.e.: “?” | ||
|
@@ -185,19 +304,19 @@ string <- "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._~`!@ | |
#> # A tibble: 4 × 6 | ||
#> expression min median `itr/sec` mem_alloc `gc/sec` | ||
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> | ||
#> 1 urlparse 1.48µs 1.56µs 623378. 208B 0 | ||
#> 2 curl 2.3µs 2.42µs 399842. 3.06KB 0 | ||
#> 3 urltools 2.42µs 2.67µs 370964. 2.48KB 0 | ||
#> 4 base 79.09µs 83.15µs 11703. 28.59KB 8.24 | ||
#> 1 urlparse 1.52µs 1.64µs 598950. 208B 0 | ||
#> 2 curl 2.3µs 2.42µs 407439. 3.06KB 0 | ||
#> 3 urltools 2.38µs 2.62µs 376010. 2.48KB 0 | ||
#> 4 base 78.76µs 81.88µs 12031. 28.59KB 8.19 | ||
|
||
show_relative(bm) | ||
#> # A tibble: 4 × 6 | ||
#> expression min median `itr/sec` mem_alloc `gc/sec` | ||
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl> | ||
#> 1 urlparse 1 1 53.3 1 NaN | ||
#> 2 curl 1.56 1.55 34.2 15.0 NaN | ||
#> 3 urltools 1.64 1.71 31.7 12.2 NaN | ||
#> 4 base 53.6 53.4 1 141. Inf | ||
#> 1 urlparse 1 1 49.8 1 NaN | ||
#> 2 curl 1.51 1.47 33.9 15.0 NaN | ||
#> 3 urltools 1.57 1.60 31.3 12.2 NaN | ||
#> 4 base 51.9 49.9 1 141. Inf | ||
|
||
ggplot2::autoplot(bm) | ||
``` | ||
|
@@ -218,19 +337,19 @@ url <- paste0(sample(strsplit(string, "")[[1]], 1e4, replace = TRUE), collapse = | |
#> # A tibble: 4 × 6 | ||
#> expression min median `itr/sec` mem_alloc `gc/sec` | ||
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> | ||
#> 1 urlparse 86.06µs 87.41µs 11291. 15.8KB 0 | ||
#> 2 curl 92.95µs 94.26µs 10209. 0B 0 | ||
#> 3 urltools 238.7µs 244.16µs 3950. 15.8KB 0 | ||
#> 4 base 6.72ms 6.84ms 141. 333.2KB 9.91 | ||
#> 1 urlparse 85.36µs 86.55µs 11420. 15.7KB 0 | ||
#> 2 curl 92.05µs 93.69µs 10521. 0B 0 | ||
#> 3 urltools 244.03µs 245.55µs 4047. 15.7KB 0 | ||
#> 4 base 6.57ms 6.73ms 142. 332.2KB 9.84 | ||
|
||
show_relative(bm) | ||
#> # A tibble: 4 × 6 | ||
#> expression min median `itr/sec` mem_alloc `gc/sec` | ||
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl> | ||
#> 1 urlparse 1 1 80.2 Inf NaN | ||
#> 2 curl 1.08 1.08 72.5 NaN NaN | ||
#> 3 urltools 2.77 2.79 28.1 Inf NaN | ||
#> 4 base 78.1 78.2 1 Inf Inf | ||
#> 1 urlparse 1 1 80.6 Inf NaN | ||
#> 2 curl 1.08 1.08 74.2 NaN NaN | ||
#> 3 urltools 2.86 2.84 28.6 Inf NaN | ||
#> 4 base 77.0 77.8 1 Inf Inf | ||
|
||
ggplot2::autoplot(bm) | ||
``` | ||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.