Skip to content

Commit

Permalink
add vectorise url_parse_v2 method
Browse files Browse the repository at this point in the history
  • Loading branch information
DyfanJones committed Jan 13, 2025
1 parent bc3c355 commit ee7d333
Show file tree
Hide file tree
Showing 16 changed files with 405 additions and 28 deletions.
1 change: 1 addition & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@
^\.github$
^codecov\.yml$
^cran-comments\.md$
^CRAN-SUBMISSION$
3 changes: 3 additions & 0 deletions CRAN-SUBMISSION
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Version: 0.1.0
Date: 2025-01-10 13:37:31 UTC
SHA: bc3c35539d6f13a5cedf8322c879d3f8dc77bb6f
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: urlparse
Type: Package
Title: Fast Simple URL Parser
Version: 0.1.0
Version: 0.1.9999
Authors@R:
person("Dyfan", "Jones", , "[email protected]", role = c("aut", "cre"))
Description: A fast and simple 'URL' parser package for 'R'. This package provides
Expand Down
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ export(url_decoder)
export(url_encoder)
export(url_modify)
export(url_parse)
export(url_parse_v2)
importFrom(Rcpp,evalCpp)
useDynLib(urlparse,"_urlparse_set_fragment")
useDynLib(urlparse,"_urlparse_set_host")
Expand All @@ -24,3 +25,4 @@ useDynLib(urlparse,"_urlparse_url_decoder")
useDynLib(urlparse,"_urlparse_url_encoder")
useDynLib(urlparse,"_urlparse_url_modify")
useDynLib(urlparse,"_urlparse_url_parse")
useDynLib(urlparse,"_urlparse_url_parse_v2")
4 changes: 4 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# urlparse 0.1.9999

* new function `url_parse_v2` to vectorise parsing urls.

# urlparse 0.1.0

* Initial CRAN submission.
31 changes: 31 additions & 0 deletions R/RcppExports.R
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,37 @@ url_parse <- function(url) {
.Call('_urlparse_url_parse', PACKAGE = 'urlparse', url)
}

#' @title Parses a vector URLs into a dataframe.
#' @description Parses a vector of URLs into their respective components.
#' It returns a data.frame where each row represents a URL,
#' and each column represents a specific component of the URL
#' such as the scheme, user, password, host, port, path, raw path,
#' raw query, and fragment.
#' @param url A vector of strings, where each string is a URL to be parsed.
#' @return A data frame with the following columns:
#' - href: The original URL.
#' - scheme: The scheme component of the URL (e.g., "http", "https").
#' - user: The user component of the URL.
#' - password: The password component of the URL.
#' - host: The host component of the URL.
#' - port: The port component of the URL.
#' - path: The decoded path component of the URL.
#' - raw_path: The raw path component of the URL.
#' - raw_query: The raw query component of the URL.
#' - fragment: The fragment component of the URL.
#' @examples
#' library(urlparse)
#' urls <- c("https://user:[email protected]:8080/path/to/resource?query=example#fragment",
#' "http://www.test.com")
#' url_parse_v2(urls)
#'
#' @export
#' @useDynLib urlparse _urlparse_url_parse_v2
#' @importFrom Rcpp evalCpp
url_parse_v2 <- function(url) {
.Call('_urlparse_url_parse_v2', PACKAGE = 'urlparse', url)
}

#' @title Builds a URL string from its components.
#'
#' @param url_components A list containing the components of the URL: scheme, host, port, path, query, and fragment.
Expand Down
38 changes: 37 additions & 1 deletion README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,8 @@ quote("foo = bar + 5", safe = "+")
url_encoder("foo = bar + 5", safe = "+")
```

Modify an `url` through piping using the `set_*` functions or using the stand alone `url_modify` function.

```{r url_modify}
url <- "http://example.com"
Expand Down Expand Up @@ -94,7 +96,6 @@ bench::mark(
)
```


## Benchmark:

```{r, echo = FALSE}
Expand All @@ -121,6 +122,41 @@ show_relative(bm)
ggplot2::autoplot(bm)
```

Since `urlpase v0.1.999+` you can use the vectorised url parser `url_parser_v2`
```{r benchmark_vectorise}
urls <- c(
"https://www.example.com",
"https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519",
"https://user_1:[email protected]:8080/dir/../api?q=1#frag",
"https://user:[email protected]",
"https://www.example.com:8080/search%3D1%2B3",
"https://www.google.co.jp/search?q=\u30c9\u30a4\u30c4",
"https://www.example.com:8080?var1=foo&var2=ba%20r&var3=baz+larry",
"https://user:[email protected]:8080",
"https://user:[email protected]",
"https://[email protected]:8080",
"https://[email protected]"
)
(bm <- bench::mark(
urlparse = lapply(urls, urlparse::url_parse),
urlparse_v2 = urlparse::url_parse_v2(urls),
httr2 = lapply(urls, httr2::url_parse),
curl = lapply(urls, curl::curl_parse_url),
urltools = urltools::url_parse(urls),
check = F
))
show_relative(bm)
ggplot2::autoplot(bm)
```

Note: `url_parse_v2` returns the parsed url as a `data.frame` this is similar behaviour to `urltools` and `adaR`:

```{r url_parse_v2}
urlparse::url_parse_v2(urls)
```

### Encoding URL:

Note: `urltools` encode special characters to lower case hex i.e.: "?" -> "%3f" instead of "%3F"
Expand Down
171 changes: 145 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,9 @@ url_encoder("foo = bar + 5", safe = "+")
#> [1] "foo%20%3D%20bar%20+%205"
```

Modify an `url` through piping using the `set_*` functions or using the
stand alone `url_modify` function.

``` r

url <- "http://example.com"
Expand Down Expand Up @@ -128,8 +131,8 @@ bench::mark(
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 piping 5.29µs 5.86µs 169576. 0B 0
#> 2 single_function 1.64µs 1.8µs 507863. 0B 0
#> 1 piping 5.37µs 5.9µs 167991. 0B 0
#> 2 single_function 1.64µs 1.8µs 519161. 0B 0
```

## Benchmark:
Expand All @@ -148,26 +151,142 @@ url <- "https://user:[email protected]:8000/path?query=1#fragment"
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 urlparse 1.68µs 1.84µs 503156. 0B 0
#> 2 httr2 64.86µs 68.59µs 14312. 560.9KB 17.4
#> 3 curl 27.22µs 28.54µs 34390. 48.78KB 13.8
#> 4 urltools 124.35µs 129.03µs 7604. 2.17MB 20.9
#> 1 urlparse 1.72µs 1.89µs 494724. 0B 0
#> 2 httr2 64.98µs 68.59µs 14019. 560.9KB 17.5
#> 3 curl 27.27µs 28.54µs 34199. 48.78KB 13.7
#> 4 urltools 124.52µs 130.13µs 7460. 2.17MB 20.9

show_relative(bm)
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 urlparse 1 1 66.2 NaN NaN
#> 2 httr2 38.6 37.2 1.88 Inf Inf
#> 3 curl 16.2 15.5 4.52 Inf Inf
#> 4 urltools 74.0 69.9 1 Inf Inf
#> 1 urlparse 1 1 66.3 NaN NaN
#> 2 httr2 37.7 36.4 1.88 Inf Inf
#> 3 curl 15.8 15.1 4.58 Inf Inf
#> 4 urltools 72.3 69.0 1 Inf Inf

ggplot2::autoplot(bm)
#> Loading required namespace: tidyr
```

<img src="man/figures/README-benchmark-1.png" width="100%" />

Since `urlpase v0.1.999+` you can use the vectorised url parser
`url_parser_v2`

``` r
urls <- c(
"https://www.example.com",
"https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519",
"https://user_1:[email protected]:8080/dir/../api?q=1#frag",
"https://user:[email protected]",
"https://www.example.com:8080/search%3D1%2B3",
"https://www.google.co.jp/search?q=\u30c9\u30a4\u30c4",
"https://www.example.com:8080?var1=foo&var2=ba%20r&var3=baz+larry",
"https://user:[email protected]:8080",
"https://user:[email protected]",
"https://[email protected]:8080",
"https://[email protected]"
)
(bm <- bench::mark(
urlparse = lapply(urls, urlparse::url_parse),
urlparse_v2 = urlparse::url_parse_v2(urls),
httr2 = lapply(urls, httr2::url_parse),
curl = lapply(urls, curl::curl_parse_url),
urltools = urltools::url_parse(urls),
check = F
))
#> # A tibble: 5 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 urlparse 19.9µs 21.4µs 44755. 200B 17.9
#> 2 urlparse_v2 10.5µs 11.2µs 87440. 4.3KB 8.74
#> 3 httr2 452.8µs 473.1µs 2060. 23.6KB 21.3
#> 4 curl 190.5µs 201.4µs 4881. 0B 9.64
#> 5 urltools 130.3µs 136.7µs 7196. 0B 12.3

show_relative(bm)
#> # A tibble: 5 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 urlparse 1.89 1.92 21.7 Inf 2.05
#> 2 urlparse_v2 1 1 42.4 Inf 1
#> 3 httr2 43.0 42.3 1 Inf 2.44
#> 4 curl 18.1 18.0 2.37 NaN 1.10
#> 5 urltools 12.4 12.2 3.49 NaN 1.41

ggplot2::autoplot(bm)
```

<img src="man/figures/README-benchmark_vectorise-1.png" width="100%" />

Note: `url_parse_v2` returns the parsed url as a `data.frame` this is
similar behaviour to `urltools` and `adaR`:

``` r
urlparse::url_parse_v2(urls)
#> href
#> 1 https://www.example.com
#> 2 https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519
#> 3 https://user_1:[email protected]:8080/dir/../api?q=1#frag
#> 4 https://user:[email protected]
#> 5 https://www.example.com:8080/search%3D1%2B3
#> 6 https://www.google.co.jp/search?q=ドイツ
#> 7 https://www.example.com:8080?var1=foo&var2=ba%20r&var3=baz+larry
#> 8 https://user:[email protected]:8080
#> 9 https://user:[email protected]
#> 10 https://[email protected]:8080
#> 11 https://[email protected]
#> scheme user password host port
#> 1 https www.example.com
#> 2 https www.google.com
#> 3 https user_1 password_1 example.org 8080
#> 4 https user password example.com
#> 5 https www.example.com 8080
#> 6 https www.google.co.jp
#> 7 https www.example.com 8080
#> 8 https user password example.com 8080
#> 9 https user password example.com
#> 10 https user example.com 8080
#> 11 https user example.com
#> path
#> 1
#> 2 /maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519
#> 3 /dir/../api
#> 4
#> 5 /search=1+3
#> 6 /search
#> 7
#> 8
#> 9
#> 10
#> 11
#> raw_path
#> 1
#> 2 /maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519
#> 3
#> 4
#> 5 /search%3D1%2B3
#> 6
#> 7
#> 8
#> 9
#> 10
#> 11
#> raw_query fragment
#> 1
#> 2
#> 3 q=1 frag
#> 4
#> 5
#> 6 q=%E3%83%89%E3%82%A4%E3%83%84
#> 7 var1=foo&var2=ba%20r&var3=baz%2Blarry
#> 8
#> 9
#> 10
#> 11
```

### Encoding URL:

Note: `urltools` encode special characters to lower case hex i.e.: “?”
Expand All @@ -185,19 +304,19 @@ string <- "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._~`!@
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 urlparse 1.48µs 1.56µs 623378. 208B 0
#> 2 curl 2.3µs 2.42µs 399842. 3.06KB 0
#> 3 urltools 2.42µs 2.67µs 370964. 2.48KB 0
#> 4 base 79.09µs 83.15µs 11703. 28.59KB 8.24
#> 1 urlparse 1.52µs 1.64µs 598950. 208B 0
#> 2 curl 2.3µs 2.42µs 407439. 3.06KB 0
#> 3 urltools 2.38µs 2.62µs 376010. 2.48KB 0
#> 4 base 78.76µs 81.88µs 12031. 28.59KB 8.19

show_relative(bm)
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 urlparse 1 1 53.3 1 NaN
#> 2 curl 1.56 1.55 34.2 15.0 NaN
#> 3 urltools 1.64 1.71 31.7 12.2 NaN
#> 4 base 53.6 53.4 1 141. Inf
#> 1 urlparse 1 1 49.8 1 NaN
#> 2 curl 1.51 1.47 33.9 15.0 NaN
#> 3 urltools 1.57 1.60 31.3 12.2 NaN
#> 4 base 51.9 49.9 1 141. Inf

ggplot2::autoplot(bm)
```
Expand All @@ -218,19 +337,19 @@ url <- paste0(sample(strsplit(string, "")[[1]], 1e4, replace = TRUE), collapse =
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 urlparse 86.06µs 87.41µs 11291. 15.8KB 0
#> 2 curl 92.95µs 94.26µs 10209. 0B 0
#> 3 urltools 238.7µs 244.16µs 3950. 15.8KB 0
#> 4 base 6.72ms 6.84ms 141. 333.2KB 9.91
#> 1 urlparse 85.36µs 86.55µs 11420. 15.7KB 0
#> 2 curl 92.05µs 93.69µs 10521. 0B 0
#> 3 urltools 244.03µs 245.55µs 4047. 15.7KB 0
#> 4 base 6.57ms 6.73ms 142. 332.2KB 9.84

show_relative(bm)
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 urlparse 1 1 80.2 Inf NaN
#> 2 curl 1.08 1.08 72.5 NaN NaN
#> 3 urltools 2.77 2.79 28.1 Inf NaN
#> 4 base 78.1 78.2 1 Inf Inf
#> 1 urlparse 1 1 80.6 Inf NaN
#> 2 curl 1.08 1.08 74.2 NaN NaN
#> 3 urltools 2.86 2.84 28.6 Inf NaN
#> 4 base 77.0 77.8 1 Inf Inf

ggplot2::autoplot(bm)
```
Expand Down
Binary file modified man/figures/README-benchmark-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified man/figures/README-benchmark_encode_large-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified man/figures/README-benchmark_encode_small-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added man/figures/README-benchmark_vectorise-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit ee7d333

Please sign in to comment.