Skip to content

Commit

Permalink
update geocoding
Browse files Browse the repository at this point in the history
  • Loading branch information
JosiahParry committed Jun 5, 2024
1 parent f2c0d3a commit 15d89d2
Show file tree
Hide file tree
Showing 3 changed files with 91 additions and 0 deletions.
15 changes: 15 additions & 0 deletions _freeze/docs/geocode/bulk-geocoding/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"hash": "f14351616d67d8381a1cf4cc5f0b03a4",
"result": {
"engine": "knitr",
"markdown": "---\ntitle: \"Bulk geocoding\"\n---\n\n\n\n\nBulk geocoding capabilities are provided via the `geocode_addresses()` function in `{arcgisgeocode}`. Rather geocoding a single address and returning match candidates, the bulk geocoding capabilities take many addresses and geocode them all at once returning a single location per address. \n\nUsing the bulk geocoding capabilities can result in incurring a cost. See more about [geocoding pricing](https://developers.arcgis.com/documentation/mapping-apis-and-services/geocoding/services/geocoding-service/#pricing).\n\n\nIn this example, you will geocode restaurant addresses in Boston, MA collected by the [Boston Area Research Initiative (BARI)](https://cssh.northeastern.edu/bari/). The data is originally from their [data portal](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DMWCBT).\n\n# Step 1. Authenticate\n\nIn order to utilize the bulk geocoding capabilities of the ArcGIS World Geocoder, you must first authenticate using `{arcgisutils}`. In this example, we are using user-based authentication via `auth_user()`. You may choose a different authentication function if it works better for you. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arcgisutils)\nlibrary(arcgisgeocode)\n\nset_arc_token(auth_user())\n```\n:::\n\n\n# Step 2. Prepare the data \n\nSimilar to using `find_address_candidates()` the geocoding results return an ID that can be used to join back onto the original dataset. First, you will read in the dataset from a filepath using `readr::read_csv()` and then create a unique identifier with `dplyr::mutate()` and `dplyr::row_number()`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Boston Yelp addresses\n# Source: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DMWCBT\nfp <- \"https://analysis-1.maps.arcgis.com/sharing/rest/content/items/0423768816b343b69d9a425b82351912/data\"\n\nlibrary(dplyr)\nrestaurants <- readr::read_csv(fp) |>\n mutate(id = row_number())\n\nrestaurants\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2,664 × 28\n restaurant_name restaurant_ID restaurant_address restaurant_tag rating price\n <chr> <dbl> <chr> <chr> <dbl> <chr>\n 1 100% Delicias 2 635 Hyde Park Ave… Latin America… 2 $$ \n 2 100% Delicias E… 3 660A Centre St,Ja… Dominican,Emp… 4 <NA> \n 3 107 4 107 Salem St,Bost… Restaurants, NA <NA> \n 4 140 Supper Club 6 138 St James Ave,… Diners, 5 <NA> \n 5 163 Vietnamese … 7 66 Harrison Ave,B… Vietnamese,Co… 3.5 $ \n 6 180 Cafe 8 23 Edinboro St,Bo… Cafes, 4 <NA> \n 7 180 Restaurant … 9 174 Lincoln St,Bo… Restaurants, NA <NA> \n 8 224 Boston Stre… 11 224 Boston St,Dor… American (New… 4 $$ \n 9 24 Hour Pizza D… 12 686 Morton St,Bos… Pizza, 1 $$$$ \n10 2Twenty2 13 222 Friend St,Bos… Asian Fusion,… 3 <NA> \n# ℹ 2,654 more rows\n# ℹ 22 more variables: review_number <dbl>, unique_reviewer <dbl>,\n# reviews_Jan_19 <dbl>, reviews_Feb_19 <dbl>, reviews_Mar_19 <dbl>,\n# reviews_Apr_19 <dbl>, reviews_May_19 <dbl>, reviews_Jun_19 <dbl>,\n# reviews_Jul_19 <dbl>, reviews_Aug_19 <dbl>, reviews_Jan_20 <dbl>,\n# reviews_Feb_20 <dbl>, reviews_Mar_20 <dbl>, reviews_Apr_20 <dbl>,\n# reviews_May_20 <dbl>, reviews_Jun_20 <dbl>, reviews_Jul_20 <dbl>, …\n```\n\n\n:::\n:::\n\n\n# Step 3. Geocode addresses\n\nThe restaurant addresses are contained in the `restaurant_address` column. Pass this column into the `single_line` argument of `geocode_addresses()` and store the results in `geocoded`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngeocoded <- geocode_addresses(\n single_line = restaurants[[\"restaurant_address\"]]\n)\n\n# preview the first 10 columns\nglimpse(geocoded[, 1:10])\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nRows: 2,664\nColumns: 11\n$ result_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…\n$ loc_name <chr> \"World\", \"World\", \"World\", \"World\", \"World\", \"World\", \"Wor…\n$ status <chr> \"M\", \"M\", \"M\", \"M\", \"M\", \"M\", \"M\", \"M\", \"M\", \"M\", \"M\", \"M\"…\n$ score <dbl> 100.00, 100.00, 100.00, 100.00, 100.00, 100.00, 100.00, 10…\n$ match_addr <chr> \"635 Hyde Park Avenue, Roslindale, Massachusetts, 02131\", …\n$ long_label <chr> \"635 Hyde Park Avenue, Roslindale, MA, 02131, USA\", \"660A …\n$ short_label <chr> \"635 Hyde Park Avenue\", \"660A Centre Street\", \"107\", \"138 …\n$ addr_type <chr> \"PointAddress\", \"PointAddress\", \"POI\", \"PointAddress\", \"Po…\n$ type_field <chr> NA, NA, \"Bank\", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ place_name <chr> NA, NA, \"107\", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…\n$ geometry <POINT [°]> POINT (-71.11936 42.27857), POINT (-71.11386 42.3128…\n```\n\n\n:::\n:::\n\n\n:::{.callout-tip}\nYou can use `dplyr::reframe()` to geocode these addresses in a dplyr-friendly way. \n:::\n\n# Step 4. Join the results\n\nIn the previous step you geocoded the addresses and returned a data frame containing the location information. More likely than not, it would be helpful to have the locations joined onto the original dataset. You can do this by using `dplyr::left_join()` and joining on the `id` column you created and the `result_id` from the geocoding results. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\njoined_addresses <- left_join(\n restaurants,\n geocoded,\n by = c(\"id\" = \"result_id\")\n)\n\ndplyr::glimpse(joined_addresses)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nRows: 2,664\nColumns: 87\n$ restaurant_name <chr> \"100% Delicias\", \"100% Delicias Express\", \"107…\n$ restaurant_ID <dbl> 2, 3, 4, 6, 7, 8, 9, 11, 12, 13, 16, 17, 18, 2…\n$ restaurant_address <chr> \"635 Hyde Park Ave,Roslindale, MA 02131,\", \"66…\n$ restaurant_tag <chr> \"Latin American,Dominican,\", \"Dominican,Empana…\n$ rating <dbl> 2.0, 4.0, NA, 5.0, 3.5, 4.0, NA, 4.0, 1.0, 3.0…\n$ price <chr> \"$$\", NA, NA, NA, \"$\", NA, NA, \"$$\", \"$$$$\", N…\n$ review_number <dbl> 37, 26, 0, 1, 335, 8, 0, 248, 31, 63, 10, 232,…\n$ unique_reviewer <dbl> 34, 25, 0, 1, 335, 8, 0, 248, 31, 63, 10, 232,…\n$ reviews_Jan_19 <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 0, 8, 0, 1, 7, 0, 1, 0…\n$ reviews_Feb_19 <dbl> 1, 2, 0, 0, 0, 0, 0, 4, 0, 3, 0, 0, 2, 0, 0, 0…\n$ reviews_Mar_19 <dbl> 1, 3, 0, 0, 0, 1, 0, 5, 1, 2, 0, 0, 3, 0, 2, 0…\n$ reviews_Apr_19 <dbl> 0, 3, 0, 0, 1, 0, 0, 3, 0, 4, 0, 3, 5, 0, 0, 0…\n$ reviews_May_19 <dbl> 2, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 6, 0, 0, 0…\n$ reviews_Jun_19 <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 4, 0, 1, 3, 0, 0, 0…\n$ reviews_Jul_19 <dbl> 0, 1, 0, 0, 3, 1, 0, 4, 1, 0, 4, 0, 3, 0, 2, 0…\n$ reviews_Aug_19 <dbl> 0, 7, 0, 0, 0, 0, 0, 3, 0, 7, 3, 0, 0, 0, 0, 0…\n$ reviews_Jan_20 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 5, 1, 0, 0…\n$ reviews_Feb_20 <dbl> 0, 1, 0, 0, 1, 0, 0, 2, 0, 2, 1, 3, 8, 6, 0, 0…\n$ reviews_Mar_20 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 6, 0, 0…\n$ reviews_Apr_20 <dbl> 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0…\n$ reviews_May_20 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…\n$ reviews_Jun_20 <dbl> 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 6, 0, 0…\n$ reviews_Jul_20 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 3, 0…\n$ reviews_Aug_20 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 4, 1, 0…\n$ restaurant_neighborhood <chr> \"Roslindale\", \"Jamaica Plain\", \"Boston\", \"Bost…\n$ GIS_ID <dbl> 1806741000, 1901410000, 302366000, 401087000, …\n$ CT_ID_10 <dbl> 25025140400, 25025120400, 25025030400, 2502501…\n$ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…\n$ loc_name <chr> \"World\", \"World\", \"World\", \"World\", \"World\", \"…\n$ status <chr> \"M\", \"M\", \"M\", \"M\", \"M\", \"M\", \"M\", \"M\", \"M\", \"…\n$ score <dbl> 100.00, 100.00, 100.00, 100.00, 100.00, 100.00…\n$ match_addr <chr> \"635 Hyde Park Avenue, Roslindale, Massachuset…\n$ long_label <chr> \"635 Hyde Park Avenue, Roslindale, MA, 02131, …\n$ short_label <chr> \"635 Hyde Park Avenue\", \"660A Centre Street\", …\n$ addr_type <chr> \"PointAddress\", \"PointAddress\", \"POI\", \"PointA…\n$ type_field <chr> NA, NA, \"Bank\", NA, NA, NA, NA, NA, NA, NA, NA…\n$ place_name <chr> NA, NA, \"107\", NA, NA, NA, NA, NA, NA, NA, NA,…\n$ place_addr <chr> \"635 Hyde Park Avenue, Roslindale, Massachuset…\n$ phone <chr> NA, NA, \"(617) 227-6236\", NA, NA, NA, NA, NA, …\n$ url <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ rank <dbl> 20, 20, 19, 20, 20, 20, 20, 20, 20, 20, 20, 20…\n$ add_bldg <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ add_num <chr> \"635\", \"660A\", \"107\", \"138\", \"66\", \"23\", \"174\"…\n$ add_num_from <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ add_num_to <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ add_range <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ side <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ st_pre_dir <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ st_pre_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ st_name <chr> \"Hyde Park\", \"Centre\", \"Salem\", \"Saint James\",…\n$ st_type <chr> \"Avenue\", \"Street\", \"St\", \"Avenue\", \"Avenue\", …\n$ st_dir <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ bldg_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ bldg_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ level_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ level_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ unit_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ unit_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ sub_addr <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ st_addr <chr> \"635 Hyde Park Avenue\", \"660A Centre Street\", …\n$ block <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ sector <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ nbrhd <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ district <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ city <chr> \"Roslindale\", \"Jamaica Plain\", \"Boston\", \"Bost…\n$ metro_area <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ subregion <chr> \"Suffolk County\", \"Suffolk County\", \"Suffolk C…\n$ region <chr> \"Massachusetts\", \"Massachusetts\", \"Massachuset…\n$ region_abbr <chr> \"MA\", \"MA\", \"MA\", \"MA\", \"MA\", \"MA\", \"MA\", \"MA\"…\n$ territory <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ zone <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ postal <chr> \"02131\", \"02130\", \"02113\", \"02116\", \"02111\", \"…\n$ postal_ext <chr> \"4723\", NA, NA, \"5071\", \"1907\", \"2131\", \"2404\"…\n$ country <chr> \"USA\", \"USA\", \"USA\", \"USA\", \"USA\", \"USA\", \"USA…\n$ cntry_name <chr> \"United States\", \"United States\", \"United Stat…\n$ lang_code <chr> \"ENG\", \"ENG\", \"ENG\", \"ENG\", \"ENG\", \"ENG\", \"ENG…\n$ distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…\n$ x <dbl> -71.11936, -71.11386, -71.05537, -71.07624, -7…\n$ y <dbl> 42.27857, 42.31285, 42.36419, 42.34923, 42.351…\n$ display_x <dbl> -71.11936, -71.11386, -71.05537, -71.07624, -7…\n$ display_y <dbl> 42.27857, 42.31285, 42.36419, 42.34923, 42.351…\n$ xmin <dbl> -71.12036, -71.11486, -71.05637, -71.07724, -7…\n$ xmax <dbl> -71.11836, -71.11286, -71.05437, -71.07524, -7…\n$ ymin <dbl> 42.27757, 42.31185, 42.36319, 42.34823, 42.350…\n$ ymax <dbl> 42.27957, 42.31385, 42.36519, 42.35023, 42.352…\n$ ex_info <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…\n$ geometry <POINT [°]> POINT (-71.11936 42.27857), POINT (-71.1…\n```\n\n\n:::\n:::",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
],
"includes": {},
"engineDependencies": {},
"preserve": {},
"postProcess": true
}
}
2 changes: 2 additions & 0 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,8 @@ website:
contents:
- docs/geocode/overview.qmd
- docs/geocode/forward-geocoding.qmd
- href: docs/geocode/bulk-geocoding.qmd
text: "Bulk Geocoding"
- section: Places


Expand Down
74 changes: 74 additions & 0 deletions docs/geocode/bulk-geocoding.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
title: "Bulk geocoding"
---

```{r include=FALSE}
knitr::opts_chunk$set(message = FALSE)
```

Bulk geocoding capabilities are provided via the `geocode_addresses()` function in `{arcgisgeocode}`. Rather geocoding a single address and returning match candidates, the bulk geocoding capabilities take many addresses and geocode them all at once returning a single location per address.

Using the bulk geocoding capabilities can result in incurring a cost. See more about [geocoding pricing](https://developers.arcgis.com/documentation/mapping-apis-and-services/geocoding/services/geocoding-service/#pricing).


In this example, you will geocode restaurant addresses in Boston, MA collected by the [Boston Area Research Initiative (BARI)](https://cssh.northeastern.edu/bari/). The data is originally from their [data portal](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DMWCBT).

# Step 1. Authenticate

In order to utilize the bulk geocoding capabilities of the ArcGIS World Geocoder, you must first authenticate using `{arcgisutils}`. In this example, we are using user-based authentication via `auth_user()`. You may choose a different authentication function if it works better for you.


```{r message=FALSE}
library(arcgisutils)
library(arcgisgeocode)
set_arc_token(auth_user())
```

# Step 2. Prepare the data

Similar to using `find_address_candidates()` the geocoding results return an ID that can be used to join back onto the original dataset. First, you will read in the dataset from a filepath using `readr::read_csv()` and then create a unique identifier with `dplyr::mutate()` and `dplyr::row_number()`.

```{r message= FALSE}
# Boston Yelp addresses
# Source: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DMWCBT
fp <- "https://analysis-1.maps.arcgis.com/sharing/rest/content/items/0423768816b343b69d9a425b82351912/data"
library(dplyr)
restaurants <- readr::read_csv(fp) |>
mutate(id = row_number())
restaurants
```

# Step 3. Geocode addresses

The restaurant addresses are contained in the `restaurant_address` column. Pass this column into the `single_line` argument of `geocode_addresses()` and store the results in `geocoded`.

```{r message=FALSE}
geocoded <- geocode_addresses(
single_line = restaurants[["restaurant_address"]]
)
# preview the first 10 columns
glimpse(geocoded[, 1:10])
```

:::{.callout-tip}
You can use `dplyr::reframe()` to geocode these addresses in a dplyr-friendly way.
:::

# Step 4. Join the results

In the previous step you geocoded the addresses and returned a data frame containing the location information. More likely than not, it would be helpful to have the locations joined onto the original dataset. You can do this by using `dplyr::left_join()` and joining on the `id` column you created and the `result_id` from the geocoding results.


```{r}
joined_addresses <- left_join(
restaurants,
geocoded,
by = c("id" = "result_id")
)
dplyr::glimpse(joined_addresses)
```

0 comments on commit 15d89d2

Please sign in to comment.