Database creation pipeline (#11)

* Add preprocessing script For hive partitioned dataset * first version of database creation also update disease search terms cols * new script * Update main.py * Update main.py * add log output for skipping * format preproc * add postproc, more efficient storage * several fixes in main script, begin postproc * larger postproc chunk size * Update postproc.py * update readme, big updates for main and postproc - better input data - renamed database_flat - more complex query with optional proximity - renamed "municipality" to "location" * Update .gitignore * Create location_search_terms.xlsx * switch to much faster iteration it is less memory-efficient * add query amsterdam all diseases all years note: we need word boundaries in disease mention! tuberculosis == tering, and this is a common word ending ("spijsvertering", "godslastering") * query for amsterdam, dordrecht, and groningen * update gröningen regex character error * add word boundaries to disease search terms * move initialresults to archive * move maps to archive in preparation of splitting out analyses to separate git repo * move two more scripts to archive * add more scripts to archive, update readme * Correct path in api harvest for api key * update readme for analysis split * try turning off unicode support for (much) faster performance * Update .gitignore * fix postproc uncertainty non-coverage * move to csv for search terms * Update query_db.py * add github release badge
sodascience · Dec 17, 2024 · 0937ede · 0937ede
1 parent 0409b08
commit 0937ede
Show file tree

Hide file tree

Showing 101 changed files with 1,504 additions and 48 deletions.
diff --git a/.gitignore b/.gitignore
@@ -163,6 +163,7 @@ cython_debug/
 delpher_api/keys.txt
 harvest_delpher_api/keys.txt
 harvest_delpher_api/apikey.txt
+src/harvest_delpher_api/apikey.txt
 
 # uv lockfile
 uv.lock

diff --git a/README.md b/README.md
@@ -1,27 +1,28 @@
 # Disease database 
 [![Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.](https://www.repostatus.org/badges/latest/wip.svg)](https://www.repostatus.org/#wip)
+[![GitHub Release](https://img.shields.io/github/v/release/sodascience/disease_database?include_prereleases)](https://github.com/sodascience/disease_database/releases/latest)
 
-Creating a historical disease database (19th-20th century) for municipalities in the Netherlands.
+Code to create a historical disease database (19th-20th century) for municipalities in the Netherlands.
 
-![Cholera in the Netherlands](maps/cholera_1864_1868.png)
+![Cholera in the Netherlands](img/cholera_1864_1868.png)
 
 ## Preparation
 
 This project uses [pyproject.toml](pyproject.toml) to handle its dependencies. You can install them using pip like so:
 
-```
+```sh
 pip install .
 ```
 
-We recommend using [uv](https://github.com/astral-sh/uv) to manage the environment. First, install uv, then clone / download this repo, then run:
+However, we recommend using [uv](https://github.com/astral-sh/uv) to manage the environment. First, install uv, then clone / download this repo, then run:
 
-``` 
+```sh
 uv sync
 ```
 
-this will automatically install the right python version, create a virtual environment, and install the required packages.
+this will automatically install the right python version, create a virtual environment, and install the required packages. If you choose not to use `uv`, you can replace `uv run` in the code examples in this repo with `python`.
 
-Note that if you encountered `error: command 'cmake' failed: No such file or directory`, you need to install [cmake](https://cmake.org/download/) first.
+Note, on macOS, if you encounter `error: command 'cmake' failed: No such file or directory`, you need to install [cmake](https://cmake.org/download/) first.
 On macOS, run `brew install cmake`. Similarly, you may have to install `apache-arrow` separately as well (e.g., on macOS `brew install apache-arrow`).
 
 Once these dependency issues are solved, run `uv sync` one more time.
@@ -44,28 +45,174 @@ This results in two kinds of polars dataframes saved in parquet format under `pr
 
 Before you run the following script, make sure to put all the Delpher zip files under `raw_data/open_archive`.
 
-```
-python src/process_open_archive/extract_article_data.py
-python src/process_open_archive/extract_meta_data.py
+```sh
+uv run src/process_open_archive/extract_article_data.py
+uv run src/process_open_archive/extract_meta_data.py
 ```
 
 Then, run 
 
-```
-python src/process_open_archive/combine_and_chunk.py
+```sh
+uv run src/process_open_archive/combine_and_chunk.py
 ``` 
 to join all the available datasets and create a yearly-chunked series of parquet files in the folder `processed_data/combined`.
 
 ## Data harvesting (1880-1940)
 After 1880, the data is not public and can only be obtained through the Delpher API: 
 
-1. Obtain an api key (which looks like this `df2e02aa-8504-4af2-b3d9-64d107f4479a`) from Delpher, then put the api key in the file `harvest_delpher_api/apikey.txt`.
+1. Obtain an API key (which looks like this `df2e02aa-8504-4af2-b3d9-64d107f4479a`) from the Royal Library / the Delpher maintainers, then put the API key in the file `src/harvest_delpher_api/apikey.txt`.
 2. Harvest the data following readme in the delpher api folder: [src/harvest_delpher_api/readme.md](./src/harvest_delpher_api/README.md)
 
+## Database creation
+After the data has been harvested and processed from 1830-1940, the folder `processed_data/combined` should now be filled with `.parquet` files. The first record looks like this:
+
+```py
+import polars as pl
+pl.scan_parquet("processed_data/combined/*.parquet").head(1).collect().glimpse()
+```
+
+```
+$ newspaper_id                 <str> 'ddd:010041217:mpeg21'
+$ article_id                   <str> 'ddd:010041217:mpeg21:a0001'
+$ article_subject              <str> 'artikel'
+$ article_title                <str> None
+$ article_text                 <str> 'De GOUVERNEUR der PROVINCIE GELDERLAND ...'
+$ newspaper_name               <str> 'Arnhemsche courant'
+$ newspaper_location           <str> 'Arnhem'
+$ newspaper_date              <date> 1830-01-02
+$ newspaper_years_digitalised  <str> '1814 t/m 1850'
+$ newspaper_years_issued       <str> '1814-2001'
+$ newspaper_language           <str> 'nl'
+$ newspaper_temporal           <str> 'Dag'
+$ newspaper_publisher          <str> 'C.A. Thieme'
+$ newspaper_spatial            <str> 'Regionaal/lokaal'
+```
+
+### Step 1: pre-processing / re-partitioning
+To make our data processing much faster, we will now process all these files into a hive-partitioned parquet folder, with subfolders for each year. This is done using the following code
+
+```sh
+uv run src/create_database/preproc.py
+```
+
+After this, the folder `processed_data/partitioned` will contain differently organized parquet files, but they contain the exact same information.
+
+### Step 2: database computation
+
+> NB: from this step onwards, we ran this on a linux (ubuntu) machine with >200 cores and 1TB of memory
+
+The next step is to create the actual database we are interested in. There are three inputs for this:
+
+| Input | Description |
+| :---- | :---------- |
+| `raw_data/manual_input/disease_search_terms.xlsx` | Contains a list of diseases and their regex search definitions |
+| `raw_data/manual_input/location_search_Terms.xlsx` | Contains a list of locations and their regex search definitions |
+| `processed_data/partitioned/**/*.parquet` | Contains the texts of all articles from 1830-1940 |
+
+The following command will take these inputs, perform the regex searches and output (many) `.parquet` files to `processed_data/database_flat`. On our big machine, this takes about 12 hours.
+
+```sh
+uv run src/create_database/main.py
+```
+
+It may be better to run this in the background without hangups:
+
+```sh
+nohup uv run src/create_database/main.py &
+```
+
+The resulting data looks approximately like this:
+
+```py
+import polars as pl
+pl.scan_parquet("processed_data/database_flat/*.parquet").head().collect()
+```
+
+```
+shape: (5, 8)
+┌──────┬───────┬────────────┬────────┬────────────┬─────────┬───────────────┬─────────┐
+│ year ┆ month ┆ n_location ┆ n_both ┆ location   ┆ cbscode ┆ amsterdamcode ┆ disease │
+│ ---  ┆ ---   ┆ ---        ┆ ---    ┆ ---        ┆ ---     ┆ ---           ┆ ---     │
+│ i32  ┆ i8    ┆ u32        ┆ u32    ┆ str        ┆ i32     ┆ i32           ┆ str     │
+╞══════╪═══════╪════════════╪════════╪════════════╪═════════╪═══════════════╪═════════╡
+│ 1834 ┆ 6     ┆ 1          ┆ 0      ┆ Aagtekerke ┆ 1000    ┆ 10531         ┆ typhus  │
+│ 1833 ┆ 12    ┆ 3          ┆ 0      ┆ Aagtekerke ┆ 1000    ┆ 10531         ┆ typhus  │
+│ 1834 ┆ 9     ┆ 1          ┆ 0      ┆ Aagtekerke ┆ 1000    ┆ 10531         ┆ typhus  │
+│ 1832 ┆ 5     ┆ 1          ┆ 0      ┆ Aagtekerke ┆ 1000    ┆ 10531         ┆ typhus  │
+│ 1831 ┆ 4     ┆ 2          ┆ 0      ┆ Aagtekerke ┆ 1000    ┆ 10531         ┆ typhus  │
+└──────┴───────┴────────────┴────────┴────────────┴─────────┴───────────────┴─────────┘
+```
+
+In this format, the column `n_location` means the number of detected mentions of the location / municipality, and the column `n_both` represents the number of disease mentions within this set of articles mentioning the location.
+
+### Step 3: post-processing
+
+The last step is to organise the data (e.g., sorting by date), compute the normalized mentions, and add uncertainty intervals (through [Jeffrey's interval](https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Jeffreys_interval))
+
+```sh
+uv run src/create_database/postproc.py
+```
+
+The resulting data folder `processed_data/database` looks like this:
+
+```
+database/
+├── disease=cholera/
+│   └── 00000000.parquet
+├── disease=diphteria/
+│   └── 00000000.parquet
+├── disease=dysentery/
+│   └── 00000000.parquet
+├── disease=influenza/
+│   └── 00000000.parquet
+├── disease=malaria/
+│   └── 00000000.parquet
+├── disease=measles/
+│   └── 00000000.parquet
+├── disease=scarletfever/
+│   └── 00000000.parquet
+├── disease=smallpox/
+│   └── 00000000.parquet
+├── disease=tuberculosis/
+│   └── 00000000.parquet
+├── disease=typhus/
+│   └── 00000000.parquet
+```
+
+Now, for example, the typhus mentions in 1838 look like this:
+```py
+import polars as pl
+lf = pl.scan_parquet("processed_data/database/**/*.parquet")
+lf.filter(pl.col("disease") == "typhus", pl.col("year") == 1838).head().collect()
+```
+```
+┌─────────┬──────┬───────┬───────────────┬─────────┬───────────────┬─────────────────────┬───────┬──────────┬────────────┬────────┐
+│ disease ┆ year ┆ month ┆ location      ┆ cbscode ┆ amsterdamcode ┆ normalized_mentions ┆ lower ┆ upper    ┆ n_location ┆ n_both │
+│ ---     ┆ ---  ┆ ---   ┆ ---           ┆ ---     ┆ ---           ┆ ---                 ┆ ---   ┆ ---      ┆ ---        ┆ ---    │
+│ str     ┆ i32  ┆ i8    ┆ str           ┆ i32     ┆ i32           ┆ f64                 ┆ f64   ┆ f64      ┆ u32        ┆ u32    │
+╞═════════╪══════╪═══════╪═══════════════╪═════════╪═══════════════╪═════════════════════╪═══════╪══════════╪════════════╪════════╡
+│ typhus  ┆ 1835 ┆ 1     ┆ Aalsmeer      ┆ 358     ┆ 11264         ┆ 0.0                 ┆ 0.0   ┆ 0.330389 ┆ 6          ┆ 0      │
+│ typhus  ┆ 1835 ┆ 1     ┆ Aalst         ┆ 1001    ┆ 11423         ┆ 0.0                 ┆ 0.0   ┆ 0.444763 ┆ 4          ┆ 0      │
+│ typhus  ┆ 1835 ┆ 1     ┆ Aalten        ┆ 197     ┆ 11046         ┆ 0.0                 ┆ 0.0   ┆ 0.853254 ┆ 1          ┆ 0      │
+│ typhus  ┆ 1835 ┆ 1     ┆ Aarlanderveen ┆ 1002    ┆ 11242         ┆ 0.0                 ┆ 0.0   ┆ 0.330389 ┆ 6          ┆ 0      │
+│ typhus  ┆ 1835 ┆ 1     ┆ Aduard        ┆ 2       ┆ 10999         ┆ 0.0                 ┆ 0.0   ┆ 0.262217 ┆ 8          ┆ 0      │
+│ typhus  ┆ 1835 ┆ 1     ┆ Akersloot     ┆ 360     ┆ 10346         ┆ 0.0                 ┆ 0.0   ┆ 0.666822 ┆ 2          ┆ 0      │
+│ typhus  ┆ 1835 ┆ 1     ┆ Alblasserdam  ┆ 482     ┆ 11327         ┆ 0.0                 ┆ 0.0   ┆ 0.666822 ┆ 2          ┆ 0      │
+│ typhus  ┆ 1835 ┆ 1     ┆ Alkmaar       ┆ 361     ┆ 10527         ┆ 0.0                 ┆ 0.0   ┆ 0.045246 ┆ 54         ┆ 0      │
+│ typhus  ┆ 1835 ┆ 1     ┆ Alphen        ┆ 1008    ┆ 10517         ┆ 0.0                 ┆ 0.0   ┆ 0.11147  ┆ 21         ┆ 0      │
+│ typhus  ┆ 1835 ┆ 1     ┆ Ambt Delden   ┆ 142     ┆ 11400         ┆ 0.0                 ┆ 0.0   ┆ 0.444763 ┆ 4          ┆ 0      │
+└─────────┴──────┴───────┴───────────────┴─────────┴───────────────┴─────────────────────┴───────┴──────────┴────────────┴────────┘
+```
+
+
 ## Data analysis
-The script `src/query/faster_query.py` uses the prepared combined data to search for mentions of diseases and locations in articles. The file produces the plot shown above. It also produces this plot about Utrecht:
 
-![](img/cholera_utrecht_full.png)
+For a basic analysis after the database has been created, take a look at the file `src/analysis/query_db.py`. 
+
+![](img/all_diseases_three_cities.png)
+
+For more in-depth analysis and usage scripts, take a look at our analysis repository: [disease_database_analysis](https://github.com/sodascience/disease_database_analysis).
+
 
 ## Contact
 <img src="./img/soda_logo.png" alt="SoDa logo" width="250px"/>

diff --git a/article_query.py → archive/article_query.py b/article_query.py → archive/article_query.py
diff --git a/initialresults/amsterdam_cholera_v1.png → ...e/initialresults/amsterdam_cholera_v1.png b/initialresults/amsterdam_cholera_v1.png → ...e/initialresults/amsterdam_cholera_v1.png
diff --git a/initialresults/amsterdam_cholera_v2.png → ...e/initialresults/amsterdam_cholera_v2.png b/initialresults/amsterdam_cholera_v2.png → ...e/initialresults/amsterdam_cholera_v2.png
diff --git a/initialresults/amsterdam_cholera_v3.png → ...e/initialresults/amsterdam_cholera_v3.png b/initialresults/amsterdam_cholera_v3.png → ...e/initialresults/amsterdam_cholera_v3.png
diff --git a/initialresults/choleramap_july1866.png → ...ve/initialresults/choleramap_july1866.png b/initialresults/choleramap_july1866.png → ...ve/initialresults/choleramap_july1866.png
diff --git a/...hes/query_experiment_amsterdam_cholera.py → ...hes/query_experiment_amsterdam_cholera.py b/...hes/query_experiment_amsterdam_cholera.py → ...hes/query_experiment_amsterdam_cholera.py
diff --git a/...ment_amsterdam_cholera_restrictedquery.py → ...ment_amsterdam_cholera_restrictedquery.py b/...ment_amsterdam_cholera_restrictedquery.py → ...ment_amsterdam_cholera_restrictedquery.py
diff --git a/...earches/query_experiment_amsterdam_flu.py → ...earches/query_experiment_amsterdam_flu.py b/...earches/query_experiment_amsterdam_flu.py → ...earches/query_experiment_amsterdam_flu.py
diff --git a/...es/query_experiment_amsterdam_smallpox.py → ...es/query_experiment_amsterdam_smallpox.py b/...es/query_experiment_amsterdam_smallpox.py → ...es/query_experiment_amsterdam_smallpox.py
diff --git a/...query_experiment_franekeradeel_cholera.py → ...query_experiment_franekeradeel_cholera.py b/...query_experiment_franekeradeel_cholera.py → ...query_experiment_franekeradeel_cholera.py
diff --git a/...uery_experiment_franekeradeel_smallpox.py → ...uery_experiment_franekeradeel_smallpox.py b/...uery_experiment_franekeradeel_smallpox.py → ...uery_experiment_franekeradeel_smallpox.py
diff --git a/...s/query_experiment_fullcountry_cholera.py → ...s/query_experiment_fullcountry_cholera.py b/...s/query_experiment_fullcountry_cholera.py → ...s/query_experiment_fullcountry_cholera.py
diff --git a/...es/query_experiment_leeuwarden_cholera.py → ...es/query_experiment_leeuwarden_cholera.py b/...es/query_experiment_leeuwarden_cholera.py → ...es/query_experiment_leeuwarden_cholera.py
diff --git a/...s/query_experiment_leeuwarden_smallpox.py → ...s/query_experiment_leeuwarden_smallpox.py b/...s/query_experiment_leeuwarden_smallpox.py → ...s/query_experiment_leeuwarden_smallpox.py
diff --git a/...arches/query_experiment_leiden_cholera.py → ...arches/query_experiment_leiden_cholera.py b/...arches/query_experiment_leiden_cholera.py → ...arches/query_experiment_leiden_cholera.py
diff --git a/...rches/query_experiment_leiden_smallpox.py → ...rches/query_experiment_leiden_smallpox.py b/...rches/query_experiment_leiden_smallpox.py → ...rches/query_experiment_leiden_smallpox.py
diff --git a/...ery_experiment_maastricht_cholera copy.py → ...ery_experiment_maastricht_cholera copy.py b/...ery_experiment_maastricht_cholera copy.py → ...ery_experiment_maastricht_cholera copy.py
diff --git a/...es/query_experiment_maastricht_cholera.py → ...es/query_experiment_maastricht_cholera.py b/...es/query_experiment_maastricht_cholera.py → ...es/query_experiment_maastricht_cholera.py
diff --git a/...s/query_experiment_maastricht_smallpox.py → ...s/query_experiment_maastricht_smallpox.py b/...s/query_experiment_maastricht_smallpox.py → ...s/query_experiment_maastricht_smallpox.py
diff --git a/...fic searches/query_experiment_template.py → ...fic searches/query_experiment_template.py b/...fic searches/query_experiment_template.py → ...fic searches/query_experiment_template.py
diff --git a/...- municipal cholera mentions, July 1866.R → ...- municipal cholera mentions, July 1866.R b/...- municipal cholera mentions, July 1866.R → ...- municipal cholera mentions, July 1866.R
diff --git a/... searches/cholera_mentions_april_1866.csv → ... searches/cholera_mentions_april_1866.csv b/... searches/cholera_mentions_april_1866.csv → ... searches/cholera_mentions_april_1866.csv
diff --git a/...searches/cholera_mentions_august_1866.csv → ...searches/cholera_mentions_august_1866.csv b/...searches/cholera_mentions_august_1866.csv → ...searches/cholera_mentions_august_1866.csv
diff --git a/...arches/cholera_mentions_december_1866.csv → ...arches/cholera_mentions_december_1866.csv b/...arches/cholera_mentions_december_1866.csv → ...arches/cholera_mentions_december_1866.csv
diff --git a/...arches/cholera_mentions_february_1866.csv → ...arches/cholera_mentions_february_1866.csv b/...arches/cholera_mentions_february_1866.csv → ...arches/cholera_mentions_february_1866.csv
diff --git a/...earches/cholera_mentions_january_1866.csv → ...earches/cholera_mentions_january_1866.csv b/...earches/cholera_mentions_january_1866.csv → ...earches/cholera_mentions_january_1866.csv
diff --git a/...y searches/cholera_mentions_july_1866.csv → ...y searches/cholera_mentions_july_1866.csv b/...y searches/cholera_mentions_july_1866.csv → ...y searches/cholera_mentions_july_1866.csv
diff --git a/...y searches/cholera_mentions_june_1866.csv → ...y searches/cholera_mentions_june_1866.csv b/...y searches/cholera_mentions_june_1866.csv → ...y searches/cholera_mentions_june_1866.csv
diff --git a/... searches/cholera_mentions_march_1866.csv → ... searches/cholera_mentions_march_1866.csv b/... searches/cholera_mentions_march_1866.csv → ... searches/cholera_mentions_march_1866.csv
diff --git a/...ry searches/cholera_mentions_may_1866.csv → ...ry searches/cholera_mentions_may_1866.csv b/...ry searches/cholera_mentions_may_1866.csv → ...ry searches/cholera_mentions_may_1866.csv
diff --git a/...arches/cholera_mentions_november_1866.csv → ...arches/cholera_mentions_november_1866.csv b/...arches/cholera_mentions_november_1866.csv → ...arches/cholera_mentions_november_1866.csv
diff --git a/...earches/cholera_mentions_october_1866.csv → ...earches/cholera_mentions_october_1866.csv b/...earches/cholera_mentions_october_1866.csv → ...earches/cholera_mentions_october_1866.csv
diff --git a/...rches/cholera_mentions_september_1866.csv → ...rches/cholera_mentions_september_1866.csv b/...rches/cholera_mentions_september_1866.csv → ...rches/cholera_mentions_september_1866.csv
diff --git a/... country searches/municipalities_1869.csv → ... country searches/municipalities_1869.csv b/... country searches/municipalities_1869.csv → ... country searches/municipalities_1869.csv
diff --git a/...try searches/municipalities_1869_test.csv → ...try searches/municipalities_1869_test.csv b/...try searches/municipalities_1869_test.csv → ...try searches/municipalities_1869_test.csv
diff --git a/.../old/cholera_mentions_july_1866_14oct.csv → .../old/cholera_mentions_july_1866_14oct.csv b/.../old/cholera_mentions_july_1866_14oct.csv → .../old/cholera_mentions_july_1866_14oct.csv
diff --git a/...ld/cholera_mentions_july_1866_20chars.csv → ...ld/cholera_mentions_july_1866_20chars.csv b/...ld/cholera_mentions_july_1866_20chars.csv → ...ld/cholera_mentions_july_1866_20chars.csv
diff --git a/...lera_mentions_july_1866_40chars_atest.csv → ...lera_mentions_july_1866_40chars_atest.csv b/...lera_mentions_july_1866_40chars_atest.csv → ...lera_mentions_july_1866_40chars_atest.csv
diff --git a/.../old/cholera_mentions_july_1866_atest.csv → .../old/cholera_mentions_july_1866_atest.csv b/.../old/cholera_mentions_july_1866_atest.csv → .../old/cholera_mentions_july_1866_atest.csv
diff --git a/...holera_mentions_july_1866_regex_15oct.csv → ...holera_mentions_july_1866_regex_15oct.csv b/...holera_mentions_july_1866_regex_15oct.csv → ...holera_mentions_july_1866_regex_15oct.csv
diff --git a/...iment_fullcountry_cholera_monthly_test.py → ...iment_fullcountry_cholera_monthly_test.py b/...iment_fullcountry_cholera_monthly_test.py → ...iment_fullcountry_cholera_monthly_test.py
diff --git a/...ches/query_fullcountry_cholera_monthly.py → ...ches/query_fullcountry_cholera_monthly.py b/...ches/query_fullcountry_cholera_monthly.py → ...ches/query_fullcountry_cholera_monthly.py
diff --git a/...ery_fullcountry_measles_monthly_looped.py → ...ery_fullcountry_measles_monthly_looped.py b/...ery_fullcountry_measles_monthly_looped.py → ...ery_fullcountry_measles_monthly_looped.py
diff --git a/...hes/query_fullcountry_smallpox_monthly.py → ...hes/query_fullcountry_smallpox_monthly.py b/...hes/query_fullcountry_smallpox_monthly.py → ...hes/query_fullcountry_smallpox_monthly.py
diff --git a/...ry_fullcountry_smallpox_monthly_looped.py → ...ry_fullcountry_smallpox_monthly_looped.py b/...ry_fullcountry_smallpox_monthly_looped.py → ...ry_fullcountry_smallpox_monthly_looped.py
diff --git a/...searches/smallpox_mentions_april_1870.csv → ...searches/smallpox_mentions_april_1870.csv b/...searches/smallpox_mentions_april_1870.csv → ...searches/smallpox_mentions_april_1870.csv
diff --git a/...searches/smallpox_mentions_april_1871.csv → ...searches/smallpox_mentions_april_1871.csv b/...searches/smallpox_mentions_april_1871.csv → ...searches/smallpox_mentions_april_1871.csv
diff --git a/...earches/smallpox_mentions_august_1870.csv → ...earches/smallpox_mentions_august_1870.csv b/...earches/smallpox_mentions_august_1870.csv → ...earches/smallpox_mentions_august_1870.csv
diff --git a/...earches/smallpox_mentions_august_1871.csv → ...earches/smallpox_mentions_august_1871.csv b/...earches/smallpox_mentions_august_1871.csv → ...earches/smallpox_mentions_august_1871.csv
diff --git a/...rches/smallpox_mentions_december_1870.csv → ...rches/smallpox_mentions_december_1870.csv b/...rches/smallpox_mentions_december_1870.csv → ...rches/smallpox_mentions_december_1870.csv
diff --git a/...rches/smallpox_mentions_december_1871.csv → ...rches/smallpox_mentions_december_1871.csv b/...rches/smallpox_mentions_december_1871.csv → ...rches/smallpox_mentions_december_1871.csv
diff --git a/...rches/smallpox_mentions_february_1870.csv → ...rches/smallpox_mentions_february_1870.csv b/...rches/smallpox_mentions_february_1870.csv → ...rches/smallpox_mentions_february_1870.csv
diff --git a/...rches/smallpox_mentions_february_1871.csv → ...rches/smallpox_mentions_february_1871.csv b/...rches/smallpox_mentions_february_1871.csv → ...rches/smallpox_mentions_february_1871.csv
diff --git a/...arches/smallpox_mentions_january_1870.csv → ...arches/smallpox_mentions_january_1870.csv b/...arches/smallpox_mentions_january_1870.csv → ...arches/smallpox_mentions_january_1870.csv
diff --git a/...arches/smallpox_mentions_january_1871.csv → ...arches/smallpox_mentions_january_1871.csv b/...arches/smallpox_mentions_january_1871.csv → ...arches/smallpox_mentions_january_1871.csv
diff --git a/... searches/smallpox_mentions_july_1870.csv → ... searches/smallpox_mentions_july_1870.csv b/... searches/smallpox_mentions_july_1870.csv → ... searches/smallpox_mentions_july_1870.csv
diff --git a/... searches/smallpox_mentions_july_1871.csv → ... searches/smallpox_mentions_july_1871.csv b/... searches/smallpox_mentions_july_1871.csv → ... searches/smallpox_mentions_july_1871.csv
diff --git a/... searches/smallpox_mentions_june_1870.csv → ... searches/smallpox_mentions_june_1870.csv b/... searches/smallpox_mentions_june_1870.csv → ... searches/smallpox_mentions_june_1870.csv
diff --git a/... searches/smallpox_mentions_june_1871.csv → ... searches/smallpox_mentions_june_1871.csv b/... searches/smallpox_mentions_june_1871.csv → ... searches/smallpox_mentions_june_1871.csv
diff --git a/...searches/smallpox_mentions_march_1870.csv → ...searches/smallpox_mentions_march_1870.csv b/...searches/smallpox_mentions_march_1870.csv → ...searches/smallpox_mentions_march_1870.csv
diff --git a/...searches/smallpox_mentions_march_1871.csv → ...searches/smallpox_mentions_march_1871.csv b/...searches/smallpox_mentions_march_1871.csv → ...searches/smallpox_mentions_march_1871.csv
diff --git a/...y searches/smallpox_mentions_may_1870.csv → ...y searches/smallpox_mentions_may_1870.csv b/...y searches/smallpox_mentions_may_1870.csv → ...y searches/smallpox_mentions_may_1870.csv
diff --git a/...y searches/smallpox_mentions_may_1871.csv → ...y searches/smallpox_mentions_may_1871.csv b/...y searches/smallpox_mentions_may_1871.csv → ...y searches/smallpox_mentions_may_1871.csv
diff --git a/...rches/smallpox_mentions_november_1870.csv → ...rches/smallpox_mentions_november_1870.csv b/...rches/smallpox_mentions_november_1870.csv → ...rches/smallpox_mentions_november_1870.csv
diff --git a/...rches/smallpox_mentions_november_1871.csv → ...rches/smallpox_mentions_november_1871.csv b/...rches/smallpox_mentions_november_1871.csv → ...rches/smallpox_mentions_november_1871.csv
diff --git a/...arches/smallpox_mentions_october_1870.csv → ...arches/smallpox_mentions_october_1870.csv b/...arches/smallpox_mentions_october_1870.csv → ...arches/smallpox_mentions_october_1870.csv
diff --git a/...arches/smallpox_mentions_october_1871.csv → ...arches/smallpox_mentions_october_1871.csv b/...arches/smallpox_mentions_october_1871.csv → ...arches/smallpox_mentions_october_1871.csv
diff --git a/...ches/smallpox_mentions_september_1870.csv → ...ches/smallpox_mentions_september_1870.csv b/...ches/smallpox_mentions_september_1870.csv → ...ches/smallpox_mentions_september_1870.csv
diff --git a/...ches/smallpox_mentions_september_1871.csv → ...ches/smallpox_mentions_september_1871.csv b/...ches/smallpox_mentions_september_1871.csv → ...ches/smallpox_mentions_september_1871.csv
diff --git a/maps/cholera_1864.parquet → archive/maps/cholera_1864.parquet b/maps/cholera_1864.parquet → archive/maps/cholera_1864.parquet
diff --git a/maps/cholera_1865.parquet → archive/maps/cholera_1865.parquet b/maps/cholera_1865.parquet → archive/maps/cholera_1865.parquet
diff --git a/maps/cholera_1866.parquet → archive/maps/cholera_1866.parquet b/maps/cholera_1866.parquet → archive/maps/cholera_1866.parquet
diff --git a/maps/cholera_1867.parquet → archive/maps/cholera_1867.parquet b/maps/cholera_1867.parquet → archive/maps/cholera_1867.parquet
diff --git a/maps/cholera_1868.parquet → archive/maps/cholera_1868.parquet b/maps/cholera_1868.parquet → archive/maps/cholera_1868.parquet
diff --git a/maps/map.R → archive/maps/map.R b/maps/map.R → archive/maps/map.R
diff --git a/maps/municipalities_1869.png → archive/maps/municipalities_1869.png b/maps/municipalities_1869.png → archive/maps/municipalities_1869.png
diff --git a/maps/municipalities_1939.png → archive/maps/municipalities_1939.png b/maps/municipalities_1939.png → archive/maps/municipalities_1939.png
diff --git a/maps/query_map.py → archive/maps/query_map.py b/maps/query_map.py → archive/maps/query_map.py
diff --git a/maps/readme.md → archive/maps/readme.md b/maps/readme.md → archive/maps/readme.md
diff --git a/municipalities_1869.xlsx → archive/municipalities_1869.xlsx b/municipalities_1869.xlsx → archive/municipalities_1869.xlsx
diff --git a/src/query/query_map.py → archive/query_map.py b/src/query/query_map.py → archive/query_map.py
diff --git a/src/query/query_space.py → archive/query_space.py b/src/query/query_space.py → archive/query_space.py
diff --git a/src/query/query_time.py → archive/query_time.py b/src/query/query_time.py → archive/query_time.py
diff --git a/src/query/utils.py → archive/utils.py b/src/query/utils.py → archive/utils.py
diff --git a/img/all_diseases_three_cities.png b/img/all_diseases_three_cities.png
diff --git a/img/amsterdam_all.png b/img/amsterdam_all.png
diff --git a/maps/cholera_1864_1868.png → img/cholera_1864_1868.png b/maps/cholera_1864_1868.png → img/cholera_1864_1868.png
diff --git a/raw_data/manual_input/.gitignore b/raw_data/manual_input/.gitignore
@@ -5,4 +5,6 @@
 # whitelist files that can be uploaded
 !query_names.xlsx
 !disease_search_terms.xlsx
-!municipalities_1869.xlsx
+!location_search_terms.xlsx
+!disease_search_terms.csv
+!location_search_terms.csv
diff --git a/raw_data/manual_input/disease_search_terms.csv b/raw_data/manual_input/disease_search_terms.csv
@@ -0,0 +1,11 @@
+Label,Disease,Type ,Regex
+Typhus,Typhoid fever; Paratyphoid fever,Food- and water-borne infectious diseases ,\b(ty(ph|f)(us|euz\w*)|febris\s?typhoidea|kwaadaardige\s?koorts)\b
+Dysentery,Diarrhoea; Dysentery; Acute diseases of the digestive system,Food- and water-borne infectious diseases ,\b(diarrhoea|dysenter\w*|rood\s?loop|buik\s?loop|bloed\s?gang)\b
+Cholera,Cholera (including: Asiatic cholera; Cholera nostras) ,Food- and water-borne infectious diseases ,\b(choler\w*|krim\s?koorts)\b
+Smallpox,Smallpox,Airborne infectious diseases,\b(pokken|variola)\b
+ScarletFever,Scarlet fever,Airborne infectious diseases,\b(rood\s?vonk|scarlatina|scharlaken\s?koorts)\b
+Measles,Measles,Airborne infectious diseases,\b(mazelen|rood\s?ziekte|rubeola|rubella)\b
+Tuberculosis,"Respiratory tuberculosis (incl: Tuberculosis of the lung and larynx, haemoptysis)",Airborne infectious diseases,\b(tering|verteringsziekte)\b
+Diphteria,Croup; Diphtheria,Airborne infectious diseases,\b((c|k)roup|angina\s?diphtheri\w*|diphtheri\w*|difteritis)\b
+Influenza,Acute respiratory disease (including influenza),Airborne infectious diseases,\b(griep|influenza)\b
+Malaria,Malaria (including: intermittent fever; pernicious fever),Other infectious diseases (mixed aetiology),\b(malaria|moeras\s?koorts|polder\s?koorts)\b
diff --git a/raw_data/manual_input/disease_search_terms.xlsx b/raw_data/manual_input/disease_search_terms.xlsx