Skip to content

Commit

Permalink
Database creation pipeline (#11)
Browse files Browse the repository at this point in the history
* Add preprocessing script

For hive partitioned dataset

* first version of database creation

also update disease search terms cols

* new script

* Update main.py

* Update main.py

* add log output for skipping

* format preproc

* add postproc, more efficient storage

* several fixes in main script, begin postproc

* larger postproc chunk size

* Update postproc.py

* update readme, big updates for main and postproc

- better input data
- renamed database_flat
- more complex query with optional proximity
- renamed "municipality" to "location"

* Update .gitignore

* Create location_search_terms.xlsx

* switch to much faster iteration

it is less memory-efficient

* add query amsterdam all diseases all years

note: we need word boundaries in disease mention! tuberculosis == tering, and this is a common word ending ("spijsvertering", "godslastering")

* query for amsterdam, dordrecht, and groningen

* update gröningen regex character error

* add word boundaries to disease search terms

* move initialresults to archive

* move maps to archive

in preparation of splitting out analyses to separate git repo

* move two more scripts to archive

* add more scripts to archive, update readme

* Correct path in api harvest for api key

* update readme for analysis split

* try turning off unicode support for (much) faster performance

* Update .gitignore

* fix postproc uncertainty non-coverage

* move to csv for search terms

* Update query_db.py

* add github release badge
  • Loading branch information
vankesteren authored Dec 17, 2024
1 parent 0409b08 commit 0937ede
Show file tree
Hide file tree
Showing 101 changed files with 1,504 additions and 48 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,7 @@ cython_debug/
delpher_api/keys.txt
harvest_delpher_api/keys.txt
harvest_delpher_api/apikey.txt
src/harvest_delpher_api/apikey.txt

# uv lockfile
uv.lock
Expand Down
177 changes: 162 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,28 @@
# Disease database
[![Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.](https://www.repostatus.org/badges/latest/wip.svg)](https://www.repostatus.org/#wip)
[![GitHub Release](https://img.shields.io/github/v/release/sodascience/disease_database?include_prereleases)](https://github.com/sodascience/disease_database/releases/latest)

Creating a historical disease database (19th-20th century) for municipalities in the Netherlands.
Code to create a historical disease database (19th-20th century) for municipalities in the Netherlands.

![Cholera in the Netherlands](maps/cholera_1864_1868.png)
![Cholera in the Netherlands](img/cholera_1864_1868.png)

## Preparation

This project uses [pyproject.toml](pyproject.toml) to handle its dependencies. You can install them using pip like so:

```
```sh
pip install .
```

We recommend using [uv](https://github.com/astral-sh/uv) to manage the environment. First, install uv, then clone / download this repo, then run:
However, we recommend using [uv](https://github.com/astral-sh/uv) to manage the environment. First, install uv, then clone / download this repo, then run:

```
```sh
uv sync
```

this will automatically install the right python version, create a virtual environment, and install the required packages.
this will automatically install the right python version, create a virtual environment, and install the required packages. If you choose not to use `uv`, you can replace `uv run` in the code examples in this repo with `python`.

Note that if you encountered `error: command 'cmake' failed: No such file or directory`, you need to install [cmake](https://cmake.org/download/) first.
Note, on macOS, if you encounter `error: command 'cmake' failed: No such file or directory`, you need to install [cmake](https://cmake.org/download/) first.
On macOS, run `brew install cmake`. Similarly, you may have to install `apache-arrow` separately as well (e.g., on macOS `brew install apache-arrow`).

Once these dependency issues are solved, run `uv sync` one more time.
Expand All @@ -44,28 +45,174 @@ This results in two kinds of polars dataframes saved in parquet format under `pr

Before you run the following script, make sure to put all the Delpher zip files under `raw_data/open_archive`.

```
python src/process_open_archive/extract_article_data.py
python src/process_open_archive/extract_meta_data.py
```sh
uv run src/process_open_archive/extract_article_data.py
uv run src/process_open_archive/extract_meta_data.py
```

Then, run

```
python src/process_open_archive/combine_and_chunk.py
```sh
uv run src/process_open_archive/combine_and_chunk.py
```
to join all the available datasets and create a yearly-chunked series of parquet files in the folder `processed_data/combined`.

## Data harvesting (1880-1940)
After 1880, the data is not public and can only be obtained through the Delpher API:

1. Obtain an api key (which looks like this `df2e02aa-8504-4af2-b3d9-64d107f4479a`) from Delpher, then put the api key in the file `harvest_delpher_api/apikey.txt`.
1. Obtain an API key (which looks like this `df2e02aa-8504-4af2-b3d9-64d107f4479a`) from the Royal Library / the Delpher maintainers, then put the API key in the file `src/harvest_delpher_api/apikey.txt`.
2. Harvest the data following readme in the delpher api folder: [src/harvest_delpher_api/readme.md](./src/harvest_delpher_api/README.md)

## Database creation
After the data has been harvested and processed from 1830-1940, the folder `processed_data/combined` should now be filled with `.parquet` files. The first record looks like this:

```py
import polars as pl
pl.scan_parquet("processed_data/combined/*.parquet").head(1).collect().glimpse()
```

```
$ newspaper_id <str> 'ddd:010041217:mpeg21'
$ article_id <str> 'ddd:010041217:mpeg21:a0001'
$ article_subject <str> 'artikel'
$ article_title <str> None
$ article_text <str> 'De GOUVERNEUR der PROVINCIE GELDERLAND ...'
$ newspaper_name <str> 'Arnhemsche courant'
$ newspaper_location <str> 'Arnhem'
$ newspaper_date <date> 1830-01-02
$ newspaper_years_digitalised <str> '1814 t/m 1850'
$ newspaper_years_issued <str> '1814-2001'
$ newspaper_language <str> 'nl'
$ newspaper_temporal <str> 'Dag'
$ newspaper_publisher <str> 'C.A. Thieme'
$ newspaper_spatial <str> 'Regionaal/lokaal'
```

### Step 1: pre-processing / re-partitioning
To make our data processing much faster, we will now process all these files into a hive-partitioned parquet folder, with subfolders for each year. This is done using the following code

```sh
uv run src/create_database/preproc.py
```

After this, the folder `processed_data/partitioned` will contain differently organized parquet files, but they contain the exact same information.

### Step 2: database computation

> NB: from this step onwards, we ran this on a linux (ubuntu) machine with >200 cores and 1TB of memory
The next step is to create the actual database we are interested in. There are three inputs for this:

| Input | Description |
| :---- | :---------- |
| `raw_data/manual_input/disease_search_terms.xlsx` | Contains a list of diseases and their regex search definitions |
| `raw_data/manual_input/location_search_Terms.xlsx` | Contains a list of locations and their regex search definitions |
| `processed_data/partitioned/**/*.parquet` | Contains the texts of all articles from 1830-1940 |

The following command will take these inputs, perform the regex searches and output (many) `.parquet` files to `processed_data/database_flat`. On our big machine, this takes about 12 hours.

```sh
uv run src/create_database/main.py
```

It may be better to run this in the background without hangups:

```sh
nohup uv run src/create_database/main.py &
```

The resulting data looks approximately like this:

```py
import polars as pl
pl.scan_parquet("processed_data/database_flat/*.parquet").head().collect()
```

```
shape: (5, 8)
┌──────┬───────┬────────────┬────────┬────────────┬─────────┬───────────────┬─────────┐
│ year ┆ month ┆ n_location ┆ n_both ┆ location ┆ cbscode ┆ amsterdamcode ┆ disease │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ i8 ┆ u32 ┆ u32 ┆ str ┆ i32 ┆ i32 ┆ str │
╞══════╪═══════╪════════════╪════════╪════════════╪═════════╪═══════════════╪═════════╡
│ 1834 ┆ 6 ┆ 1 ┆ 0 ┆ Aagtekerke ┆ 1000 ┆ 10531 ┆ typhus │
│ 1833 ┆ 12 ┆ 3 ┆ 0 ┆ Aagtekerke ┆ 1000 ┆ 10531 ┆ typhus │
│ 1834 ┆ 9 ┆ 1 ┆ 0 ┆ Aagtekerke ┆ 1000 ┆ 10531 ┆ typhus │
│ 1832 ┆ 5 ┆ 1 ┆ 0 ┆ Aagtekerke ┆ 1000 ┆ 10531 ┆ typhus │
│ 1831 ┆ 4 ┆ 2 ┆ 0 ┆ Aagtekerke ┆ 1000 ┆ 10531 ┆ typhus │
└──────┴───────┴────────────┴────────┴────────────┴─────────┴───────────────┴─────────┘
```

In this format, the column `n_location` means the number of detected mentions of the location / municipality, and the column `n_both` represents the number of disease mentions within this set of articles mentioning the location.

### Step 3: post-processing

The last step is to organise the data (e.g., sorting by date), compute the normalized mentions, and add uncertainty intervals (through [Jeffrey's interval](https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Jeffreys_interval))

```sh
uv run src/create_database/postproc.py
```

The resulting data folder `processed_data/database` looks like this:

```
database/
├── disease=cholera/
│ └── 00000000.parquet
├── disease=diphteria/
│ └── 00000000.parquet
├── disease=dysentery/
│ └── 00000000.parquet
├── disease=influenza/
│ └── 00000000.parquet
├── disease=malaria/
│ └── 00000000.parquet
├── disease=measles/
│ └── 00000000.parquet
├── disease=scarletfever/
│ └── 00000000.parquet
├── disease=smallpox/
│ └── 00000000.parquet
├── disease=tuberculosis/
│ └── 00000000.parquet
├── disease=typhus/
│ └── 00000000.parquet
```

Now, for example, the typhus mentions in 1838 look like this:
```py
import polars as pl
lf = pl.scan_parquet("processed_data/database/**/*.parquet")
lf.filter(pl.col("disease") == "typhus", pl.col("year") == 1838).head().collect()
```
```
┌─────────┬──────┬───────┬───────────────┬─────────┬───────────────┬─────────────────────┬───────┬──────────┬────────────┬────────┐
│ disease ┆ year ┆ month ┆ location ┆ cbscode ┆ amsterdamcode ┆ normalized_mentions ┆ lower ┆ upper ┆ n_location ┆ n_both │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ i8 ┆ str ┆ i32 ┆ i32 ┆ f64 ┆ f64 ┆ f64 ┆ u32 ┆ u32 │
╞═════════╪══════╪═══════╪═══════════════╪═════════╪═══════════════╪═════════════════════╪═══════╪══════════╪════════════╪════════╡
│ typhus ┆ 1835 ┆ 1 ┆ Aalsmeer ┆ 358 ┆ 11264 ┆ 0.0 ┆ 0.0 ┆ 0.330389 ┆ 6 ┆ 0 │
│ typhus ┆ 1835 ┆ 1 ┆ Aalst ┆ 1001 ┆ 11423 ┆ 0.0 ┆ 0.0 ┆ 0.444763 ┆ 4 ┆ 0 │
│ typhus ┆ 1835 ┆ 1 ┆ Aalten ┆ 197 ┆ 11046 ┆ 0.0 ┆ 0.0 ┆ 0.853254 ┆ 1 ┆ 0 │
│ typhus ┆ 1835 ┆ 1 ┆ Aarlanderveen ┆ 1002 ┆ 11242 ┆ 0.0 ┆ 0.0 ┆ 0.330389 ┆ 6 ┆ 0 │
│ typhus ┆ 1835 ┆ 1 ┆ Aduard ┆ 2 ┆ 10999 ┆ 0.0 ┆ 0.0 ┆ 0.262217 ┆ 8 ┆ 0 │
│ typhus ┆ 1835 ┆ 1 ┆ Akersloot ┆ 360 ┆ 10346 ┆ 0.0 ┆ 0.0 ┆ 0.666822 ┆ 2 ┆ 0 │
│ typhus ┆ 1835 ┆ 1 ┆ Alblasserdam ┆ 482 ┆ 11327 ┆ 0.0 ┆ 0.0 ┆ 0.666822 ┆ 2 ┆ 0 │
│ typhus ┆ 1835 ┆ 1 ┆ Alkmaar ┆ 361 ┆ 10527 ┆ 0.0 ┆ 0.0 ┆ 0.045246 ┆ 54 ┆ 0 │
│ typhus ┆ 1835 ┆ 1 ┆ Alphen ┆ 1008 ┆ 10517 ┆ 0.0 ┆ 0.0 ┆ 0.11147 ┆ 21 ┆ 0 │
│ typhus ┆ 1835 ┆ 1 ┆ Ambt Delden ┆ 142 ┆ 11400 ┆ 0.0 ┆ 0.0 ┆ 0.444763 ┆ 4 ┆ 0 │
└─────────┴──────┴───────┴───────────────┴─────────┴───────────────┴─────────────────────┴───────┴──────────┴────────────┴────────┘
```


## Data analysis
The script `src/query/faster_query.py` uses the prepared combined data to search for mentions of diseases and locations in articles. The file produces the plot shown above. It also produces this plot about Utrecht:

![](img/cholera_utrecht_full.png)
For a basic analysis after the database has been created, take a look at the file `src/analysis/query_db.py`.

![](img/all_diseases_three_cities.png)

For more in-depth analysis and usage scripts, take a look at our analysis repository: [disease_database_analysis](https://github.com/sodascience/disease_database_analysis).


## Contact
<img src="./img/soda_logo.png" alt="SoDa logo" width="250px"/>
Expand Down
File renamed without changes.
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes
File renamed without changes
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Binary file added img/all_diseases_three_cities.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/amsterdam_all.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
4 changes: 3 additions & 1 deletion raw_data/manual_input/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,6 @@
# whitelist files that can be uploaded
!query_names.xlsx
!disease_search_terms.xlsx
!municipalities_1869.xlsx
!location_search_terms.xlsx
!disease_search_terms.csv
!location_search_terms.csv
11 changes: 11 additions & 0 deletions raw_data/manual_input/disease_search_terms.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Label,Disease,Type ,Regex
Typhus,Typhoid fever; Paratyphoid fever,Food- and water-borne infectious diseases ,\b(ty(ph|f)(us|euz\w*)|febris\s?typhoidea|kwaadaardige\s?koorts)\b
Dysentery,Diarrhoea; Dysentery; Acute diseases of the digestive system,Food- and water-borne infectious diseases ,\b(diarrhoea|dysenter\w*|rood\s?loop|buik\s?loop|bloed\s?gang)\b
Cholera,Cholera (including: Asiatic cholera; Cholera nostras) ,Food- and water-borne infectious diseases ,\b(choler\w*|krim\s?koorts)\b
Smallpox,Smallpox,Airborne infectious diseases,\b(pokken|variola)\b
ScarletFever,Scarlet fever,Airborne infectious diseases,\b(rood\s?vonk|scarlatina|scharlaken\s?koorts)\b
Measles,Measles,Airborne infectious diseases,\b(mazelen|rood\s?ziekte|rubeola|rubella)\b
Tuberculosis,"Respiratory tuberculosis (incl: Tuberculosis of the lung and larynx, haemoptysis)",Airborne infectious diseases,\b(tering|verteringsziekte)\b
Diphteria,Croup; Diphtheria,Airborne infectious diseases,\b((c|k)roup|angina\s?diphtheri\w*|diphtheri\w*|difteritis)\b
Influenza,Acute respiratory disease (including influenza),Airborne infectious diseases,\b(griep|influenza)\b
Malaria,Malaria (including: intermittent fever; pernicious fever),Other infectious diseases (mixed aetiology),\b(malaria|moeras\s?koorts|polder\s?koorts)\b
Binary file modified raw_data/manual_input/disease_search_terms.xlsx
Binary file not shown.
Loading

0 comments on commit 0937ede

Please sign in to comment.