PSM table generation #150

dogversioning · 2023-11-30T18:39:04Z

This PR attempts to add in the research done as part of the suicidality studies around propensity score matching as a standard statistical tool in the library.

This attempts sole to deal with the PSM logistics, but not the cli/study logistics. So here's what it's trying to do:

Provide a new table builder extension for PSM
Provide PSM specific jinja templates
Creates a toml input format for configuring PSM jobs
Ancilarily provides a conftest fixture for running tests with duckdb

It does not attempt to do the following, for cognitive load reasons, which will follow on later:

run PSM from a manifest
metadata table peristance for managing multiple PSM tables
cli commands around optionally running/cleaning PSM tables
PSM date filtering on cohorts, which needs some more definition

Checklist

Consider if documentation (like in docs/) needs to be updated
Consider if tests should be added
Run pylint if you're making changes beyond adding studies
Update template repo if there are changes to study configuration

cumulus_library/base_table_builder.py

dogversioning · 2023-12-04T18:28:48Z

cumulus_library/databases.py

+    @abc.abstractmethod
+    def pandas_cursor(self) -> DatabaseCursor:
+        """Returns a connection to the backing database optimized for dataframes
+
+        If your database does not provide an optimized cursor, this should function the
+        same as a vanilla cursor.
+        """
+


So this is the change to the DB class I was mentioning, and I'm hoping that this comment explains why it's in here the way it is, but to be a bit more verbose about this: pyathena has a method that dramatically improves query execution when it's looking to return a dataframe - something about how they handle chunking under the hood. So, in context, when I'm passing a cursor to a method, I sometimes elect to specifically hand one of these pandas cursors off.

I did this while testing the PSM code (where the cursor is the entrypoint - we :could: rewrite table builders to take a Connection rather than a Cursor, but that's a big refactor by itself and this is already pretty gross), and in the future manifest parsing hook for this to come as a followon PR, I'm planning on specifying the pandas cursor for PSM invocation. The DuckDB version just returns a regular cursor.

Yeah I'm fine with this change based on the constraint of "Cursor is the interface, not DatabaseBackend/Connection". Some thoughts around it though:

I'd like to see as_pandas added to the Cursor protocol we have, so that consumers of Library know it's contractually available. (See below for some commentary on this.)

I'd like to see execute_as_pandas dropped -- I only added that to avoid the need for extending cursors like this. But now we could simplify that interface.

The solution of creating an alias for as_pandas in the duckdb returned cursor is fine, but gives me pause because clever monkey-patching can be taken too far. 😄 If this setup gets more complicated, I might vote for a DuckCursor wrapper object that does similar kind of translations needed in future.

We really now have two kinds of Cursors - those for which as_pandas is available and those for which it isn't. What happens on a PyAthena normal cursor if you call as_pandas?

For our purposes, maybe AthenaDatabaseBackend should create a wrapper AthenaCursor object that throws an exception if you try to call as_pandas on the wrong cursor object.

Or even better probably, have two different Cursor protocols. One pandas-powered and one that isn't. That way method signatures would be clear about which cursor they expect to be handed. (if that is always clear?)

You could also add Cursor wrappers and a method like .get_database_backend() or something to give access to parent objects without introducing two different kinds of Cursors. But that's a little clunky in its own way. But may feel less clunky.

honestly - i think i like the idea of refactoring one way or another to get these more in line, i'm just trying to not do it as part of this PR for complexity reasons - we can maybe natter about the shape? some options, pulling on some of these threads:

I don't hate making a database connection the atomic unit, but it is probably going to touch the most things

as_pandas is, apparently, available as a util method that can be called on a pyathena cursor, so we could switch to that and keep the cursor space down to one per db. that might slot better into the execute_as_pandas paradigm

I think genereally a PEP cursor has a reference back to its connection, so maybe it's not the end of the world to have it get the database backend, though i think that's my least favorite of these.

cumulus_library/databases.py

cumulus_library/statistics/psm.py

cumulus_library/template_sql/ctas.sql.jinja

tests/conftest.py

cumulus_library/base_table_builder.py

mikix · 2023-12-04T19:53:01Z

cumulus_library/databases.py

+    @abc.abstractmethod
+    def pandas_cursor(self) -> DatabaseCursor:
+        """Returns a connection to the backing database optimized for dataframes
+
+        If your database does not provide an optimized cursor, this should function the
+        same as a vanilla cursor.
+        """
+


Yeah I'm fine with this change based on the constraint of "Cursor is the interface, not DatabaseBackend/Connection". Some thoughts around it though:

I'd like to see as_pandas added to the Cursor protocol we have, so that consumers of Library know it's contractually available. (See below for some commentary on this.)

I'd like to see execute_as_pandas dropped -- I only added that to avoid the need for extending cursors like this. But now we could simplify that interface.

The solution of creating an alias for as_pandas in the duckdb returned cursor is fine, but gives me pause because clever monkey-patching can be taken too far. 😄 If this setup gets more complicated, I might vote for a DuckCursor wrapper object that does similar kind of translations needed in future.

We really now have two kinds of Cursors - those for which as_pandas is available and those for which it isn't. What happens on a PyAthena normal cursor if you call as_pandas?

For our purposes, maybe AthenaDatabaseBackend should create a wrapper AthenaCursor object that throws an exception if you try to call as_pandas on the wrong cursor object.

Or even better probably, have two different Cursor protocols. One pandas-powered and one that isn't. That way method signatures would be clear about which cursor they expect to be handed. (if that is always clear?)

You could also add Cursor wrappers and a method like .get_database_backend() or something to give access to parent objects without introducing two different kinds of Cursors. But that's a little clunky in its own way. But may feel less clunky.

pyproject.toml

cumulus_library/statistics/psm.py

cumulus_mhg_dev_db

cumulus_library/template_sql/statistics/psm_templates.py

tests/conftest.py

mikix · 2023-12-04T20:45:03Z

tests/conftest.py

+def mock_db():
+    """Provides a DuckDatabaseBackend for local testing"""
+    data_dir = f"{Path(__file__).parent}/test_data/duckdb_data"
+    with tempfile.TemporaryDirectory() as tmpdir:


nit: since you're only using this dir for one file, you could also call NamedTemporaryFile()

ok i tried this and the tempfiles were not behaving quite as gracefully w.r.t. yield/cleanup as tmpdirs, and it also provides a 'write out some logs while you're at it' option, so i'm electing to leave this as is.

mikix · 2023-12-04T20:48:53Z

tests/test_data/psm/psm_config.toml

nit: Since this is for tests, removing all the template comments would let the whole config be visible at one look. (and if you had actual comments for viewers, like about why the sample sizes are small or some other default was changed, it wouldn't be lost in the noise)

edit: Ah, like the no_optional file below 😄

on the one hand, i hear you.

on the other hand, there is no other place for these to live right now - there is no instance in core where this module would be used. I have been thinking about a stats_example study for this any any future things, but right now, for documentation purposes, other comments point here.

When the other half of the workflow is done, maybe i can move this to a markdown doc or something? but i want to keep it here, for the time being.

decided to start stubbing out the docs, so I moved the commentary there and cleaned this up.

PSM table generation

9b67de7

dogversioning force-pushed the mg/psm branch from 87d067a to 9b67de7 Compare November 30, 2023 18:39

Reference SQL

fb745ff

dogversioning force-pushed the mg/psm branch from c64ba50 to fb745ff Compare November 30, 2023 19:36

dogversioning added 7 commits December 4, 2023 11:14

fixed randomness in unit test runs

5d7ef49

Cleanup & documentation

c391b4f

Docs cleanup, query tests

4f270e1

sqlfluff cleanup

64096b3

self review

eac38ae

light PSM cleanup

af4a27b

additional cleanup

0a477ba

dogversioning commented Dec 4, 2023

View reviewed changes

pylint

e684d8f

dogversioning marked this pull request as ready for review December 4, 2023 18:46

dogversioning added 2 commits December 4, 2023 14:14

unit test fixing

86c70bc

a truly pedantic template bug

561f728

mikix approved these changes Dec 4, 2023

View reviewed changes

dogversioning added 5 commits December 5, 2023 12:12

PR comments

53458a2

post execution hook

1f09718

test cleanup

6274b80

line length

5ab07b9

sqlfluff pedantry

aa81d3b

dogversioning merged commit a12b334 into main Dec 5, 2023
3 checks passed

dogversioning deleted the mg/psm branch December 5, 2023 18:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PSM table generation #150

PSM table generation #150

dogversioning commented Nov 30, 2023 •

edited

Loading

dogversioning Dec 4, 2023

mikix Dec 4, 2023

dogversioning Dec 4, 2023

mikix Dec 4, 2023

mikix Dec 4, 2023

dogversioning Dec 5, 2023

mikix Dec 4, 2023

dogversioning Dec 4, 2023

dogversioning Dec 5, 2023

PSM table generation #150

PSM table generation #150

Conversation

dogversioning commented Nov 30, 2023 • edited Loading

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dogversioning commented Nov 30, 2023 •

edited

Loading