Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PSM table generation #150

Merged
merged 17 commits into from
Dec 5, 2023
Merged

PSM table generation #150

merged 17 commits into from
Dec 5, 2023

Conversation

dogversioning
Copy link
Contributor

@dogversioning dogversioning commented Nov 30, 2023

This PR attempts to add in the research done as part of the suicidality studies around propensity score matching as a standard statistical tool in the library.

This attempts sole to deal with the PSM logistics, but not the cli/study logistics. So here's what it's trying to do:

  • Provide a new table builder extension for PSM
  • Provide PSM specific jinja templates
  • Creates a toml input format for configuring PSM jobs
  • Ancilarily provides a conftest fixture for running tests with duckdb

It does not attempt to do the following, for cognitive load reasons, which will follow on later:

  • run PSM from a manifest
  • metadata table peristance for managing multiple PSM tables
  • cli commands around optionally running/cleaning PSM tables
  • PSM date filtering on cohorts, which needs some more definition

Checklist

  • Consider if documentation (like in docs/) needs to be updated
  • Consider if tests should be added
  • Run pylint if you're making changes beyond adding studies
  • Update template repo if there are changes to study configuration

cumulus_library/base_table_builder.py Outdated Show resolved Hide resolved
Comment on lines +58 to +65
@abc.abstractmethod
def pandas_cursor(self) -> DatabaseCursor:
"""Returns a connection to the backing database optimized for dataframes

If your database does not provide an optimized cursor, this should function the
same as a vanilla cursor.
"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the change to the DB class I was mentioning, and I'm hoping that this comment explains why it's in here the way it is, but to be a bit more verbose about this: pyathena has a method that dramatically improves query execution when it's looking to return a dataframe - something about how they handle chunking under the hood. So, in context, when I'm passing a cursor to a method, I sometimes elect to specifically hand one of these pandas cursors off.

I did this while testing the PSM code (where the cursor is the entrypoint - we :could: rewrite table builders to take a Connection rather than a Cursor, but that's a big refactor by itself and this is already pretty gross), and in the future manifest parsing hook for this to come as a followon PR, I'm planning on specifying the pandas cursor for PSM invocation. The DuckDB version just returns a regular cursor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm fine with this change based on the constraint of "Cursor is the interface, not DatabaseBackend/Connection". Some thoughts around it though:

  • I'd like to see as_pandas added to the Cursor protocol we have, so that consumers of Library know it's contractually available. (See below for some commentary on this.)
  • I'd like to see execute_as_pandas dropped -- I only added that to avoid the need for extending cursors like this. But now we could simplify that interface.
  • The solution of creating an alias for as_pandas in the duckdb returned cursor is fine, but gives me pause because clever monkey-patching can be taken too far. 😄 If this setup gets more complicated, I might vote for a DuckCursor wrapper object that does similar kind of translations needed in future.
  • We really now have two kinds of Cursors - those for which as_pandas is available and those for which it isn't. What happens on a PyAthena normal cursor if you call as_pandas?
    • For our purposes, maybe AthenaDatabaseBackend should create a wrapper AthenaCursor object that throws an exception if you try to call as_pandas on the wrong cursor object.
    • Or even better probably, have two different Cursor protocols. One pandas-powered and one that isn't. That way method signatures would be clear about which cursor they expect to be handed. (if that is always clear?)
    • You could also add Cursor wrappers and a method like .get_database_backend() or something to give access to parent objects without introducing two different kinds of Cursors. But that's a little clunky in its own way. But may feel less clunky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honestly - i think i like the idea of refactoring one way or another to get these more in line, i'm just trying to not do it as part of this PR for complexity reasons - we can maybe natter about the shape? some options, pulling on some of these threads:

  • I don't hate making a database connection the atomic unit, but it is probably going to touch the most things
  • as_pandas is, apparently, available as a util method that can be called on a pyathena cursor, so we could switch to that and keep the cursor space down to one per db. that might slot better into the execute_as_pandas paradigm
  • I think genereally a PEP cursor has a reference back to its connection, so maybe it's not the end of the world to have it get the database backend, though i think that's my least favorite of these.

cumulus_library/databases.py Show resolved Hide resolved
cumulus_library/statistics/psm.py Outdated Show resolved Hide resolved
cumulus_library/template_sql/ctas.sql.jinja Show resolved Hide resolved
tests/conftest.py Show resolved Hide resolved
@dogversioning dogversioning marked this pull request as ready for review December 4, 2023 18:46
cumulus_library/base_table_builder.py Outdated Show resolved Hide resolved
Comment on lines +58 to +65
@abc.abstractmethod
def pandas_cursor(self) -> DatabaseCursor:
"""Returns a connection to the backing database optimized for dataframes

If your database does not provide an optimized cursor, this should function the
same as a vanilla cursor.
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm fine with this change based on the constraint of "Cursor is the interface, not DatabaseBackend/Connection". Some thoughts around it though:

  • I'd like to see as_pandas added to the Cursor protocol we have, so that consumers of Library know it's contractually available. (See below for some commentary on this.)
  • I'd like to see execute_as_pandas dropped -- I only added that to avoid the need for extending cursors like this. But now we could simplify that interface.
  • The solution of creating an alias for as_pandas in the duckdb returned cursor is fine, but gives me pause because clever monkey-patching can be taken too far. 😄 If this setup gets more complicated, I might vote for a DuckCursor wrapper object that does similar kind of translations needed in future.
  • We really now have two kinds of Cursors - those for which as_pandas is available and those for which it isn't. What happens on a PyAthena normal cursor if you call as_pandas?
    • For our purposes, maybe AthenaDatabaseBackend should create a wrapper AthenaCursor object that throws an exception if you try to call as_pandas on the wrong cursor object.
    • Or even better probably, have two different Cursor protocols. One pandas-powered and one that isn't. That way method signatures would be clear about which cursor they expect to be handed. (if that is always clear?)
    • You could also add Cursor wrappers and a method like .get_database_backend() or something to give access to parent objects without introducing two different kinds of Cursors. But that's a little clunky in its own way. But may feel less clunky.

pyproject.toml Show resolved Hide resolved
cumulus_library/statistics/psm.py Outdated Show resolved Hide resolved
cumulus_library/statistics/psm.py Show resolved Hide resolved
cumulus_mhg_dev_db Outdated Show resolved Hide resolved
cumulus_library/template_sql/statistics/psm_templates.py Outdated Show resolved Hide resolved
tests/conftest.py Outdated Show resolved Hide resolved
def mock_db():
"""Provides a DuckDatabaseBackend for local testing"""
data_dir = f"{Path(__file__).parent}/test_data/duckdb_data"
with tempfile.TemporaryDirectory() as tmpdir:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since you're only using this dir for one file, you could also call NamedTemporaryFile()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i tried this and the tempfiles were not behaving quite as gracefully w.r.t. yield/cleanup as tmpdirs, and it also provides a 'write out some logs while you're at it' option, so i'm electing to leave this as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since this is for tests, removing all the template comments would let the whole config be visible at one look. (and if you had actual comments for viewers, like about why the sample sizes are small or some other default was changed, it wouldn't be lost in the noise)

edit: Ah, like the no_optional file below 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the one hand, i hear you.

on the other hand, there is no other place for these to live right now - there is no instance in core where this module would be used. I have been thinking about a stats_example study for this any any future things, but right now, for documentation purposes, other comments point here.

When the other half of the workflow is done, maybe i can move this to a markdown doc or something? but i want to keep it here, for the time being.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decided to start stubbing out the docs, so I moved the commentary there and cleaned this up.

@dogversioning dogversioning merged commit a12b334 into main Dec 5, 2023
3 checks passed
@dogversioning dogversioning deleted the mg/psm branch December 5, 2023 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants