-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PSM table generation #150
PSM table generation #150
Conversation
87d067a
to
9b67de7
Compare
c64ba50
to
fb745ff
Compare
@abc.abstractmethod | ||
def pandas_cursor(self) -> DatabaseCursor: | ||
"""Returns a connection to the backing database optimized for dataframes | ||
|
||
If your database does not provide an optimized cursor, this should function the | ||
same as a vanilla cursor. | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is the change to the DB class I was mentioning, and I'm hoping that this comment explains why it's in here the way it is, but to be a bit more verbose about this: pyathena has a method that dramatically improves query execution when it's looking to return a dataframe - something about how they handle chunking under the hood. So, in context, when I'm passing a cursor to a method, I sometimes elect to specifically hand one of these pandas cursors off.
I did this while testing the PSM code (where the cursor is the entrypoint - we :could: rewrite table builders to take a Connection rather than a Cursor, but that's a big refactor by itself and this is already pretty gross), and in the future manifest parsing hook for this to come as a followon PR, I'm planning on specifying the pandas cursor for PSM invocation. The DuckDB version just returns a regular cursor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm fine with this change based on the constraint of "Cursor is the interface, not DatabaseBackend/Connection". Some thoughts around it though:
- I'd like to see
as_pandas
added to the Cursor protocol we have, so that consumers of Library know it's contractually available. (See below for some commentary on this.) - I'd like to see
execute_as_pandas
dropped -- I only added that to avoid the need for extending cursors like this. But now we could simplify that interface. - The solution of creating an alias for
as_pandas
in the duckdb returned cursor is fine, but gives me pause because clever monkey-patching can be taken too far. 😄 If this setup gets more complicated, I might vote for a DuckCursor wrapper object that does similar kind of translations needed in future. - We really now have two kinds of Cursors - those for which
as_pandas
is available and those for which it isn't. What happens on a PyAthena normal cursor if you callas_pandas
?- For our purposes, maybe AthenaDatabaseBackend should create a wrapper AthenaCursor object that throws an exception if you try to call
as_pandas
on the wrong cursor object. - Or even better probably, have two different Cursor protocols. One pandas-powered and one that isn't. That way method signatures would be clear about which cursor they expect to be handed. (if that is always clear?)
- You could also add Cursor wrappers and a method like
.get_database_backend()
or something to give access to parent objects without introducing two different kinds of Cursors. But that's a little clunky in its own way. But may feel less clunky.
- For our purposes, maybe AthenaDatabaseBackend should create a wrapper AthenaCursor object that throws an exception if you try to call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
honestly - i think i like the idea of refactoring one way or another to get these more in line, i'm just trying to not do it as part of this PR for complexity reasons - we can maybe natter about the shape? some options, pulling on some of these threads:
- I don't hate making a database connection the atomic unit, but it is probably going to touch the most things
as_pandas
is, apparently, available as a util method that can be called on a pyathena cursor, so we could switch to that and keep the cursor space down to one per db. that might slot better into theexecute_as_pandas
paradigm- I think genereally a PEP cursor has a reference back to its connection, so maybe it's not the end of the world to have it get the database backend, though i think that's my least favorite of these.
@abc.abstractmethod | ||
def pandas_cursor(self) -> DatabaseCursor: | ||
"""Returns a connection to the backing database optimized for dataframes | ||
|
||
If your database does not provide an optimized cursor, this should function the | ||
same as a vanilla cursor. | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm fine with this change based on the constraint of "Cursor is the interface, not DatabaseBackend/Connection". Some thoughts around it though:
- I'd like to see
as_pandas
added to the Cursor protocol we have, so that consumers of Library know it's contractually available. (See below for some commentary on this.) - I'd like to see
execute_as_pandas
dropped -- I only added that to avoid the need for extending cursors like this. But now we could simplify that interface. - The solution of creating an alias for
as_pandas
in the duckdb returned cursor is fine, but gives me pause because clever monkey-patching can be taken too far. 😄 If this setup gets more complicated, I might vote for a DuckCursor wrapper object that does similar kind of translations needed in future. - We really now have two kinds of Cursors - those for which
as_pandas
is available and those for which it isn't. What happens on a PyAthena normal cursor if you callas_pandas
?- For our purposes, maybe AthenaDatabaseBackend should create a wrapper AthenaCursor object that throws an exception if you try to call
as_pandas
on the wrong cursor object. - Or even better probably, have two different Cursor protocols. One pandas-powered and one that isn't. That way method signatures would be clear about which cursor they expect to be handed. (if that is always clear?)
- You could also add Cursor wrappers and a method like
.get_database_backend()
or something to give access to parent objects without introducing two different kinds of Cursors. But that's a little clunky in its own way. But may feel less clunky.
- For our purposes, maybe AthenaDatabaseBackend should create a wrapper AthenaCursor object that throws an exception if you try to call
def mock_db(): | ||
"""Provides a DuckDatabaseBackend for local testing""" | ||
data_dir = f"{Path(__file__).parent}/test_data/duckdb_data" | ||
with tempfile.TemporaryDirectory() as tmpdir: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since you're only using this dir for one file, you could also call NamedTemporaryFile()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i tried this and the tempfiles were not behaving quite as gracefully w.r.t. yield/cleanup as tmpdirs, and it also provides a 'write out some logs while you're at it' option, so i'm electing to leave this as is.
tests/test_data/psm/psm_config.toml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Since this is for tests, removing all the template comments would let the whole config be visible at one look. (and if you had actual comments for viewers, like about why the sample sizes are small or some other default was changed, it wouldn't be lost in the noise)
edit: Ah, like the no_optional
file below 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the one hand, i hear you.
on the other hand, there is no other place for these to live right now - there is no instance in core where this module would be used. I have been thinking about a stats_example
study for this any any future things, but right now, for documentation purposes, other comments point here.
When the other half of the workflow is done, maybe i can move this to a markdown doc or something? but i want to keep it here, for the time being.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decided to start stubbing out the docs, so I moved the commentary there and cleaned this up.
This PR attempts to add in the research done as part of the suicidality studies around propensity score matching as a standard statistical tool in the library.
This attempts sole to deal with the PSM logistics, but not the cli/study logistics. So here's what it's trying to do:
It does not attempt to do the following, for cognitive load reasons, which will follow on later:
Checklist
docs/
) needs to be updated