Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PSM table generation #150
PSM table generation #150
Changes from 8 commits
9b67de7
fb745ff
5d7ef49
c391b4f
4f270e1
64096b3
eac38ae
af4a27b
0a477ba
e684d8f
86c70bc
561f728
53458a2
1f09718
6274b80
5ab07b9
aa81d3b
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is the change to the DB class I was mentioning, and I'm hoping that this comment explains why it's in here the way it is, but to be a bit more verbose about this: pyathena has a method that dramatically improves query execution when it's looking to return a dataframe - something about how they handle chunking under the hood. So, in context, when I'm passing a cursor to a method, I sometimes elect to specifically hand one of these pandas cursors off.
I did this while testing the PSM code (where the cursor is the entrypoint - we :could: rewrite table builders to take a Connection rather than a Cursor, but that's a big refactor by itself and this is already pretty gross), and in the future manifest parsing hook for this to come as a followon PR, I'm planning on specifying the pandas cursor for PSM invocation. The DuckDB version just returns a regular cursor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm fine with this change based on the constraint of "Cursor is the interface, not DatabaseBackend/Connection". Some thoughts around it though:
as_pandas
added to the Cursor protocol we have, so that consumers of Library know it's contractually available. (See below for some commentary on this.)execute_as_pandas
dropped -- I only added that to avoid the need for extending cursors like this. But now we could simplify that interface.as_pandas
in the duckdb returned cursor is fine, but gives me pause because clever monkey-patching can be taken too far. 😄 If this setup gets more complicated, I might vote for a DuckCursor wrapper object that does similar kind of translations needed in future.as_pandas
is available and those for which it isn't. What happens on a PyAthena normal cursor if you callas_pandas
?as_pandas
on the wrong cursor object..get_database_backend()
or something to give access to parent objects without introducing two different kinds of Cursors. But that's a little clunky in its own way. But may feel less clunky.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
honestly - i think i like the idea of refactoring one way or another to get these more in line, i'm just trying to not do it as part of this PR for complexity reasons - we can maybe natter about the shape? some options, pulling on some of these threads:
as_pandas
is, apparently, available as a util method that can be called on a pyathena cursor, so we could switch to that and keep the cursor space down to one per db. that might slot better into theexecute_as_pandas
paradigmThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still valid? As a comment to a non-Matt, I'm not sure it's clear to me when that would be true. I would currently guess this is about unit tests. (Same comment above the call to
knn_matched
later)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok no this is about the PsmPy behavior vs valid statistical sampling techniques, which i'm intending to deal with in the followon PR, though you can talk me into it now; this was more about niceness for a reviewer.
replace=True
allows the same record to be sampled multiple times if its below the floor value of 20. The floor value is set in PsmPy in a way which cannot be overridden; This is probably desirable from a math perspective but becomes harder when dealing with very small input sets. So this comes back to the side convo we had about data size - i could try to solve that here by touching ids and re-inserting our existing test data 3 or 4 times. But this value should be toggled before we ask anyone to use this.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tried to make this comment more sensible for lottery/bus reasons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A block like this could use a comment about why it's doing what it's doing. My rough attempt: you're replacing each column with a dummy version of that column. But even after reading the pandas docs on
get_dummies
, I'm not 💯 on what that means.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i'll add something - this is converting to a 1-hot encoding for all values of that column, basically pivoting the column values to new column headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated with some explanitory text