-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserving column names when transformer requires multiple columns as input #174
Comments
I believe this should be possible if the transformer you use implements some interface to get the name of the resulting features. See https://github.com/scikit-learn-contrib/sklearn-pandas/blob/master/sklearn_pandas/dataframe_mapper.py#L40. Perhaps you can extend Marking as "good first issue" to come up with an example of this that works. |
@hildeweerts I would say that this kind of transformers is one of the most challenging ones to insert into a pipeline. The code we have now doesn't support this kind of flexibility so I guess the most simple way to do so is to manually track the changes. Not really convenient but I believe there are no other ways right now. As @dukebody proposed, it could be something like (I guess we need to pick import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, chi2
from sklearn_pandas import DataFrameMapper
class TrackingSelectKBest(SelectKBest):
def fit(self, X, y=None):
super().fit(X, y)
scores = sorted([
(i, score) for i, score in enumerate(self.scores_)],
key=lambda pair: pair[1],
reverse=True)
self.best_columns_ = [i for i, score in scores[:self.k]]
return self
def main():
data = pd.DataFrame({
'pet': ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
'children': [4., 6, 3, 3, 2, 3, 5, 4],
'salary': [90., 24, 44, 27, 32, 59, 36, 27]})
selector = TrackingSelectKBest(chi2, k=1)
columns = np.array(['children', 'salary'])
m = DataFrameMapper([(columns, selector)])
m.fit_transform(data[columns], data['pet'])
print(columns[selector.best_columns_])
if __name__ == '__main__':
main() The snippet about should print:
Probably there are other ways to achieve this but |
@dukebody No updates on this ? Sounds like a really good idea to implement this |
Sklearn 1.0 estimator API has better support for mapper_fs = DataFrameMapper([(['children','salary'], SelectKBest(chi2, k=2))], input_df=True)
mapper_fs.fit_transform(data[['children','salary']], data['pet'])
print(mapper_fs.transformed_names_)
['children_salary_children', 'children_salary_salary'] Where Same applies to, e.g., import pandas as pd
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({
'col': [0, 0, 1, 1, 2, 3, 0],
'target': [0, 0, 1, 1, 2, 3, 0]
})
mapper = DataFrameMapper([(['col', 'target'], OneHotEncoder())], df_out=True)
transformed = mapper.fit_transform(df)
print(mapper.transformed_names_)
['col_target_x0_0', 'col_target_x0_1', 'col_target_x0_2', 'col_target_x0_3', 'col_target_x1_0',
'col_target_x1_1', 'col_target_x1_2', 'col_target_x1_3'] vs mapper = DataFrameMapper([(['col', 'target'], OneHotEncoder())], input_df=True, df_out=True)
transformed = mapper.fit_transform(df)
print(mapper.transformed_names_)
['col_target_col_0', 'col_target_col_1', 'col_target_col_2', 'col_target_col_3', 'col_target_target_0', 'col_target_target_1', 'col_target_target_2', 'col_target_target_3'] Would it be possible to take advantage of the sklearn's capabilities and improve the handling at |
I was wondering whether it is possible to preserve column names when using a transformer that requires multiple columns of the dataframe. I'll try to illustrate what I mean with an example.
Which outputs
['children_salary']
, whereas I would expect just['salary']
. This makes it impossible to keep track of which columns were dropped by the SelectKBest transformer. Is there currently a way to solve this problem?The text was updated successfully, but these errors were encountered: