Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserving column names when transformer requires multiple columns as input #174

Open
hildeweerts opened this issue Oct 2, 2018 · 4 comments · May be fixed by #248
Open

Preserving column names when transformer requires multiple columns as input #174

hildeweerts opened this issue Oct 2, 2018 · 4 comments · May be fixed by #248

Comments

@hildeweerts
Copy link

hildeweerts commented Oct 2, 2018

I was wondering whether it is possible to preserve column names when using a transformer that requires multiple columns of the dataframe. I'll try to illustrate what I mean with an example.

from sklearn.feature_selection import SelectKBest, chi2data = pd.DataFrame({
    'pet':      ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
    'children': [4., 6, 3, 3, 2, 3, 5, 4],
    'salary':   [90., 24, 44, 27, 32, 59, 36, 27]})
​
mapper_fs = DataFrameMapper([(['children','salary'], SelectKBest(chi2, k=2))])
mapper_fs.fit_transform(data[['children','salary']], data['pet'])
print(mapper_fs.transformed_names_)

Which outputs ['children_salary'], whereas I would expect just ['salary']. This makes it impossible to keep track of which columns were dropped by the SelectKBest transformer. Is there currently a way to solve this problem?

@dukebody
Copy link
Collaborator

I believe this should be possible if the transformer you use implements some interface to get the name of the resulting features. See https://github.com/scikit-learn-contrib/sklearn-pandas/blob/master/sklearn_pandas/dataframe_mapper.py#L40.

Perhaps you can extend SelectKBest to provide this interface?

Marking as "good first issue" to come up with an example of this that works.

@devforfu
Copy link
Collaborator

@hildeweerts I would say that this kind of transformers is one of the most challenging ones to insert into a pipeline. The code we have now doesn't support this kind of flexibility so I guess the most simple way to do so is to manually track the changes. Not really convenient but I believe there are no other ways right now.

As @dukebody proposed, it could be something like (I guess we need to pick k=1 if we want to choose the best column between two):

import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, chi2
from sklearn_pandas import DataFrameMapper

class TrackingSelectKBest(SelectKBest):
    def fit(self, X, y=None):
        super().fit(X, y)
        scores = sorted([
            (i, score) for i, score in enumerate(self.scores_)],
            key=lambda pair: pair[1],
            reverse=True)
        self.best_columns_ = [i for i, score in scores[:self.k]]
        return self

def main():
    data = pd.DataFrame({
        'pet':      ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
        'children': [4., 6, 3, 3, 2, 3, 5, 4],
        'salary':   [90., 24, 44, 27, 32, 59, 36, 27]})
    selector = TrackingSelectKBest(chi2, k=1)
    columns = np.array(['children', 'salary'])
    m = DataFrameMapper([(columns, selector)])
    m.fit_transform(data[columns], data['pet'])
    print(columns[selector.best_columns_])

if __name__ == '__main__':
    main()

The snippet about should print:

['salary']

Probably there are other ways to achieve this but scikit-learn transformers take numpy arrays without access to the original data frame column names so it is not possible to derive these names from the transformer.

@Aashit-Sharma
Copy link

@dukebody No updates on this ? Sounds like a really good idea to implement this

@falcaopetri
Copy link

Sklearn 1.0 estimator API has better support for feature_names. For example, using DataFrameMapper's df_in=True allows us to get:

mapper_fs = DataFrameMapper([(['children','salary'], SelectKBest(chi2, k=2))], input_df=True)
mapper_fs.fit_transform(data[['children','salary']], data['pet'])
print(mapper_fs.transformed_names_)
['children_salary_children', 'children_salary_salary']

Where children_salary was added by sklearn-pandas.

Same applies to, e.g., OneHotEncoder:

import pandas as pd
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({
    'col': [0, 0, 1, 1, 2, 3, 0],
    'target': [0, 0, 1, 1, 2, 3, 0]
})
mapper = DataFrameMapper([(['col', 'target'], OneHotEncoder())], df_out=True)
transformed = mapper.fit_transform(df)
print(mapper.transformed_names_)
['col_target_x0_0', 'col_target_x0_1', 'col_target_x0_2', 'col_target_x0_3', 'col_target_x1_0', 
 'col_target_x1_1', 'col_target_x1_2', 'col_target_x1_3']

vs

mapper = DataFrameMapper([(['col', 'target'], OneHotEncoder())], input_df=True, df_out=True)
transformed = mapper.fit_transform(df)
print(mapper.transformed_names_)
['col_target_col_0', 'col_target_col_1', 'col_target_col_2', 'col_target_col_3', 'col_target_target_0', 'col_target_target_1', 'col_target_target_2', 'col_target_target_3']

Would it be possible to take advantage of the sklearn's capabilities and improve the handling at DataFrameMapper.get_names?

@ragrawal ragrawal linked a pull request Oct 18, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants