Preserving column names when transformer requires multiple columns as input #174

hildeweerts · 2018-10-02T08:18:01Z

I was wondering whether it is possible to preserve column names when using a transformer that requires multiple columns of the dataframe. I'll try to illustrate what I mean with an example.

from sklearn.feature_selection import SelectKBest, chi2

data = pd.DataFrame({
    'pet':      ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
    'children': [4., 6, 3, 3, 2, 3, 5, 4],
    'salary':   [90., 24, 44, 27, 32, 59, 36, 27]})

mapper_fs = DataFrameMapper([(['children','salary'], SelectKBest(chi2, k=2))])
mapper_fs.fit_transform(data[['children','salary']], data['pet'])
print(mapper_fs.transformed_names_)

Which outputs ['children_salary'], whereas I would expect just ['salary']. This makes it impossible to keep track of which columns were dropped by the SelectKBest transformer. Is there currently a way to solve this problem?

The text was updated successfully, but these errors were encountered:

dukebody · 2018-10-17T18:06:21Z

I believe this should be possible if the transformer you use implements some interface to get the name of the resulting features. See https://github.com/scikit-learn-contrib/sklearn-pandas/blob/master/sklearn_pandas/dataframe_mapper.py#L40.

Perhaps you can extend SelectKBest to provide this interface?

Marking as "good first issue" to come up with an example of this that works.

devforfu · 2019-01-29T09:08:59Z

@hildeweerts I would say that this kind of transformers is one of the most challenging ones to insert into a pipeline. The code we have now doesn't support this kind of flexibility so I guess the most simple way to do so is to manually track the changes. Not really convenient but I believe there are no other ways right now.

As @dukebody proposed, it could be something like (I guess we need to pick k=1 if we want to choose the best column between two):

import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, chi2
from sklearn_pandas import DataFrameMapper

class TrackingSelectKBest(SelectKBest):
    def fit(self, X, y=None):
        super().fit(X, y)
        scores = sorted([
            (i, score) for i, score in enumerate(self.scores_)],
            key=lambda pair: pair[1],
            reverse=True)
        self.best_columns_ = [i for i, score in scores[:self.k]]
        return self

def main():
    data = pd.DataFrame({
        'pet':      ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
        'children': [4., 6, 3, 3, 2, 3, 5, 4],
        'salary':   [90., 24, 44, 27, 32, 59, 36, 27]})
    selector = TrackingSelectKBest(chi2, k=1)
    columns = np.array(['children', 'salary'])
    m = DataFrameMapper([(columns, selector)])
    m.fit_transform(data[columns], data['pet'])
    print(columns[selector.best_columns_])

if __name__ == '__main__':
    main()

The snippet about should print:

['salary']

Probably there are other ways to achieve this but scikit-learn transformers take numpy arrays without access to the original data frame column names so it is not possible to derive these names from the transformer.

Aashit-Sharma · 2020-02-24T06:17:11Z

@dukebody No updates on this ? Sounds like a really good idea to implement this

falcaopetri · 2021-10-17T18:33:12Z

Sklearn 1.0 estimator API has better support for feature_names. For example, using DataFrameMapper's df_in=True allows us to get:

mapper_fs = DataFrameMapper([(['children','salary'], SelectKBest(chi2, k=2))], input_df=True)
mapper_fs.fit_transform(data[['children','salary']], data['pet'])
print(mapper_fs.transformed_names_)
['children_salary_children', 'children_salary_salary']

Where children_salary was added by sklearn-pandas.

Same applies to, e.g., OneHotEncoder:

import pandas as pd
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({
    'col': [0, 0, 1, 1, 2, 3, 0],
    'target': [0, 0, 1, 1, 2, 3, 0]
})
mapper = DataFrameMapper([(['col', 'target'], OneHotEncoder())], df_out=True)
transformed = mapper.fit_transform(df)
print(mapper.transformed_names_)
['col_target_x0_0', 'col_target_x0_1', 'col_target_x0_2', 'col_target_x0_3', 'col_target_x1_0', 
 'col_target_x1_1', 'col_target_x1_2', 'col_target_x1_3']

vs

mapper = DataFrameMapper([(['col', 'target'], OneHotEncoder())], input_df=True, df_out=True)
transformed = mapper.fit_transform(df)
print(mapper.transformed_names_)
['col_target_col_0', 'col_target_col_1', 'col_target_col_2', 'col_target_col_3', 'col_target_target_0', 'col_target_target_1', 'col_target_target_2', 'col_target_target_3']

Would it be possible to take advantage of the sklearn's capabilities and improve the handling at DataFrameMapper.get_names?

dukebody added the good first issue label Oct 17, 2018

ragrawal linked a pull request Oct 18, 2021 that will close this issue

Use new transformer.get_feature_names_out function #248

Open

falcaopetri linked a pull request Oct 19, 2021 that will close this issue

Use new transformer.get_feature_names_out function #248

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserving column names when transformer requires multiple columns as input #174

Preserving column names when transformer requires multiple columns as input #174

hildeweerts commented Oct 2, 2018 •

edited by devforfu

Loading

dukebody commented Oct 17, 2018

devforfu commented Jan 29, 2019

Aashit-Sharma commented Feb 24, 2020

falcaopetri commented Oct 17, 2021

Preserving column names when transformer requires multiple columns as input #174

Preserving column names when transformer requires multiple columns as input #174

Comments

hildeweerts commented Oct 2, 2018 • edited by devforfu Loading

dukebody commented Oct 17, 2018

devforfu commented Jan 29, 2019

Aashit-Sharma commented Feb 24, 2020

falcaopetri commented Oct 17, 2021

hildeweerts commented Oct 2, 2018 •

edited by devforfu

Loading