-
Notifications
You must be signed in to change notification settings - Fork 115
Feature Column or Keras Preprocessing Layer
There are two options for feature engineering in TensorFlow: feature column api and keras preprocessing layers (numeric inputs and categorical inputs).
In the data analysis and transform design, we proposed some transform functions to extend the COLUMN
syntax. We will generate the python code for feature engineering from the COLUMN
clause. We will discuss which api the generated code is built upon - feature column or keras preprocess layer?
In the motivation part from the RFC named Keras Category Inputs
, we can see that the community plans to develop Keras Preprocess Layer to replace the feature column api. These layers will be released in TF2.2.
Three pain points for feature column are mentinod in this doc. The following points are copied from the RFC:
* Users have to define both feature columns and Keras Inputs for the model, resulting in code duplication and deviation from DRY (Do not repeat yourself) principle. See this [Github issue](https://github.com/tensorflow/tensorflow/issues/27416).
* Users with large dimension categorical inputs will incur large memory footprint and computation cost, if wrapped with indicator column through `tf.keras.layers.DenseFeatures`.
* Currently there is no way to correctly feed Keras linear model or dense layer with multivalent categorical inputs or weighted categorical inputs.
- Users have to define both feature columns and Keras Inputs for the model, resulting in code duplication and deviation from DRY (Do not repeat yourself) principle. See this Github issue.
In the next code snippet, we use feature columns to transform feature for a DNN model and we have to use the feature names (e.g. "color", "frequencies") to define both feature_columns
and tf.keras.Input
. What's more, some feature_column
are derived from other feature_columns
and we don't need to create tf.keras.Input
for them like indicator_column
in next code snippet.
Code Snippet 1
import numpy as np
import tensorflow as tf
color_column = tf.feature_column.categorical_column_with_vocabulary_list(
'color', ['R', 'G', 'B'], dtype=tf.string, default_value=-1
)
color_input = tf.keras.Input(name='color', shape=(1,), dtype=tf.string)
weighted_column = tf.feature_column.weighted_categorical_column(
categorical_column=color_column, weight_feature_key='frequencies'
)
frequencies_input = tf.keras.Input(name='frequencies', shape=(1,), dtype=tf.float32)
indicator_column = tf.feature_column.indicator_column(weighted_column)
inputs = {
'color': color_input,
'frequencies': frequencies_input
}
feature_layer = tf.keras.layers.DenseFeatures(indicator_column)
feature_value = feature_layer(inputs)
dense = tf.keras.layers.Dense(1)(feature_value)
model = tf.keras.Model(inputs=inputs, outputs=dense)
model.compile(optimizer='sgd', loss='mse')
x = {
'color': tf.constant([['R'],['G'],['B']]),
'frequencies': tf.constant([[0.11],[0.23],[0.87]])
}
y = tf.constant([[1], [0], [0]])
model.fit(x, y, epochs=5)
- Users with large dimension categorical inputs will incur large memory footprint and computation cost, if wrapped with indicator column through
tf.keras.layers.DenseFeatures
.
indicator_column
can be used to wrap any categorical columns and crossed_column
to represents multi-hot representation of the given column. However, the multi-hot representation using a dense matrix will incur large memory footprint.
Code Snippet 2
color_column = tf.feature_column.categorical_column_with_vocabulary_list(
'color', ['R', 'G', 'B'], dtype=tf.string, default_value=-1
)
color_one_hot = feature_column.indicator_column(color_column)
if the values of color are [['R'],['G'],['B']]
, the output of indicator_column
is
np.array([
[1,0,0],
[0,1,0],
[0,0,1]
)
The output will be very sparse if the voabulary number in vocabulary_list
is very large.
- Currently there is no way to correctly feed Keras linear model or dense layer with multivalent categorical inputs or weighted categorical inputs.
For example, we may represent text documents as a collection of word frequencies in NLP. If we want to feed keras linear model or dense layer with weighted categorical inputs, we have to use indicator_column
or embedding_column
to wrap weighted_categorical_column
for DenseFeatures
.
In Code Snippet 1, we have showed an example with weighted categorical inputs. We use indicator_column
to wrap weighted_categorical_column
to feed keras linear model. Because the weighted_column
can not be accepted by DenseFeatures
. However, the above 2nd problem is in the solution.
The another way is that we can use embedding_column
instead of indicator_column
to wrap weighted_categorical_column
for DenseFeature
to avoid the above 2nd problem.
Code Snippet 3
import numpy as np
import tensorflow as tf
color_column = tf.feature_column.categorical_column_with_vocabulary_list(
'color', ['R', 'G', 'B'], dtype=tf.string, default_value=-1
)
weighted_column = tf.feature_column.weighted_categorical_column(
categorical_column=color_column, weight_feature_key='frequencies'
)
embedding_column = tf.feature_column.embedding_column(
weighted_column, dimension=1
)
inputs = {
'color': tf.keras.Input(name='color', shape=(1,), dtype=tf.string),
'frequencies': tf.keras.Input(name='frequencies', shape=(1,), dtype=tf.float32)
}
feature_layer = tf.keras.layers.DenseFeatures(embedding_column)
feature_value = feature_layer(inputs)
model = tf.keras.Model(inputs=inputs, outputs=feature_value)
In this code snippet, we specify the dimension=1
to keep the same logic as tf.keras.layers.Dense(1)(feature_value)
in Code Snippet 1. However, if we specify the activation function for Dense
like tf.keras.layers.Dense(1, activation="relu")(feature_value)
in Code Snippet 1, we must use activation function for the output of DenseFeatures
with embedding_column
which may be tedious.
Code Snippet 4
feature_layer = tf.keras.layers.DenseFeatures(embedding_column)
feature_value = feature_layer(inputs)
relu = tf.keras.activations.relu(feature_value)
model = tf.keras.Model(inputs=inputs, outputs=relu)
- DNN
- Wide And Deep
- DeepFM
- Add a new concat_column for the
CONCAT
transform function.
- The built-in preprocess layer will be released in TF2.2. For the version (< 2.2), we will implement the layers with the same definition. For the version (>= 2.2), we will use the built-in layer directly.
- The built-in layers won't cover the
CONCAT
function. We will provide the layer in ElasticDL pip package.