-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CLI function for SMOTETomek #463
base: master
Are you sure you want to change the base?
Changes from all commits
bb2ff19
e751e94
e6dc779
b8c4993
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -332,6 +332,16 @@ class KerasRegressorMetrics(str, Enum): | |||||||||||||||||||
mae = "mae" | ||||||||||||||||||||
|
||||||||||||||||||||
|
||||||||||||||||||||
class SMOTETomekSamplingStrategy(str, Enum): | ||||||||||||||||||||
"""Sampling strategies available for SMOTETomek.""" | ||||||||||||||||||||
|
||||||||||||||||||||
minority = "minority" | ||||||||||||||||||||
not_minority = "not minority" | ||||||||||||||||||||
not_majority = "not majority" | ||||||||||||||||||||
all = "all" | ||||||||||||||||||||
auto = "auto" | ||||||||||||||||||||
|
||||||||||||||||||||
|
||||||||||||||||||||
INPUT_FILE_OPTION = Annotated[ | ||||||||||||||||||||
Path, | ||||||||||||||||||||
typer.Option( | ||||||||||||||||||||
|
@@ -3026,6 +3036,49 @@ def gamma_overlay_cli(input_rasters: INPUT_FILES_ARGUMENT, output_raster: OUTPUT | |||||||||||||||||||
# WOFE | ||||||||||||||||||||
# TODO | ||||||||||||||||||||
|
||||||||||||||||||||
# --- TRAINING DATA TOOLS --- | ||||||||||||||||||||
|
||||||||||||||||||||
|
||||||||||||||||||||
# BALANCE SMOTETOMEK | ||||||||||||||||||||
@app.command() | ||||||||||||||||||||
def balance_data_cli( | ||||||||||||||||||||
input_rasters: INPUT_FILES_ARGUMENT, | ||||||||||||||||||||
input_labels: INPUT_FILE_OPTION, | ||||||||||||||||||||
output_raster: OUTPUT_FILE_OPTION, | ||||||||||||||||||||
output_labels: OUTPUT_FILE_OPTION, | ||||||||||||||||||||
sampling_strategy_literal: Annotated[SMOTETomekSamplingStrategy, typer.Option()] = SMOTETomekSamplingStrategy.auto, | ||||||||||||||||||||
sampling_strategy_float: Optional[float] = None, | ||||||||||||||||||||
Comment on lines
+3049
to
+3050
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need to come up with more descriptive names for these, I think. For users it will be confusing what is a "sampling strategy float" and they might not know what "literal" means in this context. |
||||||||||||||||||||
random_state: Optional[int] = None, | ||||||||||||||||||||
): | ||||||||||||||||||||
"""Resample feature data using SMOTETomek. | ||||||||||||||||||||
|
||||||||||||||||||||
Parameter sampling_strategy_float will override sampling_strategy_literal if given. | ||||||||||||||||||||
""" | ||||||||||||||||||||
Comment on lines
+3053
to
+3056
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A small nitpick, if multiline docstring, the """ should be on a blank line
Suggested change
|
||||||||||||||||||||
from eis_toolkit.prediction.machine_learning_general import prepare_data_for_ml | ||||||||||||||||||||
from eis_toolkit.training_data_tools.class_balancing import balance_SMOTETomek | ||||||||||||||||||||
|
||||||||||||||||||||
X, y, profile, _ = prepare_data_for_ml(input_rasters, input_labels) | ||||||||||||||||||||
typer.echo("Progress: 30%") | ||||||||||||||||||||
|
||||||||||||||||||||
if sampling_strategy_float is not None: | ||||||||||||||||||||
sampling_strategy = sampling_strategy_float | ||||||||||||||||||||
else: | ||||||||||||||||||||
sampling_strategy = sampling_strategy_literal | ||||||||||||||||||||
|
||||||||||||||||||||
X_res, y_res = balance_SMOTETomek(X, y, sampling_strategy, random_state) | ||||||||||||||||||||
typer.echo("Progress 80%") | ||||||||||||||||||||
|
||||||||||||||||||||
with rasterio.open(output_raster, "w", **profile) as dst: | ||||||||||||||||||||
dst.write(X_res, 1) | ||||||||||||||||||||
|
||||||||||||||||||||
with rasterio.open(output_labels, "w", **profile) as dst: | ||||||||||||||||||||
dst.write(y_res, 1) | ||||||||||||||||||||
typer.echo("Progress: 100%") | ||||||||||||||||||||
typer.echo( | ||||||||||||||||||||
f"Balancing data completed, writing resampled feature data to {output_raster} \ | ||||||||||||||||||||
and corresponding labels to {output_labels}." | ||||||||||||||||||||
) | ||||||||||||||||||||
|
||||||||||||||||||||
|
||||||||||||||||||||
# --- TRANSFORMATIONS --- | ||||||||||||||||||||
|
||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The parameter length check might need to be changed to this form, at least it was used in another file accepting DataFrames too: x_size = X.index.size if isinstance(X, pd.DataFrame) else X.shape[0] There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Short notice @nmaarnio: I'll have a look into this by Monday or even Friday, depends on diff. things. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @nmaarnio , I tried to get into this, however, not sure if I do understand the complete context. In my view, it is sufficient to only use If you have a numpy raster stack which originates from a list of 10 rasters, you will have the number of rasters on The underlying question here is: Where does the Example: # Imports
import rasterio
import numpy as np
import pandas as pd
from imblearn.combine import SMOTETomek
# Load layers
input_layers = []
for file in file_list:
# Load
layer_raster = rasterio.open(file)
layer_array = layer_raster.read(1)
layer_array = np.where(layer_array == layer_raster.nodata, np.nan, layer_array)
# Make it ML compatible (x*y rows, one column)
layer_array = layer_array.reshape(-1, 1)
# Stacking
layer_stack = np.column_stack(input_layers)
# layer_stack.shape[0]: number of pixels
# layer_stack.shape[1]: number of layers
# Labels
labels_raster = rasterio.open(file)
labels_array = labels.read(1)
labels_array = np.where(labels_array == labels_raster.nodata, np.nan, labels_array)
# ML-compatibility (all rows in one column)
labels_array = labels_array.reshape(-1, 1)
# Append labels (could be done in one run with the loop above, but better for readability)
raster_stack = np.column_stack([layer_stack, labels_array])
# Remove any rows containing NaN values
raster_stack[
~np.isnan(raster_stack).any(axis=1)
] You can easily convert this one to a To further propagate this stuff into the # Split
X, y = raster_stack[:, :-1], raster_stack[:, -1]
# Example for sampling
X_smote_tomek, y_smote_tomek = SMOTETomek(sampling_strategy=0.5).fit_resample(X, y) To my knowledge, all of the sampling packages and helpers expect The goal before putting any data into the sampler is to have the right data (cleaned) in the right format (shape). I'm not aware of how the data got prepared before, but if you used Another, general command Thus:
Instead, propagate the resulting arrays under the hood and directly feed it [any supervised method] with these - no intermediate saving, since the spatial contstraints cannot be kept. If you want to save those, save it as Did that help to clarify some points? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @msmiyels and thanks for the comments! I'll get back to this later more in depth, but your last point was what we were looking for most. We were not sure if there is some accepted logic to "squeeze in " the extra synthetic data produced by this tool, but now I understood that there is not. I gather that we should not offer this as a separate tool for the user, but rather parameterize the supervised model training tools that could run this under the hood (?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can return the stacked rasters as one. We have to return individual rasters that correspond to the input rasters, like in the
unify_rasters
CLI function. However, as I was digging into this, I learned that the balancing process changes the data size and spatial integrity of the data (idk if this is a concern). I think we need to think about this more and consult someone who knows about this method in this context.