Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CLI function for SMOTETomek #463

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions eis_toolkit/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -332,6 +332,16 @@ class KerasRegressorMetrics(str, Enum):
mae = "mae"


class SMOTETomekSamplingStrategy(str, Enum):
"""Sampling strategies available for SMOTETomek."""

minority = "minority"
not_minority = "not minority"
not_majority = "not majority"
all = "all"
auto = "auto"


INPUT_FILE_OPTION = Annotated[
Path,
typer.Option(
Expand Down Expand Up @@ -3026,6 +3036,49 @@ def gamma_overlay_cli(input_rasters: INPUT_FILES_ARGUMENT, output_raster: OUTPUT
# WOFE
# TODO

# --- TRAINING DATA TOOLS ---


# BALANCE SMOTETOMEK
@app.command()
def balance_data_cli(
input_rasters: INPUT_FILES_ARGUMENT,
input_labels: INPUT_FILE_OPTION,
output_raster: OUTPUT_FILE_OPTION,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can return the stacked rasters as one. We have to return individual rasters that correspond to the input rasters, like in the unify_rasters CLI function. However, as I was digging into this, I learned that the balancing process changes the data size and spatial integrity of the data (idk if this is a concern). I think we need to think about this more and consult someone who knows about this method in this context.

output_labels: OUTPUT_FILE_OPTION,
sampling_strategy_literal: Annotated[SMOTETomekSamplingStrategy, typer.Option()] = SMOTETomekSamplingStrategy.auto,
sampling_strategy_float: Optional[float] = None,
Comment on lines +3049 to +3050
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to come up with more descriptive names for these, I think. For users it will be confusing what is a "sampling strategy float" and they might not know what "literal" means in this context.

random_state: Optional[int] = None,
):
"""Resample feature data using SMOTETomek.

Parameter sampling_strategy_float will override sampling_strategy_literal if given.
"""
Comment on lines +3053 to +3056
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small nitpick, if multiline docstring, the """ should be on a blank line

Suggested change
"""Resample feature data using SMOTETomek.
Parameter sampling_strategy_float will override sampling_strategy_literal if given.
"""
"""
Resample feature data using SMOTETomek.
Parameter sampling_strategy_float will override sampling_strategy_literal if given.
"""

from eis_toolkit.prediction.machine_learning_general import prepare_data_for_ml
from eis_toolkit.training_data_tools.class_balancing import balance_SMOTETomek

X, y, profile, _ = prepare_data_for_ml(input_rasters, input_labels)
typer.echo("Progress: 30%")

if sampling_strategy_float is not None:
sampling_strategy = sampling_strategy_float
else:
sampling_strategy = sampling_strategy_literal

X_res, y_res = balance_SMOTETomek(X, y, sampling_strategy, random_state)
typer.echo("Progress 80%")

with rasterio.open(output_raster, "w", **profile) as dst:
dst.write(X_res, 1)

with rasterio.open(output_labels, "w", **profile) as dst:
dst.write(y_res, 1)
typer.echo("Progress: 100%")
typer.echo(
f"Balancing data completed, writing resampled feature data to {output_raster} \
and corresponding labels to {output_labels}."
)


# --- TRANSFORMATIONS ---

Expand Down
19 changes: 11 additions & 8 deletions eis_toolkit/training_data_tools/class_balancing.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter length check might need to be changed to this form, at least it was used in another file accepting DataFrames too:

x_size = X.index.size if isinstance(X, pd.DataFrame) else X.shape[0]
if x_size != y.size:
raise NonMatchingParameterLengthsException(f"X and y must have the length {x_size} != {y.size}.")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short notice @nmaarnio: I'll have a look into this by Monday or even Friday, depends on diff. things.

Copy link
Collaborator

@msmiyels msmiyels Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nmaarnio , I tried to get into this, however, not sure if I do understand the complete context.

In my view, it is sufficient to only use X.shape[0] for both dataframes and numpy arrays. Behind the scenes, pandas 🐼 uses numpy.

If you have a numpy raster stack which originates from a list of 10 rasters, you will have the number of rasters on index[1] and the number of rows (pixels) on index[0].

The underlying question here is: Where does the pd.DataFrame comes from ❓ Maybe the following example (illustrating the complete workflow idea) helps:

Example:

# Imports
import rasterio
import numpy as np
import pandas as pd

from imblearn.combine import SMOTETomek

# Load layers
input_layers = []
for file in file_list:
    # Load
    layer_raster = rasterio.open(file)
    layer_array = layer_raster.read(1)
    layer_array = np.where(layer_array == layer_raster.nodata, np.nan, layer_array)

    # Make it ML compatible (x*y rows, one column)
    layer_array = layer_array.reshape(-1, 1)

# Stacking
layer_stack = np.column_stack(input_layers)       

# layer_stack.shape[0]: number of pixels
# layer_stack.shape[1]: number of layers

# Labels
labels_raster = rasterio.open(file)
labels_array = labels.read(1)
labels_array = np.where(labels_array == labels_raster.nodata, np.nan, labels_array)

# ML-compatibility (all rows in one column)
labels_array = labels_array.reshape(-1, 1)

# Append labels (could be done in one run with the loop above, but better for readability)
raster_stack = np.column_stack([layer_stack, labels_array])

# Remove any rows containing NaN values
raster_stack[
        ~np.isnan(raster_stack).any(axis=1)
    ]

You can easily convert this one to a pd.DataFrame with pd.DataFrame(raster_stack). The first index (0) will be the number of rows and the second (1) the number of layers + one label column.

To further propagate this stuff into the sampling, split it again into X and y

# Split
X, y = raster_stack[:, :-1], raster_stack[:, -1]

# Example for sampling
X_smote_tomek, y_smote_tomek = SMOTETomek(sampling_strategy=0.5).fit_resample(X, y)

To my knowledge, all of the sampling packages and helpers expect X and y to be in shape rows, columns, with rows = spatial_x * spatial_y and columns = number of evidence layers (+ attached labels, if so)

The goal before putting any data into the sampler is to have the right data (cleaned) in the right format (shape). I'm not aware of how the data got prepared before, but if you used .reshape(1, -1), the stacking methods might be changed to np.vstack and additionally transposed (otherwise rows and columns would be in the wrong order and the indizies are upside down as well).

Another, general command
I would not recommend trying to save any of the sampled data as rasters. Any sampling will modify the data and increase/decrease the amount of data used for training by creating some kind of virtual samples (or by dropping), but without a spatial relation.

Thus:

  • the number of rows will change, and the spatial_x and spatial_y will not be the same (other dimensions)
  • you cannot generally overly the sampled outputs with the original data

Instead, propagate the resulting arrays under the hood and directly feed it [any supervised method] with these - no intermediate saving, since the spatial contstraints cannot be kept. If you want to save those, save it as .csv or .feather (binary and way, way faster and smaller) and then load the file again as input.

Did that help to clarify some points?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @msmiyels and thanks for the comments! I'll get back to this later more in depth, but your last point was what we were looking for most. We were not sure if there is some accepted logic to "squeeze in " the extra synthetic data produced by this tool, but now I understood that there is not. I gather that we should not offer this as a separate tool for the user, but rather parameterize the supervised model training tools that could run this under the hood (?)

Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import numpy as np
import pandas as pd
from beartype import beartype
from beartype.typing import Optional, Union
from beartype.typing import Literal, Optional, Union
from imblearn.combine import SMOTETomek

from eis_toolkit.exceptions import NonMatchingParameterLengthsException
Expand All @@ -11,24 +11,27 @@
def balance_SMOTETomek(
X: Union[pd.DataFrame, np.ndarray],
y: Union[pd.Series, np.ndarray],
sampling_strategy: Union[float, str, dict] = "auto",
sampling_strategy: Union[float, Literal["minority", "not minority", "not majority", "all", "auto"], dict] = "auto",
random_state: Optional[int] = None,
) -> tuple[Union[pd.DataFrame, np.ndarray], Union[pd.Series, np.ndarray]]:
msorvoja marked this conversation as resolved.
Show resolved Hide resolved
"""Balances the classes of input dataset using SMOTETomek resampling method.
"""
Balances the classes of input dataset using SMOTETomek resampling method.

For more information about Imblearn SMOTETomek read the documentation here:
https://imbalanced-learn.org/stable/references/generated/imblearn.combine.SMOTETomek.html.

Args:
X: The feature matrix (input data as a DataFrame).
y: The target labels corresponding to the feature matrix.
X: Input feature data to be sampled.
y: Target labels corresponding to the input features.
sampling_strategy: Parameter controlling how to perform the resampling.
If float, specifies the ratio of samples in minority class to samples of majority class,
msorvoja marked this conversation as resolved.
Show resolved Hide resolved
if str, specifies classes to be resampled ("minority", "not minority", "not majority", "all", "auto"),
if dict, the keys should be targeted classes and values the desired number of samples for the class.
Defaults to "auto", which will resample all classes except the majority class.
random_state: Parameter controlling randomization of the algorithm. Can be given a seed (number).
Defaults to None, which randomizes the seed.
random_state: Seed for random number generation. Defaults to None.

Returns:
Resampled feature matrix and target labels.
Resampled feature data and target labels.

Raises:
NonMatchingParameterLengthsException: If X and y have different length.
Expand Down
5 changes: 3 additions & 2 deletions tests/training_data_tools/class_balancing_test.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import numpy as np
import pytest
from beartype.roar import BeartypeCallHintParamViolation
from sklearn.datasets import make_classification

from eis_toolkit.exceptions import NonMatchingParameterLengthsException
Expand Down Expand Up @@ -37,6 +38,6 @@ def test_invalid_label_length():


def test_invalid_sampling_strategy():
"""Test that invalid value for sampling strategy raises the correct exception (generated by imblearn)."""
with pytest.raises(ValueError):
"""Test that invalid value for sampling strategy raises the correct exception."""
with pytest.raises(BeartypeCallHintParamViolation):
balance_SMOTETomek(X, y, sampling_strategy="invalid_strategy")
Loading