[BUG] `BaggingClassifier` not working with WEASEL on multivariate data #6427

marcopeix · 2024-05-15T21:27:26Z

Describe the bug

In the documentation, it says

if n_features=1, BaggingClassifier turns a univariate classifier into a multivariate classifier, because slices seen by estimator are all univariate. This can be used to give a univariate classifier multivariate capabilities.

When used with WEASEL, I still get the error:

ValueError: Data seen by WEASEL instance has multivariate series, but this WEASEL instance cannot handle multivariate series. Calls with multivariate series may result in error or unreliable results.

To Reproduce
I am using this dataset:

from sktime.datasets import load_japanese_vowels

The data is preprocessed and turned into a numpy 3D array.

Code for classification:

from sktime.classification.ensemble import BaggingClassifier
from sktime.classification.dictionary_based import WEASEL

base_clf = WEASEL(alphabet_size=3, random_state=42)

clf = BaggingClassifier(base_clf, n_estimators=11, n_features=1, random_state=42)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

Expected behavior
The model trains and makes predictions

Versions
sktime v.0.27.0

System: python: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:27:34) [MSC v.1937 64 bit (AMD64)] executable: D:\Anaconda\envs\sktime\python.exe machine: Windows-10-10.0.19045-SP0

Python dependencies:
pip: 24.0
sktime: 0.27.0
sklearn: 1.4.1.post1
skbase: 0.7.5
numpy: 1.26.4
scipy: 1.12.0
pandas: 2.1.4
matplotlib: 3.8.3
joblib: 1.3.2
numba: 0.59.0
statsmodels: None
pmdarima: None
statsforecast: None
tsfresh: None
tslearn: None
torch: None
tensorflow: None
tensorflow_probability: None

The text was updated successfully, but these errors were encountered:

marcopeix · 2024-05-15T21:38:14Z

In fact, running the example from the documentation, exact same code, same dataset, we get an accuracy of 0%.

from sktime.classification.ensemble import BaggingClassifier
from sktime.classification.kernel_based import RocketClassifier
from sktime.datasets import load_unit_test
X_train, y_train = load_unit_test(split="train") 
X_test, y_test = load_unit_test(split="test") 
clf = BaggingClassifier(
    RocketClassifier(num_kernels=100),
    n_estimators=10,
) 
clf.fit(X_train, y_train) 
y_pred = clf.predict(X_test) 

clf_accuracy = round(accuracy_score(y_test, y_pred),2)*100

print(f'Accuracy: {clf_accuracy}%')

fkiraly · 2024-05-15T21:39:39Z

I cannot reproduce your error.

Can you kindly give full code with all imports, and report your versions shown by show_versions?

I get a different - expected - error message:
Data seen by WEASEL instance has unequal length series, but this WEASEL instance cannot handle unequal length series. Calls with unequal length series may result in error or unreliable results.

Attempted reproduction on windowd, python 3.11, current main.

marcopeix · 2024-05-15T21:40:45Z

This is what is returned by show_versions:

System: python: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:27:34) [MSC v.1937 64 bit (AMD64)] executable: D:\Anaconda\envs\sktime\python.exe machine: Windows-10-10.0.19045-SP0

Python dependencies:
pip: 24.0
sktime: 0.27.0
sklearn: 1.4.1.post1
skbase: 0.7.5
numpy: 1.26.4
scipy: 1.12.0
pandas: 2.1.4
matplotlib: 3.8.3
joblib: 1.3.2
numba: 0.59.0
statsmodels: None
pmdarima: None
statsforecast: None
tsfresh: None
tslearn: None
torch: None
tensorflow: None
tensorflow_probability: None

marcopeix · 2024-05-15T21:41:49Z

Full code:

import warnings
warnings.filterwarnings('ignore')

import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score

from sktime.datasets import load_japanese_vowels

X_train, y_train = load_japanese_vowels(split='train', return_type=None)
X_test, y_test = load_japanese_vowels(split='test', return_type=None)

max_length = 29

def pad_series(x):
    if len(x) < max_length:
        return np.pad(x, (0, max_length - len(x)), 'constant', constant_values=(0,))
    return x.values[:max_length]

X_train_padded = X_train.applymap(pad_series)
X_test_padded = X_test.applymap(pad_series)

X_train_arrays = [np.stack(row) for _, row in X_train_padded.iterrows()]
X_test_arrays = [np.stack(row) for _, row in X_test_padded.iterrows()]

X_train = np.stack(X_train_arrays, axis=0)
X_test = np.stack(X_test_arrays, axis=0)

from sktime.classification.ensemble import BaggingClassifier
from sktime.classification.dictionary_based import WEASEL

base_clf = WEASEL(alphabet_size=3, random_state=42)

clf = BaggingClassifier(base_clf, n_estimators=11, n_features=1, random_state=42)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

clf_accuracy = round(accuracy_score(y_test, y_pred),2)*100

print(f'Accuracy: {clf_accuracy}%')

fkiraly · 2024-05-15T21:42:29Z

The second example is a confirmed bug, something is going on with the encoding of the classes, as the prediction is int, and the input are object labels.

If you do

clf_accuracy = round(accuracy_score(y_test.astype(int) - 1, y_pred),2)*100

print(f'Accuracy: {clf_accuracy}%')

this gives accuracy of 91%, but of course it should be like that out of the box.

fkiraly · 2024-05-15T21:55:12Z

Tests were not covering <U1 output type for y, added a test here to see what is going on: #6428

This is strange, since I remember the example working.

fkiraly · 2024-05-15T22:09:26Z

A fix for the first issue is here: #6429

This does not fix the problem with the labels though.

fkiraly · 2024-05-15T22:13:47Z

2nd problem is due to default behaviour of _predict, and common to all classifiers which do not have a custom implementation. The default should convert integers into classes.

fkiraly · 2024-05-15T22:20:20Z

Fixed here: #6430

This should fix both bugs reported here.

Testing and review appreciated.

marcopeix · 2024-05-19T19:46:05Z

Fix #6429 is tested and it works. I can use WEASEL with BaggingClassifier. Fix #6430 also works, and no need to do the small hack. Thanks for your help! I hope this gets merged soon so I can get back to working from main!

Cheers!

fkiraly · 2024-05-20T00:47:32Z

You're welcome!

ETA for release is within the coming week.

Fixes #6427. The problem was that `_predict_proba` did simply forget to subset columns of `X`. The bug was not detected as the base classifier used in tests was `DummyClassifier`, which ignores `X`, and hence misses the failures in column subsetting of `X` entirely. To cover the bug, the classifier was replaced by `SummaryClassifier.`.

…ys, even if `fit` `y` was not integer (#6430) This fixes one of the bugs reported in #6427, namely the default `_predict` in `BaseClassifier` always returning integer labels, even if the original labels were not integers. This would cause all classifiers that did not have a custom `_predict` implemented - a few composites, among them `BaggingClassifier` - to always predict integers, even if the `y` seen in `fit` was of another type. The fix is simple, adding a missing application of the memorized integer-to-class dictionary. Test coverage is through #6428.

…e type and labels (#6428) This PR extends the suite test `test_classifier_on_unit_test_data` to test that `y` of object or str dtype `y` leads to correct labels on `predict` outputs. Covers second bug in #6427, namely the example returning integer labels instead of string labels - but does not fix the bug. It is hence expected that the test will detect bug #6427. Depends on the following PR which fix newly covered bugs: * #6430 * #6432

marcopeix added the bug Something isn't working label May 15, 2024

fkiraly added the module:classification classification module: time series classification label May 15, 2024

fkiraly added this to Needs triage & validation in Bugfixing via automation May 15, 2024

fkiraly mentioned this issue May 15, 2024

[ENH] test classifiers on str dtype y, ensure predict returns same type and labels #6428

Merged

fkiraly mentioned this issue May 15, 2024

[BUG] fix BaggingClassifier for column subsampling case #6429

Merged

fkiraly mentioned this issue May 15, 2024

[BUG] fix classifier default _predict returning integer labels always, even if fit y was not integer #6430

Merged

fkiraly moved this from Needs triage & validation to Under review in Bugfixing May 15, 2024

fkiraly changed the title ~~[BUG] - Bagging Classifier not working with WEASEL on multivariate data~~ [BUG] BaggingClassifier not working with WEASEL on multivariate data May 16, 2024

fkiraly closed this as completed in #6429 May 22, 2024

Bugfixing automation moved this from Under review to Fixed/resolved May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `BaggingClassifier` not working with WEASEL on multivariate data #6427

[BUG] `BaggingClassifier` not working with WEASEL on multivariate data #6427

marcopeix commented May 15, 2024 •

edited

marcopeix commented May 15, 2024

fkiraly commented May 15, 2024

marcopeix commented May 15, 2024

marcopeix commented May 15, 2024

fkiraly commented May 15, 2024

fkiraly commented May 15, 2024

fkiraly commented May 15, 2024

fkiraly commented May 15, 2024

fkiraly commented May 15, 2024

marcopeix commented May 19, 2024

fkiraly commented May 20, 2024

[BUG] BaggingClassifier not working with WEASEL on multivariate data #6427

[BUG] BaggingClassifier not working with WEASEL on multivariate data #6427

Comments

marcopeix commented May 15, 2024 • edited

marcopeix commented May 15, 2024

fkiraly commented May 15, 2024

marcopeix commented May 15, 2024

marcopeix commented May 15, 2024

fkiraly commented May 15, 2024

fkiraly commented May 15, 2024

fkiraly commented May 15, 2024

fkiraly commented May 15, 2024

fkiraly commented May 15, 2024

marcopeix commented May 19, 2024

fkiraly commented May 20, 2024

[BUG] `BaggingClassifier` not working with WEASEL on multivariate data #6427

[BUG] `BaggingClassifier` not working with WEASEL on multivariate data #6427

marcopeix commented May 15, 2024 •

edited