Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] BaggingClassifier not working with WEASEL on multivariate data #6427

Closed
marcopeix opened this issue May 15, 2024 · 11 comments · Fixed by #6429
Closed

[BUG] BaggingClassifier not working with WEASEL on multivariate data #6427

marcopeix opened this issue May 15, 2024 · 11 comments · Fixed by #6429
Labels
bug Something isn't working module:classification classification module: time series classification
Projects

Comments

@marcopeix
Copy link

marcopeix commented May 15, 2024

Describe the bug

In the documentation, it says

if n_features=1, BaggingClassifier turns a univariate classifier into a multivariate classifier, because slices seen by estimator are all univariate. This can be used to give a univariate classifier multivariate capabilities.

When used with WEASEL, I still get the error:

ValueError: Data seen by WEASEL instance has multivariate series, but this WEASEL instance cannot handle multivariate series. Calls with multivariate series may result in error or unreliable results.

To Reproduce
I am using this dataset:

from sktime.datasets import load_japanese_vowels

The data is preprocessed and turned into a numpy 3D array.

Code for classification:

from sktime.classification.ensemble import BaggingClassifier
from sktime.classification.dictionary_based import WEASEL

base_clf = WEASEL(alphabet_size=3, random_state=42)

clf = BaggingClassifier(base_clf, n_estimators=11, n_features=1, random_state=42)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

Expected behavior
The model trains and makes predictions

Versions
sktime v.0.27.0

System: python: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:27:34) [MSC v.1937 64 bit (AMD64)] executable: D:\Anaconda\envs\sktime\python.exe machine: Windows-10-10.0.19045-SP0

Python dependencies:
pip: 24.0
sktime: 0.27.0
sklearn: 1.4.1.post1
skbase: 0.7.5
numpy: 1.26.4
scipy: 1.12.0
pandas: 2.1.4
matplotlib: 3.8.3
joblib: 1.3.2
numba: 0.59.0
statsmodels: None
pmdarima: None
statsforecast: None
tsfresh: None
tslearn: None
torch: None
tensorflow: None
tensorflow_probability: None

@marcopeix marcopeix added the bug Something isn't working label May 15, 2024
@marcopeix
Copy link
Author

In fact, running the example from the documentation, exact same code, same dataset, we get an accuracy of 0%.

from sktime.classification.ensemble import BaggingClassifier
from sktime.classification.kernel_based import RocketClassifier
from sktime.datasets import load_unit_test
X_train, y_train = load_unit_test(split="train") 
X_test, y_test = load_unit_test(split="test") 
clf = BaggingClassifier(
    RocketClassifier(num_kernels=100),
    n_estimators=10,
) 
clf.fit(X_train, y_train) 
y_pred = clf.predict(X_test) 

clf_accuracy = round(accuracy_score(y_test, y_pred),2)*100

print(f'Accuracy: {clf_accuracy}%')

@fkiraly fkiraly added the module:classification classification module: time series classification label May 15, 2024
@fkiraly fkiraly added this to Needs triage & validation in Bugfixing via automation May 15, 2024
@fkiraly
Copy link
Collaborator

fkiraly commented May 15, 2024

I cannot reproduce your error.

Can you kindly give full code with all imports, and report your versions shown by show_versions?

I get a different - expected - error message:
Data seen by WEASEL instance has unequal length series, but this WEASEL instance cannot handle unequal length series. Calls with unequal length series may result in error or unreliable results.

Attempted reproduction on windowd, python 3.11, current main.

@marcopeix
Copy link
Author

This is what is returned by show_versions:

System: python: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:27:34) [MSC v.1937 64 bit (AMD64)] executable: D:\Anaconda\envs\sktime\python.exe machine: Windows-10-10.0.19045-SP0

Python dependencies:
pip: 24.0
sktime: 0.27.0
sklearn: 1.4.1.post1
skbase: 0.7.5
numpy: 1.26.4
scipy: 1.12.0
pandas: 2.1.4
matplotlib: 3.8.3
joblib: 1.3.2
numba: 0.59.0
statsmodels: None
pmdarima: None
statsforecast: None
tsfresh: None
tslearn: None
torch: None
tensorflow: None
tensorflow_probability: None

@marcopeix
Copy link
Author

Full code:

import warnings
warnings.filterwarnings('ignore')

import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score

from sktime.datasets import load_japanese_vowels

X_train, y_train = load_japanese_vowels(split='train', return_type=None)
X_test, y_test = load_japanese_vowels(split='test', return_type=None)

max_length = 29

def pad_series(x):
    if len(x) < max_length:
        return np.pad(x, (0, max_length - len(x)), 'constant', constant_values=(0,))
    return x.values[:max_length]

X_train_padded = X_train.applymap(pad_series)
X_test_padded = X_test.applymap(pad_series)

X_train_arrays = [np.stack(row) for _, row in X_train_padded.iterrows()]
X_test_arrays = [np.stack(row) for _, row in X_test_padded.iterrows()]

X_train = np.stack(X_train_arrays, axis=0)
X_test = np.stack(X_test_arrays, axis=0)

from sktime.classification.ensemble import BaggingClassifier
from sktime.classification.dictionary_based import WEASEL

base_clf = WEASEL(alphabet_size=3, random_state=42)

clf = BaggingClassifier(base_clf, n_estimators=11, n_features=1, random_state=42)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

clf_accuracy = round(accuracy_score(y_test, y_pred),2)*100

print(f'Accuracy: {clf_accuracy}%')

@fkiraly
Copy link
Collaborator

fkiraly commented May 15, 2024

The second example is a confirmed bug, something is going on with the encoding of the classes, as the prediction is int, and the input are object labels.

If you do

clf_accuracy = round(accuracy_score(y_test.astype(int) - 1, y_pred),2)*100

print(f'Accuracy: {clf_accuracy}%')

this gives accuracy of 91%, but of course it should be like that out of the box.

@fkiraly
Copy link
Collaborator

fkiraly commented May 15, 2024

Tests were not covering <U1 output type for y, added a test here to see what is going on: #6428

This is strange, since I remember the example working.

@fkiraly
Copy link
Collaborator

fkiraly commented May 15, 2024

A fix for the first issue is here: #6429

This does not fix the problem with the labels though.

@fkiraly
Copy link
Collaborator

fkiraly commented May 15, 2024

2nd problem is due to default behaviour of _predict, and common to all classifiers which do not have a custom implementation. The default should convert integers into classes.

@fkiraly
Copy link
Collaborator

fkiraly commented May 15, 2024

Fixed here: #6430

This should fix both bugs reported here.

Testing and review appreciated.

@fkiraly fkiraly moved this from Needs triage & validation to Under review in Bugfixing May 15, 2024
@fkiraly fkiraly changed the title [BUG] - Bagging Classifier not working with WEASEL on multivariate data [BUG] BaggingClassifier not working with WEASEL on multivariate data May 16, 2024
@marcopeix
Copy link
Author

Fix #6429 is tested and it works. I can use WEASEL with BaggingClassifier. Fix #6430 also works, and no need to do the small hack. Thanks for your help! I hope this gets merged soon so I can get back to working from main!

Cheers!

@fkiraly
Copy link
Collaborator

fkiraly commented May 20, 2024

You're welcome!

ETA for release is within the coming week.

fkiraly added a commit that referenced this issue May 22, 2024
Fixes #6427.

The problem was that `_predict_proba` did simply forget to subset
columns of `X`.

The bug was not detected as the base classifier used in tests was
`DummyClassifier`, which ignores `X`, and hence misses the failures in
column subsetting of `X` entirely.

To cover the bug, the classifier was replaced by `SummaryClassifier.`.
Bugfixing automation moved this from Under review to Fixed/resolved May 22, 2024
fkiraly added a commit that referenced this issue May 22, 2024
…ys, even if `fit` `y` was not integer (#6430)

This fixes one of the bugs reported in
#6427, namely the default
`_predict` in `BaseClassifier` always returning integer labels, even if
the original labels were not integers.

This would cause all classifiers that did not have a custom `_predict`
implemented - a few composites, among them `BaggingClassifier` - to
always predict integers, even if the `y` seen in `fit` was of another
type.

The fix is simple, adding a missing application of the memorized
integer-to-class dictionary.

Test coverage is through #6428.
fkiraly added a commit that referenced this issue May 22, 2024
…e type and labels (#6428)

This PR extends the suite test `test_classifier_on_unit_test_data` to
test that `y` of object or str dtype `y` leads to correct labels on
`predict` outputs.

Covers second bug in #6427,
namely the example returning integer labels instead of string labels -
but does not fix the bug.

It is hence expected that the test will detect bug #6427.

Depends on the following PR which fix newly covered bugs:

* #6430
* #6432
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module:classification classification module: time series classification
Projects
Bugfixing
Fixed/resolved
Development

Successfully merging a pull request may close this issue.

2 participants