-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] BaggingClassifier
not working with WEASEL on multivariate data
#6427
Comments
In fact, running the example from the documentation, exact same code, same dataset, we get an accuracy of 0%. from sktime.classification.ensemble import BaggingClassifier
from sktime.classification.kernel_based import RocketClassifier
from sktime.datasets import load_unit_test
X_train, y_train = load_unit_test(split="train")
X_test, y_test = load_unit_test(split="test")
clf = BaggingClassifier(
RocketClassifier(num_kernels=100),
n_estimators=10,
)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
clf_accuracy = round(accuracy_score(y_test, y_pred),2)*100
print(f'Accuracy: {clf_accuracy}%') |
I cannot reproduce your error. Can you kindly give full code with all imports, and report your versions shown by I get a different - expected - error message: Attempted reproduction on windowd, python 3.11, current |
This is what is returned by show_versions: System: python: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:27:34) [MSC v.1937 64 bit (AMD64)] executable: D:\Anaconda\envs\sktime\python.exe machine: Windows-10-10.0.19045-SP0 Python dependencies: |
Full code: import warnings
warnings.filterwarnings('ignore')
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sktime.datasets import load_japanese_vowels
X_train, y_train = load_japanese_vowels(split='train', return_type=None)
X_test, y_test = load_japanese_vowels(split='test', return_type=None)
max_length = 29
def pad_series(x):
if len(x) < max_length:
return np.pad(x, (0, max_length - len(x)), 'constant', constant_values=(0,))
return x.values[:max_length]
X_train_padded = X_train.applymap(pad_series)
X_test_padded = X_test.applymap(pad_series)
X_train_arrays = [np.stack(row) for _, row in X_train_padded.iterrows()]
X_test_arrays = [np.stack(row) for _, row in X_test_padded.iterrows()]
X_train = np.stack(X_train_arrays, axis=0)
X_test = np.stack(X_test_arrays, axis=0)
from sktime.classification.ensemble import BaggingClassifier
from sktime.classification.dictionary_based import WEASEL
base_clf = WEASEL(alphabet_size=3, random_state=42)
clf = BaggingClassifier(base_clf, n_estimators=11, n_features=1, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
clf_accuracy = round(accuracy_score(y_test, y_pred),2)*100
print(f'Accuracy: {clf_accuracy}%') |
The second example is a confirmed bug, something is going on with the encoding of the classes, as the prediction is int, and the input are object labels. If you do clf_accuracy = round(accuracy_score(y_test.astype(int) - 1, y_pred),2)*100
print(f'Accuracy: {clf_accuracy}%') this gives accuracy of 91%, but of course it should be like that out of the box. |
Tests were not covering This is strange, since I remember the example working. |
A fix for the first issue is here: #6429 This does not fix the problem with the labels though. |
2nd problem is due to default behaviour of |
Fixed here: #6430 This should fix both bugs reported here. Testing and review appreciated. |
BaggingClassifier
not working with WEASEL on multivariate data
You're welcome! ETA for release is within the coming week. |
Fixes #6427. The problem was that `_predict_proba` did simply forget to subset columns of `X`. The bug was not detected as the base classifier used in tests was `DummyClassifier`, which ignores `X`, and hence misses the failures in column subsetting of `X` entirely. To cover the bug, the classifier was replaced by `SummaryClassifier.`.
…ys, even if `fit` `y` was not integer (#6430) This fixes one of the bugs reported in #6427, namely the default `_predict` in `BaseClassifier` always returning integer labels, even if the original labels were not integers. This would cause all classifiers that did not have a custom `_predict` implemented - a few composites, among them `BaggingClassifier` - to always predict integers, even if the `y` seen in `fit` was of another type. The fix is simple, adding a missing application of the memorized integer-to-class dictionary. Test coverage is through #6428.
…e type and labels (#6428) This PR extends the suite test `test_classifier_on_unit_test_data` to test that `y` of object or str dtype `y` leads to correct labels on `predict` outputs. Covers second bug in #6427, namely the example returning integer labels instead of string labels - but does not fix the bug. It is hence expected that the test will detect bug #6427. Depends on the following PR which fix newly covered bugs: * #6430 * #6432
Describe the bug
In the documentation, it says
When used with WEASEL, I still get the error:
ValueError: Data seen by WEASEL instance has multivariate series, but this WEASEL instance cannot handle multivariate series. Calls with multivariate series may result in error or unreliable results.
To Reproduce
I am using this dataset:
from sktime.datasets import load_japanese_vowels
The data is preprocessed and turned into a numpy 3D array.
Code for classification:
Expected behavior
The model trains and makes predictions
Versions
sktime v.0.27.0
Python dependencies:
pip: 24.0
sktime: 0.27.0
sklearn: 1.4.1.post1
skbase: 0.7.5
numpy: 1.26.4
scipy: 1.12.0
pandas: 2.1.4
matplotlib: 3.8.3
joblib: 1.3.2
numba: 0.59.0
statsmodels: None
pmdarima: None
statsforecast: None
tsfresh: None
tslearn: None
torch: None
tensorflow: None
tensorflow_probability: None
The text was updated successfully, but these errors were encountered: