import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from frouros.callbacks.batch import PermutationTestOnBatchData
from frouros.detectors.data_drift import MMD

Multivariate detector#

The following example shows the use of MMD [1] multivariate detector for the breast cancer dataset provided by scikit-learn.

np.random.seed(seed=31)

X, y = load_breast_cancer(return_X_y=True)

Since this is a small data set and the only objective is to show the integration the use of a multivariate detector, the data is simply split in half.

idx_split = int(X.shape[0] * 0.5)
X_train, y_train, X_test, y_test = X[:idx_split], y[:idx_split], X[idx_split:], y[idx_split:]

The significance level will be \(\alpha = 0.01\).

alpha = 0.01

Create and fit a MMD detector using the training dataset.

detector = MMD(
    callbacks=[
        PermutationTestOnBatchData(
            num_permutations=1000,
            random_state=31,
            num_jobs=-1,
            name="permutation_test",
            verbose=False,
        ),
    ],
)
_ = detector.fit(X=X_train)

Fitting a logistic regression with the training/reference dataset.

pipeline = Pipeline(
    steps=[
        ("scale", StandardScaler()),
        ("model", LogisticRegression(random_state=31)),
    ],
)
pipeline.fit(X=X_train, y=y_train)

Pipeline(steps=[('scale', StandardScaler()),
                ('model', LogisticRegression(random_state=31))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In addition to obtaining the predictions for the test data by calling the predict method, the detector compares the reference data with test data to determine if drift is occurring.

y_pred = pipeline.predict(X=X_test)
p_value = detector.compare(X=X_test)[1]["permutation_test"]["p_value"]
print(f"p-value: {round(p_value, 4)}")
if p_value < alpha:
    print("Data drift detected")
else:
    print("No data drift detected")
print(f"Accuracy: {round(accuracy_score(y_test, y_pred), 4)}")

p-value: 0.152
No data drift detected
Accuracy: 0.9719

As the above results show, no data drift was detected. Therefore, we can simulate data drift by applying some noise to the test data, as shown below:

X_test_noise = X_test + np.random.normal(loc=0, scale=X_test.std(axis=0), size=X_test.shape)
y_pred = pipeline.predict(X=X_test_noise)
p_value = detector.compare(X=X_test_noise)[1]["permutation_test"]["p_value"]
print(f"p-value: {round(p_value, 4)}")
if p_value < alpha:
    print("Data drift detected")
else:
    print("No data drift detected")
print(f"Accuracy: {round(accuracy_score(y_test, y_pred), 4)}")

p-value: 0.296
No data drift detected
Accuracy: 0.9263

Data drift has been detected and the model’s performance has been affected by significantly lowering the accuracy value.

[1]

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(25):723–773, 2012. URL: http://jmlr.org/papers/v13/gretton12a.html.