import numpy as np
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

from frouros.detectors.data_drift import KSTest

Univariate detector#

The following example shows the use of Kolmogorov-Smirnov test [1] univariate detector with a synthetic dataset composed by 3 informative features and 2 non-informative/useless features for the model.

np.random.seed(seed=31)

X, y = make_classification(
    n_samples=10000,
    n_features=5,
    n_informative=3,
    n_redundant=0,
    n_repeated=0,
    n_classes=2,
    scale=[10, 0.1, 5, 15, 1],
    # False because it also shuffles features order
    # (we don't want features to be shuffled)
    shuffle=False,
    random_state=31,
)

Random shuffle the data rows and split data in train (70%) and test (30%).

idxs = np.arange(X.shape[0])
np.random.shuffle(idxs)
X, y = X[idxs], y[idxs]

idx_split = int(X.shape[0] * 0.7)
X_train, y_train, X_test, y_test = (
    X[:idx_split],
    y[:idx_split],
    X[idx_split:],
    y[idx_split:],
)

The significance level will be \(\alpha = 0.01\).

alpha = 0.01

Create and fit a Kolmogorov-Smirnov test detector for each feature using the training dataset.

detectors = []
for i in range(X_train.shape[1]):
    detector = KSTest()
    _ = detector.fit(X=X_train[:, i])
    detectors.append(detector)

Fitting a decision tree with the training/reference dataset.

model = DecisionTreeClassifier(random_state=31)
model.fit(X=X_train, y=y_train)
DecisionTreeClassifier(random_state=31)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In addition to obtaining the predictions for the test data by calling the predict method, the detector compares the reference data with test data to determine if drift is occurring for each feature.

y_pred = model.predict(X=X_test)
for i, detector in enumerate(detectors):
    print(f"Feature {i+1}:")
    p_value = detector.compare(X=X_test[:, i])[0].p_value
    print(f"\tp-value: {round(p_value, 4)}")
    if p_value <= alpha:
        print("\tData drift detected\n")
    else:
        print("\tNo data drift detected\n")
print(f"Accuracy: {round(accuracy_score(y_test, y_pred), 4)}")
Feature 1:
	p-value: 0.1606
	No data drift detected

Feature 2:
	p-value: 0.5984
	No data drift detected

Feature 3:
	p-value: 0.0637
	No data drift detected

Feature 4:
	p-value: 0.2359
	No data drift detected

Feature 5:
	p-value: 0.8064
	No data drift detected

Accuracy: 0.9277

Noise on informative features#

To simulate how data drift can end up degrading model’s performance, we apply some noise to two of the three relevant features, as shown below:

X_test_noise = X_test.copy()
X_test_noise[:, :2] = X_test_noise[:, :2] + np.random.normal(
    loc=0, scale=X_test_noise[:, :2].std(axis=0), size=X_test_noise[:, :2].shape
)  # Add noise to features 1 and 2 (both informative)
y_pred = model.predict(X=X_test_noise)
for i, detector in enumerate(detectors):
    print(f"Feature {i}:")
    p_value = detector.compare(X=X_test_noise[:, i])[0].p_value
    print(f"\tp-value: {round(p_value, 4)}")
    if p_value <= alpha:
        print("\tData drift detected\n")
    else:
        print("\tNo data drift detected\n")
print(f"Accuracy: {round(accuracy_score(y_test, y_pred), 4)}")
Feature 0:
	p-value: 0.0
	Data drift detected

Feature 1:
	p-value: 0.0
	Data drift detected

Feature 2:
	p-value: 0.0637
	No data drift detected

Feature 3:
	p-value: 0.2359
	No data drift detected

Feature 4:
	p-value: 0.8064
	No data drift detected

Accuracy: 0.6353

Data drift has been detected for the two of the three informative features. This has lead to a significantly drop in the accuracy, thus resulting in a degradation of model’s performance.

Noise on non-informative features#

On the other hand, if we apply some noise to the non-informative features (they should not be important for the model) we expect to see data drift in these features, but model’s performance should not decrease significantly, meaning that features affected by the noise are completely irrelevant.

X_test_noise = X_test.copy()
X_test_noise[:, 3:] = X_test_noise[:, 3:] + np.random.normal(
    loc=0, scale=X_test_noise[:, 3:].std(axis=0), size=X_test_noise[:, 3:].shape
)  # Add noise to features 4 and 5 (both non-informative)
y_pred = model.predict(X=X_test_noise)
for i, detector in enumerate(detectors):
    print(f"Feature {i}:")
    p_value = detector.compare(X=X_test_noise[:, i])[0].p_value
    print(f"\tp-value: {round(p_value, 4)}")
    if p_value <= alpha:
        print("\tData drift detected\n")
    else:
        print("\tNo data drift detected\n")
print(f"Accuracy: {round(accuracy_score(y_test, y_pred), 4)}")
Feature 0:
	p-value: 0.1606
	No data drift detected

Feature 1:
	p-value: 0.5984
	No data drift detected

Feature 2:
	p-value: 0.0637
	No data drift detected

Feature 3:
	p-value: 0.0
	Data drift detected

Feature 4:
	p-value: 0.0
	Data drift detected

Accuracy: 0.928

We can see that data drift has occurred in the two non-informative features, making the performance of the model unaffected.

[1]

Frank J Massey Jr. The kolmogorov-smirnov test for goodness of fit. Journal of the American statistical Association, 46(253):68–78, 1951.