import numpy as np
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from frouros.detectors.data_drift import KSTest
Univariate detector#
The following example shows the use of Kolmogorov-Smirnov test [1] univariate detector with a synthetic dataset composed by 3 informative features and 2 non-informative/useless features for the model.
np.random.seed(seed=31)
X, y = make_classification(
n_samples=10000,
n_features=5,
n_informative=3,
n_redundant=0,
n_repeated=0,
n_classes=2,
scale=[10, 0.1, 5, 15, 1],
# False because it also shuffles features order
# (we don't want features to be shuffled)
shuffle=False,
random_state=31,
)
Random shuffle the data rows and split data in train (70%) and test (30%).
idxs = np.arange(X.shape[0])
np.random.shuffle(idxs)
X, y = X[idxs], y[idxs]
idx_split = int(X.shape[0] * 0.7)
X_train, y_train, X_test, y_test = (
X[:idx_split],
y[:idx_split],
X[idx_split:],
y[idx_split:],
)
The significance level will be \(\alpha = 0.01\).
alpha = 0.01
Create and fit a Kolmogorov-Smirnov test detector for each feature using the training dataset.
detectors = []
for i in range(X_train.shape[1]):
detector = KSTest()
_ = detector.fit(X=X_train[:, i])
detectors.append(detector)
Fitting a decision tree with the training/reference dataset.
model = DecisionTreeClassifier(random_state=31)
model.fit(X=X_train, y=y_train)
DecisionTreeClassifier(random_state=31)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=31)
In addition to obtaining the predictions for the test data by calling the predict method, the detector compares the reference data with test data to determine if drift is occurring for each feature.
y_pred = model.predict(X=X_test)
for i, detector in enumerate(detectors):
print(f"Feature {i+1}:")
p_value = detector.compare(X=X_test[:, i])[0].p_value
print(f"\tp-value: {round(p_value, 4)}")
if p_value <= alpha:
print("\tData drift detected\n")
else:
print("\tNo data drift detected\n")
print(f"Accuracy: {round(accuracy_score(y_test, y_pred), 4)}")
Feature 1:
p-value: 0.1606
No data drift detected
Feature 2:
p-value: 0.5984
No data drift detected
Feature 3:
p-value: 0.0637
No data drift detected
Feature 4:
p-value: 0.2359
No data drift detected
Feature 5:
p-value: 0.8064
No data drift detected
Accuracy: 0.9277
Noise on informative features#
To simulate how data drift can end up degrading model’s performance, we apply some noise to two of the three relevant features, as shown below:
X_test_noise = X_test.copy()
X_test_noise[:, :2] = X_test_noise[:, :2] + np.random.normal(
loc=0, scale=X_test_noise[:, :2].std(axis=0), size=X_test_noise[:, :2].shape
) # Add noise to features 1 and 2 (both informative)
y_pred = model.predict(X=X_test_noise)
for i, detector in enumerate(detectors):
print(f"Feature {i}:")
p_value = detector.compare(X=X_test_noise[:, i])[0].p_value
print(f"\tp-value: {round(p_value, 4)}")
if p_value <= alpha:
print("\tData drift detected\n")
else:
print("\tNo data drift detected\n")
print(f"Accuracy: {round(accuracy_score(y_test, y_pred), 4)}")
Feature 0:
p-value: 0.0
Data drift detected
Feature 1:
p-value: 0.0
Data drift detected
Feature 2:
p-value: 0.0637
No data drift detected
Feature 3:
p-value: 0.2359
No data drift detected
Feature 4:
p-value: 0.8064
No data drift detected
Accuracy: 0.6353
Data drift has been detected for the two of the three informative features. This has lead to a significantly drop in the accuracy, thus resulting in a degradation of model’s performance.
Noise on non-informative features#
On the other hand, if we apply some noise to the non-informative features (they should not be important for the model) we expect to see data drift in these features, but model’s performance should not decrease significantly, meaning that features affected by the noise are completely irrelevant.
X_test_noise = X_test.copy()
X_test_noise[:, 3:] = X_test_noise[:, 3:] + np.random.normal(
loc=0, scale=X_test_noise[:, 3:].std(axis=0), size=X_test_noise[:, 3:].shape
) # Add noise to features 4 and 5 (both non-informative)
y_pred = model.predict(X=X_test_noise)
for i, detector in enumerate(detectors):
print(f"Feature {i}:")
p_value = detector.compare(X=X_test_noise[:, i])[0].p_value
print(f"\tp-value: {round(p_value, 4)}")
if p_value <= alpha:
print("\tData drift detected\n")
else:
print("\tNo data drift detected\n")
print(f"Accuracy: {round(accuracy_score(y_test, y_pred), 4)}")
Feature 0:
p-value: 0.1606
No data drift detected
Feature 1:
p-value: 0.5984
No data drift detected
Feature 2:
p-value: 0.0637
No data drift detected
Feature 3:
p-value: 0.0
Data drift detected
Feature 4:
p-value: 0.0
Data drift detected
Accuracy: 0.928
We can see that data drift has occurred in the two non-informative features, making the performance of the model unaffected.
Frank J Massey Jr. The kolmogorov-smirnov test for goodness of fit. Journal of the American statistical Association, 46(253):68–78, 1951.