top of page

Python - Kolmogorov Smirnov Test

These days I have a chance to meet participants of mathematical conference at the highest mountain of the Czech republic. It inspires me to perform few Kolmogorov tests in Python. KS Test is one of the general goodness of fit tests. The objective of each goodness of fit test is to achieve the most accurate CDF/SF/PDF/PPF/ISF computations. KS test is based on Kolmogorov complexity and KS distribution. Other Goodness of fit tests are e.g. : Chi-Square Test, Student`s t test, Fisher`s exact test, ANOVA, Kruskal-Wallis, Pearson, Spearman, …


Kolmogorov complexity — absolute value is not computable, there isn’t a single function, that will return the complexity of an arbitrary string, picture or system. Compressible string can be reduced by C symbols by a compression program. The final string that cannot be reduced by even one symbol is said to be incompressible.


Measure complexity of networks — measure the entropy of network invariants, such as adjacency matrices or degree sequences. Entropy and all entropy-based measures have several vulnerabilities.


Complexity versus entropy — Shannon entropy represents the average complexity over all strings emitted by a random string generator, whereas the Kolmogorov complexity represents the complexity of a particular string.


KS Test — Goodness of Fit test. To achieve the most accurate CDF/SF/PDF/PPF/ISF computations is to use the stats.kstwobign distribution.


CDF — Cumulative Distribution function (Empirical CDF — randomly generated, Target CDF — target values to fit the test )


scipy.special.kolmogorov — Complementary cumulative distribution CCDF (Survival Function) function of Kolmogorov distribution.


scipy.special.kolmogi — Inverse Survival Function of Kolmogorov distribution. Returns y such that kolmogorov(y) == p.



1 . One sample kolmogrov test


You can first install the obvious libraries:


from numpy.random import seed

from numpy.random import poisson

from scipy.stats import kstest


P-value is less than .05, we reject the null hypothesis. We have sufficient evidence to say that the sample data does not come from a normal distribution. We generated the sample data using the poisson() function, generated random values follow a Poisson distribution.


# set seed (e.g. make this example reproducible)

seed(0)

# generate dataset of 100 values that follow a Poisson distribution with mean=5

data = poisson(5, 100)


# perform Kolmogorov-Smirnov testkstest(data, 'norm')


KstestResult(statistic=0.9072498680518208, pvalue=1.0908062873170218e-103)ort kstest


2 . One sample kolmogrov test


You can first install the obvious libraries:


from numpy.random import seed

from numpy.random import randn

from numpy.random import lognormal

from scipy.stats import ks_2samp


Since the p-value is less than .05, we reject the null hypothesis. We have sufficient evidence to say that the two sample datasets do not come from the same distribution. The first sample is using the standard normal distribution. Values for the second sample are using the lognormal distribution.


# set seed (e.g. make this example reproducible)

seed(0)


# generate two datasets

data1 = randn(100)

data2 = lognormal(3, 1, 100)


# perform Kolmogorov-Smirnov test

ks_2samp(data1, data2)


Ks_2sampResult(statistic=0.99, pvalue=4.417521386399011e-57)


3 . scipy kolmogorov


You can first install the obvious libraries:

from scipy.special import kolmogorov

from scipy.stats import kstwobign

import numpy as np


kolmogorov([0,·0.5,·1.0])


array([1. , 0.96394524, 0.26999967])


Compare a sample of size 1000 drawn from a Laplace(0, 1) distribution against the target distribution, a Normal(0, 1) distribution.


from scipy.stats import norm,laplace

rng=np.random.default_rng()

n=1000

lap01=laplace(0,1)

x=np.sort(lap01.rvs(n,random_state=rng))

np.mean(x),np.std(x)

(-0.00591602532125853, 1.365355645380573)



Generate Empirical CDF and the KS statistic Dn.


target = norm(0,1) # Normal mean 0, stddev 1

cdfs = target.cdf(x)

ecdfs = np.arange(n+1, dtype=float)/n

gaps = np.column_stack([cdfs - ecdfs[:n], ecdfs[1:] - cdfs])

Dn = np.max(gaps)

Kn = np.sqrt(n) * Dn

print('Dn=%f, sqrt(n)*Dn=%f' % (Dn, Kn))


Dn=0.054133, sqrt(n)*Dn=1.711848



Print results:



print(chr(10).join(['For a sample of size n drawn from a N(0, 1) distribution:',

... ' the approximate Kolmogorov probability that sqrt(n)*Dn>=%f is %f' % (Kn, kolmogorov(Kn)),... ' the approximate Kolmogorov probability that sqrt(n)*Dn<=%f is %f' % (Kn, kstwobign.cdf(Kn))]))

For a sample of size n drawn from a N(0, 1) distribution:

the approximate Kolmogorov probability that sqrt(n)*Dn>=1.711848 is 0.005698

the approximate Kolmogorov probability that sqrt(n)*Dn<=1.711848 is 0.994302



Plot the Empirical CDF against the target N(0, 1) CDF.


import matplotlib.pyplot as plt


plt.step(np.concatenate([[-3], x]), ecdfs, where='post', label='Empirical CDF', color='grey')


x3 = np.linspace(-3, 3, 100)


iminus, iplus = np.argmax(gaps, axis=0)


plt.vlines([x[iminus]], ecdfs[iminus], cdfs[iminus], color='black', linestyle='dashed', lw=4)

plt.vlines([x[iplus]], cdfs[iplus], ecdfs[iplus+1], color='black', linestyle='dashed', lw=4)


plt.plot(x3, target.cdf(x3), label='CDF for N(0, 1)', color='lightgrey')


plt.ylim([0, 1]); plt.grid(True); plt.legend();






4 . scipy kolmogi


We can create inverse survival function for previous kolmogorov distribution.


from scipy.special import kolmogi


kolmogi([0,·0.1,·0.25,·0.5,·0.75,·0.9,·1.0])


array([ inf, 1.22384787, 1.01918472, 0.82757356, 0.67644769,

0.57117327, 0. ])




Refrences:


https://github.com/MLWave/koolmogorov

https://towardsdatascience.com/face-recognition-through-kolmogorov-complexity-16ac5542235b

https://en.wikipedia.org/wiki/Kolmogorov_complexity

https://en.wikipedia.org/wiki/Andrey_Kolmogorov

https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.kolmogi.html

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjGtfXUr9P4AhWaRPEDHadjBtQQFnoECBIQAw&url=https%3A%2F%2Fwww.i-programmer.info%2Fprogramming%2Ftheory%2F13793-programmers-guide-to-theory-kolmogorov-complexity.html%3Fstart%3D1&usg=AOvVaw2B145xgyi0D2qEpTMn9V8P

https://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test

https://www.quora.com/What-is-the-relationship-between-Kolmogorov-complexity-and-Shannon-entropy

https://www.youtube.com/watch?v=KyB13PD-UME

http://en.wikipedia.org/wiki/Kolmogorov_complexity

http://www.neilconway.org/talks/kolmogorov.pdf

http://people.cs.uchicago.edu/~fortnow/papers/quaderni.pdf

https://www.youtube.com/watch?v=QkwPf3fcxBs

https://www.statology.org/kolmogorov-smirnov-test-python/

https://towardsdatascience.com/comparing-sample-distributions-with-the-kolmogorov-smirnov-ks-test-a2292ad6fee5

https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.kolmogorov.html


39 views0 comments

Recent Posts

See All

Python - Cryptographic algorithms

In order to examine the basic cryptographic mechanisms, I created the short python codes to generate the old encryption keys for basic...

Python - sktime

There are various libraries created for Python Time Series. Each of them has its own style, contributors and functions. Each library has...

Python - Basic regression comparison

Regression models are the principles of machine learning models as well. They help to understand the dataset distributions. The objective...

Comments


bottom of page