Class KolmogorovSmirnovTest

java.lang.Object
org.apache.commons.statistics.inference.KolmogorovSmirnovTest

public final class KolmogorovSmirnovTest extends Object
Implements the Kolmogorov-Smirnov (K-S) test for equality of continuous distributions.

The one-sample test uses a D statistic based on the maximum deviation of the empirical distribution of sample data points from the distribution expected under the null hypothesis.

The two-sample test uses a D statistic based on the maximum deviation of the two empirical distributions of sample data points. The two-sample tests evaluate the null hypothesis that the two samples x and y come from the same underlying distribution.

References:

  1. Marsaglia, G., Tsang, W. W., & Wang, J. (2003). Evaluating Kolmogorov's Distribution. Journal of Statistical Software, 8(18), 1–4.
  2. Simard, R., & L’Ecuyer, P. (2011). Computing the Two-Sided Kolmogorov-Smirnov Distribution. Journal of Statistical Software, 39(11), 1–18.
  3. Sekhon, J. S. (2011). Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R. Journal of Statistical Software, 42(7), 1–52.
  4. Viehmann, T (2021). Numerically more stable computation of the p-values for the two-sample Kolmogorov-Smirnov test. arXiv:2102.08037
  5. Hodges, J. L. (1958). The significance probability of the smirnov two-sample test. Arkiv for Matematik, 3(5), 469-486.

Note that [1] contains an error in computing h, refer to MATH-437 for details.

Since:
1.1
See Also:
  • Method Details

    • withDefaults

      Returns:
      default instance
    • with

      Return an instance with the configured alternative hypothesis.
      Parameters:
      v - Value.
      Returns:
      an instance
    • with

      Return an instance with the configured p-value method.

      For the one-sample two-sided test Kolmogorov's asymptotic approximation can be used; otherwise the p-value uses the distribution of the D statistic.

      For the two-sample test the exact p-value can be computed for small sample sizes; otherwise the p-value resorts to the asymptotic approximation. Alternatively a p-value can be estimated from the combined distribution of the samples. This requires a source of randomness.

      Parameters:
      v - Value.
      Returns:
      an instance
      See Also:
    • with

      Return an instance with the configured inequality.

      Computes the p-value for the two-sample test as \(P(D_{n,m} > d)\) if strict; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic, either the two-sided \(D_{n,m}\) or one-sided \(D_{n,m}^+\) or \(D_{n,m}^-\).

      Parameters:
      v - Value.
      Returns:
      an instance
    • with

      public KolmogorovSmirnovTest with(org.apache.commons.rng.UniformRandomProvider v)
      Return an instance with the configured source of randomness.

      Applies to the two-sample test when the p-value method is set to ESTIMATE. The randomness is used for sampling of the combined distribution.

      There is no default source of randomness. If the randomness is not set then estimation is not possible and an IllegalStateException will be raised in the two-sample test.

      Parameters:
      v - Value.
      Returns:
      an instance
      See Also:
    • withIterations

      Return an instance with the configured number of iterations.

      Applies to the two-sample test when the p-value method is set to ESTIMATE. This is the number of sampling iterations used to estimate the p-value. The p-value is a fraction using the iterations as the denominator. The number of significant digits in the p-value is upper bounded by log10(iterations); small p-values have fewer significant digits. A large number of iterations is recommended when using a small critical value to reject the null hypothesis.

      Parameters:
      v - Value.
      Returns:
      an instance
      Throws:
      IllegalArgumentException - if the number of iterations is not strictly positive
    • statistic

      public double statistic(double[] x, DoubleUnaryOperator cdf)
      Computes the one-sample Kolmogorov-Smirnov test statistic.
      • two-sided: \(D_n=\sup_x |F_n(x)-F(x)|\)
      • greater: \(D_n^+=\sup_x (F_n(x)-F(x))\)
      • less: \(D_n^-=\sup_x (F(x)-F_n(x))\)

      where \(F\) is the distribution cumulative density function (cdf), \(n\) is the length of x and \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x.

      The cumulative distribution function should map a real value x to a probability in [0, 1]. To use a reference distribution the CDF can be passed using a method reference:

       UniformContinuousDistribution dist = UniformContinuousDistribution.of(0, 1);
       UniformRandomProvider rng = RandomSource.KISS.create(123);
       double[] x = dist.sampler(rng).samples(100);
       double d = KolmogorovSmirnovTest.withDefaults().statistic(x, dist::cumulativeProbability);
       
      Parameters:
      x - Sample being evaluated.
      cdf - Reference cumulative distribution function.
      Returns:
      Kolmogorov-Smirnov statistic
      Throws:
      IllegalArgumentException - if data does not have length at least 2; or contains NaN values.
      See Also:
    • statistic

      public double statistic(double[] x, double[] y)
      Computes the two-sample Kolmogorov-Smirnov test statistic.
      • two-sided: \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\)
      • greater: \(D_{n,m}^+=\sup_x (F_n(x)-F_m(x))\)
      • less: \(D_{n,m}^-=\sup_x (F_m(x)-F_n(x))\)

      where \(n\) is the length of x, \(m\) is the length of y, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\) is the empirical distribution that puts mass \(1/m\) at each of the values in y.

      Parameters:
      x - First sample.
      y - Second sample.
      Returns:
      Kolmogorov-Smirnov statistic
      Throws:
      IllegalArgumentException - if either x or y does not have length at least 2; or contain NaN values.
      See Also:
    • test

      Performs a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x conforms to the distribution cumulative density function (cdf).

      The test is defined by the AlternativeHypothesis:

      • Two-sided evaluates the null hypothesis that the two distributions are identical, \(F_n(i) = F(i)\) for all \( i \); the alternative is that the are not identical. The statistic is \( max(D_n^+, D_n^-) \) and the sign of \( D \) is provided.
      • Greater evaluates the null hypothesis that the \(F_n(i) <= F(i)\) for all \( i \); the alternative is \(F_n(i) > F(i)\) for at least one \( i \). The statistic is \( D_n^+ \).
      • Less evaluates the null hypothesis that the \(F_n(i) >= F(i)\) for all \( i \); the alternative is \(F_n(i) < F(i)\) for at least one \( i \). The statistic is \( D_n^- \).

      The p-value method defaults to exact. The one-sided p-value uses Smirnov's stable formula:

      \[ P(D_n^+ \ge x) = x \sum_{j=0}^{\lfloor n(1-x) \rfloor} \binom{n}{j} \left(\frac{j}{n} + x\right)^{j-1} \left(1-x-\frac{j}{n} \right)^{n-j} \]

      The two-sided p-value is computed using methods described in Simard & L’Ecuyer (2011). The two-sided test supports an asymptotic p-value using Kolmogorov's formula:

      \[ \lim_{n\to\infty} P(\sqrt{n}D_n > z) = 2 \sum_{i=1}^\infty (-1)^{i-1} e^{-2 i^2 z^2} \]

      Parameters:
      x - Sample being being evaluated.
      cdf - Reference cumulative distribution function.
      Returns:
      test result
      Throws:
      IllegalArgumentException - if data does not have length at least 2; or contains NaN values.
      See Also:
    • test

      public KolmogorovSmirnovTest.TwoResult test(double[] x, double[] y)
      Performs a two-sample Kolmogorov-Smirnov test on samples x and y. Test the empirical distributions \(F_n\) and \(F_m\) where \(n\) is the length of x, \(m\) is the length of y, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\) is the empirical distribution that puts mass \(1/m\) of the y values.

      The test is defined by the AlternativeHypothesis:

      • Two-sided evaluates the null hypothesis that the two distributions are identical, \(F_n(i) = F_m(i)\) for all \( i \); the alternative is that they are not identical. The statistic is \( max(D_n^+, D_n^-) \) and the sign of \( D \) is provided.
      • Greater evaluates the null hypothesis that the \(F_n(i) <= F_m(i)\) for all \( i \); the alternative is \(F_n(i) > F_m(i)\) for at least one \( i \). The statistic is \( D_n^+ \).
      • Less evaluates the null hypothesis that the \(F_n(i) >= F_m(i)\) for all \( i \); the alternative is \(F_n(i) < F_m(i)\) for at least one \( i \). The statistic is \( D_n^- \).

      If the p-value method is auto, then an exact p computation is attempted if both sample sizes are less than 10000 using the methods presented in Viehmann (2021) and Hodges (1958); otherwise an asymptotic p-value is returned. The two-sided p-value is \(\overline{F}(d, \sqrt{mn / (m + n)})\) where \(\overline{F}\) is the complementary cumulative density function of the two-sided one-sample Kolmogorov-Smirnov distribution. The one-sided p-value uses an approximation from Hodges (1958) Eq 5.3.

      \(D_{n,m}\) has a discrete distribution. This makes the p-value associated with the null hypothesis \(H_0 : D_{n,m} \gt d \) differ from \(H_0 : D_{n,m} \ge d \) by the mass of the observed value \(d\). These can be distinguished using an Inequality parameter. This is ignored for large samples.

      If the data contains ties there is no defined ordering in the tied region to use for the difference between the two empirical distributions. Each ordering of the tied region may create a different D statistic. All possible orderings generate a distribution for the D value. In this case the tied region is traversed entirely and the effect on the D value evaluated at the end of the tied region. This is the path with the least change on the D statistic. The path with the greatest change on the D statistic is also computed as the upper bound on D. If these two values are different then the tied region is known to generate a distribution for the D statistic and the p-value is an over estimate for the cases with a larger D statistic. The presence of any significant tied regions is returned in the result.

      If the p-value method is ESTIMATE then the p-value is estimated by repeat sampling of the joint distribution of x and y. The p-value is the frequency that a sample creates a D statistic greater than or equal to (or greater than for strict inequality) the observed value. In this case a source of randomness must be configured or an IllegalStateException will be raised. The p-value for the upper bound on D will not be estimated and is set to NaN. This estimation procedure is not affected by ties in the data and is increasingly robust for larger datasets. The method is modeled after ks.boot in the R Matching package (Sekhon (2011)).

      Parameters:
      x - First sample.
      y - Second sample.
      Returns:
      test result
      Throws:
      IllegalArgumentException - if either x or y does not have length at least 2; or contain NaN values.
      IllegalStateException - if the p-value method is ESTIMATE and there is no source of randomness.
      See Also: