Class KolmogorovSmirnovTest
- java.lang.Object
-
- org.apache.commons.statistics.inference.KolmogorovSmirnovTest
-
public final class KolmogorovSmirnovTest extends java.lang.ObjectImplements the Kolmogorov-Smirnov (K-S) test for equality of continuous distributions.The one-sample test uses a D statistic based on the maximum deviation of the empirical distribution of sample data points from the distribution expected under the null hypothesis.
The two-sample test uses a D statistic based on the maximum deviation of the two empirical distributions of sample data points. The two-sample tests evaluate the null hypothesis that the two samples
xandycome from the same underlying distribution.References:
- Marsaglia, G., Tsang, W. W., & Wang, J. (2003). Evaluating Kolmogorov's Distribution. Journal of Statistical Software, 8(18), 1–4.
- Simard, R., & L’Ecuyer, P. (2011). Computing the Two-Sided Kolmogorov-Smirnov Distribution. Journal of Statistical Software, 39(11), 1–18.
- Sekhon, J. S. (2011). Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R. Journal of Statistical Software, 42(7), 1–52.
- Viehmann, T (2021). Numerically more stable computation of the p-values for the two-sample Kolmogorov-Smirnov test. arXiv:2102.08037
- Hodges, J. L. (1958). The significance probability of the smirnov two-sample test. Arkiv for Matematik, 3(5), 469-486.
Note that [1] contains an error in computing h, refer to MATH-437 for details.
- Since:
- 1.1
- See Also:
- Kolmogorov-Smirnov (K-S) test (Wikipedia)
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classKolmogorovSmirnovTest.OneResultResult for the one-sample Kolmogorov-Smirnov test.static classKolmogorovSmirnovTest.TwoResultResult for the two-sample Kolmogorov-Smirnov test.
-
Field Summary
Fields Modifier and Type Field Description private AlternativeHypothesisalternativeAlternative hypothesis.private static KolmogorovSmirnovTestDEFAULTDefault instance.private static long[]IGNORED_DPlaceholder to use for the two-sample ties D array when the value can be ignored.private static int[]IGNORED_SIGNPlaceholder to use for the two-sample sign array when the value can be ignored.private intiterationsNumber of iterations .private static intLARGE_SAMPLEWhen the largest sample size exceeds this value, 2-sample test AUTO p-value uses an asymptotic distribution to compute the p-value.private static intMAX_ARRAY_SIZEMaximum length of an array.private static intMAX_FACTORIALMaximum finite factorial.private static longMAX_LCM_TWO_SAMPLE_EXACT_PThe maximum least common multiple (lcm) to attempt the exact p-value computation.private PValueMethodpValueMethodMethod to compute the p-value.private org.apache.commons.rng.UniformRandomProviderrngSource of randomness.private static java.lang.StringSAMPLE_1_NAMEName for sample 1.private static java.lang.StringSAMPLE_2_NAMEName for sample 2.private booleanstrictInequalityUse a strict inequality for the two-sample exact p-value.
-
Constructor Summary
Constructors Modifier Constructor Description privateKolmogorovSmirnovTest(AlternativeHypothesis alternative, PValueMethod method, boolean strict, org.apache.commons.rng.UniformRandomProvider rng, int iterations)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private static doublebinom(int n, int k)Compute the binomial coefficient binom(n, k).private static intcheckArrayLength(double[] array)Verifies thatarrayhas length at least 2.private static doublecomputeD(long dnm, int n, int m, int gcd)Compute the D statistic from the integral D value.private longcomputeIntegralKolmogorovSmirnovStatistic(double[] x, double[] y, int[] sign, long[] tiesD)Computes the two-sample Kolmogorov-Smirnov test statistic.private doublecomputeStatistic(double[] x, java.util.function.DoubleUnaryOperator cdf, int[] sign)Computes the magnitude of the one-sample Kolmogorov-Smirnov test statistic.private static java.util.function.DoubleSuppliercreateSampler(double[] x, double[] y, org.apache.commons.rng.UniformRandomProvider rng)Creates a sampler to sample randomly from the combined distribution of the two samples.(package private) static java.util.function.DoubleSuppliercreateSampler(double[] x, double[] y, org.apache.commons.rng.UniformRandomProvider rng, int maxArraySize)Creates a sampler to sample randomly from the combined distribution of the two samples.private doubleestimateP(double[] x, double[] y, long dnm)Estimates the p-value of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatxandyare samples drawn from the same probability distribution.private static double[]sort(double[] x, java.lang.String name)Sort the input array.doublestatistic(double[] x, double[] y)Computes the two-sample Kolmogorov-Smirnov test statistic.doublestatistic(double[] x, java.util.function.DoubleUnaryOperator cdf)Computes the one-sample Kolmogorov-Smirnov test statistic.KolmogorovSmirnovTest.TwoResulttest(double[] x, double[] y)Performs a two-sample Kolmogorov-Smirnov test on samplesxandy.KolmogorovSmirnovTest.OneResulttest(double[] x, java.util.function.DoubleUnaryOperator cdf)Performs a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatxconforms to the distribution cumulative density function (cdf).private static booleantestIntegralKolmogorovSmirnovStatistic(double[] x, double[] y, long plus, long minus)Test if the two-sample integral Kolmogorov-Smirnov statistic is strictly greater than the specified values for D+ and D-.(package private) static doubletwoSampleApproximateP(double d, int n, int m, boolean twoSided)Uses the Kolmogorov-Smirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.(package private) static doubletwoSampleExactP(long dnm, int n, int m, int gcd, boolean strict, boolean twoSided)Computes \(P(D_{n,m} > d)\) ifstrictistrue; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic, either the two-sided \(D_{n,m}\) or one-sided \(D_{n,m}^+\}.private static doubletwoSampleOneSidedPOutside(long d, int n, int m, int gcd)Computes \(P(D_{n,m}^+ \ge d)\), where \(D_{n,m}^+\) is the 2-sample one-sided Kolmogorov-Smirnov statistic.private static doubletwoSampleOneSidedPOutsideSquare(long d, int n)Computes \(P(D_{n,n}^+ \ge d)\), where \(D_{n,n}^+\) is the 2-sample one-sided Kolmogorov-Smirnov statistic.private doubletwoSampleP(long dnm, int n, int m, int gcd, double d, boolean exact)Computes \(P(D_{n,m} > d)\) for the 2-sample Kolmogorov-Smirnov statistic.private static doubletwoSampleTwoSidedPOutsideSquare(long d, int n)Computes \(P(D_{n,n}^+ \ge d)\), where \(D_{n,n}^+\) is the 2-sample two-sided Kolmogorov-Smirnov statistic.private static doubletwoSampleTwoSidedPStabilizedInner(long d, int n, int m, int gcd)Computes \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.KolmogorovSmirnovTestwith(org.apache.commons.rng.UniformRandomProvider v)Return an instance with the configured source of randomness.KolmogorovSmirnovTestwith(AlternativeHypothesis v)Return an instance with the configured alternative hypothesis.KolmogorovSmirnovTestwith(Inequality v)Return an instance with the configured inequality.KolmogorovSmirnovTestwith(PValueMethod v)Return an instance with the configured p-value method.static KolmogorovSmirnovTestwithDefaults()Return an instance using the default options.KolmogorovSmirnovTestwithIterations(int v)Return an instance with the configured number of iterations.
-
-
-
Field Detail
-
SAMPLE_1_NAME
private static final java.lang.String SAMPLE_1_NAME
Name for sample 1.- See Also:
- Constant Field Values
-
SAMPLE_2_NAME
private static final java.lang.String SAMPLE_2_NAME
Name for sample 2.- See Also:
- Constant Field Values
-
LARGE_SAMPLE
private static final int LARGE_SAMPLE
When the largest sample size exceeds this value, 2-sample test AUTO p-value uses an asymptotic distribution to compute the p-value.- See Also:
- Constant Field Values
-
MAX_FACTORIAL
private static final int MAX_FACTORIAL
Maximum finite factorial.- See Also:
- Constant Field Values
-
MAX_ARRAY_SIZE
private static final int MAX_ARRAY_SIZE
Maximum length of an array. This is used to determine if two arrays can be concatenated to create a sampler from the joint distribution. The limit is copied from the limit of java.util.ArrayList.- See Also:
- Constant Field Values
-
MAX_LCM_TWO_SAMPLE_EXACT_P
private static final long MAX_LCM_TWO_SAMPLE_EXACT_P
The maximum least common multiple (lcm) to attempt the exact p-value computation. The integral d value is in [0, n*m] in steps of the greatest common denominator (gcd), thus lcm = n*m/gcd is the number of possible different p-values. Some methods have a lower limit due to computation limits. This should be larger than LARGE_SAMPLE^2 so all AUTO p-values attempt an exact computation, i.e. at least 10000^2 ~ 2^26.56.- See Also:
- Constant Field Values
-
IGNORED_SIGN
private static final int[] IGNORED_SIGN
Placeholder to use for the two-sample sign array when the value can be ignored.
-
IGNORED_D
private static final long[] IGNORED_D
Placeholder to use for the two-sample ties D array when the value can be ignored.
-
DEFAULT
private static final KolmogorovSmirnovTest DEFAULT
Default instance.
-
alternative
private final AlternativeHypothesis alternative
Alternative hypothesis.
-
pValueMethod
private final PValueMethod pValueMethod
Method to compute the p-value.
-
strictInequality
private final boolean strictInequality
Use a strict inequality for the two-sample exact p-value.
-
rng
private final org.apache.commons.rng.UniformRandomProvider rng
Source of randomness.
-
iterations
private final int iterations
Number of iterations .
-
-
Constructor Detail
-
KolmogorovSmirnovTest
private KolmogorovSmirnovTest(AlternativeHypothesis alternative, PValueMethod method, boolean strict, org.apache.commons.rng.UniformRandomProvider rng, int iterations)
- Parameters:
alternative- Alternative hypothesis.method- P-value method.strict- true to use a strict inequality.rng- Source of randomness.iterations- Number of iterations.
-
-
Method Detail
-
withDefaults
public static KolmogorovSmirnovTest withDefaults()
Return an instance using the default options.- Returns:
- default instance
-
with
public KolmogorovSmirnovTest with(AlternativeHypothesis v)
Return an instance with the configured alternative hypothesis.- Parameters:
v- Value.- Returns:
- an instance
-
with
public KolmogorovSmirnovTest with(PValueMethod v)
Return an instance with the configured p-value method.For the one-sample two-sided test Kolmogorov's asymptotic approximation can be used; otherwise the p-value uses the distribution of the D statistic.
For the two-sample test the exact p-value can be computed for small sample sizes; otherwise the p-value resorts to the asymptotic approximation. Alternatively a p-value can be estimated from the combined distribution of the samples. This requires a source of randomness.
- Parameters:
v- Value.- Returns:
- an instance
- See Also:
with(UniformRandomProvider)
-
with
public KolmogorovSmirnovTest with(Inequality v)
Return an instance with the configured inequality.Computes the p-value for the two-sample test as \(P(D_{n,m} > d)\) if strict; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic, either the two-sided \(D_{n,m}\) or one-sided \(D_{n,m}^+\) or \(D_{n,m}^-\).
- Parameters:
v- Value.- Returns:
- an instance
-
with
public KolmogorovSmirnovTest with(org.apache.commons.rng.UniformRandomProvider v)
Return an instance with the configured source of randomness.Applies to the two-sample test when the p-value method is set to
ESTIMATE. The randomness is used for sampling of the combined distribution.There is no default source of randomness. If the randomness is not set then estimation is not possible and an
IllegalStateExceptionwill be raised in the two-sample test.- Parameters:
v- Value.- Returns:
- an instance
- See Also:
with(PValueMethod)
-
withIterations
public KolmogorovSmirnovTest withIterations(int v)
Return an instance with the configured number of iterations.Applies to the two-sample test when the p-value method is set to
ESTIMATE. This is the number of sampling iterations used to estimate the p-value. The p-value is a fraction using theiterationsas the denominator. The number of significant digits in the p-value is upper bounded by log10(iterations); small p-values have fewer significant digits. A large number of iterations is recommended when using a small critical value to reject the null hypothesis.- Parameters:
v- Value.- Returns:
- an instance
- Throws:
java.lang.IllegalArgumentException- if the number of iterations is not strictly positive
-
statistic
public double statistic(double[] x, java.util.function.DoubleUnaryOperator cdf)Computes the one-sample Kolmogorov-Smirnov test statistic.- two-sided: \(D_n=\sup_x |F_n(x)-F(x)|\)
- greater: \(D_n^+=\sup_x (F_n(x)-F(x))\)
- less: \(D_n^-=\sup_x (F(x)-F_n(x))\)
where \(F\) is the distribution cumulative density function (
cdf), \(n\) is the length ofxand \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inx.The cumulative distribution function should map a real value
xto a probability in [0, 1]. To use a reference distribution the CDF can be passed using a method reference:UniformContinuousDistribution dist = UniformContinuousDistribution.of(0, 1); UniformRandomProvider rng = RandomSource.KISS.create(123); double[] x = dist.sampler(rng).samples(100); double d = KolmogorovSmirnovTest.withDefaults().statistic(x, dist::cumulativeProbability);
- Parameters:
cdf- Reference cumulative distribution function.x- Sample being evaluated.- Returns:
- Kolmogorov-Smirnov statistic
- Throws:
java.lang.IllegalArgumentException- ifdatadoes not have length at least 2; or contains NaN values.- See Also:
test(double[], DoubleUnaryOperator)
-
statistic
public double statistic(double[] x, double[] y)Computes the two-sample Kolmogorov-Smirnov test statistic.- two-sided: \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\)
- greater: \(D_{n,m}^+=\sup_x (F_n(x)-F_m(x))\)
- less: \(D_{n,m}^-=\sup_x (F_m(x)-F_n(x))\)
where \(n\) is the length of
x, \(m\) is the length ofy, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inxand \(F_m\) is the empirical distribution that puts mass \(1/m\) at each of the values iny.- Parameters:
x- First sample.y- Second sample.- Returns:
- Kolmogorov-Smirnov statistic
- Throws:
java.lang.IllegalArgumentException- if eitherxorydoes not have length at least 2; or contain NaN values.- See Also:
test(double[], double[])
-
test
public KolmogorovSmirnovTest.OneResult test(double[] x, java.util.function.DoubleUnaryOperator cdf)
Performs a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatxconforms to the distribution cumulative density function (cdf).The test is defined by the
AlternativeHypothesis:- Two-sided evaluates the null hypothesis that the two distributions are identical, \(F_n(i) = F(i)\) for all \( i \); the alternative is that the are not identical. The statistic is \( max(D_n^+, D_n^-) \) and the sign of \( D \) is provided.
- Greater evaluates the null hypothesis that the \(F_n(i) <= F(i)\) for all \( i \); the alternative is \(F_n(i) > F(i)\) for at least one \( i \). The statistic is \( D_n^+ \).
- Less evaluates the null hypothesis that the \(F_n(i) >= F(i)\) for all \( i \); the alternative is \(F_n(i) < F(i)\) for at least one \( i \). The statistic is \( D_n^- \).
The p-value method defaults to exact. The one-sided p-value uses Smirnov's stable formula:
\[ P(D_n^+ \ge x) = x \sum_{j=0}^{\lfloor n(1-x) \rfloor} \binom{n}{j} \left(\frac{j}{n} + x\right)^{j-1} \left(1-x-\frac{j}{n} \right)^{n-j} \]
The two-sided p-value is computed using methods described in Simard & L’Ecuyer (2011). The two-sided test supports an asymptotic p-value using Kolmogorov's formula:
\[ \lim_{n\to\infty} P(\sqrt{n}D_n > z) = 2 \sum_{i=1}^\infty (-1)^{i-1} e^{-2 i^2 z^2} \]
- Parameters:
x- Sample being being evaluated.cdf- Reference cumulative distribution function.- Returns:
- test result
- Throws:
java.lang.IllegalArgumentException- ifdatadoes not have length at least 2; or contains NaN values.- See Also:
statistic(double[], DoubleUnaryOperator)
-
test
public KolmogorovSmirnovTest.TwoResult test(double[] x, double[] y)
Performs a two-sample Kolmogorov-Smirnov test on samplesxandy. Test the empirical distributions \(F_n\) and \(F_m\) where \(n\) is the length ofx, \(m\) is the length ofy, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inxand \(F_m\) is the empirical distribution that puts mass \(1/m\) of theyvalues.The test is defined by the
AlternativeHypothesis:- Two-sided evaluates the null hypothesis that the two distributions are identical, \(F_n(i) = F_m(i)\) for all \( i \); the alternative is that they are not identical. The statistic is \( max(D_n^+, D_n^-) \) and the sign of \( D \) is provided.
- Greater evaluates the null hypothesis that the \(F_n(i) <= F_m(i)\) for all \( i \); the alternative is \(F_n(i) > F_m(i)\) for at least one \( i \). The statistic is \( D_n^+ \).
- Less evaluates the null hypothesis that the \(F_n(i) >= F_m(i)\) for all \( i \); the alternative is \(F_n(i) < F_m(i)\) for at least one \( i \). The statistic is \( D_n^- \).
If the p-value method is auto, then an exact p computation is attempted if both sample sizes are less than 10000 using the methods presented in Viehmann (2021) and Hodges (1958); otherwise an asymptotic p-value is returned. The two-sided p-value is \(\overline{F}(d, \sqrt{mn / (m + n)})\) where \(\overline{F}\) is the complementary cumulative density function of the two-sided one-sample Kolmogorov-Smirnov distribution. The one-sided p-value uses an approximation from Hodges (1958) Eq 5.3.
\(D_{n,m}\) has a discrete distribution. This makes the p-value associated with the null hypothesis \(H_0 : D_{n,m} \gt d \) differ from \(H_0 : D_{n,m} \ge d \) by the mass of the observed value \(d\). These can be distinguished using an
Inequalityparameter. This is ignored for large samples.If the data contains ties there is no defined ordering in the tied region to use for the difference between the two empirical distributions. Each ordering of the tied region may create a different D statistic. All possible orderings generate a distribution for the D value. In this case the tied region is traversed entirely and the effect on the D value evaluated at the end of the tied region. This is the path with the least change on the D statistic. The path with the greatest change on the D statistic is also computed as the upper bound on D. If these two values are different then the tied region is known to generate a distribution for the D statistic and the p-value is an over estimate for the cases with a larger D statistic. The presence of any significant tied regions is returned in the result.
If the p-value method is
ESTIMATEthen the p-value is estimated by repeat sampling of the joint distribution ofxandy. The p-value is the frequency that a sample creates a D statistic greater than or equal to (or greater than for strict inequality) the observed value. In this case a source of randomness must be configured or anIllegalStateExceptionwill be raised. The p-value for the upper bound on D will not be estimated and is set toNaN. This estimation procedure is not affected by ties in the data and is increasingly robust for larger datasets. The method is modeled after ks.boot in the RMatchingpackage (Sekhon (2011)).- Parameters:
x- First sample.y- Second sample.- Returns:
- test result
- Throws:
java.lang.IllegalArgumentException- if eitherxorydoes not have length at least 2; or contain NaN values.java.lang.IllegalStateException- if the p-value method isESTIMATEand there is no source of randomness.- See Also:
statistic(double[], double[])
-
estimateP
private double estimateP(double[] x, double[] y, long dnm)Estimates the p-value of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatxandyare samples drawn from the same probability distribution.This method will destructively modify the input arrays (via a sort).
This method estimates the p-value by repeatedly sampling sets of size
x.lengthandy.lengthfrom the empirical distribution of the combined sample. The memory requirement is an array copy of each of the input arguments.When using strict inequality, this is equivalent to the algorithm implemented in the R function
ks.bootas described in Sekhon (2011) Journal of Statistical Software, 42(7), 1–52 [3].- Parameters:
x- First sample.y- Second sample.dnm- Integral D statistic.- Returns:
- p-value
- Throws:
java.lang.IllegalStateException- if the source of randomness is null.
-
computeStatistic
private double computeStatistic(double[] x, java.util.function.DoubleUnaryOperator cdf, int[] sign)Computes the magnitude of the one-sample Kolmogorov-Smirnov test statistic. The sign of the statistic is optionally returned insign. For the two-sided case the sign is 0 if the magnitude of D+ and D- was equal; otherwise it indicates which D had the larger magnitude.- Parameters:
x- Sample being evaluated.cdf- Reference cumulative distribution function.sign- Sign of the statistic (non-zero length).- Returns:
- Kolmogorov-Smirnov statistic
- Throws:
java.lang.IllegalArgumentException- ifdatadoes not have length at least 2; or contains NaN values.
-
computeIntegralKolmogorovSmirnovStatistic
private long computeIntegralKolmogorovSmirnovStatistic(double[] x, double[] y, int[] sign, long[] tiesD)Computes the two-sample Kolmogorov-Smirnov test statistic. The statistic is integral and has a value in[0, n*m]. Not all values are possible; the smallest increment is the greatest common divisor ofnandm.This method will destructively modify the input arrays (via a sort).
The sign of the statistic is returned in
sign. For the two-sided case the sign is 0 if the magnitude of D+ and D- was equal; otherwise it indicates which D had the larger magnitude. If the two-sided statistic is zero the two arrays are identical, or are 'identical' data of a single value (sample sizes may be different), or have a sequence of ties of 'identical' data with a net zero effect on the D statistic e.g.[1,2,3] vs [1,2,3] [0,0,0,0] vs [0,0,0] [0,0,0,0,1,1,1,1] vs [0,0,0,1,1,1]
This method detects ties in the input data. Not all ties will invalidate the returned statistic. Ties between the data can be interpreted as if the values were different but within machine epsilon. In this case the path through the tie region is not known. All paths through the tie region end at the same point. This method will compute the most extreme path through each tie region and the D statistic for these paths. If the ties D statistic is a larger magnitude than the returned D statistic then at least one tie region lies at a point on the full path that could result in a different statistic in the absence of ties. This signals the P-value computed using the returned D statistic is one of many possible p-values given the data and how ties are resolved. Note: The tiesD value may be smaller than the returned D statistic as it is not set to the maximum of D or tiesD. The only result of interest is when
tiesD > Ddue to a tie region that can change the output D. On outputtiesD[0] != 0if there were ties between samples andtiesD[1] = Dof the most extreme path through the ties.- Parameters:
x- First sample (destructively modified by sort).y- Second sample (destructively modified by sort).sign- Sign of the statistic (non-zero length).tiesD- Integral statistic for the most extreme path through any ties (length at least 2).- Returns:
- integral Kolmogorov-Smirnov statistic
- Throws:
java.lang.IllegalArgumentException- if eitherxorycontain NaN values.
-
testIntegralKolmogorovSmirnovStatistic
private static boolean testIntegralKolmogorovSmirnovStatistic(double[] x, double[] y, long plus, long minus)Test if the two-sample integral Kolmogorov-Smirnov statistic is strictly greater than the specified values for D+ and D-. Note that D- should have a negative sign to impose an inclusive lower bound.This method will destructively modify the input arrays (via a sort).
For a two-sided alternative hypothesis
plusandminusshould have the same magnitude with opposite signs.For a one-sided alternative hypothesis the value of
plusorminusshould have the expected value of the statistic, and the opposite D should have the maximum or minimum long value. Theminusshould be negatively signed:- greater:
plus= D,minus=Long.MIN_VALUE - greater:
minus= -D,plus=Long.MAX_VALUE
Note: This method has not been specialized for the one-sided case. Specialization would eliminate a conditional branch for
d > Long.MAX_VALUEord < Long.MIN_VALUE. Since these branches are never possible in the one-sided case this should be efficiently chosen by branch prediction in a processor pipeline.- Parameters:
x- First sample (destructively modified by sort; must not contain NaN).y- Second sample (destructively modified by sort; must not contain NaN).plus- Limit on D+ (inclusive upper bound).minus- Limit on D- (inclusive lower bound).- Returns:
- true if the D value exceeds the provided limits
- greater:
-
createSampler
private static java.util.function.DoubleSupplier createSampler(double[] x, double[] y, org.apache.commons.rng.UniformRandomProvider rng)Creates a sampler to sample randomly from the combined distribution of the two samples.- Parameters:
x- First sample.y- Second sample.rng- Source of randomness.- Returns:
- the sampler
-
createSampler
static java.util.function.DoubleSupplier createSampler(double[] x, double[] y, org.apache.commons.rng.UniformRandomProvider rng, int maxArraySize)Creates a sampler to sample randomly from the combined distribution of the two samples. This will copy the input data for the sampler.- Parameters:
x- First sample.y- Second sample.rng- Source of randomness.maxArraySize- Maximum size of a single array.- Returns:
- the sampler
-
computeD
private static double computeD(long dnm, int n, int m, int gcd)Compute the D statistic from the integral D value.- Parameters:
dnm- Integral D-statistic value (in [0, n*m]).n- First sample size.m- Second sample size.gcd- Greatest common divisor of n and m.- Returns:
- D-statistic value (in [0, 1]).
-
twoSampleP
private double twoSampleP(long dnm, int n, int m, int gcd, double d, boolean exact)Computes \(P(D_{n,m} > d)\) for the 2-sample Kolmogorov-Smirnov statistic.- Parameters:
dnm- Integral D-statistic value (in [0, n*m]).n- First sample size.m- Second sample size.gcd- Greatest common divisor of n and m.d- D-statistic value (in [0, 1]).exact- whether to compute the exact probability; otherwise approximate.- Returns:
- probability
- See Also:
twoSampleExactP(long, int, int, int, boolean, boolean),twoSampleApproximateP(double, int, int, boolean)
-
twoSampleExactP
static double twoSampleExactP(long dnm, int n, int m, int gcd, boolean strict, boolean twoSided)Computes \(P(D_{n,m} > d)\) ifstrictistrue; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic, either the two-sided \(D_{n,m}\) or one-sided \(D_{n,m}^+\}. Seestatistic(double[], double[])for the definition of \(D_{n,m}\).The returned probability is exact. If the value cannot be computed this returns -1.
Note: This requires the greatest common divisor of n and m. The integral D statistic in the range [0, n*m] is separated by increments of the gcd. The method will only compute p-values for valid values of D by calculating for D/gcd. Strict inquality is performed using the next valid value for D.
- Parameters:
dnm- Integral D-statistic value (in [0, n*m]).n- First sample size.m- Second sample size.gcd- Greatest common divisor of n and m.strict- whether or not the probability to compute is expressed as a strict inequality.twoSided- whether D refers to D or D+.- Returns:
- probability that a randomly selected m-n partition of m + n generates D
greater than (resp. greater than or equal to)
d(or -1)
-
twoSampleTwoSidedPStabilizedInner
private static double twoSampleTwoSidedPStabilizedInner(long d, int n, int m, int gcd)Computes \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.The returned probability is exact, implemented using the stabilized inner method presented in Viehmann (2021).
This is optimized for
m <= n. Ifm > nand index-out-of-bounds exception can occur.- Parameters:
d- Integral D-statistic value (in [2, lcm])n- Larger sample size.m- Smaller sample size.gcd- Greatest common divisor of n and m.- Returns:
- probability that a randomly selected m-n partition of m + n generates \(D_{n,m}\)
greater than or equal to
d
-
twoSampleOneSidedPOutside
private static double twoSampleOneSidedPOutside(long d, int n, int m, int gcd)Computes \(P(D_{n,m}^+ \ge d)\), where \(D_{n,m}^+\) is the 2-sample one-sided Kolmogorov-Smirnov statistic.The returned probability is exact, implemented using the outer method presented in Hodges (1958).
This method will fail-fast and return -1 if the computation of the numbers of paths overflows.
- Parameters:
d- Integral D-statistic value (in [0, lcm])n- First sample size.m- Second sample size.gcd- Greatest common divisor of n and m.- Returns:
- probability that a randomly selected m-n partition of m + n generates \(D_{n,m}\)
greater than or equal to
d
-
twoSampleOneSidedPOutsideSquare
private static double twoSampleOneSidedPOutsideSquare(long d, int n)Computes \(P(D_{n,n}^+ \ge d)\), where \(D_{n,n}^+\) is the 2-sample one-sided Kolmogorov-Smirnov statistic.The returned probability is exact, implemented using the outer method presented in Hodges (1958).
- Parameters:
d- Integral D-statistic value (in [1, lcm])n- Sample size.- Returns:
- probability that a randomly selected m-n partition of m + n generates \(D_{n,m}\)
greater than or equal to
d
-
twoSampleTwoSidedPOutsideSquare
private static double twoSampleTwoSidedPOutsideSquare(long d, int n)Computes \(P(D_{n,n}^+ \ge d)\), where \(D_{n,n}^+\) is the 2-sample two-sided Kolmogorov-Smirnov statistic.The returned probability is exact, implemented using the outer method presented in Hodges (1958).
- Parameters:
d- Integral D-statistic value (in [1, n])n- Sample size.- Returns:
- probability that a randomly selected m-n partition of n + n generates \(D_{n,n}\)
greater than or equal to
d
-
binom
private static double binom(int n, int k)Compute the binomial coefficient binom(n, k).- Parameters:
n- N.k- K.- Returns:
- binom(n, k)
-
twoSampleApproximateP
static double twoSampleApproximateP(double d, int n, int m, boolean twoSided)Uses the Kolmogorov-Smirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. Seestatistic(double[], double[])for the definition of \(D_{n,m}\).Specifically, what is returned is \(1 - CDF(d, \sqrt{mn / (m + n)})\) where CDF is the cumulative density function of the two-sided one-sample Kolmogorov-Smirnov distribution.
- Parameters:
d- D-statistic value.n- First sample size.m- Second sample size.twoSided- True to compute the two-sided p-value; else one-sided.- Returns:
- approximate probability that a randomly selected m-n partition of m + n generates
\(D_{n,m}\) greater than
d
-
checkArrayLength
private static int checkArrayLength(double[] array)
Verifies thatarrayhas length at least 2.- Parameters:
array- Array to test.- Returns:
- the length
- Throws:
java.lang.IllegalArgumentException- if array is too short
-
sort
private static double[] sort(double[] x, java.lang.String name)Sort the input array. Throws an exception if NaN values are present. It is assumed the array is non-zero length.- Parameters:
x- Input array.name- Name of the array.- Returns:
- a reference to the input (sorted) array
- Throws:
java.lang.IllegalArgumentException- ifxcontains NaN values.
-
-