Class CHDMinimalPerfectHashFunction<T>
- All Implemented Interfaces:
it.unimi.dsi.fastutil.Function<T,Long>, it.unimi.dsi.fastutil.objects.Object2LongFunction<T>, it.unimi.dsi.fastutil.Size64, Serializable, Function<T, Long>, ToLongFunction<T>
Given a list of keys without duplicates, the builder of this class finds a
minimal perfect hash function for the list. Subsequent calls to the getLong(Object)
method will return a distinct number for each key in the list. For keys out of the list, the
resulting number is not specified. In some (rare) cases it might be possible to establish that a
key was not in the original list, and in that case -1 will be returned; by signing the
function (see below), you can guarantee with a prescribed probability that -1 will be returned on
keys not in the original list. The class can then be saved by serialisation and reused later.
This class uses a chunked hash store to
provide highly scalable construction. Note that at construction time you can
pass a
it.unimi.dsi.sux4j.io.ChunkedHashStore containing the keys (associated with any value); however,
if the store is rebuilt because of a
DuplicateException it will be
rebuilt associating with each key its ordinal position.
The memory requirements for the algorithm we use are ≈2 bits per key for
load factor equal to one and λ = 5. Thus, this class can use ≈10% less memory than a
GOVMinimalPerfectHashFunction.
However, its construction time is an order of magnitude larger, and query time is about 50% slower. Different tradeoffs between construction time, query time and space can be obtained by tweaking the load factor and the parameter λ (see the paper below for their exact meaning).
For convenience, this class provides a main method that reads from standard input a (possibly
gzip'd) sequence of newline-separated strings, and writes a serialised minimal
perfect hash function for the given list.
Signing
Optionally, it is possible to sign the minimal perfect
hash function. A w-bit signature will be associated with each key, so that
getLong(Object) will return -1 on strings that are not in the original key set. As
usual, false positives are possible with probability 2-w.
How it Works
The technique used is described by Djamal Belazzougui, Fabiano C. Botelho and Martin
Dietzfelbinger in “Hash, displace and compress”, Algorithms - ESA 2009, LNCS
5757, pages 682−693, 2009. However, with respect to the algorithm described in the paper,
this implementation is much more scalable, as it uses a
ChunkedHashStore to split the generation of large key sets into
generation of smaller functions for each chunk (of size approximately
216<T>).
- Since:
- 3.2.0
- Author:
- Sebastiano Vigna
- See Also:
-
Nested Class Summary
Nested Classes -
Field Summary
FieldsModifier and TypeFieldDescriptionprotected final EliasFanoLongBigListThe displacement coefficients.protected final longThe seed used to generate the initial hash triple.static final intThe logarithm of the desired chunk size.protected final longThe number of keys.protected final SparseRankThe sparse ranking structure containing the unused entries.static final longprotected final longThe mask to compare signatures, or zero for no signatures.protected final it.unimi.dsi.fastutil.longs.LongBigListThe signatures.protected final it.unimi.dsi.bits.TransformationStrategy<? super T> The transformation strategy.Fields inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction
defRetValue -
Constructor Summary
ConstructorsModifierConstructorDescriptionprotectedCHDMinimalPerfectHashFunction(Iterable<? extends T> keys, it.unimi.dsi.bits.TransformationStrategy<? super T> transform, int lambda, double loadFactor, int signatureWidth, File tempDir, ChunkedHashStore<T> chunkedHashStore) Creates a new CHD minimal perfect hash function for the given keys. -
Method Summary
Methods inherited from class AbstractHashFunction
containsKey, sizeMethods inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction
defaultReturnValue, defaultReturnValueMethods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface it.unimi.dsi.fastutil.Function
apply, clearMethods inherited from interface it.unimi.dsi.fastutil.objects.Object2LongFunction
andThen, andThenByte, andThenChar, andThenDouble, andThenFloat, andThenInt, andThenLong, andThenObject, andThenReference, andThenShort, applyAsLong, composeByte, composeChar, composeDouble, composeFloat, composeInt, composeLong, composeObject, composeReference, composeShort, get, getOrDefault, getOrDefault, put, put, remove, removeLong
-
Field Details
-
serialVersionUID
public static final long serialVersionUID- See Also:
-
LOG2_CHUNK_SIZE
public static final int LOG2_CHUNK_SIZEThe logarithm of the desired chunk size.- See Also:
-
n
protected final long nThe number of keys. -
globalSeed
protected final long globalSeedThe seed used to generate the initial hash triple. -
transform
The transformation strategy. -
coefficients
The displacement coefficients. -
rank
The sparse ranking structure containing the unused entries. -
signatureMask
protected final long signatureMaskThe mask to compare signatures, or zero for no signatures. -
signatures
protected final it.unimi.dsi.fastutil.longs.LongBigList signaturesThe signatures.
-
-
Constructor Details
-
CHDMinimalPerfectHashFunction
protected CHDMinimalPerfectHashFunction(Iterable<? extends T> keys, it.unimi.dsi.bits.TransformationStrategy<? super T> transform, int lambda, double loadFactor, int signatureWidth, File tempDir, ChunkedHashStore<T> chunkedHashStore) throws IOException Creates a new CHD minimal perfect hash function for the given keys.- Parameters:
keys- the keys to hash, ornull.transform- a transformation strategy for the keys.lambda- the average bucket size.loadFactor- the load factor.signatureWidth- a signature width, or 0 for no signature.tempDir- a temporary directory for the store files, ornullfor the standard temporary directory.chunkedHashStore- a chunked hash store containing the keys, ornull; the store can be unchecked, but in this casekeysandtransformmust be non-null.- Throws:
IOException
-
-
Method Details
-
numBits
public long numBits()Returns the number of bits used by this structure.- Returns:
- the number of bits used by this structure.
-
getLong
-
size64
public long size64()- Specified by:
size64in interfaceit.unimi.dsi.fastutil.Size64- Overrides:
size64in classAbstractHashFunction<T>
-
main
public static void main(String[] arg) throws NoSuchMethodException, IOException, com.martiansoftware.jsap.JSAPException - Throws:
NoSuchMethodExceptionIOExceptioncom.martiansoftware.jsap.JSAPException
-