Class GV4CompressedFunction<T>
- java.lang.Object
-
- it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction<T>
-
- it.unimi.dsi.sux4j.mph.GV4CompressedFunction<T>
-
- All Implemented Interfaces:
it.unimi.dsi.fastutil.Function<T,java.lang.Long>,it.unimi.dsi.fastutil.objects.Object2LongFunction<T>,it.unimi.dsi.fastutil.Size64,java.io.Serializable,java.util.function.Function<T,java.lang.Long>,java.util.function.ToLongFunction<T>
public class GV4CompressedFunction<T> extends it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction<T> implements java.io.Serializable, it.unimi.dsi.fastutil.Size64An immutable function stored in a compressed form.Instances of this class store a function from keys to values. Keys are provided by an iterable object (whose iterators must return elements in a consistent order), whereas values are provided by a
LongIterable. If you do not specify values, each key will be assigned its rank (e.g., its position in iteration order starting from zero).The values must have a skewed distribution: some values must be much more frequent than others. In that case, this data structure uses much less space than, say, a
GOV4Functionbecause it is able to use, for each key, a number of bits close to the empirical entropy of the value list (with an additional ≈3%). It is slower than aGV3CompressedFunction, and it takes more time to build, but it uses less space.For convenience, this class provides a main method that reads from standard input a (possibly
gzip'd) sequence of newline-separated strings, and writes a serialised function mapping each element of the list to its position, or to a given list of values.Building a function
This class provides a great amount of flexibility when creating a new function; such flexibility is exposed through the builder. To exploit the various possibilities, you must understand some details of the construction.
In a first phase, we build a
BucketedHashStorecontaining hashes of the keys. By default, the store will associate each hash with the rank of the key. If you specify values, the store will associate with each hash the corresponding value.However, if you further require an indirect construction the store will associate again each hash with the rank of the corresponding key, and access randomly the values (which must be either a
LongListor aLongBigList). Indirect construction is useful only in complex, multi-layer hashes (such as anLcpMonotoneMinimalPerfectHashFunction) in which we want to reuse a checkedBucketedHashStore. Storing values in theBucketedHashStoreis extremely scalable because the values must just be aLongIterablethat will be scanned sequentially during the store construction. On the other hand, if you have already a store that associates ordinal positions, and you want to build a new function for which aLongListorLongBigListof values needs little space (e.g., because it is described implicitly), you can opt for an indirect construction using the already built store.Note that if you specify a store it will be used before building a new one (possibly because of a
DuplicateException), with obvious benefits in terms of performance. If the store is not checked, and aDuplicateExceptionis thrown, the constructor will try to rebuild the store, but this requires, of course, that the keys, and possibly the values, are available. Note that it is your responsibility to pass a correct store.Multithreading
This implementation is multithreaded: each bucket returned by the
BucketedHashStoreis processed independently. By default, this class usesRuntime.availableProcessors()parallel threads, but by default no more than 4. If you wish to set a specific number of threads, you can do so through the system property "it.unimi.dsi.sux4j.mph.threads".Implementation Details
The detail of the data structure can be found in “Engineering Compressed Functions”, by Marco Genuzio and Sebastiano Vigna, 2017. The theoretical basis for the construction is described by Jóhannes B. Hreinsson, Morten Krøyer, and Rasmus Pagh in “Storing a compressed function with constant time access”, Algorithms - ESA 2009, 17th Annual European Symposium, 2009, pages 730−741. Each output value is represented by a codeword from a prefix-free code (by default, a length-limited Huffman code). We generate a random 4-regular linear system on F2, where the known term of the k-th equation is a bit in a bit array. When we read the bit array at four suitable positions depending on an input key, we can recover the codeword representing the output value.
- Since:
- 4.2.0
- Author:
- Sebastiano Vigna, Marco Genuzio
- See Also:
GV3CompressedFunction, Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classGV4CompressedFunction.Builder<T>
-
Field Summary
Fields Modifier and Type Field Description static intBUCKET_SIZEThe expected bucket size.protected it.unimi.dsi.bits.BitVectordataA bit vector storing the main data array.protected Codec.DecoderdecoderThe decoder that will be used to yield output values.static doubleDELTAstatic intDELTA_TIMES_256protected intescapedSymbolLengthCodec.Decoder.escapedSymbolLength()fromdecoder, cached.protected intescapeLengthCodec.Decoder.escapeLength()fromdecoder, cached.protected intglobalMaxCodewordLengthLength of longest codewordprotected longglobalSeedThe seed used to generate the initial signature.protected longnThe number of keys.static java.lang.StringNUMBER_OF_THREADS_PROPERTYThe system property used to set the number of parallel threads.protected static intOFFSET_BITSprotected long[]offsetAndSeedprotected static intSEED_BITSprotected it.unimi.dsi.bits.TransformationStrategy<? super T>transformThe transformation strategy to turn objects of typeTinto bit vectors.
-
Constructor Summary
Constructors Modifier Constructor Description protectedGV4CompressedFunction(java.lang.Iterable<? extends T> keys, it.unimi.dsi.bits.TransformationStrategy<? super T> transform, it.unimi.dsi.fastutil.longs.LongIterable values, boolean indirect, java.io.File tempDir, BucketedHashStore<T> bucketedHashStore, Codec codec)Creates a new function for the given keys and values.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description booleancontainsKey(java.lang.Object o)voiddump(java.lang.String file)longgetLong(java.lang.Object o)static voidmain(java.lang.String[] arg)longnumBits()Returns the number of bits used by this structure.intsize()Deprecated.longsize64()Returns the number of keys in the function domain.-
Methods inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction
defaultReturnValue, defaultReturnValue
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface it.unimi.dsi.fastutil.objects.Object2LongFunction
andThen, andThenByte, andThenChar, andThenDouble, andThenFloat, andThenInt, andThenLong, andThenObject, andThenReference, andThenShort, applyAsLong, composeByte, composeChar, composeDouble, composeFloat, composeInt, composeLong, composeObject, composeReference, composeShort, get, getOrDefault, getOrDefault, put, put, remove, removeLong
-
-
-
-
Field Detail
-
SEED_BITS
protected static final int SEED_BITS
- See Also:
- Constant Field Values
-
OFFSET_BITS
protected static final int OFFSET_BITS
- See Also:
- Constant Field Values
-
NUMBER_OF_THREADS_PROPERTY
public static final java.lang.String NUMBER_OF_THREADS_PROPERTY
The system property used to set the number of parallel threads.- See Also:
- Constant Field Values
-
DELTA
public static final double DELTA
- See Also:
- Constant Field Values
-
DELTA_TIMES_256
public static final int DELTA_TIMES_256
-
BUCKET_SIZE
public static final int BUCKET_SIZE
The expected bucket size.- See Also:
- Constant Field Values
-
n
protected final long n
The number of keys.
-
globalMaxCodewordLength
protected final int globalMaxCodewordLength
Length of longest codeword
-
globalSeed
protected long globalSeed
The seed used to generate the initial signature.
-
offsetAndSeed
protected final long[] offsetAndSeed
-
data
protected final it.unimi.dsi.bits.BitVector data
A bit vector storing the main data array.
-
transform
protected final it.unimi.dsi.bits.TransformationStrategy<? super T> transform
The transformation strategy to turn objects of typeTinto bit vectors.
-
decoder
protected final Codec.Decoder decoder
The decoder that will be used to yield output values.
-
escapeLength
protected final int escapeLength
Codec.Decoder.escapeLength()fromdecoder, cached.
-
escapedSymbolLength
protected final int escapedSymbolLength
Codec.Decoder.escapedSymbolLength()fromdecoder, cached.
-
-
Constructor Detail
-
GV4CompressedFunction
protected GV4CompressedFunction(java.lang.Iterable<? extends T> keys, it.unimi.dsi.bits.TransformationStrategy<? super T> transform, it.unimi.dsi.fastutil.longs.LongIterable values, boolean indirect, java.io.File tempDir, BucketedHashStore<T> bucketedHashStore, Codec codec) throws java.io.IOException
Creates a new function for the given keys and values.- Parameters:
keys- the keys in the domain of the function, ornull.transform- a transformation strategy for the keys.values- values to be assigned to each element, in the same order of the iterator returned bykeys; ifnull, the assigned value will the ordinal number of each element.indirect- if true,bucketedHashStorecontains ordinal positions, andvaluesis aLongIterablethat must be accessed to retrieve the actual values.tempDir- a temporary directory for the store files, ornullfor the standard temporary directory.bucketedHashStore- a bucketed hash store containing the keys associated with their values and counting value frequencies, ornull; the store can be unchecked, but in this casekeysandtransformmust be non-null.codec- theCodecused to encode values.- Throws:
java.io.IOException
-
-
Method Detail
-
getLong
public long getLong(java.lang.Object o)
- Specified by:
getLongin interfaceit.unimi.dsi.fastutil.objects.Object2LongFunction<T>
-
size64
public long size64()
Returns the number of keys in the function domain.- Specified by:
size64in interfaceit.unimi.dsi.fastutil.Size64- Returns:
- the number of the keys in the function domain.
-
size
@Deprecated public int size()
Deprecated.- Specified by:
sizein interfaceit.unimi.dsi.fastutil.Function<T,java.lang.Long>- Specified by:
sizein interfaceit.unimi.dsi.fastutil.Size64
-
numBits
public long numBits()
Returns the number of bits used by this structure.- Returns:
- the number of bits used by this structure.
-
containsKey
public boolean containsKey(java.lang.Object o)
- Specified by:
containsKeyin interfaceit.unimi.dsi.fastutil.Function<T,java.lang.Long>
-
dump
public void dump(java.lang.String file) throws java.io.IOException- Throws:
java.io.IOException
-
main
public static void main(java.lang.String[] arg) throws java.lang.NoSuchMethodException, java.io.IOException, com.martiansoftware.jsap.JSAPException- Throws:
java.lang.NoSuchMethodExceptionjava.io.IOExceptioncom.martiansoftware.jsap.JSAPException
-
-