Package morfologik.fsa.builders
Class FSABuilder
java.lang.Object
morfologik.fsa.builders.FSABuilder
Fast, memory-conservative finite state automaton builder, returning an
in-memory
FSA that is a tradeoff between construction speed and
memory consumption. Use serializers to compress the returned automaton into
more compact form.- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic enumDebug and information constants. -
Field Summary
FieldsModifier and TypeFieldDescriptionprivate int[]States on the "active path" (still mutable).private intCurrent length of the active path.private static final intInternal serialized FSA buffer expand ratio.private final intInternal serialized FSA buffer expand ratio.private intAn epsilon state.private int[]Hash set of state addresses inserialized, hashed byhash(int, int).private intNumber of entries currently stored inhashSet.private TreeMap<FSABuilder.InfoEntry, Object> Information about the automaton and its compilation.static final Comparator<byte[]> A comparator comparing full byte arrays.private static final intMaximum number of labels from a single state.private static final intA megabyte.private int[]The next offset at which an arc will be added to the given state onactivePath.private byte[]Previous sequence added to the automaton inadd(byte[], int, int).private intprevioussequence's length, used in assertions only.private intRoot state.private intNumber of serialization buffer reallocations.private byte[]Holds serialized and mutable states.private intNumber of bytes already taken inserialized. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoidadd(byte[] sequence, int start, int len) Add a single sequence of bytes to the FSA.private intallocateState(int labels) Allocate space for a state with the given number of outgoing labels.static FSAbuild(byte[][] input) Build a minimal, deterministic automaton from a sorted list of byte sequences.static FSABuild a minimal, deterministic automaton from an iterable list of byte sequences.private intcommonPrefix(byte[] sequence, int start, int len) private static intcompare(byte[] s1, int start1, int lens1, byte[] s2, int start2, int lens2) Lexicographic order of input sequences.complete()private booleanequivalent(int start1, int start2, int len) Returntrueif two regions inserializedare identical.private voidexpandActivePath(int size) Append a new mutable state to the active path.private voidReallocate and rehash the hash set.private voidExpand internal buffers for the next state.private intfreezeState(int activePathIndex) Freeze a state: try to find an equivalent state in the interned states dictionary first, if found, return it, otherwise, serialize the mutable state atactivePathIndexand return it.private bytegetArcLabel(int arc) Get label's arc.private intgetArcTarget(int arc) Returns the address of an arc.getInfo()private inthash(int start, int byteCount) Hash code of a fragment ofserializedarray.private booleanisArcFinal(int arc) Is this arc final?private booleanisArcLast(int arc) Is this arc the state's last?private intserialize(int activePathIndex) Serialize a given state on the active path.private voidsetArcTarget(int arc, int state) Fills the target state address of an arc.private booleansetPrevious(byte[] sequence, int start, int length) Copycurrentinto an internal buffer.private intstateLength(int state) The total length of the serialized state data (all arcs).
-
Field Details
-
MB
private static final int MBA megabyte.- See Also:
-
BUFFER_GROWTH_SIZE
private static final int BUFFER_GROWTH_SIZEInternal serialized FSA buffer expand ratio.- See Also:
-
MAX_LABELS
private static final int MAX_LABELSMaximum number of labels from a single state.- See Also:
-
LEXICAL_ORDERING
A comparator comparing full byte arrays. Unsigned byte comparisons ('C'-locale). -
bufferGrowthSize
private final int bufferGrowthSizeInternal serialized FSA buffer expand ratio. -
serialized
private byte[] serializedHolds serialized and mutable states. Each state is a sequential list of arcs, the last arc is marked with.invalid reference
#BIT_ARC_LAST -
size
private int sizeNumber of bytes already taken inserialized. Start from 1 to keep 0 a sentinel value (for the hash set and final state). -
activePath
private int[] activePathStates on the "active path" (still mutable). Values are addresses of each state's first arc. -
activePathLen
private int activePathLenCurrent length of the active path. -
nextArcOffset
private int[] nextArcOffsetThe next offset at which an arc will be added to the given state onactivePath. -
root
private int rootRoot state. If negative, the automaton has been built already and cannot be extended. -
epsilon
private int epsilonAn epsilon state. The first and only arc of this state points either to the root or to the terminal state, indicating an empty automaton. -
hashSet
private int[] hashSetHash set of state addresses inserialized, hashed byhash(int, int). Zero reserved for an unoccupied slot. -
hashSize
private int hashSizeNumber of entries currently stored inhashSet. -
previous
private byte[] previousPrevious sequence added to the automaton inadd(byte[], int, int). Used in assertions only. -
info
Information about the automaton and its compilation. -
previousLength
private int previousLengthprevioussequence's length, used in assertions only. -
serializationBufferReallocations
private int serializationBufferReallocationsNumber of serialization buffer reallocations.
-
-
Constructor Details
-
FSABuilder
public FSABuilder() -
FSABuilder
public FSABuilder(int bufferGrowthSize) - Parameters:
bufferGrowthSize- Buffer growth size (in bytes) when constructing the automaton.
-
-
Method Details
-
add
public void add(byte[] sequence, int start, int len) Add a single sequence of bytes to the FSA. The input must be lexicographically greater than any previously added sequence.- Parameters:
sequence- The array holding input sequence of bytes.start- Starting offset (inclusive)len- Length of the input sequence (at least 1 byte).
-
complete
- Returns:
- Finalizes the construction of the automaton and returns it.
-
build
Build a minimal, deterministic automaton from a sorted list of byte sequences.- Parameters:
input- Input sequences to build automaton from.- Returns:
- Returns the automaton encoding all input sequences.
-
build
Build a minimal, deterministic automaton from an iterable list of byte sequences.- Parameters:
input- Input sequences to build automaton from.- Returns:
- Returns the automaton encoding all input sequences.
-
getInfo
- Returns:
- Returns various statistics concerning the FSA and its compilation.
- See Also:
-
isArcLast
private boolean isArcLast(int arc) Is this arc the state's last? -
isArcFinal
private boolean isArcFinal(int arc) Is this arc final? -
getArcLabel
private byte getArcLabel(int arc) Get label's arc. -
setArcTarget
private void setArcTarget(int arc, int state) Fills the target state address of an arc. -
getArcTarget
private int getArcTarget(int arc) Returns the address of an arc. -
commonPrefix
private int commonPrefix(byte[] sequence, int start, int len) - Returns:
- The number of common prefix characters with the previous sequence.
-
freezeState
private int freezeState(int activePathIndex) Freeze a state: try to find an equivalent state in the interned states dictionary first, if found, return it, otherwise, serialize the mutable state atactivePathIndexand return it. -
expandAndRehash
private void expandAndRehash()Reallocate and rehash the hash set. -
stateLength
private int stateLength(int state) The total length of the serialized state data (all arcs). -
equivalent
private boolean equivalent(int start1, int start2, int len) Returntrueif two regions inserializedare identical. -
serialize
private int serialize(int activePathIndex) Serialize a given state on the active path. -
hash
private int hash(int start, int byteCount) Hash code of a fragment ofserializedarray. -
expandActivePath
private void expandActivePath(int size) Append a new mutable state to the active path. -
expandBuffers
private void expandBuffers()Expand internal buffers for the next state. -
allocateState
private int allocateState(int labels) Allocate space for a state with the given number of outgoing labels.- Returns:
- state offset
-
setPrevious
private boolean setPrevious(byte[] sequence, int start, int length) Copycurrentinto an internal buffer. -
compare
private static int compare(byte[] s1, int start1, int lens1, byte[] s2, int start2, int lens2) Lexicographic order of input sequences. By default, consistent with the "C" sort (absolute value of bytes, 0-255).
-