Package morfologik.fsa
Class CFSA
- java.lang.Object
-
- morfologik.fsa.FSA
-
- morfologik.fsa.CFSA
-
- All Implemented Interfaces:
java.lang.Iterable<java.nio.ByteBuffer>
public final class CFSA extends FSA
CFSA (Compact Finite State Automaton) binary format implementation. This is a slightly reorganized version ofFSA5offering smaller automata size at some (minor) performance penalty.Note: Serialize to
CFSA2for new code.The encoding of automaton body is as follows.
---- FSA header (standard) Byte Description +-+-+-+-+-+-+-+-+\ 0 | | | | | | | | | +------ '\' +-+-+-+-+-+-+-+-+/ +-+-+-+-+-+-+-+-+\ 1 | | | | | | | | | +------ 'f' +-+-+-+-+-+-+-+-+/ +-+-+-+-+-+-+-+-+\ 2 | | | | | | | | | +------ 's' +-+-+-+-+-+-+-+-+/ +-+-+-+-+-+-+-+-+\ 3 | | | | | | | | | +------ 'a' +-+-+-+-+-+-+-+-+/ +-+-+-+-+-+-+-+-+\ 4 | | | | | | | | | +------ version (fixed 0xc5) +-+-+-+-+-+-+-+-+/ +-+-+-+-+-+-+-+-+\ 5 | | | | | | | | | +------ filler character +-+-+-+-+-+-+-+-+/ +-+-+-+-+-+-+-+-+\ 6 | | | | | | | | | +------ annot character +-+-+-+-+-+-+-+-+/ +-+-+-+-+-+-+-+-+\ 7 |C|C|C|C|G|G|G|G| +------ C - node data size (ctl), G - address size (gotoLength) +-+-+-+-+-+-+-+-+/ +-+-+-+-+-+-+-+-+\ 8-32 | | | | | | | | | +------ labels mapped for type (1) of arc encoding. : : : : : : : : : | +-+-+-+-+-+-+-+-+/ ---- Start of a node; only if automaton was compiled with NUMBERS option. Byte +-+-+-+-+-+-+-+-+\ 0 | | | | | | | | | \ LSB +-+-+-+-+-+-+-+-+ + 1 | | | | | | | | | | number of strings recognized +-+-+-+-+-+-+-+-+ +----- by the automaton starting : : : : : : : : : | from this node. +-+-+-+-+-+-+-+-+ + ctl-1 | | | | | | | | | / MSB +-+-+-+-+-+-+-+-+/ ---- A vector of node's arcs. Conditional format, depending on flags. 1) NEXT bit set, mapped arc label. +--------------- arc's label mapped in M bits if M's field value > 0 | +------------- node pointed to is next | | +----------- the last arc of the node _______| | | +--------- the arc is final / | | | | +-+-+-+-+-+-+-+-+\ 0 |M|M|M|M|M|1|L|F| +------ flags + (M) index of the mapped label. +-+-+-+-+-+-+-+-+/ 2) NEXT bit set, label separate. +--------------- arc's label stored separately (M's field is zero). | +------------- node pointed to is next | | +----------- the last arc of the node | | | +--------- the arc is final | | | | +-+-+-+-+-+-+-+-+\ 0 |0|0|0|0|0|1|L|F| +------ flags +-+-+-+-+-+-+-+-+/ +-+-+-+-+-+-+-+-+\ 1 | | | | | | | | | +------ label +-+-+-+-+-+-+-+-+/ 3) NEXT bit not set. Full arc. +------------- node pointed to is next | +----------- the last arc of the node | | +--------- the arc is final | | | +-+-+-+-+-+-+-+-+\ 0 |A|A|A|A|A|0|L|F| +------ flags + (A) address field, lower bits +-+-+-+-+-+-+-+-+/ +-+-+-+-+-+-+-+-+\ 1 | | | | | | | | | +------ label +-+-+-+-+-+-+-+-+/ : : : : : : : : : +-+-+-+-+-+-+-+-+\ gtl-1 |A|A|A|A|A|A|A|A| +------ address, continuation (MSB) +-+-+-+-+-+-+-+-+/
-
-
Field Summary
Fields Modifier and Type Field Description byte[]arcsAn array of bytes with the internal representation of the automaton.static intBIT_FINAL_ARCBitmask indicating that an arc corresponds to the last character of a sequence available when building the automaton.static intBIT_LAST_ARCBitmask indicating that an arc is the last one of the node's list and the following one belongs to another node.static intBIT_TARGET_NEXTBitmask indicating that the target node of this arc follows it in the compressed automaton structure (no goto field).private java.util.Set<FSAFlags>flagsFlags for this automaton version.intgtlNumber of bytes each address takes in full, expanded form (goto length).byte[]labelMappingLabel mapping for arcs of type (1) (see class documentation).intnodeDataLengthThe length of the node header structure (if the automaton was compiled withNUMBERSoption).static byteVERSIONAutomaton header version value.
-
Constructor Summary
Constructors Constructor Description CFSA(java.io.InputStream stream)Creates a new automaton, reading it from a file in FSA format, version 5.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description intgetArc(int node, byte label)bytegetArcLabel(int arc)(package private) intgetDestinationNodeOffset(int arc)Returns the address of the node pointed to by this arc.intgetEndNode(int arc)intgetFirstArc(int node)java.util.Set<FSAFlags>getFlags()intgetNextArc(int arc)intgetRightLanguageCount(int node)intgetRootNode()Returns the start node of this automaton.booleanisArcFinal(int arc)booleanisArcLast(int arc)Returnstrueif this arc hasNEXTbit set.booleanisArcTerminal(int arc)booleanisLabelCompressed(int arc)booleanisNextSet(int arc)private intskipArc(int offset)Read the arc's layout and skip as many bytes, as needed, to skip it.-
Methods inherited from class morfologik.fsa.FSA
getArcCount, getSequences, getSequences, iterator, read, read, readRemaining, visitAllStates, visitInPostOrder, visitInPostOrder, visitInPreOrder, visitInPreOrder
-
-
-
-
Field Detail
-
VERSION
public static final byte VERSION
Automaton header version value.- See Also:
- Constant Field Values
-
BIT_FINAL_ARC
public static final int BIT_FINAL_ARC
Bitmask indicating that an arc corresponds to the last character of a sequence available when building the automaton.- See Also:
- Constant Field Values
-
BIT_LAST_ARC
public static final int BIT_LAST_ARC
Bitmask indicating that an arc is the last one of the node's list and the following one belongs to another node.- See Also:
- Constant Field Values
-
BIT_TARGET_NEXT
public static final int BIT_TARGET_NEXT
Bitmask indicating that the target node of this arc follows it in the compressed automaton structure (no goto field).- See Also:
- Constant Field Values
-
arcs
public byte[] arcs
An array of bytes with the internal representation of the automaton. Please see the documentation of this class for more information on how this structure is organized.
-
nodeDataLength
public final int nodeDataLength
The length of the node header structure (if the automaton was compiled withNUMBERSoption). Otherwise zero.
-
flags
private final java.util.Set<FSAFlags> flags
Flags for this automaton version.
-
gtl
public final int gtl
Number of bytes each address takes in full, expanded form (goto length).
-
labelMapping
public final byte[] labelMapping
Label mapping for arcs of type (1) (see class documentation). The array is indexed by mapped label's value and contains the original label.
-
-
Method Detail
-
getRootNode
public int getRootNode()
Returns the start node of this automaton. May return0if the start node is also an end node.- Specified by:
getRootNodein classFSA- Returns:
- Returns the identifier of the root node of this automaton. Returns 0 if the start node is also the end node (the automaton is empty).
-
getFirstArc
public final int getFirstArc(int node)
- Specified by:
getFirstArcin classFSA- Parameters:
node- Identifier of the node.- Returns:
- Returns the identifier of the first arc leaving
nodeor 0 if the node has no outgoing arcs.
-
getNextArc
public final int getNextArc(int arc)
- Specified by:
getNextArcin classFSA- Parameters:
arc- The arc's identifier.- Returns:
- Returns the identifier of the next arc after
arcand leavingnode. Zero is returned if no more arcs are available for the node.
-
getArc
public int getArc(int node, byte label)
-
getEndNode
public int getEndNode(int arc)
- Specified by:
getEndNodein classFSA- Parameters:
arc- The arc's identifier.- Returns:
- Return the end node pointed to by a given
arc. Terminal arcs (those that point to a terminal state) have no end node representation and throw a runtime exception.
-
getArcLabel
public byte getArcLabel(int arc)
- Specified by:
getArcLabelin classFSA- Parameters:
arc- The arc's identifier.- Returns:
- Return the label associated with a given
arc.
-
getRightLanguageCount
public int getRightLanguageCount(int node)
- Overrides:
getRightLanguageCountin classFSA- Parameters:
node- Identifier of the node.- Returns:
- Returns the number of sequences reachable from the given state if
the automaton was compiled with
FSAFlags.NUMBERS. The size of the right language of the state, in other words.
-
isArcFinal
public boolean isArcFinal(int arc)
- Specified by:
isArcFinalin classFSA- Parameters:
arc- The arc's identifier.- Returns:
- Returns
trueif the destination node at the end of thisarccorresponds to an input sequence created when building this automaton.
-
isArcTerminal
public boolean isArcTerminal(int arc)
- Specified by:
isArcTerminalin classFSA- Parameters:
arc- The arc's identifier.- Returns:
- Returns
trueif thisarcdoes not have a terminating node (@linkFSA.getEndNode(int)will throw an exception). ImpliesFSA.isArcFinal(int).
-
isArcLast
public boolean isArcLast(int arc)
Returnstrueif this arc hasNEXTbit set.- Parameters:
arc- The node's arc identifier.- Returns:
- Returns true if the argument is the last arc of a node.
- See Also:
BIT_LAST_ARC
-
isNextSet
public boolean isNextSet(int arc)
- Parameters:
arc- The node's arc identifier.- Returns:
- Returns true if
BIT_TARGET_NEXTis set for this arc. - See Also:
BIT_TARGET_NEXT
-
isLabelCompressed
public boolean isLabelCompressed(int arc)
- Parameters:
arc- The node's arc identifier.- Returns:
- Returns
trueif the label is compressed inside flags byte.
-
getFlags
public java.util.Set<FSAFlags> getFlags()
For this automaton version, an additional
FSAFlags.NUMBERSflag may be set to indicate the automaton contains extra fields for each node.
-
getDestinationNodeOffset
final int getDestinationNodeOffset(int arc)
Returns the address of the node pointed to by this arc.
-
skipArc
private int skipArc(int offset)
Read the arc's layout and skip as many bytes, as needed, to skip it.
-
-