Class DictionaryLookup
java.lang.Object
morfologik.stemming.DictionaryLookup
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate ByteBufferInternal reusable buffer for encoding words into byte arrays usingencoder.private CharBufferInternal reusable buffer for encoding words into byte arrays usingencoder.private final CharsetDecoderCharset decoder for the FSA.private final DictionaryTheDictionarythis lookup is using.private final DictionaryMetadataFeatures of the compiled dictionary.private final CharsetEncoderCharset encoder for the FSA.private static final intExpand buffers and arrays by this constant.private final ByteSequenceIteratorAn iterator for walking along the final states offsa.private WordData[]Private internal array of reusable word data objects.private final ArrayViewList<WordData> A "view" over an array implementingprivate final FSAThe FSA we are using.private final FSATraversalAn FSA used for lookups.private final MatchResultReusable match result.private final intFSA's root node.private final charprivate final ISequenceEncoder -
Constructor Summary
ConstructorsConstructorDescriptionDictionaryLookup(Dictionary dictionary) Creates a new object of this class using the given FSA for word lookups and encoding for converting characters to bytes. -
Method Summary
Modifier and TypeMethodDescriptionstatic StringapplyReplacements(CharSequence word, LinkedHashMap<String, String> replacements) Apply partial string replacements from a given map.chariterator()Return an iterator over allWordDataentries available in the embeddedDictionary.lookup(CharSequence word) Searches the automaton for a symbol sequence equal toword, followed by a separator.Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface Iterable
forEach, spliterator
-
Field Details
-
matcher
An FSA used for lookups. -
finalStatesIterator
An iterator for walking along the final states offsa. -
rootNode
private final int rootNodeFSA's root node. -
EXPAND_SIZE
private static final int EXPAND_SIZEExpand buffers and arrays by this constant.- See Also:
-
forms
Private internal array of reusable word data objects. -
formsList
A "view" over an array implementing -
dictionaryMetadata
-
encoder
Charset encoder for the FSA. -
decoder
Charset decoder for the FSA. -
fsa
The FSA we are using. -
separatorChar
private final char separatorChar- See Also:
-
byteBuffer
Internal reusable buffer for encoding words into byte arrays usingencoder. -
charBuffer
Internal reusable buffer for encoding words into byte arrays usingencoder. -
matchResult
Reusable match result. -
dictionary
TheDictionarythis lookup is using. -
sequenceEncoder
-
-
Constructor Details
-
DictionaryLookup
Creates a new object of this class using the given FSA for word lookups and encoding for converting characters to bytes.- Parameters:
dictionary- The dictionary to use for lookups.- Throws:
IllegalArgumentException- if FSA's root node cannot be acquired (dictionary is empty).
-
-
Method Details
-
lookup
Searches the automaton for a symbol sequence equal toword, followed by a separator. The result is a stem (decompressed accordingly to the dictionary's specification) and an optional tag data. -
applyReplacements
public static String applyReplacements(CharSequence word, LinkedHashMap<String, String> replacements) Apply partial string replacements from a given map. Useful if the word needs to be normalized somehow (i.e., ligatures, apostrophes and such).- Parameters:
word- The word to apply replacements to.replacements- A map of replacements (from->to).- Returns:
- new string with all replacements applied.
-
iterator
-
getDictionary
- Returns:
- Return the
Dictionaryused by this object.
-
getSeparatorChar
public char getSeparatorChar()- Returns:
- Returns the logical separator character splitting inflected form,
lemma correction token and a tag. Note that this character is a best-effort
conversion from a byte in
DictionaryMetadata.separatorand may not be valid in the target encoding (although this is highly unlikely).
-