Package morfologik.stemming
Class DictionaryLookup
- java.lang.Object
-
- morfologik.stemming.DictionaryLookup
-
-
Field Summary
Fields Modifier and Type Field Description private java.nio.ByteBufferbyteBufferInternal reusable buffer for encoding words into byte arrays usingencoder.private java.nio.CharBuffercharBufferInternal reusable buffer for encoding words into byte arrays usingencoder.private java.nio.charset.CharsetDecoderdecoderCharset decoder for the FSA.private DictionarydictionaryTheDictionarythis lookup is using.private DictionaryMetadatadictionaryMetadataFeatures of the compiled dictionary.private java.nio.charset.CharsetEncoderencoderCharset encoder for the FSA.private static intEXPAND_SIZEExpand buffers and arrays by this constant.private ByteSequenceIteratorfinalStatesIteratorAn iterator for walking along the final states offsa.private WordData[]formsPrivate internal array of reusable word data objects.private ArrayViewList<WordData>formsListA "view" over an array implementingprivate FSAfsaThe FSA we are using.private FSATraversalmatcherAn FSA used for lookups.private MatchResultmatchResultReusable match result.private introotNodeFSA's root node.private charseparatorCharprivate ISequenceEncodersequenceEncoder
-
Constructor Summary
Constructors Constructor Description DictionaryLookup(Dictionary dictionary)Creates a new object of this class using the given FSA for word lookups and encoding for converting characters to bytes.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static java.lang.StringapplyReplacements(java.lang.CharSequence word, java.util.LinkedHashMap<java.lang.String,java.lang.String> replacements)Apply partial string replacements from a given map.DictionarygetDictionary()chargetSeparatorChar()java.util.Iterator<WordData>iterator()Return an iterator over allWordDataentries available in the embeddedDictionary.java.util.List<WordData>lookup(java.lang.CharSequence word)Searches the automaton for a symbol sequence equal toword, followed by a separator.
-
-
-
Field Detail
-
matcher
private final FSATraversal matcher
An FSA used for lookups.
-
finalStatesIterator
private final ByteSequenceIterator finalStatesIterator
An iterator for walking along the final states offsa.
-
rootNode
private final int rootNode
FSA's root node.
-
EXPAND_SIZE
private static final int EXPAND_SIZE
Expand buffers and arrays by this constant.- See Also:
- Constant Field Values
-
forms
private WordData[] forms
Private internal array of reusable word data objects.
-
formsList
private final ArrayViewList<WordData> formsList
A "view" over an array implementing
-
dictionaryMetadata
private final DictionaryMetadata dictionaryMetadata
Features of the compiled dictionary.- See Also:
DictionaryMetadata
-
encoder
private final java.nio.charset.CharsetEncoder encoder
Charset encoder for the FSA.
-
decoder
private final java.nio.charset.CharsetDecoder decoder
Charset decoder for the FSA.
-
fsa
private final FSA fsa
The FSA we are using.
-
separatorChar
private final char separatorChar
- See Also:
getSeparatorChar()
-
byteBuffer
private java.nio.ByteBuffer byteBuffer
Internal reusable buffer for encoding words into byte arrays usingencoder.
-
charBuffer
private java.nio.CharBuffer charBuffer
Internal reusable buffer for encoding words into byte arrays usingencoder.
-
matchResult
private final MatchResult matchResult
Reusable match result.
-
dictionary
private final Dictionary dictionary
TheDictionarythis lookup is using.
-
sequenceEncoder
private final ISequenceEncoder sequenceEncoder
-
-
Constructor Detail
-
DictionaryLookup
public DictionaryLookup(Dictionary dictionary) throws java.lang.IllegalArgumentException
Creates a new object of this class using the given FSA for word lookups and encoding for converting characters to bytes.- Parameters:
dictionary- The dictionary to use for lookups.- Throws:
java.lang.IllegalArgumentException- if FSA's root node cannot be acquired (dictionary is empty).
-
-
Method Detail
-
lookup
public java.util.List<WordData> lookup(java.lang.CharSequence word)
Searches the automaton for a symbol sequence equal toword, followed by a separator. The result is a stem (decompressed accordingly to the dictionary's specification) and an optional tag data.
-
applyReplacements
public static java.lang.String applyReplacements(java.lang.CharSequence word, java.util.LinkedHashMap<java.lang.String,java.lang.String> replacements)Apply partial string replacements from a given map. Useful if the word needs to be normalized somehow (i.e., ligatures, apostrophes and such).- Parameters:
word- The word to apply replacements to.replacements- A map of replacements (from->to).- Returns:
- new string with all replacements applied.
-
iterator
public java.util.Iterator<WordData> iterator()
Return an iterator over allWordDataentries available in the embeddedDictionary.- Specified by:
iteratorin interfacejava.lang.Iterable<WordData>
-
getDictionary
public Dictionary getDictionary()
- Returns:
- Return the
Dictionaryused by this object.
-
getSeparatorChar
public char getSeparatorChar()
- Returns:
- Returns the logical separator character splitting inflected form,
lemma correction token and a tag. Note that this character is a best-effort
conversion from a byte in
DictionaryMetadata.separatorand may not be valid in the target encoding (although this is highly unlikely).
-
-