Class Speller
java.lang.Object
morfologik.speller.Speller
Finds spelling suggestions. Implements K. Oflazer's algorithm as described
in: Oflazer, Kemal. 1996.
"Error-Tolerant Finite-State Recognition with Applications to Morphological Analysis and Spelling Correction."
Computational Linguistics 22 (1): 73–89.
See Jan Daciuk's s_fsa package.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionfinal classUsed to sort candidates according to edit distance, and possibly according to their frequency in the future. -
Field Summary
FieldsModifier and TypeFieldDescriptionprivate ByteBufferInternal reusable buffer for encoding words into byte arrays usingencoder.private char[]private intprivate CharBufferInternal reusable buffer for encoding words into byte arrays usingencoder.private booleanprivate final CharsetDecoderCharset decoder for the FSA.private final DictionaryMetadataFeatures of the compiled dictionary.private final intprivate intprivate final CharsetEncoderCharset encoder for the FSA.private final ByteSequenceIteratorAn iterator for walking along the final states offsa.(package private) static final int(package private) static final intprivate final FSAThe FSA we are using.private final HMatrixprivate final FSATraversalAn FSA used for lookups.private final MatchResultReusable match result.private static final intstatic final intMaximum length of the word to be checked.private static final intprivate final intFSA's root node.(package private) static final intprivate intprivate char[] -
Constructor Summary
ConstructorsConstructorDescriptionSpeller(Dictionary dictionary) Speller(Dictionary dictionary, int editDistance) -
Method Summary
Modifier and TypeMethodDescriptionprivate voidaddReplacement(List<Speller.CandidateData> candidates, String replacement) private booleanareEqual(char x, char y) private ByteBuffer(package private) static booleanChecks whether a string contains a digit.booleanUsed to determine whether the dictionary supports case conversions.private voidintcuted(int depth, int wordIndex, int candIndex) Calculates cut-off edit distance.inted(int i, int j, int wordIndex, int candIndex) Calculates edit distance.private voidfindRepl(List<Speller.CandidateData> candidates, int depth, int node, byte[] prevBytes, int wordIndex, int candIndex) Find and return suggestions by using K.private ArrayList<Speller.CandidateData> findReplacementCandidates(String word, boolean evenIfWordInDictionary) findReplacements(String word) Find suggestions by using K.Find similar words even if the original word is a correct word that exists in the dictionaryfindSimilarWords(String word) getAllReplacements(String str, int fromIndex, int level) final intfinal intintgetFrequency(CharSequence word) Get the frequency value for a word form.final intprivate CharSequenceinitialUppercase(String wordToCheck) (package private) booleanisAllUppercase(String str) Returns true ifstris made up of all-uppercase characters (ignoring characters for which no upper-/lowercase distinction exists).(package private) static booleanisAlphabetic(int codePoint) Copy-paste of Character.isAlphabetic() (needed as we require only 1.6)private booleanisArcNotTerminal(int arc, int candIndex) private booleanisBeforeSeparator(int arc) booleanisCamelCase(String str) private booleanisEndOfCandidate(int arc, int wordIndex) booleanisInDictionary(CharSequence word) Test whether the word is found in the dictionary.booleanisMisspelled(String word) Checks whether the word is misspelled, by performing a series of checks according to properties of the dictionary.(package private) booleanisMixedCase(String str) (package private) booleanisNotAllLowercase(String str) Returns true ifstris made up of all-lowercase characters (ignoring characters for which no upper-/lowercase distinction exists).(package private) boolean(package private) static booleanisNotEmpty(String str) Helper method to replace calls to "".equals().private intmatchAnyToOne(int wordIndex, int candIndex) private intmatchAnyToTwo(int wordIndex, int candIndex) private static intmin(int a, int b, int c) replaceRunOnWordCandidates(String original) Propose suggestions for misspelled run-on words.replaceRunOnWords(String original) Propose suggestions for misspelled run-on words.(package private) voidsetWordAndCandidate(String word, String candidate) Sets up the word and candidate.
-
Field Details
-
MAX_WORD_LENGTH
public static final int MAX_WORD_LENGTHMaximum length of the word to be checked.- See Also:
-
FREQ_RANGES
static final int FREQ_RANGES- See Also:
-
FIRST_RANGE_CODE
static final int FIRST_RANGE_CODE- See Also:
-
UPPER_SEARCH_LIMIT
static final int UPPER_SEARCH_LIMIT- See Also:
-
MIN_WORD_LENGTH
private static final int MIN_WORD_LENGTH- See Also:
-
MAX_RECURSION_LEVEL
private static final int MAX_RECURSION_LEVEL- See Also:
-
editDistance
private final int editDistance -
effectEditDistance
private int effectEditDistance -
hMatrix
-
candidate
private char[] candidate -
candLen
private int candLen -
wordLen
private int wordLen -
wordProcessed
private char[] wordProcessed -
replacementsAnyToOne
-
replacementsAnyToTwo
-
replacementsTheRest
-
containsSeparators
private boolean containsSeparators -
byteBuffer
Internal reusable buffer for encoding words into byte arrays usingencoder. -
charBuffer
Internal reusable buffer for encoding words into byte arrays usingencoder. -
matchResult
Reusable match result. -
dictionaryMetadata
-
encoder
Charset encoder for the FSA. -
decoder
Charset decoder for the FSA. -
matcher
An FSA used for lookups. -
rootNode
private final int rootNodeFSA's root node. -
fsa
The FSA we are using. -
finalStatesIterator
An iterator for walking along the final states offsa.
-
-
Constructor Details
-
Speller
-
Speller
-
-
Method Details
-
createReplacementsMaps
private void createReplacementsMaps() -
charSequenceToBytes
- Throws:
UnmappableInputException
-
isMisspelled
Checks whether the word is misspelled, by performing a series of checks according to properties of the dictionary. If the flagfsa.dict.speller.ignore-punctuationis set, then all non-alphabetic characters are considered to be correctly spelled. If the flagfsa.dict.speller.ignore-numbersis set, then all words containing decimal digits are considered to be correctly spelled. If the flagfsa.dict.speller.ignore-camel-caseis set, then all CamelCase words are considered to be correctly spelled. If the flagfsa.dict.speller.ignore-all-uppercaseis set, then all alphabetic words composed of only uppercase characters are considered to be correctly spelled. Otherwise, the word is checked in the dictionary. If the test fails, and the dictionary does not perform any case conversions (as set byfsa.dict.speller.convert-caseflag), then the method returns false. In case of case conversions, it is checked whether a non-mixed case word is found in its lowercase version in the dictionary, and for all-uppercase words, whether the word is found in the dictionary with the initial uppercase letter.- Parameters:
word- - the word to be checked- Returns:
- true if the word is misspelled
-
initialUppercase
-
isInDictionary
Test whether the word is found in the dictionary.- Parameters:
word- the word to be tested- Returns:
- True if it is found.
-
getFrequency
Get the frequency value for a word form. It is taken from the first entry with this word form.- Parameters:
word- the word to be tested- Returns:
- frequency value in range: 0..FREQ_RANGE-1 (0: less frequent).
-
replaceRunOnWordCandidates
Propose suggestions for misspelled run-on words. This algorithm is inspired by spell.cc in s_fsa package by Jan Daciuk.- Parameters:
original- The original misspelled word.- Returns:
- The list of suggested pairs, as CandidateData with space-concatenated strings.
-
replaceRunOnWords
-
addReplacement
-
findSimilarWordCandidates
Find similar words even if the original word is a correct word that exists in the dictionary- Parameters:
word- The original word.- Returns:
- A list of suggested candidate replacements.
-
findSimilarWords
-
findReplacements
-
findReplacementCandidates
Find and return suggestions by using K. Oflazer's algorithm. See Jan Daciuk's s_fsa package, spell.cc for further explanation. This method is identical tofindReplacements(String), but returns candidate terms with their edit distance scores.- Parameters:
word- The original misspelled word.- Returns:
- A list of suggested candidate replacements.
-
findReplacementCandidates
private ArrayList<Speller.CandidateData> findReplacementCandidates(String word, boolean evenIfWordInDictionary) -
findRepl
private void findRepl(List<Speller.CandidateData> candidates, int depth, int node, byte[] prevBytes, int wordIndex, int candIndex) -
isArcNotTerminal
private boolean isArcNotTerminal(int arc, int candIndex) -
isEndOfCandidate
private boolean isEndOfCandidate(int arc, int wordIndex) -
isBeforeSeparator
private boolean isBeforeSeparator(int arc) -
ed
public int ed(int i, int j, int wordIndex, int candIndex) Calculates edit distance.- Parameters:
i- length of first word (here: misspelled) - 1;j- length of second word (here: candidate) - 1.wordIndex- (TODO: javadoc?)candIndex- (TODO: javadoc?)- Returns:
- Edit distance between the two words. Remarks: See Oflazer.
-
areEqual
private boolean areEqual(char x, char y) -
cuted
public int cuted(int depth, int wordIndex, int candIndex) Calculates cut-off edit distance.- Parameters:
depth- current length of candidates.wordIndex- (TODO: javadoc?)candIndex- (TODO: javadoc?)- Returns:
- Cut-off edit distance. Remarks: See Oflazer.
-
matchAnyToOne
private int matchAnyToOne(int wordIndex, int candIndex) -
matchAnyToTwo
private int matchAnyToTwo(int wordIndex, int candIndex) -
min
private static int min(int a, int b, int c) -
isAlphabetic
static boolean isAlphabetic(int codePoint) Copy-paste of Character.isAlphabetic() (needed as we require only 1.6)- Parameters:
codePoint- The input character.- Returns:
- True if the character is a Unicode alphabetic character.
-
containsNoDigit
Checks whether a string contains a digit. Used for ignoring words with numbers- Parameters:
s- Word to be checked.- Returns:
- True if there is a digit inside the word.
-
isAllUppercase
Returns true ifstris made up of all-uppercase characters (ignoring characters for which no upper-/lowercase distinction exists). -
isNotAllLowercase
Returns true ifstris made up of all-lowercase characters (ignoring characters for which no upper-/lowercase distinction exists). -
isNotCapitalizedWord
- Parameters:
str- input string
-
isNotEmpty
Helper method to replace calls to "".equals().- Parameters:
str- String to check- Returns:
- true if string is empty OR null
-
isMixedCase
- Parameters:
str- input str- Returns:
- Returns true if str is MixedCase.
-
isCamelCase
- Parameters:
str- The string to check.- Returns:
- Returns true if str is CamelCase. Note that German compounds with a dash (like "Waschmaschinen-Test") are also considered camel case by this method.
-
convertsCase
public boolean convertsCase()Used to determine whether the dictionary supports case conversions.- Returns:
- boolean value that answers this question in a deep and meaningful way.
- Since:
- 1.9
-
getAllReplacements
- Parameters:
str- The string to find the replacements for.fromIndex- The index from which replacements are found.level- The recursion level. The search stops if level is > MAX_RECURSION_LEVEL.- Returns:
- A list of all possible replacements of a {#link str} given string
-
setWordAndCandidate
-
getWordLen
public final int getWordLen() -
getCandLen
public final int getCandLen() -
getEffectiveED
public final int getEffectiveED()
-