Class MultiWordChunker2
- java.lang.Object
-
- org.languagetool.tagging.disambiguation.AbstractDisambiguator
-
- org.languagetool.tagging.disambiguation.MultiWordChunker2
-
- All Implemented Interfaces:
Disambiguator
public class MultiWordChunker2 extends AbstractDisambiguator
Multiword tagger-chunker. Note: currently does not support:- overlapping tagging (first matching multiword entry wins)
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static classMultiWordChunker2.MultiWordEntry
-
Field Summary
Fields Modifier and Type Field Description private booleanallowFirstCapitalizedprivate java.lang.Stringfilenameprivate booleanremoveOtherReadingsprivate java.lang.StringtagFormatprivate java.util.Map<java.lang.String,java.util.List<MultiWordChunker2.MultiWordEntry>>tokenToPosTagMapprivate static java.lang.StringWRAP_TAG
-
Constructor Summary
Constructors Constructor Description MultiWordChunker2(java.lang.String filename)MultiWordChunker2(java.lang.String filename, boolean allowFirstCapitalized)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description AnalyzedSentencedisambiguate(AnalyzedSentence input)Implements multiword POS tags, e.g., <ELLIPSIS> for ellipsis (...) start, and </ELLIPSIS> for ellipsis end.private MultiWordChunker2.MultiWordEntryfindMultiwordEntry(AnalyzedTokenReadings[] inputTokens, int startingPosition, java.util.List<MultiWordChunker2.MultiWordEntry> multiwordItems)protected java.lang.StringformatPosTag(java.lang.String posTag, int position, int multiwordLength)Override this method if you want format POS tag differentlyprivate booleanisMatching(AnalyzedTokenReadings[] inputTokens, int startingPosition, MultiWordChunker2.MultiWordEntry multiWordEntry)private voidlazyInit()private java.util.List<java.lang.String>loadWords(java.io.InputStream stream)protected booleanmatches(java.lang.String matchText, AnalyzedTokenReadings inputTokens)protected AnalyzedTokenReadingsprepareNewReading(java.lang.String tokens, java.lang.String tok, AnalyzedTokenReadings token, java.lang.String tag)private AnalyzedTokenReadingssetAndAnnotate(AnalyzedTokenReadings oldReading, AnalyzedToken newReading)voidsetRemoveOtherReadings(boolean removeOtherReadings)voidsetWrapTag(boolean wrapTag)-
Methods inherited from class org.languagetool.tagging.disambiguation.AbstractDisambiguator
preDisambiguate
-
-
-
-
Field Detail
-
WRAP_TAG
private static final java.lang.String WRAP_TAG
- See Also:
- Constant Field Values
-
filename
private final java.lang.String filename
-
allowFirstCapitalized
private final boolean allowFirstCapitalized
-
removeOtherReadings
private boolean removeOtherReadings
-
tagFormat
private java.lang.String tagFormat
-
tokenToPosTagMap
private java.util.Map<java.lang.String,java.util.List<MultiWordChunker2.MultiWordEntry>> tokenToPosTagMap
-
-
Constructor Detail
-
MultiWordChunker2
public MultiWordChunker2(java.lang.String filename)
- Parameters:
filename- file text with multiwords and tags
-
MultiWordChunker2
public MultiWordChunker2(java.lang.String filename, boolean allowFirstCapitalized)- Parameters:
filename- file text with multiwords and tagsallowFirstCapitalized- if set totrue, first word of the multiword can be capitalized
-
-
Method Detail
-
setRemoveOtherReadings
public void setRemoveOtherReadings(boolean removeOtherReadings)
- Parameters:
removeOtherReadings- If true and multiword matches other readings will be removed
-
setWrapTag
public void setWrapTag(boolean wrapTag)
- Parameters:
wrapTag- If true the tag will be wrapped with < and >
-
formatPosTag
protected java.lang.String formatPosTag(java.lang.String posTag, int position, int multiwordLength)Override this method if you want format POS tag differently- Parameters:
posTag- POS tag for the multiwordposition- Position of the token in the multiword- Returns:
- Returns formatted POS tag for the multiword
-
lazyInit
private void lazyInit()
-
disambiguate
public AnalyzedSentence disambiguate(AnalyzedSentence input)
Implements multiword POS tags, e.g., <ELLIPSIS> for ellipsis (...) start, and </ELLIPSIS> for ellipsis end.- Parameters:
input- The tokens to be chunked.- Returns:
- AnalyzedSentence with additional markers.
-
findMultiwordEntry
private MultiWordChunker2.MultiWordEntry findMultiwordEntry(AnalyzedTokenReadings[] inputTokens, int startingPosition, java.util.List<MultiWordChunker2.MultiWordEntry> multiwordItems)
-
isMatching
private boolean isMatching(AnalyzedTokenReadings[] inputTokens, int startingPosition, MultiWordChunker2.MultiWordEntry multiWordEntry)
-
matches
protected boolean matches(java.lang.String matchText, AnalyzedTokenReadings inputTokens)
-
prepareNewReading
protected AnalyzedTokenReadings prepareNewReading(java.lang.String tokens, java.lang.String tok, AnalyzedTokenReadings token, java.lang.String tag)
-
setAndAnnotate
private AnalyzedTokenReadings setAndAnnotate(AnalyzedTokenReadings oldReading, AnalyzedToken newReading)
-
loadWords
private java.util.List<java.lang.String> loadWords(java.io.InputStream stream)
-
-