Package org.languagetool.chunking
Class EnglishChunker
- java.lang.Object
-
- org.languagetool.chunking.EnglishChunker
-
- All Implemented Interfaces:
org.languagetool.chunking.Chunker
public class EnglishChunker extends java.lang.Object implements org.languagetool.chunking.ChunkerOpenNLP-based chunker. Also uses the OpenNLP tokenizer and POS tagger and maps the result to our own tokens (we have our own tokenizer), as far as trivially possible.- Since:
- 2.3
-
-
Field Summary
Fields Modifier and Type Field Description private static java.lang.StringCHUNKER_MODELprivate static opennlp.tools.chunker.ChunkerModelchunkerModelprivate EnglishChunkFilterchunkFilterprivate static java.lang.StringPOS_TAGGER_MODELprivate static opennlp.tools.postag.POSModelposModelprivate static java.lang.StringTOKENIZER_MODELprivate static opennlp.tools.tokenize.TokenizerModeltokenModelThis needs to be static to save memory: as Language.LANGUAGES is static, any language that is once created there will never be released.
-
Constructor Summary
Constructors Constructor Description EnglishChunker()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidaddChunkTags(java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings)private voidassignChunksToReadings(java.util.List<ChunkTaggedToken> chunkTaggedTokens)private java.lang.String[]chunk(java.lang.String[] tokens, java.lang.String[] posTags)private @Nullable org.languagetool.AnalyzedTokenReadingsgetAnalyzedTokenReadingsFor(int startPos, int endPos, java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings)private java.util.List<ChunkTaggedToken>getChunkTagsForReadings(java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings)private java.lang.StringgetSentence(java.util.List<org.languagetool.AnalyzedTokenReadings> sentenceTokens)private java.util.List<ChunkTaggedToken>getTokensWithTokenReadings(java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings, java.lang.String[] tokens, java.lang.String[] chunkTags)private java.lang.String[]posTag(java.lang.String[] tokens)(package private) java.lang.String[]tokenize(java.lang.String sentence)
-
-
-
Field Detail
-
TOKENIZER_MODEL
private static final java.lang.String TOKENIZER_MODEL
- See Also:
- Constant Field Values
-
POS_TAGGER_MODEL
private static final java.lang.String POS_TAGGER_MODEL
- See Also:
- Constant Field Values
-
CHUNKER_MODEL
private static final java.lang.String CHUNKER_MODEL
- See Also:
- Constant Field Values
-
tokenModel
private static volatile opennlp.tools.tokenize.TokenizerModel tokenModel
This needs to be static to save memory: as Language.LANGUAGES is static, any language that is once created there will never be released. As English has several variants, we'd have as many posModels etc. as we have variants -> huge waste of memory:
-
posModel
private static volatile opennlp.tools.postag.POSModel posModel
-
chunkerModel
private static volatile opennlp.tools.chunker.ChunkerModel chunkerModel
-
chunkFilter
private final EnglishChunkFilter chunkFilter
-
-
Method Detail
-
addChunkTags
public void addChunkTags(java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings)
- Specified by:
addChunkTagsin interfaceorg.languagetool.chunking.Chunker
-
getChunkTagsForReadings
private java.util.List<ChunkTaggedToken> getChunkTagsForReadings(java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings)
-
tokenize
java.lang.String[] tokenize(java.lang.String sentence)
-
posTag
private java.lang.String[] posTag(java.lang.String[] tokens)
-
chunk
private java.lang.String[] chunk(java.lang.String[] tokens, java.lang.String[] posTags)
-
getTokensWithTokenReadings
private java.util.List<ChunkTaggedToken> getTokensWithTokenReadings(java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings, java.lang.String[] tokens, java.lang.String[] chunkTags)
-
assignChunksToReadings
private void assignChunksToReadings(java.util.List<ChunkTaggedToken> chunkTaggedTokens)
-
getSentence
private java.lang.String getSentence(java.util.List<org.languagetool.AnalyzedTokenReadings> sentenceTokens)
-
getAnalyzedTokenReadingsFor
@Nullable private @Nullable org.languagetool.AnalyzedTokenReadings getAnalyzedTokenReadingsFor(int startPos, int endPos, java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings)
-
-