Package org.languagetool.tokenizers.en
Class EnglishWordTokenizer
- java.lang.Object
-
- org.languagetool.tokenizers.WordTokenizer
-
- org.languagetool.tokenizers.en.EnglishWordTokenizer
-
- All Implemented Interfaces:
org.languagetool.tokenizers.Tokenizer
public class EnglishWordTokenizer extends org.languagetool.tokenizers.WordTokenizer- Since:
- 2.5
-
-
Field Summary
Fields Modifier and Type Field Description private static java.lang.String[]EXCEPTION_REPLACEMENTprivate static java.lang.String[]EXCEPTIONS
-
Constructor Summary
Constructors Constructor Description EnglishWordTokenizer()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description java.lang.StringgetTokenizingCharacters()java.util.List<java.lang.String>tokenize(java.lang.String text)Tokenizes text.
-
-
-
Method Detail
-
getTokenizingCharacters
public java.lang.String getTokenizingCharacters()
- Overrides:
getTokenizingCharactersin classorg.languagetool.tokenizers.WordTokenizer
-
tokenize
public java.util.List<java.lang.String> tokenize(java.lang.String text)
Tokenizes text. The English tokenizer differs from the standard one in two respects:- it does not treat the hyphen as part of the word if the hyphen is at the end of the word;
- it includes n-dash as a tokenizing character, as it is used without a whitespace in English.
- Specified by:
tokenizein interfaceorg.languagetool.tokenizers.Tokenizer- Overrides:
tokenizein classorg.languagetool.tokenizers.WordTokenizer- Parameters:
text- String of words to tokenize.
-
-