Package com.kohlschutter.boilerpipe.util
Class UnicodeTokenizer
java.lang.Object
com.kohlschutter.boilerpipe.util.UnicodeTokenizer
Tokenizes text according to Unicode word boundaries and strips off non-word characters.
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic String[]tokenize(CharSequence text) Tokenizes the text and returns an array of tokens.
-
Field Details
-
PAT_WORD_BOUNDARY
-
PAT_NOT_WORD_BOUNDARY
-
-
Constructor Details
-
UnicodeTokenizer
public UnicodeTokenizer()
-
-
Method Details
-
tokenize
Tokenizes the text and returns an array of tokens.- Parameters:
text- The text- Returns:
- The tokens
-