Class UnicodeTokenizer

java.lang.Object
com.kohlschutter.boilerpipe.util.UnicodeTokenizer

public class UnicodeTokenizer extends Object
Tokenizes text according to Unicode word boundaries and strips off non-word characters.
  • Field Details

    • PAT_WORD_BOUNDARY

      private static final Pattern PAT_WORD_BOUNDARY
    • PAT_NOT_WORD_BOUNDARY

      private static final Pattern PAT_NOT_WORD_BOUNDARY
  • Constructor Details

    • UnicodeTokenizer

      public UnicodeTokenizer()
  • Method Details

    • tokenize

      public static String[] tokenize(CharSequence text)
      Tokenizes the text and returns an array of tokens.
      Parameters:
      text - The text
      Returns:
      The tokens