Class BretonWordTokenizer

java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.tokenizers.br.BretonWordTokenizer
All Implemented Interfaces:
org.languagetool.tokenizers.Tokenizer

public class BretonWordTokenizer extends org.languagetool.tokenizers.WordTokenizer
  • Constructor Details

    • BretonWordTokenizer

      public BretonWordTokenizer()
  • Method Details

    • tokenize

      public List<String> tokenize(String text)
      Tokenizes just like WordTokenizer with the exception that "c’h" is not split. "C’h" is considered as a letter in breton (trigraph) and it occurs in many words. So tokenizer should not split it. Also split things like "n’eo" into 2 tokens only "n’" + "eo".
      Specified by:
      tokenize in interface org.languagetool.tokenizers.Tokenizer
      Overrides:
      tokenize in class org.languagetool.tokenizers.WordTokenizer
      Parameters:
      text - Text to tokenize
      Returns:
      List of tokens. Note: a special string ##BR_APOS## is used to replace apostrophes during tokenizing.