Class CompoundCharacterTokenizer

java.lang.Object
org.apache.fontbox.ttf.gsub.CompoundCharacterTokenizer

public class CompoundCharacterTokenizer extends Object
Takes in the given text having compound-glyphs to substitute, and splits it into chunks consisting of parts that should be substituted and the ones that can be processed normally.
  • Field Details

  • Constructor Details

    • CompoundCharacterTokenizer

      public CompoundCharacterTokenizer(Set<String> compoundWords)
      Constructor. Calls getRegexFromTokens which returns strings like (_79_99_)|(_80_99_)|(_92_99_) and creates a regexp assigned to regexExpression. See the code in GlyphArraySplitterRegexImpl on how these strings were created.

      It is assumed the compound words are sorted in descending order of length.

      Parameters:
      compoundWords - A set of strings like _79_99_, _80_99_ or _92_99_ .
    • CompoundCharacterTokenizer

      public CompoundCharacterTokenizer(Pattern pattern)
  • Method Details

    • validateCompoundWords

      private void validateCompoundWords(Set<String> compoundWords)
      Validate the compound words. They should not be null or empty and should start and end with the GLYPH_ID_SEPARATOR
    • tokenize

      public List<String> tokenize(String text)
      Tokenize a string into tokens.
      Parameters:
      text - A string like "_66_71_71_74_79_70_"
      Returns:
      A list of tokens like "_66_", "_71_71_", "74_79_70_". The "_" is sometimes missing at the beginning or end, this has to be cleaned by the caller.
    • getRegexFromTokens

      private String getRegexFromTokens(Set<String> compoundWords)