Package org.languagetool.tokenizers
Class WordTokenizer
- java.lang.Object
-
- org.languagetool.tokenizers.WordTokenizer
-
- All Implemented Interfaces:
Tokenizer
public class WordTokenizer extends java.lang.Object implements Tokenizer
Tokenizes a sentence into words. Punctuation and whitespace gets their own tokens. The tokenizer is a quite simple character-based one, though it knows about urls and will put them in one token, if fully specified including a protocol (likehttp://foobar.org).
-
-
Field Summary
Fields Modifier and Type Field Description private static java.util.regex.PatternDOMAIN_CHARSprivate static java.util.regex.PatternE_MAILprivate static java.util.regex.PatternNO_PROTOCOL_URLprivate static java.util.List<java.lang.String>PROTOCOLSprivate static java.lang.StringTOKENIZING_CHARACTERSprivate static java.util.regex.PatternURL_CHARS
-
Constructor Summary
Constructors Constructor Description WordTokenizer()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static java.util.List<java.lang.String>getProtocols()Get the protocols that the tokenizer knows about.java.lang.StringgetTokenizingCharacters()static booleanisEMail(java.lang.String token)private booleanisProtocol(java.lang.String token)static booleanisUrl(java.lang.String token)protected java.util.List<java.lang.String>joinEMails(java.util.List<java.lang.String> list)protected java.util.List<java.lang.String>joinEMailsAndUrls(java.util.List<java.lang.String> list)protected java.util.List<java.lang.String>joinUrls(java.util.List<java.lang.String> l)java.util.List<java.lang.String>tokenize(java.lang.String text)private booleanurlEndsAt(int i, java.util.List<java.lang.String> l, java.lang.String urlQuote)private booleanurlStartsAt(int i, java.util.List<java.lang.String> l)
-
-
-
Field Detail
-
PROTOCOLS
private static final java.util.List<java.lang.String> PROTOCOLS
-
URL_CHARS
private static final java.util.regex.Pattern URL_CHARS
-
DOMAIN_CHARS
private static final java.util.regex.Pattern DOMAIN_CHARS
-
NO_PROTOCOL_URL
private static final java.util.regex.Pattern NO_PROTOCOL_URL
-
E_MAIL
private static final java.util.regex.Pattern E_MAIL
-
TOKENIZING_CHARACTERS
private static final java.lang.String TOKENIZING_CHARACTERS
- See Also:
- Constant Field Values
-
-
Method Detail
-
getProtocols
public static java.util.List<java.lang.String> getProtocols()
Get the protocols that the tokenizer knows about.- Returns:
- currently
http,https, andftp - Since:
- 2.1
-
isUrl
public static boolean isUrl(java.lang.String token)
- Since:
- 3.0
-
isEMail
public static boolean isEMail(java.lang.String token)
- Since:
- 3.5
-
tokenize
public java.util.List<java.lang.String> tokenize(java.lang.String text)
-
getTokenizingCharacters
public java.lang.String getTokenizingCharacters()
- Returns:
- The string containing the characters used by the tokenizer to tokenize words.
- Since:
- 2.5
-
joinEMailsAndUrls
protected java.util.List<java.lang.String> joinEMailsAndUrls(java.util.List<java.lang.String> list)
-
joinEMails
protected java.util.List<java.lang.String> joinEMails(java.util.List<java.lang.String> list)
- Since:
- 3.5
-
joinUrls
protected java.util.List<java.lang.String> joinUrls(java.util.List<java.lang.String> l)
-
urlStartsAt
private boolean urlStartsAt(int i, java.util.List<java.lang.String> l)
-
isProtocol
private boolean isProtocol(java.lang.String token)
-
urlEndsAt
private boolean urlEndsAt(int i, java.util.List<java.lang.String> l, java.lang.String urlQuote)
-
-