Package edu.berkeley.nlp.lm.io
Class KneserNeyLmReaderCallback<W>
- java.lang.Object
-
- edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback<W>
-
- Type Parameters:
W-
- All Implemented Interfaces:
ArrayEncodedNgramLanguageModel<W>,LmReader<ProbBackoffPair,ArpaLmReaderCallback<ProbBackoffPair>>,LmReaderCallback<LongRef>,NgramOrderedLmReaderCallback<LongRef>,NgramLanguageModel<W>,java.io.Serializable
public class KneserNeyLmReaderCallback<W> extends java.lang.Object implements NgramOrderedLmReaderCallback<LongRef>, LmReader<ProbBackoffPair,ArpaLmReaderCallback<ProbBackoffPair>>, ArrayEncodedNgramLanguageModel<W>, java.io.Serializable
Class for producing a Kneser-Ney language model in ARPA format from raw text. Confusingly, this class is both aLmReaderCallback(called fromTextReader, which reads plain text), and aLmReader, which "reads" counts and produces Kneser-Ney probabilities and backoffs and passes them on anArpaLmReaderCallback- Author:
- adampauls
- See Also:
- Serialized Form
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from interface edu.berkeley.nlp.lm.ArrayEncodedNgramLanguageModel
ArrayEncodedNgramLanguageModel.DefaultImplementations
-
Nested classes/interfaces inherited from interface edu.berkeley.nlp.lm.NgramLanguageModel
NgramLanguageModel.StaticMethods
-
-
Field Summary
Fields Modifier and Type Field Description protected static floatDEFAULT_DISCOUNTprotected intlmOrderprotected HashNgramMap<KneserNeyCountValueContainer.KneserNeyCounts>ngramsprotected ConfigOptionsoptsprotected static longserialVersionUIDprotected intstartIndexprotected WordIndexer<W>wordIndexerThis array represents the discount used for each ngram order.
-
Constructor Summary
Constructors Constructor Description KneserNeyLmReaderCallback(WordIndexer<W> wordIndexer, int maxOrder)KneserNeyLmReaderCallback(WordIndexer<W> wordIndexer, int maxOrder, ConfigOptions opts)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description voidaddNgram(int[] ngram, int startPos, int endPos, LongRef value, java.lang.String words, boolean justLastWord, long[][] scratch)voidcall(int[] ngram, int startPos, int endPos, LongRef value, java.lang.String words)Called for each n-gramvoidcall(W[] ngram, LongRef value)voidcallJustLast(W[] ngram, LongRef value, long[][] scratch)voidcleanup()Called once all reading is done.static double[]defaultDiscounts()static double[]defaultMinCounts()protected floatgetDiscountForOrder(int ngramOrder)protected floatgetHighestOrderProb(int[] ngram, int startPos, int endPos)intgetLmOrder()Maximum size of n-grams stored by the model.floatgetLogProb(int[] ngram)Equivalent togetLogProb(ngram, 0, ngram.length)floatgetLogProb(int[] ngram, int startPos, int endPos)Calculate language model score of an n-gram.floatgetLogProb(java.util.List<W> ngram)Scores an n-gram.protected floatgetLowerOrderBackoff(int[] ngram, int startPos, int endPos)protected floatgetLowerOrderProb(int[] ngram, int startPos, int endPos)longgetTotalSize()WordIndexer<W>getWordIndexer()Each LM must have a WordIndexer which assigns integer IDs to each word W in the language.voidhandleNgramOrderFinished(int order)Called when all n-grams of a given order are finishedvoidhandleNgramOrderStarted(int order)Called when n-grams of a given order are startedprotected floatinterpolateProb(int[] ngram, int startPos, int endPos)voidparse(ArpaLmReaderCallback<ProbBackoffPair> callback)floatscoreSentence(java.util.List<W> sentence)Scores a complete sentence, taking appropriate care with the start- and end-of-sentence symbols.voidsetOovWordLogProb(float logProb)Sets the (log) probability for an OOV word.
-
-
-
Field Detail
-
serialVersionUID
protected static final long serialVersionUID
- See Also:
- Constant Field Values
-
DEFAULT_DISCOUNT
protected static final float DEFAULT_DISCOUNT
- See Also:
- Constant Field Values
-
lmOrder
protected final int lmOrder
-
wordIndexer
protected final WordIndexer<W> wordIndexer
This array represents the discount used for each ngram order. The original Kneser-Ney discounting (-ukndiscount) uses one discounting constant for each N-gram order. These constants are estimated as D = n1 / (n1 + 2*n2) where n1 and n2 are the total number of N-grams with exactly one and two counts, respectively. For simplicity, our code just uses a constant discount for each order of 0.75. However, other discounts can be specified.
-
ngrams
protected final HashNgramMap<KneserNeyCountValueContainer.KneserNeyCounts> ngrams
-
opts
protected final ConfigOptions opts
-
startIndex
protected final int startIndex
-
-
Constructor Detail
-
KneserNeyLmReaderCallback
public KneserNeyLmReaderCallback(WordIndexer<W> wordIndexer, int maxOrder)
- Parameters:
wordIndexer-maxOrder-inputIsSentences- If true, input n-grams are assumed to be sentences, and all sub-ngrams of up to ordermaxOrderare added. If false, input n-grams are assumed to be atomic.
-
KneserNeyLmReaderCallback
public KneserNeyLmReaderCallback(WordIndexer<W> wordIndexer, int maxOrder, ConfigOptions opts)
-
-
Method Detail
-
call
public void call(int[] ngram, int startPos, int endPos, LongRef value, java.lang.String words)Description copied from interface:LmReaderCallbackCalled for each n-gram- Specified by:
callin interfaceLmReaderCallback<W>- Parameters:
ngram- The integer representation of the words as given by the provided WordIndexervalue- The value of the n-gramwords- The string representation of the n-gram (space separated)
-
addNgram
public void addNgram(int[] ngram, int startPos, int endPos, LongRef value, java.lang.String words, boolean justLastWord, long[][] scratch)- Parameters:
ngram-startPos-endPos-value-words-
-
interpolateProb
protected float interpolateProb(int[] ngram, int startPos, int endPos)
-
getHighestOrderProb
protected float getHighestOrderProb(int[] ngram, int startPos, int endPos)
-
getLowerOrderProb
protected float getLowerOrderProb(int[] ngram, int startPos, int endPos)
-
getLowerOrderBackoff
protected float getLowerOrderBackoff(int[] ngram, int startPos, int endPos)
-
getDiscountForOrder
protected float getDiscountForOrder(int ngramOrder)
-
cleanup
public void cleanup()
Description copied from interface:LmReaderCallbackCalled once all reading is done.- Specified by:
cleanupin interfaceLmReaderCallback<W>
-
defaultDiscounts
public static double[] defaultDiscounts()
-
defaultMinCounts
public static double[] defaultMinCounts()
-
parse
public void parse(ArpaLmReaderCallback<ProbBackoffPair> callback)
- Specified by:
parsein interfaceLmReader<ProbBackoffPair,ArpaLmReaderCallback<ProbBackoffPair>>
-
getWordIndexer
public WordIndexer<W> getWordIndexer()
Description copied from interface:NgramLanguageModelEach LM must have a WordIndexer which assigns integer IDs to each word W in the language.- Specified by:
getWordIndexerin interfaceNgramLanguageModel<W>- Returns:
-
handleNgramOrderFinished
public void handleNgramOrderFinished(int order)
Description copied from interface:NgramOrderedLmReaderCallbackCalled when all n-grams of a given order are finished- Specified by:
handleNgramOrderFinishedin interfaceNgramOrderedLmReaderCallback<W>
-
handleNgramOrderStarted
public void handleNgramOrderStarted(int order)
Description copied from interface:NgramOrderedLmReaderCallbackCalled when n-grams of a given order are started- Specified by:
handleNgramOrderStartedin interfaceNgramOrderedLmReaderCallback<W>
-
getLmOrder
public int getLmOrder()
Description copied from interface:NgramLanguageModelMaximum size of n-grams stored by the model.- Specified by:
getLmOrderin interfaceNgramLanguageModel<W>- Returns:
-
scoreSentence
public float scoreSentence(java.util.List<W> sentence)
Description copied from interface:NgramLanguageModelScores a complete sentence, taking appropriate care with the start- and end-of-sentence symbols. This is a convenience method and will generally be inefficient.- Specified by:
scoreSentencein interfaceNgramLanguageModel<W>- Returns:
-
getLogProb
public float getLogProb(java.util.List<W> ngram)
Description copied from interface:NgramLanguageModelScores an n-gram. This is a convenience method and will generally be relatively inefficient. More efficient versions are available inArrayEncodedNgramLanguageModel.getLogProb(int[], int, int)andContextEncodedNgramLanguageModel.getLogProb(long, int, int, edu.berkeley.nlp.lm.ContextEncodedNgramLanguageModel.LmContextInfo).- Specified by:
getLogProbin interfaceNgramLanguageModel<W>
-
getLogProb
public float getLogProb(int[] ngram, int startPos, int endPos)Description copied from interface:ArrayEncodedNgramLanguageModelCalculate language model score of an n-gram. Warning: if you pass in an n-gram of length greater thangetLmOrder(), this call will silently ignore the extra words of context. In other words, if you pass in a 5-gram (endPos-startPos == 5) to a 3-gram model, it will only score the words fromstartPos + 2toendPos.- Specified by:
getLogProbin interfaceArrayEncodedNgramLanguageModel<W>- Parameters:
ngram- array of words in integer representationstartPos- start of the portion of the array to be readendPos- end of the portion of the array to be read.- Returns:
-
getLogProb
public float getLogProb(int[] ngram)
Description copied from interface:ArrayEncodedNgramLanguageModelEquivalent togetLogProb(ngram, 0, ngram.length)- Specified by:
getLogProbin interfaceArrayEncodedNgramLanguageModel<W>- See Also:
ArrayEncodedNgramLanguageModel.getLogProb(int[], int, int)
-
getTotalSize
public long getTotalSize()
-
setOovWordLogProb
public void setOovWordLogProb(float logProb)
Description copied from interface:NgramLanguageModelSets the (log) probability for an OOV word. Note that this is in general different from the log prob of theunktag probability.- Specified by:
setOovWordLogProbin interfaceNgramLanguageModel<W>
-
-