How NGramLab Works
Each concept used in the pipeline, explained without jargon. Useful before your presentation.
What is a corpus?
A corpus is a body of text: newspaper articles, a novel, Wikipedia pages, tweets, or any other written source. The model learns its statistics: which words appear, and which words tend to follow which other words.
In NGramLab you can paste your own text, upload a .txt file, or extract text from a website.
What is tokenization?
Computers do not understand sentences directly. Tokenization turns raw text into a list of discrete units called tokens.
"I love NLP."["i", "love", "nlp"]NGramLab can insert sentence-boundary tokens <s> and </s>. Rare words outside the vocabulary cap are replaced with <UNK>.
What is an n-gram?
An n-gram is a contiguous sequence of n tokens. We use n-grams to estimate how likely the next word is given the words before it.
['the']['the','quick']['the','quick','brown']['the','quick','brown','fox']A 4-gram model conditions on the previous 3 words to predict the fourth. Its probability is estimated by counting:
P(w4 | w1 w2 w3) ~= Count(w1 w2 w3 w4) / Count(w1 w2 w3)
What is backoff?
When the 4-gram you need was never seen in training, the count is zero. Backoff falls back to shorter contexts: trigram, then bigram, then unigram.
4-gram → α·trigram → α²·bigram → α³·add-1 unigram
LM1 uses Katz-style backoff (α = 0.4). Each fallback step discounts the probability by α so that lower-order estimates do not over-credit coarser n-grams. The final unigram uses add-1 smoothing to keep probabilities positive.
What is interpolation?
Instead of choosing one n-gram order, interpolation mixes all orders with weights that sum to 1.
P(w | h) = lambda1 P1(w) + lambda2 P2(w|h) + lambda3 P3(w|h) + lambda4 P4(w|h)
Heavier 4-gram weight trusts longer context more. Heavier unigram weight falls back to overall word frequency. The Tuning page searches for a good balance.
What is smoothing?
Smoothing keeps probabilities away from exactly zero by adding a small constant k to every count.
P(w | h) = (Count(h, w) + k) / (Count(h) + k * V)
V is the vocabulary size. NGramLab applies add-k smoothing inside each n-gram term of the interpolation, so unseen 4-grams can still contribute a non-zero probability.
What is perplexity?
Perplexity measures how surprised the model is by held-out text. It is based on the negative average log-probability assigned to the test sequence.
PP(W) = exp( - (1 / N) * sum log P(w_i | context) )
Lower is better. Perplexity can blow up if any probability hits zero, which is why LM1 can become infinite while LM2 stays finite.
Temperature & sampling
When generating, the model can choose words in several ways:
- Greedy - always pick the top word. Predictable but repetitive.
- Weighted random - sample proportional to probability. More natural variation.
- Top-K - keep only the K most likely words, then sample from that smaller set.
Temperature rescales the distribution before sampling. t < 1 sharpens it; t > 1 makes it more diverse.