How NGramLab Works

Each concept used in the pipeline, explained without jargon. Useful before your presentation.

What is a corpus?

Foundation

A corpus is a body of text: newspaper articles, a novel, Wikipedia pages, tweets, or any other written source. The model learns its statistics: which words appear, and which words tend to follow which other words.

In NGramLab you can paste your own text, upload a .txt file, or extract text from a website.

What is tokenization?

Preprocessing

Computers do not understand sentences directly. Tokenization turns raw text into a list of discrete units called tokens.

Input

"I love NLP."

Output

["i", "love", "nlp"]

NGramLab can insert sentence-boundary tokens <s> and </s>. Rare words outside the vocabulary cap are replaced with <UNK>.

What is an n-gram?

Core idea

An n-gram is a contiguous sequence of n tokens. We use n-grams to estimate how likely the next word is given the words before it.

unigram

['the']

bigram

['the','quick']

trigram

['the','quick','brown']

4-gram

['the','quick','brown','fox']

A 4-gram model conditions on the previous 3 words to predict the fourth. Its probability is estimated by counting:

P(w4 | w1 w2 w3) ~= Count(w1 w2 w3 w4) / Count(w1 w2 w3)

What is backoff?

LM1

When the 4-gram you need was never seen in training, the count is zero. Backoff falls back to shorter contexts: trigram, then bigram, then unigram.

4-gram → α·trigram → α²·bigram → α³·add-1 unigram

LM1 uses Katz-style backoff (α = 0.4). Each fallback step discounts the probability by α so that lower-order estimates do not over-credit coarser n-grams. The final unigram uses add-1 smoothing to keep probabilities positive.

What is interpolation?

LM2

Instead of choosing one n-gram order, interpolation mixes all orders with weights that sum to 1.

P(w | h) = lambda1 P1(w) + lambda2 P2(w|h) + lambda3 P3(w|h) + lambda4 P4(w|h)

Heavier 4-gram weight trusts longer context more. Heavier unigram weight falls back to overall word frequency. The Tuning page searches for a good balance.

What is smoothing?

Add-k

Smoothing keeps probabilities away from exactly zero by adding a small constant k to every count.

P(w | h) = (Count(h, w) + k) / (Count(h) + k * V)

V is the vocabulary size. NGramLab applies add-k smoothing inside each n-gram term of the interpolation, so unseen 4-grams can still contribute a non-zero probability.

What is perplexity?

Evaluation

Perplexity measures how surprised the model is by held-out text. It is based on the negative average log-probability assigned to the test sequence.

PP(W) = exp( - (1 / N) * sum log P(w_i | context) )

Lower is better. Perplexity can blow up if any probability hits zero, which is why LM1 can become infinite while LM2 stays finite.

Temperature & sampling

Generation

When generating, the model can choose words in several ways:

Greedy - always pick the top word. Predictable but repetitive.
Weighted random - sample proportional to probability. More natural variation.
Top-K - keep only the K most likely words, then sample from that smaller set.

Temperature rescales the distribution before sampling. t < 1 sharpens it; t > 1 makes it more diverse.

Back to dashboard