NGramLab
Experiment Report
Interactive 4-Gram Language Model Demo. Generated 5/13/2026, 7:21:33 AM.
1. Corpus Summary
2. Preprocessing
| Lowercase | true |
|---|---|
| Remove extra spaces | true |
| Keep punctuation | false |
| Sentence boundaries | true |
| Vocabulary limit | 1000 |
| Tokens used | <s>, </s>, <UNK> |
3. Dataset Split
| Train tokens | 0 |
|---|---|
| Validation tokens | 0 |
| Test tokens | 0 |
4. Model Design
LM1: Backoff Language Model
LM1 is an unsmoothed 4-gram model that backs off to lower-order models when the higher-order n-gram is unseen. The fallback order is 4-gram to trigram to bigram to unigram.
LM2: Linear Interpolation
LM2 combines all four n-gram orders using add-k smoothing. The interpolation is P(w | h) = lambda1 P1 + lambda2 P2 + lambda3 P3 + lambda4 P4.
6. Test-Set Perplexity
No models evaluated yet.
7. Generated Examples
No generations recorded yet.
8. Conclusion
On the held-out test set, the interpolation model (LM2) with add-k smoothing produces a finite, well-defined perplexity even when 4-grams in the test split are unseen in training, because lower-order probabilities are always mixed in and zero counts are smoothed. The unsmoothed backoff model (LM1) is simpler but fragile; a single unseen 4-gram that also has zero counts at all lower orders can force probability toward zero and perplexity toward infinity. For a small, in-domain corpus, LM2 is the more reliable choice.