You need a split dataset before evaluation. Go to split.
Test perplexity
Lower perplexity = better next-token predictions on the held-out test split.
| Model | Method | Smoothing | Perplexity | Status |
|---|---|---|---|---|
| LM1 | Backoff | None | — | not run |
| LM2 | Interpolation | add-k (k=0.1) | — | not run |
Comparison
Lower is better.
Run evaluation
Perplexity is computed in log space:
exp(-1/N · Σ log P(wᵢ | h)) over all 4-gram windows in the test set.