Step 8 of 8

Experiment Report

A clean, copy-pasteable summary of every choice you made and every number that came out. Suitable for academic submission.

The report is incomplete. Make sure you've preprocessed your corpus, split the dataset, and trained at least one model.

Export

Download or copy the report in your preferred format.

For PDF: use your browser's print dialog and choose "Save as PDF". The page is print-styled.

NGramLab

Experiment Report

Interactive 4-Gram Language Model Demo. Generated 5/13/2026, 7:21:33 AM.

v1

1. Corpus Summary

Characters
0
Tokens
0
Sentences
0
Vocabulary
0

2. Preprocessing

Lowercasetrue
Remove extra spacestrue
Keep punctuationfalse
Sentence boundariestrue
Vocabulary limit1000
Tokens used<s>, </s>, <UNK>

3. Dataset Split

Train tokens0
Validation tokens0
Test tokens0

4. Model Design

LM1: Backoff Language Model

LM1 is an unsmoothed 4-gram model that backs off to lower-order models when the higher-order n-gram is unseen. The fallback order is 4-gram to trigram to bigram to unigram.

LM2: Linear Interpolation

LM2 combines all four n-gram orders using add-k smoothing. The interpolation is P(w | h) = lambda1 P1 + lambda2 P2 + lambda3 P3 + lambda4 P4.

6. Test-Set Perplexity

No models evaluated yet.

7. Generated Examples

No generations recorded yet.

8. Conclusion

On the held-out test set, the interpolation model (LM2) with add-k smoothing produces a finite, well-defined perplexity even when 4-grams in the test split are unseen in training, because lower-order probabilities are always mixed in and zero counts are smoothed. The unsmoothed backoff model (LM1) is simpler but fragile; a single unseen 4-gram that also has zero counts at all lower orders can force probability toward zero and perplexity toward infinity. For a small, in-domain corpus, LM2 is the more reliable choice.

Raw markdown

The exact content the "Export Markdown" button writes to disk.

# NGramLab — Experiment Report

_Generated 2026-05-13T07:21:33.874Z_

## 1. Corpus Summary

- Characters: **0**
- Tokens: **0**
- Sentences: **0**
- Vocabulary size: **0**

## 2. Preprocessing

- Lowercase: true
- Keep punctuation: false
- Sentence boundaries: true
- Vocab cap: 1000
- Unknown token: `<UNK>`

## 3. Dataset Split

- Train tokens: **0**
- Validation tokens: **0**
- Test tokens: **0**

## 4. LM1 — Backoff 4-Gram

An unsmoothed 4-gram language model that falls back to lower orders (trigram → bigram → unigram) when the higher-order n-gram is unseen in training. Useful as a baseline but vulnerable to zero probabilities on rare contexts.

## 5. LM2 — Interpolation with Add-k Smoothing

A linear interpolation of unigram, bigram, trigram, and 4-gram probabilities, each smoothed with add-k. The mixing weights λ₁…λ₄ sum to 1 and the smoothing constant k > 0 keeps every probability strictly positive.

## 6. Test Perplexity

| Model | Method | Smoothing | Perplexity |
|-------|--------|-----------|------------|

## 8. Conclusion