Step 4 of 8

Train models

Build the count tables and train both 4-gram language models.

Need a preprocessed and split corpus first. Split it now.

N-gram counts

Computed from the training split. Switch tabs to see each order.

Counts will appear once a corpus is split.

LM1

Backoff 4-gram

Try 4-gram → trigram → bigram → unigram. No smoothing.

if Count(w₋₃ w₋₂ w₋₁ wᵢ) > 0
   → P = Count(...) / Count(w₋₃ w₋₂ w₋₁)
else fall back to trigram   → α · P_tri
else fall back to bigram    → α² · P_bi
else fall back to unigram   → α³ · add-1 P_uni  (α = 0.4)

Backoff discounts lower-order probabilities by α = 0.4 per step. The add-1 smoothed unigram floor keeps perplexity finite.

LM2

Interpolation + add-k

λ₁·P₁ + λ₂·P₂ + λ₃·P₃ + λ₄·P₄ with add-k smoothing on every term.

λ1 · unigram0.05

λ2 · bigram0.15

λ3 · trigram0.30

λ4 · fourgram0.50

λ sum1.000

Smoothing k0.100