Need a preprocessed and split corpus first. Split it now.
N-gram counts
Computed from the training split. Switch tabs to see each order.
Counts will appear once a corpus is split.
LM1
Backoff 4-gram
Try 4-gram → trigram → bigram → unigram. No smoothing.
if Count(w₋₃ w₋₂ w₋₁ wᵢ) > 0 → P = Count(...) / Count(w₋₃ w₋₂ w₋₁) else fall back to trigram → α · P_tri else fall back to bigram → α² · P_bi else fall back to unigram → α³ · add-1 P_uni (α = 0.4)
Backoff discounts lower-order probabilities by α = 0.4 per step. The add-1 smoothed unigram floor keeps perplexity finite.
LM2
Interpolation + add-k
λ₁·P₁ + λ₂·P₂ + λ₃·P₃ + λ₄·P₄ with add-k smoothing on every term.
λ1 · unigram0.05
λ2 · bigram0.15
λ3 · trigram0.30
λ4 · fourgram0.50
λ sum1.000
Smoothing k0.100