Step 3 of 8

Dataset split

Reserve a slice of the token stream for validation (tuning) and testing (final evaluation).

You need to preprocess the corpus before splitting it. Go to preprocessing.

Ratios

Sliders must sum to 100%. The default is the conventional 70 / 10 / 20.

train70%
validation10%
test20%
Total100%

Distribution

How tokens are allocated.

Training
Validation
Test
Tip: Sentences are shuffled with a fixed seed, then validation/test words outside the training vocabulary become<UNK>.