[ PROMPT_NODE_23019 ]

Tokenization Sentencepiece – Training

[ SKILL_DOCUMENTATION ]

# SentencePiece Training Guide Complete guide to training SentencePiece models. ## Training workflow ### Step 1: Prepare corpus ```bash # Plain text file, one sentence per line (recommended) cat corpus.txt # Hello world # This is a test # SentencePiece is language-independent # Or use raw text (SentencePiece handles sentence splitting) ``` ### Step 2: Train model **Command-line**: ```bash spm_train --input=corpus.txt --model_prefix=m --vocab_size=8000 --model_type=unigram --character_coverage=0.9995 ``` **Python API**: ```python import sentencepiece as spm spm.SentencePieceTrainer.train( input='corpus.txt', model_prefix='m', vocab_size=8000, model_type='unigram' ) ``` **Output**: `m.model` (binary), `m.vocab` (text vocabulary) ### Step 3: Load and use ```python sp = spm.SentencePieceProcessor(model_file='m.model') pieces = sp.encode('Test sentence', out_type=str) ``` ## Training parameters ### Core parameters ```python spm.SentencePieceTrainer.train( # Required input='corpus.txt', # Input corpus model_prefix='output', # Output prefix vocab_size=8000, # Target vocabulary size # Algorithm model_type='unigram', # 'unigram', 'bpe', 'char', 'word' # Coverage character_coverage=0.9995, # 0.9995 for most, 1.0 for CJK # Normalization normalization_rule_name='nmt_nfkc', # 'nmt_nfkc', 'nfkc', 'identity' # Performance num_threads=16, # Training threads input_sentence_size=10000000 # Max sentences to load ) ``` ### Special tokens ```python spm.SentencePieceTrainer.train( input='corpus.txt', model_prefix='m', vocab_size=32000, # Control symbols (special tokens for model control) control_symbols=['~~', '~~', ''], # User-defined symbols (never split) user_defined_symbols=['[MASK]', '[SEP]', '[CLS]'], # Special token pieces unk_piece='', bos_piece='~~', eos_piece='~~', pad_piece='', # Special token IDs unk_id=0, bos_id=1, eos_id=2, pad_id=3 ) ``` ### Advanced options ```python spm.SentencePieceTrainer.train( input='corpus.txt', model_prefix='m', vocab_size=32000, # Byte fallback (handle unknown chars) byte_fallback=True, # Digit handling split_digits=True, # Split digits individually # Script splitting split_by_unicode_script=True, # Split by Unicode script split_by_whitespace=True, # Split by whitespace # Length constraints max_sentencepiece_length=16, # Max token length # Rare word handling min_frequency=2, # Min frequency for token # Training size input_sentence_size=10000000, # Max sentences shuffle_input_sentence=True, # Shuffle training data # Seed seed_sentencepiece_size=1000000 # Seed vocab size ) ``` ## Training from Python iterator ```python import sentencepiece as spm from datasets import load_dataset # Load dataset dataset = load_dataset('wikitext', 'wikitext-103-raw-v1', split='train') # Create iterator def corpus_iterator(): for example in dataset: if example['text'].strip(): yield example['text'] # Train from iterator spm.SentencePieceTrainer.train( sentence_iterator=corpus_iterator(), model_prefix='wiki', vocab_size=32000, model_type='unigram' ) ``` ## Model types ### BPE ```python spm.SentencePieceTrainer.train( input='corpus.txt', model_type='bpe', vocab_size=16000 ) ``` **Training time**: ~10-15 min for 1GB corpus ### Unigram (recommended) ```python spm.SentencePieceTrainer.train( input='corpus.txt', model_type='unigram', vocab_size=8000 ) ``` **Training time**: ~30-40 min for 1GB corpus ## Character coverage ### English/European (0.9995) ```python spm.SentencePieceTrainer.train( input='en_corpus.txt', character_coverage=0.9995 # Cover 99.95% of chars ) ``` Covers: a-z, A-Z, punctuation, common accents ### CJK (1.0) ```python spm.SentencePieceTrainer.train( input='zh_corpus.txt', character_coverage=1.0 # Cover ALL characters ) ``` Required for: Chinese, Japanese, Korean ### Multilingual (0.9995-1.0) ```python spm.SentencePieceTrainer.train( input='multilingual_corpus.txt', character_coverage=0.9995 # Balance coverage/size ) ``` ## Vocabulary size selection | Task | Vocab Size | Rationale | |------|------------|-----------| | English monolingual | 16k-32k | Standard | | Multilingual | 32k-250k | More languages | | CJK | 32k-100k | More characters | | Code | 16k-32k | Similar to English | ## Normalization rules ### nmt_nfkc (recommended) ```python normalization_rule_name='nmt_nfkc' ``` - NFKC Unicode normalization - Whitespace handling - **Recommended for most tasks** ### identity (no normalization) ```python normalization_rule_name='identity' ``` - Preserves input exactly - Use for code, case-sensitive tasks ### nfkc (standard Unicode) ```python normalization_rule_name='nfkc' ``` - Standard Unicode normalization - Less aggressive than nmt_nfkc ## Performance optimization ### Multi-threading ```python spm.SentencePieceTrainer.train( input='large_corpus.txt', num_threads=32 # Use all cores ) ``` **Speedup**: ~4-8× with 16+ cores ### Sampling input ```python spm.SentencePieceTrainer.train( input='huge_corpus.txt', input_sentence_size=10000000, # Sample 10M sentences shuffle_input_sentence=True ) ``` **For very large corpora** (>10GB) ### Extremely large corpus ```python spm.SentencePieceTrainer.train( input='massive_corpus.txt', train_extremely_large_corpus=True, # Enable for >10GB input_sentence_size=100000000 ) ``` ## Best practices 1. **Use Unigram for most tasks** - Better for multilingual 2. **Set character_coverage=1.0 for CJK** - Required for full coverage 3. **Use nmt_nfkc normalization** - Works well for most cases 4. **Add user_defined_symbols for special tokens** - BERT-style tokens 5. **Enable byte_fallback for robustness** - Handles emojis/rare chars 6. **Start with vocab_size=32000** - Good default for most tasks 7. **Use multi-threading** - Speeds up training significantly

Source: claude-code-templates (MIT). See About Us for full credits.

BAGUA AI