[ PROMPT_NODE_23014 ]
Tokenization Huggingface Tokenizers 训练
[ SKILL_DOCUMENTATION ]
# 训练自定义分词器
从零开始训练分词器的完整指南。
## 训练工作流
### 第一步:选择分词算法
**决策树**:
- **GPT风格模型** → BPE
- **BERT风格模型** → WordPiece
- **多语言/无词边界** → Unigram
### 第二步:准备训练数据
python
# 选项 1:从文件读取
files = ["train.txt", "validation.txt"]
# 选项 2:从 Python 列表读取
texts = [
"This is the first sentence.",
"This is the second sentence.",
# ... 更多文本
]
# 选项 3:从数据集迭代器读取
from datasets import load_dataset
dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")
def batch_iterator(batch_size=1000):
for i in range(0, len(dataset), batch_size):
yield dataset[i:i + batch_size]["text"]
### 第三步:初始化分词器
**BPE 示例**:
python
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = ByteLevel()
tokenizer.decoder = ByteLevelDecoder()
trainer = BpeTrainer(
vocab_size=50000,
min_frequency=2,
special_tokens=["", ""],
show_progress=True
)
**WordPiece 示例**:
python
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.normalizers import BertNormalizer
from tokenizers.pre_tokenizers import BertPreTokenizer
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = BertPreTokenizer()
trainer = WordPieceTrainer(
vocab_size=30522,
min_frequency=2,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
continuing_subword_prefix="##",
show_progress=True
)
**Unigram 示例**:
python
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer
tokenizer = Tokenizer(Unigram())
trainer = UnigramTrainer(
vocab_size=8000,
special_tokens=["", "", "", ""],
unk_token="",
show_progress=True
)
### 第四步:训练
python
# 从文件训练
tokenizer.train(files=files, trainer=trainer)
# 从迭代器训练(推荐用于大型数据集)
tokenizer.train_from_iterator(
batch_iterator(),
trainer=trainer,
length=len(dataset) # 可选,用于显示进度条
)