[ PROMPT_NODE_22986 ]
sentence-transformers
[ SKILL_DOCUMENTATION ]
# Sentence Transformers - 最先进的嵌入模型
基于 Transformers 的句子和文本嵌入 Python 框架。
## 何时使用 Sentence Transformers
**适用场景:**
- 需要高质量的 RAG 嵌入
- 语义相似度计算和搜索
- 文本聚类和分类
- 多语言嵌入(支持 100+ 种语言)
- 本地运行嵌入(无需 API)
- OpenAI 嵌入的高性价比替代方案
**指标**:
- **15,700+ GitHub 星标**
- **5000+ 预训练模型**
- **100+ 语言**支持
- 基于 PyTorch/Transformers
**替代方案**:
- **OpenAI Embeddings**: 需要 API 调用,最高质量
- **Instructor**: 任务特定指令
- **Cohere Embed**: 托管服务
## 快速开始
### 安装
bash
pip install sentence-transformers
### 基本用法
python
from sentence_transformers import SentenceTransformer
# 加载模型
model = SentenceTransformer('all-MiniLM-L6-v2')
# 生成嵌入
sentences = [
"This is an example sentence",
"Each sentence is converted to a vector"
]
embeddings = model.encode(sentences)
print(embeddings.shape) # (2, 384)
# 余弦相似度
from sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item():.4f}")
## 热门模型
### 通用模型
python
# 快速,质量良好 (384 维)
model = SentenceTransformer('all-MiniLM-L6-v2')
# 质量更好 (768 维)
model = SentenceTransformer('all-mpnet-base-v2')
# 质量最好 (1024 维,较慢)
model = SentenceTransformer('all-roberta-large-v1')
### 多语言模型
python
# 50+ 种语言
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
# 100+ 种语言
model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
### 特定领域模型
python
# 法律领域
model = SentenceTransformer('nlpaueb/legal-bert-base-uncased')
# 科学论文
model = SentenceTransformer('allenai/specter')
# 代码
model = SentenceTransformer('microsoft/codebert-base')
## 语义搜索
python
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
# 语料库
corpus = [
"Python is a programming language",
"Machine learning uses algorithms",
"Neural networks are powerful"
]
# 编码语料库
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
# 查询
query = "What is Python?"
query_embedding = model.encode(query, convert_to