[ PROMPT_NODE_22214 ]
rag
[ SKILL_DOCUMENTATION ]
# LangChain RAG 指南
使用 LangChain 进行检索增强生成 (Retrieval-Augmented Generation) 的完整指南。
## 什么是 RAG?
**RAG (检索增强生成)** 结合了:
1. **检索**:从知识库中查找相关文档
2. **生成**:LLM 使用检索到的上下文生成答案
**优势**:
- 减少幻觉
- 提供最新信息
- 领域特定知识
- 来源引用
## RAG 工作流组件
### 1. 文档加载
python
from langchain_community.document_loaders import (
WebBaseLoader,
PyPDFLoader,
TextLoader,
DirectoryLoader,
CSVLoader,
UnstructuredMarkdownLoader
)
# 网页
loader = WebBaseLoader("https://docs.python.org/3/tutorial/")
docs = loader.load()
# PDF 文件
loader = PyPDFLoader("paper.pdf")
docs = loader.load()
# 多个 PDF
loader = DirectoryLoader("./papers/", glob="**/*.pdf", loader_cls=PyPDFLoader)
docs = loader.load()
# 文本文件
loader = TextLoader("data.txt")
docs = loader.load()
# CSV
loader = CSVLoader("data.csv")
docs = loader.load()
# Markdown
loader = UnstructuredMarkdownLoader("README.md")
docs = loader.load()
### 2. 文本分割
python
from langchain.text_splitter import (
RecursiveCharacterTextSplitter,
CharacterTextSplitter,
TokenTextSplitter
)
# 推荐:递归分割(尝试多种分隔符)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # 每个块的字符数
chunk_overlap=200, # 块之间的重叠部分
length_function=len,
separators=["nn", "n", " ", ""]
)
splits = text_splitter.split_documents(docs)
# 基于 Token 的分割(用于精确的 Token 限制)
text_splitter = TokenTextSplitter(
chunk_size=512, # 每个块的 Token 数
chunk_overlap=50
)
# 基于字符的分割(简单)
text_splitter = CharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separator="nn"
)
**块大小建议**:
- **简短回答**:256-512 tokens
- **通用问答**:512-1024 tokens(推荐)
- **长上下文**:1024-2048 tokens
- **重叠**:chunk_size 的 10-20%
### 3. 嵌入 (Embeddings)
python
from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import (
HuggingFaceEmbeddings,
CohereEmbeddings
)
# OpenAI (快速,高质量)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# HuggingFace (免费,本地)
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-mpnet-base-v2"
)
# Cohere
embeddings = CohereEmbeddi