[ PROMPT_NODE_22468 ]

sglang

[ SKILL_DOCUMENTATION ]

# SGLang 高性能 LLM 和 VLM 服务框架，具备用于自动前缀缓存的 RadixAttention。 ## 何时使用 SGLang **在以下情况使用 SGLang：** - 需要结构化输出 (JSON, 正则表达式, 语法) - 构建带有重复前缀（系统提示词、工具）的智能体 - 带有函数调用的智能体工作流 - 具有共享上下文的多轮对话 - 需要更快的 JSON 解码（比标准快 3 倍） **在以下情况使用 vLLM：** - 无需结构的简单文本生成 - 不需要前缀缓存 - 需要成熟、经过广泛测试的生产系统 **在以下情况使用 TensorRT-LLM：** - 最大化单请求延迟（无需批处理） - 仅限 NVIDIA 部署 - 需要 H100 上的 FP8/INT4 量化 ## 快速入门 ### 安装 bash # pip 安装 (推荐) pip install "sglang[all]" # 使用 FlashInfer (更快，CUDA 11.8/12.1) pip install sglang[all] flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ # 从源码安装 git clone https://github.com/sgl-project/sglang.git cd sglang pip install -e "python[all]" ### 启动服务器 bash # 基础服务器 (Llama 3-8B) python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 # 使用 RadixAttention (自动前缀缓存) python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --enable-radix-cache # 默认：已启用 # 多 GPU (张量并行) python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-70B-Instruct --tp 4 --port 30000 ### 基础推理 python import sglang as sgl # 设置后端 sgl.set_default_backend(sgl.OpenAI("http://localhost:30000/v1")) # 简单生成 @sgl.function def simple_gen(s, question): s += "Q: " + question + "n" s += "A:" + sgl.gen("answer", max_tokens=100) # 运行 state = simple_gen.run(question="What is the capital of France?") print(state["answer"]) # 输出: "The capital of France is Paris." ### 结构化 JSON 输出 python import sglang as sgl @sgl.function def extract_person(s, text): s += f"Extract person information from: {text}n" s += "Output JSON:n" # 受限 JSON 生成 s += sgl.gen( "json_output", max_tokens=200, regex=r'{"name": "[^"]+", "age": d+, "occupation": "[^"]+"}' ) # 运行 state = extract_person.run( text="John Smith is a 35-year-old software engineer." ) print(state["json_output"]) # 输出: {"name": "John Smith", "ag

数据来源：claude-code-templates（MIT），中文翻译由 AI 生成。详见关于我们。

BAGUA AI