[ PROMPT_NODE_22490 ]
server-deployment
[ SKILL_DOCUMENTATION ]
# 服务器部署模式
## 目录
- Docker 部署
- Kubernetes 部署
- Nginx 负载均衡
- 多节点分布式服务
- 生产配置示例
- 健康检查与监控
## Docker 部署
**基础 Dockerfile**:
dockerfile
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04
RUN apt-get update && apt-get install -y python3-pip
RUN pip install vllm
EXPOSE 8000
CMD ["vllm", "serve", "meta-llama/Llama-3-8B-Instruct",
"--host", "0.0.0.0", "--port", "8000",
"--gpu-memory-utilization", "0.9"]
**构建并运行**:
bash
docker build -t vllm-server .
docker run --gpus all -p 8000:8000 vllm-server
**Docker Compose** (带指标监控):
yaml
version: '3.8'
services:
vllm:
image: vllm/vllm-openai:latest
command: >
--model meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--enable-metrics
--metrics-port 9090
ports:
- "8000:8000"
- "9090:9090"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
## Kubernetes 部署
**部署清单 (Manifest)**:
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 2
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model=meta-llama/Llama-3-8B-Instruct"
- "--gpu-memory-utilization=0.9"
- "--enable-prefix-caching"
resources:
limits:
nvidia.com/gpu: 1
ports:
- containerPort: 8000
name: http
- containerPort: 9090
name: metrics
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm
ports:
- port: 8000
targetPort: 8000
name: http
- port: 9090
targetPort: 9090
name: metrics
type: LoadBalancer
## Nginx 负载均衡
**Nginx 配置**:
nginx
upstream vllm_backend {
least_conn; # 路由到负载最轻的服务器
server localhost