[ PROMPT_NODE_22756 ]

clip

[ SKILL_DOCUMENTATION ]

# CLIP - 对比语言-图像预训练 (Contrastive Language-Image Pre-Training) OpenAI 开发的能够通过自然语言理解图像的模型。 ## 何时使用 CLIP **适用场景：** - 零样本图像分类（无需训练数据） - 图文相似度/匹配 - 语义图像搜索 - 内容审核（检测 NSFW、暴力内容） - 视觉问答 - 跨模态检索（图像→文本，文本→图像） **指标**: - **25,300+ GitHub 星标** - 基于 4 亿图文对训练 - 在 ImageNet 上达到 ResNet-50 水平（零样本） - MIT 许可证 **替代方案**: - **BLIP-2**: 更好的图像描述能力 - **LLaVA**: 视觉语言对话 - **Segment Anything**: 图像分割 ## 快速开始 ### 安装 bash pip install git+https://github.com/openai/CLIP.git pip install torch torchvision ftfy regex tqdm ### 零样本分类 python import torch import clip from PIL import Image # 加载模型 device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load("ViT-B/32", device=device) # 加载图像 image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device) # 定义可能的标签 text = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device) # 计算相似度 with torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text) # 余弦相似度 logits_per_image, logits_per_text = model(image, text) probs = logits_per_image.softmax(dim=-1).cpu().numpy() # 打印结果 labels = ["a dog", "a cat", "a bird", "a car"] for label, prob in zip(labels, probs[0]): print(f"{label}: {prob:.2%}") ## 可用模型 python # 模型（按大小排序） models = [ "RN50", # ResNet-50 "RN101", # ResNet-101 "ViT-B/32", # Vision Transformer (推荐) "ViT-B/16", # 质量更好，速度较慢 "ViT-L/14", # 质量最好，速度最慢 ] model, preprocess = clip.load("ViT-B/32") | 模型 | 参数量 | 速度 | 质量 | |-------|------------|-------|---------| | RN50 | 102M | 快 | 好 | | ViT-B/32 | 151M | 中 | 更好 | | ViT-L/14 | 428M | 慢 | 最好 | ## 图文相似度 python # 计算嵌入 image_features = model.encode_image(image) text_features = model.encode_text(text) # 归一化 image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) # 余弦相似度 similarity = (image_features @ text_features.T).item() print(f"Si

数据来源：claude-code-templates（MIT），中文翻译由 AI 生成。详见关于我们。

BAGUA AI