多模态学习

多模态学习（Multimodal Learning）是指让人工智能系统能够同时理解和处理来自不同感知模态（如图像、文本、音频、视频等）的信息。多模态学习是迈向通用人工智能（AGI）的重要一步。

发展历程

关键里程碑

年份	里程碑	意义
2021	CLIP	对比语言-图像预训练，zero-shot视觉识别
2022	Flamingo	少样本视觉语言理解
2023	BLIP-2	高效多模态预训练，引入Q-Former
2023	LLaVA	开源视觉指令微调范式
2023	GPT-4V	多模态大语言模型商用
2023	Gemini	原生多模态架构
2024	LLaVA-NeXT	高分辨率、多图像、视频支持

范式转变

传统视觉识别                    多模态学习
┌─────────────────┐            ┌─────────────────────────┐
│ 固定类别分类     │     →      │ 开放词汇 / 自由文本描述   │
│ 监督训练        │            │ 零样本 / 少样本学习       │
│ 单模态输入      │            │ 多模态联合理解           │
└─────────────────┘            └─────────────────────────┘

核心任务

视觉-语言理解

任务	描述	示例
图像描述	为图像生成文本描述	CLIP 在图像编码中的应用
视觉问答	回答关于图像的问题	VQA、OK-VQA
图文匹配	判断图像-文本对是否匹配	CLIP对比学习
视觉推理	复杂的多步视觉推理	多模态推理

视觉-语言生成

任务	描述	代表模型
文本生成图像	根据文本描述生成图像	DALL-E、Stable Diffusion
视觉对话	多轮图文对话	GPT-4V、LLaVA
区域描述	为图像特定区域生成描述	Kosmos系列

架构分类

双编码器架构

图像和文本分别由独立的编码器处理，最后在表示空间对齐：

# 简化示意
class DualEncoder(nn.Module):
    def __init__(self, vision_encoder, text_encoder):
        super().__init__()
        self.vision_encoder = vision_encoder  # ViT
        self.text_encoder = text_encoder       # Transformer
    
    def forward(self, images, texts):
        image_features = self.vision_encoder(images)
        text_features = self.text_encoder(texts)
        return image_features, text_features

代表模型：CLIP、ALIGN

特点：

适合检索任务
训练效率高
模态交互较弱

融合编码器架构

不同模态在早期进行交互融合：

# 早期融合示意
class FusionEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        # 跨模态注意力
        self.cross_attention = MultiHeadAttention(d_model, num_heads)
    
    def forward(self, vision_features, text_features):
        # 双向交互
        vision_out = self.cross_attention(vision_features, text_features, text_features)
        text_out = self.cross_attention(text_features, vision_features, vision_features)
        return vision_out, text_out

代表模型：Flamingo、PaLI-X

特点：

适合复杂推理任务
模态交互充分
计算成本较高

解码器架构

视觉信息作为额外上下文输入到语言模型：

# LLaVA风格架构
class VLMDecoder(nn.Module):
    def __init__(self, vision_encoder, projector, llm):
        super().__init__()
        self.vision_encoder = vision_encoder  # CLIP ViT
        self.projector = projector             # 线性投影
        self.llm = llm                        # Vicuna/LLaMA
    
    def forward(self, images, texts):
        # 视觉编码
        image_features = self.vision_encoder(images)
        # 投影到语言空间
        image_tokens = self.projector(image_features)
        # 送入LLM
        outputs = self.llm(input_ids=texts, images=image_tokens)
        return outputs

代表模型：LLaVA、miniGPT-4、InstructBLIP

特点：

继承LLM能力
适合指令跟随
训练相对简单

与现有内容的衔接

本章节与以下内容紧密相关：

关联内容	关联点
Transformer与注意力	视觉Transformer、多头注意力在VLM中的应用
对比学习	CLIP训练的InfoNCE损失函数
MoE	多模态模型中的稀疏专家
PEFT	多模态模型的高效微调技术
LoRA	多模态LLM的参数高效微调
LSTM	序列建模在多模态中的应用

应用场景

视觉搜索

# CLIP零样本图像分类示例
from PIL import Image
import torch
from transformers import CLIPProcessor, CLIPModel
 
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
 
# 候选类别
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
image = Image.open("test.jpg")
 
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
print(f"预测类别: {labels[probs.argmax()]}")

多模态Agent

现代AI Agent需要理解多模态输入：

视觉反馈理解
图表信息提取
视频内容分析

Metaphor

探索

多模态学习

多模态学习

发展历程

关键里程碑

范式转变

核心任务

视觉-语言理解

视觉-语言生成

架构分类

双编码器架构

融合编码器架构

解码器架构

与现有内容的衔接

应用场景

视觉搜索

多模态Agent

核心参考文献

CLIP：对比语言-图像预训练

LLaVA：大型多模态模型

多模态模型综述

视觉-语言预训练