多模态模型综述

本篇综述主流多模态大语言模型（MLLM）的架构设计与能力分析，涵盖闭源商业模型和开源社区的重要进展。

商业多模态模型

GPT-4V

GPT-4V（Vision）是OpenAI在2023年9月发布的多模态模型，是当时最强的商业视觉语言模型之一。

能力概述

能力类别	具体表现
图像理解	识别物体、场景、文本、人脸
视觉推理	空间关系、物理规律、因果推断
文档理解	表格、图表、截图、手写内容
多图像处理	图像比较、关系分析
视觉对话	多轮图文对话

技术推测

OpenAI未公开GPT-4V的详细架构，但社区分析和论文推测：

# 推测的GPT-4V架构
class GPT4V_Architecture:
    """
    基于社区分析和论文推测的架构
    """
    
    def __init__(self):
        # 视觉编码器：可能使用改进的ViT
        self.vision_encoder = "Enhanced ViT with native resolution support"
        
        # 模态融合：可能是早期或中期融合
        self.fusion = "Unified transformer with multimodal tokens"
        
        # 语言模型：GPT-4核心
        self.language_model = "GPT-4 (larger version)"
        
        # 关键特性
        self.features = {
            "resolution": "Variable resolution, up to high-res",
            "aspect_ratio": "Flexible, native support",
            "context": "Extended context window",
            "languages": "Multilingual visual understanding"
        }
    
    def process_image(self, image):
        # 1. 可能的图像预处理
        image = self.preprocess_image(image)  # 智能裁剪、缩放
        
        # 2. 视觉编码
        visual_tokens = self.vision_encoder(image)
        
        # 3. 与文本tokens融合
        return visual_tokens
    
    def preprocess_image(self, image):
        """
        GPT-4V可能使用智能预处理：
        - 检测图像中的多个感兴趣区域
        - 分别编码不同区域
        - 保持原始宽高比
        """
        regions = self.detect_regions(image)
        encoded_regions = [self.encode_region(r) for r in regions]
        return self.combine_regions(encoded_regions)

应用场景

# GPT-4V API 调用示例（伪代码）
def use_gpt4v(image_path, question):
    response = openai.ChatCompletion.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "image_url", "image_url": {"url": image_path}},
                    {"type": "text", "text": question}
                ]
            }
        ],
        max_tokens=1024
    )
    return response.choices[0].message.content
 
# 示例问题
questions = [
    "描述这张图片的内容",
    "图中的表格数据是什么？",
    "这段代码有什么错误？",
    "比较这两张图片的异同"
]

Gemini

Google DeepMind的Gemini是原生多模态架构的代表，于2023年12月发布。¹

架构设计

class GeminiArchitecture:
    """
    Gemini的原生多模态设计
    """
    
    def __init__(self):
        # 原生多模态Transformer
        # 不使用独立的视觉编码器
        self.model = "Multimodal Transformer (M4T)"
        
        # 支持的模态
        self.modalities = {
            "text": True,
            "vision": True,
            "audio": True,
            "video": True
        }
        
        # 关键创新
        self.innovations = {
            "tokenizer": "Unified multimodal tokenizer",
            "attention": "Cross-modal attention from the start",
            "pretraining": "Joint multimodal pretraining"
        }
    
    def encode_multimodal(self, inputs):
        """
        统一处理多种模态输入
        """
        encoded = {}
        
        if "image" in inputs:
            encoded["vision"] = self.mm_tokenizer.tokenize_image(inputs["image"])
        
        if "text" in inputs:
            encoded["text"] = self.mm_tokenizer.tokenize_text(inputs["text"])
        
        if "audio" in inputs:
            encoded["audio"] = self.mm_tokenizer.tokenize_audio(inputs["audio"])
        
        if "video" in inputs:
            encoded["video"] = self.mm_tokenizer.tokenize_video(inputs["video"])
        
        # 拼接所有模态的tokens
        return self.concat_multimodal_tokens(encoded)

Gemini家族

模型	规模	特点
Gemini Ultra	超大规模	最强能力，用于复杂推理
Gemini Pro	中等规模	平衡性能与效率
Gemini Nano	小规模	端侧部署

Gemini 2.0/2.5 新特性

2024-2025年的Gemini更新引入了更多能力：

超长上下文：支持超过100万token
原生工具使用：内置代码执行、搜索能力
原生多模态输出：同时生成文本、图像
原生音频理解：无需专门的ASR模型

Claude 3 (Anthropic)

Claude 3系列虽然主要是文本模型，但其多模态版本也具备强大的视觉理解能力：

Haiku: 快速、节能
Sonnet: 平衡性能
Opus: 最强推理能力

# Claude 3 Vision API 示例
def use_claude_vision(image_path, prompt):
    response = anthropic.Anthropic().messages.create(
        model="claude-3-opus-20240229",
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/jpeg",
                            "data": load_image_base64(image_path)
                        }
                    },
                    {"type": "text", "text": prompt}
                ]
            }
        ]
    )
    return response.content

开源多模态生态

主流开源模型

模型	机构	特点
LLaVA	Microsoft	开源先驱，指令微调
miniGPT-4	KAUST	高效对齐，仅训练投影层
InstructBLIP	Salesforce	多任务统一
Qwen-VL	阿里	中文支持强
InternVL	商汤	大规模视觉编码器
DeepSeek-VL	DeepSeek	高效架构

miniGPT-4

miniGPT-4是早期开源多模态模型的重要代表：

class miniGPT4(nn.Module):
    def __init__(self):
        super().__init__()
        
        # 冻结的视觉编码器
        self.visual_encoder = CLIPViT()
        for param in self.visual_encoder.parameters():
            param.requires_grad = False
        
        # 可训练的投影层（核心创新：极简对齐）
        self.projection = nn.Sequential(
            nn.Linear(512, 2048),  # CLIP特征维度 → QFormer维度
            nn.GELU(),
            nn.Linear(2048, 4096)   # → LLM维度
        )
        
        # 冻结的LLM
        self.llm = Vicuna()
        for param in self.llm.parameters():
            param.requires_grad = False
    
    def forward(self, image, text):
        # 视觉编码
        visual_features = self.visual_encoder(image)
        
        # 投影到LLM空间
        visual_tokens = self.projection(visual_features)
        
        # 插入到文本序列中
        # [USER]<image><visual_tokens>...</visual_tokens>...
        
        # LLM生成
        return self.llm(combined_input)

Qwen-VL

阿里巴巴的Qwen-VL系列在中文多模态任务上表现优异：

class QwenVL(nn.Module):
    def __init__(self):
        super().__init__()
        
        # 大规模视觉编码器
        self.vision_encoder = Qwen2VisionTransformer()
        
        # 动态分辨率处理
        self.dynamic_aggregation = DynamicAggregation()
        
        # 多语言LLM
        self.language_model = Qwen2LM()
        
        # 投影
        self.mm_projector = nn.Linear(1024, 3584)
    
    def process_image(self, image):
        # 智能分块：不同区域使用不同分辨率
        patches = self.dynamic_aggregation.split_image(image)
        
        # 分别编码
        features = [self.vision_encoder(p) for p in patches]
        
        # 聚合
        return self.dynamic_aggregation.merge_features(features)

InternVL

InternVL系列强调大规模视觉编码器：

class InternVL(nn.Module):
    def __init__(self, model_size="large"):
        super().__init__()
        
        # 渐进式扩展策略
        config = {
            "small": {"vision_dim": 1024, "llm_dim": 2048},
            "base": {"vision_dim": 1024, "llm_dim": 4096},
            "large": {"vision_dim": 2048, "llm_dim": 4096},
            "xl": {"vision_dim": 2048, "llm_dim": 7168}
        }[model_size]
        
        # 大规模视觉编码器
        self.vision_encoder = InternVisionModel(
            hidden_size=config["vision_dim"],
            num_layers=48,  # 比标准ViT更深
            num_heads=16
        )
        
        # 像素级交互
        self.pixel_alignment = PixelLevelAlignment()
        
        # LLM
        self.language_model = InternLM(config["llm_dim"])

评估基准

主流多模态基准

基准	任务类型	代表模型
MME	综合感知+推理	14个子任务
MMBench	多维度能力	选择题格式
MMMU	大学级多模态理解	科学图表
SEED-Bench	语义理解	12维度
MathVista	视觉数学推理	图表问题
ChartQA	图表问答	数据推理

MME Benchmark详解

# MME评估框架
class MMEEvaluator:
    def __init__(self, model):
        self.model = model
        
        # 感知任务
        self.perception_tasks = [
            "existence", "count", "position",
            "color", "posters", "celebrity"
        ]
        
        # 认知任务
        self.cognition_tasks = [
            "commonsense", "numerical_calculation",
            "text_translation", "code_reasoning"
        ]
    
    def evaluate(self, test_dataset):
        results = {}
        
        for task in self.perception_tasks + self.cognition_tasks:
            task_data = load_task_data(task)
            accuracy = self.evaluate_task(task_data)
            results[task] = accuracy
        
        # 计算总分
        results["total"] = sum(results.values())
        return results

挑战与未来方向

当前挑战

1. 幻觉问题

多模态模型容易产生”视觉幻觉”：

# 幻觉示例
prompt = "描述图片中的物体"
response = model.generate(prompt, image)
# 可能的幻觉：描述了图片中不存在的内容
 
# 缓解方法
def reduce_hallucination(response, image):
    # 1. 后验证：使用VL模型验证响应
    verifier = CLIPModel()
    image_features = verifier.encode_image(image)
    
    # 2. 提取响应中的描述实体
    entities = extract_nouns(response)
    
    # 3. 验证每个实体是否在图像中
    for entity in entities:
        if not verify_entity_in_image(entity, image_features):
            response = response.replace(entity, "[待验证]")
    
    return response

2. 分辨率限制

标准ViT固定224x224或336x336
高分辨率图像需要特殊处理
解决方向：动态分辨率、窗口注意力

# 高分辨率处理策略
class HighResProcessor:
    @staticmethod
    def split_and_encode(image, model, grid_size=2):
        """将高分辨率图像分成多个子图"""
        h, w = image.height, image.width
        sub_h, sub_w = h // grid_size, w // grid_size
        
        patches = []
        for i in range(grid_size):
            for j in range(grid_size):
                patch = image.crop((
                    j * sub_w, i * sub_h,
                    (j + 1) * sub_w, (i + 1) * sub_h
                ))
                patches.append(model.encode_image(patch))
        
        # 全局特征
        global_feat = model.encode_image(image)
        
        return patches, global_feat

3. 模态偏差

模型可能过度依赖文本信息
视觉信息利用不充分
需要更好的模态平衡

未来方向

1. 视频理解

class VideoLLM:
    """视频多模态理解"""
    
    def __init__(self):
        self.frame_encoder = VideoViT()
        self.temporal_model = TemporalTransformer()
        self.llm = LLM()
    
    def understand_video(self, video_frames, query):
        # 1. 逐帧编码
        frame_features = [self.frame_encoder(f) for f in video_frames]
        
        # 2. 时序建模
        temporal_features = self.temporal_model(frame_features)
        
        # 3. 与查询融合
        combined = self.fuse(temporal_features, query_tokens)
        
        # 4. 生成回答
        return self.llm.generate(combined)

2. 3D/点云理解

方向	应用
Point-LLM	点云描述、问答
3D-VLA	3D场景理解
Embodied AI	机器人感知

3. 原生多模态输出

未来的多模态模型应该能够：

同时输出文本和图像
生成视频、音频
端到端的感知-动作闭环

class NativeMultimodalOutput:
    """原生多模态输出"""
    
    def __init__(self):
        self.multimodal_vocab = MultimodalVocabulary()
        # 包含文本token + 图像token
    
    def generate(self, query, output_modality="text+image"):
        if "text" in output_modality:
            text_output = self.generate_text(query)
        
        if "image" in output_modality:
            image_tokens = self.generate_image_tokens(query)
            image = self.vqvae.decode(image_tokens)
        
        return {"text": text_output, "image": image}

与现有内容的衔接

关联	内容
LLaVA	开源多模态模型架构
CLIP	视觉编码器基础
MoE	Gemini的稀疏专家架构
PEFT	多模态模型的高效微调
LoRA	开源模型的参数高效微调

参考文献

Google DeepMind, Gemini: A Family of Highly Capable Multimodal Models, 2023 ↩

Metaphor

探索

多模态模型综述

多模态模型综述

商业多模态模型

GPT-4V

能力概述

技术推测

应用场景

Gemini

架构设计

Gemini家族

Gemini 2.0/2.5 新特性

Claude 3 (Anthropic)

开源多模态生态

主流开源模型

miniGPT-4

Qwen-VL

InternVL

评估基准

主流多模态基准

MME Benchmark详解

挑战与未来方向

当前挑战

1. 幻觉问题

2. 分辨率限制

3. 模态偏差

未来方向

1. 视频理解

2. 3D/点云理解

3. 原生多模态输出

与现有内容的衔接

参考文献

关系图谱

目录

Metaphor

探索

多模态模型综述

多模态模型综述

商业多模态模型

GPT-4V

能力概述

技术推测

应用场景

Gemini

架构设计

Gemini家族

Gemini 2.0/2.5 新特性

Claude 3 (Anthropic)

开源多模态生态

主流开源模型

miniGPT-4

Qwen-VL

InternVL

评估基准

主流多模态基准

MME Benchmark详解

挑战与未来方向

当前挑战

1. 幻觉问题

2. 分辨率限制

3. 模态偏差

未来方向

1. 视频理解

2. 3D/点云理解

3. 原生多模态输出

与现有内容的衔接

参考文献

Footnotes

关系图谱

目录