Genie系列：Google DeepMind世界模型

概述

Genie是Google DeepMind开发的世界模型系列，从2024年的Genie 1到2025年的Genie 3，代表了视频生成世界模型的重要进展。Genie系列的核心创新在于从无标签视频中学习潜在动作（Latent Actions），使模型能够理解并控制视频中的交互行为。

┌─────────────────────────────────────────────────────────────────┐
│                     Genie 系列发展历程                            │
│                                                                   │
│  Genie 1 (2024)  ──▶  2D游戏世界生成、潜在动作学习               │
│                                                                   │
│  Genie 2 (2024)  ──▶  3D环境生成、大规模视频预训练              │
│                                                                   │
│  Genie 3 (2025)  ──▶  实时交互、720p/24fps、可提示世界事件      │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

1. Genie 1：潜在动作世界模型

1.1 核心思想

Genie 1的核心贡献是从无标签视频中学习潜在动作空间。传统方法需要动作标签，而Genie 1通过自监督学习自动发现视频中的潜在动作。

1.2 架构设计

┌─────────────────────────────────────────────────────────────┐
│                    Genie 1 架构                              │
│                                                               │
│  输入视频帧序列                                              │
│       │                                                      │
│       ▼                                                      │
│  ┌─────────────────┐                                        │
│  │  Video Tokenizer │  ──▶ 离散视频token序列                 │
│  │  (Causal VicNet) │                                       │
│  └────────┬────────┘                                        │
│           │                                                  │
│           ▼                                                  │
│  ┌─────────────────┐                                        │
│  │  Dynamics Model  │  ──▶ 潜动作序列 + 下一帧预测           │
│  │  (Transformer)   │                                       │
│  └────────┬────────┘                                        │
│           │                                                  │
│           ▼                                                  │
│  ┌─────────────────┐                                        │
│  │   Action Head   │  ──▶ 潜在动作预测                       │
│  └─────────────────┘                                        │
│                                                               │
└─────────────────────────────────────────────────────────────┘

1.3 关键组件

Video Tokenizer

Genie使用Causal Video Transformer (CViT)作为视频 tokenizer，将视频帧序列压缩为离散token：

class CausalVideoTokenizer(nn.Module):
    """
    Genie 1 的视频Tokenizer
    基于Causal Transformer架构
    """
    def __init__(self, config):
        super().__init__()
        self.image_size = config.image_size  # 240x135
        self.patch_size = config.patch_size  # 16
        self.num_frames = config.num_frames  # 16
        self编码器 = CausalVideoEncoder(config)
        self.量化器 = VectorQuantizer(
            codebook_size=8192,
            embedding_dim=32
        )
        self.解码器 = VideoDecoder(config)
    
    def encode(self, video_frames):
        """
        编码视频帧为离散token
        """
        # 时间维度因果处理
        encoded = self.编码器(video_frames)  # [B, T, H*W, D]
        quantized, indices = self.量化器(encoded)
        return quantized, indices
    
    def decode(self, indices):
        """
        从token重建视频
        """
        embedded = self.embed_tokens(indices)
        reconstructed = self.解码器(embedded)
        return reconstructed

Dynamics Model

Dynamics Model使用因果Transformer预测下一帧：

class DynamicsModel(nn.Module):
    """
    动态模型：基于潜动作的下一帧预测
    """
    def __init__(self, config):
        super().__init__()
        self.hidden_dim = config.hidden_dim
        self.num_layers = config.num_layers  # 6
        
        # 视频token序列的Self-Attention
        self.video_attention = nn.ModuleList([
            CausalSelfAttention(config)
            for _ in range(config.num_layers)
        ])
        
        # 视频token与潜动作的Cross-Attention
        self.action_cross_attention = nn.ModuleList([
            CrossAttention(config)
            for _ in range(config.num_layers)
        ])
        
        self.ffn = nn.ModuleList([
            FeedForward(config)
            for _ in range(config.num_layers)
        ])
        
        self.norm = nn.LayerNorm(config.hidden_dim)
    
    def forward(self, video_tokens, latent_actions):
        """
        前向传播
        video_tokens: [B, T, H*W, D]
        latent_actions: [B, T-1, D] 潜在动作序列
        """
        for i in range(self.num_layers):
            # 自注意力
            video_tokens = video_tokens + self.video_attention[i](video_tokens)
            video_tokens = self.norm(video_tokens)
            
            # 交叉注意力：融合潜动作信息
            video_tokens = video_tokens + self.action_cross_attention[i](
                video_tokens, latent_actions
            )
            video_tokens = self.norm(video_tokens)
            
            # 前馈网络
            video_tokens = video_tokens + self.ffn[i](video_tokens)
            video_tokens = self.norm(video_tokens)
        
        return video_tokens

Action Head

Action Head预测每个时间步的潜在动作：

class ActionHead(nn.Module):
    """
    动作头：预测潜在动作
    """
    def __init__(self, config):
        super().__init__()
        self.hidden_dim = config.hidden_dim
        self.action_dim = config.action_dim  # 潜动作维度
        
        self.net = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )
    
    def forward(self, video_tokens):
        """
        预测潜在动作
        返回: [B, T-1, action_dim]
        """
        # 使用最后一帧的token预测动作
        last_token = video_tokens[:, -1]  # [B, D]
        actions = self.net(last_token)
        return actions

1.4 训练目标

Genie 1采用多任务学习：

def genie_loss(model, batch):
    """
    Genie 1 的训练损失
    """
    video_frames = batch['frames']  # [B, T, C, H, W]
    
    # 1. 视频重建损失
    quantized, indices = model.video_tokenizer.encode(video_frames)
    reconstructed = model.video_tokenizer.decode(indices)
    recon_loss = F.mse_loss(reconstructed, video_frames)
    
    # 2. 动态预测损失
    dynamics_output = model.dynamics(quantized[:, :-1], actions)
    pred_frames = model.video_tokenizer.decode_from_features(dynamics_output)
    dyn_loss = F.mse_loss(pred_frames, video_frames[:, 1:])
    
    # 3. 潜在动作预测损失（可选）
    pred_actions = model.action_head(dynamics_output)
    action_loss = F.mse_loss(pred_actions, actions) if actions is not None else 0
    
    # 总损失
    total_loss = recon_loss + dyn_loss + 0.1 * action_loss
    
    return total_loss, {
        'recon_loss': recon_loss,
        'dyn_loss': dyn_loss,
        'action_loss': action_loss
    }

1.5 实验结果

任务	Genie 1表现	说明
2D游戏生成	高度可控	可通过潜在动作控制角色
机器人视频	良好泛化	泛化到未见过的动作
模拟环境	物理一致	保持简单物理规律

2. Genie 2：大规模3D环境生成

2.1 主要改进

Genie 2在Genie 1基础上进行了重大升级：

特性	Genie 1	Genie 2
生成类型	2D游戏	3D可玩环境
视频来源	机器人视频	互联网大规模视频
模型规模	~1B	~10B
动作控制	潜在动作	动作标签 + 潜在动作
生成质量	基础	高保真

2.2 架构升级

┌─────────────────────────────────────────────────────────────┐
│                    Genie 2 架构                              │
│                                                               │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              大规模视频预训练 (10B+ 参数)              │    │
│  └─────────────────────────────────────────────────────┘    │
│                          │                                   │
│                          ▼                                   │
│  ┌─────────────────────────────────────────────────────┐    │
│  │           3D一致性建模                               │    │
│  │  • 深度感知                                        │    │
│  │  • 相机运动                                        │    │
│  │  • 3D结构推断                                      │    │
│  └─────────────────────────────────────────────────────┘    │
│                          │                                   │
│                          ▼                                   │
│  ┌─────────────────────────────────────────────────────┐    │
│  │           可控生成                                  │    │
│  │  • 文本提示                                       │    │
│  │  • 动作控制                                       │    │
│  │  • 初始帧条件                                      │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                               │
└─────────────────────────────────────────────────────────────┘

2.3 3D感知能力

Genie 2引入了3D感知能力，使生成的环境具有空间一致性：

class Genie2_3DAware:
    """
    Genie 2 的3D感知模块
    """
    def __init__(self):
        # 深度估计网络
        self.depth_estimator = DepthEstimator()
        
        # 3D场景图构建
        self.scene_graph_builder = SceneGraphBuilder()
        
        # 3D一致性损失
        self.consistency_loss = MultiViewConsistencyLoss()
    
    def encode_with_depth(self, frames):
        """
        编码帧并估计深度
        """
        # 深度估计
        depth_maps = self.depth_estimator(frames)
        
        # 编码RGB
        video_tokens = self.encoder(frames)
        
        # 融合3D信息
        enhanced_tokens = self.fuse_3d(video_tokens, depth_maps)
        
        return enhanced_tokens, depth_maps
    
    def ensure_consistency(self, generated_frames):
        """
        确保生成帧的3D一致性
        """
        # 估计生成帧的深度
        gen_depth = self.depth_estimator(generated_frames)
        
        # 多视角一致性检查
        consistency_score = self.consistency_loss(
            generated_frames, gen_depth
        )
        
        return consistency_score

2.4 条件生成能力

Genie 2支持多种条件生成方式：

class Genie2ConditionalGeneration:
    """
    Genie 2 的条件生成
    """
    def generate_from_text(self, prompt):
        """
        文本提示生成
        """
        # 文本编码
        text_features = self.text_encoder(prompt)
        
        # 条件生成
        video = self.generative_model(condition=text_features)
        return video
    
    def generate_from_image(self, init_image, action_sequence):
        """
        初始帧 + 动作序列生成
        """
        # 图像编码
        image_features = self.image_encoder(init_image)
        
        # 动作编码
        action_features = self.action_encoder(action_sequence)
        
        # 条件生成
        video = self.generative_model(
            init=image_features,
            actions=action_features
        )
        return video
    
    def generate_playable(self, prompt, num_actions=100):
        """
        生成可玩的交互环境
        """
        # 1. 生成初始环境
        init_frames = self.generate_from_text(prompt)
        
        # 2. 生成动作响应序列
        # 用户/AI可以提供动作，模型生成响应
        action_responses = []
        current_frames = init_frames
        
        for _ in range(num_actions):
            action = self.sample_action(current_frames)
            next_frames = self.dynamics_model(current_frames, action)
            action_responses.append((action, next_frames))
            current_frames = next_frames
        
        return {
            'init_frames': init_frames,
            'action_responses': action_responses
        }

3. Genie 3：实时交互世界模型

3.1 核心突破

Genie 3在Genie 2基础上实现了重大突破，成为首个实时、交互式的世界模型：

特性	Genie 2	Genie 3
分辨率	多种	720p
帧率	非实时	24fps实时
交互延迟	秒级	<100ms
记忆能力	~10秒	~1分钟
一致性	基础	高度一致
可控性	文本+动作	多模态可控

3.2 实时架构

┌─────────────────────────────────────────────────────────────┐
│                    Genie 3 实时架构                          │
│                                                               │
│  输入处理 (< 10ms)                                           │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  • 文本提示编码                                       │  │
│  │  • 动作输入处理（键盘/鼠标）                            │  │
│  │  • 历史帧压缩                                         │  │
│  └──────────────────────────────────────────────────────┘  │
│                           │                                  │
│                           ▼                                  │
│  生成推理 (< 30ms)                                          │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  • 并行帧生成                                         │  │
│  │  • 潜在动作预测                                       │  │
│  │  • 下一帧渲染                                        │  │
│  └──────────────────────────────────────────────────────┘  │
│                           │                                  │
│                           ▼                                  │
│  输出渲染 (< 10ms)                                          │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  • 720p RGB渲染                                      │  │
│  │  • 音频同步（如有）                                    │  │
│  └──────────────────────────────────────────────────────┘  │
│                                                               │
│  总延迟: < 50ms → 24fps实时                                 │
│                                                               │
└─────────────────────────────────────────────────────────────┘

3.3 关键技术

3.3.1 高效潜在动作表示

Genie 3改进了潜在动作的表示和预测：

class Genie3LatentAction:
    """
    Genie 3 的潜在动作模块
    """
    def __init__(self):
        self.action_dim = 32  # 潜动作维度
        self.memory_length = 2560  # 可记忆~1分钟动作
        
        # 动作记忆模块
        self.action_memory = ActionMemory(
            capacity=self.memory_length,
            dim=self.action_dim
        )
        
        # 动作预测器
        self.action_predictor = ActionPredictor(
            hidden_dim=1024,
            num_layers=4
        )
    
    def predict_and_store(self, video_features, past_actions):
        """
        预测潜在动作并更新记忆
        """
        # 预测当前帧对应的潜在动作
        predicted_action = self.action_predictor(
            video_features, past_actions
        )
        
        # 存储到记忆
        self.action_memory.push(predicted_action)
        
        return predicted_action
    
    def get_controllable_action(self, user_action):
        """
        从用户动作映射到潜在动作空间
        """
        # 用户动作（如键盘输入）→ 潜在动作
        latent_action = self.action_mapping(user_action)
        return latent_action

3.3.2 可提示世界事件

Genie 3引入了**可提示世界事件（Promptable World Events）**机制，允许用户动态改变生成环境：

class PromptableWorldEvents:
    """
    可提示世界事件模块
    Genie 3 核心创新
    """
    def __init__(self):
        self.event_types = [
            'weather_change',
            'object_add',
            'character_add',
            'time_of_day',
            'scene_transition',
            'physics_modifier'
        ]
        
        # 事件注入器
        self.event_injector = EventInjector()
    
    def apply_event(self, current_state, event_description):
        """
        应用世界事件
        """
        # 解析事件描述
        event = self.parse_event(event_description)
        
        # 创建事件条件
        event_condition = self.create_condition(event)
        
        # 注入事件到生成过程
        modified_state = self.event_injector.inject(
            current_state, event_condition
        )
        
        return modified_state
    
    def parse_event(self, description):
        """
        解析自然语言事件描述
        """
        # 使用LLM或专用模型解析事件类型和参数
        parsed = self.event_parser(description)
        return parsed
    
    # 示例事件
    def weather_change(self, state, weather_type):
        """
        改变天气
        weather_type: 'sunny', 'rainy', 'snowy', 'foggy'
        """
        return self.apply_event(state, {
            'type': 'weather_change',
            'params': {'weather': weather_type}
        })
    
    def add_object(self, state, object_type, position):
        """
        添加物体到场景
        """
        return self.apply_event(state, {
            'type': 'object_add',
            'params': {
                'object': object_type,
                'position': position
            }
        })

3.4 性能对比

指标	Genie 1	Genie 2	Genie 3
参数规模	~1B	~10B	~20B
生成分辨率	240×135	多种	720p
生成帧率	~1fps	~5fps	24fps
可交互	部分	是	是
实时控制	否	有限	完全
Promptable Events	否	否	是
记忆时长	即时	~10秒	~1分钟

3.5 应用场景

3.5.1 游戏开发

应用场景：游戏设计师使用Genie 3快速原型化游戏世界

工作流程：
1. 设计师输入文字描述："中世纪城镇，有集市、城堡、森林"
2. Genie 3 生成初始3D环境
3. 设计师可以：
   - 行走探索环境
   - 通过prompt添加/修改元素
   - 测试游戏机制
4. 导出为正式游戏开发资产

3.5.2 训练数据生成

应用场景：为具身AI生成无限训练环境

优势：
1. 无限多样性：每个prompt生成不同环境
2. 安全性：虚拟环境中训练
3. 可控性：精确控制场景参数
4. 可扩展性：按需生成

3.5.3 教育培训

应用场景：沉浸式教育培训

示例：
- 历史场景重现：古罗马斗兽场
- 科学实验模拟：化学反应
- 职业培训：手术模拟
- 安全培训：灾难应对

4. 技术演进总结

4.1 架构演进

┌─────────────────────────────────────────────────────────────────┐
│                     Genie 架构演进                              │
│                                                                   │
│  Genie 1                    Genie 2                    Genie 3   │
│  ┌─────────────┐          ┌─────────────┐          ┌─────────────┐
│  │ CViT Token  │          │ 大规模预训练 │          │ 高效实时推理 │
│  │ + Transformer│    ──▶  │ + 3D感知    │    ──▶   │ + 事件系统  │
│  │ + 动作头    │          │ + 条件生成   │          │ + 长时记忆  │
│  └─────────────┘          └─────────────┘          └─────────────┘
│                                                                   │
│  参数量: 1B             参数量: 10B              参数量: 20B+    │
│  核心: 潜动作学习         核心: 大规模3D生成         核心: 实时交互 │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

4.2 关键创新点

版本	核心创新	论文/引用
Genie 1	潜在动作学习	arXiv:2401.12944
Genie 2	大规模3D环境生成	DeepMind Blog
Genie 3	实时交互+可提示事件	DeepMind Blog

4.3 开源状态

截至2026年5月：

模型	开源状态	链接
Genie 1	部分开源	GitHub
Genie 2	限研究访问	申请访问
Genie 3	研究预览	Research Preview

5. 与其他世界模型对比

5.1 与Dreamer系列对比

特性	Genie	Dreamer
训练数据	视频（无标签）	交互数据（奖励）
目标	世界生成	决策优化
控制方式	潜动作/文本	策略网络
交互性	用户控制	自动agent
实时性	Genie 3支持	有限

5.2 与视频生成模型对比

特性	Genie	Sora/Gen-2
动作控制	是	部分
交互性	是	否
潜在动作	是	否
可提示事件	是	否

6. 未来展望

6.1 短期发展

更长的记忆时长
更高分辨率和帧率
更丰富的物体交互

6.2 长期愿景

Genie系列的长期愿景：
1. 构建通用世界模拟器
2. 支持任意类型的交互
3. 实现真正的物理世界模拟
4. 作为AGI的基础组件

Metaphor

探索

Genie系列：Google DeepMind世界模型

Genie系列：Google DeepMind世界模型

概述

1. Genie 1：潜在动作世界模型

1.1 核心思想

1.2 架构设计

1.3 关键组件

Video Tokenizer

Dynamics Model

Action Head

1.4 训练目标

1.5 实验结果

2. Genie 2：大规模3D环境生成

2.1 主要改进

2.2 架构升级

2.3 3D感知能力

2.4 条件生成能力

3. Genie 3：实时交互世界模型

3.1 核心突破

3.2 实时架构

3.3 关键技术

3.3.1 高效潜在动作表示

3.3.2 可提示世界事件

3.4 性能对比

3.5 应用场景

3.5.1 游戏开发

3.5.2 训练数据生成

3.5.3 教育培训

4. 技术演进总结

4.1 架构演进

4.2 关键创新点

4.3 开源状态

5. 与其他世界模型对比

5.1 与Dreamer系列对比

5.2 与视频生成模型对比

6. 未来展望

6.1 短期发展

6.2 长期愿景

参考文献

相关主题

关系图谱

目录

反向链接