视频扩散世界模型

概述

视频扩散世界模型是当前世界模型研究的重要方向。与传统的自回归模型不同，扩散模型在视频生成质量、物理一致性和长程依赖建模方面展现出独特优势。

┌─────────────────────────────────────────────────────────────────┐
│                  视频生成方法对比                                   │
│                                                                   │
│  ┌─────────────────────┐    ┌─────────────────────┐            │
│  │    自回归模型       │    │    扩散模型          │            │
│  ├─────────────────────┤    ├─────────────────────┤            │
│  │ • 高效采样         │    │ • 高质量生成        │            │
│  │ • 长程依赖         │    │ • 多模态条件        │            │
│  │ • 动作控制         │    │ • 物理一致性        │            │
│  │ • 离散token         │    │ • 连续表示          │            │
│  └─────────────────────┘    └─────────────────────┘            │
│              │                          │                        │
│              └──────────┬───────────────┘                        │
│                         ▼                                        │
│              ┌─────────────────────┐                            │
│              │    混合架构          │                            │
│              ├─────────────────────┤                            │
│              │ • 扩散+自回归       │                            │
│              │ • 高效+高质量       │                            │
│              └─────────────────────┘                            │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

1. Vid2World：视频扩散到交互世界模型

1.1 核心思想

Vid2World是由NVIDIA等机构提出的将预训练视频扩散模型转换为交互世界模型的方法。其核心洞察是：预训练视频扩散模型已经学习了丰富的物理规律，只需要在训练和架构上进行适当改造，就能将其用于交互式世界模型。

1.2 架构设计

class Vid2WorldArchitecture:
    """
    Vid2World 架构
    将视频扩散模型转换为交互世界模型
    """
    def __init__(self, pretrained_video_diffusion):
        # 预训练视频扩散模型
        self.diffusion_model = pretrained_video_diffusion
        
        # 动作编码器
        self.action_encoder = ActionEncoder()
        
        # 状态编码器
        self.state_encoder = StateEncoder()
        
        # 自回归生成控制器
        self.autoregressive_controller = AutoregressiveController()
    
    def convert_to_world_model(self, training_data):
        """
        将视频扩散模型转换为世界模型
        关键步骤：因果化 + 动作条件
        """
        # 1. 因果化架构
        self._causalize_architecture()
        
        # 2. 添加动作条件
        self._add_action_conditioning()
        
        # 3. 自回归训练
        self._train_autoregressive(training_data)
    
    def _causalize_architecture(self):
        """
        因果化架构
        确保模型只能访问过去的信息
        """
        # 将双向注意力改为因果注意力
        for layer in self.diffusion_model.transformer_layers:
            layer.self_attention = CausalSelfAttention(
                original_attention=layer.self_attention
            )
        
        # 确保时间维度是因果的
        self.diffusion_model.temporal_modules = [
            CausalTemporalModule(m)
            for m in self.diffusion_model.temporal_modules
        ]
    
    def _add_action_conditioning(self):
        """
        添加动作条件
        使模型能够根据动作生成未来
        """
        # 添加动作嵌入层
        self.action_embedding = nn.Embedding(
            num_actions,
            action_dim
        )
        
        # 修改条件注入方式
        for layer in self.diffusion_model.transformer_layers:
            layer.add_action_projection(
                ActionProjection(
                    action_dim=action_dim,
                    hidden_dim=layer.hidden_dim
                )
            )

1.3 因果化训练

class Vid2WorldTraining:
    """
    Vid2World 训练
    将视频扩散模型适配为世界模型
    """
    
    def __init__(self, model):
        self.model = model
        self.diffusion_scheduler = DDPMScheduler()
    
    def causal_denoising_loss(self, video, actions, timesteps):
        """
        因果去噪损失
        关键：只预测给定动作后的未来
        """
        batch_size = video.shape[0]
        
        # 添加噪声
        noise = torch.randn_like(video)
        noisy_video = self.add_noise(video, timesteps, noise)
        
        # 编码动作条件
        action_features = self.model.action_encoder(actions)
        
        # 编码当前状态
        state_features = self.model.state_encoder(video[:, :-1])
        
        # 去噪预测
        noise_pred = self.model(
            noisy_video,
            timesteps,
            action_features=action_features,
            state_features=state_features
        )
        
        # 计算损失
        loss = F.mse_loss(noise_pred, noise)
        
        return loss
    
    def autoregressive_training_step(self, batch):
        """
        自回归训练步骤
        """
        video, actions = batch
        
        total_loss = 0.0
        
        # 自回归生成所有帧
        current_video = video[:, :1]  # 初始帧
        
        for t in range(1, video.shape[1]):
            # 预测第t帧
            target_frame = video[:, t]
            action = actions[:, t-1]
            
            # 生成预测
            pred_frame = self.model.generate(
                current_video,
                action
            )
            
            # 计算损失
            loss = F.mse_loss(pred_frame, target_frame)
            total_loss += loss
            
            # 更新当前状态（可选：使用真实帧或预测帧）
            # 使用真实帧可以获得更稳定的训练
            current_video = torch.cat([current_video, target_frame.unsqueeze(1)], dim=1)
        
        return total_loss / (video.shape[1] - 1)

1.4 动作引导生成

class Vid2WorldGeneration:
    """
    Vid2World 交互式生成
    """
    
    def generate_with_actions(self, init_video, action_sequence):
        """
        基于动作序列生成视频
        """
        generated_frames = [init_video[0]]
        current_state = init_video
        
        for action in action_sequence:
            # 编码当前状态
            state_features = self.model.state_encoder(current_state)
            
            # 编码动作
            action_features = self.model.action_encoder(action)
            
            # 生成下一帧
            next_frame = self.model.ddim_sample(
                current_state,
                action_features
            )
            
            generated_frames.append(next_frame)
            current_state = torch.cat([
                current_state[:, 1:],
                next_frame.unsqueeze(1)
            ], dim=1)
        
        return torch.stack(generated_frames, dim=1)
    
    def interactive_generation(self, init_video):
        """
        交互式生成
        实时响应用户动作
        """
        current_state = init_video
        generated_video = [init_video[0]]
        
        while True:
            # 获取用户动作
            action = get_user_action()
            
            if action is None:  # 用户停止
                break
            
            # 生成下一帧
            next_frame = self.model.generate(current_state, action)
            
            generated_video.append(next_frame)
            
            # 更新状态
            current_state = self.update_state(current_state, next_frame)
        
        return torch.stack(generated_video, dim=1)

2. iVideoGPT：可扩展交互视频生成

2.1 核心思想

iVideoGPT是由清华大学提出的可扩展交互视频生成框架，它将多模态信号（视觉、动作、奖励）整合为token序列，实现next-token预测的交互式世界模型。

┌─────────────────────────────────────────────────────────────────┐
│                    iVideoGPT 核心架构                             │
│                                                                   │
│  输入信号                                                       │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐                        │
│  │ 视觉观测  │ │   动作    │ │   奖励   │                        │
│  │ RGB图像   │ │ 交互动作  │ │ 奖励信号 │                        │
│  └────┬─────┘ └────┬─────┘ └────┬─────┘                        │
│       │             │             │                              │
│       ▼             ▼             ▼                              │
│  ┌─────────────────────────────────────┐                        │
│  │         Token化模块                  │                        │
│  │  • 视觉token (ViT)                  │                        │
│  │  • 动作token (Embedding)            │                        │
│  │  • 奖励token (Embedding)            │                        │
│  └─────────────────────────────────────┘                        │
│                      │                                          │
│                      ▼                                          │
│  ┌─────────────────────────────────────┐                        │
│  │    自回归Transformer                 │                        │
│  │    Next-Token 预测                  │                        │
│  └─────────────────────────────────────┘                        │
│                      │                                          │
│                      ▼                                          │
│  ┌─────────────────────────────────────┐                        │
│  │         输出信号                     │                        │
│  │  • 下一视觉token → 图像            │                        │
│  │  • 奖励预测                         │                        │
│  └─────────────────────────────────────┘                        │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

2.2 压缩Token化

class iVideoGPTTokenizer:
    """
    iVideoGPT 的压缩token化
    关键创新：高效离散化高维视觉观测
    """
    
    def __init__(self):
        # 视觉编码器
        self.vision_encoder = VisionTransformer(
            image_size=224,
            patch_size=16,
            embed_dim=768
        )
        
        # 可学习的量化器
        self.quantizer = VectorQuantizer(
            codebook_size=8192,
            embedding_dim=256,
            commitment_loss_weight=0.25
        )
        
        # 视觉解码器
        self.vision_decoder = VisionDecoder()
    
    def compress_visual_tokens(self, images):
        """
        压缩视觉token
        将高维图像压缩为离散token
        """
        # 编码
        features = self.vision_encoder(images)  # [B, H*W, D]
        
        # 量化
        quantized, indices = self.quantizer(features)
        
        # 返回离散token和量化特征
        return {
            'indices': indices,  # 用于自回归生成
            'quantized': quantized,  # 用于解码
            'codebook_usage': self.compute_codebook_usage(indices)
        }
    
    def decode_tokens(self, tokens):
        """
        从token解码为图像
        """
        # 嵌入token
        embedded = self.quantizer.embedding(tokens)
        
        # 解码
        images = self.vision_decoder(embedded)
        
        return images

2.3 多模态序列建模

class iVideoGPTSequenceModeling:
    """
    iVideoGPT 多模态序列建模
    """
    
    def __init__(self):
        # 视觉tokenizer
        self.visual_tokenizer = iVideoGPTTokenizer()
        
        # 动作编码器
        self.action_encoder = nn.Embedding(
            num_actions, action_dim
        )
        
        # 奖励编码器
        self.reward_encoder = nn.Embedding(
            num_rewards, reward_dim
        )
        
        # 模态嵌入（区分不同模态）
        self.modality_embeddings = nn.ModuleDict({
            'vision': nn.Embedding(1, embed_dim),
            'action': nn.Embedding(1, embed_dim),
            'reward': nn.Embedding(1, embed_dim)
        })
        
        # 自回归Transformer
        self.transformer = AutoregressiveTransformer(
            vocab_size=codebook_size,
            embed_dim=embed_dim,
            num_layers=num_layers,
            num_heads=num_heads
        )
    
    def build_multimodal_sequence(self, batch):
        """
        构建多模态序列
        格式: [V, A, R, V, A, R, ...]
        """
        sequences = []
        
        for episode in batch.episodes:
            seq = []
            
            for t in range(len(episode)):
                # 视觉token
                visual_tokens = self.visual_tokenizer.compress(
                    episode.observations[t]
                )
                
                # 动作token
                action_emb = self.action_encoder(
                    episode.actions[t]
                )
                
                # 奖励token
                reward_emb = self.reward_encoder(
                    episode.rewards[t]
                )
                
                # 添加模态嵌入
                visual_tokens = visual_tokens + self.modality_embeddings['vision'](0)
                action_emb = action_emb + self.modality_embeddings['action'](0)
                reward_emb = reward_emb + self.modality_embeddings['reward'](0)
                
                # 交织构建序列
                seq.extend([visual_tokens, action_emb, reward_emb])
            
            sequences.append(torch.stack(seq))
        
        return sequences
    
    def next_token_prediction(self, sequence):
        """
        Next-token预测
        """
        # 预测下一个视觉token
        logits = self.transformer(sequence)
        
        # 视觉token的预测
        vision_logits = logits[:, -3]  # 倒数第三个是视觉token
        
        return vision_logits

2.4 下游任务应用

class iVideoGPTDownstream:
    """
    iVideoGPT 下游任务
    """
    
    def action_conditioned_prediction(self, model, observation, action):
        """
        动作条件视频预测
        世界模型的核心功能
        """
        # Tokenize观测
        obs_tokens = model.visual_tokenizer.compress(observation)
        
        # 构建序列
        sequence = [
            obs_tokens,
            model.action_encoder(action),
            model.reward_encoder(0)  # 预测奖励
        ]
        
        # 预测
        predictions = model.transformer(sequence)
        
        # 解码预测的视觉token
        next_visual_tokens = predictions['vision']
        predicted_image = model.visual_tokenizer.decode(next_visual_tokens)
        
        return {
            'next_observation': predicted_image,
            'predicted_reward': predictions['reward']
        }
    
    def visual_planning(self, model, init_observation, goal_description, horizon=10):
        """
        视觉规划
        在潜在空间中进行规划
        """
        # 使用模型进行想象rollout
        # 采样多个动作序列
        best_plan = None
        best_score = float('-inf')
        
        for _ in range(num_candidates):
            actions = sample_random_actions(horizon)
            
            # Rollout
            trajectory = []
            current_obs = init_observation
            
            for action in actions:
                pred = self.action_conditioned_prediction(
                    model, current_obs, action
                )
                trajectory.append({
                    'observation': pred['next_observation'],
                    'reward': pred['predicted_reward']
                })
                current_obs = pred['next_observation']
            
            # 评估轨迹
            score = evaluate_trajectory(trajectory, goal_description)
            
            if score > best_score:
                best_score = score
                best_plan = actions
        
        return best_plan, best_score
    
    def model_based_rl(self, model, env, num_episodes=100):
        """
        基于模型的强化学习
        """
        policy = PPOPolicy()
        
        for episode in range(num_episodes):
            # 收集真实经验
            real_experience = []
            obs = env.reset()
            
            for step in range(max_steps):
                action = policy(obs)
                next_obs, reward, done, info = env.step(action)
                
                real_experience.append({
                    'obs': obs,
                    'action': action,
                    'reward': reward,
                    'next_obs': next_obs
                })
                
                # 使用世界模型进行想象更新
                if len(real_experience) > batch_size:
                    # 从世界模型采样想象数据
                    imagined_batch = sample_from_world_model(
                        model, real_experience
                    )
                    
                    # 更新策略
                    policy.update(real_experience, imagined_batch)
                
                obs = next_obs
                if done:
                    break
        
        return policy

3. Long-Context SSM视频世界模型

3.1 核心思想

Long-Context SSM世界模型通过状态空间模型（SSM）扩展视频世界模型的时间记忆，解决传统Transformer在长视频中计算复杂度高的问题。

┌─────────────────────────────────────────────────────────────────┐
│                    SSM vs Transformer 对比                         │
│                                                                   │
│  Transformer                                                     │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │ Attention: O(T²)  ───▶  长序列计算爆炸                 │  │
│  │ KV Cache: O(T)   ───▶  内存占用线性增长               │  │
│  └─────────────────────────────────────────────────────────┘  │
│                                                                   │
│  SSM (Mamba)                                                    │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │ 选择性机制: O(T)  ───▶  常数时间步更新                 │  │
│  │ 并行扫描: O(T log T)  ───▶  高效训练                   │  │
│  │ 线性复杂度  ───▶  支持超长序列                        │  │
│  └─────────────────────────────────────────────────────────┘  │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

3.2 架构设计

class LongContextSSMWorldModel:
    """
    长上下文SSM世界模型
    结合SSM和注意力机制的混合架构
    """
    
    def __init__(self):
        # SSM层（处理时间依赖）
        self.ssm_layers = nn.ModuleList([
            MambaBlock(
                d_model=embed_dim,
                d_state=state_dim,
                expand=expand_factor
            )
            for _ in range(num_ssm_layers)
        ])
        
        # 局部注意力（保持空间一致性）
        self.local_attention = LocalAttention(
            window_size=window_size,
            num_heads=num_heads
        )
        
        # 视频编码器
        self.video_encoder = VideoEncoder()
        
        # 视频解码器
        self.video_decoder = VideoDecoder()
    
    def forward(self, video_frames, actions=None):
        """
        前向传播
        关键：SSM处理时间，注意力保持空间一致性
        """
        batch_size, num_frames, channels, height, width = video_frames.shape
        
        # 编码视频帧
        features = self.video_encoder(video_frames)  # [B, T, H, W, D]
        
        # 逐层处理
        for layer in self.ssm_layers:
            # SSM时间建模
            features = self.ssm_block(features)
            
            # 局部注意力（可选）
            features = self.local_attention(features)
        
        # 解码
        reconstructed = self.video_decoder(features)
        
        return reconstructed
    
    def ssm_block(self, features):
        """
        SSM块
        选择性状态空间建模
        """
        # 展平空间维度
        B, T, H, W, D = features.shape
        x = features.view(B, T, H*W, D)
        
        # SSM处理
        for ssm_layer in self.ssm_layers:
            x = ssm_layer(x)
        
        # 恢复空间维度
        x = x.view(B, T, H, W, D)
        
        return x

3.3 长程记忆机制

class LongTermMemoryMechanism:
    """
    长程记忆机制
    在SSM中实现长时记忆
    """
    
    def __init__(self, memory_size=1000):
        self.memory_size = memory_size
        
        # 记忆存储
        self.memory_keys = None
        self.memory_values = None
        
        # 记忆更新
        self.memory_update_fn = MemoryUpdateFunction()
    
    def update_memory(self, new_features):
        """
        更新记忆
        """
        if self.memory_keys is None:
            # 初始化记忆
            self.memory_keys = new_features
            self.memory_values = new_features
        else:
            # 选择性更新
            important_features = self.select_important(new_features)
            
            # 追加到记忆
            self.memory_keys = torch.cat([
                self.memory_keys[:, -self.memory_size:],
                important_features
            ], dim=1)
            
            self.memory_values = self.memory_keys
    
    def retrieve_memory(self, query):
        """
        检索记忆
        """
        # 计算相似度
        similarities = torch.matmul(
            query, 
            self.memory_keys.transpose(-2, -1)
        )
        
        # 加权检索
        attention_weights = F.softmax(similarities, dim=-1)
        retrieved = torch.matmul(
            attention_weights,
            self.memory_values
        )
        
        return retrieved, attention_weights
    
    def select_important(self, features):
        """
        选择重要特征
        减少记忆更新时的冗余
        """
        # 基于变化检测
        if self.memory_values is not None:
            changes = torch.norm(
                features - self.memory_values[:, -1:],
                dim=-1
            )
            
            # 选择变化最大的
            _, top_k_indices = torch.topk(
                changes.mean(dim=0),
                k=min(features.shape[1], 10)
            )
            
            return features[:, top_k_indices]
        
        return features

3.4 与扩散模型结合

class DiffusionSSMWorldModel:
    """
    扩散+SSM混合世界模型
    结合扩散的生成质量和SSM的效率
    """
    
    def __init__(self):
        # SSM主干
        self.ssm_backbone = SSMBackbone()
        
        # 扩散调度器
        self.noise_scheduler = DDPMScheduler()
        
        # 条件编码器
        self.condition_encoder = ConditionEncoder()
    
    def generate(self, init_video, action, num_steps=50):
        """
        生成视频
        """
        # 编码条件
        condition = self.condition_encoder(init_video, action)
        
        # 初始化噪声
        noise = torch.randn_like(init_video)
        noisy = noise
        
        # 迭代去噪
        for t in tqdm(self.noise_scheduler.timesteps):
            # SSM处理
            noise_pred = self.ssm_backbone(noisy, condition, t)
            
            # 去噪步骤
            noisy = self.noise_scheduler.step(noise_pred, t, noisy)
        
        return noisy
    
    def train_step(self, batch):
        """
        训练步骤
        """
        video, actions = batch
        
        # 采样时间步
        t = torch.randint(0, self.noise_scheduler.num_steps, (batch_size,))
        
        # 添加噪声
        noise = torch.randn_like(video)
        noisy_video = self.noise_scheduler.add_noise(video, t, noise)
        
        # 编码条件
        condition = self.condition_encoder(video[:, :-1], actions)
        
        # 预测噪声
        noise_pred = self.ssm_backbone(noisy_video, condition, t)
        
        # 损失
        loss = F.mse_loss(noise_pred, noise)
        
        return loss

4. 复合误差缓解

4.1 复合误差问题

视频生成世界模型面临的关键挑战是复合误差（Compounding Error）：随着时间推移，小的预测误差会累积，导致长期预测质量下降。

┌─────────────────────────────────────────────────────────────────┐
│                    复合误差示意图                                  │
│                                                                   │
│  时间步:     0      1      2      3      4      5             │
│             │      │      │      │      │      │              │
│             ▼      ▼      ▼      ▼      ▼      ▼              │
│  真实:    ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐               │
│           │ 真 │ │ 真 │ │ 真 │ │ 真 │ │ 真 │ │ 真 │               │
│           └───┘ └───┘ └───┘ └───┘ └───┘ └───┘               │
│             │      │      │      │      │      │              │
│             ▼      ▼      ▼      ▼      ▼      ▼              │
│  预测:    ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐               │
│           │ ✓ │ │ △ │ │ ◇ │ │ ☆ │ │ ★ │ │ ✗ │               │
│           └───┘ └───┘ └───┘ └───┘ └───┘ └───┘               │
│                   ↑      ↑      ↑      ↑      ↑               │
│                   └──────┴──────┴──────┴──────┘               │
│                      误差逐渐累积                              │
│                                                                   │
│  图例: ✓正确  △小误差  ◇中误差  ☆大误差  ★严重误差  ✗失败      │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

4.2 缓解方法

4.2.1 计划性重采样

class PlannedResampling:
    """
    计划性重采样
    定期使用真实观测校正预测
    """
    
    def __init__(self, model, resample_interval=5):
        self.model = model
        self.resample_interval = resample_interval
    
    def generate_with_resampling(self, init_video, actions, observations=None):
        """
        带重采样的生成
        """
        generated = [init_video[0]]
        current_state = init_video
        
        for t in range(len(actions)):
            # 生成预测
            next_pred = self.model(current_state, actions[t])
            
            # 如果有真实观测且达到重采样间隔
            if observations is not None and t % self.resample_interval == 0:
                # 使用真实观测校正
                next_frame = observations[t]
            else:
                # 使用预测
                next_frame = next_pred
            
            generated.append(next_frame)
            
            # 更新状态
            current_state = self.update_state(current_state, next_frame)
        
        return torch.stack(generated)

4.2.2 KV缓存压缩

class KVCachingForWorldModels:
    """
    KV缓存压缩
    在长视频生成中保持高效
    """
    
    def __init__(self):
        self.kv_cache = None
        self.compression_ratio = 4
    
    def compress_cache(self, keys, values):
        """
        压缩KV缓存
        减少内存占用
        """
        # 分块压缩
        chunk_size = len(keys) // self.compression_ratio
        
        compressed_keys = []
        compressed_values = []
        
        for i in range(0, len(keys), chunk_size):
            chunk_keys = keys[i:i+chunk_size]
            chunk_values = values[i:i+chunk_size]
            
            # 压缩：平均池化
            compressed_keys.append(chunk_keys.mean(dim=0, keepdim=True))
            compressed_values.append(chunk_values.mean(dim=0, keepdim=True))
        
        return torch.cat(compressed_keys), torch.cat(compressed_values)
    
    def retrieve_compressed(self, query, compressed_keys, compressed_values):
        """
        从压缩缓存中检索
        """
        # 计算注意力
        attention = torch.matmul(query, compressed_keys.transpose(-2, -1))
        attention = F.softmax(attention, dim=-1)
        
        # 检索
        retrieved = torch.matmul(attention, compressed_values)
        
        return retrieved

5. 技术对比总结

5.1 模型对比

模型	生成方式	长时记忆	动作控制	计算效率	开源
Vid2World	扩散+自回归	有限	✅	中等	✅
iVideoGPT	自回归	✅	✅	高	✅
Long-Context SSM	SSM+注意力	✅	部分	高	部分
Genie 3	自回归+扩散	✅	✅	中等	部分

5.2 优缺点分析

方法	优点	缺点
纯扩散	高质量生成	推理慢，动作控制难
纯自回归	动作控制好，推理快	长序列质量下降
SSM	长上下文高效	表达力有限
混合	兼顾两者	复杂度高

6. 未来发展方向

6.1 技术趋势

更长的上下文：支持分钟级视频生成
更好的物理一致性：物理先验与扩散结合
多模态融合：视觉+语言+动作+触觉
实时生成：边缘设备部署

6.2 应用前景

视频扩散世界模型的未来应用：

1. 机器人仿真
   • 无限数据生成
   • 安全训练环境
   • 多样化场景

2. 自动驾驶
   • 边缘案例生成
   • 场景重建
   • 决策验证

3. 游戏开发
   • 动态世界生成
   • NPC行为生成
   • 实时游戏体验

4. 教育培训
   • 沉浸式模拟
   • 技能训练
   • 场景重现

Metaphor

探索

视频扩散世界模型

视频扩散世界模型

概述

1. Vid2World：视频扩散到交互世界模型

1.1 核心思想

1.2 架构设计

1.3 因果化训练

1.4 动作引导生成

2. iVideoGPT：可扩展交互视频生成

2.1 核心思想

2.2 压缩Token化

2.3 多模态序列建模

2.4 下游任务应用

3. Long-Context SSM视频世界模型

3.1 核心思想

3.2 架构设计

3.3 长程记忆机制

3.4 与扩散模型结合

4. 复合误差缓解

4.1 复合误差问题

4.2 缓解方法

4.2.1 计划性重采样

4.2.2 KV缓存压缩

5. 技术对比总结

5.1 模型对比

5.2 优缺点分析

6. 未来发展方向

6.1 技术趋势

6.2 应用前景

参考文献

相关主题

关系图谱

目录

反向链接