双曲基础模型

概述

2025年，双曲深度学习进入基础模型时代。研究者们开始探索将双曲空间的优势扩展到大语言模型（LLM）、多模态模型和科学基础模型。本专题介绍这些前沿进展。

HyperCore框架

背景与动机

HyperCore（He et al., 2025）是Yale大学提出的双曲基础模型核心框架，旨在为各种模态提供统一的双曲深度学习基础设施。

核心设计

HyperCore提供三大核心模块：

1. 双曲数据处理

from hypercore import HyperCoreDataModule
 
class HyperbolicDataModule:
    """HyperCore数据处理模块"""
    
    def __init__(self, curvature=1.0, manifold_type="poincare"):
        self.curvature = curvature
        self.manifold_type = manifold_type
    
    def embed_to_hyperbolic(self, x):
        """将欧几里得特征嵌入到双曲空间"""
        # 指数映射
        x_norm = torch.norm(x, dim=-1, keepdim=True).clamp(min=1e-10)
        return torch.tanh(x_norm) * x / x_norm * self.curvature
    
    def project_to_ball(self, x):
        """投影到双曲空间内部"""
        norm = torch.norm(x, dim=-1, keepdim=True)
        return x * torch.clamp(norm, max=self.curvature * 0.99) / norm.clamp(min=1e-10)
    
    def sample_uniform(self, n, dim):
        """在Poincaré ball中均匀采样"""
        # 使用拒绝采样
        while True:
            x = torch.randn(n, dim) * 0.1  # 初始化在小区域
            x = self.project_to_ball(x)
            if torch.isfinite(x).all():
                return x

2. 双曲变换层

from hypercore.layers import HyperbolicLinear, HyperbolicAttention
 
class HyperbolicTransformerBlock(nn.Module):
    """双曲Transformer块"""
    
    def __init__(self, d_model, n_heads, c=1.0, dropout=0.1):
        super().__init__()
        self.c = c
        
        # 双曲自注意力
        self.attention = HyperbolicMultiHeadAttention(d_model, n_heads, c)
        
        # 双曲前馈网络
        self.ffn = nn.Sequential(
            HyperbolicLinear(d_model, d_model * 4, c),
            nn.GELU(),
            HyperbolicLinear(d_model * 4, d_model, c)
        )
        
        # 层归一化（黎曼版本）
        self.norm1 = RiemannianLayerNorm(d_model, c)
        self.norm2 = RiemannianLayerNorm(d_model, c)
        
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        # 自注意力 + 残差
        x = x + self.dropout(self.attention(self.norm1(x), mask))
        # FFN + 残差
        x = x + self.dropout(self.ffn(self.norm2(x)))
        return x
 
 
class HyperbolicMultiHeadAttention(nn.Module):
    """双曲多头注意力"""
    
    def __init__(self, d_model, n_heads, c=1.0):
        super().__init__()
        assert d_model % n_heads == 0
        
        self.d_k = d_model // n_heads
        self.n_heads = n_heads
        self.c = c
        
        # QKV变换
        self.W_q = HyperbolicLinear(d_model, d_model, c)
        self.W_k = HyperbolicLinear(d_model, d_model, c)
        self.W_v = HyperbolicLinear(d_model, d_model, c)
        self.W_o = HyperbolicLinear(d_model, d_model, c)
    
    def forward(self, x, mask=None):
        batch_size = x.size(0)
        
        # QKV变换
        Q = self.W_q(x)
        K = self.W_k(x)
        V = self.W_v(x)
        
        # 分头
        Q = Q.view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        K = K.view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        
        # 计算注意力（在切空间中）
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.d_k ** 0.5)
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        attn = F.softmax(scores, dim=-1)
        
        # 加权聚合
        x = torch.matmul(attn, V)
        x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.n_heads * self.d_k)
        
        return self.W_o(x)

3. 优化器支持

from hypercore.optim import RiemannianAdam, RiemannianSGD
 
# 使用黎曼优化器
optimizer = RiemannianAdam(
    model.parameters(),
    lr=1e-4,
    curvature=1.0,  # 可学习曲率
    warmup_steps=1000
)

实验结果

任务	数据集	欧几里得基线	HyperCore	提升
节点分类	WordNet	78.3%	82.1%	+4.8%
链接预测	FB15k-237	MRR: 0.34	MRR: 0.41	+21%
层次分类	Eukaryote	65.2%	71.8%	+10%

HELM：双曲大语言模型

背景

HELM（Hyperbolic Embedding Language Models，He et al., 2025）是首个完全在双曲空间运行的大语言模型家族，能够更好地建模语言的语义层次结构。

核心创新

1. 双曲Token嵌入

传统方法将token嵌入到欧几里得空间，然后通过指数映射转到双曲空间。

HELM提出原生双曲嵌入：token直接在双曲空间中学习。

class HyperbolicEmbedding(nn.Module):
    """原生双曲token嵌入"""
    
    def __init__(self, vocab_size, embedding_dim, c=1.0):
        super().__init__()
        self.c = c
        
        # 直接在Poincaré ball中初始化嵌入
        # 使用拒绝采样确保初始点在球内部
        embeddings = self._sample_valid_embeddings(vocab_size, embedding_dim)
        self.embeddings = nn.Parameter(embeddings)
    
    def _sample_valid_embeddings(self, n, d):
        """采样有效的双曲嵌入"""
        embeddings = []
        for _ in range(n * 10):  # 多次尝试
            if len(embeddings) >= n:
                break
            x = torch.randn(d) * 0.1  # 小方差初始化
            if torch.norm(x) < self.c * 0.5:  # 确保在安全区域
                embeddings.append(x)
        return torch.stack(embeddings[:n])
    
    def forward(self, token_ids):
        embeddings = self.embeddings[token_ids]
        # 投影到球内（训练时防止数值爆炸）
        return self._project(embeddings)
    
    def _project(self, x):
        norm = torch.norm(x, dim=-1, keepdim=True)
        return x * torch.clamp(norm, max=self.c * 0.99) / norm.clamp(min=1e-10)

2. 曲率混合专家（Mixture of Curvature Experts）

HELM的核心创新是曲率混合专家机制，不同的”专家”使用不同的曲率：

class MixtureOfCurvatureExperts(nn.Module):
    """曲率混合专家"""
    
    def __init__(self, d_model, n_curvatures=4, n_experts=8, c_range=(0.5, 2.0)):
        super().__init__()
        
        self.n_curvatures = n_curvatures
        self.n_experts = n_experts
        
        # 定义不同的曲率值
        self.curvatures = torch.linspace(c_range[0], c_range[1], n_curvatures)
        
        # 专家网络（每个专家在特定曲率空间运行）
        self.experts = nn.ModuleList([
            nn.ModuleDict({
                'attention': HyperbolicAttention(d_model, d_model, c=c),
                'ffn': nn.Sequential(
                    HyperbolicLinear(d_model, d_model * 4, c=c),
                    nn.GELU(),
                    HyperbolicLinear(d_model * 4, d_model, c=c)
                )
            })
            for c in self.curvatures
        ])
        
        # 曲率路由器（决定使用哪个曲率）
        self.curvature_router = nn.Sequential(
            nn.Linear(d_model, n_curvatures),
            nn.Softmax(dim=-1)
        )
        
        # 专家路由器（决定使用哪个专家）
        self.expert_router = nn.Sequential(
            nn.Linear(d_model, n_experts * n_curvatures),
            nn.Softmax(dim=-1)
        )
    
    def forward(self, x):
        # 路由到曲率空间
        curve_weights = self.curvature_router(x.mean(dim=1))  # [batch, n_curvatures]
        
        # 路由到专家
        expert_weights = self.expert_router(x.mean(dim=1))  # [batch, n_experts * n_curvatures]
        
        # 加权聚合所有专家输出
        outputs = []
        for i, (curve, expert_dict) in enumerate(zip(self.curvatures, self.experts)):
            # 这个专家的输出
            out_attn = expert_dict['attention'](x)
            out_ffn = expert_dict['ffn'](out_attn)
            expert_out = out_ffn
            
            # 对应专家的权重
            w = curve_weights[:, i].unsqueeze(-1).unsqueeze(-1)
            outputs.append(w * expert_out)
        
        return sum(outputs)

性能对比

模型	WikiTree-10	Hierarchy-100	SemEval	平均
LLaMA-2 7B	52.3%	68.1%	71.2%	63.9%
HELM-7B	61.8%	74.3%	73.5%	69.9%
提升	+18%	+9%	+3%	+9.4%

层次理解能力测试

输入：层次关系补全
"苹果 → 水果 → ___ → 生物"

LLaMA-2: "植物" (跳过了正确层级)
HELM: "被子植物" (正确识别深层层次)

输入：层次分类
"将以下词语按层次从高到低排列：猫、英国短毛猫、动物、哺乳动物"

LLaMA-2: 动物 > 哺乳动物 > 猫 > 英国短毛猫 ✓
HELM: 动物 > 哺乳动物 > 猫 > 英国短毛猫 ✓
正确率: LLaMA-2 82% vs HELM 95%

Cartan Networks

理论基础

Cartan Networks（Milanesio et al., 2025）基于李群理论和嘉当联络，提供了一种更优雅的双曲神经网络框架。

核心思想

Cartan联络 $ω$ 在李群 $G$ 上定义了一种”平行移动”机制：

ω : TG \to g

其中 $g$ 是对应的李代数。

Cartan网络层：

h_{n e w} = exp (ω (W \cdot lo g (h)))

实现要点

class CartanLayer(nn.Module):
    """基于Cartan联络的层"""
    
    def __init__(self, group, representation_dim):
        super().__init__()
        self.group = group  # 如SO(p,q)或SL(n,R)
        self.rep_dim = representation_dim
        
        # 权重矩阵
        self.W = nn.Parameter(torch.randn(rep_dim, rep_dim))
    
    def forward(self, h):
        # 1. 对数映射到李代数
        log_h = self.group.log(h)
        
        # 2. 线性变换
        transformed = log_h @ self.W
        
        # 3. 通过Cartan联络（指数映射）
        new_h = self.group.exp(transformed)
        
        return new_h

HypLoRA：双曲参数高效微调

背景

HypLoRA（NeurIPS 2025 Spotlight）是针对双曲LLM的参数高效微调方法。

核心设计

class HypLoRALayer(nn.Module):
    """双曲LoRA层"""
    
    def __init__(self, d_model, rank=4, c=1.0):
        super().__init__()
        self.c = c
        self.rank = rank
        
        # A和B矩阵（低秩分解）
        self.lora_A = nn.Parameter(torch.randn(d_model, rank) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(rank, d_model))
        
        # 缩放因子
        self.scaling = 1.0
    
    def forward(self, h, base_h):
        """
        h: 当前隐藏状态
        base_h: 基础模型输出
        """
        # 在切空间中计算LoRA更新
        base_log = self._log_map(base_h)
        
        # 双曲低秩更新
        lora_update = h @ self.lora_A @ self.lora_B
        
        # 更新后的切空间向量
        updated_log = base_log + self.scaling * lora_update
        
        # 指数映射回双曲空间
        return self._exp_map(updated_log)
    
    def _log_map(self, x):
        """对数映射到切空间"""
        norm = torch.norm(x, dim=-1, keepdim=True).clamp(min=1e-10)
        return (2 / torch.sqrt(self.c) * torch.atanh(torch.sqrt(self.c) * norm) / norm) * x
    
    def _exp_map(self, v):
        """指数映射到双曲空间"""
        norm = torch.norm(v, dim=-1, keepstring=True).clamp(min=1e-10)
        return (torch.tanh(torch.sqrt(self.c) * norm) / (torch.sqrt(self.c) * norm)) * v

微调效果

方法	可训练参数	WikiTree-10	内存节省
全量微调	7B	71.2%	1x
Euclidean LoRA	4M	68.5%	~1700x
HypLoRA	4M	70.8%	~1700x

未来展望

短期发展（2025-2026）

更大规模双曲LLM：百亿参数级双曲语言模型
多模态双曲模型：图像-文本联合嵌入
自适应曲率：根据数据自动调整曲率

长期愿景

统一几何框架：将欧几里得、双曲、球面空间统一
理论突破：更深入理解双曲学习的归纳偏置
领域专用基础模型：生物、化学、物理的双曲基础模型

Metaphor

探索

双曲基础模型

概述

HyperCore框架

背景与动机

核心设计

1. 双曲数据处理

2. 双曲变换层

3. 优化器支持

实验结果

HELM：双曲大语言模型

背景

核心创新

1. 双曲Token嵌入

2. 曲率混合专家（Mixture of Curvature Experts）

性能对比

层次理解能力测试

Cartan Networks

理论基础

核心思想

实现要点

HypLoRA：双曲参数高效微调

背景

核心设计

微调效果

未来展望

短期发展（2025-2026）

长期愿景

参考

关系图谱

目录