Neural Neural Scaling Laws（NeuNeu）

概述

Neural Neural Scaling Laws（NeuNeu）¹ 是2026年提出的创新框架，将下游任务性能预测重新定义为时间序列外推问题。与传统的参数化缩放定律（如Kaplan、Chinchilla）不同，NeuNeu采用数据驱动的方法，无需假设特定函数形式，直接从token级验证损失中学习缩放模式。

背景：传统缩放定律的局限性

Kaplan缩放定律

Kaplan等人提出的经典缩放定律：

$L (N) \approx (\frac{N _{0}}{N})^{α_{N}}$

其中 $L$ 是语言模型的验证损失， $N$ 是模型参数量， $α_{N}$ 是幂律指数。

Chinchilla缩放定律

Chinchilla团队提出的最优分配缩放定律：

$L (N, D) \approx E + \frac{A}{N ^{α}} + \frac{B}{D ^{β}}$

其中 $D$ 是训练token数量。

传统方法的共同问题

问题	描述
假设特定函数形式	必须预先假设幂律或对数线性关系
仅预测预训练损失	无法直接预测下游任务准确率
单一规模关系	通常只考虑单个变量（模型大小或数据量）
点预测	只输出单一预测值，无不确定性估计

NeuNeu核心框架

核心思想

NeuNeu的核心洞察：将缩放定律预测视为时间序列外推问题。

训练数据构建：
├── 收集历史训练运行
├── 提取每个运行中token级验证损失
└── 建模为随计算量变化的时间序列

预测流程：
├── 输入：聚合的验证损失曲线
└── 输出：下游任务性能的分位数预测

模型架构

NeuNeu采用双分支Transformer架构：

┌─────────────────────────────────────────────────────────────┐
│                     NeuNeu Architecture                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Token Losses (256K tokens)                                  │
│         │                                                     │
│         ▼                                                     │
│  ┌─────────────┐                                             │
│  │  CNN Branch │  → Soft Prompt (可学习)                      │
│  └─────────────┘                                             │
│         │                                                     │
│         ▼                                                     │
│  ┌─────────────────────────────────────────────┐            │
│  │           Prompt Encoder                     │            │
│  │  [BOS; prompt; context; target]             │            │
│  └─────────────────────────────────────────────┘            │
│         │                                                     │
│         ▼                                                     │
│  ┌─────────────────────────────────────────────┐            │
│  │      Transformer (8 layers, 8 heads)        │            │
│  │                                              │            │
│  │   • 自注意力：序列建模                       │            │
│  │   • 前馈网络：特征提取                       │            │
│  │   • LayerNorm + GELU 激活                   │            │
│  └─────────────────────────────────────────────┘            │
│         │                                                     │
│         ▼                                                     │
│  ┌─────────────────────────────────────────────┐            │
│  │           Output Head                        │            │
│  │  → 分位数预测 (q₀.₁, q₀.₂₅, q₀.₅, q₀.₇₅, q₀.₉) │
│  └─────────────────────────────────────────────┘            │
│                                                              │
└─────────────────────────────────────────────────────────────┘

关键数学公式

1. Logistic缩放定律（基线对比）

$y_{t}^{(i)} = f (\overset{ˉ}{ℓ}_{t}; a, k, L_{0}, b) = \frac{a}{1 + e ^{- k (\overset{ˉ}{ℓ}_{t} - L_{0})}} + b$

其中：

$y_{t}^{(i)}$ ：第 $i$ 个运行在时间 $t$ 的性能
$\overset{ˉ}{ℓ}_{t}$ ：聚合的验证损失
$a, k, L_{0}, b$ ：可学习参数

2. NeuNeu分位数输出

NeuNeu输出5个分位数预测，捕捉预测的不确定性：

$\hat{Q} = [\overset{q}{^}_{0.1}, \overset{q}{^}_{0.25}, \overset{q}{^}_{0.5}, \overset{q}{^}_{0.75}, \overset{q}{^}_{0.9}]$

3. Pinball Loss（分位数回归损失）

$L_{p inba ll} = \sum_{τ \in {0.1, 0.25, 0.5, 0.75, 0.9}} {τ (a - \overset{q}{^}_{τ}) (1 - τ) (\overset{q}{^}_{τ} - a) if a < \overset{q}{^}_{τ} if a \geq \overset{q}{^}_{τ}$

训练数据构建

Token级验证损失提取

每个训练运行提取：
├── 256K个token的验证损失序列
├── 计算量标记（tokens processed）
└── 对应的下游任务准确率

数据格式：
{
    "token_losses": [l_1, l_2, ..., l_256K],
    "compute": C,
    "downstream_accuracy": acc
}

上下文窗口设计

NeuNeu使用可变长度上下文窗口：

最小窗口：4个数据点
最大窗口：32个数据点
允许在不同阶段进行预测

实验结果

主要性能对比

方法	MAE (%)	Kendall τ	Spearman ρ
Logistic（基线）	3.29	0.633	0.792
Neural（无prompt）	2.65	0.704	0.854
NeuNeu（最终）	2.04	0.756	0.892

提升：MAE降低38%（3.29% → 2.04%）

模型选择能力

方法	排名准确率
Logistic	63.3%
Neural	69.4%
NeuNeu	75.6%

零样本泛化能力

NeuNeu展示了强大的零样本泛化能力：

泛化类型	描述	性能保持
任务泛化	训练于编码任务，测试于数学任务	✓
模型家族泛化	训练于GPT-2风格，测试于LLaMA风格	✓
规模泛化	训练于≤7B，测试于13B	✓

校准性分析

分位数	理论覆盖率	实际覆盖率
10%-90%	80%	~75%
25%-75%	50%	~48%

与传统方法的对比

维度	Kaplan/Chinchilla	NeuNeu
参数化方式	手动假设函数形式	数据驱动学习
输入	聚合验证损失	Token级损失分布
输出	点预测	分位数分布
不确定性	无	有（分位数）
下游任务	需额外转换	直接预测
灵活性	受假设限制	无假设

实践意义

1. 资源分配优化

NeuNeu可以帮助研究者在训练前预测模型性能，从而：

决定最优模型大小
估算所需训练数据量
优化训练预算

2. 模型选择

通过分位数预测，NeuNeu提供置信区间，帮助：

识别潜在的异常架构
评估架构的稳定性
做出更稳健的决策

3. 早停决策

基于时间序列外推的能力，NeuNeu可以：

预测最终性能
判断是否继续训练
避免不必要的计算浪费

代码实现

PyTorch伪代码

import torch
import torch.nn as nn
import torch.nn.functional as F
 
class NeuNeu(nn.Module):
    def __init__(self, n_quantiles=5, max_seq_len=32):
        super().__init__()
        self.n_quantiles = n_quantiles
        self.quantiles = [0.1, 0.25, 0.5, 0.75, 0.9]
        
        # CNN特征提取器
        self.cnn = nn.Sequential(
            nn.Conv1d(1, 64, kernel_size=3, padding=1),
            nn.GELU(),
            nn.Conv1d(64, 128, kernel_size=3, padding=1),
            nn.AdaptiveAvgPool1d(16)
        )
        
        # 可学习prompt
        self.soft_prompt = nn.Parameter(torch.randn(1, 8, 128))
        
        # Transformer编码器
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(
                d_model=128, nhead=8, dim_feedforward=512,
                batch_first=True
            ),
            num_layers=8
        )
        
        # 分位数输出头
        self.quantile_head = nn.Linear(128, n_quantiles)
        
    def forward(self, token_losses, context_acc, target_acc=None):
        # token_losses: [batch, seq_len]
        # context_acc: [batch, n_context] 过去性能
        
        # CNN特征提取
        x = token_losses.unsqueeze(1)  # [B, 1, seq]
        cnn_feat = self.cnn(x).squeeze(-1)  # [B, 128]
        
        # 构建输入序列
        prompt = self.soft_prompt.expand(x.size(0), -1, -1)
        context = context_acc.unsqueeze(-1).expand(-1, -1, 16)  # [B, n_ctx, 16]
        target = torch.zeros_like(context[:, :1, :])  # 占位符
        
        seq = torch.cat([prompt, context, target], dim=1)
        
        # Transformer编码
        enc = self.transformer(seq)
        
        # 提取target位置输出
        target_out = enc[:, -1, :]  # [B, 128]
        
        # 分位数预测
        quantile_pred = self.quantile_head(target_out)
        
        # 计算损失（如果提供target）
        if target_acc is not None:
            loss = self.pinball_loss(quantile_pred, target_acc)
            return quantile_pred, loss
        
        return quantile_pred
    
    def pinball_loss(self, pred, target):
        losses = []
        for i, tau in enumerate(self.quantiles):
            diff = target - pred[:, i]
            loss = torch.max(tau * diff, (tau - 1) * diff)
            losses.append(loss.mean())
        return sum(losses) / len(losses)

局限性

局限性	描述
训练数据需求	需要大量历史训练运行数据
任务特定性	可能在某些任务上泛化不佳
规模外推	仍存在外推的不确定性
计算开销	Transformer推理有一定开销

未来方向

多任务学习：联合训练多个任务
主动学习：选择性获取训练数据
贝叶斯NeuNeu：不确定性量化
跨模态扩展：应用于多模态模型

参考

NeuNeu: Neural Neural Scaling Laws — Predicting Downstream Task Performance from Token-level Validation Losses. arXiv:2601.19831 (2026) ↩

Metaphor

探索

Neural Neural Scaling Laws（NeuNeu）

概述

背景：传统缩放定律的局限性

Kaplan缩放定律

Chinchilla缩放定律

传统方法的共同问题

NeuNeu核心框架

核心思想

模型架构

关键数学公式

1. Logistic缩放定律（基线对比）

2. NeuNeu分位数输出

3. Pinball Loss（分位数回归损失）

训练数据构建

Token级验证损失提取

上下文窗口设计

实验结果

主要性能对比

模型选择能力

零样本泛化能力

校准性分析

与传统方法的对比

实践意义

1. 资源分配优化

2. 模型选择

3. 早停决策

代码实现

PyTorch伪代码

局限性

未来方向

相关工作

参考

关系图谱

目录

反向链接

Metaphor

探索

Neural Neural Scaling Laws（NeuNeu）

概述

背景：传统缩放定律的局限性

Kaplan缩放定律

Chinchilla缩放定律

传统方法的共同问题

NeuNeu核心框架

核心思想

模型架构

关键数学公式

1. Logistic缩放定律（基线对比）

2. NeuNeu分位数输出

3. Pinball Loss（分位数回归损失）

训练数据构建

Token级验证损失提取

上下文窗口设计

实验结果

主要性能对比

模型选择能力

零样本泛化能力

校准性分析

与传统方法的对比

实践意义

1. 资源分配优化

2. 模型选择

3. 早停决策

代码实现

PyTorch伪代码

局限性

未来方向

相关工作

参考

Footnotes

关系图谱

目录

反向链接