PIE框架：跨层Transcoder的高效电路发现

1. 问题背景

1.1 现有电路发现的挑战

传统电路发现方法面临严峻的计算挑战：

问题	描述	影响
穷举搜索	对均匀采样的组件进行解释	对不相关组件浪费计算
高成本	解释成本随组件数线性增长	无法扩展到大型模型
不精确	无法识别关键跨层特征	电路不完整

1.2 关键洞察

核心问题：并非所有组件对目标行为都重要，但现有方法对所有组件一视同仁。

PIE解决方案：先剪枝，再解释，最后评估。

2. PIE框架详解

2.1 框架概述

PIE代表Prune, Interpret, Evaluate，是一种CLT（Cross-Layer Transcoder）原生的端到端剪枝框架：

┌──────────────────────────────────────────────────────────────┐
│                        PIE Framework                          │
├──────────────────────────────────────────────────────────────┤
│                                                               │
│   输入: 模型 + 目标行为                                        │
│      │                                                       │
│      ▼                                                       │
│   ┌───────────────┐                                         │
│   │    Prune     │ ←── 剪枝优先于解释                       │
│   │  (特征剪枝)   │                                         │
│   └───────────────┘                                         │
│      │                                                       │
│      ▼                                                       │
│   ┌───────────────┐                                         │
│   │   Interpret   │ ←── 仅对剪枝后的特征进行解释             │
│   │  (自动解释)   │                                         │
│   └───────────────┘                                         │
│      │                                                       │
│      ▼                                                       │
│   ┌───────────────┐                                         │
│   │   Evaluate    │ ←── 行为保真度评估                       │
│   │  (解释评估)   │                                         │
│   └───────────────┘                                         │
│      │                                                       │
│      ▼                                                       │
│   输出: 高效、可信的电路                                       │
└──────────────────────────────────────────────────────────────┘

2.2 三阶段详解

2.2.1 Prune阶段

目标：识别对目标行为最重要的特征

策略：利用行为相关性进行剪枝，而非均匀采样

def prune_features(model, target_behavior, budget_k):
    """
    Prune to top-k features based on behavioral relevance.
    
    Args:
        model: Neural network with CLT
        target_behavior: Definition of target behavior
        budget_k: Number of features to keep
    
    Returns:
        pruned_features: Set of important features
    """
    # Compute behavioral importance for all features
    feature_importances = {}
    
    for feature in model.clt_features:
        # Measure importance via activation patching
        importance = measure_behavioral_importance(
            model, feature, target_behavior
        )
        feature_importances[feature] = importance
    
    # Select top-k features
    sorted_features = sorted(
        feature_importances.items(),
        key=lambda x: x[1],
        reverse=True
    )
    
    pruned_features = {f for f, _ in sorted_features[:budget_k]}
    
    return pruned_features

2.2.2 Interpret阶段

目标：解释剪枝后特征的功能

方法：特征归因修补（FAP）

def interpret_feature(feature, model, context):
    """
    Interpret a CLT feature using Feature Attribution Patching.
    """
    # Get gradient with respect to feature
    output = model(context)
    output.backward()
    
    # Compute FAP score
    # Aggregates gradient-weighted write contributions
    fap_score = compute_fap_score(feature, model)
    
    return fap_score
 
 
def compute_fap_score(feature, model):
    """
    Compute Feature Attribution Patching score.
    
    FAP = Σ (∂output / ∂feature_write) × feature_value
    """
    grad = feature.gradient  # From backward pass
    value = feature.value
    
    # Gradient-weighted contribution
    fap = (grad * value).sum()
    
    return fap.item()

2.2.3 Evaluate阶段

目标：评估解释的质量

指标：KL散度行为保持 + FADE风格解释质量

def evaluate_interpretation(
    original_circuit,
    interpreted_circuit,
    test_behaviors
):
    """
    Evaluate interpretation quality.
    """
    # Behavioral fidelity: KL divergence
    kl_divergences = []
    
    for behavior in test_behaviors:
        # Sample responses
        original_response = evaluate_behavior(
            original_circuit, behavior
        )
        interpreted_response = evaluate_behavior(
            interpreted_circuit, behavior
        )
        
        # Compute KL divergence
        kl = F.kl_div(
            original_response.log(),
            interpreted_response,
            reduction='batchmean'
        )
        kl_divergences.append(kl)
    
    avg_kl = np.mean(kl_divergences)
    
    return {
        'behavioral_fidelity': 1.0 - min(avg_kl, 1.0),  # Convert to accuracy-like metric
        'kl_divergences': kl_divergences
    }

3. FAP-Synergy方法

3.1 协同感知重排序

FAP-Synergy引入系统性的协同感知重排序：

def fap_synergy_reranking(
    features,
    behaviors,
    model,
    k: int
):
    """
    Synergy-aware reranking for strict budget constraints.
    
    Args:
        features: Candidate features
        behaviors: Set of target behaviors
        model: Neural network model
        k: Budget constraint
    
    Returns:
        selected_features: Optimally selected k features
    """
    # Compute individual FAP scores
    individual_scores = {
        f: compute_fap_score(f, model)
        for f in features
    }
    
    # Compute synergy scores
    synergy_matrix = {}
    for f1, f2 in combinations(features, 2):
        synergy = compute_synergy(f1, f2, behaviors, model)
        synergy_matrix[(f1, f2)] = synergy
    
    # Greedy selection with synergy
    selected = []
    remaining = set(features)
    
    while len(selected) < k:
        best_feature = None
        best_score = float('-inf')
        
        for f in remaining:
            # Individual score
            score = individual_scores[f]
            
            # Synergy bonus with selected features
            synergy_bonus = sum(
                synergy_matrix.get((f, s), 0) + synergy_matrix.get((s, f), 0)
                for s in selected
            ) / max(len(selected), 1)
            
            # Combined score
            combined_score = score + synergy_bonus
            
            if combined_score > best_score:
                best_score = combined_score
                best_feature = f
        
        selected.append(best_feature)
        remaining.remove(best_feature)
    
    return set(selected)
 
 
def compute_synergy(f1, f2, behaviors, model):
    """
    Compute synergy score between two features.
    """
    # Joint importance
    joint_importance = measure_behavioral_importance(
        model, {f1, f2}, behaviors
    )
    
    # Individual importances
    ind1 = measure_behavioral_importance(model, {f1}, behaviors)
    ind2 = measure_behavioral_importance(model, {f2}, behaviors)
    
    # Synergy = joint - sum of individuals
    synergy = joint_importance - (ind1 + ind2)
    
    return max(synergy, 0)  # Only positive synergy

3.2 预算约束下的操作机制

┌─────────────────────────────────────────────────────────────┐
│          不同预算下的最优策略                                 │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  宽松预算 (K ≥ 200):                                       │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ 基础FAP和adapted baselines表现稳健                    │   │
│  │ 策略: 直接选择top-k                                   │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  严格预算 (K < 100):                                       │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ FAP-Synergy表现最优                                   │   │
│  │ 策略: 协同感知重排序                                  │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  "有效预算"优势:                                          │
│  FAP-Synergy在K=50达到基线在K=75的行为保真度              │
│  → 节省33%的解释成本                                      │
│                                                             │
└─────────────────────────────────────────────────────────────┘

4. 实验结果

4.1 IOI任务上的表现

方法	K=50	K=100	K=200	K=400	K=800
基准FAP	72.3%	81.2%	87.5%	91.2%	94.1%
FAP-Synergy	78.9%	85.3%	89.1%	92.1%	94.5%

4.2 Doc-String任务

方法	K=50	K=100	K=200
基准FAP	68.5%	76.2%	82.4%
FAP-Synergy	74.1%	80.3%	84.1%

4.3 效率对比

方法	行为保真度	解释成本
基线 (K=75)	78.9%	75 units
FAP-Synergy (K=50)	78.9%	50 units
节省	-	33%

5. 框架优势

5.1 剪枝优先范式

传统方法	PIE方法
先解释所有	先剪枝关键特征
O(N) 解释成本	O(K) 解释成本，K << N
不精确	高精度

5.2 协同感知

识别互补特征：选择协同工作的特征
避免冗余：减少重叠特征的重复解释
预算效率：在严格预算下表现更优

6. 实践指南

6.1 使用流程

def pie_circuit_discovery(model, target_behavior, budget_k):
    """
    Complete PIE circuit discovery workflow.
    """
    # Stage 1: Prune
    print("Stage 1: Pruning features...")
    pruned_features = prune_features(model, target_behavior, budget_k)
    print(f"Pruned to {len(pruned_features)} features")
    
    # Stage 2: Interpret
    print("Stage 2: Interpreting features...")
    interpretations = {}
    for feature in pruned_features:
        interpretations[feature] = interpret_feature(feature, model, target_behavior)
    
    # Optional: FAP-Synergy reranking
    if budget_k < 100:
        print("Applying FAP-Synergy reranking...")
        pruned_features = fap_synergy_reranking(
            pruned_features,
            target_behavior.behaviors,
            model,
            budget_k
        )
    
    # Stage 3: Evaluate
    print("Stage 3: Evaluating interpretation...")
    results = evaluate_interpretation(
        model,
        pruned_features,
        interpretations,
        target_behavior.test_set
    )
    
    return {
        'features': pruned_features,
        'interpretations': interpretations,
        'evaluation': results
    }

Metaphor

探索

PIE框架跨层Transcoder的电路发现

PIE框架：跨层Transcoder的高效电路发现

1. 问题背景

1.1 现有电路发现的挑战

1.2 关键洞察

2. PIE框架详解

2.1 框架概述

2.2 三阶段详解

2.2.1 Prune阶段

2.2.2 Interpret阶段

2.2.3 Evaluate阶段

3. FAP-Synergy方法

3.1 协同感知重排序

3.2 预算约束下的操作机制

4. 实验结果

4.1 IOI任务上的表现

4.2 Doc-String任务

4.3 效率对比

5. 框架优势

5.1 剪枝优先范式

5.2 协同感知

6. 实践指南

6.1 使用流程

6.2 注意事项

7. 总结与展望

7.1 核心贡献

7.2 实用性

7.3 未来方向

参考资料

关系图谱

目录

反向链接

Metaphor

探索

PIE框架 跨层Transcoder的电路发现

PIE框架：跨层Transcoder的高效电路发现

1. 问题背景

1.1 现有电路发现的挑战

1.2 关键洞察

2. PIE框架详解

2.1 框架概述

2.2 三阶段详解

2.2.1 Prune阶段

2.2.2 Interpret阶段

2.2.3 Evaluate阶段

3. FAP-Synergy方法

3.1 协同感知重排序

3.2 预算约束下的操作机制

4. 实验结果

4.1 IOI任务上的表现

4.2 Doc-String任务

4.3 效率对比

5. 框架优势

5.1 剪枝优先范式

5.2 协同感知

6. 实践指南

6.1 使用流程

6.2 注意事项

7. 总结与展望

7.1 核心贡献

7.2 实用性

7.3 未来方向

参考资料

关系图谱

目录

反向链接

PIE框架跨层Transcoder的电路发现