科学方程发现Agent

1. 引言

科学方程是自然规律的精炼表达。从牛顿第二定律 到爱因斯坦的质能方程 ,方程发现是科学进步的核心驱动力。然而,传统方程发现依赖人类科学家的灵感和直觉,是一个耗时且困难的过程。

科学方程发现Agent结合符号回归、深度学习和物理约束,力图自动化这一过程12。本节介绍这一领域的最新进展。

本文档为 科学Agent基础 的进阶内容。

2. 符号回归基础

2.1 问题定义

符号回归旨在从数据中自动发现数学表达式。形式化定义:

给定数据集 ,寻找符号表达式 使得:

其中 由基本操作()和变量、常数构成。

2.2 传统方法

2.2.1 遗传编程 (GP)

class GeneticProgramming:
    def __init__(self, population_size=1000, generations=100):
        self.pop_size = population_size
        self.generations = generations
    
    def evolve(self, X, y):
        # 初始化种群
        population = self.initialize_population()
        
        for gen in range(self.generations):
            # 评估适应度
            fitness = [self.fitness(ind, X, y) for ind in population]
            
            # 选择
            parents = self.selection(population, fitness)
            
            # 交叉
            offspring = self.crossover(parents)
            
            # 变异
            offspring = self.mutate(offspring)
            
            # 替代
            population = self.replacement(population, offspring)
            
            # 精英保留
            best = self.get_best(population, fitness)
            
            if self.is_perfect(best):
                return best
        
        return self.get_best(population)
    
    def fitness(self, individual, X, y):
        # 计算均方误差
        pred = individual.evaluate(X)
        mse = np.mean((pred - y) ** 2)
        
        # 复杂度惩罚
        complexity = individual.complexity()
        
        # AIC-like 准则
        return -mse - 0.01 * complexity

2.2.2 Eureqa算法

Eureqa使用基于搜索的符号回归:

  • 公式长度:鼓励简洁公式
  • 预测误差:最小化残差
  • 稀疏搜索:优先探索有前途的方向

2.3 深度学习方法

2.3.1 神经网络辅助

class NeuralSymbolicRegression:
    def __init__(self):
        self.nn = NeuralNetwork()
        self.symbolic_layer = SymbolicExpressionLayer()
    
    def forward(self, x):
        # 神经网络预测
        nn_pred = self.nn(x)
        
        # 提取符号表达式
        symbolic_expr = self.symbolic_layer.extract(nn_pred)
        
        return symbolic_expr

3. PhysX: 物理引导的LLM Agent

3.1 核心思想

PhysX是一个物理引导的LLM Agent,专门用于科学方程发现1。核心思想是:

  1. 物理约束集成:利用物理定律约束搜索空间
  2. 量纲分析:利用物理量纲减少候选空间
  3. 先验知识利用:整合物理直觉和领域知识

3.2 系统架构

┌─────────────────────────────────────────────────────────────┐
│                        PhysX Agent                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌───────────────┐    ┌───────────────┐    ┌─────────────┐ │
│  │  Physics      │───▶│   LLM         │───▶│  Verifier   │ │
│  │  Knowledge   │    │  Generator    │    │  (Dimensional│ │
│  │  Base        │    │               │    │   Analysis)  │ │
│  └───────────────┘    └───────────────┘    └─────────────┘ │
│         │                   │                    │         │
│         │                   │                    │         │
│         ▼                   ▼                    ▼         │
│  ┌─────────────────────────────────────────────────────────┐│
│  │              Physics-Guided Search                      ││
│  │   - Unit constraints   - Conservation laws            ││
│  │   - Dimensional analysis   - Symmetry                 ││
│  └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘

3.3 量纲分析模块

量纲分析是PhysX的核心组件:

class DimensionalAnalyzer:
    # 基础量纲
    BASE_UNITS = {
        'mass': 'M',
        'length': 'L', 
        'time': 'T',
        'temperature': 'Θ',
        'current': 'I'
    }
    
    def analyze(self, expr: str) -> str:
        """返回表达式的量纲"""
        # 解析表达式
        tree = self.parse(expr)
        
        # 递归计算量纲
        dim = self.compute_dimension(tree)
        
        return dim
    
    def check_consistency(self, equation: str) -> bool:
        """检查方程两边量纲是否一致"""
        left, right = equation.split('=')
        
        left_dim = self.analyze(left)
        right_dim = self.analyze(right)
        
        return left_dim == right_dim
    
    def generate_candidates(self, variables: List[dict]) -> List[str]:
        """基于量纲生成候选方程"""
        # 提取变量量纲
        dims = [v['dimension'] for v in variables]
        
        # 构建量纲方程
        target_dim = '?'  # 待发现的目标量纲
        
        # 求解量纲方程
        solutions = self.solve_dimension_equation(dims, target_dim)
        
        # 生成候选表达式
        candidates = []
        for sol in solutions:
            expr = self.build_expression(sol, variables)
            candidates.append(expr)
        
        return candidates

3.4 LLM生成器

class LLMGenerator:
    def __init__(self, llm):
        self.llm = llm
    
    def generate(self, context: dict) -> List[str]:
        prompt = f"""
        给定以下物理情境,生成可能的方程形式:
        
        变量:{context['variables']}
        已知关系:{context['known_relations']}
        物理约束:{context['constraints']}
        
        生成5个候选方程,要求:
        1. 符合量纲一致
        2. 物理意义合理
        3. 包含必要的物理常数
        """
        
        response = self.llm.generate(prompt)
        
        # 解析方程
        equations = self.parse_equations(response)
        
        return equations
    
    def refine(self, candidate: str, feedback: dict) -> str:
        """根据反馈精炼方程"""
        prompt = f"""
        给定候选方程:{candidate}
        
        反馈:{feedback}
        
        请修正方程使其更符合物理规律。
        """
        
        refined = self.llm.generate(prompt)
        return refined

3.5 验证器

class EquationVerifier:
    def verify(self, equation: str, data: np.ndarray) -> dict:
        result = {
            'dimensional': self.check_dimensional(equation),
            'statistical': self.check_statistical(equation, data),
            'physical': self.check_physical_constraints(equation),
            'score': 0.0
        }
        
        # 综合评分
        result['score'] = (
            0.4 * result['dimensional'] +
            0.3 * result['statistical'] +
            0.3 * result['physical']
        )
        
        return result
    
    def check_dimensional(self, equation: str) -> float:
        """检查量纲一致性"""
        analyzer = DimensionalAnalyzer()
        return 1.0 if analyzer.check_consistency(equation) else 0.0
    
    def check_statistical(self, equation: str, data: np.ndarray) -> float:
        """检查统计显著性"""
        expr = sympify(equation)
        
        # 计算预测
        pred = self.evaluate(expr, data)
        
        # 计算R²
        r2 = self.calculate_r2(data['target'], pred)
        
        # 计算AIC
        aic = self.calculate_aic(data['target'], pred, complexity(expr))
        
        # 综合评分
        return 0.5 * r2 + 0.5 * min(1.0, 100 / aic)
    
    def check_physical_constraints(self, equation: str) -> float:
        """检查物理约束"""
        # 能量守恒、单调性等
        constraints = [
            self.check_energy_conservation,
            self.check_monotonicity,
            self.check_symmetry
        ]
        
        scores = [c(equation) for c in constraints]
        return np.mean(scores)

4. SR-Scientist: Agentic AI方程发现

4.1 Agent架构

SR-Scientist采用Agent架构进行方程发现2

┌─────────────────────────────────────────────────────────────┐
│                    SR-Scientist Agent                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐   │
│  │ Planner │──▶│Searcher │──▶│ Tester  │──▶│ Refiner │   │
│  └─────────┘   └─────────┘   └─────────┘   └─────────┘   │
│       │                                                  │   │
│       │           Feedback Loop                          │   │
│       └──────────────────────────────────────────────────┘   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

4.2 规划器

class SRPlanner:
    def plan(self, task: dict) -> SearchPlan:
        # 分析任务
        task_type = self.classify_task(task)
        
        # 确定搜索策略
        if task_type == 'mechanical':
            search_strategy = 'physics_guided'
        elif task_type == 'biological':
            search_strategy = 'data_driven'
        else:
            search_strategy = 'hybrid'
        
        # 规划搜索路径
        plan = {
            'strategy': search_strategy,
            'max_iterations': 100,
            'refinement_rounds': 5,
            'tool_usage': self.select_tools(task)
        }
        
        return plan

4.3 搜索器

class SRSearcher:
    def search(self, plan: SearchPlan, data: np.ndarray) -> List[Candidate]:
        candidates = []
        
        for iteration in range(plan['max_iterations']):
            # 生成候选
            new_candidates = self.generate_candidates(plan, data)
            
            # 评估候选
            for cand in new_candidates:
                score = self.evaluate(cand, data)
                cand['score'] = score
                
                if score > self.threshold:
                    candidates.append(cand)
            
            # 更新搜索策略
            plan = self.update_plan(plan, candidates)
            
            if self.has_converged(candidates):
                break
        
        return candidates
    
    def generate_candidates(self, plan: Plan, data: np.ndarray) -> List[str]:
        if plan['strategy'] == 'physics_guided':
            return self.physics_guided_search(plan, data)
        elif plan['strategy'] == 'data_driven':
            return self.data_driven_search(plan, data)
        else:
            return self.hybrid_search(plan, data)

4.4 测试器

class SRTester:
    def test(self, candidate: str, data: np.ndarray, 
             ground_truth: str = None) -> TestResult:
        result = {
            'fit_quality': self.test_fit(candidate, data),
            'simplicity': self.test_simplicity(candidate),
            'robustness': self.test_robustness(candidate, data),
            'generalization': self.test_generalization(candidate, data)
        }
        
        if ground_truth:
            result['correctness'] = self.test_correctness(
                candidate, ground_truth
            )
        
        result['pass'] = all([
            result['fit_quality'] > 0.9,
            result['robustness'] > 0.8,
            result['generalization'] > 0.7
        ])
        
        return result
    
    def test_fit(self, candidate: str, data: np.ndarray) -> float:
        """测试拟合质量"""
        expr = sympify(candidate)
        pred = self.evaluate(expr, data)
        
        r2 = self.calculate_r2(data['target'], pred)
        return r2
    
    def test_robustness(self, candidate: str, data: np.ndarray) -> float:
        """测试对噪声的鲁棒性"""
        noise_levels = [0.01, 0.05, 0.1, 0.2]
        scores = []
        
        for noise in noise_levels:
            noisy_data = self.add_noise(data, noise)
            pred = self.evaluate(sympify(candidate), noisy_data)
            score = self.calculate_r2(noisy_data['target'], pred)
            scores.append(score)
        
        return np.mean(scores)
    
    def test_generalization(self, candidate: str, data: np.ndarray) -> float:
        """测试泛化能力"""
        train_data, test_data = self.split_data(data, ratio=0.8)
        
        expr = sympify(candidate)
        train_pred = self.evaluate(expr, train_data)
        test_pred = self.evaluate(expr, test_data)
        
        train_r2 = self.calculate_r2(train_data['target'], train_pred)
        test_r2 = self.calculate_r2(test_data['target'], test_pred)
        
        # 泛化gap
        gap = abs(train_r2 - test_r2)
        
        # 考虑gap的评分
        return test_r2 - 0.1 * gap

4.5 精炼器

class SRRefiner:
    def refine(self, candidate: str, test_result: TestResult,
               feedback: dict) -> str:
        """根据测试结果和反馈精炼方程"""
        
        # 识别问题
        issues = self.identify_issues(test_result, feedback)
        
        # 制定精炼策略
        if 'overfitting' in issues:
            candidate = self.simplify(candidate)
        elif 'underfitting' in issues:
            candidate = self.add_complexity(candidate)
        elif 'numerical_issue' in issues:
            candidate = self.rescale(candidate)
        
        return candidate
    
    def simplify(self, expr: str) -> str:
        """简化表达式"""
        sympy_expr = sympify(expr)
        simplified = sympy_expr.simplify()
        return str(simplified)
    
    def add_complexity(self, expr: str) -> str:
        """增加表达式复杂度"""
        # 添加交互项、高阶项等
        pass

5. AlphaEvolve风格的方法

5.1 演化-验证框架

AlphaEvolve结合了演化搜索和机器学习验证:

class AlphaEvolve:
    def __init__(self, llm, verifier):
        self.llm = llm
        self.verifier = verifier
        self.population = []
    
    def evolve(self, task: dict, max_iterations: int = 1000):
        # 初始化
        self.initialize(task)
        
        for iteration in range(max_iterations):
            # 1. LLM生成
            candidates = self.llm.generate(
                context=self.get_context(),
                n=10
            )
            
            # 2. 验证筛选
            validated = []
            for cand in candidates:
                result = self.verifier.verify(cand, task['data'])
                if result['score'] > self.threshold:
                    validated.append((cand, result))
            
            # 3. 排序选择
            validated.sort(key=lambda x: x[1]['score'], reverse=True)
            
            # 4. 精英保留
            self.population = validated[:self.population_size]
            
            # 5. 反馈LLM
            if validated:
                self.feedback(validated[0])
            
            # 检查终止条件
            if self.is_solved(validated):
                return validated[0]
        
        return self.get_best()

6. 物理约束集成

6.1 守恒律约束

class ConservationConstraint:
    def __init__(self, conservation_type):
        self.type = conservation_type  # energy, momentum, charge, etc.
    
    def check(self, equation: str, data: np.ndarray) -> float:
        if self.type == 'energy':
            return self.check_energy_conservation(equation, data)
        elif self.type == 'momentum':
            return self.check_momentum_conservation(equation, data)
    
    def check_energy_conservation(self, equation: str, 
                                  data: np.ndarray) -> float:
        """检查能量守恒"""
        # 计算能量变化
        # ...
        return conservation_score

6.2 对称性约束

class SymmetryConstraint:
    def check(self, equation: str, symmetry_type: str) -> bool:
        if symmetry_type == 'translation':
            return self.check_translation_invariance(equation)
        elif symmetry_type == 'rotation':
            return self.check_rotation_invariance(equation)
        elif symmetry_type == 'scale':
            return self.check_scale_invariance(equation)
    
    def check_scale_invariance(self, equation: str) -> bool:
        """检查尺度不变性"""
        # 对于正确发现的物理定律
        # 缩放变量应导致方程形式不变
        pass

7. 评估与基准

7.1 Feynman方程发现基准

方程复杂度发现成功率(PhysX)发现成功率(SR-Scientist)
95%92%
88%85%
72%68%
很高45%52%

7.2 真实数据评估

数据集物理领域发现率准确率
pendulum力学85%0.92
spring_mass力学82%0.89
circuit电磁68%0.84
chemical化学55%0.78

8. 局限性与未来方向

8.1 当前局限性

  1. 长方程困难:复杂物理定律的方程难以发现
  2. 常数估计:物理常数估计不准确
  3. 多尺度问题:跨尺度物理规律发现困难
  4. 噪声敏感:实验数据噪声影响发现

8.2 未来方向

  1. 多物理场集成:处理耦合物理系统
  2. 不确定性量化:估计发现方程的不确定性
  3. 交互式发现:人类-机器协作方程发现
  4. 跨学科迁移:不同领域方程发现知识迁移

9. 参考文献


相关文档

Footnotes

  1. PhysX: Physics-Guided LLM Agent for Scientific Equation Discovery (2026) 2

  2. SR-Scientist: Agentic AI for Equation Discovery (2026) 2