6. Bgolearn 中的代理模型#

警告

重要:所有数据必须是 pandas DataFrames/Series,而不是 numpy 数组!

  • data_matrixpd.DataFrame 带列名

  • Measured_responsepd.Series

  • virtual_samplespd.DataFrame 带相同列名

使用 numpy 数组会导致:AttributeError: 'numpy.ndarray' object has no attribute 'columns'

备注

本页面解释了 Bgolearn 中可用的不同代理模型以及如何为您的优化问题选择合适的模型。

6.1. 概述#

代理模型(也称为元模型)是贝叶斯优化的核心。它们近似昂贵的目标函数并提供不确定性估计来指导优化过程。

Bgolearn 支持多种代理模型,每种都有不同的优势和用例:

  1. 高斯过程 (GP) - 默认且最常用

  2. 随机森林 (RF) - 适合离散/分类特征

  3. 支持向量回归 (SVR) - 对噪声鲁棒

  4. 多层感知器 (MLP) - 神经网络方法

  5. AdaBoost - 集成方法

6.2. 高斯过程 (GaussianProcess)#

6.2.1. 理论#

高斯过程是贝叶斯优化的理想模型。它们原则上提供所需的预测和不确定性估计的充分信息。

from Bgolearn import BGOsampling

# Use Gaussian Process (default)
opt = BGOsampling.Bgolearn()
model = opt.fit(
    data_matrix=data_matrix,
    Measured_response=measured_response,
    virtual_samples=virtual_samples,
    Classifier='GaussianProcess'  # Explicit specification
)

6.2.2. 优势#

GP 优势

  • 不确定性量化:提供预测不确定性

  • 理论基础:完善的贝叶斯框架

  • 平滑插值:适用于连续函数

  • 少量超参数:相对容易调优

  • 全局优化:擅长寻找全局最优

6.2.3. 局限性#

GP 局限性

  • 计算成本:随训练数据 O(n³) 扩展

  • 平滑性假设:假设底层函数平滑

  • 高维挑战:在许多特征(>20)时表现不佳

  • 分类特征:不擅长处理离散变量

6.2.4. 最佳用例#

  • 连续优化问题

  • 平滑目标函数

  • 小到中等数据集(<1000 样本)

  • 当不确定性很重要时

  • 材料性能优化

6.2.5. 示例:使用 GP 进行合金优化#

import numpy as np
import pandas as pd
import copy
from Bgolearn import BGOsampling

# Alloy composition optimization - use pandas DataFrame
data_matrix = pd.DataFrame([
    [2.0, 1.2, 0.5],  # Cu, Mg, Si
    [3.5, 0.8, 0.7],
    [1.8, 1.5, 0.3],
    [4.2, 0.9, 0.8]
], columns=['Cu', 'Mg', 'Si'])

strength_values = pd.Series([250, 280, 240, 290])  # MPa
measured_response = copy.deepcopy(strength_values)


virtual_samples = pd.DataFrame([
    [2.5, 1.0, 0.6],
    [3.0, 1.3, 0.4],
    [3.8, 0.9, 0.8]
], columns=['Cu', 'Mg', 'Si'])

# GP optimization
opt = BGOsampling.Bgolearn()
model = opt.fit(
    data_matrix=data_matrix,  # DataFrame
    Measured_response=strength_values,  # Series
    virtual_samples=virtual_samples,  # DataFrame
    Classifier='GaussianProcess',
    CV_test=2,  # 2-fold cross-validation
    Normalize=True
)

print("GP optimization completed!")

6.3. 随机森林 (RandomForest)#

6.3.1. 理论#

随机森林构建多个决策树并平均它们的预测。它特别擅长处理离散特征和非线性关系。

# Use Random Forest
model = opt.fit(
    data_matrix=data_matrix,
    Measured_response=measured_response,
    virtual_samples=virtual_samples,
    Classifier='RandomForest'
)

6.3.2. 优势#

RF 优势

  • 处理离散特征:自然适合分类变量

  • 非线性关系:捕获复杂模式

  • 对异常值鲁棒:对噪声数据不太敏感

  • 快速训练:对大数据集高效

  • 特征重要性:提供特征排名

  • 无假设:不假设函数平滑性

6.3.3. 局限性#

RF 局限性

  • 有限的不确定性:不确定性量化较差

  • 过拟合风险:在小数据集上可能过拟合

  • 不连续:创建阶梯式预测

  • 超参数调优:需要优化许多参数

6.3.4. 最佳用例#

  • 离散/分类特征

  • 大数据集(>1000 样本)

  • 非平滑函数

  • 当鲁棒性很重要时

  • 混合变量类型

6.3.5. 示例:加工参数优化#

# Processing parameters with discrete levels - use pandas DataFrame
processing_data = pd.DataFrame([
    [450, 2, 1],    # Temperature, Time, Atmosphere (1=N2, 2=Ar, 3=Air)
    [500, 4, 2],
    [550, 6, 3],
    [480, 3, 1]
], columns=['Temperature', 'Time', 'Atmosphere'])

hardness_values = pd.Series([180, 220, 250, 200])

virtual_processing = pd.DataFrame([
    [475, 3.5, 1],
    [525, 4.5, 2],
    [490, 2.5, 3]
], columns=['Temperature', 'Time', 'Atmosphere'])

# Random Forest for mixed variables
model = opt.fit(
    data_matrix=processing_data,  # DataFrame
    Measured_response=hardness_values,  # Series
    virtual_samples=virtual_processing,  # DataFrame
    Classifier='RandomForest',
    CV_test='LOOCV'  # Leave-one-out cross-validation
)

print("Random Forest optimization completed!")

6.4. 支持向量回归 (SVR)#

6.4.1. 理论#

SVR 使用支持向量机进行回归,寻找一个与目标偏差最多为 ε 且尽可能平坦的函数。

# Use SVR
model = opt.fit(
    data_matrix=data_matrix,
    Measured_response=measured_response,
    virtual_samples=virtual_samples,
    Classifier='SVR'
)

6.4.2. 优势#

SVR 优势

  • 对噪声鲁棒:很好地处理噪声数据

  • 高维:适用于许多特征

  • 核函数灵活性:可用不同的核函数

  • 稀疏解:仅使用支持向量

  • 正则化:内置过拟合保护

6.4.3. 局限性#

SVR 局限性

  • 参数敏感性:需要仔细调优

  • 无不确定性:不提供预测不确定性

  • 核函数选择:选择正确的核函数至关重要

  • 计算成本:对大数据集可能很慢

6.4.4. 最佳用例#

  • 有噪声的数据

  • 高维问题

  • 当鲁棒性至关重要时

  • 非线性关系

6.4.5. 示例:高维优化#

# High-dimensional alloy with many elements - use pandas DataFrame
element_names = ['Al', 'Cu', 'Mg', 'Si', 'Fe', 'Mn', 'Cr', 'Zn']
high_dim_data = pd.DataFrame(
    np.random.random((20, 8)),  # 8 alloying elements
    columns=element_names
)
high_dim_response = pd.Series(np.random.random(20) * 100 + 200)

high_dim_virtual = pd.DataFrame(
    np.random.random((50, 8)),
    columns=element_names
)

# SVR for high-dimensional problem
model = opt.fit(
    data_matrix=high_dim_data,  # DataFrame
    Measured_response=high_dim_response,  # Series
    virtual_samples=high_dim_virtual,  # DataFrame
    Classifier='SVR',
    Normalize=True
)

print("SVR optimization completed!")

6.5. 多层感知器 (MLP)#

6.5.1. 理论#

MLP 是一个具有多个隐藏层的神经网络,可以近似复杂的非线性函数。

# Use MLP
model = opt.fit(
    data_matrix=data_matrix,
    Measured_response=measured_response,
    virtual_samples=virtual_samples,
    Classifier='MLP'
)

6.5.2. 优势#

MLP 优势

  • 通用近似器:可以近似任何连续函数

  • 非线性建模:非常适合复杂关系

  • 可扩展:可以处理大数据集

  • 灵活架构:可定制的网络结构

6.5.3. 局限性#

MLP 局限性

  • 需要大量数据:需要大量训练数据

  • 超参数调优:需要优化许多参数

  • 过拟合风险:容易在小数据集上过拟合

  • 无不确定性:不提供不确定性估计

  • 训练不稳定:对初始化敏感

6.5.4. 最佳用例#

  • 大数据集(>500 样本)

  • 复杂的非线性关系

  • 当有足够数据可用时

  • 模式识别任务

6.6. AdaBoost#

6.6.1. 理论#

AdaBoost(自适应提升)结合多个弱学习器来创建强预测器。

# Use AdaBoost
model = opt.fit(
    data_matrix=data_matrix,
    Measured_response=measured_response,
    virtual_samples=virtual_samples,
    Classifier='AdaBoost'
)

6.6.2. 优势#

AdaBoost 优势

  • 集成方法:结合多个模型

  • 自适应:专注于困难样本

  • 减少偏差:通常提高预测准确性

  • 通用:适用于不同的基学习器

6.6.3. 局限性#

AdaBoost 局限性

  • 对噪声敏感:可能在噪声数据上过拟合

  • 计算成本:比单个模型慢

  • 参数调优:需要仔细调优

  • 有限的不确定性:不确定性量化较差

6.7. 模型选择指南#

6.7.1. 决策树#

从这里开始
    ↓
您有不确定性要求吗?
    ↓ 是                    ↓ 否
高斯过程              什么类型的特征?
    ↓                        ↓
数据平滑吗?         连续 → SVR/MLP
    ↓ 是    ↓ 否           ↓
    GP    随机森林      离散 → 随机森林
                             ↓
                        大数据集?
                             ↓ 是    ↓ 否
                            MLP    随机森林

6.7.2. 详细选择标准#

表 6.1 模型选择指南#

标准

高斯过程

随机森林

SVR

MLP

AdaBoost

数据大小

<1000

500

500

500

500

特征类型

连续

混合

连续

连续

混合

不确定性

优秀

较差

较差

噪声容忍度

中等

中等

训练速度

中等

中等

预测速度

中等

6.8. 实际示例#

6.8.1. 模型比较#

# Create test data for model comparison
data_matrix = pd.DataFrame([
    [2.0, 1.2, 0.5],
    [3.5, 0.8, 0.7],
    [1.8, 1.5, 0.3],
    [4.2, 0.9, 0.8],
    [2.8, 1.1, 0.6]
], columns=['Cu', 'Mg', 'Si'])

measured_response = pd.Series([250, 280, 240, 290, 265])

virtual_samples = pd.DataFrame([
    [2.5, 1.0, 0.6],
    [3.0, 1.3, 0.4],
    [3.8, 0.9, 0.8]
], columns=['Cu', 'Mg', 'Si'])

# Compare all models on the same problem
models = ['GaussianProcess', 'RandomForest', 'SVR', 'MLP', 'AdaBoost']
results = {}

for model_name in models:
    print(f"Testing {model_name}...")

    try:
        model = opt.fit(
            data_matrix=data_matrix,  # DataFrame
            Measured_response=measured_response,  # Series
            virtual_samples=virtual_samples,  # DataFrame
            Classifier=model_name,
            CV_test=5,  # 5-fold cross-validation
            Normalize=True
        )
        
        # Get results using EI
        ei_values, recommended_points = model.EI()
        recommended_point = recommended_points[0]

        # Get model performance metrics (if available)
        try:
            # Access cross-validation score if available
            cv_score = getattr(model, 'cv_score_', 0.0)
        except:
            cv_score = 0.0

        results[model_name] = {
            'recommended_point': recommended_point,
            'ei_max': np.max(ei_values),
            'cv_score': cv_score,
            'success': True
        }

        print(f"  Success: EI max = {np.max(ei_values):.3f}")
        
    except Exception as e:
        print(f"  Failed: {str(e)}")
        results[model_name] = {'success': False, 'error': str(e)}

# Display comparison
print("\nModel Performance Comparison:")
print("-" * 50)
for model, result in results.items():
    if result['success']:
        print(f"{model:15s}: Success - EI max = {result['ei_max']:.3f}")
    else:
        print(f"{model:15s}: Failed - {result.get('error', 'Unknown error')}")

6.8.2. 超参数调整#

# Hyperparameter tuning in Bgolearn
import numpy as np
import pandas as pd
from Bgolearn import BGOsampling

# Create test data
data_matrix = pd.DataFrame([
    [2.0, 1.2, 0.5, 450],  # Cu, Mg, Si, Temperature
    [3.5, 0.8, 0.7, 500],
    [1.8, 1.5, 0.3, 480],
    [4.2, 0.9, 0.8, 520],
    [2.8, 1.1, 0.6, 490],
    [3.2, 1.3, 0.4, 510],
    [2.5, 0.9, 0.7, 470],
    [3.8, 1.0, 0.5, 530]
], columns=['Cu', 'Mg', 'Si', 'Temperature'])

strength_values = pd.Series([250, 280, 240, 290, 265, 275, 245, 295])

virtual_samples = pd.DataFrame([
    [2.5, 1.0, 0.6, 485],
    [3.0, 1.3, 0.4, 505],
    [3.8, 0.9, 0.8, 515]
], columns=['Cu', 'Mg', 'Si', 'Temperature'])

# Hyperparameter tuning: compare different cross-validation settings
cv_settings = [3, 5, 10, 'LOOCV']
normalization_settings = [True, False]

best_score = -np.inf
best_config = None
results = {}

opt = BGOsampling.Bgolearn()

print("Starting hyperparameter tuning...")
print("=" * 50)

for cv_test in cv_settings:
    for normalize in normalization_settings:
        config_name = f"CV_{cv_test}_Norm_{normalize}"
        print(f"Testing configuration: {config_name}")

        try:
            model = opt.fit(
                data_matrix=data_matrix,
                Measured_response=strength_values,
                virtual_samples=virtual_samples,
                Classifier='GaussianProcess',
                CV_test=cv_test,
                Normalize=normalize,
                seed=42  # Ensure reproducibility
            )

            # Get EI values as performance metric
            ei_values, recommended_points = model.EI()
            max_ei = np.max(ei_values)

            # Calculate prediction quality
            predicted_mean = model.virtual_samples_mean
            predicted_std = model.virtual_samples_std

            # Combined score: EI max + inverse of prediction uncertainty
            uncertainty_score = 1.0 / (np.mean(predicted_std) + 1e-6)
            combined_score = max_ei + 0.1 * uncertainty_score

            results[config_name] = {
                'max_ei': max_ei,
                'mean_uncertainty': np.mean(predicted_std),
                'combined_score': combined_score,
                'recommended_point': recommended_points[0],
                'success': True
            }

            if combined_score > best_score:
                best_score = combined_score
                best_config = config_name

            print(f"  Success - EI max: {max_ei:.3f}, Combined score: {combined_score:.3f}")

        except Exception as e:
            print(f"  Failed: {str(e)}")
            results[config_name] = {'success': False, 'error': str(e)}

# Display tuning results
print(f"\n Hyperparameter tuning results:")
print("=" * 50)
print(f"{'Configuration':<20} {'EI Max':<10} {'Mean Uncertainty':<15} {'Combined Score':<15}")
print("-" * 60)

for config, result in results.items():
    if result['success']:
        print(f"{config:<20} {result['max_ei']:<10.3f} {result['mean_uncertainty']:<15.3f} {result['combined_score']:<15.3f}")
    else:
        print(f"{config:<20} {'Failed':<10} {'-':<15} {'-':<15}")

print(f"\n Best configuration: {best_config}")
print(f" Best score: {best_score:.3f}")

if best_config and results[best_config]['success']:
    best_point = results[best_config]['recommended_point']
    print(f" Next experiment point recommended by best configuration:")
    print(f"   Cu: {best_point[0]:.2f}%, Mg: {best_point[1]:.2f}%, Si: {best_point[2]:.2f}%, T: {best_point[3]:.0f}K")

6.9. 高级主题#

6.9.1. 集成方法#

Implement ensemble methods in Bgolearn by combining multiple surrogate models for improved performance:

import numpy as np
import pandas as pd
from Bgolearn import BGOsampling

# Create test data
data_matrix = pd.DataFrame([
    [2.0, 1.2, 0.5],
    [3.5, 0.8, 0.7],
    [1.8, 1.5, 0.3],
    [4.2, 0.9, 0.8],
    [2.8, 1.1, 0.6],
    [3.2, 1.3, 0.4]
], columns=['Cu', 'Mg', 'Si'])

strength_values = pd.Series([250, 280, 240, 290, 265, 275])

virtual_samples = pd.DataFrame([
    [2.5, 1.0, 0.6],
    [3.0, 1.3, 0.4],
    [3.8, 0.9, 0.8],
    [2.2, 1.4, 0.5],
    [3.6, 0.7, 0.6]
], columns=['Cu', 'Mg', 'Si'])

# Define ensemble method class
class BgolearnEnsemble:
    """Bgolearn ensemble method implementation"""

    def __init__(self, model_types=['GaussianProcess', 'RandomForest', 'SVR']):
        self.model_types = model_types
        self.models = {}
        self.weights = None

    def fit(self, data_matrix, measured_response, virtual_samples, **kwargs):
        """Train all models"""
        print(" Training ensemble models...")

        opt = BGOsampling.Bgolearn()

        for model_type in self.model_types:
            print(f"  Training {model_type}...")
            try:
                model = opt.fit(
                    data_matrix=data_matrix,
                    Measured_response=measured_response,
                    virtual_samples=virtual_samples,
                    Classifier=model_type,
                    CV_test=5,
                    Normalize=True,
                    **kwargs
                )
                self.models[model_type] = model
                print(f"  {model_type} training successful")
            except Exception as e:
                print(f"  {model_type} training failed: {e}")

        # Calculate model weights (based on EI performance)
        self._calculate_weights()

    def _calculate_weights(self):
        """Calculate model weights based on EI performance"""
        if not self.models:
            return

        ei_scores = {}
        for name, model in self.models.items():
            try:
                ei_values, _ = model.EI()
                ei_scores[name] = np.max(ei_values)
            except:
                ei_scores[name] = 0.0

        # Normalize weights
        total_score = sum(ei_scores.values())
        if total_score > 0:
            self.weights = {name: score/total_score for name, score in ei_scores.items()}
        else:
            # Equal weights
            self.weights = {name: 1.0/len(self.models) for name in self.models.keys()}

        print(f"Model weights: {self.weights}")

    def ensemble_EI(self):
        """Ensemble expected improvement"""
        if not self.models:
            raise ValueError("No trained models available")

        ensemble_ei = None
        ensemble_points = None

        for name, model in self.models.items():
            try:
                ei_values, points = model.EI()
                weight = self.weights.get(name, 0.0)

                if ensemble_ei is None:
                    ensemble_ei = weight * ei_values
                    ensemble_points = points
                else:
                    ensemble_ei += weight * ei_values

            except Exception as e:
                print(f" {name} EI calculation failed: {e}")

        # Find the point corresponding to maximum EI
        if ensemble_ei is not None:
            max_idx = np.argmax(ensemble_ei)
            best_point = ensemble_points[max_idx:max_idx+1]  # Maintain dimensions
            return ensemble_ei, best_point
        else:
            raise ValueError("All model EI calculations failed")

    def ensemble_predictions(self):
        """Ensemble predictions"""
        predictions = {}
        uncertainties = {}

        for name, model in self.models.items():
            try:
                pred_mean = model.virtual_samples_mean
                pred_std = model.virtual_samples_std

                predictions[name] = pred_mean
                uncertainties[name] = pred_std

            except Exception as e:
                print(f" {name} prediction failed: {e}")

        if not predictions:
            raise ValueError("All model predictions failed")

        # Weighted average predictions
        ensemble_mean = np.zeros_like(list(predictions.values())[0])
        ensemble_std = np.zeros_like(list(uncertainties.values())[0])

        for name, pred in predictions.items():
            weight = self.weights.get(name, 0.0)
            ensemble_mean += weight * pred
            ensemble_std += weight * uncertainties[name]

        return ensemble_mean, ensemble_std

    def get_model_comparison(self):
        """Get model comparison results"""
        comparison = {}

        for name, model in self.models.items():
            try:
                ei_values, points = model.EI()
                pred_mean = model.virtual_samples_mean
                pred_std = model.virtual_samples_std

                comparison[name] = {
                    'max_ei': np.max(ei_values),
                    'mean_prediction': np.mean(pred_mean),
                    'mean_uncertainty': np.mean(pred_std),
                    'weight': self.weights.get(name, 0.0),
                    'recommended_point': points[0]
                }
            except Exception as e:
                comparison[name] = {'error': str(e)}

        return comparison

# Using ensemble methods
print("Starting ensemble method demonstration")
print("=" * 50)

# Create and train ensemble model
ensemble = BgolearnEnsemble(
    model_types=['GaussianProcess', 'RandomForest', 'SVR']
)

ensemble.fit(
    data_matrix=data_matrix,
    measured_response=strength_values,
    virtual_samples=virtual_samples,
    seed=42
)

# Get ensemble predictions
print("\n Ensemble prediction results:")
try:
    ensemble_mean, ensemble_std = ensemble.ensemble_predictions()
    print(f"Ensemble prediction mean: {ensemble_mean[:3]}")  # Show first 3
    print(f"Ensemble prediction std: {ensemble_std[:3]}")
except Exception as e:
    print(f"Ensemble prediction failed: {e}")

# Get ensemble EI
print("\n Ensemble Expected Improvement:")
try:
    ensemble_ei, ensemble_point = ensemble.ensemble_EI()
    max_ei_idx = np.argmax(ensemble_ei)
    print(f"Maximum ensemble EI value: {ensemble_ei[max_ei_idx]:.3f}")
    print(f"Recommended next experiment point: Cu={ensemble_point[0][0]:.2f}, Mg={ensemble_point[0][1]:.2f}, Si={ensemble_point[0][2]:.2f}")
except Exception as e:
    print(f"Ensemble EI calculation failed: {e}")

# Model comparison
print("\n Model performance comparison:")
comparison = ensemble.get_model_comparison()
print(f"{'Model':<15} {'Max EI':<10} {'Weight':<8} {'Mean Uncertainty':<15}")
print("-" * 55)

for name, metrics in comparison.items():
    if 'error' not in metrics:
        print(f"{name:<15} {metrics['max_ei']:<10.3f} {metrics['weight']:<8.3f} {metrics['mean_uncertainty']:<15.3f}")
    else:
        print(f"{name:<15} {'Failed':<10} {'-':<8} {'-':<15}")

print("\n Ensemble method advantages:")
print("  - Combines strengths of multiple models")
print("  - Improves prediction stability")
print("  - Reduces single model bias")
print("  - Automatic weight allocation")

6.9.2. 迁移学习#

使用预训练模型处理相关问题:

# Conceptual transfer learning
def transfer_learning(source_model, target_data, target_response):
    """Transfer knowledge from source to target problem."""
    # Use source model as initialization
    # Fine-tune on target data
    pass

6.9.3. 在线学习#

使用新数据更新模型:

# Conceptual online learning
def online_update(model, new_x, new_y):
    """Update model with new observation."""
    # Add new data point
    # Retrain or update model incrementally
    pass

6.10. 故障排除#

6.10.1. 常见问题#

  1. 较差的 CV 分数

    • 尝试不同的模型

    • 检查数据质量

    • 增加训练数据

    • 调整归一化

  2. 训练缓慢

    • 对大数据使用随机森林

    • 减小虚拟空间大小

    • 对高维考虑 SVR

  3. 过拟合

    • 使用交叉验证

    • 降低模型复杂度

    • 添加更多训练数据

  4. 不确定性较差

    • 使用高斯过程

    • 增加自举迭代次数

    • 检查模型假设

6.10.2. 性能优化#

# Tips for better performance
optimization_tips = {
    "Data preprocessing": "Normalize features, remove outliers",
    "Model selection": "Start with GP, try RF for discrete features",
    "Hyperparameters": "Use cross-validation for tuning",
    "Computational": "Reduce virtual space size for speed",
    "Validation": "Use CV_test=10 or 'LOOCV' for validation"
}

6.11. 下一步#

参见

有关代理建模的更多信息:

  • Rasmussen, C.E. “Gaussian Processes for Machine Learning”

  • Forrester, A. “Engineering Design via Surrogate Modelling”

  • Queipo, N.V. “Surrogate-based Analysis and Optimization”