Metrics Guide

Erasus provides 26+ metrics to evaluate unlearning quality across four dimensions: forgetting, utility, privacy, and efficiency.

Forgetting Quality

Measures how well the model has forgotten the target data:

  • Accuracy — Classification accuracy on forget vs retain sets. After unlearning, forget accuracy should drop while retain accuracy stays high.

  • MIA (Membership Inference Attack) — Trains a shadow model to predict membership. AUC → 0.5 indicates successful unlearning.

  • KL Divergence — Measures distribution shift between the unlearned model and a retrained-from-scratch model.

  • Extraction Attack — Tests if memorised data can be extracted from the model after unlearning.

Model Utility

Measures preservation of useful model capabilities:

  • BLEU — Machine translation / text generation quality

  • ROUGE — Summarisation quality (ROUGE-N, ROUGE-L)

  • CLIP Score — Image-text alignment quality

  • Inception Score — Image generation quality / diversity

  • Downstream Tasks — Performance on held-out evaluation tasks

Privacy

Measures formal privacy guarantees:

  • Epsilon-Delta — (ε, δ)-differential privacy accounting

  • Privacy Audit — Empirical privacy leakage estimation

  • Differential Privacy — DP-SGD compliance checking

Efficiency

Measures computational cost:

  • Time Complexity — Wall-clock time for unlearning

  • Memory Usage — Peak GPU/CPU memory

  • Speedup — Ratio vs retraining from scratch

  • FLOPs — Floating point operations count

Using MetricSuite

from erasus.metrics.metric_suite import MetricSuite

# Run specific metrics
suite = MetricSuite(["accuracy", "mia", "kl_divergence"])
results = suite.run(model, forget_loader, retain_loader)

# Print results
for name, value in results.items():
    if isinstance(value, float):
        print(f"  {name}: {value:.4f}")

Benchmark Runner

For comprehensive benchmarks with statistical tests and visualisation:

from erasus.metrics.benchmarks import BenchmarkRunner

runner = BenchmarkRunner(
    strategies=["gradient_ascent", "scrub", "fisher_forgetting"],
    metrics=["accuracy", "mia"],
    n_runs=3,
)
results = runner.run(model, forget_loader, retain_loader)
runner.export_latex(results, "benchmark_table.tex")