Skip to content

Experiments & Fitness Evaluation

The self-evolution system in PRX uses controlled experiments and fitness evaluation to measure whether proposed changes actually improve agent performance. Every evolution proposal above L1 is tested through an A/B experiment before permanent adoption.

Overview

The experiment system provides:

  • A/B testing -- run control and treatment variants side by side
  • Fitness scoring -- quantify agent performance with a composite score
  • Statistical validation -- ensure improvements are significant, not random noise
  • Automatic convergence -- promote the winner and retire the loser when results are conclusive

Experiment Lifecycle

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌───────────┐
│  Create  │───►│  Run     │───►│ Evaluate │───►│ Converge  │
│          │    │          │    │          │    │           │
│ Define   │    │ Split    │    │ Compare  │    │ Promote   │
│ variants │    │ traffic  │    │ fitness  │    │ or reject │
└──────────┘    └──────────┘    └──────────┘    └───────────┘

1. Create

An experiment is created when the evolution pipeline generates a proposal:

  • A control variant representing the current configuration
  • A treatment variant representing the proposed change
  • Experiment parameters: duration, sample size, traffic split

2. Run

During the experiment, sessions are assigned to variants:

  • Sessions are assigned randomly based on the traffic split ratio
  • Each session runs entirely under one variant (no mid-session switching)
  • Both variants are monitored for the same set of fitness metrics

3. Evaluate

After the minimum duration or sample size is reached:

  • Fitness scores are computed for both variants
  • Statistical significance is tested (default: 95% confidence)
  • Effect size is calculated to measure practical significance

4. Converge

Based on evaluation results:

  • Treatment wins -- the proposed change is promoted to the default configuration
  • Control wins -- the proposed change is rejected; the control remains
  • Inconclusive -- the experiment is extended or the change is deferred

Configuration

toml
[self_evolution.experiments]
enabled = true
default_duration_hours = 168       # 1 week default
min_sample_size = 100              # minimum sessions per variant
traffic_split = 0.5                # 50/50 split between control and treatment
confidence_level = 0.95            # 95% statistical confidence required
min_effect_size = 0.02             # minimum 2% improvement to accept

[self_evolution.experiments.auto_converge]
enabled = true
check_interval_hours = 24          # evaluate results every 24 hours
max_duration_hours = 720           # force convergence after 30 days

Configuration Reference

FieldTypeDefaultDescription
enabledbooltrueEnable or disable the experiment system
default_duration_hoursu64168Default experiment duration in hours (1 week)
min_sample_sizeusize100Minimum sessions per variant before evaluation
traffic_splitf640.5Fraction of sessions assigned to the treatment variant (0.0--1.0)
confidence_levelf640.95Required statistical confidence level
min_effect_sizef640.02Minimum fitness improvement (fraction) to accept the treatment
auto_converge.enabledbooltrueAutomatically promote/reject when results are conclusive
auto_converge.check_interval_hoursu6424How often to check experiment results
auto_converge.max_duration_hoursu64720Force convergence after this duration (30 days default)

Experiment Record Structure

Each experiment is tracked as a structured record:

FieldTypeDescription
experiment_idStringUnique identifier (UUIDv7)
decision_idStringLink to the originating decision
layerLayerEvolution layer: L1, L2, or L3
statusStatusrunning, evaluating, converged, cancelled
created_atDateTime<Utc>When the experiment was created
converged_atOption<DateTime<Utc>>When the experiment concluded
controlVariantDescription of the control variant
treatmentVariantDescription of the treatment variant
control_sessionsusizeNumber of sessions assigned to control
treatment_sessionsusizeNumber of sessions assigned to treatment
control_fitnessFitnessScoreAggregate fitness for the control variant
treatment_fitnessFitnessScoreAggregate fitness for the treatment variant
p_valueOption<f64>Statistical significance (lower = more significant)
winnerOption<String>"control", "treatment", or null if inconclusive

Fitness Evaluation

Fitness scoring quantifies agent performance across multiple dimensions. The composite fitness score is used to compare experiment variants and track evolution progress over time.

Fitness Dimensions

DimensionWeightDescriptionHow Measured
response_relevance0.30How relevant agent responses are to user queriesLLM-as-judge scoring
task_completion0.25Fraction of tasks completed successfullyTool call success rate
response_latency0.15Time from user message to first response tokenPercentile-based (p50, p95)
token_efficiency0.10Tokens consumed per successful taskLower is better
memory_precision0.10Relevance of recalled memoriesRecall relevance scoring
user_satisfaction0.10Explicit user feedback signalsThumbs up/down, corrections

Composite Score

The composite fitness score is a weighted sum:

fitness = sum(dimension_score * dimension_weight)

Each dimension is normalized to a 0.0--1.0 range before weighting. The composite score is also in the 0.0--1.0 range, where higher is better.

Fitness Configuration

toml
[self_evolution.fitness]
evaluation_window_hours = 24       # aggregate metrics over this window
min_sessions_for_score = 10        # require at least 10 sessions for a valid score

[self_evolution.fitness.weights]
response_relevance = 0.30
task_completion = 0.25
response_latency = 0.15
token_efficiency = 0.10
memory_precision = 0.10
user_satisfaction = 0.10

[self_evolution.fitness.thresholds]
minimum_acceptable = 0.50          # fitness below this triggers an alert
regression_delta = 0.05            # fitness drop > 5% triggers rollback

Fitness Configuration Reference

FieldTypeDefaultDescription
evaluation_window_hoursu6424Time window for aggregating fitness metrics
min_sessions_for_scoreusize10Minimum sessions needed to compute a valid score
weights.*f64(see table above)Weight for each fitness dimension (must sum to 1.0)
thresholds.minimum_acceptablef640.50Alert threshold for low fitness
thresholds.regression_deltaf640.05Maximum fitness drop before automatic rollback

CLI Commands

bash
# List active experiments
prx evolution experiments --status running

# View a specific experiment
prx evolution experiments --id <experiment_id>

# View experiment results with fitness breakdown
prx evolution experiments --id <experiment_id> --details

# Cancel a running experiment (reverts to control)
prx evolution experiments cancel <experiment_id>

# View current fitness score
prx evolution fitness

# View fitness history over time
prx evolution fitness --history --last 30d

# View fitness breakdown by dimension
prx evolution fitness --breakdown

Example Fitness Output

Current Fitness Score: 0.74

Dimension            Score   Weight  Contribution
response_relevance   0.82    0.30    0.246
task_completion      0.78    0.25    0.195
response_latency     0.69    0.15    0.104
token_efficiency     0.65    0.10    0.065
memory_precision     0.71    0.10    0.071
user_satisfaction    0.60    0.10    0.060

Trend (last 7 days): +0.03 (improving)

Experiment Examples

L2 Prompt Optimization

A typical L2 experiment tests a system prompt change:

  • Control: current system prompt (320 tokens)
  • Treatment: refined system prompt (272 tokens, 15% shorter)
  • Hypothesis: shorter prompt frees context window, improving response relevance
  • Duration: 7 days, 100 sessions per variant
  • Result: treatment fitness 0.75 vs control 0.72 (p = 0.03), treatment promoted

L3 Strategy Change

An L3 experiment tests a routing policy change:

  • Control: route all coding tasks to Claude Opus
  • Treatment: route simple coding tasks to Claude Sonnet, complex to Opus
  • Hypothesis: cost-efficient routing without quality loss
  • Duration: 14 days, 200 sessions per variant
  • Result: treatment fitness 0.73 vs control 0.74 (p = 0.42), inconclusive -- experiment extended

Statistical Methods

The experiment system uses the following statistical methods:

  • Two-sample t-test for comparing mean fitness scores between variants
  • Mann-Whitney U test as a non-parametric alternative when fitness distributions are skewed
  • Bonferroni correction when multiple fitness dimensions are compared simultaneously
  • Sequential analysis with alpha-spending to allow early stopping when results are clearly significant

Limitations

  • Experiments require sufficient session volume; low-traffic deployments may take weeks to reach significance
  • User satisfaction signals depend on explicit feedback, which may be sparse
  • LLM-as-judge scoring for response relevance adds latency and cost to the evaluation pipeline
  • Only one experiment can run per evolution layer at a time to avoid confounding
  • Fitness scores are relative to the specific deployment; they are not comparable across different PRX instances

Released under the Apache-2.0 License.