Experiments & Fitness Evaluation

The self-evolution system in PRX utilise controlled experiments et l'evaluation de la fitness pour mesurer whether proposed changes actually improve agent performance. Every evolution proposal above L1 is tested via unn A/B experiment before permanent adoption.

Apercu

The experiment system provides:

A/B testing -- run control and treatment variants cote a cote
Fitness scoring -- quantify agent performance avec un composite score
Statistical validation -- garantir que les ameliorations sont significatives, pas du bruit aleatoire
Automatic convergence -- promote the winner and retire the loser when results are conclusive

Experiment Lifecycle

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌───────────┐
│  Create  │───►│  Run     │───►│ Evaluate │───►│ Converge  │
│          │    │          │    │          │    │           │
│ Define   │    │ Split    │    │ Compare  │    │ Promote   │
│ variants │    │ traffic  │    │ fitness  │    │ or reject │
└──────────┘    └──────────┘    └──────────┘    └───────────┘

1. Create

An experiment is created lorsque le evolution pipeline genere a proposal:

A control variant representing the current configuration
A treatment variant representing the proposed change
Experiment parameters: duration, sample size, traffic split

2. Run

During the experiment, sessions are assigned to variants:

Sessions are assigned randomly based sur le traffic split ratio
Each session runs entirely under one variant (no mid-session switching)
Both variants are monitored pour le same set of fitness metrics

3. Evaluate

After the minimum duration or sample size is reached:

Fitness scores are computed for both variants
Statistical significance is tested (par defaut : 95% confidence)
Effect size is calculated to measure practical significance

4. Converge

Based on evaluation results:

Treatment wins -- the proposed change is promoted to la valeur par defaut configuration
Control wins -- the proposed change est rejete; the control remains
Inconclusive -- the experiment is extended ou le change is deferred

Configuration

toml

[self_evolution.experiments]
enabled = true
default_duration_hours = 168       # 1 week default
min_sample_size = 100              # minimum sessions per variant
traffic_split = 0.5                # 50/50 split between control and treatment
confidence_level = 0.95            # 95% statistical confidence required
min_effect_size = 0.02             # minimum 2% improvement to accept

[self_evolution.experiments.auto_converge]
enabled = true
check_interval_hours = 24          # evaluate results every 24 hours
max_duration_hours = 720           # force convergence after 30 days

Configuration Reference

Champ	Type	Defaut	Description
`enabled`	`bool`	`true`	Enable or disable the experiment system
`default_duration_hours`	`u64`	`168`	Defaut experiment duration in hours (1 week)
`min_sample_size`	`usize`	`100`	Minimum sessions per variant before evaluation
`traffic_split`	`f64`	`0.5`	Fraction of sessions assigned vers le treatment variant (0.0--1.0)
`confidence_level`	`f64`	`0.95`	Requis statistical confidence level
`min_effect_size`	`f64`	`0.02`	Minimum fitness improvement (fraction) to accept the treatment
`auto_converge.enabled`	`bool`	`true`	Automatically promote/reject when results are conclusive
`auto_converge.check_interval_hours`	`u64`	`24`	How often pour verifier experiment results
`auto_converge.max_duration_hours`	`u64`	`720`	Force convergence after this duration (30 days default)

Experiment Record Structure

Each experiment is tracked comme un structured record:

Champ	Type	Description
`experiment_id`	`String`	Unique identifier (UUIDv7)
`decision_id`	`String`	Link vers le originating decision
`layer`	`Layer`	Evolution layer: `L1`, `L2`, or `L3`
`status`	`Status`	`running`, `evaluating`, `converged`, `cancelled`
`created_at`	`DateTime<Utc>`	When the experiment was created
`converged_at`	`Option<DateTime<Utc>>`	When the experiment concluded
`control`	`Variant`	Description of the control variant
`treatment`	`Variant`	Description of the treatment variant
`control_sessions`	`usize`	Number of sessions assigned to control
`treatment_sessions`	`usize`	Number of sessions assigned to treatment
`control_fitness`	`FitnessScore`	Aggregate fitness pour le control variant
`treatment_fitness`	`FitnessScore`	Aggregate fitness pour le treatment variant
`p_value`	`Option<f64>`	Statistical significance (lower = more significant)
`winner`	`Option<String>`	`"control"`, `"treatment"`, or `null` if inconclusive

Fitness Evaluation

Fitness scoring quantifies agent performance a travers plusieurs dimensions. The composite fitness score is utilise pour compare experiment variants and track evolution progress over time.

Fitness Dimensions

Dimension	Weight	Description	How Measured
`response_relevance`	0.30	How relevant agent responses are to user interroge	LLM-as-judge scoring
`task_completion`	0.25	Fraction of tasks completed successfully	Tool call success rate
`response_latency`	0.15	Time from user message to first response token	Percentile-based (p50, p95)
`token_efficiency`	0.10	Tokens consumed per successful task	Lower is better
`memory_precision`	0.10	Relevance of recalled memories	Recall relevance scoring
`user_satisfaction`	0.10	Explicit user feedback signals	Thumbs up/down, corrections

Composite Score

The composite fitness score is a weighted sum:

fitness = sum(dimension_score * dimension_weight)

Each dimension is normalized vers un 0.0--1.0 range before weighting. The composite score is egalement in the 0.0--1.0 range, where higher is better.

Fitness Configuration

toml

[self_evolution.fitness]
evaluation_window_hours = 24       # aggregate metrics over this window
min_sessions_for_score = 10        # require at least 10 sessions for a valid score

[self_evolution.fitness.weights]
response_relevance = 0.30
task_completion = 0.25
response_latency = 0.15
token_efficiency = 0.10
memory_precision = 0.10
user_satisfaction = 0.10

[self_evolution.fitness.thresholds]
minimum_acceptable = 0.50          # fitness below this triggers an alert
regression_delta = 0.05            # fitness drop > 5% triggers rollback

Fitness Configuration Reference

Champ	Type	Defaut	Description
`evaluation_window_hours`	`u64`	`24`	Time window for aggregating fitness metrics
`min_sessions_for_score`	`usize`	`10`	Minimum sessions needed to compute a valid score
`weights.*`	`f64`	(see table above)	Weight for each fitness dimension (must sum to 1.0)
`thresholds.minimum_acceptable`	`f64`	`0.50`	Alert threshold for low fitness
`thresholds.regression_delta`	`f64`	`0.05`	Maximum fitness drop before automatic rollback

CLI Commands

bash

# List active experiments
prx evolution experiments --status running

# View a specific experiment
prx evolution experiments --id <experiment_id>

# View experiment results with fitness breakdown
prx evolution experiments --id <experiment_id> --details

# Cancel a running experiment (reverts to control)
prx evolution experiments cancel <experiment_id>

# View current fitness score
prx evolution fitness

# View fitness history over time
prx evolution fitness --history --last 30d

# View fitness breakdown by dimension
prx evolution fitness --breakdown

Exemple Fitness Output

Current Fitness Score: 0.74

Dimension            Score   Weight  Contribution
response_relevance   0.82    0.30    0.246
task_completion      0.78    0.25    0.195
response_latency     0.69    0.15    0.104
token_efficiency     0.65    0.10    0.065
memory_precision     0.71    0.10    0.071
user_satisfaction    0.60    0.10    0.060

Trend (last 7 days): +0.03 (improving)

Experiment Examples

L2 Prompt Optimization

A typical L2 experiment tests a system prompt change:

Control: current system prompt (320 tokens)
Treatment: refined system prompt (272 tokens, 15% shorter)
Hypothesis: shorter prompt frees context window, improving response relevance
Duration: 7 days, 100 sessions per variant
Result: treatment fitness 0.75 vs control 0.72 (p = 0.03), treatment promoted

L3 Strategy Change

An L3 experiment tests a routing policy change:

Control: route all coding tasks to Claude Opus
Treatment: route simple coding tasks to Claude Sonnet, complex to Opus
Hypothesis: cost-efficient routing without quality loss
Duration: 14 days, 200 sessions per variant
Result: treatment fitness 0.73 vs control 0.74 (p = 0.42), inconclusive -- experiment extended

Statistical Methods

The experiment system uses les elements suivants statistical methods:

Two-sample t-test for comparing mean fitness scores between variants
Mann-Whitney U test comme un non-parametric alternative when fitness distributions are skewed
Bonferroni correction when multiple fitness dimensions are compared simultaneously
Sequential analysis avec unlpha-spending to allow early stopping when results are clearly significant

Limiteations

Experiments require sufficient session volume; low-traffic deployments may take weeks to reach significance
User satisfaction signals depend on explicit feedback, which peut etre sparse
LLM-as-judge scoring for response relevance adds latency and cost vers le evaluation pipeline
Only one experiment can run per evolution layer at a time pour eviter confounding
Fitness scores are relative vers le specific deployment; they are not comparable across different PRX instances

Voir aussi Pages

Self-Evolution Overview
Decision Log -- decisions that trigger experiments
Evolution Pipeline -- le pipeline that genere proposals
Safety & Rollback -- automatic rollback on regression

Experiments & Fitness Evaluation ​

Apercu ​

Experiment Lifecycle ​

1. Create ​

2. Run ​

3. Evaluate ​

4. Converge ​

Configuration ​

Configuration Reference ​

Experiment Record Structure ​

Fitness Evaluation ​

Fitness Dimensions ​

Composite Score ​

Fitness Configuration ​

Fitness Configuration Reference ​

CLI Commands ​

Exemple Fitness Output ​

Experiment Examples ​

L2 Prompt Optimization ​

L3 Strategy Change ​

Statistical Methods ​

Limiteations ​

Voir aussi Pages ​