Skip to content
Cette page a été générée et traduite avec l'aide de l'IA. Si vous remarquez des inexactitudes, n'hésitez pas à contribuer. Modifier sur GitHub

Experiments & Fitness Evaluation

The self-evolution system in PRX utilise controlled experiments et l'evaluation de la fitness pour mesurer whether proposed changes actually improve agent performance. Every evolution proposal above L1 is tested via unn A/B experiment before permanent adoption.

Apercu

The experiment system provides:

  • A/B testing -- run control and treatment variants cote a cote
  • Fitness scoring -- quantify agent performance avec un composite score
  • Statistical validation -- garantir que les ameliorations sont significatives, pas du bruit aleatoire
  • Automatic convergence -- promote the winner and retire the loser when results are conclusive

Experiment Lifecycle

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌───────────┐
│  Create  │───►│  Run     │───►│ Evaluate │───►│ Converge  │
│          │    │          │    │          │    │           │
│ Define   │    │ Split    │    │ Compare  │    │ Promote   │
│ variants │    │ traffic  │    │ fitness  │    │ or reject │
└──────────┘    └──────────┘    └──────────┘    └───────────┘

1. Create

An experiment is created lorsque le evolution pipeline genere a proposal:

  • A control variant representing the current configuration
  • A treatment variant representing the proposed change
  • Experiment parameters: duration, sample size, traffic split

2. Run

During the experiment, sessions are assigned to variants:

  • Sessions are assigned randomly based sur le traffic split ratio
  • Each session runs entirely under one variant (no mid-session switching)
  • Both variants are monitored pour le same set of fitness metrics

3. Evaluate

After the minimum duration or sample size is reached:

  • Fitness scores are computed for both variants
  • Statistical significance is tested (par defaut : 95% confidence)
  • Effect size is calculated to measure practical significance

4. Converge

Based on evaluation results:

  • Treatment wins -- the proposed change is promoted to la valeur par defaut configuration
  • Control wins -- the proposed change est rejete; the control remains
  • Inconclusive -- the experiment is extended ou le change is deferred

Configuration

toml
[self_evolution.experiments]
enabled = true
default_duration_hours = 168       # 1 week default
min_sample_size = 100              # minimum sessions per variant
traffic_split = 0.5                # 50/50 split between control and treatment
confidence_level = 0.95            # 95% statistical confidence required
min_effect_size = 0.02             # minimum 2% improvement to accept

[self_evolution.experiments.auto_converge]
enabled = true
check_interval_hours = 24          # evaluate results every 24 hours
max_duration_hours = 720           # force convergence after 30 days

Configuration Reference

ChampTypeDefautDescription
enabledbooltrueEnable or disable the experiment system
default_duration_hoursu64168Defaut experiment duration in hours (1 week)
min_sample_sizeusize100Minimum sessions per variant before evaluation
traffic_splitf640.5Fraction of sessions assigned vers le treatment variant (0.0--1.0)
confidence_levelf640.95Requis statistical confidence level
min_effect_sizef640.02Minimum fitness improvement (fraction) to accept the treatment
auto_converge.enabledbooltrueAutomatically promote/reject when results are conclusive
auto_converge.check_interval_hoursu6424How often pour verifier experiment results
auto_converge.max_duration_hoursu64720Force convergence after this duration (30 days default)

Experiment Record Structure

Each experiment is tracked comme un structured record:

ChampTypeDescription
experiment_idStringUnique identifier (UUIDv7)
decision_idStringLink vers le originating decision
layerLayerEvolution layer: L1, L2, or L3
statusStatusrunning, evaluating, converged, cancelled
created_atDateTime<Utc>When the experiment was created
converged_atOption<DateTime<Utc>>When the experiment concluded
controlVariantDescription of the control variant
treatmentVariantDescription of the treatment variant
control_sessionsusizeNumber of sessions assigned to control
treatment_sessionsusizeNumber of sessions assigned to treatment
control_fitnessFitnessScoreAggregate fitness pour le control variant
treatment_fitnessFitnessScoreAggregate fitness pour le treatment variant
p_valueOption<f64>Statistical significance (lower = more significant)
winnerOption<String>"control", "treatment", or null if inconclusive

Fitness Evaluation

Fitness scoring quantifies agent performance a travers plusieurs dimensions. The composite fitness score is utilise pour compare experiment variants and track evolution progress over time.

Fitness Dimensions

DimensionWeightDescriptionHow Measured
response_relevance0.30How relevant agent responses are to user interrogeLLM-as-judge scoring
task_completion0.25Fraction of tasks completed successfullyTool call success rate
response_latency0.15Time from user message to first response tokenPercentile-based (p50, p95)
token_efficiency0.10Tokens consumed per successful taskLower is better
memory_precision0.10Relevance of recalled memoriesRecall relevance scoring
user_satisfaction0.10Explicit user feedback signalsThumbs up/down, corrections

Composite Score

The composite fitness score is a weighted sum:

fitness = sum(dimension_score * dimension_weight)

Each dimension is normalized vers un 0.0--1.0 range before weighting. The composite score is egalement in the 0.0--1.0 range, where higher is better.

Fitness Configuration

toml
[self_evolution.fitness]
evaluation_window_hours = 24       # aggregate metrics over this window
min_sessions_for_score = 10        # require at least 10 sessions for a valid score

[self_evolution.fitness.weights]
response_relevance = 0.30
task_completion = 0.25
response_latency = 0.15
token_efficiency = 0.10
memory_precision = 0.10
user_satisfaction = 0.10

[self_evolution.fitness.thresholds]
minimum_acceptable = 0.50          # fitness below this triggers an alert
regression_delta = 0.05            # fitness drop > 5% triggers rollback

Fitness Configuration Reference

ChampTypeDefautDescription
evaluation_window_hoursu6424Time window for aggregating fitness metrics
min_sessions_for_scoreusize10Minimum sessions needed to compute a valid score
weights.*f64(see table above)Weight for each fitness dimension (must sum to 1.0)
thresholds.minimum_acceptablef640.50Alert threshold for low fitness
thresholds.regression_deltaf640.05Maximum fitness drop before automatic rollback

CLI Commands

bash
# List active experiments
prx evolution experiments --status running

# View a specific experiment
prx evolution experiments --id <experiment_id>

# View experiment results with fitness breakdown
prx evolution experiments --id <experiment_id> --details

# Cancel a running experiment (reverts to control)
prx evolution experiments cancel <experiment_id>

# View current fitness score
prx evolution fitness

# View fitness history over time
prx evolution fitness --history --last 30d

# View fitness breakdown by dimension
prx evolution fitness --breakdown

Exemple Fitness Output

Current Fitness Score: 0.74

Dimension            Score   Weight  Contribution
response_relevance   0.82    0.30    0.246
task_completion      0.78    0.25    0.195
response_latency     0.69    0.15    0.104
token_efficiency     0.65    0.10    0.065
memory_precision     0.71    0.10    0.071
user_satisfaction    0.60    0.10    0.060

Trend (last 7 days): +0.03 (improving)

Experiment Examples

L2 Prompt Optimization

A typical L2 experiment tests a system prompt change:

  • Control: current system prompt (320 tokens)
  • Treatment: refined system prompt (272 tokens, 15% shorter)
  • Hypothesis: shorter prompt frees context window, improving response relevance
  • Duration: 7 days, 100 sessions per variant
  • Result: treatment fitness 0.75 vs control 0.72 (p = 0.03), treatment promoted

L3 Strategy Change

An L3 experiment tests a routing policy change:

  • Control: route all coding tasks to Claude Opus
  • Treatment: route simple coding tasks to Claude Sonnet, complex to Opus
  • Hypothesis: cost-efficient routing without quality loss
  • Duration: 14 days, 200 sessions per variant
  • Result: treatment fitness 0.73 vs control 0.74 (p = 0.42), inconclusive -- experiment extended

Statistical Methods

The experiment system uses les elements suivants statistical methods:

  • Two-sample t-test for comparing mean fitness scores between variants
  • Mann-Whitney U test comme un non-parametric alternative when fitness distributions are skewed
  • Bonferroni correction when multiple fitness dimensions are compared simultaneously
  • Sequential analysis avec unlpha-spending to allow early stopping when results are clearly significant

Limiteations

  • Experiments require sufficient session volume; low-traffic deployments may take weeks to reach significance
  • User satisfaction signals depend on explicit feedback, which peut etre sparse
  • LLM-as-judge scoring for response relevance adds latency and cost vers le evaluation pipeline
  • Only one experiment can run per evolution layer at a time pour eviter confounding
  • Fitness scores are relative vers le specific deployment; they are not comparable across different PRX instances

Voir aussi Pages

Released under the Apache-2.0 License.