Vector Search & Text Processing
PRX inclut a text processing pipeline that powers semantic memory retrieval. This pipeline handles text chunking, vector embedding, topic extraction, et content filtering -- transforming raw conversation text into searchable, organized memory entries.
Architecture
The text processing pipeline consists of four stages, each configurable independently:
Raw Text
│
▼
┌──────────┐ ┌───────────┐ ┌───────────┐ ┌──────────┐
│ Chunker │───►│ Embedder │───►│ Topic │───►│ Filter │
│ │ │ │ │ Extractor │ │ │
└──────────┘ └───────────┘ └───────────┘ └──────────┘
Split text Vectorize Classify Decide if
into chunks each chunk by topic worth savingVector Search
Vector search enables semantic similarity retrieval -- finding memories that are conceptually related vers un query even lorsque le exact words differ.
Fonctionnement
- Indexing -- each memory chunk is embedded dans un dense vector (e.g., 768 dimensions)
- Storage -- vectors are stocke dans a vector index (sqlite-vec, pgvector, or in-memory)
- Query -- the search query is embedded en utilisant le same model
- Retrieval -- the index retourne le top-K vectors by cosine similarity
- Reranking -- optionally, results are reranked using a cross-encoder for higher precision
Configuration
[memory.vector]
enabled = true
index_type = "sqlite-vec" # "sqlite-vec", "pgvector", or "memory"
similarity_metric = "cosine" # "cosine", "dot_product", or "euclidean"
top_k = 10
similarity_threshold = 0.5
rerank = false
rerank_model = "cross-encoder/ms-marco-MiniLM-L-6-v2"Index Types
| Index Type | Storage | Persistence | Best For |
|---|---|---|---|
sqlite-vec | Local file | Oui | Single-user, local deployments |
pgvector | PostgreSQL | Oui | Multi-user, production deployments |
memory | In-process | Non (session only) | Testing and ephemeral sessions |
Configuration Reference
| Champ | Type | Defaut | Description |
|---|---|---|---|
enabled | bool | true | Enable or disable vector search |
index_type | String | "sqlite-vec" | Vector index backend |
similarity_metric | String | "cosine" | Distance metric for similarity comparison |
top_k | usize | 10 | Number of results to retour per query |
similarity_threshold | f64 | 0.5 | Minimum similarity score (0.0--1.0) to include in results |
rerank | bool | false | Enable cross-encoder reranking for improved precision |
rerank_model | String | "" | Cross-encoder model name (only used when rerank = true) |
ef_search | usize | 64 | HNSW search parameter (higher = more accurate, slower) |
Text Chunking
Before embedding, les textes longs devrez etre decoupes en morceaux plus petits. PRX fournit two chunking strategies: token-aware and semantic.
Token-Aware Chunking
Token-aware chunking splits text at token boundaries pour garantir each chunk fits within the embedding model's context window. It respecte word and sentence boundaries pour eviter cutting mid-word.
[memory.chunker]
strategy = "token"
max_tokens = 512
overlap_tokens = 64
tokenizer = "cl100k_base" # OpenAI-compatible tokenizerThe algorithm:
- Tokenize l'entree text en utilisant le configured tokenizer
- Split into chunks of au plus
max_tokenstokens - Each chunk overlaps avec le previous by
overlap_tokensto preserve context at boundaries - Les limites des morceaux sont ajustees pour s'aligner avec les fins de phrases ou de paragraphes lorsque possible
Semantic Chunking
Semantic chunking uses embedding similarity to find natural topic boundaries in the text. Instead of splitting at fixed token counts, it detects where the topic shifts.
[memory.chunker]
strategy = "semantic"
max_tokens = 1024
min_tokens = 64
breakpoint_threshold = 0.3The algorithm:
- Split the text into sentences
- Compute embeddings for each sentence
- Calculate cosine similarity between consecutive sentences
- When similarity drops below
breakpoint_threshold, insert a chunk boundary - Merge small chunks (below
min_tokens) avec undjacent chunks
Chunking Configuration Reference
| Champ | Type | Defaut | Description |
|---|---|---|---|
strategy | String | "token" | Chunking strategy: "token" or "semantic" |
max_tokens | usize | 512 | Maximum tokens per chunk |
overlap_tokens | usize | 64 | Overlap between consecutive chunks (token strategy only) |
tokenizer | String | "cl100k_base" | Tokenizer name for token counting |
min_tokens | usize | 64 | Minimum tokens per chunk (semantic strategy only) |
breakpoint_threshold | f64 | 0.3 | Similarity drop threshold for topic boundaries (semantic strategy only) |
Choosing a Strategy
| Criterion | Token-Aware | Semantic |
|---|---|---|
| Speed | Fast (no embedding calls during chunking) | Slower (requires per-sentence embedding) |
| Quality | Good for uniform content | Better for multi-topic documents |
| Predictability | Consistent chunk sizes | Variable chunk sizes |
| Use case | Chat logs, short messages | Long documents, meeting notes |
Topic Extraction
PRX automatiquement extracts topics depuis la memoire entries to organize them into categories. Topics improve retrieval by enabling filtered search within specific domains.
Fonctionnement
- After chunking, each chunk is analyzed for topic keywords and semantic content
- The topic extractor assigns one ou plus topic labels depuis un configurable taxonomy
- Topics sont stockes alongside the memory entry as metadata
- During recall, interroge can optionally filter by topic to narrow results
Configuration
[memory.topics]
enabled = true
max_topics_per_entry = 3
taxonomy = "auto" # "auto", "fixed", or "hybrid"
custom_topics = [] # only used when taxonomy = "fixed" or "hybrid"
min_confidence = 0.6Taxonomy Modes
| Mode | Description |
|---|---|
auto | Topics are generated dynamically depuis le content. New topics are created as needed. |
fixed | Only topics from custom_topics are assigned. Content that ne fait pas match any topic is left uncategorized. |
hybrid | Prefers custom_topics but creates new topics when content ne fait pas match any existing label. |
Topic Configuration Reference
| Champ | Type | Defaut | Description |
|---|---|---|---|
enabled | bool | true | Enable or disable topic extraction |
max_topics_per_entry | usize | 3 | Maximum topic labels per memory entry |
taxonomy | String | "auto" | Taxonomy mode: "auto", "fixed", or "hybrid" |
custom_topics | [String] | [] | Custom topic labels for fixed/hybrid taxonomies |
min_confidence | f64 | 0.6 | Minimum confidence score (0.0--1.0) to assign a topic |
Content Filtering
Nont every message is worth saving to long-term memory. The content filter applies autosave heuristics to decide which content devrait etre persisted et which devrait etre discarded.
Autosave Heuristics
The filter evaluates each candidate memory entry against several criteria:
| Heuristic | Description | Weight |
|---|---|---|
| Information density | Ratio of unique tokens to total tokens. Low-density text (e.g., "ok", "thanks") is filtered out | High |
| Nonvelty | Similarity to existing memories. Content too similar to what is already stored is skipped | High |
| Relevance | Semantic similarity to l'utilisateur's known interests and active topics | Medium |
| Actionability | Presence of action items, decisions, or commitments (e.g., "I will...", "let's do...") | Medium |
| Recency bias | Recent context is weighted higher for short-term relevance | Low |
A composite score is computed comme un weighted sum. Entries scoring below the autosave_threshold are not persisted.
Configuration
[memory.filter]
enabled = true
autosave_threshold = 0.4
novelty_threshold = 0.85 # skip if >85% similar to existing memory
min_length = 20 # skip entries shorter than 20 characters
max_length = 10000 # truncate entries longer than 10,000 characters
exclude_patterns = [
"^(ok|thanks|got it|sure)$",
"^\\s*$",
]Filter Configuration Reference
| Champ | Type | Defaut | Description |
|---|---|---|---|
enabled | bool | true | Enable or disable content filtering |
autosave_threshold | f64 | 0.4 | Minimum composite score (0.0--1.0) to persist a memory |
novelty_threshold | f64 | 0.85 | Maximum similarity to existing memories before deduplication |
min_length | usize | 20 | Minimum character length pour un memory entry |
max_length | usize | 10000 | Maximum character length (longer entries are tronque) |
exclude_patterns | [String] | [] | Regex patterns for content that should never be saved |
Full Pipeline Example
A complete configuration combining all four stages:
[memory]
backend = "embeddings"
[memory.embeddings]
provider = "ollama"
model = "nomic-embed-text"
dimension = 768
[memory.vector]
enabled = true
index_type = "sqlite-vec"
top_k = 10
similarity_threshold = 0.5
[memory.chunker]
strategy = "semantic"
max_tokens = 1024
min_tokens = 64
breakpoint_threshold = 0.3
[memory.topics]
enabled = true
taxonomy = "hybrid"
custom_topics = ["coding", "architecture", "debugging", "planning"]
[memory.filter]
enabled = true
autosave_threshold = 0.4
novelty_threshold = 0.85Voir aussi Pages
- Memory System Overview
- Embeddings Backend -- embedding fournisseur configuration
- SQLite Backend -- local storage for sqlite-vec index
- PostgreSQL Backend -- storage for pgvector index
- Memory Hygiene -- compaction and cleanup strategies