Multimodal Content Handling
PRX prend en charge le contenu multimodal -- images, audio, et video -- a travers ses channels et LLM fournisseurs. The multimodal subsystem handles content type detection, format transcoding, size enforcement, et capability negotiation between channels et fournisseurs.
Apercu
Lorsqu'un utilisateur envoie a media attachment (photo, voice message, document) via un channel, the multimodal pipeline:
- Detects the content type using magic bytes and file extension
- Validates the content against size and format constraints
- Transcodes the content si le target fournisseur ne fait pas support the source format
- Dispatches the content to le LLM fournisseur as part of the conversation context
- Handles media in la reponse if le fournisseur genere images or audio
Channel Input Provider Output
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Content Type │ │ Response │
│ Detection │ │ Media │
└──────┬───────┘ └──────┬───────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Validation │ │ Transcoding │
│ & Limits │ │ (if needed) │
└──────┬───────┘ └──────┬───────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Transcoding │ │ Channel │
│ (if needed) │ │ Delivery │
└──────┬───────┘ └──────────────┘
│
▼
┌──────────────┐
│ Provider │
│ Dispatch │
└──────────────┘Supported Content Types
Images
| Format | Detection | Send to Provider | Receive from Provider |
|---|---|---|---|
| JPEG | Magic bytes FF D8 FF | Oui | Oui |
| PNG | Magic bytes 89 50 4E 47 | Oui | Oui |
| GIF | Magic bytes 47 49 46 | Oui (first frame) | Non |
| WebP | RIFF header + WEBP | Oui | Oui |
| BMP | Magic bytes 42 4D | Transcoded to PNG | Non |
| TIFF | Magic bytes 49 49 or 4D 4D | Transcoded to PNG | Non |
| SVG | XML detection | Rasterized to PNG | Non |
Audio
| Format | Detection | Transcription | Provider Input |
|---|---|---|---|
| OGG/Opus | OGG header | Oui (via STT) | Transcribed text |
| MP3 | ID3/sync header | Oui (via STT) | Transcribed text |
| WAV | RIFF + WAVE | Oui (via STT) | Transcribed text |
| M4A/AAC | ftyp box | Oui (via STT) | Transcribed text |
| WebM | EBML header | Oui (via STT) | Transcribed text |
Video
| Format | Detection | Processing |
|---|---|---|
| MP4 | ftyp box | Extract keyframes + audio track |
| WebM | EBML header | Extract keyframes + audio track |
| MOV | ftyp box | Extract keyframes + audio track |
Les fichiers video sont decomposes en images cles et une piste audio. Les images cles sont envoyees comme images et the audio is transcribed.
Content Type Detection
Detection uses a two-pass approach:
- Magic bytes -- the first 16 bytes of le fichier are checked against known signatures
- File extension -- if magic bytes are inconclusive, le fichier extension is used comme un fallback
- MIME type header -- for content received via HTTP, the
Content-Typeheader is consulted
The detection result determine which processing pipeline gere le content.
Configuration
[multimodal]
enabled = true
[multimodal.images]
max_size_bytes = 20_971_520 # 20 MB
max_resolution = "4096x4096" # maximum width x height
auto_resize = true # resize images exceeding max_resolution
resize_quality = 85 # JPEG quality for resized images (1-100)
strip_exif = true # remove EXIF metadata for privacy
[multimodal.audio]
max_size_bytes = 26_214_400 # 25 MB
max_duration_secs = 300 # 5 minutes
stt_provider = "whisper" # "whisper", "deepgram", or "provider" (use LLM provider's STT)
stt_model = "whisper-1"
stt_language = "auto" # "auto" for language detection, or ISO 639-1 code
[multimodal.video]
max_size_bytes = 104_857_600 # 100 MB
max_duration_secs = 120 # 2 minutes
keyframe_interval_secs = 5 # extract one keyframe every 5 seconds
max_keyframes = 20 # maximum keyframes to extract
extract_audio = true # transcribe audio trackConfiguration Reference
Images
| Champ | Type | Defaut | Description |
|---|---|---|---|
max_size_bytes | u64 | 20971520 | Maximum image file size (20 MB) |
max_resolution | String | "4096x4096" | Maximum image dimensions (WxH) |
auto_resize | bool | true | Automatically resize oversized images |
resize_quality | u8 | 85 | JPEG quality for resized images (1--100) |
strip_exif | bool | true | Remove EXIF metadata from images |
Audio
| Champ | Type | Defaut | Description |
|---|---|---|---|
max_size_bytes | u64 | 26214400 | Maximum audio file size (25 MB) |
max_duration_secs | u64 | 300 | Maximum audio duration (5 minutes) |
stt_fournisseur | String | "whisper" | Speech-to-text fournisseur |
stt_model | String | "whisper-1" | STT model name |
stt_language | String | "auto" | Language hint for transcription |
Video
| Champ | Type | Defaut | Description |
|---|---|---|---|
max_size_bytes | u64 | 104857600 | Maximum video file size (100 MB) |
max_duration_secs | u64 | 120 | Maximum video duration (2 minutes) |
keyframe_interval_secs | u64 | 5 | Seconds between extracted keyframes |
max_keyframes | usize | 20 | Maximum number of keyframes to extract |
extract_audio | bool | true | Transcribe the video's audio track |
Provider Capabilities
Nont all LLM fournisseurs support the same media types. PRX negotiates capabilities automatically:
| Provider | Image Input | Image Output | Audio Input | Native Multimodal |
|---|---|---|---|---|
| Anthropic (Claude) | Oui | Non | Non (transcribe first) | Oui (vision) |
| OpenAI (GPT-4o) | Oui | Oui (DALL-E) | Oui (Whisper) | Oui |
| Google (Gemini) | Oui | Oui (Imagen) | Oui | Oui |
| Ollama (LLaVA) | Oui | Non | Non | Oui (vision) |
| AWS Bedrock | Varies by model | Varies | Non | Varies |
When a fournisseur ne fait pas support a media type natively, PRX applies fallback processing:
- Image not supported -- l'image est decrite a l'aide d'un modele capable de vision, and the description is sent as text
- Audio not supported -- audio is transcribed en utilisant le configured STT fournisseur, and the transcript is sent as text
- Video not supported -- keyframes and audio transcript sont envoyes comme un composite message
Channel Media Limites
Each channel imposes its own file size and format restrictions:
| Channel | Max Upload | Max Download | Supported Formats |
|---|---|---|---|
| Telegram | 50 MB | 20 MB | Images, audio, video, documents |
| Discord | 25 MB (free) | 25 MB | Images, audio, video, documents |
| 16 MB (media) | 16 MB | JPEG, PNG, MP3, MP4, PDF | |
| 20 MB | 20 MB | Images, audio, documents | |
| DingTalk | 20 MB | 20 MB | Images, audio, documents |
| Lark | 25 MB | 25 MB | Images, audio, video, documents |
| Matrix | Homeserver dependent | Homeserver dependent | All common formats |
| 25 MB (typical) | 25 MB | All via MIME attachments | |
| CLI | Filesystem limit | Filesystem limit | All formats |
PRX applique the channel's limits before attempting to send a response. Si un generated image ou file exceeds the channel limit, it is compressed ou un download link is provided instead.
Transcoding Pipeline
When format conversion is needed, PRX utilise les elements suivants transcoding pipeline:
- Image transcoding -- gere par the
imagecrate (pure Rust, no external dependencies) - Audio transcoding -- gere par FFmpeg if installed, otherwise falls back to native decoders for common formats
- Video keyframe extraction -- necessite FFmpeg
FFmpeg Detection
PRX automatiquement detects FFmpeg au demarrage:
prx doctor multimodalOutput:
Multimodal Support:
Images: OK (native)
Audio transcoding: OK (ffmpeg 6.1 detected)
Video processing: OK (ffmpeg 6.1 detected)
STT provider: OK (whisper-1 via OpenAI)If FFmpeg is pas installed, audio transcoding et video processing are limited to nativement supported formats.
Limiteations
- Video processing necessite FFmpeg to be installed on le systeme
- Large media files may significantly increase LLM token usage (especially multiple keyframes)
- Some fournisseurs charge additional fees for vision/multimodal API calls
- Real-time audio streaming (live voice conversation) is not yet supported
- Generated images from fournisseurs (DALL-E, Imagen) sont soumis a le fournisseur's content policy
- SVG rasterization uses a basic renderer; complex SVGs ne peut pas render accurately
Voir aussi Pages
- Agent Runtime -- how media content flows through la boucle de l'agent
- Channels Overview -- channel-specific media handling
- Providers Overview -- fournisseur multimodal capabilities
- Embeddings Backend -- embedding models for memory