Skip to content
Cette page a été générée et traduite avec l'aide de l'IA. Si vous remarquez des inexactitudes, n'hésitez pas à contribuer. Modifier sur GitHub

Multimodal Content Handling

PRX prend en charge le contenu multimodal -- images, audio, et video -- a travers ses channels et LLM fournisseurs. The multimodal subsystem handles content type detection, format transcoding, size enforcement, et capability negotiation between channels et fournisseurs.

Apercu

Lorsqu'un utilisateur envoie a media attachment (photo, voice message, document) via un channel, the multimodal pipeline:

  1. Detects the content type using magic bytes and file extension
  2. Validates the content against size and format constraints
  3. Transcodes the content si le target fournisseur ne fait pas support the source format
  4. Dispatches the content to le LLM fournisseur as part of the conversation context
  5. Handles media in la reponse if le fournisseur genere images or audio
Channel Input                    Provider Output
  │                                  │
  ▼                                  ▼
┌──────────────┐              ┌──────────────┐
│ Content Type │              │ Response     │
│ Detection    │              │ Media        │
└──────┬───────┘              └──────┬───────┘
       │                             │
       ▼                             ▼
┌──────────────┐              ┌──────────────┐
│ Validation   │              │ Transcoding  │
│ & Limits     │              │ (if needed)  │
└──────┬───────┘              └──────┬───────┘
       │                             │
       ▼                             ▼
┌──────────────┐              ┌──────────────┐
│ Transcoding  │              │ Channel      │
│ (if needed)  │              │ Delivery     │
└──────┬───────┘              └──────────────┘


┌──────────────┐
│ Provider     │
│ Dispatch     │
└──────────────┘

Supported Content Types

Images

FormatDetectionSend to ProviderReceive from Provider
JPEGMagic bytes FF D8 FFOuiOui
PNGMagic bytes 89 50 4E 47OuiOui
GIFMagic bytes 47 49 46Oui (first frame)Non
WebPRIFF header + WEBPOuiOui
BMPMagic bytes 42 4DTranscoded to PNGNon
TIFFMagic bytes 49 49 or 4D 4DTranscoded to PNGNon
SVGXML detectionRasterized to PNGNon

Audio

FormatDetectionTranscriptionProvider Input
OGG/OpusOGG headerOui (via STT)Transcribed text
MP3ID3/sync headerOui (via STT)Transcribed text
WAVRIFF + WAVEOui (via STT)Transcribed text
M4A/AACftyp boxOui (via STT)Transcribed text
WebMEBML headerOui (via STT)Transcribed text

Video

FormatDetectionProcessing
MP4ftyp boxExtract keyframes + audio track
WebMEBML headerExtract keyframes + audio track
MOVftyp boxExtract keyframes + audio track

Les fichiers video sont decomposes en images cles et une piste audio. Les images cles sont envoyees comme images et the audio is transcribed.

Content Type Detection

Detection uses a two-pass approach:

  1. Magic bytes -- the first 16 bytes of le fichier are checked against known signatures
  2. File extension -- if magic bytes are inconclusive, le fichier extension is used comme un fallback
  3. MIME type header -- for content received via HTTP, the Content-Type header is consulted

The detection result determine which processing pipeline gere le content.

Configuration

toml
[multimodal]
enabled = true

[multimodal.images]
max_size_bytes = 20_971_520      # 20 MB
max_resolution = "4096x4096"     # maximum width x height
auto_resize = true               # resize images exceeding max_resolution
resize_quality = 85              # JPEG quality for resized images (1-100)
strip_exif = true                # remove EXIF metadata for privacy

[multimodal.audio]
max_size_bytes = 26_214_400      # 25 MB
max_duration_secs = 300          # 5 minutes
stt_provider = "whisper"         # "whisper", "deepgram", or "provider" (use LLM provider's STT)
stt_model = "whisper-1"
stt_language = "auto"            # "auto" for language detection, or ISO 639-1 code

[multimodal.video]
max_size_bytes = 104_857_600     # 100 MB
max_duration_secs = 120          # 2 minutes
keyframe_interval_secs = 5       # extract one keyframe every 5 seconds
max_keyframes = 20               # maximum keyframes to extract
extract_audio = true             # transcribe audio track

Configuration Reference

Images

ChampTypeDefautDescription
max_size_bytesu6420971520Maximum image file size (20 MB)
max_resolutionString"4096x4096"Maximum image dimensions (WxH)
auto_resizebooltrueAutomatically resize oversized images
resize_qualityu885JPEG quality for resized images (1--100)
strip_exifbooltrueRemove EXIF metadata from images

Audio

ChampTypeDefautDescription
max_size_bytesu6426214400Maximum audio file size (25 MB)
max_duration_secsu64300Maximum audio duration (5 minutes)
stt_fournisseurString"whisper"Speech-to-text fournisseur
stt_modelString"whisper-1"STT model name
stt_languageString"auto"Language hint for transcription

Video

ChampTypeDefautDescription
max_size_bytesu64104857600Maximum video file size (100 MB)
max_duration_secsu64120Maximum video duration (2 minutes)
keyframe_interval_secsu645Seconds between extracted keyframes
max_keyframesusize20Maximum number of keyframes to extract
extract_audiobooltrueTranscribe the video's audio track

Provider Capabilities

Nont all LLM fournisseurs support the same media types. PRX negotiates capabilities automatically:

ProviderImage InputImage OutputAudio InputNative Multimodal
Anthropic (Claude)OuiNonNon (transcribe first)Oui (vision)
OpenAI (GPT-4o)OuiOui (DALL-E)Oui (Whisper)Oui
Google (Gemini)OuiOui (Imagen)OuiOui
Ollama (LLaVA)OuiNonNonOui (vision)
AWS BedrockVaries by modelVariesNonVaries

When a fournisseur ne fait pas support a media type natively, PRX applies fallback processing:

  • Image not supported -- l'image est decrite a l'aide d'un modele capable de vision, and the description is sent as text
  • Audio not supported -- audio is transcribed en utilisant le configured STT fournisseur, and the transcript is sent as text
  • Video not supported -- keyframes and audio transcript sont envoyes comme un composite message

Channel Media Limites

Each channel imposes its own file size and format restrictions:

ChannelMax UploadMax DownloadSupported Formats
Telegram50 MB20 MBImages, audio, video, documents
Discord25 MB (free)25 MBImages, audio, video, documents
WhatsApp16 MB (media)16 MBJPEG, PNG, MP3, MP4, PDF
QQ20 MB20 MBImages, audio, documents
DingTalk20 MB20 MBImages, audio, documents
Lark25 MB25 MBImages, audio, video, documents
MatrixHomeserver dependentHomeserver dependentAll common formats
Email25 MB (typical)25 MBAll via MIME attachments
CLIFilesystem limitFilesystem limitAll formats

PRX applique the channel's limits before attempting to send a response. Si un generated image ou file exceeds the channel limit, it is compressed ou un download link is provided instead.

Transcoding Pipeline

When format conversion is needed, PRX utilise les elements suivants transcoding pipeline:

  1. Image transcoding -- gere par the image crate (pure Rust, no external dependencies)
  2. Audio transcoding -- gere par FFmpeg if installed, otherwise falls back to native decoders for common formats
  3. Video keyframe extraction -- necessite FFmpeg

FFmpeg Detection

PRX automatiquement detects FFmpeg au demarrage:

bash
prx doctor multimodal

Output:

Multimodal Support:
  Images: OK (native)
  Audio transcoding: OK (ffmpeg 6.1 detected)
  Video processing: OK (ffmpeg 6.1 detected)
  STT provider: OK (whisper-1 via OpenAI)

If FFmpeg is pas installed, audio transcoding et video processing are limited to nativement supported formats.

Limiteations

  • Video processing necessite FFmpeg to be installed on le systeme
  • Large media files may significantly increase LLM token usage (especially multiple keyframes)
  • Some fournisseurs charge additional fees for vision/multimodal API calls
  • Real-time audio streaming (live voice conversation) is not yet supported
  • Generated images from fournisseurs (DALL-E, Imagen) sont soumis a le fournisseur's content policy
  • SVG rasterization uses a basic renderer; complex SVGs ne peut pas render accurately

Voir aussi Pages

Released under the Apache-2.0 License.