Skip to content

Multimodal Content Handling

PRX supports multimodal content -- images, audio, and video -- across its channels and LLM providers. The multimodal subsystem handles content type detection, format transcoding, size enforcement, and capability negotiation between channels and providers.

Overview

When a user sends a media attachment (photo, voice message, document) through a channel, the multimodal pipeline:

  1. Detects the content type using magic bytes and file extension
  2. Validates the content against size and format constraints
  3. Transcodes the content if the target provider does not support the source format
  4. Dispatches the content to the LLM provider as part of the conversation context
  5. Handles media in the response if the provider generates images or audio
Channel Input                    Provider Output
  │                                  │
  ▼                                  ▼
┌──────────────┐              ┌──────────────┐
│ Content Type │              │ Response     │
│ Detection    │              │ Media        │
└──────┬───────┘              └──────┬───────┘
       │                             │
       ▼                             ▼
┌──────────────┐              ┌──────────────┐
│ Validation   │              │ Transcoding  │
│ & Limits     │              │ (if needed)  │
└──────┬───────┘              └──────┬───────┘
       │                             │
       ▼                             ▼
┌──────────────┐              ┌──────────────┐
│ Transcoding  │              │ Channel      │
│ (if needed)  │              │ Delivery     │
└──────┬───────┘              └──────────────┘


┌──────────────┐
│ Provider     │
│ Dispatch     │
└──────────────┘

Supported Content Types

Images

FormatDetectionSend to ProviderReceive from Provider
JPEGMagic bytes FF D8 FFYesYes
PNGMagic bytes 89 50 4E 47YesYes
GIFMagic bytes 47 49 46Yes (first frame)No
WebPRIFF header + WEBPYesYes
BMPMagic bytes 42 4DTranscoded to PNGNo
TIFFMagic bytes 49 49 or 4D 4DTranscoded to PNGNo
SVGXML detectionRasterized to PNGNo

Audio

FormatDetectionTranscriptionProvider Input
OGG/OpusOGG headerYes (via STT)Transcribed text
MP3ID3/sync headerYes (via STT)Transcribed text
WAVRIFF + WAVEYes (via STT)Transcribed text
M4A/AACftyp boxYes (via STT)Transcribed text
WebMEBML headerYes (via STT)Transcribed text

Video

FormatDetectionProcessing
MP4ftyp boxExtract keyframes + audio track
WebMEBML headerExtract keyframes + audio track
MOVftyp boxExtract keyframes + audio track

Video files are decomposed into keyframe images and an audio track. The keyframes are sent as images and the audio is transcribed.

Content Type Detection

Detection uses a two-pass approach:

  1. Magic bytes -- the first 16 bytes of the file are checked against known signatures
  2. File extension -- if magic bytes are inconclusive, the file extension is used as a fallback
  3. MIME type header -- for content received via HTTP, the Content-Type header is consulted

The detection result determines which processing pipeline handles the content.

Configuration

toml
[multimodal]
enabled = true

[multimodal.images]
max_size_bytes = 20_971_520      # 20 MB
max_resolution = "4096x4096"     # maximum width x height
auto_resize = true               # resize images exceeding max_resolution
resize_quality = 85              # JPEG quality for resized images (1-100)
strip_exif = true                # remove EXIF metadata for privacy

[multimodal.audio]
max_size_bytes = 26_214_400      # 25 MB
max_duration_secs = 300          # 5 minutes
stt_provider = "whisper"         # "whisper", "deepgram", or "provider" (use LLM provider's STT)
stt_model = "whisper-1"
stt_language = "auto"            # "auto" for language detection, or ISO 639-1 code

[multimodal.video]
max_size_bytes = 104_857_600     # 100 MB
max_duration_secs = 120          # 2 minutes
keyframe_interval_secs = 5       # extract one keyframe every 5 seconds
max_keyframes = 20               # maximum keyframes to extract
extract_audio = true             # transcribe audio track

Configuration Reference

Images

FieldTypeDefaultDescription
max_size_bytesu6420971520Maximum image file size (20 MB)
max_resolutionString"4096x4096"Maximum image dimensions (WxH)
auto_resizebooltrueAutomatically resize oversized images
resize_qualityu885JPEG quality for resized images (1--100)
strip_exifbooltrueRemove EXIF metadata from images

Audio

FieldTypeDefaultDescription
max_size_bytesu6426214400Maximum audio file size (25 MB)
max_duration_secsu64300Maximum audio duration (5 minutes)
stt_providerString"whisper"Speech-to-text provider
stt_modelString"whisper-1"STT model name
stt_languageString"auto"Language hint for transcription

Video

FieldTypeDefaultDescription
max_size_bytesu64104857600Maximum video file size (100 MB)
max_duration_secsu64120Maximum video duration (2 minutes)
keyframe_interval_secsu645Seconds between extracted keyframes
max_keyframesusize20Maximum number of keyframes to extract
extract_audiobooltrueTranscribe the video's audio track

Provider Capabilities

Not all LLM providers support the same media types. PRX negotiates capabilities automatically:

ProviderImage InputImage OutputAudio InputNative Multimodal
Anthropic (Claude)YesNoNo (transcribe first)Yes (vision)
OpenAI (GPT-4o)YesYes (DALL-E)Yes (Whisper)Yes
Google (Gemini)YesYes (Imagen)YesYes
Ollama (LLaVA)YesNoNoYes (vision)
AWS BedrockVaries by modelVariesNoVaries

When a provider does not support a media type natively, PRX applies fallback processing:

  • Image not supported -- image is described using a vision-capable model, and the description is sent as text
  • Audio not supported -- audio is transcribed using the configured STT provider, and the transcript is sent as text
  • Video not supported -- keyframes and audio transcript are sent as a composite message

Channel Media Limits

Each channel imposes its own file size and format restrictions:

ChannelMax UploadMax DownloadSupported Formats
Telegram50 MB20 MBImages, audio, video, documents
Discord25 MB (free)25 MBImages, audio, video, documents
WhatsApp16 MB (media)16 MBJPEG, PNG, MP3, MP4, PDF
QQ20 MB20 MBImages, audio, documents
DingTalk20 MB20 MBImages, audio, documents
Lark25 MB25 MBImages, audio, video, documents
MatrixHomeserver dependentHomeserver dependentAll common formats
Email25 MB (typical)25 MBAll via MIME attachments
CLIFilesystem limitFilesystem limitAll formats

PRX enforces the channel's limits before attempting to send a response. If a generated image or file exceeds the channel limit, it is compressed or a download link is provided instead.

Transcoding Pipeline

When format conversion is needed, PRX uses the following transcoding pipeline:

  1. Image transcoding -- handled by the image crate (pure Rust, no external dependencies)
  2. Audio transcoding -- handled by FFmpeg if installed, otherwise falls back to native decoders for common formats
  3. Video keyframe extraction -- requires FFmpeg

FFmpeg Detection

PRX automatically detects FFmpeg at startup:

bash
prx doctor multimodal

Output:

Multimodal Support:
  Images: OK (native)
  Audio transcoding: OK (ffmpeg 6.1 detected)
  Video processing: OK (ffmpeg 6.1 detected)
  STT provider: OK (whisper-1 via OpenAI)

If FFmpeg is not installed, audio transcoding and video processing are limited to natively supported formats.

Limitations

  • Video processing requires FFmpeg to be installed on the system
  • Large media files may significantly increase LLM token usage (especially multiple keyframes)
  • Some providers charge additional fees for vision/multimodal API calls
  • Real-time audio streaming (live voice conversation) is not yet supported
  • Generated images from providers (DALL-E, Imagen) are subject to the provider's content policy
  • SVG rasterization uses a basic renderer; complex SVGs may not render accurately

Released under the Apache-2.0 License.