Multimodal Content Handling

PRX supports multimodal content -- images, audio, and video -- across its channels and LLM providers. The multimodal subsystem handles content type detection, format transcoding, size enforcement, and capability negotiation between channels and providers.

Overview

When a user sends a media attachment (photo, voice message, document) through a channel, the multimodal pipeline:

Detects the content type using magic bytes and file extension
Validates the content against size and format constraints
Transcodes the content if the target provider does not support the source format
Dispatches the content to the LLM provider as part of the conversation context
Handles media in the response if the provider generates images or audio

Channel Input                    Provider Output
  │                                  │
  ▼                                  ▼
┌──────────────┐              ┌──────────────┐
│ Content Type │              │ Response     │
│ Detection    │              │ Media        │
└──────┬───────┘              └──────┬───────┘
       │                             │
       ▼                             ▼
┌──────────────┐              ┌──────────────┐
│ Validation   │              │ Transcoding  │
│ & Limits     │              │ (if needed)  │
└──────┬───────┘              └──────┬───────┘
       │                             │
       ▼                             ▼
┌──────────────┐              ┌──────────────┐
│ Transcoding  │              │ Channel      │
│ (if needed)  │              │ Delivery     │
└──────┬───────┘              └──────────────┘
       │
       ▼
┌──────────────┐
│ Provider     │
│ Dispatch     │
└──────────────┘

Supported Content Types

Images

Format	Detection	Send to Provider	Receive from Provider
JPEG	Magic bytes `FF D8 FF`	Yes	Yes
PNG	Magic bytes `89 50 4E 47`	Yes	Yes
GIF	Magic bytes `47 49 46`	Yes (first frame)	No
WebP	RIFF header + `WEBP`	Yes	Yes
BMP	Magic bytes `42 4D`	Transcoded to PNG	No
TIFF	Magic bytes `49 49` or `4D 4D`	Transcoded to PNG	No
SVG	XML detection	Rasterized to PNG	No

Audio

Format	Detection	Transcription	Provider Input
OGG/Opus	OGG header	Yes (via STT)	Transcribed text
MP3	ID3/sync header	Yes (via STT)	Transcribed text
WAV	RIFF + `WAVE`	Yes (via STT)	Transcribed text
M4A/AAC	ftyp box	Yes (via STT)	Transcribed text
WebM	EBML header	Yes (via STT)	Transcribed text

Video

Format	Detection	Processing
MP4	ftyp box	Extract keyframes + audio track
WebM	EBML header	Extract keyframes + audio track
MOV	ftyp box	Extract keyframes + audio track

Video files are decomposed into keyframe images and an audio track. The keyframes are sent as images and the audio is transcribed.

Content Type Detection

Detection uses a two-pass approach:

Magic bytes -- the first 16 bytes of the file are checked against known signatures
File extension -- if magic bytes are inconclusive, the file extension is used as a fallback
MIME type header -- for content received via HTTP, the Content-Type header is consulted

The detection result determines which processing pipeline handles the content.

Configuration

toml

[multimodal]
enabled = true

[multimodal.images]
max_size_bytes = 20_971_520      # 20 MB
max_resolution = "4096x4096"     # maximum width x height
auto_resize = true               # resize images exceeding max_resolution
resize_quality = 85              # JPEG quality for resized images (1-100)
strip_exif = true                # remove EXIF metadata for privacy

[multimodal.audio]
max_size_bytes = 26_214_400      # 25 MB
max_duration_secs = 300          # 5 minutes
stt_provider = "whisper"         # "whisper", "deepgram", or "provider" (use LLM provider's STT)
stt_model = "whisper-1"
stt_language = "auto"            # "auto" for language detection, or ISO 639-1 code

[multimodal.video]
max_size_bytes = 104_857_600     # 100 MB
max_duration_secs = 120          # 2 minutes
keyframe_interval_secs = 5       # extract one keyframe every 5 seconds
max_keyframes = 20               # maximum keyframes to extract
extract_audio = true             # transcribe audio track

Configuration Reference

Images

Field	Type	Default	Description
`max_size_bytes`	`u64`	`20971520`	Maximum image file size (20 MB)
`max_resolution`	`String`	`"4096x4096"`	Maximum image dimensions (WxH)
`auto_resize`	`bool`	`true`	Automatically resize oversized images
`resize_quality`	`u8`	`85`	JPEG quality for resized images (1--100)
`strip_exif`	`bool`	`true`	Remove EXIF metadata from images

Audio

Field	Type	Default	Description
`max_size_bytes`	`u64`	`26214400`	Maximum audio file size (25 MB)
`max_duration_secs`	`u64`	`300`	Maximum audio duration (5 minutes)
`stt_provider`	`String`	`"whisper"`	Speech-to-text provider
`stt_model`	`String`	`"whisper-1"`	STT model name
`stt_language`	`String`	`"auto"`	Language hint for transcription

Video

Field	Type	Default	Description
`max_size_bytes`	`u64`	`104857600`	Maximum video file size (100 MB)
`max_duration_secs`	`u64`	`120`	Maximum video duration (2 minutes)
`keyframe_interval_secs`	`u64`	`5`	Seconds between extracted keyframes
`max_keyframes`	`usize`	`20`	Maximum number of keyframes to extract
`extract_audio`	`bool`	`true`	Transcribe the video's audio track

Provider Capabilities

Not all LLM providers support the same media types. PRX negotiates capabilities automatically:

Provider	Image Input	Image Output	Audio Input	Native Multimodal
Anthropic (Claude)	Yes	No	No (transcribe first)	Yes (vision)
OpenAI (GPT-4o)	Yes	Yes (DALL-E)	Yes (Whisper)	Yes
Google (Gemini)	Yes	Yes (Imagen)	Yes	Yes
Ollama (LLaVA)	Yes	No	No	Yes (vision)
AWS Bedrock	Varies by model	Varies	No	Varies

When a provider does not support a media type natively, PRX applies fallback processing:

Image not supported -- image is described using a vision-capable model, and the description is sent as text
Audio not supported -- audio is transcribed using the configured STT provider, and the transcript is sent as text
Video not supported -- keyframes and audio transcript are sent as a composite message

Channel Media Limits

Each channel imposes its own file size and format restrictions:

Channel	Max Upload	Max Download	Supported Formats
Telegram	50 MB	20 MB	Images, audio, video, documents
Discord	25 MB (free)	25 MB	Images, audio, video, documents
WhatsApp	16 MB (media)	16 MB	JPEG, PNG, MP3, MP4, PDF
QQ	20 MB	20 MB	Images, audio, documents
DingTalk	20 MB	20 MB	Images, audio, documents
Lark	25 MB	25 MB	Images, audio, video, documents
Matrix	Homeserver dependent	Homeserver dependent	All common formats
Email	25 MB (typical)	25 MB	All via MIME attachments
CLI	Filesystem limit	Filesystem limit	All formats

PRX enforces the channel's limits before attempting to send a response. If a generated image or file exceeds the channel limit, it is compressed or a download link is provided instead.

Transcoding Pipeline

When format conversion is needed, PRX uses the following transcoding pipeline:

Image transcoding -- handled by the image crate (pure Rust, no external dependencies)
Audio transcoding -- handled by FFmpeg if installed, otherwise falls back to native decoders for common formats
Video keyframe extraction -- requires FFmpeg

FFmpeg Detection

PRX automatically detects FFmpeg at startup:

bash

prx doctor multimodal

Output:

Multimodal Support:
  Images: OK (native)
  Audio transcoding: OK (ffmpeg 6.1 detected)
  Video processing: OK (ffmpeg 6.1 detected)
  STT provider: OK (whisper-1 via OpenAI)

If FFmpeg is not installed, audio transcoding and video processing are limited to natively supported formats.

Limitations

Video processing requires FFmpeg to be installed on the system
Large media files may significantly increase LLM token usage (especially multiple keyframes)
Some providers charge additional fees for vision/multimodal API calls
Real-time audio streaming (live voice conversation) is not yet supported
Generated images from providers (DALL-E, Imagen) are subject to the provider's content policy
SVG rasterization uses a basic renderer; complex SVGs may not render accurately

Agent Runtime -- how media content flows through the agent loop
Channels Overview -- channel-specific media handling
Providers Overview -- provider multimodal capabilities
Embeddings Backend -- embedding models for memory

Multimodal Content Handling ​

Overview ​

Supported Content Types ​

Images ​

Audio ​

Video ​

Content Type Detection ​

Configuration ​

Configuration Reference ​

Images ​

Audio ​

Video ​

Provider Capabilities ​

Channel Media Limits ​

Transcoding Pipeline ​

FFmpeg Detection ​

Limitations ​

Related Pages ​

Multimodal Content Handling

Overview

Supported Content Types

Images

Audio

Video

Content Type Detection

Configuration

Configuration Reference

Images

Audio

Video

Provider Capabilities

Channel Media Limits

Transcoding Pipeline

FFmpeg Detection

Limitations

Related Pages