Multimodal

Models that accept and produce more than one type of content (text, image, audio, video).

Multimodal models can take in text plus images, audio, or video and reason across them. Practical use cases include document understanding (PDFs with tables and diagrams), screen-shot analysis, image classification with structured output, and audio transcription with summary. Multimodal capabilities expand the set of workflows that can be redesigned AI-native.

When it matters

When the workflow input includes images, audio, video, or screenshots — not just text. Increasingly common in support, claims processing, content moderation, and design workflows.

Real example

A claims-processing workflow where customers upload photos of damage. Multimodal Claude analyzes the image alongside the text description, extracts damage category and severity, and routes to the appropriate adjuster queue.

KPIs to watch

Multimodal task accuracy on labelled test set (image+text → correct routing, >85%), latency overhead per image (<2s typical), cost vs text-only (typically 2-4×).

Related terms

LLM (Large Language Model)

A transformer-based model trained on language data to predict and generate text.

Foundation model

A large model pre-trained on broad data, then adapted to many downstream tasks.

Context window

The maximum number of tokens a model can process in a single request.

Frontier model

The leading-edge foundation models with the highest reasoning, coding, and multimodal capabilities.

See it in action

We use this every week

Send a short brief and we'll walk you through how Multimodal shows up in a real engagement we're running. We reply within one business day.

Start a project →