Defined term
Multimodal
Models that accept and produce more than one type of content (text, image, audio, video).
Multimodal models can take in text plus images, audio, or video and reason across them. Practical use cases include document understanding (PDFs with tables and diagrams), screen-shot analysis, image classification with structured output, and audio transcription with summary. Multimodal capabilities expand the set of workflows that can be redesigned AI-native.
When it matters
When the workflow input includes images, audio, video, or screenshots — not just text. Increasingly common in support, claims processing, content moderation, and design workflows.
Real example
A claims-processing workflow where customers upload photos of damage. Multimodal Claude analyzes the image alongside the text description, extracts damage category and severity, and routes to the appropriate adjuster queue.
KPIs to watch
Multimodal task accuracy on labelled test set (image+text → correct routing, >85%), latency overhead per image (<2s typical), cost vs text-only (typically 2-4×).
Related terms
LLM (Large Language Model)
A transformer-based model trained on language data to predict and generate text.
Foundation model
A large model pre-trained on broad data, then adapted to many downstream tasks.
Context window
The maximum number of tokens a model can process in a single request.
Frontier model
The leading-edge foundation models with the highest reasoning, coding, and multimodal capabilities.
See it in action
We use this every week
Book a 30-min call and we'll walk you through how Multimodal shows up in a real engagement we're running.
Book a 30-min call