Key takeaways
- OCR and vision encoders are part of your trust boundary—test them.
- Indirect injection lives in slides and scans, not just chat text.
- Agent tools triggered from visual context need the same least-privilege discipline as APIs.
Multi-modal models fuse text, images, audio, and sometimes video into a single interface. That convenience expands the attack surface: adversaries can hide instructions in pixels, poison visual context, or chain modalities to bypass filters that only inspect one channel. Industry and academic work through 2025 documented a steady rise in cross-modal jailbreaks and indirect injection via documents that mix media types.
Why modality multiplies risk
Text-only guardrails rarely understand that an image caption, OCR stream, or audio transcript will be concatenated into the same latent context as the user message. Attackers embed payloads where text classifiers do not look—barcodes in screenshots, barely visible overlay text, or patches that confuse vision encoders while appearing benign to humans.
Emerging vector families
1. Visual prompt injection
Instructions embedded in images (slides, memes, scanned PDFs) tell the model to ignore policies or exfiltrate context. Because many pipelines run OCR or vision-to-text before the LLM, the payload becomes indistinguishable from “user content.”
2. Cross-modal confusion
Models may privilege one modality over another. A safe textual question paired with a misleading diagram can steer reasoning (e.g., fake UI elements implying a trusted action). Red teams should vary which modality carries the malicious intent.
3. Tool and agent bridges
When vision models trigger tools (search, code execution, ticket creation), a single malicious frame can cascade into privileged actions—mapping directly to OWASP LLM06 (Excessive Agency) when scopes are too broad.
Defensive patterns that work
- Modality-aware policy: Apply allow-lists for image sources; sandbox OCR output before merging into prompts.
- Structured handoffs: Label provenance (“text from user” vs “text extracted from uploaded file”) so downstream policies can differ.
- Continuous testing: Extend red-team libraries beyond ASCII—include image-based injection corpora and refresh after model upgrades.