Vision, audio, video
Last updated Jun 18, 2026
Multimodal LLM releases
Large language model releases with multimodal capabilities, including vision-language, audio, video, image-generation, and document-understanding models.
62 models
Kimi K2.7 Code
AvailableMoonshot's open coding-focused agentic model built on K2.6, with native vision/video input, forced thinking mode, and stronger long-horizon software-engineering performance.
MiniMax-M3
AvailableNative multimodal MiniMax model with a one-million-token context, sparse attention, and agentic coding/cowork positioning.
GPT-5.6
PreviewOpenAI's mid-2026 flagship, headlined by an industry-leading 1.5M-token context window and long-horizon agentic tool use.
Claude Fable 5
WithdrawnThe public, guardrailed sibling of Mythos and Anthropic's most capable widely-released model, built for long-horizon agentic work. Launched June 9, 2026 across the Claude API, AWS, and Microsoft Foundry — then pulled three days later under a US government export-control directive barring access by foreign nationals.
Claude Opus 4.8
AvailableAnthropic's most capable model, with strengthened agentic and long-running task performance.
Gemini 3.5 Pro
PreviewAnnounced at Google I/O 2026; emphasizes deep multimodal reasoning over a 2M-token context.
Grok 4.3
AvailablexAI's agentic flagship with a 1M-token context and aggressive API pricing.
Gemma 4 31B
AvailableGoogle DeepMind's Gemma 4 advanced-reasoning open model for personal computers, part of the April 2026 Gemma 4 family.
Kimi K2.6
AvailableMoonshot's open native multimodal agentic model for long-horizon coding, visual interface generation, and autonomous tool orchestration.
GPT-5.4
AvailableWorkhorse GPT-5 release with a dedicated Thinking mode; widely deployed across ChatGPT and the API.
Qwen3.5-397B
AvailableNative vision-language MoE supporting 201 languages with a 1M-token context.
Gemini 3.1 Pro
AvailableGenerally available multimodal flagship with native tool use and a 2M-token context.
Claude Opus 4.6
AvailableIntroduced genuinely autonomous multi-file coding and stronger computer use.
Kimi K2.5
AvailableOpen multimodal Kimi model that adds native visual agentic intelligence, instant and thinking modes, and agent-swarm workflows on top of the K2 base.
GLM-4.6V
AvailableOpen 106B-class vision-language model with native multimodal function calling for visual agents.
Mistral Large 3
AvailableMistral's largest open-weight MoE, aimed at frontier reasoning while remaining self-hostable.
Gemma 3 27B
AvailableGoogle's open multimodal model: 128k context, 140+ languages, runs on a single GPU.
GLM-4.5V
AvailableVision-language GLM based on GLM-4.5-Air, covering image, video, document, grounding, and GUI-agent tasks.
Grok 4
DeprecatedxAI's fourth-generation Grok line, preceding the later 4.x API updates already tracked in the catalog.
ERNIE-4.5-VL-424B-A47B
AvailableBaidu's largest ERNIE 4.5 vision-language MoE, supporting text, image, and video inputs with thinking and non-thinking modes.
Kimi-VL-A3B-Thinking-2506
AvailableUpdated MIT-licensed Kimi-VL reasoning model with better multimodal reasoning, video understanding, high-resolution perception, and lower thinking-token use.
Claude Opus 4
DeprecatedFirst Claude 4 Opus model, positioned for long-running agentic and coding work before the 4.x point releases.
Kimi-Audio-7B-Instruct
AvailableOpen audio foundation model for audio understanding, generation, speech recognition, audio QA, captioning, and speech conversation.
Kimi-VL-A3B-Instruct
AvailableEfficient MIT-licensed vision-language MoE for OCR, image/video understanding, long documents, and OS-style agent tasks.
OpenAI o3
AvailableReasoning model released alongside o4-mini with tool use, image reasoning, and stronger agentic problem solving.
GPT-4.1
DeprecatedAPI model family focused on coding, instruction following, and one-million-token long-context work.
Llama 4 Maverick
AvailableMeta's flagship open-weight MoE; highest MMLU among open models at release.
Llama 4 Scout
AvailableEfficient open-weight MoE designed for very long context on modest hardware.
Qwen2.5-Omni-7B
AvailableLocal omni-modal Qwen model that supports text, image, audio, video, and speech generation in a 7B package.
Gemini 2.5 Pro
DeprecatedReasoning-focused Gemini 2.5 model that made thinking a core part of Google's flagship model line.
Mistral Small 3.1
AvailableApache-licensed Small update adding vision and a 128K context window to the efficient 24B line.
Claude 3.7 Sonnet
RetiredAnthropic's first hybrid-reasoning Sonnet. Shut down May 11, 2026 as the 4.x line matured.
Grok 3
DeprecatedxAI's third-generation model family, introduced with stronger reasoning, search, and coding modes.
Qwen2.5-VL-72B
AvailableVision-language Qwen2.5 model for image, document, video, and agentic visual grounding tasks.
Doubao-1.5-pro
AvailableDoubao 1.5 Pro update positioned for stronger multimodal, reasoning, and agentic work in Volcano Engine.
Kimi k1.5
AvailableMoonshot's multimodal reinforcement-learning reasoning model, reported as matching OpenAI o1 on math, coding, and multimodal reasoning.
MiniMax-01
AvailableOpen MiniMax generation with MiniMax-Text-01 and MiniMax-VL-01 long-context models.
Step-2
AvailableSecond-generation StepFun foundation model line with larger-scale multimodal and reasoning ambitions.
Gemini 2.0 Flash
DeprecatedFirst Gemini 2.0 release, built for native multimodal input/output, tool use, and agentic product integrations.
OpenAI o1
DeprecatedGeneral release of OpenAI's o1 reasoning model with stronger deliberative reasoning and multimodal ChatGPT integration.
Amazon Nova Pro
AvailableAWS-native multimodal model with a 300k context; size and architecture undisclosed.
Amazon Nova Lite
AvailableLower-cost multimodal Nova understanding model for text, image, and video inputs.
Claude 3.5 Haiku
DeprecatedFast, lower-cost Claude 3.5 model for latency-sensitive coding, tool-use, and customer-facing workloads.
Llama 3.2 90B Vision
AvailableFirst Llama family release with native vision models, alongside smaller edge-oriented 1B and 3B text models.
Molmo 72B
AvailableOpen multimodal model family trained for strong image understanding, pointing, and visual grounding.
Pixtral 12B
AvailableMistral's first open multimodal model, adding image understanding to a Mistral text backbone.
Grok-2
RetiredSecond-generation Grok release with Grok-2 and Grok-2 mini for chat, coding, reasoning, and image-enabled product experiences.
MiniCPM-V 2.6
Available8B vision-language model for local image, multi-image, OCR, and video understanding, with llama.cpp and Ollama support.
Claude 3.5 Sonnet
RetiredMajor Sonnet upgrade that became Anthropic's default high-intelligence workhorse for coding, writing, and visual reasoning.
GPT-4o
RetiredThe 2024 omni-modal model that defined a generation of assistants. Deprecated in Feb 2026 and fully retired across ChatGPT on April 3, 2026.
Falcon 2 11B
AvailableFalcon 2 generation, including text and vision-language 11B models under a permissive TII license.
Step-1V
AvailableStepFun's first major vision-language model, released after the Step-1 language model.
Claude 3 Opus
DeprecatedHighest-capability Claude 3 model, launched with Sonnet and Haiku and Anthropic's first major vision-capable Claude family.
Gemini 1.5 Pro
DeprecatedGemini generation that introduced production-scale long context, eventually expanding to a two-million-token window.
GLM-4
AvailableZhipu's GLM-4 flagship generation, launched as the successor to ChatGLM3 with stronger tool use and multimodal variants.
Gemini 1.0 Ultra
DeprecatedGoogle's first natively multimodal Gemini flagship, since superseded by the 1.5/2/3 lines.
GPT-4 Turbo
DeprecatedLower-cost GPT-4 generation with a 128K context window, introduced at OpenAI DevDay.
ERNIE 4.0
AvailableBaidu's fourth-generation ERNIE flagship, announced with stronger understanding, generation, reasoning, and memory.
LLaVA 1.5 13B
AvailableOpen vision-language assistant and one of the most widely run early local multimodal models.
EXAONE 2.0
RetiredSecond EXAONE generation, improving bilingual Korean-English performance and enterprise deployment options.
GPT-4
DeprecatedThe model that brought reliable multi-step reasoning to the mainstream; size never disclosed.
EXAONE 1.0
RetiredLG AI Research's first EXAONE foundation model generation, introduced as a large multimodal expert AI.