LVMs | Prediction Guard

This page provides information on the Large Vision Models (LVMs) that are available in the Prediction Guard API. These multimodal models are designed for inference using text and images, and are used in the /chat/completions endpoint.

Models

Model Name	Type	Use Case	Context Length	More Info
Qwen2.5-VL-7B-Instruct	Vision Text Generation	Used for generating text from text and image inputs	16384	link

Model Descriptions

Qwen2.5-VL-7B-Instruct

This is a powerful vision-language model designed for complex multimodal understanding and instruction following. It excels at image, video, and document comprehension while maintaining strong capabilities in reasoning and code generation.

Type: Chat Use Case: Vision-Language Understanding, Instruction Following, Tool Use Prompt Format: ChatML

https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct

Qwen2.5-VL is the latest generation in the Qwen series of vision-language models, offering major improvements in structured visual understanding, agentic tool use, and long-context video reasoning. The 32B instruction-tuned variant is optimized for following user intent across both textual and visual inputs. Key enhancements include advanced multimodal reasoning, dynamic tool use, visual localization, and the ability to generate structured outputs from documents, forms, and images. It also introduces robust capabilities in understanding and summarizing long videos—over an hour in length—with temporal alignment and event capture. Qwen2.5-VL-7B-Instruct is built on an optimized ViT-based vision encoder with enhancements like SwiGLU and RMSNorm, and dynamic resolution sampling for efficient video comprehension. The model supports reliable JSON outputs and coordinates for grounded visual tasks, making it suitable for enterprise and applied AI use cases across finance, commerce, and logistics.