LVMs
This page provides information on the Large Vision Models (LVMs) that are available in the Prediction Guard API.
These multimodal models are designed for inference using text and images, and are used in the /chat/completions
endpoint.
Models
Model Descriptions
Qwen2.5-VL-7B-Instruct
This is a powerful vision-language model designed for complex multimodal understanding and instruction following. It excels at image, video, and document comprehension while maintaining strong capabilities in reasoning and code generation.
Type: Chat Use Case: Vision-Language Understanding, Instruction Following, Tool Use Prompt Format: ChatML
https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct
Qwen2.5-VL is the latest generation in the Qwen series of vision-language models, offering major improvements in structured visual understanding, agentic tool use, and long-context video reasoning. The 32B instruction-tuned variant is optimized for following user intent across both textual and visual inputs. Key enhancements include advanced multimodal reasoning, dynamic tool use, visual localization, and the ability to generate structured outputs from documents, forms, and images. It also introduces robust capabilities in understanding and summarizing long videos—over an hour in length—with temporal alignment and event capture. Qwen2.5-VL-7B-Instruct is built on an optimized ViT-based vision encoder with enhancements like SwiGLU and RMSNorm, and dynamic resolution sampling for efficient video comprehension. The model supports reliable JSON outputs and coordinates for grounded visual tasks, making it suitable for enterprise and applied AI use cases across finance, commerce, and logistics.