LLMs
This page provides information on the Large Language Models (LLMs) that are available in the Prediction Guard API.
These models are designed for text inference, and are used in the /completions
and /chat/completions
endpoints.
Models
Model Descriptions
Hermes-3-Llama-3.1-70B
This is a general use model that excels at reasoning and multi-turn conversations, with an improved focus on longer context lengths. This allows for more accuracy and recall in areas that require a longer context window, along with being an improved version of the previous Hermes and Llama line of models.
Type: Chat
Use Case: Instruction Following or Chat-Like Applications
Prompt Format: ChatML
https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B
Hermes 3 is a generalist language model with many improvements over Hermes 2, including advanced agentic capabilities, much better roleplaying, reasoning, multi-turn conversation, long context coherence, and improvements across the board.
The ethos of the Hermes series of models is focused on aligning LLMs to the user, with powerful steering capabilities and control given to the end user.
The Hermes 3 series builds and expands on the Hermes 2 set of capabilities, including more powerful and reliable function calling and structured output capabilities, generalist assistant capabilities, and improved code generation skills.
gpt-oss-120b
gpt-oss-120b is OpenAI’s open-weight model designed for powerful reasoning, agentic tasks, and versatile developer use cases. This model features configurable reasoning effort and full chain-of-thought capabilities.
Type: Chat/Reasoning
Use Case: Instruction following or chat-like applications with reasoning
Prompt Format: ChatML
https://huggingface.co/openai/gpt-oss-120b
gpt-oss-120b is part of OpenAI’s gpt-oss series of open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. The model features:
- Configurable reasoning effort: Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs.
- Full chain-of-thought: Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs.
- Agentic capabilities: Use the model’s native capabilities for function calling, web browsing, Python code execution, and Structured Outputs.
- Apache 2.0 license: Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment.
- MXFP4 quantization: The model was post-trained with MXFP4 quantization of the MoE weights, making it run efficiently on a single 80GB GPU (like NVIDIA H100 or AMD MI300X).
Qwen2.5-Coder-14B-Instruct
Qwen2.5-Coder is the latest series of code-specific Qwen large language models (formerly known as CodeQwen). It is designed to enhance code generation, reasoning, and fixing, making it a powerful tool for developers.
Type: Code Generation Use Case: Generating computer code or answering tech questions Prompt Format: ChatML
https://huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct
Qwen2.5-Coder builds on the strong foundation of Qwen2.5, scaling up training tokens
to 5.5 trillion, incorporating diverse data sources such as source code, text-code grounding
and synthetic data.
Key Improvements Over CodeQwen1.5
- Enhanced Code Abilities: Significant improvements in code generation, reasoning, and fixing.
- State-of-the-Art Performance: Qwen2.5-Coder-32B achieves coding performance on par with GPT-4o.
- Real-World Application Support: Designed to power Code Agents, with strengths in coding,
mathematics, and general reasoning.
Qwen2.5-Coder-14B-Instruct balances model size and performance, offering strong coding capabilities
with instruction tuning for better usability.
Hermes-3-Llama-3.1-8B
This is a general use model that excels at reasoning and multi-turn conversations, with an improved focus on longer context lengths. This allows for more accuracy and recall in areas that require a longer context window, along with being an improved version of the previous Hermes and Llama line of models.
Type: Chat
Use Case: Instruction Following or Chat-Like Applications
Prompt Format: ChatML
https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B
Hermes 3 is a generalist language model with many improvements over Hermes 2, including advanced agentic capabilities, much better roleplaying, reasoning, multi-turn conversation, long context coherence, and improvements across the board.
The ethos of the Hermes series of models is focused on aligning LLMs to the user, with powerful steering capabilities and control given to the end user.
The Hermes 3 series builds and expands on the Hermes 2 set of capabilities, including more powerful and reliable function calling and structured output capabilities, generalist assistant capabilities, and improved code generation skills.
neural-chat-7b-v3-3
A revolutionary AI model for performing digital conversations.
Type: Chat
Use Case: Instruction Following or Chat-Like Applications
Prompt Format: Neural Chat
https://huggingface.co/Intel/neural-chat-7b-v3-3
This model is a fine-tuned 7B parameter LLM on the Intel Gaudi 2 processor from the Intel/neural-chat-7b-v3-1 on the meta-math/MetaMathQA dataset. The model was aligned using the Direct Performance Optimization (DPO) method with Intel/orca_dpo_pairs. The Intel/neural-chat-7b-v3-1 was originally fine-tuned from mistralai/Mistral-7B-v-0.1. For more information, refer to the blog
The Practice of Supervised Fine-tuning and Direct Preference Optimization on Intel Gaudi2