LLMs | Prediction Guard

This page provides information on the Large Language Models (LLMs) that are available in the Prediction Guard API. These models are designed for text inference, and are used in the /completions and /chat/completions endpoints.

Models

Model Name	Type	Use Case	Prompt Format	Context Length	More Info
Hermes-3-Llama-3.1-70B	Chat	Instruction following or chat-like applications	ChatML	20480	link
gpt-oss-120b	Chat/Reasoning	Instruction following or chat-like applications with reasoning	ChatML	128000	link
Qwen2.5-Coder-14B-Instruct	Code Generation	Generating computer code or answering tech questions	ChatML	20480	link
Hermes-3-Llama-3.1-8B	Chat	Instruction following or chat-like applications	ChatML	32768	link
neural-chat-7b-v3-3	Chat	Instruction following or chat-like applications	Neural Chat	32768	link

Model Descriptions

Hermes-3-Llama-3.1-70B

This is a general use model that excels at reasoning and multi-turn conversations, with an improved focus on longer context lengths. This allows for more accuracy and recall in areas that require a longer context window, along with being an improved version of the previous Hermes and Llama line of models.

Type: Chat
Use Case: Instruction Following or Chat-Like Applications
Prompt Format: ChatML

https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B

Hermes 3 is a generalist language model with many improvements over Hermes 2, including advanced agentic capabilities, much better roleplaying, reasoning, multi-turn conversation, long context coherence, and improvements across the board.

The ethos of the Hermes series of models is focused on aligning LLMs to the user, with powerful steering capabilities and control given to the end user.

The Hermes 3 series builds and expands on the Hermes 2 set of capabilities, including more powerful and reliable function calling and structured output capabilities, generalist assistant capabilities, and improved code generation skills.

gpt-oss-120b

gpt-oss-120b is OpenAI’s open-weight model designed for powerful reasoning, agentic tasks, and versatile developer use cases. This model features configurable reasoning effort and full chain-of-thought capabilities.

Type: Chat/Reasoning
Use Case: Instruction following or chat-like applications with reasoning
Prompt Format: ChatML

https://huggingface.co/openai/gpt-oss-120b

gpt-oss-120b is part of OpenAI’s gpt-oss series of open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. The model features:

Configurable reasoning effort: Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs.
Full chain-of-thought: Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs.
Agentic capabilities: Use the model’s native capabilities for function calling, web browsing, Python code execution, and Structured Outputs.
Apache 2.0 license: Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment.
MXFP4 quantization: The model was post-trained with MXFP4 quantization of the MoE weights, making it run efficiently on a single 80GB GPU (like NVIDIA H100 or AMD MI300X).

Qwen2.5-Coder-14B-Instruct

Qwen2.5-Coder is the latest series of code-specific Qwen large language models (formerly known as CodeQwen). It is designed to enhance code generation, reasoning, and fixing, making it a powerful tool for developers.

Type: Code Generation Use Case: Generating computer code or answering tech questions Prompt Format: ChatML

https://huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct

Qwen2.5-Coder builds on the strong foundation of Qwen2.5, scaling up training tokens
to 5.5 trillion, incorporating diverse data sources such as source code, text-code grounding
and synthetic data.

Key Improvements Over CodeQwen1.5

Enhanced Code Abilities: Significant improvements in code generation, reasoning, and fixing.
State-of-the-Art Performance: Qwen2.5-Coder-32B achieves coding performance on par with GPT-4o.
Real-World Application Support: Designed to power Code Agents, with strengths in coding,
mathematics, and general reasoning.

Qwen2.5-Coder-14B-Instruct balances model size and performance, offering strong coding capabilities
with instruction tuning for better usability.

Hermes-3-Llama-3.1-8B

Type: Chat
Use Case: Instruction Following or Chat-Like Applications
Prompt Format: ChatML

https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B

The ethos of the Hermes series of models is focused on aligning LLMs to the user, with powerful steering capabilities and control given to the end user.

neural-chat-7b-v3-3

A revolutionary AI model for performing digital conversations.

Type: Chat
Use Case: Instruction Following or Chat-Like Applications
Prompt Format: Neural Chat

https://huggingface.co/Intel/neural-chat-7b-v3-3

This model is a fine-tuned 7B parameter LLM on the Intel Gaudi 2 processor from the Intel/neural-chat-7b-v3-1 on the meta-math/MetaMathQA dataset. The model was aligned using the Direct Performance Optimization (DPO) method with Intel/orca_dpo_pairs. The Intel/neural-chat-7b-v3-1 was originally fine-tuned from mistralai/Mistral-7B-v-0.1. For more information, refer to the blog

The Practice of Supervised Fine-tuning and Direct Preference Optimization on Intel Gaudi2