LLMs

This page provides information on the Large Language Models (LLMs) that are available in the Prediction Guard API. These models are designed for text inference, and are used in the /completions and /chat/completions endpoints.

Models

Model NameTypeUse CasePrompt FormatContext LengthMore Info
Hermes-3-Llama-3.1-70BChatInstruction following or chat-like applicationsChatML20480link
gpt-oss-120bChat/ReasoningInstruction following or chat-like applications with reasoningChatML64000link
Qwen2.5-Coder-14B-InstructCode GenerationGenerating computer code or answering tech questionsChatML20480link
Hermes-3-Llama-3.1-8BChatInstruction following or chat-like applicationsChatML32768link
neural-chat-7b-v3-3ChatInstruction following or chat-like applicationsNeural Chat32768link

Model Descriptions

Hermes-3-Llama-3.1-70B

This is a general use model that excels at reasoning and multi-turn conversations, with an improved focus on longer context lengths. This allows for more accuracy and recall in areas that require a longer context window, along with being an improved version of the previous Hermes and Llama line of models.

Type: Chat
Use Case: Instruction Following or Chat-Like Applications
Prompt Format: ChatML

https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B

Hermes 3 is a generalist language model with many improvements over Hermes 2, including advanced agentic capabilities, much better roleplaying, reasoning, multi-turn conversation, long context coherence, and improvements across the board.

The ethos of the Hermes series of models is focused on aligning LLMs to the user, with powerful steering capabilities and control given to the end user.

The Hermes 3 series builds and expands on the Hermes 2 set of capabilities, including more powerful and reliable function calling and structured output capabilities, generalist assistant capabilities, and improved code generation skills.

gpt-oss-120b

gpt-oss-120b is OpenAI’s open-weight model designed for powerful reasoning, agentic tasks, and versatile developer use cases. This model features configurable reasoning effort and full chain-of-thought capabilities.

Type: Chat/Reasoning
Use Case: Instruction following or chat-like applications with reasoning
Prompt Format: ChatML

https://huggingface.co/openai/gpt-oss-120b

gpt-oss-120b is part of OpenAI’s gpt-oss series of open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. The model features:

  • Configurable reasoning effort: Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs.
  • Full chain-of-thought: Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs.
  • Agentic capabilities: Use the model’s native capabilities for function calling, web browsing, Python code execution, and Structured Outputs.
  • Apache 2.0 license: Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment.
  • MXFP4 quantization: The model was post-trained with MXFP4 quantization of the MoE weights, making it run efficiently on a single 80GB GPU (like NVIDIA H100 or AMD MI300X).

Qwen2.5-Coder-14B-Instruct

Qwen2.5-Coder is the latest series of code-specific Qwen large language models (formerly known as CodeQwen). It is designed to enhance code generation, reasoning, and fixing, making it a powerful tool for developers.

Type: Code Generation Use Case: Generating computer code or answering tech questions Prompt Format: ChatML

https://huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct

Qwen2.5-Coder builds on the strong foundation of Qwen2.5, scaling up training tokens
to 5.5 trillion, incorporating diverse data sources such as source code, text-code grounding
and synthetic data.

Key Improvements Over CodeQwen1.5

  • Enhanced Code Abilities: Significant improvements in code generation, reasoning, and fixing.
  • State-of-the-Art Performance: Qwen2.5-Coder-32B achieves coding performance on par with GPT-4o.
  • Real-World Application Support: Designed to power Code Agents, with strengths in coding,
    mathematics, and general reasoning.

Qwen2.5-Coder-14B-Instruct balances model size and performance, offering strong coding capabilities
with instruction tuning for better usability.

Hermes-3-Llama-3.1-8B

This is a general use model that excels at reasoning and multi-turn conversations, with an improved focus on longer context lengths. This allows for more accuracy and recall in areas that require a longer context window, along with being an improved version of the previous Hermes and Llama line of models.

Type: Chat
Use Case: Instruction Following or Chat-Like Applications
Prompt Format: ChatML

https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B

Hermes 3 is a generalist language model with many improvements over Hermes 2, including advanced agentic capabilities, much better roleplaying, reasoning, multi-turn conversation, long context coherence, and improvements across the board.

The ethos of the Hermes series of models is focused on aligning LLMs to the user, with powerful steering capabilities and control given to the end user.

The Hermes 3 series builds and expands on the Hermes 2 set of capabilities, including more powerful and reliable function calling and structured output capabilities, generalist assistant capabilities, and improved code generation skills.

neural-chat-7b-v3-3

A revolutionary AI model for performing digital conversations.

Type: Chat
Use Case: Instruction Following or Chat-Like Applications
Prompt Format: Neural Chat

https://huggingface.co/Intel/neural-chat-7b-v3-3

This model is a fine-tuned 7B parameter LLM on the Intel Gaudi 2 processor from the Intel/neural-chat-7b-v3-1 on the meta-math/MetaMathQA dataset. The model was aligned using the Direct Performance Optimization (DPO) method with Intel/orca_dpo_pairs. The Intel/neural-chat-7b-v3-1 was originally fine-tuned from mistralai/Mistral-7B-v-0.1. For more information, refer to the blog

The Practice of Supervised Fine-tuning and Direct Preference Optimization on Intel Gaudi2