LLMs | Prediction Guard

This page provides information on the Large Language Models (LLMs) that are available in the Prediction Guard API. These models are designed for text inference, and are used in the /completions and /chat/completions endpoints.

Models

Model Name	Type	Use Case	Prompt Format	Context Length	More Info
Hermes-3-Llama-3.1-70B	Chat	Instruction following or chat-like applications	ChatML	20480	link
DeepSeek-R1-Distill-Qwen-32B	Chat/Reasoning	Instruction following or chat-like applications with reasoning	ChatML	32768	link
Qwen2.5-Coder-14B-Instruct	Code Generation	Generating computer code or answering tech questions	ChatML	32768	link
Hermes-3-Llama-3.1-8B	Chat	Instruction following or chat-like applications	ChatML	32768	link
neural-chat-7b-v3-3	Chat	Instruction following or chat-like applications	Neural Chat	32768	link

Model Descriptions

Hermes-3-Llama-3.1-70B

This is a general use model that excels at reasoning and multi-turn conversations, with an improved focus on longer context lengths. This allows for more accuracy and recall in areas that require a longer context window, along with being an improved version of the previous Hermes and Llama line of models.

Type: Chat
Use Case: Instruction Following or Chat-Like Applications
Prompt Format: ChatML

https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B

Hermes 3 is a generalist language model with many improvements over Hermes 2, including advanced agentic capabilities, much better roleplaying, reasoning, multi-turn conversation, long context coherence, and improvements across the board.

The ethos of the Hermes series of models is focused on aligning LLMs to the user, with powerful steering capabilities and control given to the end user.

The Hermes 3 series builds and expands on the Hermes 2 set of capabilities, including more powerful and reliable function calling and structured output capabilities, generalist assistant capabilities, and improved code generation skills.

DeepSeek-R1-Distill-Qwen-32B

DeepSeek-R1-Distill-Qwen-32B is a distilled reasoning model derived from DeepSeek-R1 achieving state-of-the-art performance for dense models.

Type: Chat/Reasoning
Use Case: Instruction following or chat-like applications with reasoning
Prompt Format: ChatML

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

DeepSeek-R1 is a first-generation reasoning model developed through large-scale reinforcement learning (RL).
The process involved training DeepSeek-R1-Zero purely with RL, without supervised fine-tuning (SFT)
allowing it to develop emergent reasoning behaviors such as self-verification, reflection, and generating long chain-of-thought (CoT) reasoning steps. To improve readability, coherence, and overall performance, DeepSeek-R1 was further trained with cold-start data before RL.

DeepSeek-R1’s reasoning capabilities were distilled into smaller models, significantly improving
their performance compared to models trained from scratch with RL.
Several dense models were fine-tuned using DeepSeek-R1’s reasoning data, leading to state-of-the-art
performance in benchmarks.
Open-source checkpoints are available in multiple sizes: 1.5B, 7B, 8B, 14B, 32B, and 70B,
based on Qwen2.5 and Llama3 series.

Qwen2.5-Coder-14B-Instruct

Qwen2.5-Coder is the latest series of code-specific Qwen large language models (formerly known as CodeQwen). It is designed to enhance code generation, reasoning, and fixing, making it a powerful tool for developers.

Type: Code Generation Use Case: Generating computer code or answering tech questions Prompt Format: ChatML

https://huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct

Qwen2.5-Coder builds on the strong foundation of Qwen2.5, scaling up training tokens
to 5.5 trillion, incorporating diverse data sources such as source code, text-code grounding
and synthetic data.

Key Improvements Over CodeQwen1.5

Enhanced Code Abilities: Significant improvements in code generation, reasoning, and fixing.
State-of-the-Art Performance: Qwen2.5-Coder-32B achieves coding performance on par with GPT-4o.
Real-World Application Support: Designed to power Code Agents, with strengths in coding,
mathematics, and general reasoning.

Qwen2.5-Coder-14B-Instruct balances model size and performance, offering strong coding capabilities
with instruction tuning for better usability.

Hermes-3-Llama-3.1-8B

Type: Chat
Use Case: Instruction Following or Chat-Like Applications
Prompt Format: ChatML

https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B

The ethos of the Hermes series of models is focused on aligning LLMs to the user, with powerful steering capabilities and control given to the end user.

neural-chat-7b-v3-3

A revolutionary AI model for performing digital conversations.

Type: Chat
Use Case: Instruction Following or Chat-Like Applications
Prompt Format: Neural Chat

https://huggingface.co/Intel/neural-chat-7b-v3-3

This model is a fine-tuned 7B parameter LLM on the Intel Gaudi 2 processor from the Intel/neural-chat-7b-v3-1 on the meta-math/MetaMathQA dataset. The model was aligned using the Direct Performance Optimization (DPO) method with Intel/orca_dpo_pairs. The Intel/neural-chat-7b-v3-1 was originally fine-tuned from mistralai/Mistral-7B-v-0.1. For more information, refer to the blog

The Practice of Supervised Fine-tuning and Direct Preference Optimization on Intel Gaudi2