Accessing LLMs
(Run this example in Google Colab here)
Prompting is the process of providing a partial, usually text, input to a model. As we discussed in the last chapter, models will then use their parameterized data transformations to find a probable completion or output that matches the prompt.
To run any prompt through a model, we need to set a foundation for how we will access generative AI models and perform inference. There is a huge variety in the landscape of generative AI models in terms of size, access patterns, licensing, etc. However, a common theme is the usage of LLMs through a REST API, which is either:
- Provided by a third party service (OpenAI, Anthropic, Cohere, etc.)
- Self-hosted in your own infrastructure or in an account you control with a model hosting provider (Replicate, Baseten, etc.)
- Self-hosted using a DIY model serving API (Flask, FastAPI, etc.)
We will use Prediction Guard to call open access LLMs (like Mistral, Llama3, Deepseek, etc.) via a standardized OpenAI-like API. This will allow us to explore the full range of LLMs available. Further, it will illustrate how companies can access a wide range of models (outside of the GPT family).
In order to “prompt” an LLM via Prediction Guard (and eventually engineer prompts), you can use any of the following SDKs: Python, Go, Rust, JS, and HTTP.
We will use Python to show an example:
You will need to install Prediction Guard into your Python environment.
Now import PredictionGuard, setup your API Key, and create the client.
Accessing LLMs
Generating text with one of these models is then just single request for a “Completion” (note, we also support chat completions). Here we will call the Hermes-2-Pro-Llama-3-8B model and try to have it autocomplete a joke.
You can find out more about the available Models in the docs.
The completions call should result in something similar to the following JSON output which includes the completion.
Using The SDKs
You can also try these examples using the other official SDKs: