Accessing LLMs

(Run this example in Google Colab here)

Prompting is the process of providing a partial, usually text, input to a model. As we discussed in the last chapter, models will then use their parameterized data transformations to find a probable completion or output that matches the prompt.

To run any prompt through a model, we need to set a foundation for how we will access generative AI models and perform inference. There is a huge variety in the landscape of generative AI models in terms of size, access patterns, licensing, etc. However, a common theme is the usage of LLMs through a REST API, which is either:

  • Provided by a third party service (OpenAI, Anthropic, Cohere, etc.)
  • Self-hosted in your own infrastructure or in an account you control with a model hosting provider (Replicate, Baseten, etc.)
  • Self-hosted using a DIY model serving API (Flask, FastAPI, etc.)

We will use Prediction Guard to call open access LLMs (like Mistral, Llama3, Deepseek, etc.) via a standardized OpenAI-like API. This will allow us to explore the full range of LLMs available. Further, it will illustrate how companies can access a wide range of models (outside of the GPT family).

In order to “prompt” an LLM via Prediction Guard (and eventually engineer prompts), you can use any of the following SDKs: Python, Go, Rust, JS, and HTTP.

We will use Python to show an example:

You will need to install Prediction Guard into your Python environment.

copy
$$ pip install predictionguard

Now import PredictionGuard, setup your API Key, and create the client.

copy
1import os
2
3from predictionguard import PredictionGuard
4
5# Set your Prediction Guard token as an environmental variable.
6os.environ["PREDICTIONGUARD_API_KEY"] = "<api key>"
7
8client = PredictionGuard()

Accessing LLMs

Generating text with one of these models is then just single request for a “Completion” (note, we also support chat completions). Here we will call the Hermes-2-Pro-Llama-3-8B model and try to have it autocomplete a joke.

You can find out more about the available Models in the docs.

copy
1response = client.completions.create(model="Hermes-2-Pro-Llama-3-8B",
2 prompt="The best joke I know is: ")
3
4print(json.dumps(
5 response,
6 sort_keys=True,
7 indent=4,
8 separators=(',', ': ')
9))

The completions call should result in something similar to the following JSON output which includes the completion.

copy
1{
2 "choices": [
3 {
4 "index": 0,
5 "model": "Hermes-2-Pro-Llama-3-8B",
6 "status": "success",
7 "text": "2/1\n```\n2/1\n```\nIf you didn't understand, I'll explain it further. For a given denominator, 1/1 is the fraction that has the closest numerator to the greatest common multiple of the numerator and the denominator, because when reducing a fraction to its simplest terms, any common factors are canceled out, and the greatest common factor of the numerator and denominator is usually the best numerator, however in this case the numerator and denominator are 1 which have no"
8 }
9 ],
10 "created": 1720018377,
11 "id": "cmpl-7yX6KVwvUTPPqUM7H2Z4KNadDgEhI",
12 "object": "text_completion"
13}

Using The SDKs

You can also try these examples using the other official SDKs:

Python, Go, Rust, JS, HTTP