Using LLMs

Accessing LLMs

(Run this example in Google Colab here)

Prompting is the process of providing a partial, usually text, input to a model. As we discussed in the last chapter, models will then use their parameterized data transformations to find a probable completion or output that matches the prompt.

To run any prompt through a model, we need to set a foundation for how we will access generative AI models and perform inference. There is a huge variety in the landscape of generative AI models in terms of size, access patterns, licensing, etc. However, a common theme is the usage of LLMs through a REST API, which is either:

  • Provided by a third party service (OpenAI, Anthropic, Cohere, etc.)
  • Self-hosted in your own infrastructure or in an account you control with a model hosting provider (Replicate, Baseten, etc.)
  • Self-hosted using a DIY model serving API (Flask, FastAPI, etc.)

We will use Prediction Guard to call open access LLMs (like Mistral, Llama 2, WizardCoder, etc.) via a standardized OpenAI-like API. This will allow us to explore the full range of LLMs available. Further, it will illustrate how companies can access a wide range of models (outside of the GPT family).

In order to “prompt” an LLM via Prediction Guard (and eventually engineer prompts), you can use any of the following SDKs: Python, Go, Rust, JS, and HTTP.

We will use Python to show an example:

You will need to install Prediction Guard into your Python environment.

copy
$$ pip install predictionguard

Now import PredictionGuard, setup your API Key, and create the client.

copy
1import os
2
3from predictionguard import PredictionGuard
4
5# Set your Prediction Guard token as an environmental variable.
6os.environ["PREDICTIONGUARD_API_KEY"] = "<api key>"
7
8client = PredictionGuard()

Accessing LLMs

Generating text with one of these models is then just single request for a “Completion” (note, we also support chat completions). Here we will call the Neural-Chat-7B model and try to have it autocomplete a joke.

You can find out more about the available Models in the docs.

copy
1response = client.completions.create(model="Neural-Chat-7B",
2 prompt="The best joke I know is: ")
3
4print(json.dumps(
5 response,
6 sort_keys=True,
7 indent=4,
8 separators=(',', ': ')
9))

The completions call should result in something similar to the following JSON output which includes the completion.

copy
1{
2 "id":"cmpl-hUb28aOve3iF5lLlwkai6YmzZQer6",
3 "object":"text_completion",
4 "created":1717692267,
5 "choices":[
6 {
7 "text":"\n\nA man walks into a bar and says to the bartender, \"If I show you something really weird, will you give me a free drink?\" The bartender, being intrigued, says, \"Sure, I'll give it a look.\" The man reaches into his pocket and pulls out a tiny horse. The bartender is astonished and gives the man a free drink. The man then puts the horse back into his pocket.\n\nThe next day, the same man walks back into the bar and says to the bartender, \"If I show you something even weirder than yesterday and you give me a free drink, will you do it again?\" The bartender, somewhat reluctantly, says, \"Okay, I guess you can show it to me.\" The man reaches into his pocket, pulls out the same tiny horse, and opens the door to reveal the entire bar inside the horse.\n\nThe bartender faints.",
8 "index":0,
9 "status":"success",
10 "model":"Neural-Chat-7B"
11 }
12 ]
13}

Using The SDKs

You can also try these examples using the other official SDKs:

Python, Go, Rust, JS, HTTP