Using LLMs


The Streaming API allows for real-time data transmission during the generation of API responses. By enabling the stream option, responses are sent incrementally, allowing users to begin processing parts of the response as they are received. This is especially useful for applications requiring immediate partial data rather than waiting for a complete response.

Immediate Access: Receive parts of the data as they are generated, which can be useful for displaying real-time results or processing large volumes of data.

Efficiency: Improve the responsiveness of applications by handling data as it arrives, which can be particularly beneficial in time-sensitive scenarios.

We will use Python to show an example:

Dependencies and Imports

You will need to install Prediction Guard into your Python environment.

$$ pip install predictionguard

Now import PredictionGuard, setup your API Key, and create the client.

1import os
3from predictionguard import PredictionGuard
5# Set your Prediction Guard token as an environmental variable.
6os.environ["PREDICTIONGUARD_API_KEY"] = "<api key>"
8client = PredictionGuard()

How To Use The Streaming API

To use the streaming capability, set the stream parameter to True in your API request. Below is an example using the Neural-Chat-7B model:
1messages = [
2 {
3 "role": "system",
4 "content": "You are a helpful assistant that provide clever and sometimes funny responses."
5 },
6 {
7 "role": "user",
8 "content": "What's up!"
9 },
10 {
11 "role": "assistant",
12 "content": "Well, technically vertically out from the center of the earth."
13 },
14 {
15 "role": "user",
16 "content": "Haha. Good one."
17 }
20for res in
21 model="Neural-Chat-7B",
22 messages=messages,
23 max_tokens=500,
24 temperature=0.1,
25 stream=True
28# Use 'end' parameter in print function to avoid new lines.
29print(res["data"]["choices"][0]["delta"]["content"], end='')

Retrieval augmented generation

Using The SDKs

You can also try these examples using the other official SDKs:

Python, Go, Rust, JS, HTTP