Chaining and Retrieval | Prediction Guard

(Run this example in Google Colab here)

We’ve actually already seen how it can be useful to “chain” various LLM operations together (see other notebooks under Using LLMs). In the Hinglish chat example we chained a response generation and then a machine translation using LLMs.

As you solve problems with LLMs, do NOT always think about your task as a single prompt. Decompose your problem into multiple steps. Just like programming which uses multiple functions, classes, etc. LLM integration is a new kind of reasoning engine that you can “program” in a multistep, conditional, control flow sort of fashion.

Further, enterprise LLM applications need reliability, trust, and consistency. Because LLMs only predict probable text, they have no understanding or connection to reality. This produces hallucinations that can be part of a coherent text block but factually (or otherwise) wrong. To deal with this we need to ground on LLM operations with external data.

We will use Python to show an example:

Dependencies and Imports

You will need to install Prediction Guard, LangChain, LanceDB and a few more dependencies in you Python environment.

copy

$ $ pip install langchain predictionguard lancedb html2text sentence-transformers

Now import PredictionGuard and the other dependencies, set up your API Key, and create the client.

copy

1 import os
2 import urllib.request
3 
4 import html2text
5 from langchain import PromptTemplate, FewShotPromptTemplate
6 from langchain.text_splitter import CharacterTextSplitter
7 from sentence_transformers import SentenceTransformer
8 import numpy as np
9 import lancedb
10 from lancedb.embeddings import with_embeddings
11 import pandas as pd
12 from predictionguard import PredictionGuard
13 
14 
15 # Set your Prediction Guard API key and URL as environmental variables.
16 os.environ["PREDICTIONGUARD_API_KEY"] = "<api key>"
17 os.environ["PREDICTIONGUARD_URL"] = "<pg url>"
18 
19 # You can also set them when initializing the client.
20 client = PredictionGuard(
21     api_key="<Your PG API Key>",
22     url="<Your PG API URL>"
23 )

Chaining

Let’s say that we are trying to create a response to a user and we want our LLM to follow a variety of rules. We could try to encode all of these instructions into a single prompt. However, as we accumulate more and more instructions the prompt becomes harder and harder for the LLM to follow. Think about an LLM like a child or a high school intern. We want to make things as clear and easy as possible, and complicated instructions don’t do that.

copy

1 template = """### Instruction:
2 Decide if the following input message is an informational question, a general chat message, or a request for code generation.
3 If the message is an informational question, answer it based on the informational context provided below.
4 If the message is a general chat message, respond in a kind and friendly manner based on the coversation context provided below.
5 If the message is a request for code generation, respond with a code snippet.
6 
7 ### Input:
8 Message: {query}
9 
10 Informational Context: The Greater Los Angeles and San Francisco Bay areas in California are the nation's second and fifth-most populous urban regions, respectively. Greater Los Angeles has over 18.7 million residents and the San Francisco Bay Area has over 9.6 million residents. Los Angeles is state's most populous city and the nation's second-most populous city. San Francisco is the second-most densely populated major city in the country. Los Angeles County is the country's most populous county, and San Bernardino County is the nation's largest county by area. Sacramento is the state's capital.
11 
12 Conversational Context:
13 Human - "Hello, how are you?"
14 AI - "I'm good, what can I help you with?"
15 Human - "What is the captital of California?"
16 AI - "Sacramento"
17 Human - "Thanks!"
18 AI - "You are welcome!"
19 
20 ### Response:
21 """
22 
23 prompt = PromptTemplate(
24     input_variables=["query"],
25     template=template,
26 )
27 
28 result = client.completions.create(
29     model="gpt-oss-120b",
30     prompt=prompt.format(query="What is the population of LA?")
31 )
32 
33 print(result['choices'][0]['text'])

When we run this, at least sometimes, we get bad output because of the complicated instructions:

The population of LA is approximately 3.9 million people.

Rather than try to handle everything in one call to the LLM, let’s decompose our logic into multiple calls that are each simple. We will also add in some non-LLM logic The chain of processing is:

Prompt 1 - Determine if the message is a request for code generation.
Prompt 2 - Q&A prompt to answer based on informational context
Prompt 3 - A general chat template for when there isn’t an informational question being asked
Prompt 4 - A code generation prompt
Question detector - A non-LLM based detection of whether an input in a question or not

code

1 category_template = """### Instruction:
2 Read the below input and determine if it is a request to generate computer code? Respond "yes" or "no" and no other text.
3 
4 ### Input:
5 {query}
6 
7 ### Response:
8 """
9 
10 category_prompt = PromptTemplate(
11     input_variables=["query"],
12     template=category_template
13 )
14 
15 qa_template = """### Instruction:
16 Read the context below and respond with an answer to the question. If the question cannot be answered based on the context alone or the context does not explicitly say the answer to the question, write "Sorry I had trouble answering this question, based on the information I found."
17 
18 ### Input:
19 Context: {context}
20 
21 Question: {query}
22 
23 ### Response:
24 """
25 
26 qa_prompt = PromptTemplate(
27     input_variables=["context", "query"],
28     template=qa_template
29 )
30 
31 chat_template = """### Instruction:
32 You are a friendly and clever AI assistant. Respond to the latest human message in the input conversation below.
33 
34 ### Input:
35 {context}
36 Human: {query}
37 AI:
38 
39 ### Response:
40 """
41 
42 chat_prompt = PromptTemplate(
43     input_variables=["context", "query"],
44     template=chat_template
45 )
46 
47 code_template = """### Instruction:
48 You are a code generation assistant. Respond with a code snippet and any explanation requested in the below input.
49 
50 ### Input:
51 {query}
52 
53 ### Response:
54 """
55 
56 code_prompt = PromptTemplate(
57     input_variables=["query"],
58     template=code_template
59 )
60 
61 
62 # QuestionID provides some help in determining if a sentence is a question.
63 class QuestionID:
64     """
65         QuestionID has the actual logic used to determine if sentence is a question
66     """
67     def padCharacter(self, character: str, sentence: str):
68         if character in sentence:
69             position = sentence.index(character)
70             if position > 0 and position < len(sentence):
71 
72                 # Check for existing white space before the special character.
73                 if (sentence[position - 1]) != " ":
74                     sentence = sentence.replace(character, (" " + character))
75 
76         return sentence
77 
78     def predict(self, sentence: str):
79         questionStarters = [
80             "which", "wont", "cant", "isnt", "arent", "is", "do", "does",
81             "will", "can"
82         ]
83         questionElements = [
84             "who", "what", "when", "where", "why", "how", "sup", "?"
85         ]
86 
87         sentence = sentence.lower()
88         sentence = sentence.replace("\'", "")
89         sentence = self.padCharacter('?', sentence)
90         splitWords = sentence.split()
91 
92         if any(word == splitWords[0] for word in questionStarters) or any(
93                 word in splitWords for word in questionElements):
94             return True
95         else:
96             return False
97 
98 def response_chain(message, convo_context, info_context):
99 
100   # Determine what kind of message this is.
101   result = client.completions.create(
102       model="gpt-oss-120b",
103       prompt=category_prompt.format(query=message)
104   )
105 
106   # configure our chain
107   if "yes" in result['choices'][0]['text']:
108     code = "yes"
109   else:
110     code = "no"
111   qIDModel = QuestionID()
112   question = qIDModel.predict(message)
113 
114   if code == "no" and question:
115 
116     # Handle the informational request.
117     result = client.completions.create(
118         model="gpt-oss-120b",
119         prompt=qa_prompt.format(context=info_context, query=message)
120     )
121     completion = result['choices'][0]['text'].split('#')[0].strip()
122 
123   elif code == "yes":
124 
125     # Handle the code generation request.
126     result = client.completions.create(
127         model="gpt-oss-120b",
128         prompt=code_prompt.format(query=message),
129         max_completion_tokens=500
130     )
131     completion = result['choices'][0]['text']
132 
133   else:
134 
135     # Handle the chat message.
136     result = client.completions.create(
137         model="gpt-oss-120b",
138         prompt=chat_prompt.format(context=convo_context, query=message),
139         output={
140             "toxicity": True
141         }
142     )
143     completion = result['choices'][0]['text'].split('Human:')[0].strip()
144 
145   return code, question, completion

Now we can supply the relevant context and options to our response chain and see what we get back:

copy

1 info_context = "The Greater Los Angeles and San Francisco Bay areas in California are the nation's second and fifth-most populous urban regions, respectively. Greater Los Angeles has over 18.7 million residents and the San Francisco Bay Area has over 9.6 million residents. Los Angeles is state's most populous city and the nation's second-most populous city. San Francisco is the second-most densely populated major city in the country. Los Angeles County is the country's most populous county, and San Bernardino County is the nation's largest county by area. Sacramento is the state's capital."
2 
3 convo_context = """Human: Hello, how are you?
4 AI: I'm good, what can I help you with?
5 Human: What is the captital of California?
6 AI: Sacramento
7 Human: Thanks!
8 AI: You are welcome!"""
9 
10 message = "Which city in California has the highest population?"
11 #message = "I'm really enjoying this conversation."
12 #message = "Generate some python code that gets the current weather in the bay area."
13 
14 code, question, completion = response_chain(message, convo_context, info_context)
15 print("CODE GEN REQUESTED:", code)
16 print("QUESTION:", question)
17 print("")
18 print("RESPONSE:", completion)

This should respond with something similar to:

CODE GEN REQUESTED: no
QUESTION: True
RESPONSE: Los Angeles is the city in California with the highest population.

External Knowledge In Prompts, Grounding

We’ve actually already seen external knowledge within our prompts. In the question and answer example, the context that we pasted in was a copy of phrasing on the Domino’s website. This “grounds” the prompt with external knowledge that is current and factual.

copy

1 template = """### Instruction:
2 Read the context below and respond with an answer to the question. If the question cannot be answered based on the context alone or the context does not explicitly say the answer to the question, write "Sorry I had trouble answering this question, based on the information I found."
3 
4 ### Input:
5 Context: {context}
6 
7 Question: {question}
8 
9 ### Response:
10 """
11 
12 prompt = PromptTemplate(
13     input_variables=["context", "question"],
14     template=template,
15 )
16 
17 context = "Domino's gift cards are great for any person and any occasion. There are a number of different options to choose from. Each comes with a personalized card carrier and is delivered via US Mail."
18 
19 question = "How are gift cards delivered?"
20 
21 myprompt = prompt.format(context=context, question=question)
22 
23 result = client.completions.create(
24     model="gpt-oss-120b",
25     prompt=myprompt
26 )
27 result['choices'][0]['text'].split('#')[0].strip()

The answer returned from this prompting is grounded in the external knowledge we inserted, so we aren’t relying on the LLM to provide the answer with its own probabilities and based on its training data.

Gift cards are delivered via US Mail.

Retrieval Augmentated Generation (RAG)

Retrieval augmented generation

Retrieval-augmented generation (RAG) is an innovative approach that merges the capabilities of large-scale retrieval systems with sequence-to-sequence models to enhance their performance in generating detailed and contextually relevant responses. Instead of relying solely on the knowledge contained within the model’s parameters, RAG allows the model to dynamically retrieve and integrate information from an external database or a set of documents during the generation process. By doing so, it provides a bridge between the vast knowledge stored in external sources and the powerful generation abilities of neural models, enabling more informed, diverse, and context-aware outputs in tasks like question answering, dialogue systems, and more.

copy

1 # Let's get the html off of a website.
2 fp = urllib.request.urlopen("https://docs.kernel.org/process/submitting-patches.html")
3 mybytes = fp.read()
4 html = mybytes.decode("utf8")
5 fp.close()
6 
7 # And convert it to text.
8 h = html2text.HTML2Text()
9 h.ignore_links = True
10 text = h.handle(html)
11 
12 print(text)

This is the text that we will be referencing in our RAG system. I mean, who doesn’t want to know more about the linux kernel. The above code should print out something like the following, which is the text on that website:

# The Linux Kernel
6.7.0-rc4
### Quick Search
### Contents
  * A guide to the Kernel Development Process
  * Submitting patches: the essential guide to getting your code into the kernel
    * Obtain a current source tree
    * Describe your changes
    * Separate your changes
    * Style-check your changes
    * Select the recipients for
etc...

Let’s clean things up a bit and split it into smaller chunks (that will fit into our LLM prompts):

copy

1 # Clean things up just a bit.
2 text = text.split("### This Page")[1]
3 text = text.split("## References")[0]
4 
5 # Chunk the text into smaller pieces for injection into LLM prompts.
6 text_splitter = CharacterTextSplitter(chunk_size=700, chunk_overlap=50)
7 docs = text_splitter.split_text(text)
8 
9 # Let's checkout some of the chunks!
10 for i in range(0, 3):
11   print("Chunk", str(i+1))
12   print("----------------------------")
13   print(docs[i])
14   print("")

Our reference “chunks” for retrieval look like the following:

Chunk 1
----------------------------
* Show Source
# Submitting patches: the essential guide to getting your code into the
kernel¶
For a person or company who wishes to submit a change to the Linux kernel, the
process can sometimes be daunting if you're not familiar with "the system."
This text is a collection of suggestions which can greatly increase the
chances of your change being accepted.
Chunk 2
----------------------------
This document contains a large number of suggestions in a relatively terse
format. For detailed information on how the kernel development process works,
see A guide to the Kernel Development Process. Also, read Linux Kernel patch
submission checklist for a list of items to check before submitting code. For
device tree binding patches, read Submitting Devicetree (DT) binding patches.
This documentation assumes that you're using `git` to prepare your patches. If
you're unfamiliar with `git`, you would be well-advised to learn how to use
it, it will make your life as a kernel developer and in general much easier.
Chunk 3
----------------------------
Some subsystems and maintainer trees have additional information about their
workflow and expectations, see Documentation/process/maintainer-handbooks.rst.
## Obtain a current source tree¶
If you do not have a repository with the current kernel source handy, use
`git` to obtain one. You'll want to start with the mainline repository, which
can be grabbed with:
    git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

We will now do a bit more clean up and “embed” these chunks to store them in a Vector Database.

copy

1 # Let's take care of some of the formatting so it doesn't conflict with our
2 # typical prompt template structure
3 docs = [x.replace('#', '-') for x in docs]
4 
5 
6 # Now we need to embed these documents and put them into a "vector store" or
7 # "vector db" that we will use for semantic search and retrieval.
8 
9 # Embeddings setup
10 name="all-MiniLM-L12-v2"
11 model = SentenceTransformer(name)
12 
13 def embed_batch(batch):
14     return [model.encode(sentence) for sentence in batch]
15 
16 def embed(sentence):
17     return model.encode(sentence)
18 
19 # LanceDB setup
20 os.mkdir(".lancedb")
21 uri = ".lancedb"
22 db = lancedb.connect(uri)
23 
24 # Create a dataframe with the chunk ids and chunks
25 metadata = []
26 for i in range(len(docs)):
27     metadata.append([
28         i,
29         docs[i]
30     ])
31 doc_df = pd.DataFrame(metadata, columns=["chunk", "text"])
32 
33 # Embed the documents
34 data = with_embeddings(embed_batch, doc_df)
35 
36 # Create the DB table and add the records.
37 db.create_table("linux", data=data)
38 table = db.open_table("linux")
39 table.add(data=data)

We now have:

Downloaded our reference data (for eventual retrieval)
Split that reference data into relevant sized chunks for injection into our prompts
Embedded those chunks (such that we have a vector that can be used for matching)
Stored the vectors into the Vector Database (LanceDB in this case)

We can now try matching to text chunks in the database:

copy

1 # Let's try to match a query to one of our documents.
2 message = "How many problems should be solved per patch?"
3 results = table.search(embed(message)).limit(5).to_df()
4 results.head()

This will give a dataframe with a ranking of relevant text chunks by a “distance” metric. The lower the distance, the more semantically relevant the chunk is to the user query.

	chunk	_distance
0	52	0.785209
1	52	0.785209
2	14	0.844908
3	14	0.844908
4	6	0.878058

Now we can create a function that will return an answer to a user query based on the RAG methodology:

copy

1 # Now let's augment our Q&A prompt with this external knowledge on-the-fly!!!
2 template = """### Instruction:
3 Read the below input context and respond with a short answer to the given question. Use only the information in the below input to answer the question. If you cannot answer the question, respond with "Sorry, I can't find an answer, but you might try looking in the following resource."
4 
5 ### Input:
6 Context: {context}
7 
8 Question: {question}
9 
10 ### Response:
11 """
12 qa_prompt = PromptTemplate(
13     input_variables=["context", "question"],
14     template=template,
15 )
16 
17 def rag_answer(message):
18 
19   # Search the for relevant context
20   results = table.search(embed(message)).limit(5).to_df()
21   results.sort_values(by=['_distance'], inplace=True, ascending=True)
22   doc_use = results['text'].values[0]
23 
24   # Augment the prompt with the context
25   prompt = qa_prompt.format(context=doc_use, question=message)
26 
27   # Get a response
28   result = client.completions.create(
29       model="gpt-oss-120b",
30       prompt=prompt
31   )
32 
33   return result['choices'][0]['text']
34 
35 response = rag_answer("How many problems should be solved in a single patch?")
36 
37 print('')
38 print("RESPONSE:", response)

This will return something similar to:

RESPONSE: A single patch should solve one problem at a time.

Using The SDKs

You can also try these examples using the other official SDKs:

Python, Go, Rust, JS, HTTP