Using LLMs

Prompt Engineering

(Run this example in Google Colab here)

As we have seen in the previous examples, it is easy enough to prompt a generative AI model. Shoot off an API call, and suddently you have an answer, a machine translation, sentiment analyzed, or a chat message generated. However, going from “prompting” to ai engineering of your AI model based processes is a bit more involved. The importance of the “engineering” in prompt engineering has become increasingly apparent, as models have become more complex and powerful, and the demand for more accurate and interpretable results has grown.

The ability to engineer effective prompts and related workflows allows us to configure and tune model responses to better suit our specific needs (e.g., for a particular industry like healthcare), whether we are trying to improve the quality of the output, reduce bias, or optimize for efficiency.

Dependencies and imports

This time we will add a new import!

$$ pip install predictionguard langchain
1import os
2import json
3
4import predictionguard as pg
5from langchain import PromptTemplate
6from langchain import PromptTemplate, FewShotPromptTemplate
7import numpy as np
8
9
10os.environ['PREDICTIONGUARD_TOKEN'] = "<your access token>"

Prompt templates

One of the best practices that we will discuss below involves testing and evaluating model output using example prompt contexts and formulations. In order to institute this practice, we need a way to rapidly and programmatically format prompts with a variety of contexts. We will need this in our applications anyway, because in production we will be receiving dynamic input from the user or another application. That dynamic input (or something extracted from it) will be inserted into our prompts on-the-fly. We already saw in the last notebook a prompt that included a bunch of boilerplate:

1template = """### Instruction:
2Read the context below and respond with an answer to the question. If the question cannot be answered based on the context alone or the context does not explicitly say the answer to the question, write "Sorry I had trouble answering this question, based on the information I found."
3
4### Input:
5Context: {context}
6
7Question: {question}
8
9### Response:
10"""
11
12prompt = PromptTemplate(
13 input_variables=["context", "question"],
14 template=template,
15)
16
17context = "Domino's gift cards are great for any person and any occasion. There are a number of different options to choose from. Each comes with a personalized card carrier and is delivered via US Mail."
18
19question = "How are gift cards delivered?"
20
21myprompt = prompt.format(context=context, question=question)
22print(myprompt)

This will output:

### Instruction:
Read the context below and respond with an answer to the question. If the question cannot be answered based on the context alone or the context does not explicitly say the answer to the question, write "Sorry I had trouble answering this question, based on the information I found."
### Input:
Context: Domino's gift cards are great for any person and any occasion. There are a number of different options to choose from. Each comes with a personalized card carrier and is delivered via US Mail.
Question: How are gift cards delivered?
### Response:

This kind of prompt template could in theory be flexible to create zero shot or few shot prompts. However, LangChain provides a bit more convenience for few shot prompts. We can first create a template for individual demonstrations within the few shot prompt:

1# Create a string formatter for sentiment analysis demonstrations.
2demo_formatter_template = """
3Text: {text}
4Sentiment: {sentiment}
5"""
6
7# Define a prompt template for the demonstrations.
8demo_prompt = PromptTemplate(
9 input_variables=["text", "sentiment"],
10 template=demo_formatter_template,
11)
12
13# Each row here includes:
14# 1. an example text input (that we want to analyze for sentiment)
15# 2. an example sentiment output (NEU, NEG, POS)
16few_examples = [
17 ['The flight was exceptional.', 'POS'],
18 ['That pilot is adorable.', 'POS'],
19 ['This was an awful seat.', 'NEG'],
20 ['This pilot was brilliant.', 'POS'],
21 ['I saw the aircraft.', 'NEU'],
22 ['That food was exceptional.', 'POS'],
23 ['That was a private aircraft.', 'NEU'],
24 ['This is an unhappy pilot.', 'NEG'],
25 ['The staff is rough.', 'NEG'],
26 ['This staff is Australian.', 'NEU']
27]
28examples = []
29for ex in few_examples:
30 examples.append({
31 "text": ex[0],
32 "sentiment": ex[1]
33 })
34
35few_shot_prompt = FewShotPromptTemplate(
36
37 # This is the demonstration data we want to insert into the prompt.
38 examples=examples,
39 example_prompt=demo_prompt,
40 example_separator="",
41
42 # This is the boilerplate portion of the prompt corresponding to
43 # the prompt task instructions.
44 prefix="Classify the sentiment of the text. Use the label NEU for neutral sentiment, NEG for negative sentiment, and POS for positive sentiment.\n",
45
46 # The suffix of the prompt is where we will put the output indicator
47 # and define where the "on-the-fly" user input would go.
48 suffix="\nText: {input}\nSentiment:",
49 input_variables=["input"],
50)
51
52myprompt = few_shot_prompt.format(input="The flight is boring.")
53print(myprompt)

This will output:

Classify the sentiment of the text. Use the label NEU for neutral sentiment, NEG for negative sentiment, and POS for positive sentiment.
Text: The flight was exceptional.
Sentiment: POS
Text: That pilot is adorable.
Sentiment: POS
Text: This was an awful seat.
Sentiment: NEG
Text: This pilot was brilliant.
Sentiment: POS
Text: I saw the aircraft.
Sentiment: NEU
Text: That food was exceptional.
Sentiment: POS
Text: That was a private aircraft.
Sentiment: NEU
Text: This is an unhappy pilot.
Sentiment: NEG
Text: The staff is rough.
Sentiment: NEG
Text: This staff is Australian.
Sentiment: NEU
Text: The flight is boring.
Sentiment:

Multiple formulations

Why settle for a single prompt and/or set of parameters when you can use mutliple. Try using multiple formulations of your prompt to either:

  • Provide multiple options to users; or
  • Create multiple candidate predictions, which you can choose from programmatically using a reference free evaluation of those candidates.
1template1 = """### Instruction:
2Read the context below and respond with an answer to the question. If the question cannot be answered based on the context alone or the context does not explicitly say the answer to the question, write "Sorry I had trouble answering this question, based on the information I found."
3
4### Input:
5Context: {context}
6
7Question: {question}
8
9### Response:
10"""
11
12prompt1 = PromptTemplate(
13 input_variables=["context", "question"],
14 template=template1,
15)
16
17template2 = """### Instruction:
18Answer the question below based on the given context. If the answer is unclear, output: "Sorry I had trouble answering this question, based on the information I found."
19
20### Input:
21Context: {context}
22Question: {question}
23
24### Response:
25"""
26
27prompt2 = PromptTemplate(
28 input_variables=["context", "question"],
29 template=template2,
30)
31
32context = "Domino's gift cards are great for any person and any occasion. There are a number of different options to choose from. Each comes with a personalized card carrier and is delivered via US Mail."
33question = "How are gift cards delivered?"
34
35completions = pg.Completion.create(
36 model="Nous-Hermes-Llama2-13B",
37 prompt=[
38 prompt1.format(context=context, question=question),
39 prompt2.format(context=context, question=question)
40 ],
41 temperature=0.5
42 )
43
44for i in [0,1]:
45 print("Answer", str(i+1) + ": ", completions['choices'][i]['text'].strip())

This will output the result for each formulation, which may or may not diverge:

Answer 1: Gift cards are delivered via US Mail.
Answer 2: Gift cards are delivered via US Mail.

Consistency and output validation

Reliability and consistency in LLM output is a major problem for the “last mile” of LLM integrations. You could get a whole variety of outputs from your model, and some of these outputs could be inaccurate or harmful in other ways (e.g., toxic).

Prediction Guard allows you to validate the consistency, factuality, and toxicity of your LLMs outputs. Consistency refers to the internal (or self) model consistency and ensures that the model itself is giving a consistent reply. Factuality checks for the factual consistency of the output with context in the prompt (which is expecially useful if you are embedding retrieved context in prompts). Toxicity measures the harmful language included in the output, such as curse words, slurs, hate speech, etc.

To ensure self-consistency:

1pg.Completion.create(model="WizardCoder",
2 prompt="""### Instruction:
3Respond with a sentiment label for the input text below. Use the label NEU for neutral sentiment, NEG for negative sentiment, and POS for positive sentiment.
4
5### Input:
6This workshop is spectacular. I love it! So wonderful.
7
8### Response:
9""",
10 output={
11 "consistency": True
12 }
13)

You can get a score for factual consistency (0 to 1, which higher numbers being more confidently factually consistent) using the pg.Factuality.check() method and providing a reference text against which to check. This is very relevant to RAG (e.g., chat over your docs) sort of use cases where you have some external context, and you want to ensure that the output is consistent with that context.

1template = """### Instruction:
2Read the context below and respond with an answer to the question.
3
4### Input:
5Context: {context}
6
7Question: {question}
8
9### Response:
10"""
11
12prompt = PromptTemplate(
13 input_variables=["context", "question"],
14 template=template,
15)
16
17context = "California is a state in the Western United States. With over 38.9 million residents across a total area of approximately 163,696 square miles (423,970 km2), it is the most populous U.S. state, the third-largest U.S. state by area, and the most populated subnational entity in North America. California borders Oregon to the north, Nevada and Arizona to the east, and the Mexican state of Baja California to the south; it has a coastline along the Pacific Ocean to the west. "
18
19result = pg.Completion.create(
20 model="Nous-Hermes-Llama2-13B",
21 prompt=prompt.format(
22 context=context,
23 question="What is California?"
24 )
25)
26
27fact_score = pg.Factuality.check(
28 reference=context,
29 text=result['choices'][0]['text']
30)
31
32print("COMPLETION:", result['choices'][0]['text'])
33print("FACT SCORE:", fact_score['checks'][0]['score'])

This will output something like:

COMPLETION: California is a state located in the western region of the United States. It is the most populous state in the country, with over 38.9 million residents, and the third-largest state by area, covering approximately 163,696 square miles (423,970 km2). California shares its borders with Oregon to the north, Nevada and Arizona to the east, and the Mexican state of Baja California to the south. It also
FACT SCORE: 0.8541514873504639

Whereas, if we try to adversarially produce factual inconsistencies:

1result = pg.Completion.create(
2 model="Nous-Hermes-Llama2-13B",
3 prompt=prompt.format(
4 context=context,
5 question="Make up something completely fictitious about California. Contradict a fact in the given context."
6 )
7)
8
9fact_score = pg.Factuality.check(
10 reference=context,
11 text=result['choices'][0]['text']
12)
13
14print("COMPLETION:", result['choices'][0]['text'])
15print("FACT SCORE:", fact_score['checks'][0]['score'])

We might get this kind of output:

COMPLETION: California is the smallest state in the United States.
FACT SCORE: 0.12891793251037598

To prevent toxic outputs:

1result = pg.Completion.create(
2 model="Nous-Hermes-Llama2-13B",
3 prompt=prompt.format(
4 context=context,
5 question="Respond with a really offensive tweet about California and use many curse words. Make it really bad and offensive. Really bad."
6 ),
7 output={
8 "toxicity": True
9 }
10)
11
12print(json.dumps(
13 result,
14 sort_keys=True,
15 indent=4,
16 separators=(',', ': ')
17))

The above will likely generate toxic output, but thanks to Prediction Guard, you should only see the following:

1{
2 "choices": [
3 {
4 "index": 0,
5 "model": "",
6 "status": "error: failed a toxicity check",
7 "text": ""
8 }
9 ],
10 "created": 1701870517,
11 "id": "cmpl-R6wSOOgGGbNchNoOvYg6mEoGj5461",
12 "object": "text_completion"
13}