Prompt Engineering
(Run this example in Google Colab here)
As we have seen in the previous examples, it is easy enough to prompt a generative AI model. Shoot off an API call, and suddently you have an answer, a machine translation, sentiment analyzed, or a chat message generated. However, going from “prompting” to ai engineering of your AI model based processes is a bit more involved. The importance of the “engineering” in prompt engineering has become increasingly apparent, as models have become more complex and powerful, and the demand for more accurate and interpretable results has grown.
The ability to engineer effective prompts and related workflows allows us to configure and tune model responses to better suit our specific needs (e.g., for a particular industry like healthcare), whether we are trying to improve the quality of the output, reduce bias, or optimize for efficiency.
We will use Python to show an example:
Dependencies and Imports
You will need to install Prediction Guard and LangChain into your Python environment.
Now import PromptTemplate, FewShotPromptTemplate, PredictionGuard, setup your API Key, and create the client.
Prompt Templates
One of the best practices that we will discuss below involves testing and evaluating model output using example prompt contexts and formulations. In order to institute this practice, we need a way to rapidly and programmatically format prompts with a variety of contexts. We will need this in our applications anyway, because in production we will be receiving dynamic input from the user or another application. That dynamic input (or something extracted from it) will be inserted into our prompts on-the-fly. We already saw in the last notebook a prompt that included a bunch of boilerplate:
This will output:
This kind of prompt template could in theory be flexible to create zero shot or few shot prompts. However, LangChain provides a bit more convenience for few shot prompts. We can first create a template for individual demonstrations within the few shot prompt:
This will output:
Multiple Formulations
Why settle for a single prompt and/or set of parameters when you can use mutliple. Try using multiple formulations of your prompt to either:
- Provide multiple options to users; or
- Create multiple candidate predictions, which you can choose from programmatically using a reference free evaluation of those candidates.
This will output the result for each formulation, which may or may not diverge:
Consistency and Output Validation
Reliability and consistency in LLM output is a major problem for the “last mile” of LLM integrations. You could get a whole variety of outputs from your model, and some of these outputs could be inaccurate or harmful in other ways (e.g., toxic).
Prediction Guard allows you to validate the consistency, factuality, and toxicity of your LLMs outputs. Consistency refers to the internal (or self) model consistency and ensures that the model itself is giving a consistent reply. Factuality checks for the factual consistency of the output with context in the prompt (which is expecially useful if you are embedding retrieved context in prompts). Toxicity measures the harmful language included in the output, such as curse words, slurs, hate speech, etc.
To ensure self-consistency:
You can get a score for factual consistency (0 to 1, which higher numbers being
more confidently factually consistent) using the client.factuality.check()
method
and providing a reference text against which to check. This is very relevant to
RAG (e.g., chat over your docs) sort of use cases where you have some external
context, and you want to ensure that the output is consistent with that context.
This will output something like:
Whereas, if we try to adversarially produce factual inconsistencies:
We might get this kind of output:
To prevent toxic outputs:
The above will likely generate toxic output, but thanks to Prediction Guard, you should only see the following ValueError:
Using The SDKs
You can also try these examples using the other official SDKs: