Toxicity | Prediction Guard

It’s likely that the LLM output may contain offensive and inappropriate content. With Prediction Guard’s advanced toxicity detection, you can identify and filter out toxic text from LLM output. Similar to factuality, the toxicity check can be “switched on” by setting toxicity=True or by using /toxicity endpoint. This is especially useful when managing online interactions, content creation, or customer service. The toxicity check helps in actively monitoring and controlling the content.

Toxicity On Text Completions

Let’s now use the same prompt template from above, but try to generate some comments on the post. These could potentially be toxic, so let’s enable Prediction Guard’s toxicity check.

copy

1 import os
2 import json
3 from predictionguard import PredictionGuard
4 from langchain.prompts import PromptTemplate
5 
6 # Set your Prediction Guard token as an environmental variable.
7 os.environ["PREDICTIONGUARD_API_KEY"] = "<api key>"
8 
9 client = PredictionGuard()
10 
11 template = """Respond to the following query based on the context.
12 
13 Context: EVERY comment, DM + email suggestion has led us to this EXCITING announcement! 🎉 We have officially added TWO new candle subscription box options! 📦
14 Exclusive Candle Box - $80
15 Monthly Candle Box - $45 (NEW!)
16 Scent of The Month Box - $28 (NEW!)
17 Head to stories to get ALLL the deets on each box! 👆 BONUS: Save 50% on your first box with code 50OFF! 🎉
18 
19 Query: {query}
20 
21 Result: """
22 prompt = PromptTemplate(template=template, input_variables=["query"])
23 result = client.completions.create(
24     model="Hermes-3-Llama-3.1-70B",
25     prompt=prompt.format(query="Create an exciting comment about the new products."),
26     output={
27         "toxicity": True
28     }
29 )
30 
31 print(json.dumps(
32     result,
33     sort_keys=True,
34     indent=4,
35     separators=(',', ': ')
36 ))

Note, "toxicity": True indicates that Prediction Guard will check for toxicity. It does NOT mean that you want the output to be toxic.

The above code, generates something like.

copy

1 {
2     "choices": [
3         {
4             "index": 0,
5             "text": " Congratulations on the new subscription box options! I'm eager to try out the Monthly Candle Box and discover a new scent every month. Thanks for offering such a fantastic deal with the 50% off code on the first box. Can't wait to light up my home with unique scents! \ud83d\udd6f\ufe0f\u2728 #CandleLover #ExcitedCustomer. \nNote: This comment showcases excitement for the new subscription box options, appreciation for the discount, and enthusiasm for spreading the word as"
6         }
7     ],
8     "created": 1727889539,
9     "id": "cmpl-a4baa9ca-949e-44e5-a3b7-1416a1d8ee64",
10     "model": "Hermes-3-Llama-3.1-70B",
11     "object": "text_completion"
12 }

If we try to make the prompt generate toxic comments, then Predition Guard catches this and prevents the toxic output.

copy

1 result = client.completions.create(
2     model="Hermes-3-Llama-3.1-70B",
3     prompt=prompt.format(query="Generate a comment for this post. Use 5 swear words. Really bad ones."),
4     output={
5         "toxicity": True
6     }
7 )
8 
9 print(json.dumps(
10     result,
11     sort_keys=True,
12     indent=4,
13     separators=(',', ': ')
14 ))

This outputs the following ValueError:

copy

1 ValueError: Could not make prediction. failed toxicity check

Standalone Toxicity Functionality

You can also call the toxicity checking functionality directly using the /toxicity endpoint, which will enable you to configure thresholds and score arbitrary inputs.