Data Chat With LLMs

(Run this example in Google Colab here)

Large Language Models (LLMs) like ‘deepseek-coder-6.7B-instruct’ have demonstrated impressive capabilities for understanding natural language and generating SQL. We can leverage these skills for data analysis by having them automatically generate SQL queries against known database structures. And then rephrase these sql outputs using state of the art text/chat completion models like ‘neural-chat-7b-v3-3’ to get well written answers to user questions.

Unlike code generation interfaces that attempt to produce executable code from scratch, our approach focuses strictly on generating industry-standard SQL from plain English questions. This provides two major benefits:

  • SQL is a well-established language supported across environments, avoiding the need to execute less secure auto-generated code.

  • Mapping natural language questions to SQL over known schemas is more robust than attempting to generate arbitrary code for unfamiliar data structures.

By combining language model understanding of questions with a defined database schema, the system can translate simple natural language queries into precise SQL for fast and reliable data analysis. This makes surfacing insights more accessible compared to manual SQL writing or hopelessly broad code generation.

For this demo we have selecteed a public dataset from Kaggle - Jobs and Salaries in Data Science (Find the dataset here)

Installation and Setup

  • Install the Python SDK with pip install predictionguard

  • Get a Prediction Guard api key (as described here) and set it as the environment variable PREDICTIONGUARD_API_KEY.

Setup

First, import the necessary libraries.

copy
1import time
2import os
3import re
4import urllib
5
6import html2text
7import sqlite3
8from langchain import PromptTemplate
9from sentence_transformers import SentenceTransformer
10import lancedb
11from lancedb.embeddings import with_embeddings
12from predictionguard import PredictionGuard
13import pandas as pd
14from getpass import getpass

Authenticate to Prediction Guard API

Now setup your API Key, and create the client.

copy
1# Set your Prediction Guard token as an environmental variable.
2os.environ["PREDICTIONGUARD_API_KEY"] = "<api key>"
3
4client = PredictionGuard()

Create a SQLite Database

We will create a local SQLite database for this example, but a similar approach could be used with any remote Postgres, MySQL, etc. database. We will load an example movie rental database called Sakila. Sakila models a database for a chain of video rental stores.

It contains a vast amount of information about.

  • movie titles, actors, genres, etc.
  • what stores have what films in inventory
  • transactions and payments
  • customers and staff

External Request Body

copy
1# Pull the example database
2! git clone https://github.com/bradleygrant/sakila-sqlite3.git

Establish connection to the SQLite DB

Now connect to the database.

copy
1db_path = 'sakila-sqlite3/sakila_master.db'
2conn = sqlite3.connect(db_path)

Query Execution

After having a look at the dataset , make data cleaning/preprocessing decisions if needed.

copy
1def execute_sql_query(query):
2 cursor = conn.cursor()
3 cursor.execute(query)
4 result = cursor.fetchall()
5 cursor.close()
6 return result
7
8# Try querying the database
9# Try querying the database
10results = execute_sql_query("SELECT * FROM customer LIMIT 3;")
11results

Here is the expected output.

copy
1[(1,
2 1,
3 'MARY',
4 'SMITH',
5 '[email protected]',
6 5,
7 '1',
8 '2006-02-14 22:04:36.000',
9 '2020-12-23 07:15:11'),
10 (2,
11 1,
12 'PATRICIA',
13 'JOHNSON',
14 '[email protected]',
15 6,
16 '1',
17 '2006-02-14 22:04:36.000',
18 '2020-12-23 07:15:11'),
19 (3,
20 1,
21 'LINDA',
22 'WILLIAMS',
23 '[email protected]',
24 7,
25 '1',
26 '2006-02-14 22:04:36.000',
27 '2020-12-23 07:15:11')]

Simple Approach With Static Schema Information

Generally, our approach to SQL generation involves asking the LLM to generate a relevant SQL query and injecting the schema information in the prompt for context. The problem in this case is that the schema information is quite long. In reality, production DBs might have 100’s of tables, views, etc.

All of this schema information addeded into the prompt can create issues with:

  • Model context windows
  • Model performance

As such, one “naive” thing we could try is generating a summary of the schema information that fits into the context window of the given model. This will only scale to a certain point, and it may introduce weirdness because of lack of relevant context. However, it might be enough for your use case.

Prepare Descriptive Static Schema Information

copy
1schema_description = []
2query = "SELECT name FROM sqlite_master WHERE type='table';"
3
4# Assuming 'conn' is your SQLite connection object and has been defined earlier
5cursor = conn.cursor()
6cursor.execute(query)
7tables = cursor.fetchall()
8
9for (table,) in tables:
10
11 # Use double quotes around the table name to avoid syntax error with reserved keywords
12 cursor.execute(f'PRAGMA table_info("{table}")')
13 columns = cursor.fetchall()
14 column_descriptions = ", ".join([f"{col[1]}" for col in columns])
15 schema_description.append(f"- {table}: includes {column_descriptions}")
16
17cursor.close()
18static_schema_description = "\n".join(schema_description)
19print(static_schema_description)

Schema for this example dataset will look like this.

copy
1- actor: includes actor_id, first_name, last_name, last_update
2- country: includes country_id, country, last_update
3- city: includes city_id, city, country_id, last_update
4- address: includes address_id, address, address2, district, city_id, postal_code, phone, last_update
5- language: includes language_id, name, last_update
6- category: includes category_id, name, last_update
7- customer: includes customer_id, store_id, first_name, last_name, email, address_id, active, create_date, last_update
8- film: includes film_id, title, description, release_year, language_id, original_language_id, rental_duration, rental_rate, length, replacement_cost, rating, special_features, last_update
9- film_actor: includes actor_id, film_id, last_update
10- film_category: includes film_id, category_id, last_update
11- film_text: includes film_id, title, description
12- inventory: includes inventory_id, film_id, store_id, last_update
13- staff: includes staff_id, first_name, last_name, address_id, picture, email, store_id, active, username, password, last_update
14- store: includes store_id, manager_staff_id, address_id, last_update
15- payment: includes payment_id, customer_id, staff_id, rental_id, amount, payment_date, last_update
16- rental: includes rental_id, rental_date, inventory_id, customer_id, return_date, staff_id, last_update

Prompt Templates

Define prompt templates for generating SQL queries and chatbot responses using Prediction Guard.

1sql_template = """You are a SQL expert and you only generate SQL queries which are executable. You provide no extra explanations.
2You respond with a SQL query that answers the user question in the below instruction by querying a database with the following tables:
3{schema_description}
4
5### Instruction:
6User question: \"{question}\"
7
8### Response:
9"""
10sql_prompt = PromptTemplate(template=sql_template, input_variables=["question", "schema_description"])
11
12qa_template = """### System:
13You are a data chatbot who answers user questions.
14To answer these questions you will need the data provided, which is a result of executed the given SQL query.
15Give a short and crisp response that answers the answer.
16Don't add any notes or any extra information after your response.
17
18### User:
19Question: {question}
20
21SQL Query: {sql_query}
22
23Data: {data}
24
25### Assistant:
26"""
27qa_prompt = PromptTemplate(template=qa_template,input_variables=["question", "sql_query", "data"])

Create Some Utilities To Generate And Refine The SQL Query

Generate SQL queries based on user questions using PredictionGuard and process the queries.

1def generate_sql_query(question, injected_schema):
2
3 prompt_filled = sql_prompt.format(question=question, schema_description=injected_schema)
4
5 try:
6 result = client.completions.create(
7 model="deepseek-coder-6.7b-instruct",
8 prompt=prompt_filled,
9 max_tokens=300,
10 temperature=0.1
11 )
12 sql_query = result["choices"][0]["text"]
13 return sql_query
14
15 except Exception as e:
16 return None
17
18
19def extract_and_refine_sql_query(sql_query):
20
21 # Extract SQL query using a regular expression
22 match = re.search(r"(SELECT.*?);", sql_query, re.DOTALL)
23 if match:
24
25 refined_query = match.group(1)
26
27 # Check for and remove any text after a colon
28 colon_index = refined_query.find(':')
29 if colon_index != -1:
30 refined_query = refined_query[:colon_index]
31
32 # Ensure the query ends with a semicolon
33 if not refined_query.endswith(';'):
34 refined_query += ';'
35 return refined_query
36
37 else:
38 return ""
39
40def get_answer_from_sql(question, injected_schema):
41 sql_query = generate_sql_query(question, injected_schema)
42 sql_query = extract_and_refine_sql_query(sql_query)
43
44 try:
45 cursor = conn.cursor()
46 cursor.execute(sql_query)
47 result = cursor.fetchall()
48 cursor.close()
49 return result, sql_query
50
51 except sqlite3.Error as e:
52 print(f"Error executing SQL query: {e}")
53 return "There was an error executing the SQL query."

Return Natural Language Responses

Generate responses to user questions based on SQL query results.

1def get_answer(question, data, sql_query):
2
3 prompt_filled = qa_prompt.format(question=question, data=data, sql_query=sql_query)
4
5 # Respond to the user
6 output = client.completions.create(
7 model="neural-chat-7b-v3-3",
8 prompt=prompt_filled,
9 max_tokens=200,
10 temperature=0.1
11 )
12 completion = output['choices'][0]['text']
13
14 return completion

Test It Out

Let’s test out the demo.

copy
1question = "What are the three most rented movies?"
2
3print('Question:')
4print('------------------------')
5print(question)
6print('')
7
8context, sql_query = get_answer_from_sql(question, static_schema_description)
9print('Generated SQL Query:')
10print('------------------------')
11print(sql_query)
12print('')
13print('SQL result:')
14print('------------------------')
15print(context)
16print('')
17
18# Convert context and answer to string if they are not already
19answer = get_answer(question, context, sql_query)
20context_str = ', '.join([str(item) for item in context]) if isinstance(context, list) else str(context)
21answer_str = str(answer)
22print('Generate NL answer:')
23print('------------------------')
24print(answer)

Output:

External Request Body

More Advanced Retrieval Approach With Dynamic Schema information

Assuming that your database fits one of the following scenarios:

-has many tables

-has tables with many columns

-includes fields with “unexpected” formats for values

-includes columns with non-semantically meaningful names etc.

We need to go beyond the simple, naive SQL generation method. We will need to integrate a vector database to store schema information along with data dictionary descriptions of tables and columns. The column description will also include example field values for extra context.

We will retrieve the relevant information to answer a question on-the-fly and inject it into the prompt. We will also include “special instructions” in the prompt to deal with database quirks.

You can go through this code to replicate for your various use cases :

(Run this example in Google Colab here)

For An Interactive UI

In this case we will be using a streamlit based web application to create an appealing chat interface.

You can go through this code to replicate for your various use cases :

(Find relevant codes and details for this in our github repo here)

Chatbot in action

Conclusion

This document outlines the structure of an application designed for interactive data analysis through a chat interface. It leverages several advanced Python libraries and techniques, including vector embeddings with LanceDB, executing SQL queries on pandas dataframes, and generating dynamic responses using LangChain and PredictionGuard.