Using LLMs for Data Analysis and SQL Query Generation

(Run this example in Google Colab here)

Large language models (LLMs) like ‘deepseek-coder-6.7B-instruct’ have demonstrated impressive capabilities for understanding natural language and generating SQL. We can leverage these skills for data analysis by having them automatically generate SQL queries against known database structures. And then rephrase these sql outputs using state of the art text/chat completion models like ‘Neural-Chat-7B’ to get well written answers to user questions.

Unlike code generation interfaces that attempt to produce executable code from scratch, our approach focuses strictly on generating industry-standard SQL from plain English questions. This provides two major benefits:

SQL is a well-established language supported across environments, avoiding the need to execute less secure auto-generated code.
Mapping natural language questions to SQL over known schemas is more robust than attempting to generate arbitrary code for unfamiliar data structures.

By combining language model understanding of questions with a defined database schema, the system can translate simple natural language queries into precise SQL for fast and reliable data analysis. This makes surfacing insights more accessible compared to manual SQL writing or hopelessly broad code generation.

For this demo we have selecteed a public dataset from Kaggle - Jobs and Salaries in Data Science (Find the dataset here)

Installation and Setup

Install the Python SDK with pip install predictionguard
Get a Prediction Guard access token (as described here) and set it as the environment variable PREDICTIONGUARD_TOKEN.

Setup

First, import the necessary libraries:

1 import time
2 import os
3 import re
4 import urllib
5 
6 import html2text
7 import sqlite3
8 from langchain import PromptTemplate
9 from sentence_transformers import SentenceTransformer
10 import lancedb
11 from lancedb.embeddings import with_embeddings
12 import predictionguard as pg
13 import pandas as pd
14 from getpass import getpass

Authenticate to Prediction Guard API

1 pg_access_token = getpass('Enter your Prediction Guard access token: ')
2 os.environ['PREDICTIONGUARD_TOKEN'] = pg_access_token

Create a SQLite Database

We will create a local SQLite database for this example, but a similar approach could be used with any remote Postgres, MySQL, etc. database. We will load an example movie rental database called Sakila. Sakila models a database for a chain of video rental stores. It contains a vast amount of information about:

movie titles actors, genres, etc. what stores have what films in inventory transactions and payments customers staff

1 # Pull the example database
2 ! git clone https://github.com/bradleygrant/sakila-sqlite3.git

Establish connection to the SQLite DB

1 db_path = 'sakila-sqlite3/sakila_master.db'
2 conn = sqlite3.connect(db_path)

Query Execution

After having a look at the dataset , make data cleaning/preprocessing decisions if needed.

1 def execute_sql_query(query):
2   cursor = conn.cursor()
3   cursor.execute(query)
4   result = cursor.fetchall()
5   cursor.close()
6   return result
7 
8 # Try querying the database
9 # Try querying the database
10 results = execute_sql_query("SELECT * FROM customer LIMIT 3;")
11 results

Output :

1 [(1,
2   1,
3   'MARY',
4   'SMITH',
5   'MARY.SMITH@sakilacustomer.org',
6   5,
7   '1',
8   '2006-02-14 22:04:36.000',
9   '2020-12-23 07:15:11'),
10  (2,
11   1,
12   'PATRICIA',
13   'JOHNSON',
14   'PATRICIA.JOHNSON@sakilacustomer.org',
15   6,
16   '1',
17   '2006-02-14 22:04:36.000',
18   '2020-12-23 07:15:11'),
19  (3,
20   1,
21   'LINDA',
22   'WILLIAMS',
23   'LINDA.WILLIAMS@sakilacustomer.org',
24   7,
25   '1',
26   '2006-02-14 22:04:36.000',
27   '2020-12-23 07:15:11')]

Simple approach with static schema information

Generally, our approach to SQL generation involves asking the LLM to generate a relevant SQL query and injecting the schema information in the prompt for context. The problem in this case is that the schema information is quite long. In reality, production DBs might have 100’s of tables, views, etc. All of this schema information addeded into the prompt creates issues with:

Model context windows Model performance As such, one “naive” thing we could try is generating a summary of the schema information that fits into the context window of the given model. This will only scale to a certain point, and it may introduce weirdness because of lack of relevant context. However, it might be enough for your use case.

Prepare descriptive static schema information

1 schema_description = []
2 query = "SELECT name FROM sqlite_master WHERE type='table';"
3 
4 # Assuming 'conn' is your SQLite connection object and has been defined earlier
5 cursor = conn.cursor()
6 cursor.execute(query)
7 tables = cursor.fetchall()
8 
9 for (table,) in tables:
10 
11     # Use double quotes around the table name to avoid syntax error with reserved keywords
12     cursor.execute(f'PRAGMA table_info("{table}")')
13     columns = cursor.fetchall()
14     column_descriptions = ", ".join([f"{col[1]}" for col in columns])
15     schema_description.append(f"- {table}: includes {column_descriptions}")
16 
17 cursor.close()
18 static_schema_description = "\n".join(schema_description)
19 print(static_schema_description)

Schema for this example dataset will look like this :

1 - actor: includes actor_id, first_name, last_name, last_update
2 - country: includes country_id, country, last_update
3 - city: includes city_id, city, country_id, last_update
4 - address: includes address_id, address, address2, district, city_id, postal_code, phone, last_update
5 - language: includes language_id, name, last_update
6 - category: includes category_id, name, last_update
7 - customer: includes customer_id, store_id, first_name, last_name, email, address_id, active, create_date, last_update
8 - film: includes film_id, title, description, release_year, language_id, original_language_id, rental_duration, rental_rate, length, replacement_cost, rating, special_features, last_update
9 - film_actor: includes actor_id, film_id, last_update
10 - film_category: includes film_id, category_id, last_update
11 - film_text: includes film_id, title, description
12 - inventory: includes inventory_id, film_id, store_id, last_update
13 - staff: includes staff_id, first_name, last_name, address_id, picture, email, store_id, active, username, password, last_update
14 - store: includes store_id, manager_staff_id, address_id, last_update
15 - payment: includes payment_id, customer_id, staff_id, rental_id, amount, payment_date, last_update
16 - rental: includes rental_id, rental_date, inventory_id, customer_id, return_date, staff_id, last_update

Prompt Templates

Define prompt templates for generating SQL queries and chatbot responses using Prediction Guard:

1 sql_template = """You are a SQL expert and you only generate SQL queries which are executable. You provide no extra explanations.
2 You respond with a SQL query that answers the user question in the below instruction by querying a database with the following tables:
3 {schema_description}
4 
5 ### Instruction:
6 User question: \"{question}\"
7 
8 ### Response:
9 """
10 sql_prompt = PromptTemplate(template=sql_template, input_variables=["question", "schema_description"])
11 
12 qa_template = """### System:
13 You are a data chatbot who answers user questions.
14 To answer these questions you will need the data provided, which is a result of executed the given SQL query.
15 Give a short and crisp response that answers the answer.
16 Don't add any notes or any extra information after your response.
17 
18 ### User:
19 Question: {question}
20 
21 SQL Query: {sql_query}
22 
23 Data: {data}
24 
25 ### Assistant:
26 """
27 qa_prompt = PromptTemplate(template=qa_template,input_variables=["question", "sql_query", "data"])

Create some utilities to generate and refine the SQL query

Generate SQL queries based on user questions using PredictionGuard and process the queries:

1 def generate_sql_query(question, injected_schema):
2 
3   prompt_filled = sql_prompt.format(question=question, schema_description=injected_schema)
4 
5   try:
6       result = pg.Completion.create(
7           model="deepseek-coder-6.7b-instruct",
8           prompt=prompt_filled,
9           max_tokens=300,
10           temperature=0.1
11       )
12       sql_query = result["choices"][0]["text"]
13       return sql_query
14 
15   except Exception as e:
16       return None
17 
18 
19 def extract_and_refine_sql_query(sql_query):
20 
21   # Extract SQL query using a regular expression
22   match = re.search(r"(SELECT.*?);", sql_query, re.DOTALL)
23   if match:
24 
25       refined_query = match.group(1)
26 
27       # Check for and remove any text after a colon
28       colon_index = refined_query.find(':')
29       if colon_index != -1:
30           refined_query = refined_query[:colon_index]
31 
32       # Ensure the query ends with a semicolon
33       if not refined_query.endswith(';'):
34           refined_query += ';'
35       return refined_query
36 
37   else:
38       return ""
39 
40 def get_answer_from_sql(question, injected_schema):
41     sql_query = generate_sql_query(question, injected_schema)
42     sql_query = extract_and_refine_sql_query(sql_query)
43 
44     try:
45         cursor = conn.cursor()
46         cursor.execute(sql_query)
47         result = cursor.fetchall()
48         cursor.close()
49         return result, sql_query
50 
51     except sqlite3.Error as e:
52         print(f"Error executing SQL query: {e}")
53         return "There was an error executing the SQL query."

Return natural language responses

Generate responses to user questions based on SQL query results:

1 def get_answer(question, data, sql_query):
2 
3   prompt_filled = qa_prompt.format(question=question, data=data, sql_query=sql_query)
4 
5   # Respond to the user
6   output = pg.Completion.create(
7       model="Neural-Chat-7B",
8       prompt=prompt_filled,
9       max_tokens=200,
10       temperature=0.1
11   )
12   completion = output['choices'][0]['text']
13 
14   return completion

Test it out

1 question = "What are the three most rented movies?"
2 
3 print('Question:')
4 print('------------------------')
5 print(question)
6 print('')
7 
8 context, sql_query = get_answer_from_sql(question, static_schema_description)
9 print('Generated SQL Query:')
10 print('------------------------')
11 print(sql_query)
12 print('')
13 print('SQL result:')
14 print('------------------------')
15 print(context)
16 print('')
17 
18 # Convert context and answer to string if they are not already
19 answer = get_answer(question, context, sql_query)
20 context_str = ', '.join([str(item) for item in context]) if isinstance(context, list) else str(context)
21 answer_str = str(answer)
22 print('Generate NL answer:')
23 print('------------------------')
24 print(answer)

Output :

More advanced retrieval approach with dynamic schema information

Assuming that your database fits one of the following scenarios:

-has many tables

-has tables with many columns

-includes fields with “unexpected” formats for values

-includes columns with non-semantically meaningful names etc.

We need to go beyond the simple, naive SQL generation method. We will need to integrate a vector database to store schema information along with data dictionary descriptions of tables and columns. The column description will also include example field values for extra context.

We will retrieve the relevant information to answer a question on-the-fly and inject it into the prompt. We will also include “special instructions” in the prompt to deal with database quirks.

You can go through this code to replicate for your various use cases : (Run this example in Google Colab here)

For an interactive UI

In this case we will be using a streamlit based web application to create an appealing chat interface.

You can go through this code to replicate for your various use cases : (Find relavent codes and details for this in our github repohere)

Chatbot in action :

Conclusion

This document outlines the structure of an application designed for interactive data analysis through a chat interface. It leverages several advanced Python libraries and techniques, including vector embeddings with LanceDB, executing SQL queries on pandas dataframes, and generating dynamic responses using LangChain and PredictionGuard.

Getting started

Models

Using LLMs

Process LLM Input

Validating LLM Output

Guides

Reference

Support

Data Chat with LLMs