Data Chat With LLMs
(Run this example in Google Colab here)
Large Language Models (LLMs) like ‘deepseek-coder-6.7B-instruct’ have demonstrated impressive capabilities for understanding natural language and generating SQL. We can leverage these skills for data analysis by having them automatically generate SQL queries against known database structures. And then rephrase these sql outputs using state of the art text/chat completion models like ‘neural-chat-7b-v3-3’ to get well written answers to user questions.
Unlike code generation interfaces that attempt to produce executable code from scratch, our approach focuses strictly on generating industry-standard SQL from plain English questions. This provides two major benefits:
-
SQL is a well-established language supported across environments, avoiding the need to execute less secure auto-generated code.
-
Mapping natural language questions to SQL over known schemas is more robust than attempting to generate arbitrary code for unfamiliar data structures.
By combining language model understanding of questions with a defined database schema, the system can translate simple natural language queries into precise SQL for fast and reliable data analysis. This makes surfacing insights more accessible compared to manual SQL writing or hopelessly broad code generation.
For this demo we have selecteed a public dataset from Kaggle - Jobs and Salaries in Data Science (Find the dataset here)
Installation and Setup
-
Install the Python SDK with
pip install predictionguard
-
Get a Prediction Guard api key (as described here) and set it as the environment variable
PREDICTIONGUARD_API_KEY
.
Setup
First, import the necessary libraries.
Authenticate to Prediction Guard API
Now setup your API Key, and create the client.
Create a SQLite Database
We will create a local SQLite database for this example, but a similar approach could be used with any remote Postgres, MySQL, etc. database. We will load an example movie rental database called Sakila. Sakila models a database for a chain of video rental stores.
It contains a vast amount of information about.
- movie titles, actors, genres, etc.
- what stores have what films in inventory
- transactions and payments
- customers and staff
Establish connection to the SQLite DB
Now connect to the database.
Query Execution
After having a look at the dataset , make data cleaning/preprocessing decisions if needed.
Here is the expected output.
Simple Approach With Static Schema Information
Generally, our approach to SQL generation involves asking the LLM to generate a relevant SQL query and injecting the schema information in the prompt for context. The problem in this case is that the schema information is quite long. In reality, production DBs might have 100’s of tables, views, etc.
All of this schema information addeded into the prompt can create issues with:
- Model context windows
- Model performance
As such, one “naive” thing we could try is generating a summary of the schema information that fits into the context window of the given model. This will only scale to a certain point, and it may introduce weirdness because of lack of relevant context. However, it might be enough for your use case.
Prepare Descriptive Static Schema Information
Schema for this example dataset will look like this.
Prompt Templates
Define prompt templates for generating SQL queries and chatbot responses using Prediction Guard.
Create Some Utilities To Generate And Refine The SQL Query
Generate SQL queries based on user questions using PredictionGuard and process the queries.
Return Natural Language Responses
Generate responses to user questions based on SQL query results.
Test It Out
Let’s test out the demo.
Output:
More Advanced Retrieval Approach With Dynamic Schema information
Assuming that your database fits one of the following scenarios:
-has many tables
-has tables with many columns
-includes fields with “unexpected” formats for values
-includes columns with non-semantically meaningful names etc.
We need to go beyond the simple, naive SQL generation method. We will need to integrate a vector database to store schema information along with data dictionary descriptions of tables and columns. The column description will also include example field values for extra context.
We will retrieve the relevant information to answer a question on-the-fly and inject it into the prompt. We will also include “special instructions” in the prompt to deal with database quirks.
You can go through this code to replicate for your various use cases :
(Run this example in Google Colab here)
For An Interactive UI
In this case we will be using a streamlit based web application to create an appealing chat interface.
You can go through this code to replicate for your various use cases :
(Find relevant codes and details for this in our github repo here)
Chatbot in action
Conclusion
This document outlines the structure of an application designed for interactive data analysis through a chat interface. It leverages several advanced Python libraries and techniques, including vector embeddings with LanceDB, executing SQL queries on pandas dataframes, and generating dynamic responses using LangChain and PredictionGuard.