Artificial Intelligence News

Reinventing the data experience: Use generative AI and modern data architectures to unlock insights


Implementing a modern data architecture provides a scalable method for integrating data from disparate sources. By organizing data by business domain rather than infrastructure, each domain can choose the tools that suit its needs. Organizations can maximize the value of their modern data architectures with generative AI solutions while continuously innovating.

Natural language capabilities allow non-technical users to query data via conversational English instead of complex SQL. However, realizing the full benefits requires overcoming several challenges. AI models and languages ​​must identify appropriate data sources, generate effective SQL queries, and produce coherent responses with embedded results at scale. They also need a user interface for natural language questions.

Overall, implementing modern data architectures and generative AI techniques with AWS is a promising approach for gathering and disseminating critical insights from a wide variety of data at enterprise scale. The newest offering for generative AI from AWS is Amazon Bedrock, which is a fully managed service and the easiest way to build and scale generative AI applications with a base model. AWS also offers a base model through Amazon SageMaker JumpStart as an Amazon SageMaker endpoint. The combination of a large language model (LLM), including the ease of integration that Amazon Bedrock offers, and a scalable, domain-oriented data infrastructure positions it as a smart method for leveraging the abundant information stored in various analytical databases and data lakes.

In this post, we feature scenarios where companies have implemented modern data architectures with data residing in multiple databases and APIs such as legal data on Amazon Simple Storage Service (Amazon S3), human resources on Amazon Relational Database Service (Amazon RDS), sales and marketing on Amazon Redshift, financial market data on third-party data warehouse solutions Snowflake, and product data as API. This implementation aims to increase the business analytics productivity of enterprises, product owners, and business domain experts. All of this is achieved through the use of generative AI in this domain’s mesh architecture, which allows companies to achieve their business goals more efficiently. This solution has the option to include an LLM from JumpStart as a SageMaker endpoint as well as third-party models. We provide a medium for enterprise users to ask fact-based questions without basic knowledge of data pipelines, thereby reducing the complexities of writing simple to complex SQL queries.

Solution overview

Modern data architectures on AWS apply artificial intelligence and natural language processing to query multiple analytical databases. Using services such as Amazon Redshift, Amazon RDS, Snowflake, Amazon Athena, and AWS Glue, create scalable solutions to integrate data from multiple sources. Use LangChaina powerful library for working with LLM, including base models from Amazon Bedrock and JumpStart in Amazon SageMaker Studio notebooks, a system built into which users can ask business questions in natural English and receive answers with data pulled from relevant databases.

The following diagram illustrates the architecture.

The hybrid architecture uses multiple databases and an LLM, with underlying models from Amazon Bedrock and JumpStart for data source identification, SQL generation, and output text generation.

The following diagram illustrates specific workflow steps for our solution.

The steps are as follows:

  1. A business user provides an English question prompt.
  2. The AWS Glue crawler is scheduled to run at frequent intervals to extract metadata from the database and create table definitions in the AWS Glue Data Catalog. The Data Catalog is the input to Sequence Chain 1 (see previous diagram).
  3. LangChain, a tool for working with LLM and instructions, is used in Studio notebooks. LangChain requires an LLM to be defined. As part of Sequence Chain 1, Data Catalog prompts and metadata are passed to the LLM, hosted on the SageMaker endpoint, to identify the relevant databases and tables using the LangChain.
  4. Fast and identifiable databases and tables are passed to Chain Sequence 2.
  5. LangChain establishes a connection to the database and executes SQL queries to get the results.
  6. The results are passed on to the LLM to generate an English answer with the data.
  7. Users receive English answers to their prompts, querying data from different databases.

The following sections describe some of the key steps with associated code. To dive deeper into the solutions and code for all the steps shown here, see GitHub repos. The following diagram shows the sequence of steps to follow:


You can use any database that is compatible with it SQLAlchemy to generate responses from LLM and LangChain. However, this database must have metadata registered in the AWS Glue Data Catalog. Additionally, you must have access to the LLM via a JumpStart key or API.

Connect to database using SQLAlchemy

LangChain uses SQLAlchemy to connect to SQL databases. We initialize the SQLDatabase LangChain function by creating an engine and establishing a connection for each data source. Here’s an example of how to connect to an Amazon Aurora MySQL Compatible Edition serverless database and include only the employees table:

#connect to AWS Aurora MySQL
cluster_arn = <cluster_arn>
secret_arn = <secret_arn>
  connect_args=dict(aurora_cluster_arn=cluster_arn, secret_arn=secret_arn))
dbrds = SQLDatabase(engine_rds, include_tables=('employees'))

Next, we create a prompt that is used by Chain Sequence 1 to identify the database and table names based on user queries.

Generate dynamic command templates

We use the AWS Glue Data Catalog, which is designed to store and manage metadata information, to identify data sources for user queries and create requests for Sequence 1, as detailed in the following steps:

  1. We created a Data Catalog by crawling metadata from multiple data sources using the JDBC connection used in the demonstration.
  2. With the Boto3 library, we create a consolidated Data Catalog view of multiple data sources. Here is an example of how to get the employee table metadata from the Data Catalog for the Aurora MySQL database:
 #retrieve metadata from glue data catalog
  glue_tables_rds = glue_client.get_tables(DatabaseName=<database_name>, MaxResults=1000)
    for table in glue_tables_rds('TableList'):
        for column in table('StorageDescriptor')('Columns'):

The Merged Data Catalog has details about the data source, such as schema, table names, and column names. The following is an example of aggregated Data Catalog output:


  1. We pass the consolidated Data Catalog to the prompt template and define the prompt used by LangChain:
prompt_template = """
From the table below, find the database (in column database) which will contain the data (in corresponding column_names) to answer the question {query} \n
"""+glue_catalog +""" Give your answer as database == \n Also,give your answer as database.table =="""

Sequence Chain 1: Detect source metadata for user queries using LangChain and LLM

We pass the prompt template generated in the previous step to the prompt, along with user queries to the LangChain model, to find the best data source to answer questions. LangChain uses our selected LLM model to detect source metadata.

Use the following code to use an LLM from JumpStart or a third party model:

#define your LLM model here
llm = <LLM>
#pass prompt template and user query to the prompt
PROMPT = PromptTemplate(template=prompt_template, input_variables=("query"))
# define llm chain
llm_chain = LLMChain(prompt=PROMPT, llm=llm)
#run the query and save to generated texts
generated_texts =

The resulting text contains information such as the database and table names that run the user’s queries. For example, for a user query “Name all employees with birthdays this month”, generated_text have information database == rdsmysql And database.table == rdsmysql.employees.

Next, we pass the human resources domain details, Aurora MySQL database, and employees table to Chain Sequence 2.

Chained Sequence 2: Retrieve responses from data sources to answer user queries

Next, we run the LangChain SQL database chain to convert text to SQL and implicitly run the resulting SQL against the database to fetch database results in easy-to-read language.

We start by defining a prompt template that instructs the LLM to generate SQL in the syntactically correct dialect and then run it against the database:

_DEFAULT_TEMPLATE = """Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer.
Only use the following tables:
If someone asks for the sales, they really mean the tickit.sales table.
Question: {input}"""
#define the prompt
PROMPT = PromptTemplate( input_variables=("input", "table_info", "dialect"), template=_DEFAULT_TEMPLATE)

Lastly, we pass the LLM, database connection, and prompt to the SQL database chain and run the SQL query:

db_chain = SQLDatabaseChain.from_llm(llm, db, prompt=PROMPT)

For example, for a user query “Name all employees with birthdays this month”, the response would be as follows:

Question: Name all employees with birth date this month

SELECT * FROM employees WHERE MONTH(birth_date) = MONTH(CURRENT_DATE());

User Response:
The employees with birthdays this month are:
Christian Koblick
Tzvetan Zielinski


Once you’ve run a modern data architecture with generative AI, be sure to clean up any resources you’re not going to use. Shut down and delete the databases in use (Amazon Redshift, Amazon RDS, Snowflake). Additionally, clear data in Amazon S3 and terminate all Studio notebook instances to prevent further charges. If you used JumpStart to deploy LLM as a SageMaker real-time endpoint, delete the endpoint via the console or SageMaker Studio.


In this post, we integrate modern data architectures with generative AI and LLM within SageMaker. This solution uses various text-to-text foundation models from JumpStart as well as third-party models. This hybrid approach identifies the data source, writes an SQL query, and returns a response with the query results. It uses Amazon Redshift, Amazon RDS, Snowflake, and LLM. To enhance the solution, you can add more databases, UI for English queries, quick techniques and data tools. This can be a smart, unified way to gain insights from multiple data stores. For a deeper dive into the solutions and code featured in this post, see GitHub repos . See also Amazon Bedrock for use cases on generative AI, foundational models, and large language models.


Example prompts

Domain Database/API Remind SQL (Produced by LLM) Output
sales & Marketing Amazon RedShift How many ticket sales are there? SELECT COUNT(*) AS total_sales
FROM tickit.sales;
There are 172,456 ticket sales.
sales & Marketing Amazon RedShift What was the total ticket sales commission in 2008? SELECT SUM(commission) AS total_commission
FROM tickit.sales
WHERE EXTRACT(YEAR FROM saletime) = 2008
The total commission for ticket sales in the year 2008 was $16,614,814.65.
Law S3 How many scams were there in 2023? SELECT count(*)
FROM claims
WHERE extract(year from write_time) = 2023 AND fraud = 1;
There were 164 fraud claims in 2023.
Law S3 How many policies have been claimed this year? SELECT count(*)
FROM claims;
There were 5000 claims made this year.
Human Resources MySQL Amazon Aurora List all employees with birth dates this month SELECT * FROM employees
The employees with birthdays this month are:
Christian Koblick
Tzvetan Zielinski
Kazuhito Cappelletti
Yinghua Dredge
Human Resources MySQL Amazon Aurora How many employees were employed before 1990? SELECT COUNT(*) AS 'Number of employees hired before 1990'
FROM employees
WHERE hire_date < '1990-01-01'
29 employees were hired before 1990.
Finance and Investment Snowflake Which stocks performed the best and worst in May 2013? SELECT name, MAX(close) AS max_close, MIN(close) AS min_close
FROM all_stocks_5yr
WHERE date BETWEEN '2013-05-01' AND '2013-05-31'
ORDER BY max_close DESC, min_close ASC
The stock that performed the best in May 2013 was AnySock1 (ASTOCK1) with a maximum closing price of $842.50. The stock that performed the worst was AnySock2 (ASTOCK2) with a minimum closing price of $3.22.
Finance and Investment Snowflake What was the average volume of shares traded in July 2013? SELECT AVG(volume) AS average_volume
FROM all_stocks_5yr
WHERE date BETWEEN '2013-07-01' AND '2013-07-31'
The average volume of stocks traded in July 2013 was 4,374,177
Product – Weather FIRE What’s the current weather like in New York City in degrees Fahrenheit?

About the Author

Navneet Tuteja is a Data Specialist at Amazon Web Services. Prior to joining AWS, Navneet worked as a facilitator for organizations looking to modernize their data architecture and deploy comprehensive AI/ML solutions. He holds an engineering degree from Thapar University, as well as a master’s degree in statistics from Texas A&M University.

Sovik Kumar Nat is an AI/ML solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management and IoT. Sovik has published articles and holds patents in monitoring ML models. He holds dual master’s degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelor’s degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking the ferry and watching movies.


Source link

Related Articles

Back to top button