Implementing a modern data architecture provides a scalable method for integrating data from disparate sources. By organizing data by business domain rather than infrastructure, each domain can choose the tools that suit its needs. Organizations can maximize the value of their modern data architectures with generative AI solutions while continuously innovating.
Natural language capabilities allow non-technical users to query data via conversational English instead of complex SQL. However, realizing the full benefits requires overcoming several challenges. AI models and languages must identify appropriate data sources, generate effective SQL queries, and produce coherent responses with embedded results at scale. They also need a user interface for natural language questions.
Overall, implementing modern data architectures and generative AI techniques with AWS is a promising approach for gathering and disseminating critical insights from a wide variety of data at enterprise scale. The newest offering for generative AI from AWS is Amazon Bedrock, which is a fully managed service and the easiest way to build and scale generative AI applications with a base model. AWS also offers a base model through Amazon SageMaker JumpStart as an Amazon SageMaker endpoint. The combination of a large language model (LLM), including the ease of integration that Amazon Bedrock offers, and a scalable, domain-oriented data infrastructure positions it as a smart method for leveraging the abundant information stored in various analytical databases and data lakes.
In this post, we feature scenarios where companies have implemented modern data architectures with data residing in multiple databases and APIs such as legal data on Amazon Simple Storage Service (Amazon S3), human resources on Amazon Relational Database Service (Amazon RDS), sales and marketing on Amazon Redshift, financial market data on third-party data warehouse solutions Snowflake, and product data as API. This implementation aims to increase the business analytics productivity of enterprises, product owners, and business domain experts. All of this is achieved through the use of generative AI in this domain’s mesh architecture, which allows companies to achieve their business goals more efficiently. This solution has the option to include an LLM from JumpStart as a SageMaker endpoint as well as third-party models. We provide a medium for enterprise users to ask fact-based questions without basic knowledge of data pipelines, thereby reducing the complexities of writing simple to complex SQL queries.
Modern data architectures on AWS apply artificial intelligence and natural language processing to query multiple analytical databases. Using services such as Amazon Redshift, Amazon RDS, Snowflake, Amazon Athena, and AWS Glue, create scalable solutions to integrate data from multiple sources. Use LangChaina powerful library for working with LLM, including base models from Amazon Bedrock and JumpStart in Amazon SageMaker Studio notebooks, a system built into which users can ask business questions in natural English and receive answers with data pulled from relevant databases.
The following diagram illustrates the architecture.
The hybrid architecture uses multiple databases and an LLM, with underlying models from Amazon Bedrock and JumpStart for data source identification, SQL generation, and output text generation.
The following diagram illustrates specific workflow steps for our solution.
The steps are as follows:
- A business user provides an English question prompt.
- The AWS Glue crawler is scheduled to run at frequent intervals to extract metadata from the database and create table definitions in the AWS Glue Data Catalog. The Data Catalog is the input to Sequence Chain 1 (see previous diagram).
- LangChain, a tool for working with LLM and instructions, is used in Studio notebooks. LangChain requires an LLM to be defined. As part of Sequence Chain 1, Data Catalog prompts and metadata are passed to the LLM, hosted on the SageMaker endpoint, to identify the relevant databases and tables using the LangChain.
- Fast and identifiable databases and tables are passed to Chain Sequence 2.
- LangChain establishes a connection to the database and executes SQL queries to get the results.
- The results are passed on to the LLM to generate an English answer with the data.
- Users receive English answers to their prompts, querying data from different databases.
The following sections describe some of the key steps with associated code. To dive deeper into the solutions and code for all the steps shown here, see GitHub repos. The following diagram shows the sequence of steps to follow:
You can use any database that is compatible with it SQLAlchemy to generate responses from LLM and LangChain. However, this database must have metadata registered in the AWS Glue Data Catalog. Additionally, you must have access to the LLM via a JumpStart key or API.
Connect to database using SQLAlchemy
LangChain uses SQLAlchemy to connect to SQL databases. We initialize the SQLDatabase LangChain function by creating an engine and establishing a connection for each data source. Here’s an example of how to connect to an Amazon Aurora MySQL Compatible Edition serverless database and include only the employees table:
Next, we create a prompt that is used by Chain Sequence 1 to identify the database and table names based on user queries.
Generate dynamic command templates
We use the AWS Glue Data Catalog, which is designed to store and manage metadata information, to identify data sources for user queries and create requests for Sequence 1, as detailed in the following steps:
- We created a Data Catalog by crawling metadata from multiple data sources using the JDBC connection used in the demonstration.
- With the Boto3 library, we create a consolidated Data Catalog view of multiple data sources. Here is an example of how to get the employee table metadata from the Data Catalog for the Aurora MySQL database:
The Merged Data Catalog has details about the data source, such as schema, table names, and column names. The following is an example of aggregated Data Catalog output:
- We pass the consolidated Data Catalog to the prompt template and define the prompt used by LangChain:
Sequence Chain 1: Detect source metadata for user queries using LangChain and LLM
We pass the prompt template generated in the previous step to the prompt, along with user queries to the LangChain model, to find the best data source to answer questions. LangChain uses our selected LLM model to detect source metadata.
Use the following code to use an LLM from JumpStart or a third party model:
The resulting text contains information such as the database and table names that run the user’s queries. For example, for a user query “Name all employees with birthdays this month”,
generated_text have information
database == rdsmysql And
database.table == rdsmysql.employees.
Next, we pass the human resources domain details, Aurora MySQL database, and employees table to Chain Sequence 2.
Chained Sequence 2: Retrieve responses from data sources to answer user queries
Next, we run the LangChain SQL database chain to convert text to SQL and implicitly run the resulting SQL against the database to fetch database results in easy-to-read language.
We start by defining a prompt template that instructs the LLM to generate SQL in the syntactically correct dialect and then run it against the database:
Lastly, we pass the LLM, database connection, and prompt to the SQL database chain and run the SQL query:
For example, for a user query “Name all employees with birthdays this month”, the response would be as follows:
Once you’ve run a modern data architecture with generative AI, be sure to clean up any resources you’re not going to use. Shut down and delete the databases in use (Amazon Redshift, Amazon RDS, Snowflake). Additionally, clear data in Amazon S3 and terminate all Studio notebook instances to prevent further charges. If you used JumpStart to deploy LLM as a SageMaker real-time endpoint, delete the endpoint via the console or SageMaker Studio.
In this post, we integrate modern data architectures with generative AI and LLM within SageMaker. This solution uses various text-to-text foundation models from JumpStart as well as third-party models. This hybrid approach identifies the data source, writes an SQL query, and returns a response with the query results. It uses Amazon Redshift, Amazon RDS, Snowflake, and LLM. To enhance the solution, you can add more databases, UI for English queries, quick techniques and data tools. This can be a smart, unified way to gain insights from multiple data stores. For a deeper dive into the solutions and code featured in this post, see GitHub repos . See also Amazon Bedrock for use cases on generative AI, foundational models, and large language models.
|Domain||Database/API||Remind||SQL (Produced by LLM)||Output|
|sales & Marketing||Amazon RedShift||How many ticket sales are there?||
|sales & Marketing||Amazon RedShift||What was the total ticket sales commission in 2008?||
|Law||S3||How many scams were there in 2023?||
|Law||S3||How many policies have been claimed this year?||
|Human Resources||MySQL Amazon Aurora||List all employees with birth dates this month||
|Human Resources||MySQL Amazon Aurora||How many employees were employed before 1990?||
|Finance and Investment||Snowflake||Which stocks performed the best and worst in May 2013?||
|Finance and Investment||Snowflake||What was the average volume of shares traded in July 2013?||
|Product – Weather||FIRE||What’s the current weather like in New York City in degrees Fahrenheit?|
About the Author
Navneet Tuteja is a Data Specialist at Amazon Web Services. Prior to joining AWS, Navneet worked as a facilitator for organizations looking to modernize their data architecture and deploy comprehensive AI/ML solutions. He holds an engineering degree from Thapar University, as well as a master’s degree in statistics from Texas A&M University.
Sovik Kumar Nat is an AI/ML solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management and IoT. Sovik has published articles and holds patents in monitoring ML models. He holds dual master’s degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelor’s degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking the ferry and watching movies.