Data Engineering for Generative AI Applications on AWS

Sun, Sep 29, 2024

Read in 3 minutes

Raghuveer Varahagiri

Data Engineering for Generative AI Applications on AWS

Not all Generative AI applications are made the same. Let us explore the wide variety of scenarios we encounter and a structured framework to approach the data handling necessray to support those use cases.

With the explosion of GenerativeAI use cases, the importance of good data engineering and judicious handling of various types of data has come to the forefront. For building a sucessful GenAI solution, it is very important to ensure the right data is available in the right format, at at the right time (without undue latency).

Amazon Bedrock brings powerful tools to developers and enables creation of rich and complex Generative AI applications for all types of use cases. This cna be combined with the equally powerful and mature tooling for data and analytics that AWS ecosystem provides to build a comprehensive solution that marries the latest in GenAI with robust underlying data that is specific to your use case.

Across industry domains and functions, one common use case we are seeing is that of Chatbots. We use this common use case as an example – a public facing chatbot published by a hypothetical airport targetting travellers. This allows us to examine the nuances of the various scenarios that need to be addressed in the architecture and design of the overall solution.

Approach

Before we dive into the details, though – here is a useful framework to help guide the end-to-end process of architecting the Data Engineering for Generative AI applications for AWS. I believe most of this discussion applies equally well to building solutions on other platforms as well – but we will use specific examples of AWS services and components in our architecture examples.

Types of Data

With the lens of a GenAI chatbot and the types of questions that a user might typically throw at our bot, we can think of classifying them across at least two dimensions: whether the data required to address the user query is static or dynamic, and whether it is generic of domain-specific.

A useful way one might classify these into a manageable set of categories is by looking at the combination of these dimensions and how one might handle the data pipelines and engineering required to support them.

You can further identify the speciifc areas where heavy data engineering would be necessary.

Beyond these two dimensions there are a number of other dimensions along which one may classify the questions. But it is wise to keep this maneageable and only choose what has material impact on your application’s functionality and architecture.

Scenarios and Architectures

Now let us take these example questions and arrange them along a scale of what is trivial all the way to what is disallowed.

Conclusion

The following are some takeaways for this exercise to enable a comprehensive review and analysis of the functionality desired from the GenAI application – which is a chatbot in this case – and applyingthe right filters to arrive at a well-adapted arhitecture to provide the right data foundation for a successful outcome for the overall application.