Evolution of AWS AI/ML Stack

The AWS AI stack reconfigured

[Imagine the opening scenes of a documentary]

In December 2024, on a stage in Las Vegas, AWS renamed its most important AI product.

The original Amazon SageMaker, the service that had defined cloud machine learning for seven years, became SageMaker AI. The name SageMaker, reassigned, now belonged to something larger: a unified data-and-AI platform stitching analytics, data engineering, machine learning, and generative AI development under one broader umbrella.

If you were not paying close attention, the rename felt like a footnote. It was not. It was the AI strategy of a hyperscaler admitting that the conceptual architecture it had been teaching the industry since 2017 no longer fit what it actually sold.

This is a story about how AWS got there, and why the three-layer ML stack diagram you learned in 2018 is now outdated.

2017: the diagram that worked

When Andy Jassy introduced SageMaker at re:Invent 2017, AWS gave the industry a mental model that would shape cloud AI thinking for the next half-decade. Machine learning came in three layers: frameworks and infrastructure at the bottom, platform services in the middle, and application services on top.

Amazon SageMaker launch at reinvent 2017

The genius of the framing was not the boxes. It was the choice. A developer could enter the stack at any layer: call an API for instant intelligence, build a custom model on the platform, or drop down to bare infrastructure when maximum control mattered.

SageMaker collapsed what had been a weeks-long workflow of provisioning notebooks, configuring training infrastructure, and stitching together deployment systems into something a developer could do in an afternoon.

It was a real architectural insight. It was also the last time the diagram cleanly described what AWS sold.

Pivot 1: SageMaker engulfs the lifecycle

For the next five years, AWS poured services into the middle layer. Ground Truth brought managed data labeling. Studio made SageMaker feel like an integrated ML development environment. Autopilot, Experiments, Debugger, Processing, and Model Monitor filled in the operational gaps around repeatable model development.

Then Data Wrangler and Feature Store admitted what practitioners already knew: data preparation and reusable data features were not side quests. They were most of the work. Pipelines brought CI/CD logic to ML workflows. Clarify turned responsible AI from a slide into a product surface. Canvas quietly expanded the audience from data scientists to business analysts.

By re:Invent 2022, the announcements had shifted toward governance: Role Manager for access, Model Cards for documentation, and Model Dashboard for central visibility. The implicit customer question had changed from “How do we build a model?” to “How do we govern fifty of them across a regulated business?”

It is tempting to read these years as incremental product expansion. They were more than that. SageMaker was reluctantly acknowledging, one feature at a time, that the actual job of running ML in an enterprise was bigger and messier than the original build-train-deploy trinity suggested.

Underneath all of it, AWS was running a longer game. The 2015 Annapurna Labs acquisition later surfaced as Inferentia for inference and Trainium for training. AWS had concluded early that the binding constraint on the next decade of AI would not only be model architecture. It would be the unit economics of training and inference at scale.

Pivot 2: Bedrock appears as a new pillar

ChatGPT’s release in November 2022 did not just change the AI conversation. It created a strategic emergency for every cloud provider.

In April 2023, AWS announced Amazon Bedrock. By September it was generally available. By re:Invent 2023, AWS had added Knowledge Bases, Agents for Bedrock, Guardrails, model evaluation, Amazon Q, and SageMaker HyperPod.

The interesting choice was not that AWS shipped a generative AI platform. Everyone shipped one in 2023. The interesting choice was where it lived in the stack.

AWS could have grafted foundation models onto SageMaker as another tab in the existing middle layer. Instead, it built Bedrock as a separate platform pillar with its own vocabulary, pricing model, console experience, and abstraction. SageMaker abstracted the machinery of model building, training, and deployment. Bedrock abstracted access to models you did not train and might never own.

This was an explicit bet on model choice: that customers would value the ability to choose among Amazon models, Anthropic models, Meta models, Cohere models, and others more than they would value a single privileged model path. In 2023, this looked like AWS hedging because it did not have a flagship frontier model of its own. In 2026, it looks more like AWS turning portability into the product.

The conceptual stack diagram now had a problem. The three-layer cake had a fourth pillar bolted onto its side, and nobody had updated the slides.

Pivot 3: the diagram quietly dies

re:Invent 2024 is where the old diagram really came apart.

Three things mattered at that show, and they were really one thing.

First, the rename. The existing SageMaker became SageMaker AI. The next-generation SageMaker became a unified platform for data, analytics, and AI, including SageMaker Unified Studio, SageMaker Lakehouse, and data and AI governance capabilities. The thesis was simple: AI quality is data quality, and AWS was done pretending those were separate platform stories.

Second, Amazon Nova. AWS shipped its own foundation model family in Bedrock: Nova Micro, Nova Lite, Nova Pro, Nova Canvas, and Nova Reel. After two years of being seen as the cloud without a flagship first-party model, AWS now had one. Nova on Bedrock, alongside third-party models, completed the model-choice bet rather than replacing it.

Third, the silicon thesis became more visible. HyperPod, Trainium, UltraClusters, and large-scale training infrastructure moved from background economics to front-stage strategy. The infrastructure layer was no longer just “where the models run.” It was part of the AI product story.

Step back and look at what 2024 actually announced. The middle layer had absorbed the data layer. The model layer had gained first-party foundation models. The silicon layer had become a competitive moat. The original three-layer diagram could not accommodate all of this without becoming illegible.

So AWS did not merely update the diagram. It moved on from it.

Pivot 4: a new top of the stack

By 2025, the word that kept appearing was not model, training, or even data. It was agent.

Amazon Bedrock AgentCore gave AWS a managed runtime and operating layer for AI agents: session isolation, memory, identity, tool integration, observability, gateway capabilities, browser use, and code interpretation. Later updates added stronger evaluation and policy controls.

This matters because agents are not just prompt-response applications. They reason, call tools, maintain context, interact with enterprise systems, and run over longer task horizons. That requires a new class of runtime, governance, and observability primitives.

Nova also moved forward. Nova 2 introduced reasoning-oriented models in Bedrock, while Nova Forge pushed customization deeper into the model development lifecycle. Instead of only fine-tuning the surface of a completed model, customers could work from Nova checkpoints and blend proprietary data into earlier training phases.

The infrastructure layer kept moving too. Trainium3-powered EC2 Trn3 UltraServers and AWS AI Factories signaled that accelerator economics, sovereignty, and deployment locality were all becoming part of the AI platform conversation.

The important change is not any single product. It is the implicit shift in who the customer is. The original 2017 diagram assumed a human at the top: a developer calling Rekognition, a data scientist building a model, or an infrastructure engineer optimizing a GPU instance. Agentic systems blur that. The “user” at the top of the stack is increasingly software acting on behalf of a human who is not watching every step.

That is what kills the old diagram for good. The diagram described abstraction levels for human consumers. What AWS is selling now is closer to a continuous loop: data flows into models, models drive agents, agents act on systems, systems generate more data, and that data feeds back.

The layers are still there. They are just no longer stacked neatly.

Five things the chronology shows

1. The center of gravity keeps moving up. AWS moved from training infrastructure, to managed ML, to MLOps, to foundation-model application development, to agent runtimes. Each layer commoditizes parts of the one beneath it.

2. The silicon thesis was the long game. Inferentia and Trainium were not side products. They were AWS preparing for a world where the cost of inference and training would become a first-order platform concern.

3. AWS bet on model choice. Bedrock’s strategic claim is not just “use generative AI on AWS.” It is “use many models through one managed platform.” That portability became more valuable as model markets moved quickly.

4. Generative AI did not replace classical ML. SageMaker did not disappear. It became a deeper subsystem for teams that still need data preparation, training, tuning, deployment, monitoring, governance, and customization.

5. The naming became a liability. Bedrock, AgentCore, Q, Q Developer, Q Business, Nova, Nova Forge, SageMaker, SageMaker AI, SageMaker Unified Studio, SageMaker Lakehouse, SageMaker HyperPod: a reasonably informed customer can be forgiven for not knowing which name is a product, feature, platform, assistant, or umbrella brand.

The diagram problem is also a vocabulary problem. The vocabulary problem can become a sales problem.

What is actually next

The three-layer stack diagram had a job: it taught a generation of enterprises that machine learning was not just an algorithm. It was a workflow, an infrastructure problem, and an organizational discipline. That job is done.

What comes next is not a cleaner diagram. It is a quieter realization that the diagram metaphor itself no longer fits. The interesting question for the next wave of AWS AI announcements is not what AWS will add to the stack. It is whether AWS can keep the abstractions stable enough to sell while what it is actually building is a recursive loop in which its own services increasingly become each other’s customers.

Eight years ago, AWS taught us to think in layers. The layers worked. Then they kept absorbing each other until there was nothing clean left on either side of the boundary lines.

The stack did not just get more complicated. It collapsed into a system.

What replaces it does not have a settled name yet. That may be the most honest signal anyone has about where this is going.