Computer Architecture Today

Computer Architecture’s AlphaZero Moment is Here

Karu Sankaralingam — Fri, 10 Apr 2026 14:00:43 +0000

For decades, we have designed chips in fundamentally the same way: human intuition applied to a vanishingly small slice of an impossibly large design space. That paradigm worked when Moore’s Law was lifting everything. We could afford to be wrong. We could afford to miss the best design. Process scaling would close the gap.

That world is over. In a recent position paper — “Computer Architecture’s AlphaZero Moment: Automated Discovery in an Encircled World” — I argue that we are at an inflection point. Not a gradual shift, but a structural break in how architecture must be practiced.

From Idea Scarcity to Evaluation Scarcity

The central claim is simple, but uncomfortable:

Computer architecture is no longer bottlenecked by ideas. It is bottlenecked by evaluation and telemetry.

For decades, the field has implicitly assumed that ideas are scarce — that the role of the architect is to generate the one clever mechanism worth exploring. Everything else follows. But recent evidence suggests the opposite. With modern large language models and agentic pipelines, hundreds of viable architectural ideas can be generated per day, thousands of candidate designs can be evaluated per week, and design cycles can compress from months to weeks.

This is not speculative. We built a system called the Gauntlet and tested it on 85 papers from ISCA 2025 and HPCA 2026 — largely outside the model’s training data. Across 475 independent runs, it produced viable architectural mechanisms 95% of the time: independently re-deriving authors’ exact solutions in 48% of cases, and proposing valid alternatives the authors never considered in another 50%. Each took 10–20 minutes. This flips a foundational assumption of the field. If ideas are abundant, then the limiting factor is no longer creativity — it is which ideas we can evaluate, validate, and trust. This link has this corpus of problem statement and Gauntlet’s solutions.

1. Evaluation is the new bottleneck

We are moving from a world where the question was “Can we come up with a good idea?” to one where the question becomes “Can we evaluate 10,000 ideas fast enough to find the best one?” This elevates simulation infrastructure, analytical modeling, and verification into the central problems of the field. The “PhD student for three months” implementation bottleneck is already eroding — our system built first-principles performance models from papers in under 20 minutes. What replaces it is a race to build faster, more accurate, and more scalable evaluation pipelines.

2. The telemetry divide

If evaluation becomes central, then ground truth becomes everything. Over time, access to closed-loop deployment telemetry — real workloads, real performance counters, real system behavior at scale, and in low-level depth — may matter as much as architectural insight itself. This creates a risk of structural divide. Academic research, long dependent on proxy benchmarks, could drift further from production reality unless we collectively rethink how we share and access workload data.

3. The end of the old boundary

The traditional separation between “chip company” and “cloud provider” begins to dissolve. Automated architecture requires three tightly coupled capabilities: deployment (to generate telemetry), infrastructure (to evaluate designs at scale), and silicon expertise (to realize designs physically). No single traditional player owns all three. The result is convergence — either through vertical integration or new hybrid ecosystems.

The Deeper Claim

The more provocative claim is not about tools — it is about limits. Human-driven architecture is becoming structurally outmatched by the scale of the design space. This is not a statement about human ability. It is about combinatorics. The architectural search space — spanning parametric and structural choices — is effectively unbounded. Humans sample an infinitesimal fraction of it. That was acceptable in an era of abundance. It is not acceptable in an era where architectural efficiency is the primary lever for progress. The analogy to AlphaZero is not rhetorical. It is structural: when search, evaluation, and feedback loops become fast enough, intuition gives way to systematic exploration.

What This Means for Research — and Teaching

If this framing is even partially correct, it forces a rethinking of what it means to “do” computer architecture research. Several shifts seem likely. If machines can generate many viable solutions, identifying the *right problem* becomes the scarce intellectual act. Evaluation frameworks, modeling techniques, and telemetry integration may matter more than individual architectural ideas. And the reliance on fixed benchmark suites becomes increasingly fragile in a world driven by dynamic, evolving workloads.

The full paper includes a set of predictions and my opinions on how I see this playing out. This extends to how we teach. Do we still emphasize canonical microarchitectures, or shift toward trade-off reasoning, evaluation frameworks, and interpreting machine-generated designs? What does it mean to train a researcher when idea generation itself is becoming automated?

A Call for Collaboration

This is not a settled direction — it is a hypothesis that needs to be stress-tested by the community. If this resonates (or if you think it is completely wrong), I would love to engage on: new models for teaching architecture, shared evaluation infrastructure and artifacts, privacy-preserving approaches to workload telemetry, and workshops focused on problem formulation rather than solution novelty. If this is even half right, we may need to rethink our identity as a field. Let’s debate it.

About the author: Karthikeyan Sankaralingam is Principal Research Scientist at NVIDIA and Professor at UW-Madison.

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.

Spilling the Neural Tea: A Journey Down the Side-Channel

Adnan Rakin — Mon, 06 Apr 2026 15:37:22 +0000

Years ago, I came across three pioneering works (CSI-NN, Cache Telepathy, and DeepSniffer) in the field of reverse engineering neural networks that inspired my journey into side-channel attacks to uncover the secrets of modern Deep Neural Networks (DNNs). Fast forward to today, and there has been significant exploitation of side-channel attacks to discover the secrets of neural networks. It’s a good time to provide an overview of where we stand, the outlook for the future, and the challenges ahead.

Motivation: Let’s take a step back and first try to understand why we care about secrets in deep learning models. It basically boils down to two fundamental challenges associated with deep learning: i) Financial, ii) Security and Privacy challenges. In general, DNNs are intellectual property (IP), as they are products developed over years of research, implementation, and investment in computing units, and they entail significant training costs (time, energy, and labor), making them a valuable asset for their owners. Just to give a rough estimate, OpenAI’s GPT-4 costs more than $ 100 million, and its GPT-5 model is expected to be more than 5x as expensive (Cost of Training GPT). I do not know about you, but if I spent 100 million on something, I would care about protecting it. The next challenge is knowing that a model secret gives an adversary white-box knowledge, which is extremely powerful in security and privacy settings. Any adversary with knowledge of a target victim’s model architecture (e.g., model type, layer sequence, and number) and weight information, formally defined as “white-box,” can launch powerful security (adversarial attacks) and privacy threats (model inversion attacks/membership inference attacks). As highlighted in Figure 1, the attacker’s final objective in the DNN reverse-engineering attack is to gain white-box privileges either to steal IP for financial gain or to launch subsequent attacks.

In summary, in security and privacy research, defining the threat model is the first step towards any exploitation, and the underlying assumption is often that a reverse-engineering attack has successfully uncovered the model architecture, weights, and other hyperparameters.

Attack Objectives: By now, we have established that an attacker’s goal is to uncover two key properties of a victim DNN: its architecture and its parameters. However, this is an oversimplified goal and can often be misleading. To understand this, let’s consider a deep neural network as a function of x, denoted f(x). If an attacker wants to recover the exact victim model, their objective is for the stolen model to be identical to the original f(x), which is practically impossible for large-scale DNNs, whether using existing side-channel attacks or the exact victim dataset. As a result, a more practical and plausible goal for an attacker would be to achieve functional equivalence. If the stolen function is different, such as g(x), then, for incentive purposes, all an attacker cares about is that these two functions produce identical output, i.e., f(x)= g(x), for inputs x that are of the attacker’s interest. As a result, achieving functional equivalence means recovering the DNN model architecture, often as close as possible to the victim architecture’s topology. On the weight side, even if an attacker cannot extract the exact weights, they must aim for a weight-space solution that captures the victim model’s functionality.

In summary, to steal a copy of the victim model/function, an attacker must identify the victim model architecture. In modern deep learning, where most practical applications use some version of a DNN model from an existing pool (e.g., GPT, Llama), recovering the architecture often boils down to detecting the model’s topology. Once the architecture is revealed, the attacker must recover the model parameters/weights, which is often a challenging part of the attack. Then again, as we discussed earlier, exact model recovery can be challenging, but achieving functional equivalence is a modest objective. Most importantly, to achieve functional equivalence, the attacker may not need to reveal the exact numerical weights; rather, gradually recovering coarse-grained information (e.g., weight sparsity, quantization pattern, weight distribution) is often sufficient.

Figure 1: Spectrum of attack threats characterized by attacker’s knowledge: Black-Box (No Knowledge), Grey-Box (Partial Knowledge, e.g., architecture), and White-box (Complete knowledge of model architecture and weights), the ultimate goal of reverse-engineering (AI-generated).

Attack Techniques and Capabilities. Among the popular types of side-channel attacks, i.e., physical and microarchitectural, both can be utilized in two different threat model settings. In edge or embedded devices, the physical side channel is the dominant threat, and several works (CSI-NN, BarraCUDA) have shown that it is possible to recover the model architecture and weights of simple neural networks. On the other hand, micro-architectural side channels are a popular choice for resource-sharing cloud environments where users can upload and run their code in a colocated environment (e.g., Amazon SageMaker and Google ML Engine). Microarchitectural attacks have been successful in recovering model architecture across the board using cache timing channels, memory access patterns, and GPU context switching. I acknowledge that there are many ways to recover DNN model weights, including learning-based approaches and mathematical recovery techniques. In this blog post, I focus on side-channel attacks. At the same time, learning-based approaches can work as a complementary approach with side-channel attacks once the architecture information has already been leaked.

In summary, while side-channel attacks have been successful in leaking model architecture information, as the scale of modern DNNs, e.g., LLM weights, continues to reach new heights of billions, none of the existing side channels can scalably and predictably recover model parameter information. A common workaround would be to support these methods with a learning approach, assuming an attacker has a partial training set, which may not be practical, even in a resource-sharing environment where data remains private.

Future Challenges and Opportunities:

What is the future of architecture-recovery attacks, given the success of existing side channels?

As the next wave of vision and language domain architectures emerges, they present new challenges and opportunities for the microarchitectural side-channel attack community. These models require modern compute support, which can accelerate their inference (e.g., tensor cores), as GPUs become more modern and newer generations may leave new traces of side-channel information. Hence, these newer compute platforms (e.g., new GPUs) and their associated architectural support demand new innovation in side-channel capabilities to recover the model architecture. We must remember that architecture recovery is essential; without it, model parameter recovery is no longer useful. Moreover, as LLMs emerge as the dominant model, the question is not just about recovering weights or architecture; leaking other components, such as KV cache in a multi-tenant setting, can lead to privacy leakage.

Can a microarchitectural side channel alone ever be sufficient to recover model weight information?

The sheer scale of the modern model poses an even greater challenge for recovering weights, making direct recovery an ambitious, and even impossible, goal; instead, we should focus on functional equivalence. To achieve functional equivalence, weight recovery methods can set tiny stepping stones to augment learning-based recovery.

Complete weight recovery using a side channel at the scale of LLMs or even a smaller vision model may be too ambitious. Instead, the attacks should focus on coarse-grained information about weights, such as model sparsity levels, quantization mechanisms, weight sign recovery, and other optimization techniques. The key idea is to achieve functional equivalence by first recovering coarse-grained information, which is sufficient to support other learning-based recovery. It is time to work towards an achievable target: recovering this statistical weight-level knowledge and studying how critical their role is in improving subsequent attacks. As models and their computation units are increasingly optimized, leaking information such as sparsity levels or bit-widths will become more feasible by detecting optimized paths through side-channel leakage.

Finally, an attack is never the end goal. We probe attacks from every angle so we can study them before any attacker ever thinks about them. The endgame is always to develop subsequent defenses, which I leave for another discussion.

About the author:

Adnan Siraj Rakin is an Assistant Professor at the School of Computing at Binghamton University. He received his Master’s (2021) and PhD (2022) from Arizona State University. He works on emerging security and privacy challenges in modern AI systems and algorithms. His paper on DNN model weight recovery has been crowned as Top Picks in Hardware and Embedded Security in 2024.

To Sparsify or To Quantize: A Hardware Architecture View

Sai Srivatsa Bhamidipati — Thu, 12 Mar 2026 15:00:43 +0000

The debate of sparsity versus quantization has made its rounds in the ML optimization community for many years. Now, with the Generative AI revolution, the debate is intensifying. While these might both seem like simple mathematical approximations to an AI researcher, for a hardware architect, they present fundamentally different sets of challenges. Many architects in the AI hardware space are deeply familiar with watching the scale tip from one side to the other, constantly searching for a pragmatic balance. Let’s look at both techniques, unpack the architectural challenges they introduce, and explore whether a “best of both worlds” scenario is truly possible (Spoiler: It depends).

Note: We will only be looking at compute-bound workloads, which traditionally rely on dense compute units such as tensor cores or MXUs. We will set aside memory-bound workloads for now, as they introduce their own distinct set of tradeoffs for sparsity and quantization.

Sparsity

The core idea of sparsity is beautifully simple: if a neural network weight is zero (or close enough to it), just don’t do the math. Theoretically, pruning can save massive amounts of compute and memory bandwidth.

The Architecture Challenge: The Chaos of Unstructured Data

The golden goose of this approach is fine-grained, unstructured sparsity. It offers a high level of achievable compression through pruning, but results in a completely random distribution of zero elements. Traditional dense hardware hates this. Randomness leads to irregular memory accesses, unpredictable load balancing across cores, and terrible cache utilization. High-performance SIMD units end up starving while the memory controller plays hopscotch trying to fetch the next non-zero value. To architect around this, pioneering unstructured sparse accelerators—such as EIE and SCNN—had to rely heavily on complex routing logic, specialized crossbars, and deep queues just to keep the compute units fed, often trading compute area for routing overhead.

The Compromise: Structured and Coarse-Grained Sparsity

To tame this chaos, the industry shifted toward structured compromises. The universally embraced N:M sparsity (popularized by NVIDIA’s Ampere architecture) forces exactly N non-zero elements in every block of M. This provides a predictable load-balancing mechanism where the hardware can perfectly schedule memory fetches and compute.

More recently, to tackle the quadratic memory bottleneck of long-context LLMs, we’ve seen a surge in modern sparse attention mechanisms that leverage block sparsity. Techniques like Block-Sparse Attention and Routing Attention enforce sparsity at the chunk or tile level. Instead of picking individual tokens, they route computation to contiguous blocks of tokens, allowing standard dense matrix multiplication engines to skip entire chunks while maintaining high MXU utilizations and contiguous memory access. Other approaches, like StreamingLLM, evict older tokens entirely, retaining only local context and specific “heavy hitter” sink tokens.

The trade-off across these methods is clear: we exchange theoretical maximum efficiency for hardware-friendly predictability, paying a “tax” in metadata storage (index matrices), specialized multiplexing logic, and the persistent algorithmic risk of dropping contextually vital information.

Quantization

While sparsity aims to compute less, quantization aims to compute smaller. Shrinking datatypes from 32-bit floats (FP32) to INT8, or embracing emerging standards like the OCP Microscaling Formats (MX) Specification (such as MXFP8 E4M3 and E5M2), acts as an immediate multiplier for memory bandwidth and capacity. But the frontier has pushed much further than 8-bit. Recent advancements in extreme quantization, such as BitNet b1.58 (1-bit LLMs using ternary weights of {-1, 0, 1}) and 2-bit quantization schemes (like GPTQ or Quip), demonstrate that large language models can maintain remarkable accuracy even when weights are squeezed to their absolute theoretical limits.

The Architecture Challenge: The Tyranny of Metadata and Scaling Factors

From an architecture perspective, the challenge of extreme quantization isn’t just the math—it’s the metadata. To maintain accuracy at 4-bit, 2-bit, or sub-integer levels, algorithms demand fine-grained control, requiring per-channel, per-group, or even per-token dynamic scaling factors. Every time we shrink the primary datapath, the relative hardware overhead of managing these scaling factors skyrockets. Along with that, the quantization algorithm also becomes more fine grained, dynamic and complex. We are forced to add additional logic and even high-precision accumulators (often FP16 or FP32) just to handle the on-the-fly de-quantization and accumulation. We aggressively optimize the MAC (Multiply-Accumulate) units, only to trade that for the overhead of adding scaling factor handling and supporting a potentially new dynamic quantization scheme, which can outweigh the gains.

The Compromise: Algorithmic Offloading

To fix this without blowing up the complexity and area budget, the community relies on algorithmic co-design. Techniques like SmoothQuant effectively migrate the quantization difficulty offline, mathematically shifting the dynamic range from spiky, hard-to-predict activations into the statically known weights. Similarly, AWQ (Activation-aware Weight Quantization) identifies and protects a small fraction of “salient” weights to maintain accuracy without requiring complex, dynamic mixed-precision hardware pipelines. By absorbing the complexity into offline mathematics, these techniques allow the hardware to run mostly uniform, low-precision datatypes.

However, much like the routing tax in sparsity, this algorithmic offloading comes with some compromises. These methods heavily rely on static, offline calibration datasets. If a model encounters out-of-distribution data in production (a different language, an unusual coding syntax, or an unexpected prompt structure), the statically determined scaling factors can fail, leading to outlier clipping and catastrophic accuracy collapse. Furthermore, relying on offline preprocessing creates a rigid deployment pipeline that prevents the model from adapting to extreme activation spikes on the fly.

Is there a “best of both worlds”?

So, knowing these trade-offs, do we sparsify or do we quantize? Many years ago, the Deep Compression paper proved we could do both. But today, pulling this off at the scale of a 70-billion parameter LLM is incredibly difficult. It suffers from the classic hardware optimization catch-22 (see All in on Matmul?) : No one uses a new piece of hardware because it’s not supported by software, and it’s not supported by software because no one’s using it.

So what’s the path forward for hardware architects? In my opinion, the following:

Deep Hardware-Software Co-design: The days of throwing a generic matrix-multiplication engine at a model are over. We need to work directly with AI researchers so that when they design a new pruning threshold or a novel sub-byte data type, the hardware already has a streamlined, fast path for the metadata.
Generalized Compression Abstractions: Historically, we have designed accelerators that are either “good at sparsity” (with complex routing networks) or “good at quantization” (with mixed-precision MACs). Moving forward, we need to view these not as orthogonal features, but as a unified spectrum of compression. Architectures must be designed to dynamically adapt—perhaps fluidly dropping structurally sparse blocks during a memory-bound decode phase, while leaning on extreme sub-byte quantization during a compute-heavy prefill phase—potentially even sharing the same underlying logic.
Balance Efficiency and Programmability: As explored in the “All in on MatMul?” post, we need to keep our hardware flexible. Over-fitting to today’s specific sparsity pattern or quantization trick risks building being trapped in the local minimum. We must maintain enough programmability to enable future algorithm discovery and break free from the catch-22.

Some notable research going along this path include Effective interplay between sparsity and quantization, which proves the non-orthogonality of the two techniques and explores the interplay between them and also the Compression Trinity work which takes a look at multiple techniques across sparsity, quantization and low rank approximation and tries to take a holistic view of the optimization space across the stack.

Ultimately, as alluded to before, there is no single silver bullet, and like all open architecture problems, the answer is always “it depends”. But in the era of Generative AI, it depends on whether we view sparsity and quantization as competing alternatives or as pieces of the same puzzle. Perhaps it’s time we stop asking which one is better, and start designing architectures flexible enough to embrace the realities of both.

About the Author:

Sai Srivatsa Bhamidipati is a Senior Silicon Architect at Google working on the Google Tensor TPU in the Pixel phones. His primary focus is on efficient and scalable compute for Generative AI on the Tensor TPU.

Authors’ Disclaimer:

Portions of this post were edited with the assistance of AI models. Some references, notes and images were also compiled using AI tools. The content represents the opinions of the authors and does not necessarily represent the views, policies, or positions of Google or its affiliates.

From the Editor’s Desk – 2026 Edition

Dmitry Ponomarev — Tue, 03 Feb 2026 20:19:39 +0000

As we close the book on 2025, Computer Architecture Today has seen another successful year of community engagement. We published 29 posts covering a wide spectrum of topics—from datacenter energy-efficiency to the evolving debate on LLMs in peer review, alongside trip reports from our major conferences. I want to thank all our authors for their insights, with special appreciation for those who contributed multiple times.

Over the last year, we shifted our editorial model, moving from a roster of set contributors to a more flexible, open-submission approach. We also re-established our conference trip reports, highlighting top architecture venues.

The blog thrives on new voices, and our door is always open. We are actively looking for:

New Ideas: If you have a topic in mind, please propose it using this link or email me directly.
Trip Reports: Planning to attend a conference? Volunteer to share your experience.
Event Summaries: Organizers of workshops or tutorials are welcome to publicize their events through summary posts.
Industry Perspectives: We would like to hear from our industry colleagues about their take on the future landscape of computer architecture.

Finally, as AI tools proliferate, the conversation around their role in our paper reviewing process is far from over. I look forward to seeing more of that debate here.

Here’s to the new advances in Computer Architecture in 2026!

Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead

Zhongming Yu and Jishen Zhao — Tue, 20 Jan 2026 15:19:11 +0000

Large language model (LLM) agents are quickly moving from “single agent” to *multi-agent systems*: tool-using agents, planner-orchestrator, debate teams, specialized sub-agents that collaborate to solve tasks. At the same time, the *context* these agents must operate within is becoming more complex: longer histories, multiple modalities, structured traces, and customized environments. This combination creates a bottleneck that looks surprisingly familiar to computer architects: memory.

In computer systems, performance and scalability are often limited not by compute, but by memory hierarchy, bandwidth, and consistency. Multi-agent systems are heading toward the same wall — except their “memory” is not raw bytes, but semantic context used for reasoning. After dipping our heads building various LLM multi-agent frameworks over the past two years (e.g., OrcaLoca for software issue localization, MAGE for RTL design, Pro-V for RTL verification, and PettingLLMs enabling RL training on multiple LLM agents), we would like to share our insights learned from our experience through the lens of a computer architect. This blog frames multi-agent memory as a computer architecture problem, proposes a simple architecture-inspired model, and highlights the key challenges and protocol gaps that define the road ahead.

While our perspectives are still preliminary and evolving, we hope they serve as a starting point to ignite a broader conversation.

Multi-Agent Memory Systems in Growing Complex Contexts

Why memory matters: Context is changing

Longer context windows: Long-context evaluation suites like RULER and LongBench show that “real” long-context ability involves more than simple retrieval — it includes multi-hop tracing, aggregation, and sustained reasoning as length scales.
Multi-modal inputs: Benchmarks such as MMMU (static images: charts, diagrams, tables) and VideoMME(videos with audio and subtitles) demonstrate that models must handle diverse visual modalities alongside text, extending beyond single-modality processing.
Structured data & traces: Text-to-SQL (e.g., Spider, BIRD) highlight that agents increasingly operate over structured, executable data — database schemas and generated SQL queries — rather than only raw chat history.
Customized environments: In SWE-bench and Multi-SWE-bench, models are evaluated by applying patches to real repositories and running tests in containerized (Docker) environments, making “environment state + execution” part of the memory problem. Similarly, WebArena and OSWorld provide realistic, reproducible interactive environments that stress long-horizon state tracking and grounded actions.

Bottom line: Context is no longer a static prompt — it’s a dynamic, multi-format, partially persistent memory system.

Basic Prototypes: Shared vs. Distributed Agent Memory

Before we talk about “hierarchies,” it helps to name the two simplest prototypes, which mirror classical memory systems.

1) Shared Memory

All agents access a shared memory pool (e.g., a shared vector store, shared document database).

Pros: Easy to share knowledge; fast reuse.
Cons: Requires coherence support. Without coordination, agents overwrite each other, read stale info, or rely on inconsistent versions of shared facts.

2) Distributed Memory

Each agent owns local memory (local scratchpad, local cache, local long-term store) and shares via synchronization.

Pros: Isolation by default; more scalable; fewer contention issues.
Cons: Needs explicit synchronization; state divergence becomes common unless carefully managed.

Most real systems sit somewhere in between: local working memory plus selectively shared artifacts.

An Agent Memory Architecture Inspired by Modern Computer Architecture Design

Computer architecture teaches a practical lesson: you don’t build “one memory.” You build a memory hierarchy with different layers optimized for latency, bandwidth, capacity, and persistence.

A useful mapping for agents will be:

Agent I/O Layer

What it is: Interfaces that ingest and emit information.

Audio/speech
Text documents
Images
Network calls/web data

Analogy: Devices and I/O subsystems feeding the CPU.

Agent Cache Layer

What it is: Fast, limited-capacity memory optimized for immediate reasoning.

Compressed context
Recent trajectories and tool calls
Short-term latent storage (e.g., KV cache, embeddings of recent steps)

Analogy: CPU caches (L1/L2/L3): small, fast, and constantly refreshed.

Agent Memory Layer

What it is: Large-capacity, slower memory optimized for retrieval and persistence.

Full dialogue history
External knowledge databases (vector DBs, graph DBs, document stores)
Long-term latent storage

Analogy: Main memory + storage hierarchy.

This framing emphasizes a key principle: Agent performance is an end-to-end data movement problem. Even if the model is powerful, if relevant information is stuck in the wrong layer (or never loaded), reasoning accuracy and efficiency degrade.

And just like in hardware, caching is not optional. Similar to computer memory hierarchies, agent memory benefits from I/O and caching layers to improve efficiency and scalability.

Protocol Extensions for Multi-Agent Scenarios

Architecture layers need protocols. In multi-agent settings, protocols determine what can be shared, how fast, and under what rules.

Today, many agent frameworks rely on MCP (Model Context Protocol) as a connectivity layer. Agents registered via MCP can connect and communicate, but inter-agent bandwidth remains limited by message-passing. MCP largely uses JSON-RPC, so it’s best viewed as a protocol for agent context I/O: request/response, tool invocation, and structured messages.

That’s necessary — but not sufficient.

Missing Piece 1: Agent Cache Sharing Protocol

Many recent studies, such as DriodSpeak and Cache to cache, explored KV cache sharing between LLM. However, we still lack a principled and unified protocol for sharing cached artifacts across agents.

Goal: Enable one agent’s cached results to be transformed and reused by other agents.

In architecture terms, this is like enabling cache transfers or shared cache behavior — except the payload is semantic and may require transformation before reuse.

Missing Piece 2: Agent Memory Access Protocol

Although frameworks like Letta and Mem0 support shared state within agent memory, the protocol defines how agents read/write each other’s memory is missing.

Goal: Define memory access semantics: permissions, scope, and granularity.

Key questions:

Can Agent B read Agent A’s long-term memory, or only shared memory?
Is access read-only, append-only, or read-write?
What is the unit of access: a document, a chunk, a key-value record, a “thought,” a trace segment?
Can we support “agent RDMA”-like patterns: low-latency direct access to remote memory without expensive message-level serialization?

Without a memory access protocol, inter-agent collaboration is forced into slow, high-level message passing, which wastes bandwidth and loses structure.

The Next Frontier: Multi-Agent Memory Consistency

The largest conceptual gap is consistency. The goal of memory consistency in computer architecture and systems design is to define constraints on the order of reads and writes to memory addresses. Consistency models (e.g., sequential consistency, TSO, and release consistency) clarify what behaviors programmers can rely on.

For agent memory, the goal shifts: It’s not about bytes at an address, but about maintaining a coherent semantic context that supports correct reasoning and coordination.

Why Agent Consistency Is Harder

The “state” is not a scalar value; it’s a plan, a summary, a retrieval result, a tool trace.
Writes are not deterministic; they may be speculative or wrong.
Conflicts aren’t simple write-write conflicts — they’re semantic contradictions.
Freshness depends on the environment state (repo version, API results, and permissions).

What a Multi-Agent Memory Consistency Layer Might Need

A practical direction is to define consistency around the artifacts agents actually share — cached evidence, tool traces, plans, and long-term records — across both shared and distributed memory setups (often a hybrid: local caches + shared store). The layer should expose a consistency model (e.g., session, causal, eventual semantic, and stronger guarantees for “committed” outputs), provide richer communication primitives than plain message passing, and include conflict-resolution policies (source ranking, timestamps, consensus, and optional human intervention for high-stakes conflicts).

Research on this is still rare, but it is likely to become foundational — much like coherence and consistency were for multiprocessors.

Conclusion

Many agent memory systems today resemble human memory — informal, redundant, and hard to control — leaving a large opportunity for computer architecture researchers to rethink what “memory” should mean for agents at scale. To move from ad-hoc prompting to reliable multi-agent systems, we need better memory hierarchies, explicit protocols for cache sharing and memory access, and principled consistency models that keep shared context coherent.

Acknowledgement

We sincerely thank Wentao Ni, Hejia Zhang, Mingrui Yin, Jiaying Yang, and Yujie Zhao for their invaluable contributions through brainstorming, discussions, data collection, and survey work over the past few months. This article would not have been possible without their dedicated efforts.

About the authors:

Zhongming Yu is a PhD student in the Computer Science and Engineering Department at University of California, San Diego. His research interests are in combining machine learning and computer systems, with a special focus on LLM agent system for machine learning systems, evolving ML and systems, and autonomous software engineering.

Jishen Zhao is a Professor in the Computer Science and Engineering Department at University of California, San Diego. Her research spans and stretches the boundary across computer architecture, system software, and machine learning, with an emphasis on memory systems, machine learning and systems codesign, and system support for smart applications.

PipeOrgan: Modeling Memory-Bandwidth-Bound Executions for AI and Beyond

Mark D. Hill — Mon, 12 Jan 2026 15:00:20 +0000

TL;DR: Latency-tolerant architectures, e.g., GPUs, increasingly use memory/storage hierarchies, e.g., for KV Caches to speed Large-Language Model AI inference. To aid codesign of such workloads and architectures, we develop the simple PipeOrgan analytic model for bandwidth-bound workloads running on memory/storage hierarchies.

Background

For three reasons, memory bandwidth, more than latency, limits AI inference performance. First, AI inference uses latency-tolerant compute engines, such as GPUs. Second, it principally uses hardware memory hierarchies to store a data structure called a Key-Value (KV) Cache that holds information from recent queries to reduce redundant computation. With PagedAttention, each KV Cache fetch obtains one or more multi-megabyte blocks (often called pages) that require substantial bandwidth to complete. Third, inference’s “decode” phase is memory-bound due to low arithmetic intensity, putting great pressure on memory bandwidth.

Traditional CPU memory/storage hierarchies are shaped by increasing latency, but designing hierarchies for AI workloads requires focusing on decreasing bandwidth. Since AI software is flexible, codesigning software and hardware is essential.

To provide intuition and first answer to the above questions, we next contribute the simple PipeOrgan analytic model for optimizing bandwidth-bound workloads running on a memory hierarchy with many parallel pipes from memories to compute. The PipeOrgan model shows that husbanding and providing bandwidth is important for AI software and hardware. Analytic models have long provided computing intuition, e.g., Amdahl’s Law, Iron Law, and Roofline.

Example System with Two Parallel Memories

Let’s start simple. Consider the hardware depicted in Figure 1 with High Bandwidth Memory (HBM) with bandwidth 16 TB/s in parallel with an LPDDR memory with bandwidth 0.5 TB/s. Assume for now that there are no transfers between memories, e.g., to cache.

Using the PipeOrgan math from the next section, Figure 2’s blue line shows how system performance changes depending on what percentage of data comes from LPDDR memory. (The orange line comes later when we add caching.) Performance is highest when LPDDR provides exactly 3% of the data (arrow 1), which matches its 3% bandwidth (0.5/(16.0+0.5)). At this point, both LPDDR and HBM memories finish transferring data at the same time, so they act as co-bottlenecks and the system runs at peak efficiency.

When less than 3% of data is from LPDDR (left of the peak), HBM finishes last and limits performance. When LPDDR sources more than 3% (right of the peak), it is the bottleneck. LPDDR might have to source more data, because HBM’s limited capacity, currently 48-64GB per stack, may preclude it from being able to source its share (97%). If so, performance drops quickly: 4% from LPDDR gives 76% of peak (arrow 2), and 20% yields just 15% (arrow 3).

However, future AI systems will feature multiple memory and storage levels, using HBM, LPDDR, host DDR, pooled DDR, and attached or pooled FLASH storage.

PipeOrgan Model of Systems with N Parallel Memories

The above result generalizes to an N-level memory/storage hierarchy with each level feeding compute in parallel. Optimal performance is achieved when all parallel memories complete a workload phase simultaneously, leading to this PipeOrgan principle:

Memory-bandwidth-bound workloads perform best when data is sourced from each memory level in proportion to its bandwidth.

Proof:

Let each memory provide bandwidth b_i TB/s in parallel for total bandwidth B = b_1 + … + b_N.
For a workload, let each source d_i bytes in parallel for total data transferred D = d_1 + … + d_N.
By assumption, the workload is limited by data transfer time with compute hidden.
Time for each memory to finish its data transfer is d_i/b_i = TB/(TB/s) = seconds.
Workload Time is the maximum of all memories finishing: MAX [d_1/b_1, …, d_N/b_N].
Workload Performance = 1/ Time = MIN[b_1/d_1, …, b_N/d_N].
Set each d_i = (D/B)*b_i = proportional to its bandwidth b_i.
Performance = MIN[b_1/((D/B)*b_1), …, b_N/((D/B)*b_N)].
Performance = MIN[(B/D), …, (B/D)] = B/D and Time = 1/Performance = D/B.

This makes sense: PipeOrgan shows that best performance occurs when one moves all the data using all the bandwidth with no bandwidth idling.

But Caching Is Critical

The PipeOrgan version above assumes all data goes directly to compute, without transfers among memories. In reality, systems move data from lower- to higher-bandwidth memories, caching it for reuse. For a two-level system (see Figure 4), assume the entire fraction of the workload’s data used from LPDDR is first transferred to HBM for caching (orange arrow). Let the data used from LPDDR be f*D where f ranges from 0 to 1.

Performance with caching = MIN[(b_1/D)/(f+1), b_2/(f*D)] = MIN[limited by HBM BW, limited by LPDDR BW].

Figure 2 shows an orange curve for caching that is hidden under the original blue curve when more than 3% of data is sourced from LPDDR. At more than 3% from LPDDR, performance–without and with caching–is limited by the time to transfer needed data with the same limited LPDDR bandwidth.

While it might look like caching doesn’t matter, caching is actually important. This is because caching can greatly shift a workload’s x-axis operating point. For example, sourcing 20% of data from LPDDR yields 15% of peak performance (arrow 3). If LPDDR data is cached in HBM and reused five times, then–as the orange dashed arrow shows–only 4% comes from LPDDR and performance gets boosted to 76% of peak—a ~5x improvement (arrow 2).

Consequently, caching remains critical. Moreover, PipeOrgan and its N parallel memory principle also applies bandwidth-bound workloads once caching’s more complex information flows are accounted for.

Implications, Limitations and Future Work

Statistician George Box famously said, “Essentially, all models are wrong, but some are useful.”

We conjecture that the PipeOrgan model is useful for AI codesign, especially in the early stages and with software people having less hardware understanding. Its key implication is that bandwidth-bound workloads must carefully manage bandwidth from larger, slower memories and storage. While vast data can be stored statically, dynamic use from low-bandwidth memories should remain modest.

Three PipeOrgan limitations motivate future work. First, most workloads aren’t bandwidth bound throughout, and PipeOrgan doesn’t address other phases. Modeling these requires more parameters, increasing accuracy but also complexity.

Second, the caching model variant only covers two memory levels and always transfers data first to the higher-bandwidth level before use. Future work should extend this to N memory levels and more advanced caching policies. Modeling the many options for caching may be challenging.

Third, PipeOrgan may need to be extended for systems that do some processing in or near the memories themselves rather than moving all data to a segregated compute unit.

Burks, Goldstine, & von Neumann, 1946: We are therefore forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible.

In sum, after eight decades of memory hierarchies focused mostly on latency, we are now at the exciting early stages of codesigning bandwidth-focused memory/storage hierarchies for more flexible AI software.

About the Author: Mark D. Hill is John P. Morgridge Professor and Gene M. Amdahl Professor Emeritus of Computer Sciences at the University of Wisconsin-Madison and consultant to industry. He initiated the PipeOrgan model consulting for Microsoft and was given permission to release it. He is a fellow of AAAS, ACM, and IEEE, as well as recipient of the 2019 Eckert-Mauchly Award.

In Memoriam: Remembering Mike Flynn

Ruby B. Lee, Charlie Neuhauser, Timothy M. Pinkston — Tue, 06 Jan 2026 21:00:08 +0000

Michael J. Flynn is a widely respected contributor—indeed a giant—in the field of Computer Architecture. He made highly significant and impactful contributions throughout his career, both in industry and in academia. Sadly, he passed away peacefully December 24, 2025, having lived a long and full life.

Born May 20, 1934, in New York, NY, Flynn earned his Bachelor’s, Master’s, and Ph.D. degrees in Electrical Engineering from Manhattan College (1955), Syracuse University (1960), and Purdue University (1961), respectively, and he received an honorary Doctor of Science degree from the University of Dublin (1998). After ten years as a design engineer and project manager at IBM (1955-65, in Endicott and Poughkeepsie, NY), he became a member of the faculty at the University of Illinois at Chicago (1965-1966), Northwestern University (1966-1970), and Johns Hopkins University (1970-1975) before joining Stanford University in 1975 as Professor of Electrical Engineering. He taught internationally, in Ireland, other places in Europe, Singapore, and Japan.

As a young project manager at IBM, Flynn was responsible for the design of the well-known IBM System 360 (Models 91/92/95 series), the first computer to implement the sophisticated Tomasulo algorithm, along with many other groundbreaking high-performance architectural techniques. As the first family of general-purpose computer mainframes that featured architectural compatibility for both commercial and scientific applications, the System 360 is widely recognized as revolutionizing computing during that time—and in many ways persisting even today. Indeed, many of the high-performance computing techniques developed by Flynn and his IBM colleagues are used throughout the industry today, having migrated from barn-sized mainframes to finger-nail sized microprocessor chips. Flynn also was the first to shed light on the performance potential and limitations of parallel computers with what’s become known as Flynn’s classification (or Flynn’s taxonomy), a pioneering framework for categorizing parallelism in computer architectures based on the number of simultaneous instruction streams and data streams they handle, e.g., SISD, SIMD, MISD, and MIMD. His original taxonomy is still used widely today, with various extensions derived from it, to distinguish between different kinds of parallel processor computer systems.

In 1972, together with some colleagues from IBM, Flynn co-founded Palyn Associates which provided consulting services in the field of high-performance computer architecture and design. For more than 30 years, he and his colleagues advised nearly every major computer company in Japan, Europe and the United States, including IBM, CDC, Fujitsu, Hitachi, Honeywell Bull, and ICL. Later, he played a prominent role in Maxeler, products of which made use of advanced dataflow techniques to provide high performance processing for specific applications, such as automated trading. As a renowned professor at Stanford until his retirement in 1999 and transition to emeritus status, Flynn made seminal contributions to instruction set architecture (ISA), computer arithmetic, advanced floating-point design, multimedia, parallel processors and interconnects, emulation, and performance evaluation, to name a few. He (co-)authored several textbooks, including Introduction to Arithmetic for Digital Systems Designers, Computer Architecture: Pipelined and Parallel Processor Design, and Advanced Computer Arithmetic Design. An IEEE Fellow, ACM Fellow, and Fellow of the Institution of Engineers of Ireland, Flynn received numerous other honors and awards for his impactful technical contributions, including the ACM/IEEE Eckert-Mauchly Award (1992), IEEE Computer Society’s (CS) Harry Goode Memorial Award and Medal (1995), the Tesla Award and Medal from the International Tesla Society in Belgrade (1998), IEEE CS Charles Babbage Award, IEEE CS Computer Pioneer Award (2015, his acceptance speech video is here), and many others.

Notably, when the field of computer architecture was still in its infancy more than fifty years ago, Flynn founded the IEEE CS Technical Committee on Computer Architecture (TCCA) and ACM’s Special Interest Group on Computer Architecture (SIGARCH); he also started the ACM/IEEE International Symposium on Computer Architecture (ISCA), co-sponsored by both, which is among the most prestigious flagship computer architecture conferences in the world. At ISCA’s 50^th anniversary conference at FCRC 2023, Flynn was invited to give a “50 Year Retrospective Lecture” and was given an honorary plaque with these words inscribed: “In recognition, with tremendous gratitude, of your lifetime dedication and leadership to the computer architecture community on this the 50^th anniversary of your founding of ISCA, SIGARCH, and TCCA.”

Even more than his impressive technical contributions, which are many, Flynn is remembered fondly by the many dozens of doctoral graduate students he advised—for his unending kindness, wealth of wisdom, caring tutelage, gentle encouragement, constant motivation, and enduring support, especially when most needed. He treated each and every student as if they were a member of his own family, and he was viewed by them not only as their academic “father,” but referred to affectionately as “the Great Man.” Many of his former mentees returned to Stanford several times each year for luncheons to enjoy his company and reminisce about exciting times working with him in tackling some of the most compelling technical issues of the day.

Flynn was an equally generous mentor to his junior faculty colleagues, helping them establish their careers and providing sage advice as they made their way. Kunle Olukotun attests to this: “Meeting Mike Flynn near the end of my Ph.D. at the University of Michigan changed the trajectory of my career. At the time, I was firmly on a path toward industry, but Mike believed that I could be a strong academic, and he encouraged me to apply to Stanford. Mike saw something in me that I did not yet see in myself, and that confidence made an enduring difference. Once I arrived at Stanford, Mike served as my mentor. He helped me navigate the academic waters with thoughtful and wise advice, provided opportunities to showcase my research, and supported me through nominations for awards and professional recognition. I am deeply grateful to Mike for all he did to help establish my career, and for the role he played in the success of so many other junior colleagues whom he mentored with the same generosity and vision. I am deeply saddened by his passing.” Similar sentiments are echoed by Bill Dally, who shares the following: “I first met Mike as a graduate student at Stanford in 1980. I was awed by his accomplishments and his understanding of parallel computing. He kindled my interest in parallel computing which launched me on a very successful career. Later, when I came to Stanford as a faculty member in 1997, I found Mike to be a great source of advice about Stanford, being a faculty member, research strategy, and many other topics. I am deeply saddened to hear of Mike’s passing. He will be greatly missed.” Another of his faculty colleagues at Stanford, Christos Kozyrakis, recalls the following: “One of the most memorable moments of my early teaching years was hosting him in class to discuss the Flynn taxonomy of computer architecture—a special experience for both the students and myself and a vivid reminder of the lasting impact of his work.” Indeed, Mike Flynn was highly respected and revered by fellow colleagues all throughout his professional career. Solemnly noted by John L. Hennessy, “Mike was the person who hired me at Stanford, gave me some of my first research funding, jointly published an early paper with me, and gave me my first consulting opportunity. Sadly, his passing marks the end of an important era in computing: Mike was the last of the great System 360 pioneers—Gene Amdahl, Bob Evans, Fred Brooks, Eric Bloch, Gerry Blaauw, and Robert Tomasulo—all are now gone.”

He was a wonderful human being.

Professor Michael J. Flynn will be sorely missed by his loving family as well as by his extended academic family and all those whose lives he has indelibly touched over his blessed ninety-one plus years. May he rest blissfully in peace, and may his venerable legacy be inspirational and long lasting. Fittingly, through Mike Flynn’s final public words to all of us in the computer architecture community in his ISCA 50^th Anniversary Lecture, he exhorted us all by saying: “Now it’s your turn!”

About the Authors:

Ruby B. Lee is the Forest G. Hamrick Professor Emeritus in the ECE department at Princeton University, and chief architect at Hewlett-Packard in Silicon valley before that. She is a Fellow of the IEEE, ACM and the American Academy of Arts and Sciences, and recipient of awards such as the most Influential Paper award in 20 years at ISCA 2025 and the Test of Time award at the ACSAC 2024 security conference. Her research combines cyber security, computer architecture and deep learning, including secure processor and cache architectures, attacks and defenses, low-cost AI and multimedia.

Charlie Neuhauser is now retired after more than 50 years in the field of computer design and analysis. During the latter half of his career, he provided technical insight to attorneys and companies in the area of intellectual property. He is currently the registration chair for the IEEE Hot Chips Symposium.

Timothy M. Pinkston is the George Pfleger Chaired Professor of Electrical and Computer Engineering at the University of Southern California and also is a Vice Dean in USC’s Viterbi School of Engineering. A Fellow of AAAS, ACM, and IEEE, and recipient of the ACM SIGARCH Alan D. Berenbaum Distinguished Service Award, Timothy’s research contributions mainly are in the area of interconnection networks and efficient data movement in parallel computing systems.

All three authors are former Ph.D. students of Mike Flynn at Stanford (Lee and Pinkston) and Johns Hopkins (Neuhauser).

Microarchitectural Modeling in the Era of Accelerator-Rich Systems and Computing at Scale

Dimitris Gizopoulos — Mon, 08 Dec 2025 15:00:26 +0000

Microarchitecture simulators have been conceived and implemented to be valuable tools for the design of computing chips of all types (SimpleScalar, gem5, SMTSIM, Sniper, Qflex, Scarab, GPGPU-sim, Accel-Sim, Multi2Sim, NaviSim, SCALE-sim, gem5-Salam, TAO, PyTorchSim – the list is neither historically complete nor updated). In essence, microarchitecture simulators have an “impossible” objective: to model and measure properties of a chip (and a complete system built on it) both quickly and accurately; a contradiction in terms. They try to approach this goal operating at an abstraction layer which allows fast exploration of a large space of options but, still, remaining relatively close to the actual hardware they model; at least at a cycle-level. The objective is the same across different computing units, CPUs, GPUs, DSAs (Domain-Specific Accelerators), including, of course, AIAs (AI accelerators).

Trace based simulation, event-driven simulation, and, more recently, machine-learning based simulation approaches are employed for the same purpose: explore and rank-order the designers’ ideas about the microarchitecture in terms of important system properties – primarily performance, but also power and energy consumption, reliability, and security. Among hundreds or thousands of different microarchitecture and software combinations, the architects and designers are suffocating. Microarchitecture simulation narrows down this space to a handful of configurations that can subsequently be analyzed at lower (finer) abstraction layers.

This short blog post aims to simplistically quantify the challenges of the role of microarchitecture simulators in today’s landscape.

The Simulation Throughput Bet

Assuming that simulation at any abstraction layer can be parallelized similarly, we compare single simulation runs at each level. A single workload run on a simulated or real CPU, GPU, or AIA has a throughput of approximately (IPS = instructions/operations per second):

< 1 Kilo-IPS simulated at the gate level
10 – 50 Kilo-IPS simulated at the register-transfer (boosted to 5 – 10 Mega-IPS when FPGA-accelerated simulation is employed)
0.3 – 1 Mega-IPS simulated at the microarchitecture level
1 – 3 Giga-IPS running at real silicon

(Absolute numbers may vary depending on the host system running simulations, but relative differences are close to the above.)

The closest representation of the real physical system in the list is the gate level (if one wants to descend further down, it can be the transistor level). Using the above approximate numbers, a short 10-second runtime workload on final silicon needs about 1 year of gate-level simulation. The same run at the microarchitecture level takes less than a week!

Let’s now assume a simple design space exploration case, which only involves:

10 workloads of about the above short duration each
20 different microarchitecture points (counts, sizes, and organizations of registers, buffers, queues, caches, arithmetic units)
5 compilation options

Exploring the design space using gate-level simulation (bravely assuming the entire workloads can run at this level) would require 1000 years of simulation time. If 1000 servers were available for simulation, this time would be reduced again to only (!) 1 year. At the microarchitecture level, it is again a matter of only a few days!

The space that architects and designers need to explore does not consist of only 10 workloads, 20 microarchitecture points, and 5 compilation options. Microarchitecture-level simulation reduces years of simulation studies for design exploration down to days.

Do Accelerator-rich Designs Make a Microarchitecture Simulator’s Life Easier?

Domain-specific accelerators (DSAs) or particularly Artificial intelligence accelerators (AIAs) are very fast and energy-efficient in the few tasks they are dedicated to serve. Does the small number of tasks mean that the design space to explore is smaller or at least more manageable than CPUs or GPUs? Most probably not. The AI accelerators design space is very large because of the rapidly evolving and expanding set of ML algorithms. As a result, the accelerator chip designs often undergo major changes. For example, in systolic array based AIAs, at least the following design parameters must be evaluated in simulation at design time:

Dimensions of the systolic array
Data flow of the systolic array (input, output, weight stationary)
Data type of processing elements (short or long integers or floating-point numbers)
Level of memory hierarchy the AIA is connected to

Because of the rapid changes in ML/AI algorithms used by AIAs for training and inference, their “useful” life span is and will likely remain much shorter than that of established general purpose CPU and GPU architectures, where the microarchitecture knobs are also plentiful, but well understood. Therefore, the effort put into an AIA design may see a much shorter useful production time compared to a CPU or a GPU. A new design space exploration phase will need to be performed for the new generation of AIAs designed for the new generation of ML/AI algorithms.

Beyond Performance Exploration – Resilience at Scale

Microarchitecture simulators have been employed for resilience analysis of CPUs and GPUs for almost a decade now. Recently, they have been proven very effective in Silent Data Corruption (SDCs) research to automatically generate functional test programs for CPUs, to demystify the actual rate of SDCs generated by CPUs at datacenter scale, accurately analyze SDCs for AIAs and contribute to decision making for fault protection of AIA-based systems.

Resilience (and SDCs) analysis for large-scale AI systems (recently openly recognized by the OCP group of companies and the research community as a critical problem) adds extra dimensions to the design space exploration arena (these dimensions exist for CPUs and GPUs resilience analysis too):

Type of silicon defects to analyze (fault models)
Methods to enhance the resilience of the systems (protection)
Scale of AI systems

For modeling defects and faults, the main challenge is the enhancement of microarchitecture simulators for AIAs with mechanisms for the accurate representation of silicon defect roots and behaviors (such as transient, permanent, or delay faults) and the necessary physical conditions that excite them (temperature, droops, etc.).

Protection schemes represent yet another design knob. When a redundancy technique (at space, time, or information) is employed to detect and correct hardware faults, it alters the design of the microarchitecture, software, or both, adding overheads in terms of area, power and performance. Applying simple calculations presented above in this context points to further expansion of the design space that the simulator is called to explore.

Increasing system scale of datacenters or HPC clusters (datacenters for “usual” or ML/AI workloads, HPC systems for scientific computing or ML/AI workloads) exacerbates the problem of device variability – chips that are designed to be equal but operate with variations. The scale of deployment of AIA chips, CPUs and GPUs, coupled with diversity of workloads, increases the number of failure mechanisms and scenarios that can happen in the field, but were never imagined at chip design or manufacturing time. Yet another dimension to analyze.

Microarchitecture modeling and simulation for joint performance and resilience exploration is now an integral part of brave development activities around the globe; for one, the DARE project, Europe’s most ambitious endeavor of all times in HPC and AI computing, heavily relies on microarchitectural simulation for all three different computing engines it builds (a high-performance general-purpose microprocessor, an aggressive vector processor, and an AI inference engine) to make design decisions for performance, power, and resilience. The expected FIT (failures-in-time) and SDCs rates when the designed chips are deployed at scale, will be estimated using microarchitecture simulation and protection schemes will be diligently implemented.

The Need for Validation

Like any abstraction, microarchitecture-level modeling is constantly questioned for the validity of its findings vs. simulation of more detailed layers it abstracts (there are examples for gem5 validation for performance measurements and reliability measurements compared to physical systems and experiments). Validation of simulators is among the most important and useful results expected by the computer architecture and systems research community as our computing systems increase in complexity and scale everywhere. Simulators are extremely important, but we need to continuously tune them to model, as accurately as possible, the properties of physical computing systems and behaviors of modern software stacks.

About the Author: Dimitris Gizopoulos is Professor of Computer Architecture at the University of Athens. His research team (Computer Architecture Lab) focuses on modeling, evaluating, and improving the performance, dependability, and energy-efficiency of computing systems based on CPUs, GPUs, and AIAs.

The Hitchhiker’s Guide to Coherent Fabrics: 5 Programming Rules for CXL, NVLink, and InfinityFabric

Mon, 01 Dec 2025 14:59:17 +0000

This is the second article in the series, following our first blog in Dec 2023:
https://www.sigarch.org/tuning-the-symphony-of-heterogeneous-memory-systems/

Modern applications are increasingly memory hungry. Applications like Large-Language Models (LLM), in-memory databases, and data analytics platforms often demand more memory bandwidth and capacity than what a standard server CPU can provide. This leads to the development of coherent fabrics to interconnect more memory with cache coherence support, such that workloads can benefit from large memory capacity with ideally not much effort to modify the code.

But is it truly trivial for workloads to adopt such heterogeneous memory systems? Are there any hidden caveats behind the pipedreams of future memory systems? And if one decided to build such a heterogeneous memory system, what are some important factors to consider before building a cluster of such systems?

To answer such questions, we studied a wide range of coherent fabrics, namely Compute Express Link (CXL), NVLink Chip-to-Chip (NVLink-C2C), and AMD’s InfinityFabric. In terms of architecture reverse engineering, performance characterizations, and performance implications to emerging workloads.

This work is made possible with 8 PhD students from 4 UCSD research groups working over 1.5 years, with wide industrial collaborations. Together we measured 13 server systems, across 3 CPU vendors with a wide range of CPU generations, 3 types coherent fabric links, 5 device vendors, and multiple system configurations (including local vs. remote NUMA, Sub-NUMA clustering mode, interleaved mode, etc.).

We have shared our detailed observations on ArXiV, and open sourced our benchmarking suite on GitHub which runs on all the above-mentioned systems even including non-CXL systems like NVIDIA GH200. This blog article focuses on CXL, which has recently witnessed a broad attention. Please refer to our paper for more details on other types of coherent fabrics.

1. Compute Express Link (CXL)

CXL today is widely available, with latest server CPUs from Intel and AMD supporting memory expanders from various vendors. But given such options, one needs to decide if they need CXL, and how to build a CXL system.

1.1 Who Should Use CXL?

The core value propositions of CXL-based memory expansion are:

Massive Capacity Expansion: CXL allows for the addition of hundreds of gigabytes, or even terabytes, of extra memory through the server’s PCIe slots.
Targeted Bandwidth Expansion: In scenarios where the main memory channels are saturated but PCIe lanes sit idle, CXL can increase the total system bandwidth. And CXL is superior to directly-attached DIMMs in terms of bandwidth per CPU pin. This is particularly beneficial for workloads that are not acutely sensitive to latency.

If any of these are currently limiting your workload, then CXL might be worth looking into. And the rest of this article may help you with more detailed suggestions.

1.2 How to architect a CXL-based Machine?

Before you start to spec your machine with CXL, there are a few things to consider:

1.2.1 Latency Tax

The first principle of CXL is that it is slower than the Local DRAM with accesses being 50% to 300% slower. Measurements show that while a typical local DIMM access might take around 100 ns, an access to a modern ASIC-based CXL device falls in the 200-300 ns range. Early-generation FPGA-based CXL prototypes are even slower, with latencies around 400 ns.

CXL has high latency tax, however alternatives to CXL are even worse. To put CXL’s latency in context, here is a comparison of different memory/storage technologies:

Metric	Local DRAM (DIMM)	Local CXL Memory (ASIC)	NVMe SSD
Typical Latency	Lowest (~80-120 ns)	Medium (~200-300 ns)	Highest (10,000+ ns)
Peak Bandwidth	Highest (200+ GB/s)	Medium (~30 GB/s per device with 16 lanes)	Lower (~7-14 GB/s)
Max Capacity	Limited by DIMM slots	High (Terabytes)	Highest (Many TBs)
Recommended Use Case	Hot Data, Performance	Warm Data, Capacity Tier	Cold Data, Storage

Rule of the thumb: It is important to remember that CXL is a new tier of memory, not a DRAM replacement! Don’t think of CXL as a way to get more of fast memory. Think of it as a way to get faster access to massive capacity.

1.2.2 Bandwidth: Scalable but with Limits

A single CXL memory expander using 8 CXL 2.0 lanes (each CXL lane takes one PCIe lane) provides up to 32 GiB/s of additional memory bandwidth. Modern AMD servers such as AMD Turin provide up to 64 CXL lanes, offering up to 250 GiB/s of additional memory bandwidth.

In practice, we measured about 25-30 GiB/s of memory bandwidth of a single 8-lane CXL memory expander on real systems. Thus a 64 CXL-lanes CPU can support 200~240 GiB/s of CXL bandwidth.

2. The 5 Essential Rules of CXL Programming

Rule #1: Pin Your Workloads, Especially on Earlier Intel CPUs

To avoid limiting the workloads to a fraction of the available Last-Level Cache (LLC), Intel CPU users should strongly consider pinning workloads that access CXL memory to the local socket using tools like numactl.

The reason is that on the tested Intel Sapphire Rapids (SPR) and Emerald Rapids (EMR) CPUs, an application accessing CXL memory remotely is restricted to only a fraction of its local CPU’s LLC which is as little as 1/8th of the total cache on SPR and 1/4th on EMR. In contrast, remote access to standard DIMM memory can utilize the full LLC. However, the tested AMD Zen4 (we only have a single socket Zen5 so cannot verify whether Zen5 is affected) and the new Intel Granite Rapids (GNR) systems did not exhibit this behavior, showing symmetric cache utilization for both local and remote CXL access. For system architects, it means services must be designed with an explicit awareness of this physical topology to avoid catastrophic performance degradation.


Zen4-1-ASIC-CXL-1: DRAM bandwidth.	Zen4-1-ASIC-CXL-1: CXL bandwidth.
Running two threads of bandwidth test on different cores, using Zen4-1 with ASIC-CXL-1 (refer to Table 1 for hardware details).

Workload pinning should also take chiplet performance into consideration. As illustrated in the above figure, chiplet architecture impacts the bandwidth of accessing not just DRAM but also CXL; accesses originating from cores within the same chiplet group have a lower bandwidth. While the above example is from Zen4, other chiplet-based CPUs generally have similar performance characteristics.

Rule #2: Asymmetric Read/Write Performance

We observed a critical performance asymmetry on the tested AMD platforms. While load (read) bandwidth scaled with thread count, the store (write) bandwidth for ASIC CXL devices remained flat and low, regardless of the number of threads. This indicates a potential “performance asymmetry” for certain workloads on these platforms, where a very small number of cores can achieve the peak bandwidth with stores vs. loads.

Store performance DIMMs vs CXL memory expander.

Load performance DIMMs vs CXL memory expander.

Load and store bandwidth scaling for Intel EMR using an ASIC-based CXL memory expander. Store BW is saturated with a lot fewer cores than the load BW.

Rule #3: Introducing CXL Memory Reduces the Overall Memory Access Latency

Average load access latency scaling with threads.

Total load bandwidth scaling with threads.

Latency and bandwidth scaling with threads shows that DIMMs+CXL not only increased the total available memory bandwidth, but also decreased the average memory access latency.

Adding CXL memory alongside DDR DIMMs lowers overall system memory latency even though CXL memory itself is slower. This happens because the extra bandwidth from CXL prevents DRAM channels from saturating, reducing queuing delays and shortening average access time.

In experiments on an Intel SPR system with two CXL memory expanders on the same socket, we found:

Latency improvement: Once CXL devices were active, total memory latency dropped compared to using only DIMMs. E.g., the “Local-DIMM + 1xCXL” has lower read latencies than using solely local DIMMs.
Bandwidth extension: A single CXL device added roughly 17 GiB/s of bandwidth, while two devices offered about 25 GiB/s. This is below their combined theoretical 50 GiB/s peak, indicating partial utilization.

On the remote socket, CXL memory did not increase bandwidth. This is likely due to UPI saturation, but it still reduces latency when used with DIMMs.

Rule #4: Which CPU Architecture to Pick?

The bandwidth and latency performance of CXL Memory Expanders are significantly influenced by the CPU microarchitecture. We find AMD CPUs in general can saturate the CXL device bandwidth, while Intel’s earlier generations (SPR and EMR) are sub-optimal; but the recent GNR generation reaches parity with AMD.

Intel and AMD CPUs demonstrate a similar bandwidth accessing local memory. However, Intel’s SPR and EMR processors exhibit relatively lower bandwidth in remote memory access compared to local access, while the latest GNR fixed the issue. In terms of latency characteristics, Intel CPUs generally offer a lower latency than AMD processors.

Both CPU architectures exhibit a common phenomenon whereby latency increases dramatically when approaching a maximum bandwidth utilization. This represents a typical trade-off characteristic that occurs during memory system saturation and constitutes an important performance consideration when implementing CXL Memory Expander solutions.

Rule #5: Capacity Expansion Enables Complex AI-based Scientific Discovery Like AlphaFold3

CXL memory makes it tempting for many memory-hungry workloads, and among which we find AI-based scientific workloads an interesting use case. Such workloads consume a large amount of memory capacity and require good enough bandwidth, while the end-to-end execution time is dominated by CPU-side operations rather than GPU. CXL is the drop-in solution for them.

In our experiments with AlphaFold3 accessing CPU DIMM memory: even when most inputs run within system memory capacity, heavy inputs including RNA will require hundreds of gigabytes. On DIMM-only systems these inputs fail with out-of-memory errors, but with CXL memory expanders they complete successfully. Although CXL introduces additional latency, its impact was secondary; capacity expansion was the decisive factor for enabling these workloads.

This is a clear example of a broader pattern: scientific workloads are often deployed on HPC systems already tuned for expected requirements, yet unusually large inputs can break those assumptions. In such situations, CXL memory provides a flexible alternative by expanding capacity on demand without rebuilding servers or modifying applications.

Conclusions

We are entering the era of heterogeneous computing, where the memory system is getting more heterogeneous with the addition of coherent fabrics like CXL. Our research revealed that such systems have unconventional performance characteristics, and programmers have to carefully consider such characteristics to achieve optimal performance.

Acknowledgement

This work was supported by the PRISM and ACE centers, two of the seven centers in JUMP 2.0, a Semiconductor Research Corporation (SRC) program. We would also like to acknowledge Microsoft, Giga Computing, Samsung Memory Research Center, and National Research Platform for providing access and support for various servers.

About the authors:

Zixuan Wang got his PhD from the Computer Science and Engineering Department at University of California, San Diego. His research spans across architecture and system, with a focus on heterogeneous memory system performance and security.

Suyash Mahar got his PhD from the Computer Science and Engineering Department at University of California, San Diego. His research focuses on cloud computing with focus on memory and storage systems.

Luyi Li is a PhD student in the Computer Science and Engineering Department at University of California, San Diego. His research focuses on exploring vulnerabilities, designing protections and optimizing performance for CPU architecture and memory systems.

Jangseon Park is a PhD student in the Computer Science and Engineering Department at University of California, San Diego. His research interests are in heterogeneous memory system architecture for AI systems with emerging memories.

Jinpyo Kim is a PhD student in the Computer Science and Engineering Department at University of California, San Diego. His research focuses on memory-centric optimization and energy-efficient computing for heterogeneous AI/ML architectures.

Theodore Michailidis is a PhD candidate in the Computer Science and Engineering Department at University of California, San Diego. His research interests lie at the intersection of memory, operating and datacenter systems.

Yue Pan is a PhD student in the Computer Science and Engineering Department at University of California, San Diego. His research interest lies in computer architecture, memory, and system designs for high-performance applications.

Mingyao Shen got his PhD from the Computer Science and Engineering Department at University of California, San Diego. His research is in storage and memory system performance. This work was done while Mingyao was with UCSD.

Tajana Rosing is a Fratamico Endowed Chair of Computer Science and Engineering and Electrical Engineering at University of California, San Diego. Her research spans energy-efficient computing, computer architecture, neuromorphic computing, and distributed embedded systems. She is an ACM and IEEE Fellow.

Dean Tullsen is a Distinguished Professor in the Computer Science and Engineering Department at University of California, San Diego. His research focuses on computer architecture, with contributions spanning on-chip parallelism (multithreading, multicore), architectures for secure execution, software and hardware techniques for parallel speedup, low-power and energy-efficient processors, servers, datacenters, etc.

Steven Swanson is a Professor in the Computer Science and Engineering Department at University of California, San Diego and holds the Halicioglu Chair in Memory Systems. His research focuses on understanding the implications of emerging technology trends on computing systems.

IEEE Computer Architecture Letters (CAL) – An Update and FAQs

Sudhanva Gurumurthi and Mattan Erez — Fri, 21 Nov 2025 15:00:30 +0000

CAL has held a unique place in the computer architecture community for well over two decades as a periodical for publishing early and exciting results. CAL papers are only four pages long and undergo rigorous peer review to select those with novel ideas and/or insights that are of interest to the computer architecture community and may have high impact. Another unique attribute of CAL is that it has historically provided a shorter turnaround for reviews compared to conferences and journals. As you will see below, the turnaround time to the first decision is now typically about 30 days

We began our EIC/AEIC terms on January 1, 2025. It has been a busy and exciting year! We grew the editorial board to include several academic and industry experts from across the world. We worked with IEEE to update the scope of the periodical to bring it in line with major architecture conferences and made numerous process changes to improve the submission-to-publication turnaround time, building on work done by prior EICs. We also updated CAL’s publicity strategy to encourage the use of arXiv and social media per IEEE’s guidelines and best practices.

Over the course of the year, we’ve communicated with many members of the architecture community and identified a few questions that we frequently get asked. We’re capturing those here.

FAQ1: Can you share any stats about how CAL is doing?

We have been tracking two metrics at a monthly cadence:

Acceptance rate of submitted manuscripts
Turnaround time from submission to first decision

The graph below plots these metrics as an average over a 12-month window. This data was gathered from the IEEE ScholarOne Manuscripts™ EIC dashboard. The bars correspond to the acceptance rate and the line graph to the turnaround time from submission to first decision.

We can see that the acceptance rate initially dropped and has been fluctuating around 35% for the past few months. When a submission arrives to the EIC queue, we have workflows in place to check a manuscript for scope and minimum technical substance. Papers that pass these steps are then sent out for review by the Associate Editor (AE). We’ve found that most manuscripts that eventually get accepted at CAL go through one or more revision rounds (more details on this in the next FAQ). While this acceptance rate is higher than many architecture conferences, we think this is reflective of CAL receiving many good papers and reviewers being more welcoming of early-stage results and new directions of work so long as they are novel, technically sound, and the authors take into account reviewer feedback through the revision rounds.

We can also see that, after an initial couple of months, the average turnaround time of manuscripts has been steadily decreasing over the course of the year, with the most recent datapoint being about 30 days. For all manuscripts submitted to CAL this year, we’ve found that approximately 85% received their first decision within 30 days and the remainder within 60 days. These stats are in line with CAL’s turnaround trends in the past. This fast turnaround is the result of the collective hard work put in by the CAL AEs, reviewers, and the IEEE support staff. Reviewing is hard work and the fast-paced nature of CAL adds a layer of expediency to the process.

IEEE sends out a quarterly report to all EICs and the VP of Publications on the turnaround times of all journals. In the most recent report that was sent out in the first week of November, we were happy to see CAL designated as a high performer!

FAQ 2: Why don’t you track the turnaround time from submission to publication or set any goals for it?

Once the first decision is sent out, there are many variables that impact when a paper is fully decided as an Accept or Reject. The first decision could be an Accept, a Revise&Resubmit, a Minor Revision, or a Reject. It is very rare for a manuscript to receive an Accept as a first decision. Most papers that eventually get published at CAL go through a Revise&Resubmit, where authors get 6 weeks to submit their revision, and/or a Minor Revision, whether the authors get one week. A Revise&Resubmit is analogous to a “Major Revision” at other journals and requires a complete second review round, with sufficient time given to the reviewers to evaluate the new version and enter their reviews. For a Minor Revision, the AE can, if they so choose, provide a recommendation to the EIC without seeking additional inputs from the reviewers. Given these possibilities and all the parties involved, it could be two months or more for the final decision.

FAQ 3: Does CAL guarantee a specific turnaround for the first decision?

Sorry but there are no guarantees. We routinely keep an eye on all in-flight CAL manuscripts and the review system tracks turnaround and sends out automated reminders to the AEs and reviewers. The AEs and EIC also send out personal reminders, as needed. While we’ve found all of these help move things along (as evidenced by the data shown in FAQ 1), we can’t promise or enforce any specific turnaround.

FAQ 4: I have a paper submitted to CAL that I’d like to expand for an upcoming conference. Is it okay to submit that paper while the CAL paper is still under review?

While CAL welcomes papers on early results that could eventually be expanded to a conference paper, IEEE Computer Society rules disallow a CAL paper that is not fully decided to be submitted to a conference. CAL is an independent periodical with review timelines that are independent of other venues. Also, if your expanded work overlaps with the overall turnaround time of CAL, please consider if the initial submission truly represents early research. We feel a good heuristic for “early” is about 12 months or more from when the work is ready for submission to a conference or Transactions-style journal. Please plan accordingly.

FAQ 5: How can I help CAL?

There are many ways to do this:

Please consider submitting your early research results to CAL. CAL also welcomes papers that provide novel and insightful learnings from an industry context. Please keep FAQs 3 and 4 in mind when planning a submission.
If you are asked to review a paper for CAL, please agree or suggest alternative reviewers.
If you are an author of a paper accepted at CAL, please publicize your work.
Attend the Best of CAL session at HPCA.

We thank the computer architecture community for your support of CAL in your roles as authors, reviewers, and editorial board members! Your help has been invaluable in strengthening CAL and retaining its special place as a forum to publish early, novel, and exciting results.

About the Authors:

Sudhanva Gurumurthi is a Fellow at AMD, where he is responsible for research and advanced development in RAS. His work has impacted numerous AMD products, multiple industry standards, and external research in the field. Before joining industry, Sudhanva was an Associate Professor in the Computer Science Department at the University of Virginia. Sudhanva is the recipient of an NSF CAREER Award, a Google Focused Research Award, and is named to the ISCA Hall of Fame. Sudhanva received his BE in Computer Science and Engineering from the College of Engineering Guindy, Anna University, and his PhD in Computer Science and Engineering from Penn State.

Mattan Erez is a Professor in the Department of Electrical & Computer Engineering at The University of Texas at Austin, where he holds the Cullen Trust for Higher Education Endowed Professorship in Engineering #7. He has received several best paper awards at international conferences and is named to the Hall of Fame of ISCA and HPCA. Mattan is the recipient of many research awards, including the NSF CAREER Award, the DOE Early Career Research Award, and the Presidential Early Career Research Award for Scientists and Engineers awarded by President Obama. Mattan received a BSc in Electrical Engineering and a BA in Physics from the Technion, Israel Institute of Technology, and his MS and PhD in Electrical Engineering from Stanford University.