Computer Architecture Today

Architecture & Systems are Changing: The Architect’s Role in the Era of Agentic Co-Design

Dimitrios Skarlatos — Tue, 19 May 2026 14:00:32 +0000

Architecture & Systems are Changing: The Architect’s Role in the Era of Agentic Co-Design

The AI datacenter stack is built on hardware-software contracts and abstractions that were never designed for the workloads datacenters now serve. Memory systems strain under terabyte-scale capacity. Heterogeneous accelerators have been pressed into deployment. With datacenters projected to consume over 1,000 TWh annually, surpassing Japan (the world’s fourth-largest economy), renegotiating the hardware-software contract is no longer optional.

AI was enabled by decades of hardware and software efficiency gains. The next leap requires two orders of magnitude more, on a stack whose workloads, infrastructure, and economics bear little resemblance to the one the contract was written for.

That is not a problem any single layer of the stack can solve. It is a co-design problem, and it is unfolding while the design process itself is changing across systems and architecture.

The contract so far

Computer architecture has long been guided by a quiet contract with three commitments: abstractions, interfaces, and transparency. Layers that hide hardware complexity from programmers; interfaces like the x86 ISA that let decades-old binaries still run on Linux today; and microarchitectural state largely hidden behind a model programmers can keep in their heads. Together, these commitments deliver the property programmers care about most: programmability.

This contract is not arbitrary: it is what lets billions of lines of legacy software keep running while architects rebuild underneath. But the contract was negotiated for a world where humans wrote all of the code and humans designed all of the hardware. Both halves of that world are changing at the same time, and the architect’s job is evolving with them.

Plenty of room at the Top

In 2020, Leiserson, Thompson, Emer, Kuszmaul, Lampson, Sanchez, and Schardl argued in Science that post-Moore performance gains would have to come from the “Top” of the computing stack: software, algorithms, and hardware architecture, rather than from the “Bottom” of semiconductor physics. They were right, and the half-decade since has only sharpened the point.

The harder claim in that paper is the one we want to dwell on. The Top has plenty of room, but the gains are “opportunistic, uneven, and sporadic,” subject to diminishing returns. The Top has historically been mined by hand, one paper and one design cycle at a time. What is changing now is the rate at which it is mineable. The two directions we describe next change that rate. Same Top, mined faster, mined more systematically, and mined by tools the field did not have until recently.

Two directions are reshaping the design loop

Two complementary directions are converging on how we build system software and hardware: embedding learning inside low-level mechanisms, and using AI agents to explore the architectural design space itself.

The first direction has a deep history. Perceptron-based branch predictors put a lightweight learning model on the critical path more than two decades ago, and the catalog has steadily grown since. On the cache-hierarchy side, Mockingjay uses a trained reuse-distance predictor to imitate Belady’s optimal replacement policy. On the prefetching side, Hashemi et al. framed memory access patterns as an LSTM prediction task, Pythia recast the entire prefetcher as an online reinforcement-learning agent, and Micro-Armed Bandit showed that lightweight bandit-based RL can match more complex agents at a fraction of the storage cost. Outside the cache hierarchy, reinforcement learning has been applied to chip floorplanning, learning-based memory allocation replaced hand-tuned allocator heuristics with predictors trained on real telemetry, and Seer applied deep learning to predict QoS violations in cloud microservices before they materialize. Most recently, our work on learned virtual memory (LVM) eliminated address-translation overhead with a learned index that fits in two cycles of integer arithmetic. The principle generalizes: fixed designs are being replaced with principled, hardware-realizable models that adapt to workload shifts in ways hand-tuned heuristics cannot.

The second direction is newer, and arguably more disruptive. AlphaEvolve demonstrated that LLMs paired with evolutionary search can discover algorithms across domains, from mathematical constructions to data-center scheduling. ADRS extended the idea to broader systems research, and recent work from Google has applied the same approach to cache replacement. The same paradigm has reached the software side of the machine: agentic systems that generate and tune CUDA and Triton kernels, including NVIDIA’s AVO and Meta’s KernelEvolve, are now in use across heterogeneous accelerators. Sankaralingam captured the bigger picture in Computer Architecture’s AlphaZero Moment, arguing that the field is approaching a regime where architectural discovery itself becomes a search problem, beyond per-mechanism tuning. In our own recent work, Agentic Architect, we coupled LLM-driven code evolution with cycle-accurate simulation to explore microarchitectural design spaces, and found that the loop matches or exceeds state-of-the-art designs on cache replacement, prefetching, and branch prediction.

These two directions are not alternatives. They differ in what they decide and when. The first decides how a fixed mechanism behaves at runtime: a branch predictor that learns its own weights, a cache policy that adapts to the workload. The second decides what the mechanism looks like in the first place: the predictor, the policy, the prefetcher itself, evolved before deployment. Both move judgment that used to live in tight loops written by experts into search problems that can be scored and re-evaluated.

Figure 1. Agentic Architect, a framework for Computer Architecture Design Space Exploration and Optimization.

Why the loop closes here

Computer architecture has a structural advantage that is easy to take for granted: from the beginning, the field has organized itself around shared, quantitative empirical evaluation. SPEC, PARSEC, CloudSuite, DeathStarBench, and a long list of others encode a community-wide agreement about what a “fair comparison” looks like. The metrics are equally well established: IPC, MPKI, miss rates, area, power, energy-delay product. Every paper in the field is, in effect, a measurement against an agreed instrument.

An agentic loop needs exactly this kind of discipline to close. The loop’s productivity is bounded by the cost and clarity of its fitness signal: how cheaply can a candidate be evaluated, and how reliably does the resulting score reflect the property we actually care about? In domains where evaluation is subjective, expensive, or contested, agentic exploration struggles. In computer architecture, the cycle-accurate simulator gives the loop reproducibility: controlled-environment evaluation against well-defined metrics. Production profiling, hardware performance counters, tracing, and system telemetry give it realism: behavior under load and access patterns that synthetic benchmarks cannot reproduce. The two together are what close the loop.

That has a practical consequence. It means the field does not have to invent its evaluation infrastructure to take advantage of agentic co-design; it has to connect it. The benchmarks, the simulators, and the metric vocabulary are already in place. What is missing is the throughput and the integration: simulators that can serve hundreds of evaluations per study, fitness functions that compose IPC with area and power as primary terms, and training/evaluation splits that let us measure generalization instead of overfitting. We will return to this agenda below.

When code is co-authored, what does “programmable” mean?

Some of the architectural conservatism we aimed to maintain was justified, decades ago, by a single phrase: but no one will program it. The Cell processor’s programmer-managed SPEs and local stores are an example: an elegant design that proved very hard to program in practice.

The cost of programmability used to be borne almost entirely by humans. That is no longer true, and it changes the calculation.

In April 2025, Satya Nadella reported that 20% to 30% of Microsoft’s code was AI-generated, with internal acceptance rates rising monotonically. Google sits in a similar regime: a quarter in Q3 2024, half by fall 2025, and 75% of Google’s code by April 2026, with Sundar Pichai describing the shift as “truly agentic workflows” in which engineers orchestrate fleets of AI agents rather than writing each line themselves.

These numbers describe authorship of characters, not accountability. But they should change how we evaluate the programmability constraint. When agents can routinely program across ISAs, generate platform-specific code paths, write test harnesses, and bridge unfamiliar interfaces given a clear specification, the cost on the programmer is no longer a sufficient veto on a hardware design choice. Designs that were dismissed because they imposed too high a cost on human programmers warrant a fresh look when most of that cost falls on agents instead.

Programmability still matters. Clarity, debuggability, verifiability, and predictable performance remain real properties humans need, and increasingly properties agents need too. Abstractions still matter, perhaps more than ever. Deciding which to expose, which to hide, and which to make machine-checkable is now a question for the programming-languages and systems community alongside architects. But the most consequential lever may not be what we add; it may be what we remove. Many of the layers in today’s stack exist to hide hardware from human programmers and cost cycles and area to maintain. When agents absorb that complexity, the layers come off, and the performance and efficiency we have been paying to abstract away come back.

A widening design space

Reshaping the loop only matters if the space it has to cover is tractable. Increasingly, it isn’t.

A modern AI datacenter spans CPUs, GPUs, AI accelerators, a widening memory landscape (DRAM, CXL, HBM, HBF, SSD), and rack-scale integration with NVLink and optical interconnects. The software and hardware layers have not caught up: a single agentic query may dispatch dozens of model invocations across heterogeneous devices and tool calls on CPUs, all with different abstractions. The operating system (OS), the layer that has historically reconciled such mismatches, must evolve at an unprecedented pace to keep up with growing hardware capabilities and software demands. For example, it only has partial visibility into the GPU, despite its prominent role in AI workloads. Our recent work LithOS has established a beachhead for OS-level control over GPUs, but extending that contract to coordinate the full heterogeneous stack is open. At every level of that stack, energy and power are hardening from secondary considerations into primary constraints.

Each of these pressures is, individually, a multi-year research program. Together, they describe a design space defined by heterogeneous compute, evolving memory hierarchies, rack-scale integration, software-level coordination, and workload regimes that did not exist five years ago. Covering this space by hand is increasingly difficult, even for a large team of architects.

That is the practical case for agentic co-design. The space is outgrowing human-only exploration, and the tools to cover it are finally here.

A proof point

In our recent work, we introduce the Agentic Architect, an agentic framework for architecture design space exploration and optimization. We evaluate it across three of the most studied microarchitectural domains: cache replacement, data prefetching, and branch prediction. We chose them precisely because they are mature. They have decades of literature, well-understood baselines, and limited remaining headroom; if the loop produces gains in these domains, the result is meaningful. The evolved cache replacement policy matched and slightly exceeded Mockingjay; the evolved prefetcher beat SOTA by 17%; the evolved branch predictor improved over Hashed Perceptron on workloads where branch behavior is the bottleneck.

Figure 2. Storage versus performance for data prefetchers. The evolved prefetcher (87 KB) is Pareto-optimal: it delivers the highest geomean speedup over no prefetching at a smaller storage budget than the next-best design.

The more interesting result is what the loop discovered, and what it didn’t. The components in the evolved designs are almost entirely known techniques: stride engines and delta correlators for prefetching, reuse-distance predictors and signature tables for replacement, perceptron variants for branch prediction. None of these primitives is new.

What is new is the coordination. The evolved prefetcher continuously re-evaluates each predictive engine and throttles speculative ones under memory pressure. The evolved replacement policy arbitrates between three independent predictors based on their recent accuracy. The recurring structure across all three domains is the same: preserve the seed’s core, add orthogonal known features, integrate them through new coordination, and adapt at runtime. The novelty lies in the coordination. The loop refines the foundation; the architect still chooses it.

The infrastructure needs to evolve

If agentic co-design is going to do useful work across this design space, the bottleneck moves to infrastructure. The benchmarks and metrics are already there. What we need to build is throughput, multi-objective scoring, and cross-layer reach. The agenda is concrete:

New tools for agentic architecture design space exploration. Cycle-accurate simulators were built for human-paced experimentation; an agentic loop wants hundreds of evaluations per study, with storage, area, power, and timing as terms in the score rather than afterthoughts that disqualify the result later. We need simulators, search strategies, and metrics purpose-built for this regime: search loops that respect hardware constraints and balance exploration against exploitation, and composite metrics that combine performance, area cost, and generalization into signals the search can rank against.
Cross-component and cross-layer co-evolution. Co-design across the OS/hardware boundary is now the norm rather than the exception. Taking virtual memory as an example, TLB design, page-table walkers, translation footprint in the caches, and huge-page promotion in the kernel are tightly coupled, and optimizing any one in isolation may capture only a fraction of the available improvement. RTL backends, full-system simulators, and formal verification each let the loop close around a different surface.
Open source has to evolve. Releasing code is no longer enough. We need structured artifacts that span the full stack, from prompt, seed, and scoring function down to traces, simulator and system configurations, and where applicable RTL, packaged so an agent can clone a repo, re-run the search that produced a published result, and compare new candidates against the same baseline.

The architecture and systems communities are uniquely positioned to drive that work.

Renegotiating computer architecture and systems

A stack co-authored by humans and agents needs renegotiation along the three axes of the old contract. Each is now reweighed against a new deliverable: programmability for agents and humans alike, rather than humans alone.

Abstractions. Many layers exist precisely to hide hardware from human programmers, and they cost cycles and area to maintain. With agents absorbing that complexity, some of those layers can come off; performance and efficiency we have been paying to abstract away come back.

Interfaces. The boundary between hardware and software was drawn for human programmers. As agents become the primary author of low-level code, the interface that carries the contract forward needs redrawing: machine-checkable, composable, and accessible to tools rather than only to humans.

Transparency. The property that lets a programmer model the CPU in their head gives way to a stricter need: explainability. The architect must verify the result against intent, explain why it works, and check that it generalizes beyond the workloads it was trained on. None of these come for free; the field needs methods, metrics, and tooling that make them routine.

Leiserson and colleagues told us in 2020 that there was plenty of room at the Top of the computing stack. The half-decade since has been about confirming they were right; the next half-decade will be about whether we build the tools to actually live there. Agentic co-design, paired with learning embedded inside the system itself, is a strong candidate for addressing the “opportunistic, uneven, sporadic” character that delivered those gains so far.

Architecture is changing. The contract still holds, but the terms are up for negotiation. The next generation of the stack will be defined as much by what we remove as by what we add. The people best positioned to make those calls are the ones who understand both the hardware and the software. That is, by definition, our community.

Acknowledgments

Thanks to the Computer Architecture & Operating System (CAOS) group at Carnegie Mellon and to Prof. Alex Daglis and Prof. Todd Mowry for feedback on this post.

About the Author

Dimitrios Skarlatos is an assistant professor in the Computer Science Department at Carnegie Mellon University. His research bridges computer architecture and operating systems with a focus on AI datacenter efficiency, privacy, and scalability. His work has been deployed in production datacenters and upstreamed into the Linux kernel. He has received the IEEE CS TCCA Young Computer Architect Award, the NSF CAREER Award, the Intel Rising Star Award, a Linux Foundation Faculty Award, an ISCA Best Paper Award, two ASPLOS Best Paper Awards, a CACM Research Highlight, four IEEE MICRO Top Picks, the joint ACM SIGARCH & IEEE CS TCCA Outstanding Dissertation Award, the David J. Kuck Outstanding PhD Thesis Award, and over a dozen industry faculty awards from Amazon, AMD, Intel, Meta, Oracle, and VMware. His recent work led to the founding of LithosAI.

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.

From Control to Data to Value: A Third Axis of Parallelism

Di Wu, Zhewen Pan, Joshua San Miguel — Wed, 13 May 2026 15:00:29 +0000

TL;DR: The history of parallel computing is a history of shifting what we put at the center of the computer. The first axis, control-level parallelism (CLP), is control-centric and schedules around the program counter: it gave us the high-performance computing (HPC) era. The second axis, data-level parallelism (DLP), is data-centric and schedules around tensors: it gave us the artificial intelligence (AI) era. A third axis is now emerging: value-level parallelism (VLP), where narrow data bitwidth exposes a small number of unique values and lets the architecture deduplicate redundant computation. Two recent works, Carat (ASPLOS ’24) and Mugi (ASPLOS ’26), make the case concretely: VLP eliminates redundant computation in both linear and nonlinear operations on AI workloads. This article argues that VLP is not a point of optimization but the beginning of a value-centric computing paradigm, one that is crucial for addressing the escalating energy demands of next-generation intelligent systems.

Traditional Parallel Computing

The First Axis: Control-Level Parallelism

The HPC era saw a diverse set of workloads. The metric of success made the goal explicit: instructions per cycle (IPC) normalizes performance to how fast instructions are consumed, not to what data the instructions are operating on. Therefore, computer architecture in the HPC era was control-centric. Michael Flynn formalized the design space in 1966 with his taxonomy [1]: SISD, SIMD, MISD, MIMD, outlining the orthogonality of instruction and data. For decades, this was the right framing: transistors were scarce, control logic was expensive, and the most valuable thing an architect could do was to issue one more instruction per cycle.

The actual implementation to exploit the CLP to execute multiple instructions in parallel arose from the independence between instructions. We enjoyed the technology evolution from pipelining, branch prediction for SISD, superscalar issue, out-of-order execution, simultaneous multithreading, chip multiprocessors for MIMD and beyond.

The Second Axis: Data-Level Parallelism

Entering the AI era, powered by large language models (LLMs), transistors became plentiful but the memory bandwidth became scarce due to the large data volume in AI tensors. Consequently, we hit the memory wall and turned to data-centric architectures, expanding more along the data dimension in Flynn’s taxonomy. Success is now measured by how well we apply one operation to many data elements while feeding them efficiently from memory (e.g., throughput, goodput, and arithmetic intensity). TPUs with systolic arrays and GPUs with tensor cores are renowned examples to exploit the rich DLP opportunities from high-dimensional tensors.

Diving deeper, we see that compute arrays, i.e., dataflow architecture, are becoming the first class citizens. Dataflow architecture follows the philosophy of letting data drive control, with early works from MIT [2] [3]. With regular compute and memory patterns in AI tensors, dataflow architecture builds massively parallel compute arrays to maximize the computational density and minimize the control overhead.

Another line of research to exploit DLP for AI workloads targets the von Neumann bottleneck, envisioned by John Backus as early as 1978 [4]. These solutions move the computation closer to the memory (e.g., in/near-memory/storage processing) by attaching additional compute logic next to the memory blocks, unlocking massive DLP on the already wide enough memory blocks.

The Third Axis: Value-Level Parallelism

Though Flynn’s taxonomy, with dimensions of instructions and data, has been followed for decades, there are untouched landscapes. While CLP and DLP focus on concurrency and parallelism, neither asks the next question about the content of the data: can the patterns in data values benefit the computation efficiency? VLP is value-centric and the third axis in this regard, i.e., it targets computational redundancy inherent to the data patterns of workloads. It recognizes that when identical values flow through a pipeline, the arithmetic becomes deterministic and, therefore, avoidable. Consequently, we move from executing every instruction and data to computing only each unique data value.

Origins for GEMM

Carat (Pan, San Miguel, Wu — ASPLOS ’24) [5] is the paper that coined VLP and materialized it in hardware architecture. The insight is simple. As deep learning inference moves to larger batches and lower precisions (e.g., FP8 being the de-facto data format in DeepSeek v3), the number of unique values shrinks rapidly while the frequency of each grows. Here, we give an example. For a scalar-vector multiplication for an arbitrary scale weight w and 1k UINT4 inputs, conventional hardware would compute 1k multiplications for the weight w and each UINT4 vector element. Looking closely, there are only 16 unique products, i.e., 0 x w, 1 x w, 2 x w, …, 15 x w. Thus, conventional hardware would compute 1k/16=64 times more than needed.

Figure 1. Overview of VLP for scalar-vector multiplication.

Figure 1 above outlines how VLP is constructed in Carat. In Figure 1 (a), VLP consists of value reuse, which accumulates the weight w over time, each accumulation result, called a partial product, is used to compute the next partial product. Then each input just subscribes to the proper partial product as the correct output. To materialize the subscription, we leverage temporal coding, often seen in the brain, which generates a spike at the cycle indexed by the data value. For example, a data valued 8 will generate a spike at cycle 8, as shown in Figure 1 (b). Therefore, there exists a temporal correspondence between the spike and the accumulated partial product. Each input subscribes to their correct output in parallel, giving the rise to value-level parallelism. The scheduling unit is no longer the instruction or the array element; it is the unique product value, made available to many input consumers via temporal coding. Given this formulation, we see that lower precision produces fewer unique values, while larger batches create more inputs to share the unique values.

Generalizing Beyond

So far, VLP in Carat targets GEMM optimization for large-batch, low-precision, symmetric-format use cases. However, these assumptions may no longer hold in more recent LLM workloads: the batch size is small (e.g., 8~16) to ensure real-time response, the data formats are asymmetric (e.g., INT4-FP16) to minimize the memory footprint of the weight and KV cache, and the nonlinear operations are heavy and complicated (e.g., softmax, GELU, SiLU) to ensure high accuracy.

Mugi (Price, Vellaisamy, Shen, Wu — ASPLOS ’26) [6] is a follow-up work that closes the gap and generalizes VLP for both linear and nonlinear operations for broader AI workloads. Figure 2 shows VLP for elementwise nonlinear operations that can be done through a four-phase pipeline on input floating-point numbers with a sign (S), mantissa (M) and exponent (E). The first phase is input approximation, which converts wider inputs to narrower bits without sacrificing the LLM accuracy too much. The inputs to nonlinear operations are always in higher precision (e.g., BF16, FP16, or FP32) in AI workloads than model weights. The approximation to narrower bits ensures a shorter temporal signal for high throughput. The second phase is value reuse with opportunities from large GEMM shapes in LLMs. Unlike value reuse in Carat accumulating the partial product, value reuse in Mugi loads the precomputed nonlinear results, where higher accuracy is allocated to more critical inputs. The third phase performs temporal subscription on the mantissa bits (M) of the inputs. For each input, the selected output corresponds to the nonlinear results with the same mantissa but different exponents (E). Finally, the fourth phase performs temporal subscription on the exponent bits of the inputs. For each input, the selected output corresponds to the correct mantissa and exponent.

Figure 2. Overview of VLP for elementwise nonlinear operations.

Mugi essentially cascades VLP to construct high dimensionality in the value space to compute nonlinear operations. The resulting architecture unifies the datapath for both linear and nonlinear operations, leading to savings in silicon area.

What Makes VLP Different

Conventional architectures pay arithmetic costs to process every instruction and data element while advanced techniques (e.g., memoization, tabulation) leverage computation reuse and pay memory costs to store past results and refer back to them when needed. VLP pays for neither: it minimizes both arithmetic and memory access, instead spending its silicon budget on value delivery, i.e., the network and temporal converters that route unique results to many input consumers in parallel. The type of computation fundamentally changes to something new: a form of temporal subscription.

Future Opportunities and Open Questions

In its current form, VLP relies on temporal coding, which is actually inspired from how the brain works. Though the community has focused predominantly on deep learning, VLP opens up a different research direction: what are potential synergies between neuromorphic and classical computing in computer architecture? Looking beyond, VLP raises several questions to answer in the AI era.

Where does VLP stop paying off? Though Carat and Mugi work for varying batch sizes, both of them now are designed for low precision to create more opportunities for value reuse. There is presumably a design space with high precision and low value redundancy. It is essential to understand the mechanism to exploit VLP for such scenarios and quantify the potential gain.
Is VLP an ISA-level concept or an accelerator-level one? Both Carat and Mugi are accelerator designs. A real-world question is whether VLP can inform CPU and GPU microarchitecture. What would a VLP-based tensor instruction look like? Could it be a drop-in replacement of tensor cores with better efficiency?
What should the software stack look like? VLP for nonlinear involves approximation, which naturally introduces inaccuracy to the task. This falls back to the question of approximate computing, but in the context of new hardware primitives. We probably shall build a co-design framework to deploy VLP under approximation errors.
Can VLP live with sparsity? Sparse computation has been a major optimization since the start of deep learning, and more opportunities are emerging from weight, KV-cache spanning across the bit level, value level, block level and even request level. It is meaningful to study how VLP synergizes with such use cases.
How does VLP interact with memory? Despite efforts in computation, memory stays at the core of AI. A natural question is whether we can optimize the memory system with VLP, or more broadly, value-centric computing. There have been associative memory-based AI accelerators, and whether there could be VLP-based alternatives?

Final Thoughts

Architecture research is all about how to compute faster and more efficiently. Control-level parallelism has let us argue about IPC and pipelines for thirty years. Data-level parallelism has let us argue about FLOPS and dataflow for fifteen. Value-level parallelism, the third axis, now shows promise for emerging AI workloads (thanks to Carat and Mugi) and paves the way for exciting synergies between neuromorphic and classical computing. Here’s hoping one day computer architects see the value in it.

About the authors:

Di Wu is an assistant professor at the University of Central Florida.

Zhewen Pan is a PhD candidate at the University of Wisconsin–Madison.

Joshua San Miguel is an associate professor at the University of Wisconsin–Madison.

How AI Will Reshape Computer Systems by 2035: A Jeffersonian Dinner in San Francisco about Our 10,000x Future

Helen Wright, Jeff Dean, Mark D. Hill, and Dave Patterson — Mon, 04 May 2026 14:00:06 +0000

Editor’s Note: this post is a republication of CRA-I post available at: https://cra.org/industry/2026/04/27/how-ai-will-reshape-computer-systems-by-2035-a-jeffersonian-dinner-in-san-francisco-about-our-10000x-future/

CRA-Industry (CRA-I) recently continued its series of intimate Industry Salon Gatherings, bringing together leaders to discuss the long-term trajectory of our field. Our latest session, organized by Mark D. Hill (University of Wisconsin-Madison & CRA) and CRA-I, took place on April 16, 2026 at the historic University Club of San Francisco and was sponsored by Laude Institute, which had just announced an ambitious and exciting slate of new research “AI moonshot” awards.

The evening featured a high-level conversation among 20 participants, co-hosted by Dave Patterson (UC Berkeley/Google) and Jeff Dean (Google AI). The Salon tackled a fundamental question: “What will computer systems look like in 2035, and how will they be designed?” The participants were researchers and leaders from West Coast academic institutions and technology companies, big and small. As the table below shows, the group was diverse in multiple dimensions, including career stage, with expertise centering on computer architecture but extending to software systems, AI models, and design methodology and tools.

Following the “Jeffersonian” dinner format, the evening was designed to forge deep connections and brainstorm visionary ideas. To ensure a frank and pre-competitive dialogue, the event operated under the Chatham House Rule, allowing for candid exchange while protecting the anonymity of the participants’ specific contributions.

The three-hour (!) conversation explored the intersection of architecture, software, AI, and future design methodologies. Here we highlight some key observations and conjectures made by Salon participants:

We are in the midst of an AI revolution that appears to be even more impactful than the introduction of microprocessors, PCs, Internet, or smartphones.
To drive future change, we must focus on metrics such as improving “intelligence” per Watt for efficiency and more AI tokens processed at fixed user-perceived latency for more “intelligence.”
One lively topic was centered on the future of interfaces and abstractions that have been essential for humans to build complex systems. We explored two related questions: whether abstractions will continue to matter in the AI era, and, if they do, whether those abstractions must remain human-interpretable, allowing human/AI teams to advance the field together. The general view was that abstractions will continue to play an important role, not only for humans, but also helping AI systems to reason and coordinate. However, for communication and reasoning among agents, these abstractions need not be interpretable to humans, and may evolve beyond human comprehension. At the same time, there remains a need for human-interpretable abstractions to enable oversight, intervention and guidance by humans.
We conjecture that in five years 10,000x more AI inference will be done worldwide, with these gains hypothetically coming from multiplicative progress of 50x in AI algorithms, 50x in system/hardware optimization/specialization, and 4x from further data center growth.
We expect 50x AI algorithm progress for three reasons. First, Transformers have sparked a rapid series of AI inventions since the original paper in 2017, and the area is still ripe. There has been a clear trend in increased data and compute efficiency in training large models. Second, there is an existence proof that we can do much better since humans learning from birth to early adulthood are able to use 1,000x less input data than today’s largest ML models, suggesting that much more data-efficient learning algorithms are still possible.Third, deep analysis to determine what aspects of current AI technology are necessary for current levels of intelligence may reveal much more efficient methods for obtaining the same or better results.
We expect 50x system/hardware progress from two trends. First, increased hardware specialization will lead to major improvements in efficiency. Second, AI to automate hardware design will enable much faster and lower-cost creation of this specialized hardware. AI is already greatly impacting the system/hardware design process. We expect many design flows to be accelerated or altered. For example, can Large Language Models iterate with improving formal tools and specification methods to “hill climb” design spaces, and can formal verification be accelerated by customized hardware? Moreover, AI is having a fundamental impact on software developers. We conjecture that future developers will write little code directly. Rather, they will manage teams of AI agents. And this in turn could dramatically accelerate software development for specialized hardware. How do we prepare students and professionals for this world?
We expect a substantial increase in global data center capacity (perhaps 4x over the next five years), but recognize that non-technical forces are at play here. Expansion will be relatively larger if companies focus on scale out more than the above innovation opportunities. However, expansion could be much less due to community “techlash.”
We discussed energy trends, including that solar was now significantly cheaper than other energy sources for new capacity, and that battery prices continued to fall, making solar+batteries a viable way of powering new datacenters and other energy needs. We also discussed other new sources of clean energy that are not yet commercially viable but might be in the next five years, such as fusion.
Finally, acknowledging “tech-lash” brings us to opportunities and challenges that AI brings to society. Much of the public views AI as a potential disruption in their lives, fearing the negative more than embracing the positive. They may currently be correct. It is our job to ensure that we develop and encourage the societally- beneficial aspects of AI, and we discussed education and healthcare as two domains with considerable early positive benefits and enormous further potential. We also recognized negative aspects of AI usage in areas like easing the creation of misinformation and cyberattacks, and many expressed concern about society’s capacity to absorb substantial job disruption in short time scales. We conjecture that within the next few years, AI policy will be a major election factor. How do we make a public AI debate substantive? How do we ensure policymakers can make informed decisions about AI, so that we can sensibly regulate some of the negative consequences of AI without stifling the positive uses? As AI changes our world, how does that change university teaching, life-long learning, and job retraining? How does it alter university and industry research? Three hours are insufficient to answer these societal questions.

In their Turing Award lecture in 2018, Hennessy and Patterson asserted that we were beginning a Golden Age for computer architecture. The subsequent decade has shown them to be prescient.

This San Francisco gathering reinforced the value of bringing industry and academia together to look beyond the immediate product cycle. As noted by the organizers, the insights gained here will help shape the CRA roadmap for supporting the computing research ecosystem over the next decade.

As we went around the table to get everyone’s last comments, many mentioned how much they enjoyed the conversation and would love to do it again. Impressively, four of the invited leaders had to fly to San Francisco for the event and everyone stayed until the official end of the three-plus-hour reception and dinner. The feedback was overwhelmingly positive. One participant noted that it was “valuable to hear directly from the leaders of our community [about] their perspectives on the role of AI—both in how we design computers and the projected compute demand driving that design.” Another shared that the evening was “tremendously valuable for my own thinking on where the computing industry is and should be going, and for sharing thoughts to hopefully influence how others view technology advancement and its societal implications.”

If you are interested in participating in or hosting future CRA-I Salon Gatherings, please sign up for our mailing list or contact Helen Wright (hwright@cra.org).

Salon Participants

First Name	Last Name	Affiliation
Doug	Burger	Microsoft Research
Jason	Cong	UCLA
Jeff	Dean	Google
Chris	Fletcher	Berkeley
Anna	Goldie	Ricursive Intelligence
Peter	Harsha	CRA
Mark	Hill	CRA/University of Wisconsin–Madison
Andy	Konwinski	Laude Institute
Alex	Ksendzovsky	The Biological Computing Co
Azalia	Mirhoseini	Stanford/Ricursive Intelligence
Dave	Patterson	Berkeley/Google
Chris	Ramming	CRA-Industry
Sophia	Shao	Berkeley
Ben	Spector	Flapping Airplanes
Ion	Stoica	Berkeley
Caroline	Trippel	Stanford
Natalia	Vassilieva	Cerebras Systems
Ralph	Wittig	AMD
Helen	Wright	CRA
Carole-Jean	Wu	Meta

About the Authors:

Helen Wright is the Manager of CRA-Industry (CRA-I), a committee of the Computing Research Association. She bridges the gap between academia and industry, convening leadership to tackle “grand challenges” like workforce, socially responsible AI, and partnerships. Helen leads a Council and Steering Committee of over 20 industry pioneers to drive these initiatives forward. Previously, she was a Science Education Analyst at the NSF. She holds graduate and undergraduate degrees from the University of Virginia.

Jeff Dean joined Google in 1999 and is currently Google’s Chief Scientist, where he co-leads the Gemini effort. His areas of focus include machine learning and AI, computer systems, and AI applications. He has worked on Google Search, Google News, Google Translate, Google’s advertising systems, MapReduce, BigTable, Spanner, TensorFlow, Pathways, and Gemini. In 2011, he co-founded the Google Brain project. He received a B.S. in CS and economics from the University of Minnesota in 1990 and a Ph.D. in CS from the University of Washington in 1996. He is a member of the U.S. National Academy of Engineering (2009), and is a Fellow of the ACM and the AAAS. He is a recipient of the ACM Prize in Computing, and the IEEE John von Neumann medal.

Mark D. Hill is the Gene M. Amdahl and John P. Morgridge Professor Emeritus of Computer Sciences at the University of Wisconsin-Madison (http://www.cs.wisc.edu/~markhill), following his 1988-2020 service in CS and ECE. His research interests include parallel-computer system design, memory system design, and computer simulation. Hill’s work is highly collaborative with over 170 co-authors. He received the 2019 Eckert-Mauchly Award and is a fellow of AAAS, ACM, and IEEE. He serves on Computing Research Association (CRA) Board of Directors that is sponsoring this event. Hill was also Partner Hardware Architect at Microsoft (2020-24) where he led some software-hardware pathfinding for Azure.

David Patterson retired after 40 years as an EECS professor at UC Berkeley before joining Google in 2016 as a Distinguished Engineer. He is probably best known for the book Computer Architecture: A Quantitative Approach and for the Berkeley RISC (Reduced Instruction Set Computer), RAID (Redundant Array of Inexpensive Disks) and NOW (Network of Workstations) projects. He and his co-author John Hennessy shared the 2017 ACM A.M Turing Award (the “Nobel Prize of Computing”) and the 2022 NAE Charles Stark Draper Prize for Engineering (a “Nobel Prize of Engineering”).

Fourth Data Prefetching Championship: Part 2

Digvijay Singh — Wed, 29 Apr 2026 14:00:03 +0000

This article continues (and concludes) the discussion on the proceedings of DPC-4, covering the remaining four contestants and a summary of the trends observed in all eight prefetchers presented in the championship. Similar to Part I, we focus on how each prefetch algorithm functions, and why it is effective. Finer implementation details can be obtained from the workshop papers or the source code.

BertiGO (Simranjit Singh, University of Murcia; Agustín Navarro Torres, University of Zaragoza; Alberto Ros (University of Murcia

Motivation

When evaluating the baseline prefetcher configuration, the authors noted that Berti frequently issues redundant prefetch requests for lines already prefetched or present in the cache. Also, using only the PC provides very limited context for pattern recognition, limiting the prediction capabilities of Berti. Furthermore, Pythia is found to generate a lot of useless prefetches for some workloads, which pollutes the L2 cache and wastes memory bandwidth.

Idea

A Region-Based Bit-Map Filter is added, which is a fully associative structure storing the prefetched and accessed cache lines per region, in the form of a bit-vector. For regions tracked by the filter, having the M-th bit set in the bitmap implies that we drop all prefetch requests for the M-th cache line inside the region.
In addition to using PC, the authors propose using a hash (shifted XOR) of the last 4 PCs with the current PC, to index the Berti tables with additional context.
Set-Dueling is added to Pythia: instead of using the default policy to issue prefetches, 5 different policies are introduced, including a No-Prefetch policy that disables Pythia. All 5 policies are enabled for a 10M-instruction tournament, at the end of which the policy with the lowest miss rate is chosen for the rest of execution.
An Adaptive Next Line (ANeLin) is added to the LLC, which uses a sampling cache to track the demand misses and insert next-line prefetches. A heuristic mechanism is used to track useful and useless prefetches globally and per-PC. ANeLin can be disabled if the ratio of useful to useless prefetches drops below a threshold.

Why It Works

Adding a Bit-Map Filter eliminates redundant and useless prefetches. Using PC history adds context from the program flow while learning memory accesses with minimal overhead. Disabling Pythia and Next Line prefetching when they do not generate enough useful prefetches solves the problem of cache pollution due to wasteful prefetching. This is especially useful in the constrained bandwidth and multicore scenarios where data and memory need to be shared judiciously for optimal performance.

Entangling Data Prefetcher (Agustín Navarro Torres, Universidad de Zaragoza; Simranjit Singh, University of Murcia; Biswabandan Panda, IIT Bombay; Alberto Ros, University of Murcia)

Motivation

Comparing Berti with other state-of-art prefetchers, the authors identify a SPEC2017 workload where Berti achieves negligible performance gain over no-prefetch baseline. Profiling this trace reveals that it consists of long-reuse strides (stride accesses separated by 2K-cycle interval) and zero-strides (consecutive accesses to the same cache line). Berti cannot issue zero-delta prefetches, and even though prefetches are correctly issued for long-reuse deltas, they get evicted before the cache line gets accessed. T-SKID, a Time Skipping Prefetcher is built on top of a standard PC-Stride prefetcher, but decouples the PC that triggers a prefetch (TriggerPC) from the PC that trains the predictor(TargetPC). This allows it to prefetch long-reuse and zero stride patterns. However, the underlying stride prefetcher limits its scope to constant stride instead of complex delta patterns predicted easily by Berti.

Idea

EDP is proposed as a VA-based L1D prefetcher. It gets trained and triggered on cache misses or prefetch hits (cache hit on a prefetched line). For every TargetPC, it records the fill latency of the demand access or prefetch request. It then searches the global PC history for the most recent PC that was observed more than (current cycle – fill latency) cycles ago – this is the TriggerPC which could have triggered a timely prefetch for TargetPC. This ‘Entangling Pair’ of PCs is added to the Entangling Table, that stores the set of TargetPCs for a given TriggerPC. EDP also looks at the address history of each TargetPC to calculate the list of timely deltas (similar to Berti) and stores them with the current address in a Delta Table indexed by TargetPC. To issue prefetches, the TriggerPC is used to obtain one or more TargetPC, which are used to obtain address and deltas for timely prefetch. The prefetches calculated in this way are passed through a Bloom Filter to drop redundant requests, and then placed in a Proxy Prefetch Queue (PPQ) where the prefetch request waits till slots open up in the demand read queue. If there is no space in the latter, prefetch requests are not issued. Pythia is implemented at L2, with a throttling mechanism at LLC that tracks each core’s requests and sets the EDP aggressiveness.

Why It Works

Using a different PC to trigger prefetches allows EDP to successfully prefetch zero and long reuse delta patterns for its target PC. Filtering out redundant prefetches reduces contention for resources. Using a dedicated PPQ for prefetch requests prevents prefetches from competing with critical loads for resources. The LLC throttling mechanism helps evenly distribute resources in the multi-core scenario.

Composite Prefetching with Bandits (Charles Block, Pedro Palacios, Abraham Farrell, Gerasimos Gerogiannis, Josep Torrellas, University of Illinois at Urbana-Champaign)

Motivation

The authors point out that the current state-of-the-art prefetchers try to optimize low-level metrics such as accuracy, timeliness and coverage. The system performance (IPC) depends on these factors, but can have variable sensitivity to each of them depending on the workload and program phase. Furthermore, a single prefetcher is generally insufficient to deliver the best performance for a diverse set of workloads – industrial processors generally deploy a composite prefetcher consisting of multiple prefetch engines.

Idea

A Multi-Armed Bandit is a Reinforcement Learning agent that chooses the best action (arm) to maximize the reward function value. Inspired by this, a Micro-Armed Bandit (MAB) is used to prefetch at L2C. Each ‘arm’ consists of different configurations for 5 state-of-the-art prefetchers-

Next Line, Spatial Memory Streaming, Best Offset Prefetcher: Can be turned ON or OFF
Stride, Stream prefetchers: Degree can be tuned to control aggressiveness

A bloom filter is implemented to prevent issuing redundant prefetches. Each arm is used for a fixed time period (bandit step) after which the reward generated by it is evaluated by the agent. This is evaluated against the rewards generated previously to calculate which arm to use next. The total IPC of the core is used as a reward function for the MAB.

To optimize multi-core performance, another agent called ‘µMama’ is added at the system level, using the geometric mean of IPCs across all cores as a reward function. At each timestep, it decides whether to allow the cores to pursue their independent actions, or to force them into joint actions which have a record of increasing the µMama reward.

Why It Works

Using Reinforcement Learning to directly maximize the system performance ensures that the prefetcher dynamically re-configures itself with execution to improve IPC. The caveat is that this now becomes a search space problem – the arms of the bandit need to be diverse enough to support different kinds of workloads, in order to deliver the best performance.

Global Berti (Gilead Posluns, Mark Jeffrey; University of Toronto)

Motivation

Berti is a state-of-the-art prefetcher that detects Streaming patterns, i.e., consistent delta values between accesses by the same PC. Practical workloads however, often exhibit Spatial patterns identified by consistent delta values between accesses by different PCs. In the absence of streaming patterns, prefetching based on spatial patterns could alleviate the efficacy of Berti.

Idea

Global Berti detects spatial patterns using Berti’s existing structures – the History Table conventionally stores within a row, the addresses of all the lines accessed by a particular PC, in FIFO order. When a streaming pattern cannot be detected, local training is useless and Global Berti looks at the most recent address for all PCs to detect spatial patterns (global training). Berti’s Delta Table holds the row delta values for the same PC; Global Berti stores the global deltas (across PCs) in the same table, adding a local bit to differentiate between streaming and spatial training.

Why It Works

By itself, Berti is quite effective at detecting and covering streaming patterns. Adding the capability to detect spatial patterns in the absence of streaming patterns increases Global Berti’s coverage and therefore, the overall performance. As expected, the highest speedup over Berti is obtained in SPEC2017 and Graph workloads that are dominated by irregular accesses which require spatial prefetching. On the other hand, AI workloads containing mostly streaming patterns see a much lesser speedup.

General Trends

Although the major focus of almost all DPC-4 submissions is to overcome the limitations of the high-performing Berti/Pythia baseline, they highlight several key trends in data prefetching research:

Prefetching across Physical Page Boundaries: Issuing page-crossing prefetches is extremely useful for AI workloads since they are dominated by streaming accesses. This is leveraged by most submissions to gain an edge over the baseline prefetcher configuration.
Preventing Redundant Prefetches: Quite a few papers also combat excessive prefetching and resource contention through advanced throttling, priority, and filtering mechanisms.
Increased System-Level and Multi-Core Awareness: There is a growing emphasis on system-aware solutions to judiciously manage shared resources like memory bandwidth, which is constrained in high-core-count datacenters. This includes core-level fairness throttling (Emender, EDP) and global coordination agents (µMama) to dynamically adjust prefetcher configurations for optimal multi-core performance.
Expanding Pattern Coverage for Diverse Workloads: Submissions seek to improve coverage beyond simple streaming patterns. This includes detecting spatial patterns across different PCs (Global Berti), and targeting complex patterns like long-reuse and zero-strides (EDP). The adoption of PC history (BertiGO) also provides better context for pattern recognition.
Shift Towards Adaptive and Composite Designs: Recognizing that a single prefetcher is insufficient for diverse workloads, the trend moves toward composite prefetchers. This is accompanied by dynamic re-configuration to select the best prefetcher setting at runtime, and adaptive heuristics to tune aggressiveness.

About the Author

Digvijay Singh obtained his Bachelor’s degree from BITS Pilani and his Master’s degree from Texas A&M University where he worked on data prefetching as part of the CAMSIN research group. He currently works as a Silicon Architect in Google’s mobile CPU team.

Fourth Data Prefetching Championship: Part I

Digvijay Singh — Mon, 27 Apr 2026 14:00:53 +0000

This article is the first in a two-part series that summarizes the key contributions of 4th Data Prefetching Championship (DPC-4), held in conjunction with the 32nd iteration of HPCA in 2026. While discussing innovative data prefetching techniques presented in this contest, we focus on the functionality of proposed algorithms and also explain why they are effective. Finer implementation details can be found from the papers or the source code.

Implementation Constraints

All prefetchers are evaluated against a baseline configuration that employs: Berti prefetcher (DPC3 winner) at L1D (Level-1 Data cache) and Pythia prefetcher at L2 (Level-2 cache). While there were no constraint on design complexity, upper limits were defined on the storage budget of the prefetchers to ensure the design was practically feasible for implementation. These limits were defined as follows: L1D Prefetcher: 32KB, L2 Prefetcher: 128KB, LLC (Last Level Cache) Prefetcher: 256KB.

Keynotes

The event included two keynote talks. The first keynote, titled “Is Prefetcher Research Still Alive?”, was given by Leeor Peled from Huawei. Leeor discussed the modern relevance of prefetching research, offering a pragmatic philosophy for academic researchers. He argued that the primary objective should not necessarily be to surpass “best-in-class” models – which are often the result of years of ‘engineered’ fine-tuning – but rather to introduce novel, high-potential concepts that invite further optimization. He emphasized that while an individual effort might not immediately surpass the state-of-the-art, a sufficiently “interesting” technique can evolve into a transformative solution through subsequent community-driven iteration.

He suggested two optimizations that can be explored:

Building a Semantic Prefetcher that correlates memory accesses with address generating code, i.e., a high-precision version of the Runahead Prefetcher that selectively runs only the code responsible for generating a future address.
Training neural networks to identify deep correlations between memory accesses, potentially unlocking the ability to predict complex, non-linear patterns that remain invisible to current heuristic-based logic.

The following issues can (and should) be addressed to build better prefetchers:

Generalizing complex patterns, e.g. pointer chasing loads
Accurately choosing memory access with high correlation for better training
Prefetching to the appropriate cache level to optimize for timeliness
Throttling prefetches for fairness amongst multiple cores
Using LLMs to process memory traces instead of text sequences

The second keynote, titled “Data Prefetching: A Datacenter Perspective”, was given by Akanksha J. from Google. Addressing the memory bottleneck problem in modern datacenters (40% of the CPU cycles are spent idling for memory responses) Akanksha highlighted that cloud environments are characterized by massive multi-threading and incessant context switching. In these scenarios, a single thread may migrate across multiple cores, while each core rotates through a vast “plethora” of applications. The Google workloads utilized in DPC-4 are a better representation of this reality, and are primarily frontend-bound. Without a sophisticated instruction prefetcher to streamline code delivery, the underlying bottlenecks in data prefetching remain obscured and impossible to solve. She also analyzed structural failures of current prefetching solutions, identifying these primary aspects:

Current design philosophy focuses on “tuning for the common case,” resulting in hard-coded heuristic values—such as fixed confidence thresholds and prefetch degrees—that are taped out into non-programmable silicon. While these “black boxes” are meticulously engineered to squeeze every drop of performance from SPEC workloads, they lack the flexibility required for the high heterogeneity of datacenter tasks. Consequently, these resource-hungry techniques often penalize cloud performance rather than enhancing it.
If we disable hardware prefetchers entirely and rely on software to insert prefetches, we miss out on critical opportunities to utilize valuable information about system states (coherence, timeliness, cache hits/misses) that improves prefetching. Akanksha proposed a shift towards “Software-Defined Prefetching,” a paradigm that transcends current ISA limitations. In this model, the software layer dynamically selects which code segments to target and determines the optimal hardware prefetcher to activate for peak accuracy. Simultaneously, the hardware leverages real-time system state data to maximize coverage.

Furthermore, Akanksha advocated for evaluating all prefetching techniques within constrained-bandwidth environments, arguing that such stress tests better reflect the realities of modern compute environments.

Now, on to prefetcher designs themselves.

Virtual Inter-Page Prefetcher (Ho Je Lee, Won Woo Ro; Yonsei University)

Motivation

Analyzing the baseline prefetecher configuration, the authors observed that the L2 Prefetcher (Pythia) is more effective than the L1 Prefetcher (Berti) in reducing Misses Per Kilo Instructions (MPKI) for the Last Level Cache (LLC).
Since Pythia operates in the Physical Address (PA) space, it is not feasible to let it issue prefetches across page boundaries, as incorrect physical page access poses a security risk.
A roofline study shows that there is significant performance to be gained when Pythia is allowed to issue page-cross prefetches in the PA space. This advantage amplifies when it is granted visibility of the Virtual Address (VA) space, preventing incorrect page accesses.

Idea

VIP is implemented at L1 level, but issues prefetches to the L2. It gets trained on L1 Misses by reading the {PC, VA} information off the packets sent to L1 MSHR. These are written to the VIP Stride Table that calculates the observed stride for a particular PC and stores it. If a stride value is repeated, the confidence gets incremented. Otherwise it gets reset. The confidence value determines the prefetch degree.

Why It Works

The implemented VIP configuration is a simple yet elegant solution to gain performance over the baseline by supplementing the existing Berti and Pythia prefetchers with cross-page prefetches (note that the DPC-3 version of Berti operates in the PA space and cannot issue prefetches across page boundaries). As expected, the stride prefetcher boosts AI workloads with sequential accesses of large data structures that span across pages. The typical CPU workloads such as SPEC see a moderate gain; the control-flow dominated Google workloads have a marginal slowdown since they rarely have uninterrupted streams.

Signature Pattern Prediction and Access-Map Prefetcher (Maccoy Merrell, Lei Wang, Paul Gratz, Stavros Kalafatis; Texas A&M University)

Motivation

Access Map Pattern Matching (AMPM) and Signature Path Prefetching (SPP) are both considered state-of-the-art prefetching techniques; while SPP is sensitive to the order of memory accesses, AMPM is resistant to OoO execution. However, AMPM relies heavily on stored patterns for each region and is unable to issue prefetches for new regions or when the observed accesses deviate from expectations. SPP excels at this and can even make predictions from its issued prefetches.

Idea

Implemented at L2 level, a Region Table (RT) tracks all access maps (as bit-vectors) on a per-region basis. Upon a memory access, an N-bit portion from the respective access map is used to index a Pattern Table (PT). The PT outputs the most frequently occurring N-bit pattern as a prefetch candidate, which can be used to speculatively index the PT. Similar to SPP, speculative prefetching continues till the overall confidence drops below a threshold. The RT access map indicates the recently accessed cache lines and filters out redundant prefetches.

Why It Works

The authors have identified the complementary nature of SPP and AMPM, and have combined them effectively to utilize the OoO resistance of AMPM with the Speculative mechanism of SPP. Additionally, numerous throttling mechanisms are implemented which consider pattern usefulness as well as global metrics such as DRAM bandwidth and overall usefulness to drop prefetches and set prefetch degree. SPPAM is implemented at L2C with Berti (the MICRO version which operates in the VA space) at L1D and Bingo at LLC. Similar to the previous paper, the cross-page stream information is passed to SPPAM from L1D.

Emender (Jiajie Chen, Tingji Zhang, Xiaoyi Liu, Xuefeng Zhang, Peng Qu, Youhui Zhang; Tsinghua University)

Motivation

An evaluation of different combinations of state-of-the-art prefetchers shows that VBerti (L1D) and Pythia (L2) is the highest performing combination. Here, VBerti refers to the MICRO version of Berti that operates in the VA space, allowing it to issue page-crossing prefetches. It is observed that this optimal prefetcher combination issues too many prefetch requests that fill the prefetch queue quickly, which leads to useful prefetches getting dropped. A second-order effect of a full prefetch queue is the excessive usage of L1D to Memory bandwidth that can delay critical loads.

Idea

Four key features are added to tackle the problem of over-prefetching in the VBerti+Pythia configuration:

Pending Target Buffer is added to sort all issued prefetches by confidence, which helps prioritize useful prefetches between different PCs.
Cuckoo Filter is added which tracks the VAs already present in the cache to prevent redundant prefetches. This structure is chosen due to its O(1) query time, high accuracy and zero false negatives.
Dynamic Confidence Threshold is added which increases with the cache miss rate, throttling low-confidence prefetches.
A Fairness-based Throttling scheme is implemented across cores, which tracks the useless prefetches per-core at L3 and stops the core with the most useless prefetches from prefetching.

Why It Works

The authors identify problematic areas in the baseline Berti+Pythia system and propose features to effectively address them. The best performance improvement comes from the Cuckoo Filter for single-core and Fairness Throttling for multi-core configuration. Since Emender provides the least gain for limited bandwidth configuration, it would be interesting to look at the accuracy data.

sBerti (Jiapeng Zhou, Ben Chen, Kunlin Li, Yun Chen; HKUST, Guangzhou)

Motivation

When profiling the DPC4 workloads on the given baseline prefetcher configuration (Berti + Pythia), the authors observed a high L1D miss rate in the AI-ML and Google workloads. A deeper analysis of the traces indicated that most of these misses occurred when the access stream moved across the 4KB physical page boundary, which happens frequently in these workloads. The version of Berti used in the baseline does not issue prefetches across page boundaries, and thus, a stride prefetcher can help.

Idea

A decoupled Smart Stride Prefetcher is added at L1D, which operates on the VA space and can therefore track memory access streams across page boundaries. It is trained using a Smart Stride Table (SST), which is indexed by a hash of the PC, and subtracts the lastVA from the current VA to calculate the delta value. If the absolute value of delta is a multiple of the stored stride, the confidence is updated; this also provides resistance to out-of-order execution. Prefetches are issued if this confidence is greater than a static threshold. The lookahead is tuned via a heuristic which is incremented upon observing late prefetches and decremented by timely prefetches. A Recent Prefetch Table stores the recently issued prefetches to track their timeliness and filter duplicate prefetches between Berti and Smart Stride engines.

Why It Works

The addition of a decoupled stride prefetcher gives sBerti the ability to issue prefetches across physical page boundaries, reducing the “Cold-start Penalty” of Berti. The heuristic based dynamic distance adjustment helps tune the aggressiveness at runtime, allowing longer lookahead for AI-ML workloads dominated by streaming accesses. The final sBerti configuration (Stride + Berti at L1D, Pythia at L2) delivers the best performance in a full bandwidth scenario, where the stride engine can prefetch further ahead.

We will overview the rest of the prefetchers in part 2 of this post.

About the Author

Digvijay Singh received his Bachelor’s degree from BITS Pilani and his Master’s degree from Texas A&M University where he worked on data prefetching as part of the CAMSIN research group. He currently works as a Silicon Architect in Google’s mobile CPU team.

Beyond Qubits: A Systems View of Hybrid CV-DV Quantum Computing

Mon, 20 Apr 2026 15:31:53 +0000

Hybrid continuous-discrete-variable (CV-DV) quantum computing combines oscillators and qubits to tackle problems that are difficult for either model alone, from bosonic simulation to quantum error correction. At ASPLOS 2026, our tutorial introduced the foundations, compilation stack, benchmarking methods, and programming tools behind this emerging architecture model. In this blog post, we overview the key elements of our tutorial.

Tutorial website: https://cvdv.ncsu.edu/resources/asplos-tutorial/

Foundations

We began with the foundations of hybrid CV-DV quantum computing, introducing the physical model, mathematical language, and programming abstractions behind qubit-oscillator systems. Many leading quantum platforms naturally combine qubits with oscillator modes, such as cavities, vibrational modes, or photonic fields. Rather than treating oscillators as auxiliary hardware, hybrid CV-DV computing views their large Hilbert spaces as a computational resource.

The tutorial covered core representations of CV states in both Fock space and phase space, along with the key operators and gate families that support universal CV-DV computation. A central message was that hybrid systems are not simply “qubits plus extra hardware,” but a distinct computational model with their own instruction sets, abstractions, and compilation challenges. We showed how familiar qubit concepts such as Pauli and Clifford structure extend into the oscillator setting through displacement operations, squeezing, quadratic Hamiltonians, beamsplitters, and controlled hybrid interactions.

We also discussed why this matters from a computer architecture perspective. Hybrid CV-DV systems introduce new instruction set architectures (ISAs), abstract machine models (AMMs), and compilation choices that help separate hardware details from software design. Depending on the platform and compiler stack, the same computation may be expressed in phase-space language, Fock-space language, or a mixed qubit-oscillator representation.

To ground these ideas, we highlighted emerging algorithmic primitives and applications where hybrid systems may offer advantages, including oscillator-mediated entangling gates, state-transfer protocols, Hamiltonian simulation, bosonic quantum error correction, vibronic dynamics, and sensing. We closed the session by surveying two leading implementation pathways, superconducting circuit QED and trapped-ion systems, and discussing the distinct control and connectivity tradeoffs they expose. A comprehensive tutorial on the foundations of hybrid CV-DV quantum processors is available here.

Compilation

We also presented Strategies and Tools to Compile CV-DV Quantum Circuits. We began by emphasizing why Hamiltonian simulation is a central application and one of the most promising directions for hybrid continuous-variable and discrete-variable (CV-DV) quantum systems. CV systems can naturally represent continuous degrees of freedom, while DV systems provide strong control and interaction structures. Together, they enable important applications in areas such as quantum chemistry and materials science. However, a key challenge lies in decomposing the time-evolution operator e^{-iHt} into a sequence of executable quantum gates. This transformation is fundamentally a compilation problem, bridging high-level quantum algorithms and low-level hardware. As such, compilers play a critical role in hybrid quantum systems.

We then focused on the dominant approach today: symbolic compilation. In particular, we discussed two early CV-DV Hamiltonian simulation compilers from Chen et al., ISCA’25 and Decker et al., QCE’25. The core idea is to avoid direct matrix-based computation and instead leverage the algebraic structure of operators for rule-based decomposition. Techniques such as Trotter-Suzuki product formulas, the Baker–Campbell–Hausdorff (BCH) expansion, and bosonic commutation relations are used to gradually break down complex Hamiltonians into hardware-executable primitive gates. This process is typically implemented through rule matching and recursive rewriting, where expressions are repeatedly transformed until only supported base gates remain. While this approach avoids the exponential blowup of high-dimensional matrices, it introduces tradeoffs between approximation error and resource overhead.

Finally, we analyzed the limitations of current compilers and outlined future research directions. Key challenges include limited gate sets and decomposition rules, the tradeoff between accuracy and resource cost, hardware connectivity constraints, and insufficient optimization flexibility. To address these issues, we highlighted the need for improved programmability, richer native gate support, more accurate cost models, and optimizations that exploit algebraic properties such as commutativity. We also presented the Genesis compiler from Chen et al., ISCA’25 as an end-to-end solution example, including typical use cases and code snippets. Genesis employs a multi-level intermediate representation (IR) and a full compilation pipeline to automatically translate Hamiltonians into limited hardware connectivity physical circuits, demonstrating a systematic and extensible compilation framework for hybrid CV-DV quantum computing.

Benchmark and Circuit Simulator

We also presented HyQBench by Mohapatra et al., an open-source benchmark suite implemented in Bosonic Qiskit and QuTiP. HyQBench covers eight representative hybrid circuits spanning three abstraction levels: primitives, algorithms, and applications. These include cat state generation, GKP state preparation, CV-to-DV state transfer, CV-DV QFT, CV-DV VQE, CV-QAOA, Jaynes-Cummings-Hubbard (JCH) Hamiltonian simulation, and Shor’s algorithm.

One key takeaway is that hybrid architectures can reduce hardware resources dramatically for some workloads. For example, simulating a 3-site JCH model in a DV-only encoding requires 9 qubits and 393 CNOT gates, whereas a hybrid implementation uses only 3 qumodes, 3 qubits, and 8 gates. This kind of reduction highlights why benchmarking hybrid systems requires more than simply counting qubits.

To support this, we introduced a feature map tailored to hybrid systems. In addition to standard structural metrics such as gate counts, circuit depth, and qubit/qumode counts, we proposed three CV-DV-specific metrics: Wigner negativity as a proxy for non-classicality and classical simulation hardness, truncation cost to quantify population near the Fock cutoff, and maximum energy. These metrics help separate workloads with very different simulation and execution behavior. For example, JCH simulation remains relatively close to Gaussian behavior, while CV-QAOA and Shor’s algorithm exhibit higher Wigner negativity and are harder to simulate classically.

We also discussed early hardware validation. A cat-state preparation benchmark was executed on Sandia National Laboratories’ QSCOUT trapped-ion platform and achieved a fidelity of 0.71. HyQBench was further used to calibrate conditional displacement gates on the same platform, reinforcing the need for standardized benchmark suites that support both evaluation and device calibration. The full paper is available at https://arxiv.org/abs/2603.04398.

To lower the barrier to entry for this area, we also developed HyQSim, a browser-based hybrid CV-DV circuit simulator that requires no installation. HyQSim supports drag-and-drop circuit construction, arbitrary Fock cutoffs, and built-in visualization through Wigner plots, Fock-state amplitudes, and Bloch sphere views. It is available at https://cvdv.ncsu.edu/resources/simulator/, and the code is hosted at https://github.com/shubdeepmohapatra01/HyQSim/.

Programming

Finally, we discussed programming support for hybrid CV-DV systems. Quantum programming languages and frameworks have developed many important ideas over the years, including linear quantum types for enforcing the no-cloning theorem, automatic uncomputation of ancilla qubits, and dynamic lifting of classical variables for mid-circuit measurement. Hybrid quantum computing introduces an additional requirement: heterogeneous quantum registers containing both qubits and qumodes.

To address this challenge, we developed Hybridlane, a CV-DV quantum programming framework built on PennyLane. By extending PennyLane, Hybridlane inherits a broad library of qubit algorithms, gates, and compilation routines while remaining familiar to existing users. Hybridlane tracks wire types automatically through symbolic circuit analysis and type inference, enabling scalable circuit construction, platform independence, and integration with downstream compilation flows.

The tutorial concluded with example workflows using Hybridlane. In one example, we reused an existing PennyLane quantum phase estimation template for a CV-DV Hamiltonian simulation and then lowered it through symbolic compilation to a gate sequence executable on the Bosonic Qiskit backend. In another, we demonstrated a cross-platform workflow in which a conditional displacement gate was calibrated in simulation and then compiled for execution on Sandia’s QSCOUT trapped-ion platform. Together, these examples showed how hybrid quantum software can begin to support the same define-simulate-execute workflow that has become standard in mature qubit SDKs.

We hope Hybridlane helps enable a broader ecosystem of reusable software and research for hybrid quantum computing. It is available at https://github.com/pnnl/hybridlane.

Closing

Hybrid CV-DV computing sits at the intersection of quantum hardware, computer architecture, compilation, and programming systems. We hope this tutorial helps make the area more accessible to researchers across architecture, systems, programming languages, and quantum information, and we invite readers to explore the tutorial materials, benchmarks, and tools linked above.

About the Authors

Yuan Liu is an Assistant Professor of Electrical & Computer Engineering and Computer Science at North Carolina State University. Prior to joining the NC State faculty, he was a postdoctoral researcher at the Massachusetts Institute of Technology. His research interests lie at the intersection of quantum computing, quantum engineering, quantum algorithms/architectures and applications.

Zihan Chen is a Ph.D. student in computer systems at Rutgers University, advised by Prof. Eddy Z. Zhang. His research focuses on compiler and system-level techniques, as well as parallel computing, to enhance the efficiency, programmability, scalability, and fault tolerance of emerging quantum computing systems.

Shubdeep Mohapatra is a Ph.D. candidate in Computer Engineering at NC State University, advised by Prof. Huiyang Zhou and Prof. Yuan Liu. His research focuses on quantum error characterization, mitigation, and benchmarking, aimed at improving the reliability and fault tolerance of near-term quantum computing systems.

Jim Furches is a post-masters research associate at Pacific Northwest National Laboratory. His current research interests are in quantum benchmarking, algorithms, and quantum programming and compilation.

Zheng (Eddy) Zhang is a Professor in the Department of Computer Science at Rutgers University. Her research focuses on full-stack compiler and programming systems for quantum computing. She studies how to better coordinate quantum applications, programming languages, intermediate representations, compilation, pulse-level control, and hardware architecture to improve the performance, usability, and scalability of quantum systems.

Huiyang Zhou is a Professor of Electrical and Computer Engineering at North Carolina State University. His current research interests include GPU architecture, processor security, and quantum computing.

Computer Architecture’s AlphaZero Moment is Here

Karu Sankaralingam — Fri, 10 Apr 2026 14:00:43 +0000

For decades, we have designed chips in fundamentally the same way: human intuition applied to a vanishingly small slice of an impossibly large design space. That paradigm worked when Moore’s Law was lifting everything. We could afford to be wrong. We could afford to miss the best design. Process scaling would close the gap.

That world is over. In a recent position paper — “Computer Architecture’s AlphaZero Moment: Automated Discovery in an Encircled World” — I argue that we are at an inflection point. Not a gradual shift, but a structural break in how architecture must be practiced.

From Idea Scarcity to Evaluation Scarcity

The central claim is simple, but uncomfortable:

Computer architecture is no longer bottlenecked by ideas. It is bottlenecked by evaluation and telemetry.

For decades, the field has implicitly assumed that ideas are scarce — that the role of the architect is to generate the one clever mechanism worth exploring. Everything else follows. But recent evidence suggests the opposite. With modern large language models and agentic pipelines, hundreds of viable architectural ideas can be generated per day, thousands of candidate designs can be evaluated per week, and design cycles can compress from months to weeks.

This is not speculative. We built a system called the Gauntlet and tested it on 85 papers from ISCA 2025 and HPCA 2026 — largely outside the model’s training data. Across 475 independent runs, it produced viable architectural mechanisms 95% of the time: independently re-deriving authors’ exact solutions in 48% of cases, and proposing valid alternatives the authors never considered in another 50%. Each took 10–20 minutes. This flips a foundational assumption of the field. If ideas are abundant, then the limiting factor is no longer creativity — it is which ideas we can evaluate, validate, and trust. This link has this corpus of problem statement and Gauntlet’s solutions.

1. Evaluation is the new bottleneck

We are moving from a world where the question was “Can we come up with a good idea?” to one where the question becomes “Can we evaluate 10,000 ideas fast enough to find the best one?” This elevates simulation infrastructure, analytical modeling, and verification into the central problems of the field. The “PhD student for three months” implementation bottleneck is already eroding — our system built first-principles performance models from papers in under 20 minutes. What replaces it is a race to build faster, more accurate, and more scalable evaluation pipelines.

2. The telemetry divide

If evaluation becomes central, then ground truth becomes everything. Over time, access to closed-loop deployment telemetry — real workloads, real performance counters, real system behavior at scale, and in low-level depth — may matter as much as architectural insight itself. This creates a risk of structural divide. Academic research, long dependent on proxy benchmarks, could drift further from production reality unless we collectively rethink how we share and access workload data.

3. The end of the old boundary

The traditional separation between “chip company” and “cloud provider” begins to dissolve. Automated architecture requires three tightly coupled capabilities: deployment (to generate telemetry), infrastructure (to evaluate designs at scale), and silicon expertise (to realize designs physically). No single traditional player owns all three. The result is convergence — either through vertical integration or new hybrid ecosystems.

The Deeper Claim

The more provocative claim is not about tools — it is about limits. Human-driven architecture is becoming structurally outmatched by the scale of the design space. This is not a statement about human ability. It is about combinatorics. The architectural search space — spanning parametric and structural choices — is effectively unbounded. Humans sample an infinitesimal fraction of it. That was acceptable in an era of abundance. It is not acceptable in an era where architectural efficiency is the primary lever for progress. The analogy to AlphaZero is not rhetorical. It is structural: when search, evaluation, and feedback loops become fast enough, intuition gives way to systematic exploration.

What This Means for Research — and Teaching

If this framing is even partially correct, it forces a rethinking of what it means to “do” computer architecture research. Several shifts seem likely. If machines can generate many viable solutions, identifying the *right problem* becomes the scarce intellectual act. Evaluation frameworks, modeling techniques, and telemetry integration may matter more than individual architectural ideas. And the reliance on fixed benchmark suites becomes increasingly fragile in a world driven by dynamic, evolving workloads.

The full paper includes a set of predictions and my opinions on how I see this playing out. This extends to how we teach. Do we still emphasize canonical microarchitectures, or shift toward trade-off reasoning, evaluation frameworks, and interpreting machine-generated designs? What does it mean to train a researcher when idea generation itself is becoming automated?

A Call for Collaboration

This is not a settled direction — it is a hypothesis that needs to be stress-tested by the community. If this resonates (or if you think it is completely wrong), I would love to engage on: new models for teaching architecture, shared evaluation infrastructure and artifacts, privacy-preserving approaches to workload telemetry, and workshops focused on problem formulation rather than solution novelty. If this is even half right, we may need to rethink our identity as a field. Let’s debate it.

About the author: Karthikeyan Sankaralingam is Principal Research Scientist at NVIDIA and Professor at UW-Madison.

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.

Spilling the Neural Tea: A Journey Down the Side-Channel

Adnan Rakin — Mon, 06 Apr 2026 15:37:22 +0000

Years ago, I came across three pioneering works (CSI-NN, Cache Telepathy, and DeepSniffer) in the field of reverse engineering neural networks that inspired my journey into side-channel attacks to uncover the secrets of modern Deep Neural Networks (DNNs). Fast forward to today, and there has been significant exploitation of side-channel attacks to discover the secrets of neural networks. It’s a good time to provide an overview of where we stand, the outlook for the future, and the challenges ahead.

Motivation: Let’s take a step back and first try to understand why we care about secrets in deep learning models. It basically boils down to two fundamental challenges associated with deep learning: i) Financial, ii) Security and Privacy challenges. In general, DNNs are intellectual property (IP), as they are products developed over years of research, implementation, and investment in computing units, and they entail significant training costs (time, energy, and labor), making them a valuable asset for their owners. Just to give a rough estimate, OpenAI’s GPT-4 costs more than $ 100 million, and its GPT-5 model is expected to be more than 5x as expensive (Cost of Training GPT). I do not know about you, but if I spent 100 million on something, I would care about protecting it. The next challenge is knowing that a model secret gives an adversary white-box knowledge, which is extremely powerful in security and privacy settings. Any adversary with knowledge of a target victim’s model architecture (e.g., model type, layer sequence, and number) and weight information, formally defined as “white-box,” can launch powerful security (adversarial attacks) and privacy threats (model inversion attacks/membership inference attacks). As highlighted in Figure 1, the attacker’s final objective in the DNN reverse-engineering attack is to gain white-box privileges either to steal IP for financial gain or to launch subsequent attacks.

In summary, in security and privacy research, defining the threat model is the first step towards any exploitation, and the underlying assumption is often that a reverse-engineering attack has successfully uncovered the model architecture, weights, and other hyperparameters.

Attack Objectives: By now, we have established that an attacker’s goal is to uncover two key properties of a victim DNN: its architecture and its parameters. However, this is an oversimplified goal and can often be misleading. To understand this, let’s consider a deep neural network as a function of x, denoted f(x). If an attacker wants to recover the exact victim model, their objective is for the stolen model to be identical to the original f(x), which is practically impossible for large-scale DNNs, whether using existing side-channel attacks or the exact victim dataset. As a result, a more practical and plausible goal for an attacker would be to achieve functional equivalence. If the stolen function is different, such as g(x), then, for incentive purposes, all an attacker cares about is that these two functions produce identical output, i.e., f(x)= g(x), for inputs x that are of the attacker’s interest. As a result, achieving functional equivalence means recovering the DNN model architecture, often as close as possible to the victim architecture’s topology. On the weight side, even if an attacker cannot extract the exact weights, they must aim for a weight-space solution that captures the victim model’s functionality.

In summary, to steal a copy of the victim model/function, an attacker must identify the victim model architecture. In modern deep learning, where most practical applications use some version of a DNN model from an existing pool (e.g., GPT, Llama), recovering the architecture often boils down to detecting the model’s topology. Once the architecture is revealed, the attacker must recover the model parameters/weights, which is often a challenging part of the attack. Then again, as we discussed earlier, exact model recovery can be challenging, but achieving functional equivalence is a modest objective. Most importantly, to achieve functional equivalence, the attacker may not need to reveal the exact numerical weights; rather, gradually recovering coarse-grained information (e.g., weight sparsity, quantization pattern, weight distribution) is often sufficient.

Figure 1: Spectrum of attack threats characterized by attacker’s knowledge: Black-Box (No Knowledge), Grey-Box (Partial Knowledge, e.g., architecture), and White-box (Complete knowledge of model architecture and weights), the ultimate goal of reverse-engineering (AI-generated).

Attack Techniques and Capabilities. Among the popular types of side-channel attacks, i.e., physical and microarchitectural, both can be utilized in two different threat model settings. In edge or embedded devices, the physical side channel is the dominant threat, and several works (CSI-NN, BarraCUDA) have shown that it is possible to recover the model architecture and weights of simple neural networks. On the other hand, micro-architectural side channels are a popular choice for resource-sharing cloud environments where users can upload and run their code in a colocated environment (e.g., Amazon SageMaker and Google ML Engine). Microarchitectural attacks have been successful in recovering model architecture across the board using cache timing channels, memory access patterns, and GPU context switching. I acknowledge that there are many ways to recover DNN model weights, including learning-based approaches and mathematical recovery techniques. In this blog post, I focus on side-channel attacks. At the same time, learning-based approaches can work as a complementary approach with side-channel attacks once the architecture information has already been leaked.

In summary, while side-channel attacks have been successful in leaking model architecture information, as the scale of modern DNNs, e.g., LLM weights, continues to reach new heights of billions, none of the existing side channels can scalably and predictably recover model parameter information. A common workaround would be to support these methods with a learning approach, assuming an attacker has a partial training set, which may not be practical, even in a resource-sharing environment where data remains private.

Future Challenges and Opportunities:

What is the future of architecture-recovery attacks, given the success of existing side channels?

As the next wave of vision and language domain architectures emerges, they present new challenges and opportunities for the microarchitectural side-channel attack community. These models require modern compute support, which can accelerate their inference (e.g., tensor cores), as GPUs become more modern and newer generations may leave new traces of side-channel information. Hence, these newer compute platforms (e.g., new GPUs) and their associated architectural support demand new innovation in side-channel capabilities to recover the model architecture. We must remember that architecture recovery is essential; without it, model parameter recovery is no longer useful. Moreover, as LLMs emerge as the dominant model, the question is not just about recovering weights or architecture; leaking other components, such as KV cache in a multi-tenant setting, can lead to privacy leakage.

Can a microarchitectural side channel alone ever be sufficient to recover model weight information?

The sheer scale of the modern model poses an even greater challenge for recovering weights, making direct recovery an ambitious, and even impossible, goal; instead, we should focus on functional equivalence. To achieve functional equivalence, weight recovery methods can set tiny stepping stones to augment learning-based recovery.

Complete weight recovery using a side channel at the scale of LLMs or even a smaller vision model may be too ambitious. Instead, the attacks should focus on coarse-grained information about weights, such as model sparsity levels, quantization mechanisms, weight sign recovery, and other optimization techniques. The key idea is to achieve functional equivalence by first recovering coarse-grained information, which is sufficient to support other learning-based recovery. It is time to work towards an achievable target: recovering this statistical weight-level knowledge and studying how critical their role is in improving subsequent attacks. As models and their computation units are increasingly optimized, leaking information such as sparsity levels or bit-widths will become more feasible by detecting optimized paths through side-channel leakage.

Finally, an attack is never the end goal. We probe attacks from every angle so we can study them before any attacker ever thinks about them. The endgame is always to develop subsequent defenses, which I leave for another discussion.

About the author:

Adnan Siraj Rakin is an Assistant Professor at the School of Computing at Binghamton University. He received his Master’s (2021) and PhD (2022) from Arizona State University. He works on emerging security and privacy challenges in modern AI systems and algorithms. His paper on DNN model weight recovery has been crowned as Top Picks in Hardware and Embedded Security in 2024.

To Sparsify or To Quantize: A Hardware Architecture View

Sai Srivatsa Bhamidipati — Thu, 12 Mar 2026 15:00:43 +0000

The debate of sparsity versus quantization has made its rounds in the ML optimization community for many years. Now, with the Generative AI revolution, the debate is intensifying. While these might both seem like simple mathematical approximations to an AI researcher, for a hardware architect, they present fundamentally different sets of challenges. Many architects in the AI hardware space are deeply familiar with watching the scale tip from one side to the other, constantly searching for a pragmatic balance. Let’s look at both techniques, unpack the architectural challenges they introduce, and explore whether a “best of both worlds” scenario is truly possible (Spoiler: It depends).

Note: We will only be looking at compute-bound workloads, which traditionally rely on dense compute units such as tensor cores or MXUs. We will set aside memory-bound workloads for now, as they introduce their own distinct set of tradeoffs for sparsity and quantization.

Sparsity

The core idea of sparsity is beautifully simple: if a neural network weight is zero (or close enough to it), just don’t do the math. Theoretically, pruning can save massive amounts of compute and memory bandwidth.

The Architecture Challenge: The Chaos of Unstructured Data

The golden goose of this approach is fine-grained, unstructured sparsity. It offers a high level of achievable compression through pruning, but results in a completely random distribution of zero elements. Traditional dense hardware hates this. Randomness leads to irregular memory accesses, unpredictable load balancing across cores, and terrible cache utilization. High-performance SIMD units end up starving while the memory controller plays hopscotch trying to fetch the next non-zero value. To architect around this, pioneering unstructured sparse accelerators—such as EIE and SCNN—had to rely heavily on complex routing logic, specialized crossbars, and deep queues just to keep the compute units fed, often trading compute area for routing overhead.

The Compromise: Structured and Coarse-Grained Sparsity

To tame this chaos, the industry shifted toward structured compromises. The universally embraced N:M sparsity (popularized by NVIDIA’s Ampere architecture) forces exactly N non-zero elements in every block of M. This provides a predictable load-balancing mechanism where the hardware can perfectly schedule memory fetches and compute.

More recently, to tackle the quadratic memory bottleneck of long-context LLMs, we’ve seen a surge in modern sparse attention mechanisms that leverage block sparsity. Techniques like Block-Sparse Attention and Routing Attention enforce sparsity at the chunk or tile level. Instead of picking individual tokens, they route computation to contiguous blocks of tokens, allowing standard dense matrix multiplication engines to skip entire chunks while maintaining high MXU utilizations and contiguous memory access. Other approaches, like StreamingLLM, evict older tokens entirely, retaining only local context and specific “heavy hitter” sink tokens.

The trade-off across these methods is clear: we exchange theoretical maximum efficiency for hardware-friendly predictability, paying a “tax” in metadata storage (index matrices), specialized multiplexing logic, and the persistent algorithmic risk of dropping contextually vital information.

Quantization

While sparsity aims to compute less, quantization aims to compute smaller. Shrinking datatypes from 32-bit floats (FP32) to INT8, or embracing emerging standards like the OCP Microscaling Formats (MX) Specification (such as MXFP8 E4M3 and E5M2), acts as an immediate multiplier for memory bandwidth and capacity. But the frontier has pushed much further than 8-bit. Recent advancements in extreme quantization, such as BitNet b1.58 (1-bit LLMs using ternary weights of {-1, 0, 1}) and 2-bit quantization schemes (like GPTQ or Quip), demonstrate that large language models can maintain remarkable accuracy even when weights are squeezed to their absolute theoretical limits.

The Architecture Challenge: The Tyranny of Metadata and Scaling Factors

From an architecture perspective, the challenge of extreme quantization isn’t just the math—it’s the metadata. To maintain accuracy at 4-bit, 2-bit, or sub-integer levels, algorithms demand fine-grained control, requiring per-channel, per-group, or even per-token dynamic scaling factors. Every time we shrink the primary datapath, the relative hardware overhead of managing these scaling factors skyrockets. Along with that, the quantization algorithm also becomes more fine grained, dynamic and complex. We are forced to add additional logic and even high-precision accumulators (often FP16 or FP32) just to handle the on-the-fly de-quantization and accumulation. We aggressively optimize the MAC (Multiply-Accumulate) units, only to trade that for the overhead of adding scaling factor handling and supporting a potentially new dynamic quantization scheme, which can outweigh the gains.

The Compromise: Algorithmic Offloading

To fix this without blowing up the complexity and area budget, the community relies on algorithmic co-design. Techniques like SmoothQuant effectively migrate the quantization difficulty offline, mathematically shifting the dynamic range from spiky, hard-to-predict activations into the statically known weights. Similarly, AWQ (Activation-aware Weight Quantization) identifies and protects a small fraction of “salient” weights to maintain accuracy without requiring complex, dynamic mixed-precision hardware pipelines. By absorbing the complexity into offline mathematics, these techniques allow the hardware to run mostly uniform, low-precision datatypes.

However, much like the routing tax in sparsity, this algorithmic offloading comes with some compromises. These methods heavily rely on static, offline calibration datasets. If a model encounters out-of-distribution data in production (a different language, an unusual coding syntax, or an unexpected prompt structure), the statically determined scaling factors can fail, leading to outlier clipping and catastrophic accuracy collapse. Furthermore, relying on offline preprocessing creates a rigid deployment pipeline that prevents the model from adapting to extreme activation spikes on the fly.

Is there a “best of both worlds”?

So, knowing these trade-offs, do we sparsify or do we quantize? Many years ago, the Deep Compression paper proved we could do both. But today, pulling this off at the scale of a 70-billion parameter LLM is incredibly difficult. It suffers from the classic hardware optimization catch-22 (see All in on Matmul?) : No one uses a new piece of hardware because it’s not supported by software, and it’s not supported by software because no one’s using it.

So what’s the path forward for hardware architects? In my opinion, the following:

Deep Hardware-Software Co-design: The days of throwing a generic matrix-multiplication engine at a model are over. We need to work directly with AI researchers so that when they design a new pruning threshold or a novel sub-byte data type, the hardware already has a streamlined, fast path for the metadata.
Generalized Compression Abstractions: Historically, we have designed accelerators that are either “good at sparsity” (with complex routing networks) or “good at quantization” (with mixed-precision MACs). Moving forward, we need to view these not as orthogonal features, but as a unified spectrum of compression. Architectures must be designed to dynamically adapt—perhaps fluidly dropping structurally sparse blocks during a memory-bound decode phase, while leaning on extreme sub-byte quantization during a compute-heavy prefill phase—potentially even sharing the same underlying logic.
Balance Efficiency and Programmability: As explored in the “All in on MatMul?” post, we need to keep our hardware flexible. Over-fitting to today’s specific sparsity pattern or quantization trick risks building being trapped in the local minimum. We must maintain enough programmability to enable future algorithm discovery and break free from the catch-22.

Some notable research going along this path include Effective interplay between sparsity and quantization, which proves the non-orthogonality of the two techniques and explores the interplay between them and also the Compression Trinity work which takes a look at multiple techniques across sparsity, quantization and low rank approximation and tries to take a holistic view of the optimization space across the stack.

Ultimately, as alluded to before, there is no single silver bullet, and like all open architecture problems, the answer is always “it depends”. But in the era of Generative AI, it depends on whether we view sparsity and quantization as competing alternatives or as pieces of the same puzzle. Perhaps it’s time we stop asking which one is better, and start designing architectures flexible enough to embrace the realities of both.

About the Author:

Sai Srivatsa Bhamidipati is a Senior Silicon Architect at Google working on the Google Tensor TPU in the Pixel phones. His primary focus is on efficient and scalable compute for Generative AI on the Tensor TPU.

Authors’ Disclaimer:

Portions of this post were edited with the assistance of AI models. Some references, notes and images were also compiled using AI tools. The content represents the opinions of the authors and does not necessarily represent the views, policies, or positions of Google or its affiliates.

From the Editor’s Desk – 2026 Edition

Dmitry Ponomarev — Tue, 03 Feb 2026 20:19:39 +0000

As we close the book on 2025, Computer Architecture Today has seen another successful year of community engagement. We published 29 posts covering a wide spectrum of topics—from datacenter energy-efficiency to the evolving debate on LLMs in peer review, alongside trip reports from our major conferences. I want to thank all our authors for their insights, with special appreciation for those who contributed multiple times.

Over the last year, we shifted our editorial model, moving from a roster of set contributors to a more flexible, open-submission approach. We also re-established our conference trip reports, highlighting top architecture venues.

The blog thrives on new voices, and our door is always open. We are actively looking for:

New Ideas: If you have a topic in mind, please propose it using this link or email me directly.
Trip Reports: Planning to attend a conference? Volunteer to share your experience.
Event Summaries: Organizers of workshops or tutorials are welcome to publicize their events through summary posts.
Industry Perspectives: We would like to hear from our industry colleagues about their take on the future landscape of computer architecture.

Finally, as AI tools proliferate, the conversation around their role in our paper reviewing process is far from over. I look forward to seeing more of that debate here.

Here’s to the new advances in Computer Architecture in 2026!