<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="https://feeds.feedblitz.com/feedblitz_rss.xslt"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	 xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
<channel>
	<title>Computer Architecture Today</title>
	<atom:link href="https://www.sigarch.org/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.sigarch.org</link>
	<description>Informing the broad computing community about current activities, advances and future directions in computer architecture.</description>
	<lastBuildDate>Wed, 29 Apr 2026 14:00:03 -0400</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	
<image>
	<url>https://www.sigarch.org/wp-content/uploads/2017/03/logo_rgb.png</url>
	<title>Computer Architecture Today</title>
	<link>https://www.sigarch.org</link>
</image> 
<site xmlns="com-wordpress:feed-additions:1">125883397</site>
<meta xmlns="http://www.w3.org/1999/xhtml" name="robots" content="noindex" />
<item>
<feedburner:origLink>https://www.sigarch.org/an-overview-of-the-fourth-data-prefetching-championship-part-2/</feedburner:origLink>
		<title>Fourth Data Prefetching Championship: Part 2</title>
		<link>https://feeds.feedblitz.com/~/954807320/0/sigarch-cat~Fourth-Data-Prefetching-Championship-Part/</link>
		<comments>https://feeds.feedblitz.com/~/954807320/0/sigarch-cat~Fourth-Data-Prefetching-Championship-Part/#respond</comments>
		<pubDate>Wed, 29 Apr 2026 14:00:03 +0000</pubDate>
		<dc:creator><![CDATA[Digvijay Singh]]></dc:creator>
		<category><![CDATA[ACM SIGARCH]]></category>
		<category><![CDATA[Data Prefetcher]]></category>
		<category><![CDATA[Memory Wall]]></category>
		<guid isPermaLink="false">https://www.sigarch.org/?p=103014</guid>
		<description><![CDATA[<div><img width="300" xheight="187" src="https://www.sigarch.org/wp-content/uploads/2026/04/DPC-4-Part-1-300x187.png" class="attachment-medium size-medium wp-post-image" alt="DPC-4 Concept Art (Indigo)" style="max-width:100% !important;height:auto !important;margin-bottom:15px;margin-left:15px;float:right;"  fetchpriority="high" /></div>This article continues (and concludes) the discussion on the proceedings of DPC-4, covering the remaining four contestants and a summary of the trends observed in all eight prefetchers presented in the championship. Similar to Part I, we focus on how each prefetch algorithm functions, and why it is effective. Finer implementation details can be obtained [&#8230;]]]>
</description>
				<content:encoded><![CDATA[<div><img width="300" height="187" src="https://www.sigarch.org/wp-content/uploads/2026/04/DPC-4-Part-1-300x187.png" class="attachment-medium size-medium wp-post-image" alt="DPC-4 Concept Art (Indigo)" style="margin-bottom:15px;margin-left:15px;float:right;" decoding="async" /></div><p>This article continues (and concludes) the discussion on the proceedings of <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://sites.google.com/view/dpc4-2026/home?pli=1">DPC-4</a>, covering the remaining four contestants and a summary of the trends observed in all eight prefetchers presented in the championship. Similar to <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.sigarch.org/fourth-data-prefetching-championship-part-i/">Part I</a>, we focus on how each prefetch algorithm functions, and why it is effective. Finer implementation details can be obtained from the workshop <a class="WKVSfLCavKyjywFEXwZDFLAfdDosdiAqrY " href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://github.com/CMU-SAFARI/DPC4/tree/main/final-versions" target="_self" data-test-app-aware-link="">papers</a> or the source <a class="WKVSfLCavKyjywFEXwZDFLAfdDosdiAqrY " href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://github.com/CMU-SAFARI/DPC4/tree/main/submissions" target="_self" data-test-app-aware-link="">code</a>.</p>
<h3><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://github.com/CMU-SAFARI/DPC4/blob/main/final-versions/BertiGO-final.pdf"><strong>BertiGO</strong> (</a><em>Simranjit Singh, University of Murcia; Agustín Navarro Torres, University of Zaragoza; Alberto Ros (University of Murcia</em></h3>
<h4 id="ember675" class="ember-view reader-text-block__heading-3">Motivation</h4>
<p id="ember676" class="ember-view reader-text-block__paragraph">When evaluating the baseline prefetcher configuration, the authors noted that <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://dpc3.compas.cs.stonybrook.edu/pdfs/Berti.pdf">Berti</a> frequently issues redundant prefetch requests for lines already prefetched or present in the cache. Also, using only the PC provides very limited context for pattern recognition, limiting the prediction capabilities of Berti. Furthermore, <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://dl.acm.org/doi/10.1145/3466752.3480114">Pythia</a> is found to generate a lot of useless prefetches for some workloads, which pollutes the L2 cache and wastes memory bandwidth.</p>
<h4 id="ember677" class="ember-view reader-text-block__heading-3">Idea</h4>
<ol>
<li>A Region-Based Bit-Map Filter is added, which is a fully associative structure storing the prefetched and accessed cache lines per region, in the form of a bit-vector. For regions tracked by the filter, having the M-th bit set in the bitmap implies that we drop all prefetch requests for the M-th cache line inside the region.</li>
<li>In addition to using PC, the authors propose using a hash (shifted XOR) of the last 4 PCs with the current PC, to index the Berti tables with additional context.</li>
<li>Set-Dueling is added to Pythia: instead of using the default policy to issue prefetches, 5 different policies are introduced, including a No-Prefetch policy that disables Pythia. All 5 policies are enabled for a 10M-instruction tournament, at the end of which the policy with the lowest miss rate is chosen for the rest of execution.</li>
<li>An Adaptive Next Line (ANeLin) is added to the LLC, which uses a sampling cache to track the demand misses and insert next-line prefetches. A heuristic mechanism is used to track useful and useless prefetches globally and per-PC. ANeLin can be disabled if the ratio of useful to useless prefetches drops below a threshold.</li>
</ol>
<h4 id="ember679" class="ember-view reader-text-block__heading-3">Why It Works</h4>
<p id="ember680" class="ember-view reader-text-block__paragraph">Adding a Bit-Map Filter eliminates redundant and useless prefetches. Using PC history adds context from the program flow while learning memory accesses with minimal overhead. Disabling Pythia and Next Line prefetching when they do not generate enough useful prefetches solves the problem of cache pollution due to wasteful prefetching. This is especially useful in the constrained bandwidth and multicore scenarios where data and memory need to be shared judiciously for optimal performance.</p>
<h3 id="ember682" class="ember-view reader-text-block__heading-2"><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://github.com/CMU-SAFARI/DPC4/blob/main/final-versions/EDP-final.pdf"><strong>Entangling Data Prefetcher</strong> (</a><em>Agustín Navarro Torres, Universidad de Zaragoza;  Simranjit Singh, University of Murcia; Biswabandan Panda, IIT Bombay; Alberto Ros, University of Murcia)</em></h3>
<h4 id="ember684" class="ember-view reader-text-block__heading-3">Motivation</h4>
<p id="ember685" class="ember-view reader-text-block__paragraph">Comparing Berti with other state-of-art prefetchers, the authors identify a SPEC2017 workload where Berti achieves negligible performance gain over no-prefetch baseline. Profiling this trace reveals that it consists of long-reuse strides (stride accesses separated by 2K-cycle interval) and zero-strides (consecutive accesses to the same cache line). Berti cannot issue zero-delta prefetches, and even though prefetches are correctly issued for long-reuse deltas, they get evicted before the cache line gets accessed. T-SKID, a Time Skipping Prefetcher is built on top of a standard PC-Stride prefetcher, but decouples the PC that triggers a prefetch (TriggerPC) from the PC that trains the predictor(TargetPC). This allows it to prefetch long-reuse and zero stride patterns. However, the underlying stride prefetcher limits its scope to constant stride instead of complex delta patterns predicted easily by Berti.</p>
<h4 id="ember686" class="ember-view reader-text-block__heading-3">Idea</h4>
<p id="ember687" class="ember-view reader-text-block__paragraph">EDP is proposed as a VA-based L1D prefetcher. It gets trained and triggered on cache misses or prefetch hits (cache hit on a prefetched line). For every TargetPC, it records the fill latency of the demand access or prefetch request. It then searches the global PC history for the most recent PC that was observed more than (current cycle &#8211; fill latency) cycles ago – this is the TriggerPC which could have triggered a timely prefetch for TargetPC. This ‘Entangling Pair’ of PCs is added to the Entangling Table, that stores the set of TargetPCs for a given TriggerPC. EDP also looks at the address history of each TargetPC to calculate the list of timely deltas (similar to Berti) and stores them with the current address in a Delta Table indexed by TargetPC. To issue prefetches, the TriggerPC is used to obtain one or more TargetPC, which are used to obtain address and deltas for timely prefetch. The prefetches calculated in this way are passed through a Bloom Filter to drop redundant requests, and then placed in a Proxy Prefetch Queue (PPQ) where the prefetch request waits till slots open up in the demand read queue. If there is no space in the latter, prefetch requests are not issued. Pythia is implemented at L2, with a throttling mechanism at LLC that tracks each core&#8217;s requests and sets the EDP aggressiveness.</p>
<h4 id="ember688" class="ember-view reader-text-block__heading-3">Why It Works</h4>
<p id="ember689" class="ember-view reader-text-block__paragraph">Using a different PC to trigger prefetches allows EDP to successfully prefetch zero and long reuse delta patterns for its target PC. Filtering out redundant prefetches reduces contention for resources. Using a dedicated PPQ for prefetch requests prevents prefetches from competing with critical loads for resources. The LLC throttling mechanism helps evenly distribute resources in the multi-core scenario.</p>
<h3 id="ember691" class="ember-view reader-text-block__heading-2"><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://github.com/CMU-SAFARI/DPC4/blob/main/final-versions/uMAMA-final.pdf"><strong>Composite Prefetching with Bandits</strong> (</a><em>Charles Block, Pedro Palacios, Abraham Farrell, Gerasimos Gerogiannis, Josep Torrellas, University of Illinois at Urbana-Champaign)</em></h3>
<h4 id="ember693" class="ember-view reader-text-block__heading-3">Motivation</h4>
<p id="ember694" class="ember-view reader-text-block__paragraph">The authors point out that the current state-of-the-art prefetchers try to optimize low-level metrics such as accuracy, timeliness and coverage. The system performance (IPC) depends on these factors, but can have variable sensitivity to each of them depending on the workload and program phase. Furthermore, a single prefetcher is generally insufficient to deliver the best performance for a diverse set of workloads – industrial processors generally deploy a composite prefetcher consisting of multiple prefetch engines.</p>
<h4 id="ember695" class="ember-view reader-text-block__heading-3">Idea</h4>
<p id="ember696" class="ember-view reader-text-block__paragraph">A Multi-Armed Bandit is a Reinforcement Learning agent that chooses the best action (arm) to maximize the reward function value. Inspired by this, a Micro-Armed Bandit (MAB) is used to prefetch at L2C. Each ‘arm’ consists of different configurations for 5 state-of-the-art prefetchers-</p>
<ul>
<li>Next Line, Spatial Memory Streaming, Best Offset Prefetcher: Can be turned ON or OFF</li>
<li>Stride, Stream prefetchers: Degree can be tuned to control aggressiveness</li>
</ul>
<p id="ember698" class="ember-view reader-text-block__paragraph">A bloom filter is implemented to prevent issuing redundant prefetches. Each arm is used for a fixed time period (bandit step) after which the reward generated by it is evaluated by the agent. This is evaluated against the rewards generated previously to calculate which arm to use next. The total IPC of the core is used as a reward function for the MAB.</p>
<p id="ember699" class="ember-view reader-text-block__paragraph">To optimize multi-core performance, another agent called ‘µMama’ is added at the system level, using the geometric mean of IPCs across all cores as a reward function. At each timestep, it decides whether to allow the cores to pursue their independent actions, or to force them into joint actions which have a record of increasing the µMama reward.</p>
<h4 id="ember700" class="ember-view reader-text-block__heading-3">Why It Works</h4>
<p id="ember701" class="ember-view reader-text-block__paragraph">Using Reinforcement Learning to directly maximize the system performance ensures that the prefetcher dynamically re-configures itself with execution to improve IPC. The caveat is that this now becomes a search space problem &#8211; the arms of the bandit need to be diverse enough to support different kinds of workloads, in order to deliver the best performance.</p>
<h3 id="ember703" class="ember-view reader-text-block__heading-2"><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://github.com/CMU-SAFARI/DPC4/blob/main/final-versions/GBerti-final.pdf"><strong>Global Berti</strong> (</a><em>Gilead Posluns, Mark Jeffrey; University of Toronto)</em></h3>
<h4 id="ember705" class="ember-view reader-text-block__heading-3">Motivation</h4>
<p id="ember706" class="ember-view reader-text-block__paragraph">Berti is a state-of-the-art prefetcher that detects Streaming patterns, i.e., consistent delta values between accesses by the <em>same</em> PC. Practical workloads however, often exhibit Spatial patterns identified by consistent delta values between accesses by <em>different </em>PCs. In the absence of streaming patterns, prefetching based on spatial patterns could alleviate the efficacy of Berti.</p>
<h4 id="ember707" class="ember-view reader-text-block__heading-3">Idea</h4>
<p id="ember708" class="ember-view reader-text-block__paragraph">Global Berti detects spatial patterns using Berti’s existing structures – the History Table conventionally stores within a row, the addresses of all the lines accessed by a particular PC, in FIFO order. When a streaming pattern cannot be detected, local training is useless and Global Berti looks at the most recent address for all PCs to detect spatial patterns (global training). Berti’s Delta Table holds the row delta values for the same PC; Global Berti stores the global deltas (across PCs) in the same table, adding a local bit to differentiate between streaming and spatial training.</p>
<h4 id="ember709" class="ember-view reader-text-block__heading-3">Why It Works</h4>
<p id="ember710" class="ember-view reader-text-block__paragraph">By itself, Berti is quite effective at detecting and covering streaming patterns. Adding the capability to detect spatial patterns in the absence of streaming patterns increases Global Berti’s coverage and therefore, the overall performance. As expected, the highest speedup over Berti is obtained in SPEC2017 and Graph workloads that are dominated by irregular accesses which require spatial prefetching. On the other hand, AI workloads containing mostly streaming patterns see a much lesser speedup.</p>
<h3 id="ember712" class="ember-view reader-text-block__heading-2">General Trends</h3>
<p id="ember713" class="ember-view reader-text-block__paragraph">Although the major focus of almost all DPC-4 submissions is to overcome the limitations of the high-performing Berti/Pythia baseline, they highlight several key trends in data prefetching research:</p>
<ul>
<li><strong>Prefetching across Physical Page Boundaries: </strong>Issuing page-crossing prefetches is extremely useful for AI workloads since they are dominated by streaming accesses. This is leveraged by most submissions to gain an edge over the baseline prefetcher configuration.</li>
<li><strong>Preventing Redundant Prefetches: </strong>Quite a few papers also combat excessive prefetching and resource contention through advanced throttling, priority, and filtering mechanisms.</li>
<li><strong>Increased System-Level and Multi-Core Awareness:</strong> There is a growing emphasis on system-aware solutions to judiciously manage shared resources like memory bandwidth, which is constrained in high-core-count datacenters. This includes core-level fairness throttling (Emender, EDP) and global coordination agents (µMama) to dynamically adjust prefetcher configurations for optimal multi-core performance.</li>
<li><strong>Expanding Pattern Coverage for Diverse Workloads:</strong> Submissions seek to improve coverage beyond simple streaming patterns. This includes detecting spatial patterns across different PCs (Global Berti), and targeting complex patterns like long-reuse and zero-strides (EDP). The adoption of PC history (BertiGO) also provides better context for pattern recognition.</li>
<li><strong>Shift Towards Adaptive and Composite Designs:</strong> Recognizing that a  single prefetcher is insufficient for diverse workloads, the trend moves toward composite prefetchers. This is accompanied by dynamic re-configuration to select the best prefetcher setting at runtime, and adaptive heuristics to tune aggressiveness.</li>
</ul>
<h3>About the Author</h3>
<p><span style="font-weight: 400;">Digvijay Singh obtained his Bachelor’s degree from BITS Pilani and his Master’s degree from Texas A&amp;M University where he worked on data prefetching as part of the CAMSIN research group. He currently works as a Silicon Architect in Google’s mobile CPU team.</span></p>
<p class="disclaim"><strong>Disclaimer:</strong> <em>These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.</em></p><Img align="left" border="0" height="1" width="1" alt="" style="border:0;float:left;margin:0;padding:0;width:1px!important;height:1px!important;" hspace="0" src="https://feeds.feedblitz.com/~/i/954807320/0/sigarch-cat">
]]>
</content:encoded>
			<wfw:commentRss>https://feeds.feedblitz.com/~/954807320/0/sigarch-cat~Fourth-Data-Prefetching-Championship-Part/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	<post-id xmlns="com-wordpress:feed-additions:1">103014</post-id></item>
<item>
<feedburner:origLink>https://www.sigarch.org/fourth-data-prefetching-championship-part-i/</feedburner:origLink>
		<title>Fourth Data Prefetching Championship: Part I</title>
		<link>https://feeds.feedblitz.com/~/954636863/0/sigarch-cat~Fourth-Data-Prefetching-Championship-Part-I/</link>
		<comments>https://feeds.feedblitz.com/~/954636863/0/sigarch-cat~Fourth-Data-Prefetching-Championship-Part-I/#respond</comments>
		<pubDate>Mon, 27 Apr 2026 14:00:53 +0000</pubDate>
		<dc:creator><![CDATA[Digvijay Singh]]></dc:creator>
		<category><![CDATA[ACM SIGARCH]]></category>
		<category><![CDATA[Data Prefetcher]]></category>
		<category><![CDATA[Memory Wall]]></category>
		<guid isPermaLink="false">https://www.sigarch.org/?p=103010</guid>
		<description><![CDATA[<div><img width="300" xheight="187" src="https://www.sigarch.org/wp-content/uploads/2026/04/DPC-4-Part-2-300x187.png" class="attachment-medium size-medium wp-post-image" alt="DPC-4 Concept Art (Blue)" style="max-width:100% !important;height:auto !important;margin-bottom:15px;margin-left:15px;float:right;"  /></div>This article is the first in a two-part series that summarizes the key contributions of 4th Data Prefetching Championship (DPC-4), held in conjunction with the 32nd iteration of HPCA in 2026. While discussing innovative data prefetching techniques presented in this contest, we focus on the functionality of proposed algorithms and also explain why they are [&#8230;]]]>
</description>
				<content:encoded><![CDATA[<div><img width="300" height="187" src="https://www.sigarch.org/wp-content/uploads/2026/04/DPC-4-Part-2-300x187.png" class="attachment-medium size-medium wp-post-image" alt="DPC-4 Concept Art (Blue)" style="margin-bottom:15px;margin-left:15px;float:right;" decoding="async" loading="lazy" /></div><p class="ember-view reader-text-block__paragraph">This article is the first in a two-part series that summarizes the key contributions of 4th Data Prefetching Championship (<a class="WKVSfLCavKyjywFEXwZDFLAfdDosdiAqrY " href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://sites.google.com/corp/view/dpc4-2026/home" target="_self" data-test-app-aware-link="">DPC-4</a>), held in conjunction with the 32nd iteration of <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://2026.hpca-conf.org/track/hpca-2026-main-conference">HPCA</a> in 2026. While discussing innovative data prefetching techniques presented in this contest, we focus on the functionality of proposed algorithms and also explain why they are effective. Finer implementation details can be found from the <a class="WKVSfLCavKyjywFEXwZDFLAfdDosdiAqrY " href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://github.com/CMU-SAFARI/DPC4/tree/main/final-versions" target="_self" data-test-app-aware-link="">papers</a> or the source <a class="WKVSfLCavKyjywFEXwZDFLAfdDosdiAqrY " href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://github.com/CMU-SAFARI/DPC4/tree/main/submissions" target="_self" data-test-app-aware-link="">code</a>.</p>
<h3 id="ember612" class="ember-view reader-text-block__heading-3">Implementation Constraints</h3>
<p id="ember613" class="ember-view reader-text-block__paragraph">All prefetchers are evaluated against a baseline configuration that employs: <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://dpc3.compas.cs.stonybrook.edu/pdfs/Berti.pdf">Berti</a> prefetcher (<a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://dpc3.compas.cs.stonybrook.edu/">DPC3</a> winner) at L1D (Level-1 Data cache) and <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://dl.acm.org/doi/10.1145/3466752.3480114">Pythia</a> prefetcher at L2 (Level-2 cache). While there were no constraint on design complexity, upper limits were defined on the storage budget of the prefetchers to ensure the design was practically feasible for implementation. These limits were defined as follows: L1D Prefetcher: 32KB, L2 Prefetcher: 128KB, LLC (Last Level Cache) Prefetcher: 256KB.</p>
<h3 id="ember618" class="ember-view reader-text-block__heading-2">Keynotes</h3>
<p>The event included two keynote talks. The first keynote, titled &#8220;Is Prefetcher Research Still Alive?&#8221;, was given by <em><a id="ember621" class="ember-view" href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.linkedin.com/in/leeor-peled-125365b4/">Leeor Peled</a> </em>from Huawei. Leeor discussed the modern relevance of prefetching research, offering a pragmatic philosophy for academic researchers. He argued that the primary objective should not necessarily be to surpass &#8220;best-in-class&#8221; models – which are often the result of years of ‘engineered’ fine-tuning – but rather to introduce <strong>novel, high-potential concepts</strong> that invite further optimization. He emphasized that while an individual effort might not immediately surpass the state-of-the-art, a sufficiently &#8220;interesting&#8221; technique can evolve into a transformative solution through subsequent community-driven iteration.</p>
<p id="ember623" class="ember-view reader-text-block__paragraph">He suggested two optimizations that can be explored:</p>
<ol>
<li>Building a Semantic Prefetcher that correlates memory accesses with address generating code, i.e., a high-precision version of the Runahead Prefetcher that selectively runs only the code responsible for generating a future address.</li>
<li>Training neural networks to identify deep correlations between memory accesses, potentially unlocking the ability to predict complex, non-linear patterns that remain invisible to current heuristic-based logic.</li>
</ol>
<p id="ember625" class="ember-view reader-text-block__paragraph">The following issues can (and should) be addressed to build better prefetchers:</p>
<ul>
<li>Generalizing complex patterns, e.g. pointer chasing loads</li>
<li>Accurately choosing memory access with high correlation for better training</li>
<li>Prefetching to the appropriate cache level to optimize for timeliness</li>
<li>Throttling prefetches for fairness amongst multiple cores</li>
<li>Using LLMs to process memory traces instead of text sequences</li>
</ul>
<p id="ember631" class="ember-view reader-text-block__paragraph">The second keynote, titled &#8220;Data Prefetching: A Datacenter Perspective&#8221;, was given by <em><a id="ember630" class="ember-view" href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.linkedin.com/in/akanksha-j-a8336884/">Akanksha J.</a> from Google. </em> Addressing the memory bottleneck problem in modern datacenters (40% of the CPU cycles are spent idling for memory responses) Akanksha highlighted that cloud environments are characterized by massive multi-threading and incessant context switching. In these scenarios, a single thread may migrate across multiple cores, while each core rotates through a vast &#8220;plethora&#8221; of applications. The Google workloads utilized in DPC-4 are a better representation of this reality, and are primarily frontend-bound. Without a sophisticated instruction prefetcher to streamline code delivery, the underlying bottlenecks in data prefetching remain obscured and impossible to solve. She also analyzed structural failures of current prefetching solutions, identifying these primary aspects:</p>
<ol>
<li>Current design philosophy focuses on &#8220;tuning for the common case,&#8221; resulting in hard-coded heuristic values—such as fixed confidence thresholds and prefetch degrees—that are taped out into non-programmable silicon. While these &#8220;black boxes&#8221; are meticulously engineered to squeeze every drop of performance from SPEC workloads, they lack the flexibility required for the high heterogeneity of datacenter tasks. Consequently, these resource-hungry techniques often penalize cloud performance rather than enhancing it.</li>
<li>If we disable hardware prefetchers entirely and rely on software to insert prefetches, we miss out on critical opportunities to utilize valuable information about system states (coherence, timeliness, cache hits/misses) that improves prefetching. Akanksha proposed a shift towards <strong>&#8220;Software-Defined Prefetching,&#8221;</strong> a paradigm that transcends current <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.arm.com/glossary/isa">ISA</a> limitations. In this model, the software layer dynamically selects which code segments to target and determines the optimal hardware prefetcher to activate for peak accuracy. Simultaneously, the hardware leverages real-time system state data to maximize coverage.</li>
</ol>
<p id="ember633" class="ember-view reader-text-block__paragraph">Furthermore, Akanksha advocated for evaluating all prefetching techniques within constrained-bandwidth environments, arguing that such stress tests better reflect the realities of modern compute environments.</p>
<p>Now, on to prefetcher designs themselves.</p>
<h3 id="ember635" class="ember-view reader-text-block__heading-2"><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://github.com/CMU-SAFARI/DPC4/blob/main/final-versions/VIP-final.pdf"><strong>Virtual Inter-Page Prefetcher</strong></a> (<em>Ho Je Lee, Won Woo Ro; Yonsei University)</em></h3>
<h4 id="ember637" class="ember-view reader-text-block__heading-3">Motivation</h4>
<ul>
<li>Analyzing the baseline prefetecher configuration, the authors observed that the L2 Prefetcher (Pythia) is more effective than the L1 Prefetcher (Berti) in reducing Misses Per Kilo Instructions (MPKI) for the Last Level Cache (LLC).</li>
<li>Since Pythia operates in the Physical Address (PA) space, it is not feasible to let it issue prefetches across page boundaries, as incorrect physical page access poses a security risk.</li>
<li>A roofline study shows that there is significant performance to be gained when Pythia is allowed to issue page-cross prefetches in the PA space. This advantage amplifies when it is granted visibility of the Virtual Address (VA) space, preventing incorrect page accesses.</li>
</ul>
<h4 id="ember639" class="ember-view reader-text-block__heading-3">Idea</h4>
<p id="ember640" class="ember-view reader-text-block__paragraph">VIP is implemented at L1 level,  but issues prefetches to the L2. It gets trained on L1 Misses by reading the {<a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.geeksforgeeks.org/operating-systems/what-is-program-counter/">PC</a>, VA} information off the packets sent to L1 MSHR. These are written to the VIP Stride Table that calculates the observed stride for a particular PC and stores it. If a stride value is repeated, the confidence gets incremented. Otherwise it gets reset. The confidence value determines the prefetch degree.</p>
<h4 id="ember641" class="ember-view reader-text-block__heading-3">Why It Works</h4>
<p id="ember642" class="ember-view reader-text-block__paragraph">The implemented VIP configuration is a simple yet elegant solution to gain performance over the baseline by supplementing the existing Berti and Pythia prefetchers with cross-page prefetches (note that the DPC-3 version of Berti operates in the PA space and cannot issue prefetches across page boundaries). As expected, the stride prefetcher boosts AI workloads with sequential accesses of large data structures that span across pages. The typical CPU workloads such as SPEC see a moderate gain; the control-flow dominated Google workloads have a marginal slowdown since they rarely have uninterrupted streams.</p>
<h3 id="ember644" class="ember-view reader-text-block__heading-2"><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://github.com/CMU-SAFARI/DPC4/blob/main/final-versions/SPPAM-final.pdf"><strong>Signature Pattern Prediction and Access-Map Prefetcher</strong></a> (<em>Maccoy Merrell, Lei Wang, Paul Gratz, Stavros Kalafatis; Texas A&amp;M University)</em></h3>
<h4 id="ember646" class="ember-view reader-text-block__heading-3">Motivation</h4>
<p id="ember647" class="ember-view reader-text-block__paragraph">Access Map Pattern Matching (<a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://dl.acm.org/doi/10.1145/1542275.1542349">AMPM</a>) and Signature Path Prefetching (<a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://dl.acm.org/doi/10.5555/3195638.3195711">SPP</a>) are both considered state-of-the-art prefetching techniques; while SPP is sensitive to the order of memory accesses, AMPM is resistant to OoO execution. However, AMPM relies heavily on stored patterns for each region and  is unable to issue prefetches for new regions or when the observed accesses deviate from expectations. SPP excels at this and can even make predictions from its issued prefetches.</p>
<h4 id="ember648" class="ember-view reader-text-block__heading-3">Idea</h4>
<p id="ember649" class="ember-view reader-text-block__paragraph">Implemented at L2 level, a Region Table (RT) tracks all access maps (as bit-vectors) on a per-region basis. Upon a memory access, an N-bit portion from the respective access map is used to index a Pattern Table (PT). The PT outputs the most frequently occurring N-bit pattern as a prefetch candidate, which can be used to speculatively index the PT. Similar to SPP, speculative prefetching continues till the overall confidence drops below a threshold. The RT access map indicates the recently accessed cache lines and filters out redundant prefetches.</p>
<h4 id="ember650" class="ember-view reader-text-block__heading-3">Why It Works</h4>
<p id="ember651" class="ember-view reader-text-block__paragraph">The authors have identified the complementary nature of SPP and AMPM, and have combined them effectively to utilize the OoO resistance of AMPM with the Speculative mechanism of SPP. Additionally, numerous throttling mechanisms are implemented which consider pattern usefulness as well as global metrics such as DRAM bandwidth and overall usefulness to drop prefetches and set prefetch degree. SPPAM is implemented at L2C with Berti (the MICRO version which operates in the VA space) at L1D and Bingo at LLC. Similar to the previous paper, the cross-page stream information is passed to SPPAM from L1D.</p>
<h3 id="ember653" class="ember-view reader-text-block__heading-2"><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://github.com/CMU-SAFARI/DPC4/blob/main/final-versions/Emender-final.pdf"><strong>Emender</strong></a> (<em>Jiajie Chen, Tingji Zhang, Xiaoyi Liu, Xuefeng Zhang, Peng Qu, Youhui Zhang; Tsinghua University)</em></h3>
<h4 id="ember655" class="ember-view reader-text-block__heading-3">Motivation</h4>
<p id="ember656" class="ember-view reader-text-block__paragraph">An evaluation of different combinations of state-of-the-art prefetchers shows that <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://dl.acm.org/doi/10.1109/MICRO56248.2022.00072">VBerti</a> (L1D) and Pythia (L2) is the highest performing combination. Here, VBerti refers to the MICRO version of Berti that operates in the VA space, allowing it to issue page-crossing prefetches. It is observed that this optimal prefetcher combination issues too many prefetch requests that fill the prefetch queue quickly, which leads to useful prefetches getting dropped. A second-order effect of a full prefetch queue is the excessive usage of L1D to Memory bandwidth that can delay critical loads.</p>
<h4 id="ember657" class="ember-view reader-text-block__heading-3">Idea</h4>
<p id="ember658" class="ember-view reader-text-block__paragraph">Four key features are added to tackle the problem of over-prefetching in the VBerti+Pythia configuration:</p>
<ol>
<li>Pending Target Buffer is added to sort all issued prefetches by confidence, which helps prioritize useful prefetches between different PCs.</li>
<li>Cuckoo Filter is added which tracks the VAs already present in the cache to prevent redundant prefetches. This structure is chosen due to its O(1) query time, high accuracy and zero false negatives.</li>
<li>Dynamic Confidence Threshold is added which increases with the cache miss rate, throttling low-confidence prefetches.</li>
<li>A Fairness-based Throttling scheme is implemented across cores, which tracks the useless prefetches per-core at L3 and stops the core with the most useless prefetches from prefetching.</li>
</ol>
<h4 id="ember661" class="ember-view reader-text-block__heading-3">Why It Works</h4>
<p id="ember662" class="ember-view reader-text-block__paragraph">The authors identify problematic areas in the baseline Berti+Pythia system and propose features to effectively address them. The best performance improvement comes from the Cuckoo Filter for single-core and Fairness Throttling for multi-core configuration. Since Emender provides the least gain for limited bandwidth configuration, it would be interesting to look at the accuracy data.</p>
<h3 id="ember664" class="ember-view reader-text-block__heading-2"><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://github.com/CMU-SAFARI/DPC4/blob/main/final-versions/sBerti-final.pdf"><strong>sBerti</strong></a> (<em>Jiapeng Zhou, Ben Chen, Kunlin Li, Yun Chen; HKUST, Guangzhou)</em></h3>
<h4 id="ember666" class="ember-view reader-text-block__heading-3">Motivation</h4>
<p id="ember667" class="ember-view reader-text-block__paragraph">When profiling the DPC4 workloads on the given baseline prefetcher configuration (Berti + Pythia), the authors observed a high L1D miss rate in the AI-ML and Google workloads. A deeper analysis of the traces indicated that most of these misses occurred when the access stream moved across the 4KB physical page boundary, which happens frequently in these workloads. The version of Berti used in the baseline does not issue prefetches across page boundaries, and thus, a stride prefetcher can help.</p>
<h4 id="ember668" class="ember-view reader-text-block__heading-3">Idea</h4>
<p id="ember669" class="ember-view reader-text-block__paragraph">A decoupled Smart Stride Prefetcher is added at L1D, which operates on the VA space and can  therefore track memory access streams across page boundaries. It is trained using a Smart Stride Table (SST), which is indexed by a hash of the PC, and subtracts the lastVA from the current VA to calculate the delta value. If the absolute value of delta is a multiple of the stored stride, the confidence is updated; this also provides resistance to out-of-order execution. Prefetches are issued if this confidence is greater than a static threshold. The lookahead is tuned via a heuristic which is incremented upon observing late prefetches and decremented by timely prefetches. A Recent Prefetch Table stores the recently issued prefetches to track their timeliness and filter duplicate prefetches between Berti and Smart Stride engines.</p>
<h4 id="ember670" class="ember-view reader-text-block__heading-3">Why It Works</h4>
<p id="ember671" class="ember-view reader-text-block__paragraph">The addition of a decoupled stride prefetcher gives sBerti the ability to issue prefetches across physical page boundaries, reducing the “Cold-start Penalty” of Berti. The heuristic based dynamic distance adjustment helps tune the aggressiveness at runtime, allowing longer lookahead for AI-ML workloads dominated by streaming accesses. The final sBerti configuration (Stride + Berti at L1D, Pythia at L2) delivers the best performance in a full bandwidth scenario, where the stride engine can prefetch further ahead.</p>
<p>We will overview the rest of the prefetchers in part 2 of this post.</p>
<h3>About the Author</h3>
<p><span style="font-weight: 400;">Digvijay Singh received his Bachelor’s degree from BITS Pilani and his Master’s degree from Texas A&amp;M University where he worked on data prefetching as part of the CAMSIN research group. He currently works as a Silicon Architect in Google’s mobile CPU team.</span></p>
<p class="disclaim"><strong>Disclaimer:</strong> <em>These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.</em></p><Img align="left" border="0" height="1" width="1" alt="" style="border:0;float:left;margin:0;padding:0;width:1px!important;height:1px!important;" hspace="0" src="https://feeds.feedblitz.com/~/i/954636863/0/sigarch-cat">
]]>
</content:encoded>
			<wfw:commentRss>https://feeds.feedblitz.com/~/954636863/0/sigarch-cat~Fourth-Data-Prefetching-Championship-Part-I/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	<post-id xmlns="com-wordpress:feed-additions:1">103010</post-id></item>
<item>
<feedburner:origLink>https://www.sigarch.org/beyond-qubits-a-systems-view-of-hybrid-cv-dv-quantum-computing/</feedburner:origLink>
		<title>Beyond Qubits: A Systems View of Hybrid CV-DV Quantum Computing</title>
		<link>https://feeds.feedblitz.com/~/954105707/0/sigarch-cat~Beyond-Qubits-A-Systems-View-of-Hybrid-CVDV-Quantum-Computing/</link>
		<comments>https://feeds.feedblitz.com/~/954105707/0/sigarch-cat~Beyond-Qubits-A-Systems-View-of-Hybrid-CVDV-Quantum-Computing/#respond</comments>
		<pubDate>Mon, 20 Apr 2026 15:31:53 +0000</pubDate>
		<dc:creator><![CDATA[Yuan Liu, Zihan Chen, Shubdeep Mohapatra, Jim Furches, Zheng (Eddy) Zhang, Huiyang Zhou]]></dc:creator>
		<category><![CDATA[ACM SIGARCH]]></category>
		<category><![CDATA[Quantum Computing]]></category>
		<guid isPermaLink="false">https://www.sigarch.org/?p=102865</guid>
		<description><![CDATA[<div><img width="300" xheight="167" src="https://www.sigarch.org/wp-content/uploads/2026/04/Picture1-300x167.png" class="attachment-medium size-medium wp-post-image" alt="" style="max-width:100% !important;height:auto !important;margin-bottom:15px;margin-left:15px;float:right;"  loading="lazy" /></div>Hybrid continuous-discrete-variable (CV-DV) quantum computing combines oscillators and qubits to tackle problems that are difficult for either model alone, from bosonic simulation to quantum error correction. At ASPLOS 2026, our tutorial introduced the foundations, compilation stack, benchmarking methods, and programming tools behind this emerging architecture model. In this blog post, we overview the key elements [&#8230;]]]>
</description>
				<content:encoded><![CDATA[<div><img width="300" height="167" src="https://www.sigarch.org/wp-content/uploads/2026/04/Picture1-300x167.png" class="attachment-medium size-medium wp-post-image" alt="" style="margin-bottom:15px;margin-left:15px;float:right;" decoding="async" loading="lazy" /></div><p><b>Hybrid continuous-discrete-variable (CV-DV) quantum computing combines oscillators and qubits to tackle problems that are difficult for either model alone, from bosonic simulation to quantum error correction. At ASPLOS 2026, our tutorial introduced the foundations, compilation stack, benchmarking methods, and programming tools behind this emerging architecture model. In this blog post, we overview the key elements of our tutorial. </b></p>
<p><span style="font-weight: 400;">Tutorial website: </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://cvdv.ncsu.edu/resources/asplos-tutorial/"><span style="font-weight: 400;">https://cvdv.ncsu.edu/resources/asplos-tutorial/</span></a></p>
<h3><b>Foundations</b></h3>
<p><span style="font-weight: 400;">We began with the foundations of hybrid CV-DV quantum computing, introducing the physical model, mathematical language, and programming abstractions behind qubit-oscillator systems. Many leading quantum platforms naturally combine qubits with oscillator modes, such as cavities, vibrational modes, or photonic fields. Rather than treating oscillators as auxiliary hardware, hybrid CV-DV computing views their large Hilbert spaces as a computational resource.</span></p>
<p><span style="font-weight: 400;">The tutorial covered core representations of CV states in both Fock space and phase space, along with the key operators and gate families that support universal CV-DV computation. A central message was that hybrid systems are not simply “qubits plus extra hardware,” but a distinct computational model with their own instruction sets, abstractions, and compilation challenges. We showed how familiar qubit concepts such as Pauli and Clifford structure extend into the oscillator setting through displacement operations, squeezing, quadratic Hamiltonians, beamsplitters, and controlled hybrid interactions.</span></p>
<p><span style="font-weight: 400;">We also discussed why this matters from a computer architecture perspective. Hybrid CV-DV systems introduce new instruction set architectures (ISAs), abstract machine models (AMMs), and compilation choices that help separate hardware details from software design. Depending on the platform and compiler stack, the same computation may be expressed in phase-space language, Fock-space language, or a mixed qubit-oscillator representation.</span></p>
<p><span style="font-weight: 400;">To ground these ideas, we highlighted emerging algorithmic primitives and applications where hybrid systems may offer advantages, including oscillator-mediated entangling gates, state-transfer protocols, Hamiltonian simulation, bosonic quantum error correction, vibronic dynamics, and sensing. We closed the session by surveying two leading implementation pathways, superconducting circuit QED and trapped-ion systems, and discussing the distinct control and connectivity tradeoffs they expose. A comprehensive tutorial on the foundations of hybrid CV-DV quantum processors is available </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://journals.aps.org/prxquantum/abstract/10.1103/4rf7-9tfx"><span style="font-weight: 400;">here</span></a><span style="font-weight: 400;">.</span></p>
<h3><b>Compilation</b></h3>
<p><span style="font-weight: 400;">We also presented Strategies and Tools to Compile CV-DV Quantum Circuits. We began by emphasizing why Hamiltonian simulation is a central application and one of the most promising directions for hybrid continuous-variable and discrete-variable (CV-DV) quantum systems. CV systems can naturally represent continuous degrees of freedom, while DV systems provide strong control and interaction structures. Together, they enable important applications in areas such as quantum chemistry and materials science. However, a key challenge lies in decomposing the time-evolution operator e^{-iHt} into a sequence of executable quantum gates. This transformation is fundamentally a compilation problem, bridging high-level quantum algorithms and low-level hardware. As such, compilers play a critical role in hybrid quantum systems.</span></p>
<p><span style="font-weight: 400;">We then focused on the dominant approach today: symbolic compilation. In particular, we discussed two early CV-DV Hamiltonian simulation compilers from </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://dl.acm.org/doi/full/10.1145/3695053.3731065"><span style="font-weight: 400;">Chen et al., ISCA’25</span></a><span style="font-weight: 400;"> and </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://ieeexplore.ieee.org/abstract/document/11250346"><span style="font-weight: 400;">Decker et al., QCE’25</span></a><span style="font-weight: 400;">. The core idea is to avoid direct matrix-based computation and instead leverage the algebraic structure of operators for rule-based decomposition. Techniques such as Trotter-Suzuki product formulas, the Baker–Campbell–Hausdorff (BCH) expansion, and bosonic commutation relations are used to gradually break down complex Hamiltonians into hardware-executable primitive gates. This process is typically implemented through rule matching and recursive rewriting, where expressions are repeatedly transformed until only supported base gates remain. While this approach avoids the exponential blowup of high-dimensional matrices, it introduces tradeoffs between approximation error and resource overhead.</span></p>
<p><span style="font-weight: 400;">Finally, we analyzed the limitations of current compilers and outlined future research directions. Key challenges include limited gate sets and decomposition rules, the tradeoff between accuracy and resource cost, hardware connectivity constraints, and insufficient optimization flexibility. To address these issues, we highlighted the need for improved programmability, richer native gate support, more accurate cost models, and optimizations that exploit algebraic properties such as commutativity. We also presented the </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://github.com/ruadapt/Genesis-CVDV-Compiler"><span style="font-weight: 400;">Genesis compiler</span></a><span style="font-weight: 400;"> from Chen et al., ISCA’25 as an end-to-end solution example, including typical use cases and code snippets. Genesis employs a multi-level intermediate representation (IR) and a full compilation pipeline to automatically translate Hamiltonians into limited hardware connectivity physical circuits, demonstrating a systematic and extensible compilation framework for hybrid CV-DV quantum computing.</span></p>
<h3><b>Benchmark and Circuit Simulator</b></h3>
<p><span style="font-weight: 400;">We also presented </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/2603.04398"><b>HyQBench</b></a> <span style="font-weight: 400;">by Mohapatra et al., an open-source benchmark suite implemented in Bosonic Qiskit and QuTiP. HyQBench covers eight representative hybrid circuits spanning three abstraction levels: primitives, algorithms, and applications. These include cat state generation, GKP state preparation, CV-to-DV state transfer, </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://ieeexplore.ieee.org/abstract/document/11129874"><span style="font-weight: 400;">CV-DV QFT</span></a><span style="font-weight: 400;">, </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/2501.11735"><span style="font-weight: 400;">CV-DV VQE</span></a><span style="font-weight: 400;">, CV-QAOA, Jaynes-Cummings-Hubbard (JCH) Hamiltonian simulation, and </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.nature.com/articles/s41467-025-67694-5"><span style="font-weight: 400;">Shor’s algorithm</span></a><span style="font-weight: 400;">.</span></p>
<p><span style="font-weight: 400;">One key takeaway is that hybrid architectures can reduce hardware resources dramatically for some workloads. For example, simulating a 3-site JCH model in a DV-only encoding requires 9 qubits and 393 CNOT gates, whereas a hybrid implementation uses only 3 qumodes, 3 qubits, and 8 gates. This kind of reduction highlights why benchmarking hybrid systems requires more than simply counting qubits.</span></p>
<p><span style="font-weight: 400;">To support this, we introduced a feature map tailored to hybrid systems. In addition to standard structural metrics such as gate counts, circuit depth, and qubit/qumode counts, we proposed three CV-DV-specific metrics: Wigner negativity as a proxy for non-classicality and classical simulation hardness, truncation cost to quantify population near the Fock cutoff, and maximum energy. These metrics help separate workloads with very different simulation and execution behavior. For example, JCH simulation remains relatively close to Gaussian behavior, while CV-QAOA and Shor’s algorithm exhibit higher Wigner negativity and are harder to simulate classically.</span></p>
<p><span style="font-weight: 400;">We also discussed early hardware validation. A cat-state preparation benchmark was executed on Sandia National Laboratories’ QSCOUT trapped-ion platform and achieved a fidelity of 0.71. HyQBench was further used to calibrate conditional displacement gates on the same platform, reinforcing the need for standardized benchmark suites that support both evaluation and device calibration. The full paper is available at </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/2603.04398"><span style="font-weight: 400;">https://arxiv.org/abs/2603.04398</span></a><span style="font-weight: 400;">.</span></p>
<p><span style="font-weight: 400;">To lower the barrier to entry for this area, we also developed </span><b>HyQSim</b><span style="font-weight: 400;">, a browser-based hybrid CV-DV circuit simulator that requires no installation. HyQSim supports drag-and-drop circuit construction, arbitrary Fock cutoffs, and built-in visualization through Wigner plots, Fock-state amplitudes, and Bloch sphere views. It is available at </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://cvdv.ncsu.edu/resources/simulator/"><span style="font-weight: 400;">https://cvdv.ncsu.edu/resources/simulator/</span></a><span style="font-weight: 400;">, and the code is hosted at </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://github.com/shubdeepmohapatra01/HyQSim/"><span style="font-weight: 400;">https://github.com/shubdeepmohapatra01/HyQSim/</span></a><span style="font-weight: 400;">.</span></p>
<h3><b>Programming</b></h3>
<p><span style="font-weight: 400;">Finally, we discussed programming support for hybrid CV-DV systems. Quantum programming languages and frameworks have developed many important ideas over the years, including </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://dl.acm.org/doi/10.1007/11417170_26"><span style="font-weight: 400;">linear quantum types</span></a><span style="font-weight: 400;"> for enforcing the no-cloning theorem, </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://dl.acm.org/doi/10.1145/3385412.3386007"><span style="font-weight: 400;">automatic uncomputation</span></a><span style="font-weight: 400;"> of ancilla qubits, and </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://dl.acm.org/doi/10.1145/2499370.2462177"><span style="font-weight: 400;">dynamic lifting of classical variables</span></a><span style="font-weight: 400;"> for mid-circuit measurement. Hybrid quantum computing introduces an additional requirement: heterogeneous quantum registers containing both qubits and qumodes.</span></p>
<p><span style="font-weight: 400;">To address this challenge, we developed </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/2603.10919"><b>Hybridlane</b></a><span style="font-weight: 400;">, a CV-DV quantum programming framework built on </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/1811.04968"><span style="font-weight: 400;">PennyLane</span></a><span style="font-weight: 400;">. By extending PennyLane, Hybridlane inherits a broad library of qubit algorithms, gates, and compilation routines while remaining familiar to existing users. Hybridlane tracks wire types automatically through symbolic circuit analysis and type inference, enabling scalable circuit construction, platform independence, and integration with downstream compilation flows.</span></p>
<p><span style="font-weight: 400;">The tutorial concluded with example workflows using Hybridlane. In one example, we reused an existing PennyLane quantum phase estimation template for a CV-DV Hamiltonian simulation and then lowered it through symbolic compilation to a gate sequence executable on the </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/2209.11153"><span style="font-weight: 400;">Bosonic Qiskit</span></a><span style="font-weight: 400;"> backend. In another, we demonstrated a cross-platform workflow in which a conditional displacement gate was calibrated in simulation and then compiled for execution on Sandia’s QSCOUT trapped-ion platform. Together, these examples showed how hybrid quantum software can begin to support the same define-simulate-execute workflow that has become standard in mature qubit SDKs.</span></p>
<p><span style="font-weight: 400;">We hope Hybridlane helps enable a broader ecosystem of reusable software and research for hybrid quantum computing. It is available at </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://github.com/pnnl/hybridlane"><span style="font-weight: 400;">https://github.com/pnnl/hybridlane</span></a><span style="font-weight: 400;">.</span></p>
<h3><b>Closing</b></h3>
<p><span style="font-weight: 400;">Hybrid CV-DV computing sits at the intersection of quantum hardware, computer architecture, compilation, and programming systems. We hope this tutorial helps make the area more accessible to researchers across architecture, systems, programming languages, and quantum information, and we invite readers to explore the tutorial materials, benchmarks, and tools linked above.</span></p>
<h3><strong>About the Authors</strong></h3>
<p><span style="font-weight: 400;"><strong>Yuan Liu</strong> is an Assistant Professor of Electrical &amp; Computer Engineering and Computer Science at North Carolina State University. Prior to joining the NC State faculty, he was a postdoctoral researcher at the Massachusetts Institute of Technology. His research interests lie at the intersection of quantum computing, quantum engineering, quantum algorithms/architectures and applications.</span></p>
<p><span style="font-weight: 400;"><strong>Zihan Chen</strong> is a Ph.D. student in computer systems at Rutgers University, advised by Prof. Eddy Z. Zhang. His research focuses on compiler and system-level techniques, as well as parallel computing, to enhance the efficiency, programmability, scalability, and fault tolerance of emerging quantum computing systems.</span></p>
<p><span style="font-weight: 400;"><strong>Shubdeep Mohapatra</strong> is a Ph.D. candidate in Computer Engineering at NC State University, advised by Prof. Huiyang Zhou and Prof. Yuan Liu. His research focuses on quantum error characterization, mitigation, and benchmarking, aimed at improving the reliability and fault tolerance of near-term quantum computing systems.</span></p>
<p><span style="font-weight: 400;"><strong>Jim Furches</strong> is a post-masters research associate at Pacific Northwest National Laboratory. His current research interests are in quantum benchmarking, algorithms, and quantum programming and compilation.</span></p>
<p><span style="font-weight: 400;"><strong>Zheng (Eddy) Zhang</strong> is a Professor in the Department of Computer Science at Rutgers University. Her research focuses on full-stack compiler and programming systems for quantum computing. She studies how to better coordinate quantum applications, programming languages, intermediate representations, compilation, pulse-level control, and hardware architecture to improve the performance, usability, and scalability of quantum systems.</span></p>
<p><span style="font-weight: 400;"><strong>Huiyang Zhou</strong> is a Professor of Electrical and Computer Engineering at North Carolina State University. His current research interests include GPU architecture, processor security, and quantum computing.</span></p>
<p class="disclaim"><strong>Disclaimer:</strong> <em>These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.</em></p><Img align="left" border="0" height="1" width="1" alt="" style="border:0;float:left;margin:0;padding:0;width:1px!important;height:1px!important;" hspace="0" src="https://feeds.feedblitz.com/~/i/954105707/0/sigarch-cat">
]]>
</content:encoded>
			<wfw:commentRss>https://feeds.feedblitz.com/~/954105707/0/sigarch-cat~Beyond-Qubits-A-Systems-View-of-Hybrid-CVDV-Quantum-Computing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	<post-id xmlns="com-wordpress:feed-additions:1">102865</post-id></item>
<item>
<feedburner:origLink>https://www.sigarch.org/computer-architectures-alphazero-moment-is-here/</feedburner:origLink>
		<title>Computer Architecture&#8217;s AlphaZero Moment is Here</title>
		<link>https://feeds.feedblitz.com/~/953617784/0/sigarch-cat~Computer-Architectures-AlphaZero-Moment-is-Here/</link>
		<comments>https://feeds.feedblitz.com/~/953617784/0/sigarch-cat~Computer-Architectures-AlphaZero-Moment-is-Here/#comments</comments>
		<pubDate>Fri, 10 Apr 2026 14:00:43 +0000</pubDate>
		<dc:creator><![CDATA[Karu Sankaralingam]]></dc:creator>
		<category><![CDATA[ACM SIGARCH]]></category>
		<category><![CDATA[Architecture]]></category>
		<category><![CDATA[Evaluation]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<guid isPermaLink="false">https://www.sigarch.org/?p=102162</guid>
		<description><![CDATA[<div><img width="300" xheight="187" src="https://www.sigarch.org/wp-content/uploads/2026/04/Gemini_Generated_Image_yndetsyndetsynde-1-300x187.png" class="attachment-medium size-medium wp-post-image" alt="" style="max-width:100% !important;height:auto !important;margin-bottom:15px;margin-left:15px;float:right;"  loading="lazy" /></div>For decades, we have designed chips in fundamentally the same way: human intuition applied to a vanishingly small slice of an impossibly large design space. That paradigm worked when Moore&#8217;s Law was lifting everything. We could afford to be wrong. We could afford to miss the best design. Process scaling would close the gap. That [&#8230;]]]>
</description>
				<content:encoded><![CDATA[<div><img width="300" height="187" src="https://www.sigarch.org/wp-content/uploads/2026/04/Gemini_Generated_Image_yndetsyndetsynde-1-300x187.png" class="attachment-medium size-medium wp-post-image" alt="" style="margin-bottom:15px;margin-left:15px;float:right;" decoding="async" loading="lazy" /></div><div>
<div></div>
<div>For decades, we have designed chips in fundamentally the same way: human intuition applied to a vanishingly small slice of an impossibly large design space. That paradigm worked when Moore&#8217;s Law was lifting everything. We could afford to be wrong. We could afford to miss the best design. Process scaling would close the gap.</div>
<div></div>
<div>That world is over. In a recent position paper — <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/2604.03312">&#8220;Computer Architecture&#8217;s AlphaZero Moment: Automated Discovery in an Encircled World&#8221;</a> — I argue that we are at an inflection point. Not a gradual shift, but a structural break in how architecture must be practiced.</div>
<div></div>
<h3>From Idea Scarcity to Evaluation Scarcity</h3>
<div>The central claim is simple, but uncomfortable:</div>
<div></div>
<div><em>Computer architecture is no longer bottlenecked by ideas. It is bottlenecked by evaluation and telemetry.</em></div>
<div></div>
<div>For decades, the field has implicitly assumed that ideas are scarce — that the role of the architect is to generate the one clever mechanism worth exploring. Everything else follows. But recent evidence suggests the opposite. With modern large language models and agentic pipelines, hundreds of viable architectural ideas can be generated per day, thousands of candidate designs can be evaluated per week, and design cycles can compress from months to weeks.</div>
<div></div>
<div>This is not speculative. We built a system called the <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://github.com/VerticalResearchGroup/Gauntlet">Gauntlet</a> and tested it on 85 papers from ISCA 2025 and HPCA 2026 — largely outside the model&#8217;s training data. Across 475 independent runs, it produced viable architectural mechanisms 95% of the time: independently re-deriving authors&#8217; exact solutions in 48% of cases, and proposing valid alternatives the authors never considered in another 50%. Each took 10–20 minutes. This flips a foundational assumption of the field. If ideas are abundant, then the limiting factor is no longer creativity — it is <strong>which ideas we can evaluate, validate, and trust</strong>. This <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://pages.cs.wisc.edu/~karu/ArchAlphaZero/zero-arch/html/">link</a> has this corpus of problem statement and Gauntlet&#8217;s solutions.</div>
</div>
<div></div>
<div>
<h4></h4>
<h4>1. Evaluation is the new bottleneck</h4>
<p>We are moving from a world where the question was &#8220;Can we come up with a good idea?&#8221; to one where the question becomes &#8220;Can we evaluate 10,000 ideas fast enough to find the best one?&#8221; This elevates simulation infrastructure, analytical modeling, and verification into the central problems of the field. The &#8220;PhD student for three months&#8221; implementation bottleneck is already eroding — our system built first-principles performance models from papers in under 20 minutes. What replaces it is a race to build faster, more accurate, and more scalable evaluation pipelines.</p>
<h4></h4>
<h4>2. The telemetry divide</h4>
<div>If evaluation becomes central, then <strong>ground truth becomes everything. </strong>Over time, access to closed-loop deployment telemetry — real workloads, real performance counters, real system behavior at scale, and in low-level depth — may matter as much as architectural insight itself. This creates a risk of structural divide. Academic research, long dependent on proxy benchmarks, could drift further from production reality unless we collectively rethink how we share and access workload data.</div>
<h4></h4>
<h4>3. The end of the old boundary</h4>
<div>The traditional separation between &#8220;chip company&#8221; and &#8220;cloud provider&#8221; begins to dissolve. Automated architecture requires three tightly coupled capabilities: deployment (to generate telemetry), infrastructure (to evaluate designs at scale), and silicon expertise (to realize designs physically). No single traditional player owns all three. The result is convergence — either through vertical integration or new hybrid ecosystems.</div>
<h3></h3>
<h3>The Deeper Claim</h3>
<div>The more provocative claim is not about tools — it is about limits. Human-driven architecture is becoming structurally outmatched by the scale of the design space. This is not a statement about human ability. It is about combinatorics. The architectural search space — spanning parametric and structural choices — is effectively unbounded. Humans sample an infinitesimal fraction of it. That was acceptable in an era of abundance. It is not acceptable in an era where architectural efficiency is the primary lever for progress. The analogy to <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/1712.01815">AlphaZero </a> is not rhetorical. It is structural: when search, evaluation, and feedback loops become fast enough, intuition gives way to systematic exploration.</div>
<h3></h3>
<h3>What This Means for Research — and Teaching</h3>
<div>If this framing is even partially correct, it forces a rethinking of what it means to &#8220;do&#8221; computer architecture research. Several shifts seem likely. If machines can generate many viable solutions, identifying the *right problem* becomes the scarce intellectual act. Evaluation frameworks, modeling techniques, and telemetry integration may matter more than individual architectural ideas. And the reliance on fixed benchmark suites becomes increasingly fragile in a world driven by dynamic, evolving workloads.</div>
<div></div>
<div>The full paper includes a set of predictions and my opinions on how I see this playing out. This extends to how we teach. Do we still emphasize canonical microarchitectures, or shift toward trade-off reasoning, evaluation frameworks, and interpreting machine-generated designs? What does it mean to train a researcher when idea generation itself is becoming automated?</div>
<h3></h3>
<h3>A Call for Collaboration</h3>
<div>This is not a settled direction — it is a hypothesis that needs to be stress-tested by the community. If this resonates (or if you think it is completely wrong), I would love to engage on: new models for teaching architecture, shared evaluation infrastructure and artifacts, privacy-preserving approaches to workload telemetry, and workshops focused on problem formulation rather than solution novelty. If this is even half right, we may need to rethink our identity as a field. Let&#8217;s debate it.</div>
<div></div>
<div><strong>About the author:</strong> Karthikeyan Sankaralingam is Principal Research Scientist at NVIDIA and Professor at UW-Madison.</div>
<div></div>
<div>Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.</div>
</div>
<p class="disclaim"><strong>Disclaimer:</strong> <em>These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.</em></p><Img align="left" border="0" height="1" width="1" alt="" style="border:0;float:left;margin:0;padding:0;width:1px!important;height:1px!important;" hspace="0" src="https://feeds.feedblitz.com/~/i/953617784/0/sigarch-cat">
]]>
</content:encoded>
			<wfw:commentRss>https://feeds.feedblitz.com/~/953617784/0/sigarch-cat~Computer-Architectures-AlphaZero-Moment-is-Here/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	<post-id xmlns="com-wordpress:feed-additions:1">102162</post-id></item>
<item>
<feedburner:origLink>https://www.sigarch.org/spilling-the-neural-tea-a-journey-down-the-side-channel/</feedburner:origLink>
		<title>Spilling the Neural Tea: A Journey Down the Side-Channel</title>
		<link>https://feeds.feedblitz.com/~/953386850/0/sigarch-cat~Spilling-the-Neural-Tea-A-Journey-Down-the-SideChannel/</link>
		<comments>https://feeds.feedblitz.com/~/953386850/0/sigarch-cat~Spilling-the-Neural-Tea-A-Journey-Down-the-SideChannel/#respond</comments>
		<pubDate>Mon, 06 Apr 2026 15:37:22 +0000</pubDate>
		<dc:creator><![CDATA[Adnan Rakin]]></dc:creator>
		<category><![CDATA[ACM SIGARCH]]></category>
		<category><![CDATA[deep neural networks]]></category>
		<category><![CDATA[Security]]></category>
		<category><![CDATA[side-channels]]></category>
		<guid isPermaLink="false">https://www.sigarch.org/?p=101933</guid>
		<description><![CDATA[<div><img width="300" xheight="164" src="https://www.sigarch.org/wp-content/uploads/2026/04/Gemini_Generated_Image_ptypq1ptypq1ptyp-300x164.png" class="attachment-medium size-medium wp-post-image" alt="" style="max-width:100% !important;height:auto !important;margin-bottom:15px;margin-left:15px;float:right;"  loading="lazy" /></div>Years ago, I came across three pioneering works (CSI-NN, Cache Telepathy, and DeepSniffer) in the field of reverse engineering neural networks that inspired my journey into side-channel attacks to uncover the secrets of modern Deep Neural Networks (DNNs). Fast forward to today, and there has been significant exploitation of side-channel attacks to discover the secrets [&#8230;]]]>
</description>
				<content:encoded><![CDATA[<div><img width="300" height="164" src="https://www.sigarch.org/wp-content/uploads/2026/04/Gemini_Generated_Image_ptypq1ptypq1ptyp-300x164.png" class="attachment-medium size-medium wp-post-image" alt="" style="margin-bottom:15px;margin-left:15px;float:right;" decoding="async" loading="lazy" /></div><p><span style="font-weight: 400;">Years ago, I came across three pioneering works (</span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.usenix.org/conference/usenixsecurity19/presentation/batina"><span style="font-weight: 400;">CSI-NN</span></a><span style="font-weight: 400;">, </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://iacoma.cs.uiuc.edu/iacoma-papers/usenix20.pdf"><span style="font-weight: 400;">Cache Telepathy</span></a><span style="font-weight: 400;">, and </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://dl.acm.org/doi/10.1145/3373376.3378460"><span style="font-weight: 400;">DeepSniffer</span></a><span style="font-weight: 400;">) in the field of reverse engineering neural networks that inspired my journey into side-channel attacks to uncover the secrets of modern Deep Neural Networks (DNNs). Fast forward to today, and there has been significant exploitation of side-channel attacks to discover the secrets of neural networks. It&#8217;s a good time to provide an overview of where we stand, the outlook for the future, and the challenges ahead.</span></p>
<p><b>Motivation: </b><span style="font-weight: 400;">Let&#8217;s take a step back and first try to understand why we care about secrets in deep learning models. It basically boils down to two fundamental chall</span><span style="font-weight: 400;">enges associated with deep learning: i) Financial, ii) Security and Privacy challenges. In general, DNNs are intellectual property (IP), as they are products developed over years of research, implementation, and investment in computing units, and they entail significant training costs (time, energy, and labor), making them a valuable asset for their owners. Just to give a rough estimate, OpenAI’s </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://openai.com/index/gpt-4-research/"><span style="font-weight: 400;">GPT-4</span></a><span style="font-weight: 400;"> costs more than $ 100 million, and its </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://openai.com/gpt-5/"><span style="font-weight: 400;">GPT-5</span></a><span style="font-weight: 400;"> model is expected to be more than 5x as expensive (</span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.forbes.com/sites/katharinabuchholz/2024/08/23/the-extreme-cost-of-training-ai-models/"><span style="font-weight: 400;">Cost of Training GPT</span></a><span style="font-weight: 400;">). I do not know about you, but if I spent 100 million on something, I would care about protecting it. The next challenge is knowing that a model secret gives an adversary white-box knowledge, which is extremely powerful in security and privacy settings. Any adversary with knowledge of a target victim&#8217;s model architecture (e.g., model type, layer sequence, and number) and weight information, formally defined as “white-box,” can launch powerful security (adversarial attacks) and privacy threats (model inversion attacks/membership inference attacks). As highlighted in Figure 1, the attacker’s final objective in the DNN reverse-engineering attack is to gain white-box privileges either to steal IP for financial gain or to launch subsequent attacks.</span></p>
<p><i><span style="font-weight: 400;">In summary, in security and privacy research, defining the threat model is the first step towards any exploitation, and the underlying assumption is often that a reverse-engineering attack has successfully uncovered the model architecture, weights, and other hyperparameters.</span></i></p>
<p><b>Attack Objectives: </b><span style="font-weight: 400;">By now, we have established that an attacker’s goal is to uncover two key properties of a victim DNN: its architecture and its parameters. However, this is an oversimplified goal and can often be misleading. To understand this, let&#8217;s consider a deep neural network as a function of </span><span style="font-weight: 400;">x, denoted</span><span style="font-weight: 400;"> </span><span style="font-weight: 400;">f</span><span style="font-weight: 400;">(x)</span><span style="font-weight: 400;">.  If an attacker wants to recover the exact victim model, their objective is for the stolen model to be identical to the original </span><span style="font-weight: 400;">f</span><span style="font-weight: 400;">(x)</span><span style="font-weight: 400;">, which is practically impossible for large-scale DNNs, whether using existing side-channel attacks or the exact victim dataset. As a result, a more practical and plausible goal for an attacker would be to achieve functional equivalence. If the stolen function is different, such as </span><span style="font-weight: 400;">g</span><span style="font-weight: 400;">(x)</span><span style="font-weight: 400;">,</span><span style="font-weight: 400;"> then, for incentive purposes, all an attacker cares about is that these two functions produce identical output,  i.e., </span><span style="font-weight: 400;">f</span><span style="font-weight: 400;">(x)=</span> <span style="font-weight: 400;">g</span><span style="font-weight: 400;">(x)</span><span style="font-weight: 400;">, for inputs </span><span style="font-weight: 400;">x</span><span style="font-weight: 400;"> that are of the attacker&#8217;s interest. As a result, achieving functional equivalence means recovering the DNN model architecture, often as close as possible to the victim architecture&#8217;s topology. On the weight side, even if an attacker cannot extract the exact weights, they must aim for a weight-space solution that captures the victim model&#8217;s functionality.</span></p>
<p><i><span style="font-weight: 400;">In summary, to steal a copy of the victim model/function, an attacker must identify the victim model architecture. In modern deep learning, where most practical applications use some version of a DNN model from an existing pool (e.g., GPT, Llama), recovering the architecture often boils down to detecting the model&#8217;s topology. Once the architecture is revealed, the attacker must recover the model parameters/weights, which is often a challenging part of the attack. Then again, as we discussed earlier, exact model recovery can be challenging, but achieving functional equivalence is a modest objective. Most importantly, to achieve functional equivalence, the attacker may not need to reveal the exact numerical weights; rather, gradually recovering coarse-grained information (e.g., weight sparsity, quantization pattern, weight distribution) is often sufficient.</span></i></p>
<p><img loading="lazy" decoding="async" class="alignnone wp-image-101935" src="https://www.sigarch.org/wp-content/uploads/2026/04/image7.png" alt="" width="884" height="483" /></p>
<p>Figure 1: Spectrum of attack threats characterized by attacker’s knowledge: Black-Box (No Knowledge), Grey-Box (Partial Knowledge, e.g., architecture), and White-box (Complete knowledge of model architecture and weights), the ultimate goal of reverse-engineering (AI-generated).</p>
<p><b>Attack Techniques and Capabilities.</b> <span style="font-weight: 400;">Among the popular types of side-channel attacks, i.e., physical and microarchitectural, both can be utilized in two different threat model settings. In edge or embedded devices, the physical side channel is the dominant threat, and several works (</span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.usenix.org/conference/usenixsecurity19/presentation/batina"><span style="font-weight: 400;">CSI-NN</span></a><span style="font-weight: 400;">, </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.usenix.org/conference/usenixsecurity25/presentation/horvath"><span style="font-weight: 400;">BarraCUDA</span></a><span style="font-weight: 400;">) have shown that it is possible to recover the model architecture and weights of simple neural networks. On the other hand, micro-architectural side channels are a popular choice for resource-sharing cloud environments where users can upload and run their code in a colocated environment (e.g., </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://aws.amazon.com/sagemaker/"><span style="font-weight: 400;">Amazon SageMaker</span></a><span style="font-weight: 400;"> and </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://cloud.google.com/ml-engine/docs/technical-overview"><span style="font-weight: 400;">Google ML Engine</span></a><span style="font-weight: 400;">). Microarchitectural attacks have been successful in recovering model architecture across the board using </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.usenix.org/conference/usenixsecurity20/presentation/yan"><span style="font-weight: 400;">cache timing channels</span></a><span style="font-weight: 400;">, </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://dl.acm.org/doi/10.1145/3373376.3378460"><span style="font-weight: 400;">memory access patterns</span></a><span style="font-weight: 400;">, and </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://ieeexplore.ieee.org/document/9153424/"><span style="font-weight: 400;">GPU context switching</span></a><span style="font-weight: 400;">. I acknowledge that there are many ways to recover DNN model weights, including </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/tramer"><span style="font-weight: 400;">learning-based approaches</span></a><span style="font-weight: 400;"> and </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.usenix.org/conference/usenixsecurity20/presentation/jagielski"><span style="font-weight: 400;">mathematical recovery</span></a><span style="font-weight: 400;"> techniques. In this blog post, I focus on side-channel attacks. At the same time, learning-based approaches can work as a complementary approach with side-channel attacks once the architecture information has already been leaked. </span></p>
<p><i><span style="font-weight: 400;">In summary, while side-channel attacks have been successful in leaking model architecture information, as the scale of modern DNNs, e.g., LLM weights, continues to reach new heights of billions, none of the existing side channels can scalably and predictably recover model parameter information. A common workaround would be to support these methods with a learning approach, assuming an attacker has a partial training set, which may not be practical, even in a resource-sharing environment where data remains private.</span></i></p>
<p><b><i>Future Challenges and Opportunities: </i></b></p>
<p><b>What is the future of architecture-recovery attacks, given the success of existing side channels?</b></p>
<p><i><span style="font-weight: 400;">As the next wave of vision and language domain architectures emerges, they present new challenges and opportunities for the microarchitectural side-channel attack community. These models require modern compute support, which can accelerate their inference (e.g., tensor cores), as GPUs become more modern and newer generations may leave new traces of side-channel information. Hence, these newer compute platforms (e.g., new GPUs) and their associated architectural support demand new innovation in side-channel capabilities to recover the model architecture. We must remember that architecture recovery is essential; without it, model parameter recovery is no longer useful. Moreover, as LLMs emerge as the dominant model, the question is not just about recovering weights or architecture; leaking other components, such as KV cache in a multi-tenant setting, can lead to </span></i><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.ndss-symposium.org/wp-content/uploads/2025-1772-paper.pdf"><i><span style="font-weight: 400;">privacy leakage</span></i></a><i><span style="font-weight: 400;">. </span></i></p>
<p><b>Can a microarchitectural side channel alone ever be sufficient to recover model weight information? </b></p>
<p><span style="font-weight: 400;">The sheer scale of the modern model poses an even greater challenge for recovering weights, making direct recovery an ambitious, and even impossible, goal; instead, we should focus on functional equivalence. To achieve functional equivalence, weight recovery methods can set tiny stepping stones to augment learning-based recovery. </span></p>
<p><i><span style="font-weight: 400;">Complete weight recovery using a side channel at the scale of LLMs or even a smaller vision model may be too ambitious. Instead, the attacks should focus on coarse-grained information about weights, such as model sparsity levels, quantization mechanisms, weight sign recovery, and other optimization techniques. The key idea is to achieve functional equivalence by first recovering coarse-grained information, which is sufficient to support other learning-based recovery. It is time to work towards an achievable target: recovering this statistical weight-level knowledge and studying how critical their role is in improving subsequent attacks. As models and their computation units are increasingly optimized, leaking information such as sparsity levels or bit-widths will become more feasible by detecting optimized paths through side-channel leakage.</span></i></p>
<p><span style="font-weight: 400;">Finally, an attack is never the end goal. We probe attacks from every angle so we can study them before any attacker ever thinks about them. The endgame is always to develop subsequent defenses, which I leave for another discussion.</span></p>
<p><strong>About the author: </strong></p>
<p><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.binghamton.edu/computer-science/people/profile.html?id=arakin"><span style="font-weight: 400;">Adnan Siraj Rakin</span></a><span style="font-weight: 400;"> is an Assistant Professor at the School of Computing at Binghamton University. He received his Master&#8217;s (2021) and PhD (2022) from Arizona State University. He works on emerging security and privacy challenges in modern AI </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.usenix.org/conference/usenixsecurity21/presentation/rakin"><span style="font-weight: 400;">systems</span></a><span style="font-weight: 400;"> and </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://openaccess.thecvf.com/content/ICCV2023/papers/Ahmed_SSDA_Secure_Source-Free_Domain_Adaptation_ICCV_2023_paper.pdf"><span style="font-weight: 400;">algorithms</span></a><span style="font-weight: 400;">. His paper on</span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://ieeexplore.ieee.org/document/9833743/"><span style="font-weight: 400;"> DNN model weight recovery</span></a><span style="font-weight: 400;"> has been </span><span style="font-weight: 400;">crowned as </span><span style="font-weight: 400;">Top Picks in Hardware and Embedded Security in 2024. </span></p>
<p class="disclaim"><strong>Disclaimer:</strong> <em>These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.</em></p><Img align="left" border="0" height="1" width="1" alt="" style="border:0;float:left;margin:0;padding:0;width:1px!important;height:1px!important;" hspace="0" src="https://feeds.feedblitz.com/~/i/953386850/0/sigarch-cat">
]]>
</content:encoded>
			<wfw:commentRss>https://feeds.feedblitz.com/~/953386850/0/sigarch-cat~Spilling-the-Neural-Tea-A-Journey-Down-the-SideChannel/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	<post-id xmlns="com-wordpress:feed-additions:1">101933</post-id></item>
<item>
<feedburner:origLink>https://www.sigarch.org/to-sparsify-or-to-quantize-a-hardware-architecture-view/</feedburner:origLink>
		<title>To Sparsify or To Quantize: A Hardware Architecture View</title>
		<link>https://feeds.feedblitz.com/~/950020673/0/sigarch-cat~To-Sparsify-or-To-Quantize-A-Hardware-Architecture-View/</link>
		<comments>https://feeds.feedblitz.com/~/950020673/0/sigarch-cat~To-Sparsify-or-To-Quantize-A-Hardware-Architecture-View/#respond</comments>
		<pubDate>Thu, 12 Mar 2026 15:00:43 +0000</pubDate>
		<dc:creator><![CDATA[Sai Srivatsa Bhamidipati]]></dc:creator>
		<category><![CDATA[ACM SIGARCH]]></category>
		<category><![CDATA[Accelerators]]></category>
		<category><![CDATA[deep neural networks]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<guid isPermaLink="false">https://www.sigarch.org/?p=100754</guid>
		<description><![CDATA[<div><img width="300" xheight="164" src="https://www.sigarch.org/wp-content/uploads/2026/03/Blog-Image-2-Picsart-AiImageEnhancer-300x164.png" class="attachment-medium size-medium wp-post-image" alt="" style="max-width:100% !important;height:auto !important;margin-bottom:15px;margin-left:15px;float:right;"  loading="lazy" /></div>The debate of sparsity versus quantization has made its rounds in the ML optimization community for many years. Now, with the Generative AI revolution, the debate is intensifying. While these might both seem like simple mathematical approximations to an AI researcher, for a hardware architect, they present fundamentally different sets of challenges. Many architects in [&#8230;]]]>
</description>
				<content:encoded><![CDATA[<div><img width="300" height="164" src="https://www.sigarch.org/wp-content/uploads/2026/03/Blog-Image-2-Picsart-AiImageEnhancer-300x164.png" class="attachment-medium size-medium wp-post-image" alt="" style="margin-bottom:15px;margin-left:15px;float:right;" decoding="async" loading="lazy" /></div><p><span style="font-weight: 400;">The debate of sparsity versus quantization has made its rounds in the ML optimization community for many years. Now, with the Generative AI revolution, the debate is intensifying. While these might both seem like simple mathematical approximations to </span>an AI researcher, for a hardware architect, they present fundamentally different sets of challenges. Many architects in the AI hardware space are deeply familiar with watching the scale tip from one side to the other, constantly searching for a pragmatic balance. Let&#8217;s look at both techniques, unpack the architectural challenges they introduce, and explore whether a &#8220;best of both worlds&#8221; scenario is truly possible (Spoiler: It depends).</p>
<p><i><span style="font-weight: 400;">Note: We will only be looking at compute-bound workloads, which traditionally rely on dense compute units such as tensor cores or MXUs. We will set aside memory-bound workloads for now, as they introduce their own distinct set of tradeoffs for sparsity and quantization.</span></i></p>
<h2><b>Sparsity</b></h2>
<p><span style="font-weight: 400;">The core idea of sparsity is beautifully simple: if a neural network weight is zero (or close enough to it), just don&#8217;t do the math. Theoretically, pruning can save massive amounts of compute and memory bandwidth.</span></p>
<p><b>The Architecture Challenge: The Chaos of Unstructured Data</b></p>
<p><span style="font-weight: 400;">The golden goose of this approach is fine-grained, unstructured sparsity. It offers a high level of achievable compression through pruning, but results in a completely random distribution of zero elements. Traditional dense hardware </span><i><span style="font-weight: 400;">hates</span></i><span style="font-weight: 400;"> this. Randomness leads to irregular memory accesses, unpredictable load balancing across cores, and terrible cache utilization. High-performance SIMD units end up starving while the memory controller plays hopscotch trying to fetch the next non-zero value. To architect around this, pioneering unstructured sparse accelerators—such as</span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/1602.01528"> <span style="font-weight: 400;">EIE</span></a><span style="font-weight: 400;"> and</span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/1708.04485"> <span style="font-weight: 400;">SCNN</span></a><span style="font-weight: 400;">—had to rely heavily on complex routing logic, specialized crossbars, and deep queues just to keep the compute units fed, often trading compute area for routing overhead.</span></p>
<p><b>The Compromise: Structured and Coarse-Grained Sparsity</b></p>
<p><span style="font-weight: 400;">To tame this chaos, the industry shifted toward structured compromises. The universally embraced</span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/"> <span style="font-weight: 400;">N:M sparsity</span></a><span style="font-weight: 400;"> (popularized by NVIDIA&#8217;s Ampere architecture) forces exactly N non-zero elements in every block of M. This provides a predictable load-balancing mechanism where the hardware can perfectly schedule memory fetches and compute.</span></p>
<p><span style="font-weight: 400;">More recently, to tackle the quadratic memory bottleneck of long-context LLMs, we&#8217;ve seen a surge in modern </span><i><span style="font-weight: 400;">sparse attention mechanisms</span></i><span style="font-weight: 400;"> that leverage block sparsity. Techniques like</span> <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://github.com/mit-han-lab/Block-Sparse-Attention"><i><span style="font-weight: 400;">Block-Sparse Attention</span></i></a><span style="font-weight: 400;"> and </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/2003.05997"><span style="font-weight: 400;">Routing Attention</span></a><span style="font-weight: 400;"> enforce sparsity at the chunk or tile level. Instead of picking individual tokens, they route computation to contiguous blocks of tokens, allowing standard dense matrix multiplication engines to skip entire chunks while maintaining high MXU utilizations and contiguous memory access. Other approaches, like</span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/2309.17453"> <span style="font-weight: 400;">StreamingLLM</span></a><span style="font-weight: 400;">, evict older tokens entirely, retaining only local context and specific &#8220;heavy hitter&#8221; sink tokens.</span></p>
<p><span style="font-weight: 400;">The trade-off across these methods is clear: we exchange theoretical maximum efficiency for hardware-friendly predictability, paying a &#8220;tax&#8221; in metadata storage (index matrices), specialized multiplexing logic, and the persistent algorithmic risk of dropping contextually vital information.</span></p>
<h2><b>Quantization</b></h2>
<p><span style="font-weight: 400;">While sparsity aims to compute </span><i><span style="font-weight: 400;">less</span></i><span style="font-weight: 400;">, quantization aims to compute </span><i><span style="font-weight: 400;">smaller</span></i><span style="font-weight: 400;">. Shrinking datatypes from 32-bit floats (FP32) to INT8, or embracing emerging standards like the</span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf"> <span style="font-weight: 400;">OCP Microscaling Formats (MX) Specification</span></a><span style="font-weight: 400;"> (such as MXFP8 E4M3 and E5M2), acts as an immediate multiplier for memory bandwidth and capacity. But the frontier has pushed much further than 8-bit. Recent advancements in extreme quantization, such as</span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/2402.17764"> <span style="font-weight: 400;">BitNet b1.58</span></a><span style="font-weight: 400;"> (1-bit LLMs using ternary weights of {-1, 0, 1}) and 2-bit quantization schemes (like</span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/2210.17323"> <span style="font-weight: 400;">GPTQ</span></a><span style="font-weight: 400;"> or <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/2307.13304">Quip</a>), demonstrate that large language models can maintain remarkable accuracy even when weights are squeezed to their absolute theoretical limits.</span></p>
<p><b>The Architecture Challenge: The Tyranny of Metadata and Scaling Factors</b></p>
<p><span style="font-weight: 400;">From an architecture perspective, the challenge of extreme quantization isn&#8217;t just the math—it&#8217;s the metadata. To maintain accuracy at 4-bit, 2-bit, or sub-integer levels, algorithms demand fine-grained control, requiring per-channel, per-group, or even per-token dynamic scaling factors. Every time we shrink the primary datapath, the relative hardware overhead of managing these scaling factors skyrockets. Along with that, the quantization algorithm also becomes more fine grained, dynamic and complex. We are forced to add additional logic and even high-precision accumulators (often FP16 or FP32) just to handle the on-the-fly de-quantization and accumulation. We aggressively optimize the MAC (Multiply-Accumulate) units, only to trade that for the overhead of adding scaling factor handling and supporting a potentially new dynamic quantization scheme, which can outweigh the gains.</span></p>
<p><b>The Compromise: Algorithmic Offloading</b></p>
<p><span style="font-weight: 400;">To fix this without blowing up the complexity and area budget, the community relies on algorithmic co-design. Techniques like</span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/2211.10438"> <span style="font-weight: 400;">SmoothQuant</span></a><span style="font-weight: 400;"> effectively migrate the quantization difficulty offline, mathematically shifting the dynamic range from spiky, hard-to-predict activations into the statically known weights. Similarly,</span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/2306.00978"> <span style="font-weight: 400;">AWQ (Activation-aware Weight Quantization)</span></a><span style="font-weight: 400;"> identifies and protects a small fraction of &#8220;salient&#8221; weights to maintain accuracy without requiring complex, dynamic mixed-precision hardware pipelines. By absorbing the complexity into offline mathematics, these techniques allow the hardware to run mostly uniform, low-precision datatypes.</span></p>
<p><span style="font-weight: 400;">However, much like the routing tax in sparsity, this algorithmic offloading comes with some compromises. These methods heavily rely on static, offline calibration datasets. If a model encounters out-of-distribution data in production (a different language, an unusual coding syntax, or an unexpected prompt structure), the statically determined scaling factors can fail, leading to outlier clipping and catastrophic accuracy collapse. Furthermore, relying on offline preprocessing creates a rigid deployment pipeline that prevents the model from adapting to extreme activation spikes on the fly.</span></p>
<h2><b>Is there a &#8220;best of both worlds&#8221;?</b></h2>
<p><span style="font-weight: 400;">So, knowing these trade-offs, do we sparsify or do we quantize? Many years ago, the </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/1510.00149"><span style="font-weight: 400;">Deep Compression</span></a><span style="font-weight: 400;"> paper proved we could do both. But today, pulling this off at the scale of a 70-billion parameter LLM is incredibly difficult. It suffers from the classic hardware optimization catch-22 (see </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.sigarch.org/dont-put-all-your-tensors-in-one-basket-hardware-lottery/"><span style="font-weight: 400;">All in on Matmul?</span></a><span style="font-weight: 400;">) : </span><i><span style="font-weight: 400;">No one uses a new piece of hardware because it’s not supported by software, and it’s not supported by software because no one’s using it.</span></i></p>
<p><span style="font-weight: 400;">So what&#8217;s the path forward for hardware architects? In my opinion, the following:</span></p>
<ul>
<li style="font-weight: 400;"><b>Deep Hardware-Software Co-design:</b><span style="font-weight: 400;"> The days of throwing a generic matrix-multiplication engine at a model are over. We need to work directly with AI researchers so that when they design a new pruning threshold or a novel sub-byte data type, the hardware already has a streamlined, fast path for the metadata.</span></li>
<li style="font-weight: 400;"><b>Generalized Compression Abstractions:</b><span style="font-weight: 400;"> Historically, we have designed accelerators that are either &#8220;good at sparsity&#8221; (with complex routing networks) or &#8220;good at quantization&#8221; (with mixed-precision MACs). Moving forward, we need to view these not as orthogonal features, but as a unified spectrum of compression. Architectures must be designed to dynamically adapt—perhaps fluidly dropping structurally sparse blocks during a memory-bound decode phase, while leaning on extreme sub-byte quantization during a compute-heavy prefill phase—potentially even sharing the same underlying logic.</span></li>
<li style="font-weight: 400;"><b>Balance Efficiency and Programmability:</b><span style="font-weight: 400;"> As explored in the &#8220;All in on MatMul?&#8221; post, we need to keep our hardware flexible. Over-fitting to today&#8217;s specific sparsity pattern or quantization trick risks building being trapped in the local minimum. We must maintain enough programmability to enable future algorithm discovery and break free from the catch-22.</span></li>
</ul>
<p><span style="font-weight: 400;">Some notable research going along this path include </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/pdf/2405.20935"><span style="font-weight: 400;">Effective interplay between sparsity and quantization</span></a><span style="font-weight: 400;">, which proves the non-orthogonality of the two techniques and explores the interplay between them and also the </span><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.cs.toronto.edu/~mmozaffari/compression-trinity/index.html"><span style="font-weight: 400;">Compression Trinity</span></a><span style="font-weight: 400;"> work which takes a look at multiple techniques across sparsity, quantization and low rank approximation and tries to take a holistic view of the optimization space across the stack.</span></p>
<p><span style="font-weight: 400;">Ultimately, as alluded to before, there is no single silver bullet, and like all open architecture problems, the answer is always &#8220;it depends&#8221;.  But in the era of Generative AI, it depends on whether we view sparsity and quantization as competing alternatives or as pieces of the same puzzle. Perhaps it’s time we stop asking which one is better, and start designing architectures flexible enough to embrace the realities of both.</span></p>
<p>&nbsp;</p>
<h3><b>About the Author:</b></h3>
<p><span style="font-weight: 400;">Sai Srivatsa Bhamidipati is a Senior Silicon Architect at Google working on the Google Tensor TPU in the Pixel phones. His primary focus is on efficient and scalable compute for Generative AI on the Tensor TPU.</span></p>
<h3><b>Authors’ Disclaimer:</b></h3>
<p><span style="font-weight: 400;">Portions of this post were edited with the assistance of AI models. Some references, notes and images were also compiled using AI tools. The content represents the opinions of the authors and does not necessarily represent the views, policies, or positions of Google or its affiliates.</span></p>
<p class="disclaim"><strong>Disclaimer:</strong> <em>These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.</em></p><Img align="left" border="0" height="1" width="1" alt="" style="border:0;float:left;margin:0;padding:0;width:1px!important;height:1px!important;" hspace="0" src="https://feeds.feedblitz.com/~/i/950020673/0/sigarch-cat">
]]>
</content:encoded>
			<wfw:commentRss>https://feeds.feedblitz.com/~/950020673/0/sigarch-cat~To-Sparsify-or-To-Quantize-A-Hardware-Architecture-View/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	<post-id xmlns="com-wordpress:feed-additions:1">100754</post-id></item>
<item>
<feedburner:origLink>https://www.sigarch.org/from-the-editors-desk-2026-edition/</feedburner:origLink>
		<title>From the Editor&#8217;s Desk &#8211; 2026 Edition</title>
		<link>https://feeds.feedblitz.com/~/944524166/0/sigarch-cat~From-the-Editors-Desk-Edition/</link>
		<comments>https://feeds.feedblitz.com/~/944524166/0/sigarch-cat~From-the-Editors-Desk-Edition/#respond</comments>
		<pubDate>Tue, 03 Feb 2026 20:19:39 +0000</pubDate>
		<dc:creator><![CDATA[Dmitry Ponomarev]]></dc:creator>
		<category><![CDATA[ACM SIGARCH]]></category>
		<category><![CDATA[Blog]]></category>
		<category><![CDATA[Editorial]]></category>
		<guid isPermaLink="false">https://www.sigarch.org/?p=96844</guid>
		<description><![CDATA[<div><img width="300" xheight="169" src="https://www.sigarch.org/wp-content/uploads/2026/02/AdobeStock_862939397-300x169.jpeg" class="attachment-medium size-medium wp-post-image" alt="" style="max-width:100% !important;height:auto !important;margin-bottom:15px;margin-left:15px;float:right;"  loading="lazy" /></div>As we close the book on 2025, Computer Architecture Today has seen another successful year of community engagement. We published 29 posts covering a wide spectrum of topics—from datacenter energy-efficiency to the evolving debate on LLMs in peer review, alongside trip reports from our major conferences. I want to thank all our authors for their [&#8230;]]]>
</description>
				<content:encoded><![CDATA[<div><img width="300" height="169" src="https://www.sigarch.org/wp-content/uploads/2026/02/AdobeStock_862939397-300x169.jpeg" class="attachment-medium size-medium wp-post-image" alt="" style="margin-bottom:15px;margin-left:15px;float:right;" decoding="async" loading="lazy" /></div><p>As we close the book on 2025, <i data-path-to-node="18,0" data-index-in-node="30">Computer Architecture Today</i> has seen another successful year of community engagement. We published 29 posts covering a wide spectrum of topics—from datacenter energy-efficiency to the evolving debate on LLMs in peer review, alongside trip reports from our major conferences. I want to thank all our authors for their insights, with special appreciation for those who contributed multiple times.</p>
<p>Over the last year, we shifted our editorial model, moving from a roster of set contributors to a more flexible, open-submission approach. We also re-established our conference trip reports, highlighting top architecture venues.</p>
<p data-path-to-node="14,2">The blog thrives on new voices, and our door is always open. We are actively looking for:</p>
<ul data-path-to-node="14,3">
<li>
<p data-path-to-node="14,3,0,0"><b data-path-to-node="14,3,0,0" data-index-in-node="0">New Ideas:</b> If you have a topic in mind, please propose it using <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.sigarch.org/contribute/propose-a-blog-post-topic/">this link</a> or email me directly.</p>
</li>
<li>
<p data-path-to-node="14,3,1,0"><b data-path-to-node="14,3,1,0" data-index-in-node="0">Trip Reports:</b> Planning to attend a conference? Volunteer to share your experience.</p>
</li>
<li>
<p data-path-to-node="14,3,2,0"><b data-path-to-node="14,3,2,0" data-index-in-node="0">Event Summaries:</b> Organizers of workshops or tutorials are welcome to publicize their events through summary posts.</p>
</li>
<li>
<p data-path-to-node="14,3,3,0"><b data-path-to-node="14,3,3,0" data-index-in-node="0">Industry Perspectives:</b> We would like to hear from our industry colleagues about their take on the future landscape of computer architecture.</p>
</li>
</ul>
<p data-path-to-node="14,4">Finally, as AI tools proliferate, the conversation around their role in our paper reviewing process is far from over. I look forward to seeing more of that debate here.</p>
<p data-path-to-node="14,4">Here’s to the new advances in Computer Architecture in 2026!</p>
<p class="disclaim"><strong>Disclaimer:</strong> <em>These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.</em></p><Img align="left" border="0" height="1" width="1" alt="" style="border:0;float:left;margin:0;padding:0;width:1px!important;height:1px!important;" hspace="0" src="https://feeds.feedblitz.com/~/i/944524166/0/sigarch-cat">
]]>
</content:encoded>
			<wfw:commentRss>https://feeds.feedblitz.com/~/944524166/0/sigarch-cat~From-the-Editors-Desk-Edition/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	<post-id xmlns="com-wordpress:feed-additions:1">96844</post-id></item>
<item>
<feedburner:origLink>https://www.sigarch.org/multi-agent-memory-from-a-computer-architecture-perspective-visions-and-challenges-ahead/</feedburner:origLink>
		<title>Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead</title>
		<link>https://feeds.feedblitz.com/~/940946942/0/sigarch-cat~MultiAgent-Memory-from-a-Computer-Architecture-Perspective-Visions-and-Challenges-Ahead/</link>
		<comments>https://feeds.feedblitz.com/~/940946942/0/sigarch-cat~MultiAgent-Memory-from-a-Computer-Architecture-Perspective-Visions-and-Challenges-Ahead/#respond</comments>
		<pubDate>Tue, 20 Jan 2026 15:19:11 +0000</pubDate>
		<dc:creator><![CDATA[Zhongming Yu and Jishen Zhao]]></dc:creator>
		<category><![CDATA[ACM SIGARCH]]></category>
		<category><![CDATA[Agents]]></category>
		<category><![CDATA[LLM]]></category>
		<category><![CDATA[Memory Consistency]]></category>
		<category><![CDATA[Memory Hierarchy]]></category>
		<guid isPermaLink="false">https://www.sigarch.org/?p=97929</guid>
		<description><![CDATA[<div><img width="300" xheight="200" src="https://www.sigarch.org/wp-content/uploads/2026/01/title-300x200.png" class="attachment-medium size-medium wp-post-image" alt="" style="max-width:100% !important;height:auto !important;margin-bottom:15px;margin-left:15px;float:right;"  loading="lazy" /></div>Large language model (LLM) agents are quickly moving from “single agent” to *multi-agent systems*: tool-using agents, planner-orchestrator, debate teams, specialized sub-agents that collaborate to solve tasks. At the same time, the *context* these agents must operate within is becoming more complex: longer histories, multiple modalities, structured traces, and customized environments. This combination creates a bottleneck [&#8230;]]]>
</description>
				<content:encoded><![CDATA[<div><img width="300" height="200" src="https://www.sigarch.org/wp-content/uploads/2026/01/title-300x200.png" class="attachment-medium size-medium wp-post-image" alt="" style="margin-bottom:15px;margin-left:15px;float:right;" decoding="async" loading="lazy" /></div><p>Large language model (LLM) agents are quickly moving from “single agent” to *multi-agent systems*: tool-using agents, planner-orchestrator, debate teams, specialized sub-agents that collaborate to solve tasks. At the same time, the *context* these agents must operate within is becoming more complex: longer histories, multiple modalities, structured traces, and customized environments. This combination creates a bottleneck that looks surprisingly familiar to computer architects: memory.</p>
<p>In computer systems, performance and scalability are often limited not by compute, but by memory hierarchy, bandwidth, and consistency. Multi-agent systems are heading toward the same wall — except their “memory” is not raw bytes, but semantic context used for reasoning. After dipping our heads building various LLM multi-agent frameworks over the past two years (e.g., <strong><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://github.com/fishmingyu/OrcaLoca">OrcaLoca</a> </strong>for software issue localization, <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://stable-lab.github.io/MAGE/"><strong>MAGE</strong></a> for RTL design, <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://github.com/stable-lab/Pro-V"><strong>Pro-V</strong></a> for RTL verification, and <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://pettingllms-ai.github.io/"><strong>PettingLLMs</strong> </a>enabling RL training on multiple LLM agents), we would like to share our insights learned from our experience through the lens of a computer architect. This blog frames multi-agent memory as a <strong>computer architecture problem</strong>, proposes a simple architecture-inspired model, and highlights the key challenges and protocol gaps that define the road ahead.</p>
<p>While our perspectives are still preliminary and evolving, we hope they serve as a starting point to ignite a broader conversation.</p>
<hr />
<h2>Multi-Agent Memory Systems in Growing Complex Contexts</h2>
<h3>Why memory matters: Context is changing</h3>
<ul>
<li><strong>Longer context windows:</strong> Long-context evaluation suites like <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/2404.06654"><strong>RULER</strong></a> and <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://longbench2.github.io/"><strong>LongBench</strong></a> show that &#8220;real&#8221; long-context ability involves more than simple retrieval — it includes multi-hop tracing, aggregation, and sustained reasoning as length scales.</li>
<li><strong>Multi-modal inputs:</strong> Benchmarks such as <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://mmmu-benchmark.github.io/"><strong>MMMU</strong></a> (static images: charts, diagrams, tables) and <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://video-mme.github.io/"><strong>VideoMME</strong></a>(videos with audio and subtitles) demonstrate that models must handle diverse visual modalities alongside text, extending beyond single-modality processing.</li>
<li><strong>Structured data &amp; traces:</strong> Text-to-SQL (e.g., <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://spider2-sql.github.io/"><strong>Spider</strong></a>, <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://bird-bench.github.io/"><strong>BIRD</strong></a>) highlight that agents increasingly operate over structured, executable data — database schemas and generated SQL queries — rather than only raw chat history.</li>
<li><strong>Customized environments:</strong> In <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.swebench.com/SWE-bench/guides/evaluation/"><strong>SWE-bench</strong></a> and <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://multi-swe-bench.github.io/#/"><strong>Multi-SWE-bench</strong></a>, models are evaluated by applying patches to real repositories and running tests in containerized (Docker) environments, making &#8220;environment state + execution&#8221; part of the memory problem. Similarly, <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://webarena.dev/"><strong>WebArena</strong></a> and <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://os-world.github.io/"><strong>OSWorld</strong></a> provide realistic, reproducible interactive environments that stress long-horizon state tracking and grounded actions.</li>
</ul>
<p><strong>Bottom line:</strong> Context is no longer a static prompt — it&#8217;s a dynamic, multi-format, partially persistent memory system.</p>
<p><img loading="lazy" decoding="async" class="alignnone size-full wp-image-97940" src="https://www.sigarch.org/wp-content/uploads/2026/01/motivation.jpg" alt="" width="11766" height="6266" /></p>
<hr />
<h2>Basic Prototypes: Shared vs. Distributed Agent Memory</h2>
<p>Before we talk about “hierarchies,” it helps to name the two simplest prototypes, which mirror classical memory systems.</p>
<h3>1) Shared Memory</h3>
<p>All agents access a shared memory pool (e.g., a shared vector store, shared document database).</p>
<ul>
<li><strong>Pros:</strong> Easy to share knowledge; fast reuse.</li>
<li><strong>Cons:</strong> Requires <strong>coherence support</strong>. Without coordination, agents overwrite each other, read stale info, or rely on inconsistent versions of shared facts.</li>
</ul>
<h3>2) Distributed Memory</h3>
<p>Each agent owns local memory (local scratchpad, local cache, local long-term store) and shares via synchronization.</p>
<ul>
<li><strong>Pros:</strong> Isolation by default; more scalable; fewer contention issues.</li>
<li><strong>Cons:</strong> Needs explicit <strong>synchronization</strong>; state divergence becomes common unless carefully managed.</li>
</ul>
<p>Most real systems sit somewhere in between: local working memory plus selectively shared artifacts.</p>
<hr />
<h2>An Agent Memory Architecture Inspired by Modern Computer Architecture Design</h2>
<p>Computer architecture teaches a practical lesson: you don’t build “one memory.” You build a <strong>memory hierarchy</strong> with different layers optimized for latency, bandwidth, capacity, and persistence.</p>
<p>A useful mapping for agents will be:</p>
<h3>Agent I/O Layer</h3>
<p><strong>What it is:</strong> Interfaces that ingest and emit information.</p>
<ul>
<li>Audio/speech</li>
<li>Text documents</li>
<li>Images</li>
<li>Network calls/web data</li>
</ul>
<p><strong>Analogy:</strong> Devices and I/O subsystems feeding the CPU.</p>
<h3>Agent Cache Layer</h3>
<p><strong>What it is:</strong> Fast, limited-capacity memory optimized for immediate reasoning.</p>
<ul>
<li>Compressed context</li>
<li>Recent trajectories and tool calls</li>
<li>Short-term latent storage (e.g., KV cache, embeddings of recent steps)</li>
</ul>
<p><strong>Analogy:</strong> CPU caches (L1/L2/L3): small, fast, and constantly refreshed.</p>
<h3>Agent Memory Layer</h3>
<p><strong>What it is:</strong> Large-capacity, slower memory optimized for retrieval and persistence.</p>
<ul>
<li>Full dialogue history</li>
<li>External knowledge databases (vector DBs, graph DBs, document stores)</li>
<li>Long-term latent storage</li>
</ul>
<p><strong>Analogy:</strong> Main memory + storage hierarchy.</p>
<p>This framing emphasizes a key principle: <strong>Agent performance is an end-to-end data movement problem</strong>. Even if the model is powerful, if relevant information is stuck in the wrong layer (or never loaded), reasoning accuracy and efficiency degrade.</p>
<p>And just like in hardware, caching is not optional. Similar to computer memory hierarchies, agent memory benefits from I/O and caching layers to improve efficiency and scalability.</p>
<p><img loading="lazy" decoding="async" class="alignnone size-full wp-image-97939" src="https://www.sigarch.org/wp-content/uploads/2026/01/Memprotocol.jpg" alt="" width="22116" height="14550" /></p>
<hr />
<h2>Protocol Extensions for Multi-Agent Scenarios</h2>
<p>Architecture layers need <em>protocols</em>. In multi-agent settings, protocols determine what can be shared, how fast, and under what rules.</p>
<p>Today, many agent frameworks rely on <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://blog.modelcontextprotocol.io/"><strong>MCP</strong> (Model Context Protocol)</a> as a connectivity layer. Agents registered via MCP can connect and communicate, but inter-agent bandwidth remains limited by message-passing. MCP largely uses JSON-RPC, so it’s best viewed as a protocol for <strong>agent context I/O</strong>: request/response, tool invocation, and structured messages.</p>
<p>That’s necessary — but not sufficient.</p>
<h3>Missing Piece 1: Agent Cache Sharing Protocol</h3>
<p>Many recent studies, such as <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/2411.02820"><strong>DriodSpeak</strong></a> and <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/2510.03215"><strong>Cache to cache</strong></a>, explored KV cache sharing between LLM. However, we still lack a principled and unified protocol for sharing <em>cached artifacts</em> across agents.</p>
<p><strong>Goal:</strong> Enable one agent’s cached results to be transformed and reused by other agents.</p>
<p>In architecture terms, this is like enabling cache transfers or shared cache behavior — except the payload is semantic and may require transformation before reuse.</p>
<h3>Missing Piece 2: Agent Memory Access Protocol</h3>
<p>Although frameworks like <strong><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://docs.letta.com/">Letta</a></strong> and <strong><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://mem0.ai/">Mem0</a> </strong>support shared state within agent memory, the protocol defines how agents read/write each other’s memory is missing.</p>
<p><strong>Goal:</strong> Define memory access semantics: permissions, scope, and granularity.</p>
<p>Key questions:</p>
<ul>
<li>Can Agent B read Agent A’s long-term memory, or only shared memory?</li>
<li>Is access read-only, append-only, or read-write?</li>
<li>What is the unit of access: a document, a chunk, a key-value record, a “thought,” a trace segment?</li>
<li>Can we support “agent RDMA”-like patterns: low-latency direct access to remote memory without expensive message-level serialization?</li>
</ul>
<p>Without a memory access protocol, inter-agent collaboration is forced into slow, high-level message passing, which wastes bandwidth and loses structure.</p>
<hr />
<h2>The Next Frontier: Multi-Agent Memory Consistency</h2>
<p>The largest conceptual gap is <strong>consistency</strong>. The goal of <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://doi.org/10.1109/2.546611"><strong>memory consistency</strong></a> in computer architecture and systems design is to define constraints on the order of reads and writes to memory addresses. Consistency models (e.g., sequential consistency, TSO, and release consistency) clarify what behaviors programmers can rely on.</p>
<p>For agent memory, the goal shifts: It’s not about bytes at an address, but about maintaining a <strong>coherent semantic context</strong> that supports correct reasoning and coordination.</p>
<p><img loading="lazy" decoding="async" class="alignnone size-full wp-image-97941" src="https://www.sigarch.org/wp-content/uploads/2026/01/memory-consistency-comparison.jpg" alt="" width="20184" height="8150" /></p>
<h3>Why Agent Consistency Is Harder</h3>
<ul>
<li>The “state” is not a scalar value; it’s a <em>plan</em>, a <em>summary</em>, a <em>retrieval result</em>, a <em>tool trace</em>.</li>
<li>Writes are not deterministic; they may be speculative or wrong.</li>
<li>Conflicts aren’t simple write-write conflicts — they&#8217;re semantic contradictions.</li>
<li>Freshness depends on the environment state (repo version, API results, and permissions).</li>
</ul>
<h3>What a Multi-Agent Memory Consistency Layer Might Need</h3>
<p>A practical direction is to define consistency around the <em>artifacts agents actually share</em> — cached evidence, tool traces, plans, and long-term records — across both <strong>shared</strong> and <strong>distributed</strong> memory setups (often a hybrid: local caches + shared store). The layer should expose a <strong>consistency model</strong> (e.g., session, causal, eventual semantic, and stronger guarantees for “committed” outputs), provide richer <strong>communication primitives</strong> than plain message passing, and include <strong>conflict-resolution policies</strong> (source ranking, timestamps, consensus, and optional human intervention for high-stakes conflicts).</p>
<p>Research on this is still rare, but it is likely to become foundational — much like coherence and consistency were for multiprocessors.</p>
<h2>Conclusion</h2>
<p>Many agent memory systems today resemble <strong>human memory</strong> — informal, redundant, and hard to control — leaving a large opportunity for computer architecture researchers to rethink what “memory” should mean for agents <strong>at scale</strong>. To move from ad-hoc prompting to reliable multi-agent systems, we need <strong>better memory hierarchies</strong>, <strong>explicit protocols</strong> for cache sharing and memory access, and <strong>principled consistency models</strong> that keep shared context coherent.</p>
<h2>Acknowledgement</h2>
<p>We sincerely thank Wentao Ni, Hejia Zhang, Mingrui Yin, Jiaying Yang, and Yujie Zhao for their invaluable contributions through brainstorming, discussions, data collection, and survey work over the past few months. This article would not have been possible without their dedicated efforts.</p>
<p><b>About the authors:</b></p>
<p><i>Zhongming Yu is a PhD student in the </i><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://cse.ucsd.edu/"><i>Computer Science and Engineering Department</i></a><i> at</i><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~www.ucsd.edu/"><i> University of California, San Diego</i></a><i>. His research interests are in combining machine learning and computer systems, with a special focus on LLM agent system for machine learning systems, evolving ML and systems, and autonomous software engineering. </i></p>
<p><i>Jishen Zhao is a Professor in the</i><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://cse.ucsd.edu/"><i> Computer Science and Engineering Department</i></a><i> at</i><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~www.ucsd.edu/"><i> University of California, San Diego</i></a><i>. Her research spans and stretches the boundary across computer architecture, system software, and machine learning, with an emphasis on memory systems, machine learning and systems codesign, and system support for smart applications.</i></p>
<p class="disclaim"><strong>Disclaimer:</strong> <em>These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.</em></p><Img align="left" border="0" height="1" width="1" alt="" style="border:0;float:left;margin:0;padding:0;width:1px!important;height:1px!important;" hspace="0" src="https://feeds.feedblitz.com/~/i/940946942/0/sigarch-cat">
]]>
</content:encoded>
			<wfw:commentRss>https://feeds.feedblitz.com/~/940946942/0/sigarch-cat~MultiAgent-Memory-from-a-Computer-Architecture-Perspective-Visions-and-Challenges-Ahead/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	<post-id xmlns="com-wordpress:feed-additions:1">97929</post-id></item>
<item>
<feedburner:origLink>https://www.sigarch.org/pipeorgan-modeling-memory-bandwidth-bound-executions-for-ai-and-beyond/</feedburner:origLink>
		<title>PipeOrgan: Modeling Memory-Bandwidth-Bound Executions for AI and Beyond</title>
		<link>https://feeds.feedblitz.com/~/940049756/0/sigarch-cat~PipeOrgan-Modeling-MemoryBandwidthBound-Executions-for-AI-and-Beyond/</link>
		<comments>https://feeds.feedblitz.com/~/940049756/0/sigarch-cat~PipeOrgan-Modeling-MemoryBandwidthBound-Executions-for-AI-and-Beyond/#respond</comments>
		<pubDate>Mon, 12 Jan 2026 15:00:20 +0000</pubDate>
		<dc:creator><![CDATA[Mark D. Hill]]></dc:creator>
		<category><![CDATA[ACM SIGARCH]]></category>
		<category><![CDATA[Accelerators]]></category>
		<category><![CDATA[Memory]]></category>
		<category><![CDATA[Modelling]]></category>
		<guid isPermaLink="false">https://www.sigarch.org/?p=97568</guid>
		<description><![CDATA[<div><img width="300" xheight="200" src="https://www.sigarch.org/wp-content/uploads/2026/01/SIGARCH_PipeOrgan_via_ChatGPT_2026_01_05-300x200.png" class="attachment-medium size-medium wp-post-image" alt="" style="max-width:100% !important;height:auto !important;margin-bottom:15px;margin-left:15px;float:right;"  loading="lazy" /></div>TL;DR: Latency-tolerant architectures, e.g., GPUs, increasingly use memory/storage hierarchies, e.g., for KV Caches to speed Large-Language Model AI inference. To aid codesign of such workloads and architectures, we develop the simple PipeOrgan analytic model for bandwidth-bound workloads running on memory/storage hierarchies.  Background For three reasons, memory bandwidth, more than latency, limits AI inference performance. First, [&#8230;]]]>
</description>
				<content:encoded><![CDATA[<div><img width="300" height="200" src="https://www.sigarch.org/wp-content/uploads/2026/01/SIGARCH_PipeOrgan_via_ChatGPT_2026_01_05-300x200.png" class="attachment-medium size-medium wp-post-image" alt="" style="margin-bottom:15px;margin-left:15px;float:right;" decoding="async" loading="lazy" /></div><p><i><span style="font-weight: 400;">TL;DR: Latency-tolerant architectures, e.g., GPUs, increasingly use memory/storage hierarchies, e.g., for KV Caches to speed Large-Language Model AI inference. To aid codesign of such workloads and architectures, we develop the simple PipeOrgan analytic model for bandwidth-bound workloads running on memory/storage hierarchies. </span></i></p>
<h3><b>Background</b></h3>
<p><span style="font-weight: 400;">For three reasons, memory bandwidth, more than latency, limits AI inference performance. First, AI inference uses latency-tolerant compute engines, such as GPUs. Second, it principally uses hardware memory hierarchies to store a data structure called a <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://huggingface.co/blog/not-lain/kv-caching">Key-Value (KV) Cache</a> that holds information from recent queries to reduce redundant computation. With <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/abs/2309.06180">PagedAttention</a>, each KV Cache fetch obtains one or more multi-megabyte blocks (often called pages) that require substantial bandwidth to complete. Third, inference&#8217;s “decode” phase is memory-bound due to low arithmetic intensity, putting great pressure on memory bandwidth.</span></p>
<p><span style="font-weight: 400;">Traditional CPU memory/storage hierarchies are shaped by increasing latency, but designing hierarchies for AI workloads requires focusing on decreasing bandwidth. Since AI software is flexible, codesigning software and hardware is essential. </span></p>
<p><span style="font-weight: 400;">To provide intuition and first answer to the above questions, we next contribute the simple <em>PipeOrgan</em> analytic model for optimizing bandwidth-bound workloads running on a memory hierarchy with many parallel <em>pipes</em> from memories to compute. The PipeOrgan model shows that husbanding and providing bandwidth is important for AI software and hardware. Analytic models have long provided computing intuition, e.g., <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://en.wikipedia.org/wiki/Amdahl%27s_law">Amdahl&#8217;s Law</a>, <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://en.wikipedia.org/wiki/Iron_law_of_processor_performance">Iron Law</a>, and <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://en.wikipedia.org/wiki/Roofline_model">Roofline</a>.</span></p>
<p><img loading="lazy" decoding="async" class=" wp-image-97780 aligncenter" src="https://www.sigarch.org/wp-content/uploads/2026/01/figure1.png" alt="" width="448" height="306" /></p>
<p><img loading="lazy" decoding="async" class="wp-image-97787 aligncenter" src="https://www.sigarch.org/wp-content/uploads/2026/01/figure2-1.png" alt="" width="798" height="350" /></p>
<h3><b>Example System with Two Parallel Memories</b></h3>
<p><span style="font-weight: 400;">Let’s start simple. Consider the hardware depicted in Figure 1 with High Bandwidth Memory (HBM) with bandwidth 16 TB/s </span><b>in parallel with</b><span style="font-weight: 400;"> an LPDDR memory with bandwidth 0.5 TB/s. Assume for now that there are no transfers between memories, e.g., to cache. </span></p>
<p><span style="font-weight: 400;">Using the PipeOrgan math from the next section, Figure 2’s blue line shows how system performance changes depending on what percentage of data comes from LPDDR memory. (The orange line comes later when we add caching.) </span>Performance is highest when LPDDR provides exactly 3% of the data <span style="font-weight: 400;">(arrow 1)</span>, which matches its 3% bandwidth <span style="font-weight: 400;">(0.5/(16.0+0.5))</span>. At this point, both LPDDR and HBM memories finish transferring data at the same time, so they act as co-bottlenecks and the system runs at peak efficiency.</p>
<p>When less than 3% of data is from LPDDR (left of the peak), <span style="font-weight: 400;">HBM finishes last and limits performance. When LPDDR sources more than 3% (right of the peak), it is</span> the bottleneck. LPDDR might have to source more data, because <span style="font-weight: 400;">HBM&#8217;s limited capacity, currently 48-64GB per stack, may preclude it from being able to source its share (97%). If so, </span><span style="font-weight: 400;">performance drops quickly: 4% from LPDDR gives 76% of peak (arrow 2), and 20% yields just 15% (arrow 3).</span></p>
<p><span style="font-weight: 400;">However, future AI systems will feature <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://arxiv.org/pdf/2407.00079">multiple memory and storage levels</a>, using HBM, LPDDR, host DDR, pooled DDR, and attached or pooled FLASH storage</span><span style="font-weight: 400;">.</span></p>
<p><img loading="lazy" decoding="async" class=" wp-image-97782 aligncenter" src="https://www.sigarch.org/wp-content/uploads/2026/01/figure3.png" alt="" width="570" height="314" /></p>
<h3><b>PipeOrgan Model of Systems with N Parallel Memories</b></h3>
<p><span style="font-weight: 400;">The above result generalizes to an N-level memory/storage hierarchy with each level feeding compute in parallel. Optimal performance is achieved when all parallel memories complete a workload phase simultaneously, leading to this PipeOrgan principle:</span></p>
<p><b><i>Memory-bandwidth-bound workloads perform best when data is sourced from each memory level in proportion to its bandwidth.</i></b></p>
<p><strong>Proof: </strong></p>
<ol>
<li>Let each memory provide bandwidth b_i TB/s in parallel for total bandwidth B = b_1 + … + b_N.</li>
<li><span style="font-weight: 400;">For a workload, let each source d_i bytes in parallel for total data transferred D = d_1 + … + d_N.</span></li>
<li><span style="font-weight: 400;">By assumption, the workload is limited by data transfer time with compute hidden.</span></li>
<li><span style="font-weight: 400;">Time for each memory to finish its data transfer is d_i/b_i  = TB/(TB/s) = seconds.</span></li>
<li><span style="font-weight: 400;">Workload Time is the maximum of all memories finishing: MAX [d_1/b_1, …, d_N/b_N].</span></li>
<li><span style="font-weight: 400;">Workload Performance = 1/ Time = MIN[b_1/d_1, …, b_N/d_N].</span></li>
<li><span style="font-weight: 400;">Set each d_i = (D/B)*b_i = proportional to its bandwidth b_i.</span></li>
<li><span style="font-weight: 400;">Performance = MIN[b_1/((D/B)*b_1), …, b_N/((D/B)*b_N)].</span></li>
<li><span style="font-weight: 400;">Performance = MIN[(B/D), …, (B/D)] = B/D and Time = 1/Performance = D/B. </span></li>
</ol>
<p><span style="font-weight: 400;">This makes sense: PipeOrgan shows that best performance occurs when one moves all the data using all the bandwidth with no bandwidth idling.</span></p>
<p><img loading="lazy" decoding="async" class="wp-image-97783 aligncenter" src="https://www.sigarch.org/wp-content/uploads/2026/01/figure4.png" alt="" width="555" height="263" /></p>
<h3><b>But Caching Is Critical</b></h3>
<p><span style="font-weight: 400;">The PipeOrgan version above assumes all data goes directly to compute, without transfers among memories. In reality, systems move data from lower- to higher-bandwidth memories, caching it for reuse. For a two-level system (see Figure 4), assume the entire fraction of the workload’s data used from LPDDR is first transferred to HBM for caching (orange arrow). Let the data used from LPDDR be f*D where f ranges from 0 to 1.</span></p>
<ul>
<li><span style="font-weight: 400;">Performance with caching = MIN[(b_1/D)/(f+1), b_2/(f*D)] = MIN[limited by HBM BW, limited by LPDDR BW].</span></li>
</ul>
<p><span style="font-weight: 400;">Figure 2 shows an orange curve for caching that is hidden under the original blue curve when more than 3% of data is sourced from LPDDR. At more than 3% from LPDDR, performance&#8211;without and with caching&#8211;is limited by the time to transfer needed data with the same limited LPDDR bandwidth.</span></p>
<p><span style="font-weight: 400;">While it might look like caching </span><span style="font-weight: 400;">doesn&#8217;t matter, caching is actually important. </span><span style="font-weight: 400;">This is because caching can greatly shift a workload’s x-axis operating point. For example, sourcing 20% of data from LPDDR yields 15% of peak performance (arrow 3). If LPDDR data is cached in HBM and reused five times, then–as the orange dashed arrow shows–only 4% comes from LPDDR and performance gets boosted to 76% of peak—a ~5x improvement (arrow 2).</span></p>
<p><span style="font-weight: 400;">Consequently, caching remains critical. Moreover, PipeOrgan and its N parallel memory principle also applies bandwidth-bound workloads once caching&#8217;s more complex information flows are accounted for.</span></p>
<h3><b>Implications, Limitations and Future Work</b></h3>
<p><span style="font-weight: 400;">Statistician George Box famously said, “</span><i><span style="font-weight: 400;">Essentially, all models are wrong, but some are useful.</span></i><span style="font-weight: 400;">” </span></p>
<p><span style="font-weight: 400;">We conjecture that the PipeOrgan model is useful for AI codesign, especially in the early stages and with software people having less hardware understanding. </span><b>Its key implication is that bandwidth-bound workloads must carefully manage bandwidth from larger, slower memories and storage. </b><span style="font-weight: 400;">While vast data can be stored statically, dynamic use from low-bandwidth memories should remain modest.</span></p>
<p><span style="font-weight: 400;">Three PipeOrgan limitations motivate future work. First, most workloads aren’t bandwidth bound throughout, and PipeOrgan doesn’t address other phases. Modeling these requires more parameters, increasing accuracy but also complexity.</span></p>
<p><span style="font-weight: 400;">Second, the caching model variant only covers two memory levels and always transfers data first to the higher-bandwidth level before use. Future work should extend this to N memory levels and more advanced caching policies. Modeling the many options for caching may be challenging.</span></p>
<p><span style="font-weight: 400;">Third, PipeOrgan may need to be extended for systems that do some processing in or near the memories themselves rather than moving all data to a segregated compute unit.</span></p>
<p><i><span style="font-weight: 400;"><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.cs.princeton.edu/courses/archive/fall13/cos375/Burks.pdf">Burks, Goldstine, &amp; von Neumann, 1946</a>: We are therefore forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible.</span></i></p>
<p><span style="font-weight: 400;">In sum, after eight decades of memory hierarchies focused mostly on latency, we are now at the exciting early stages of codesigning bandwidth-focused memory/storage hierarchies for more flexible AI software.</span></p>
<p><b>About the Author:</b><span style="font-weight: 400;"><a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://pages.cs.wisc.edu/~markhill/"> Mark D. Hill</a> is John P. Morgridge Professor and Gene M. Amdahl Professor Emeritus of Computer Sciences at the University of Wisconsin-Madison and consultant to industry. He initiated the PipeOrgan model consulting for Microsoft and was given permission to release it. He is a fellow of AAAS, ACM, and IEEE, as well as recipient of the 2019 Eckert-Mauchly Award.</span></p>
<p class="disclaim"><strong>Disclaimer:</strong> <em>These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.</em></p><Img align="left" border="0" height="1" width="1" alt="" style="border:0;float:left;margin:0;padding:0;width:1px!important;height:1px!important;" hspace="0" src="https://feeds.feedblitz.com/~/i/940049756/0/sigarch-cat">
]]>
</content:encoded>
			<wfw:commentRss>https://feeds.feedblitz.com/~/940049756/0/sigarch-cat~PipeOrgan-Modeling-MemoryBandwidthBound-Executions-for-AI-and-Beyond/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	<post-id xmlns="com-wordpress:feed-additions:1">97568</post-id></item>
<item>
<feedburner:origLink>https://www.sigarch.org/in-memoriam-remembering-mike-flynn/</feedburner:origLink>
		<title>In Memoriam: Remembering Mike Flynn</title>
		<link>https://feeds.feedblitz.com/~/939763391/0/sigarch-cat~In-Memoriam-Remembering-Mike-Flynn/</link>
		<comments>https://feeds.feedblitz.com/~/939763391/0/sigarch-cat~In-Memoriam-Remembering-Mike-Flynn/#respond</comments>
		<pubDate>Tue, 06 Jan 2026 21:00:08 +0000</pubDate>
		<dc:creator><![CDATA[Ruby B. Lee, Charlie Neuhauser, Timothy M. Pinkston]]></dc:creator>
		<category><![CDATA[ACM SIGARCH]]></category>
		<category><![CDATA[Memoriam]]></category>
		<guid isPermaLink="false">https://www.sigarch.org/?p=97727</guid>
		<description><![CDATA[<div><img width="257" xheight="300" src="https://www.sigarch.org/wp-content/uploads/2026/01/Picture1.jpg" class="attachment-medium size-medium wp-post-image" alt="" style="max-width:100% !important;height:auto !important;margin-bottom:15px;margin-left:15px;float:right;"  loading="lazy" /></div>Michael J. Flynn is a widely respected contributor—indeed a giant—in the field of Computer Architecture.  He made highly significant and impactful contributions throughout his career, both in industry and in academia.  Sadly, he passed away peacefully December 24, 2025, having lived a long and full life. Born May 20, 1934, in New York, NY, Flynn [&#8230;]]]>
</description>
				<content:encoded><![CDATA[<div><img width="257" height="300" src="https://www.sigarch.org/wp-content/uploads/2026/01/Picture1.jpg" class="attachment-medium size-medium wp-post-image" alt="" style="margin-bottom:15px;margin-left:15px;float:right;" decoding="async" loading="lazy" /></div><p style="font-weight: 400;">Michael J. Flynn is a widely respected contributor—indeed a <em>giant</em>—in the field of Computer Architecture.  He made highly significant and impactful contributions throughout his career, both in industry and in academia.  Sadly, he passed away peacefully December 24, 2025, having lived a long and full life.</p>
<p style="font-weight: 400;">Born May 20, 1934, in New York, NY, Flynn earned his Bachelor’s, Master’s, and Ph.D. degrees in Electrical Engineering from Manhattan College (1955), Syracuse University (1960), and Purdue University (1961), respectively, and he received an honorary Doctor of Science degree from the University of Dublin (1998).  After ten years as a design engineer and project manager at IBM (1955-65, in Endicott and Poughkeepsie, NY), he became a member of the faculty at the University of Illinois at Chicago (1965-1966), Northwestern University (1966-1970), and Johns Hopkins University (1970-1975) before joining Stanford University in 1975 as Professor of Electrical Engineering.  He taught internationally, in Ireland, other places in Europe, Singapore, and Japan.</p>
<p style="font-weight: 400;">As a young project manager at IBM, Flynn was responsible for the design of the well-known <em>IBM System 360 (Models 91/92/95 series)</em>, the first computer to implement the sophisticated Tomasulo algorithm, along with many other groundbreaking high-performance architectural techniques.  As the first family of general-purpose computer mainframes that featured <em>architectural compatibility</em> for both commercial and scientific applications, the System 360 is widely recognized as revolutionizing computing during that time—and in many ways persisting even today.  Indeed, many of the high-performance computing techniques developed by Flynn and his IBM colleagues are used throughout the industry today, having migrated from barn-sized mainframes to finger-nail sized microprocessor chips.  Flynn also was the first to shed light on the performance potential and limitations of parallel computers with what’s become known as <em>Flynn’s classification </em>(or <em>Flynn’s taxonomy</em>), a pioneering framework for categorizing parallelism in computer architectures based on the number of simultaneous instruction streams and data streams they handle, e.g., SISD, SIMD, MISD, and MIMD.  His original taxonomy is still used widely today, with various extensions derived from it, to distinguish between different kinds of parallel processor computer systems.</p>
<p style="font-weight: 400;">In 1972, together with some colleagues from IBM, Flynn co-founded Palyn Associates which provided consulting services in the field of high-performance computer architecture and design.  For more than 30 years, he and his colleagues advised nearly every major computer company in Japan, Europe and the United States, including IBM, CDC, Fujitsu, Hitachi, Honeywell Bull, and ICL.  Later, he played a prominent role in Maxeler, products of which made use of advanced dataflow techniques to provide high performance processing for specific applications, such as automated trading.  As a renowned professor at Stanford until his retirement in 1999 and transition to emeritus status, Flynn made seminal contributions to instruction set architecture (ISA), computer arithmetic, advanced floating-point design, multimedia, parallel processors and interconnects, emulation, and performance evaluation, to name a few.  He (co-)authored several textbooks, including <u>Introduction to Arithmetic for Digital Systems Designers</u>, <u>Computer Architecture: Pipelined and Parallel Processor Design</u>, and <u>Advanced Computer Arithmetic Design</u>. An IEEE Fellow, ACM Fellow, and Fellow of the Institution of Engineers of Ireland, Flynn received numerous other honors and awards for his impactful technical contributions, including the ACM/IEEE Eckert-Mauchly Award (1992), IEEE Computer Society’s (CS) Harry Goode Memorial Award and Medal (1995), the Tesla Award and Medal from the International Tesla Society in Belgrade (1998), IEEE CS Charles Babbage Award, IEEE CS Computer Pioneer Award (2015, his acceptance speech video is <a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://www.youtube.com/watch?v=xAhRYUPSZKM">here</a>), and many others.</p>
<p style="font-weight: 400;">Notably, when the field of computer architecture was still in its infancy more than fifty years ago, Flynn founded the IEEE CS Technical Committee on Computer Architecture (TCCA) and ACM’s Special Interest Group on Computer Architecture (SIGARCH); he also started the ACM/IEEE International Symposium on Computer Architecture (ISCA), co-sponsored by both, which is among the most prestigious flagship computer architecture conferences in the world.  At ISCA’s 50<sup>th</sup> anniversary conference at FCRC 2023, Flynn was invited to give a “<a href="https://feeds.feedblitz.com/~/t/0/0/sigarch-cat/~https://u.pcloud.link/publink/show?code=XZro5EVZFqb2N6FIwjuqKDkbaDonqJzVoXzX">50 Year Retrospective Lecture</a>” and was given an honorary plaque with these words inscribed: <em>&#8220;In recognition, with tremendous gratitude, of your lifetime dedication and leadership to the computer architecture community on this the 50<sup>th</sup> anniversary of your founding of ISCA, SIGARCH, and TCCA.&#8221;</em></p>
<p style="font-weight: 400;">Even more than his impressive technical contributions, which are many, Flynn is remembered fondly by the many dozens of doctoral graduate students he advised—for his unending kindness, wealth of wisdom, caring tutelage, gentle encouragement, constant motivation, and enduring support, especially when most needed.  He treated each and every student as if they were a member of his own family, and he was viewed by them not only as their academic “father,” but referred to affectionately as “the Great Man.”  Many of his former mentees returned to Stanford several times each year for luncheons to enjoy his company and reminisce about exciting times working with him in tackling some of the most compelling technical issues of the day.</p>
<p style="font-weight: 400;">Flynn was an equally generous mentor to his junior faculty colleagues, helping them establish their careers and providing sage advice as they made their way.  Kunle Olukotun attests to this: <em>“Meeting Mike Flynn near the end of my Ph.D. at the University of Michigan changed the trajectory of my career.  At the time, I was firmly on a path toward industry, but Mike believed that I could be a strong academic, and he encouraged me to apply to Stanford.  Mike saw something in me that I did not yet see in myself, and that confidence made an enduring difference. Once I arrived at Stanford, Mike served as my mentor. He helped me navigate the academic waters with thoughtful and wise advice, provided opportunities to showcase my research, and supported me through nominations for awards and professional recognition.  I am deeply grateful to Mike for all he did to help establish my career, and for the role he played in the success of so many other junior colleagues whom he mentored with the same generosity and vision.  I am deeply saddened by his passing.”</em>  Similar sentiments are echoed by Bill Dally, who shares the following: <em>“I first met Mike as a graduate student at Stanford in 1980.  I was awed by his accomplishments and his understanding of parallel computing. He kindled my interest in parallel computing which launched me on a very successful career.  Later, when I came to Stanford as a faculty member in 1997, I found Mike to be a great source of advice about Stanford, being a faculty member, research strategy, and many other topics.  I am deeply saddened to hear of Mike&#8217;s passing.  He will be greatly missed.”</em>  Another of his faculty colleagues at Stanford, Christos Kozyrakis, recalls the following: <em>“One of the most memorable moments of my early teaching years was hosting him in class to discuss the Flynn taxonomy of computer architecture—a special experience for both the students and myself and a vivid reminder of the lasting impact of his work.”</em>  Indeed, Mike Flynn was highly respected and revered by fellow colleagues all throughout his professional career.  Solemnly noted by John L. Hennessy, <em>“Mike was the person who hired me at Stanford, gave me some of my first research funding, jointly published an early paper with me, and gave me my first consulting opportunity.  Sadly, his passing marks the end of an important era in computing: Mike was the last of the great System 360 pioneers—Gene Amdahl, Bob Evans, Fred Brooks, Eric Bloch, Gerry Blaauw, and Robert Tomasulo—all are now gone.”</em></p>
<p style="font-weight: 400;">He was a wonderful human being.</p>
<p style="font-weight: 400;">Professor Michael J. Flynn will be sorely missed by his loving family as well as by his extended academic family and all those whose lives he has indelibly touched over his blessed ninety-one plus years.  May he rest blissfully in peace, and may his venerable legacy be inspirational and long lasting.  Fittingly, through Mike Flynn’s final public words to all of us in the computer architecture community in his ISCA 50<sup>th</sup> Anniversary Lecture, he exhorted us all by saying: <em>“Now it’s your turn!”</em></p>
<p style="font-weight: 400;"><em><strong>About the Authors:</strong> </em></p>
<p><strong>Ruby B. Lee</strong> is the Forest G. Hamrick Professor Emeritus in the ECE department at Princeton University, and chief architect at Hewlett-Packard in Silicon valley before that. She is a Fellow of the IEEE, ACM and the American Academy of Arts and Sciences, and recipient of awards such as the most Influential Paper award in 20 years at ISCA 2025 and the Test of Time award at the ACSAC 2024 security conference. Her research combines cyber security, computer architecture and deep learning, including secure processor and cache architectures, attacks and defenses, low-cost AI and multimedia.</p>
<p><strong>Charlie Neuhauser</strong> is now retired after more than 50 years in the field of computer design and analysis.  During the latter half of his career, he provided technical insight to attorneys and companies in the area of intellectual property.  He is currently the registration chair for the IEEE Hot Chips Symposium.</p>
<p><strong>Timothy M. Pinkston</strong> is the George Pfleger Chaired Professor of Electrical and Computer Engineering at the University of Southern California and also is a Vice Dean in USC’s Viterbi School of Engineering.  A Fellow of AAAS, ACM, and IEEE, and recipient of the ACM SIGARCH Alan D. Berenbaum Distinguished Service Award, Timothy’s research contributions mainly are in the area of interconnection networks and efficient data movement in parallel computing systems.</p>
<p style="font-weight: 400;">All three authors are former  Ph.D. students of Mike Flynn at Stanford (Lee and Pinkston) and Johns Hopkins (Neuhauser).</p>
<p class="disclaim"><strong>Disclaimer:</strong> <em>These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.</em></p><Img align="left" border="0" height="1" width="1" alt="" style="border:0;float:left;margin:0;padding:0;width:1px!important;height:1px!important;" hspace="0" src="https://feeds.feedblitz.com/~/i/939763391/0/sigarch-cat">
]]>
</content:encoded>
			<wfw:commentRss>https://feeds.feedblitz.com/~/939763391/0/sigarch-cat~In-Memoriam-Remembering-Mike-Flynn/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	<post-id xmlns="com-wordpress:feed-additions:1">97727</post-id></item>
</channel></rss>

