Why a decades old architecture decision is impeding the power of AI computing

AI computing has a reputation for consuming epic quantities of energy. This is partly because of the sheer volume of data being handled. Training often requires billions or trillions of pieces of information to create a model with billions of parameters. But that’s not the whole reason — it also comes down to how most computer chips are built.

Modern computer processors are quite efficient at performing the discrete computations they’re usually tasked with. Though their efficiency nosedives when they must wait for data to move back and forth between memory and compute, they’re designed to quickly switch over to work on some unrelated task. But for AI computing, almost all the tasks are interrelated, so there often isn’t much other work that can be done when the processor gets stuck waiting, said IBM Research scientist Geoffrey Burr.

Partner with aster.cloud
for your next big idea.
Let us know here.

From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.

CYBERPOGO.COM :: For the Arts, Sciences, and Technology.

DADAHACKS.COM :: Parenting For The Rest Of Us.

ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.

TAKUMAKU.COM :: For The Hearth And Home.

ASTER.CLOUD :: From The Cloud And Beyond.

LIWAIWAI.COM :: Intelligence, Inside and Outside.

GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.

FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.

ASTERCASTER.COM :: Supra Astra. Beyond The Stars.

BARTDAY.COM :: Prosperity For Everyone.

In that scenario, processors hit what is called the von Neumann bottleneck, the lag that happens when data moves slower than computation. It’s the result of von Neumann architecture, found in almost every processor over the last six decades, wherein a processor’s memory and computing units are separate, connected by a bus. This setup has advantages, including flexibility, adaptability to varying workloads, and the ability to easily scale systems and upgrade components. That makes this architecture great for conventional computing, and it won’t be going away any time soon.

But for AI computing, whose operations are simple, numerous, and highly predictable, a conventional processor ends up working below its full capacity while it waits for model weights to be shuttled back and forth from memory. Scientists and engineers at IBM Research are working on new processors, like the AIU family, which use various strategies to break down the von Neumann bottleneck and supercharge AI computing.

Why does the von Neumann bottleneck exist?

The von Neumann bottleneck is named for mathematician and physicist John von Neumann, who first circulated a draft of his idea for a stored-program computer in 1945. In that paper, he described a computer with a processing unit, a control unit, memory that stored data and instructions, external storage, and input/output mechanisms. His description didn’t name any specific hardware — likely to avoid security clearance issues with the US Army, for whom he was consulting. Almost no scientific discovery is made by one individual, though, and von Neumann architecture is no exception. Von Neumann’s work was based on the work of J. Presper Eckert and John Mauchly, who invented the Electronic Numerical Integrator and Computer (ENIAC), the world’s first digital computer. In the time since that paper was written, von Neumann architecture has become the norm.

“The von Neumann architecture is quite flexible, that’s the main benefit,” said IBM Research scientist Manuel Le Gallo-Bourdeau. “That’s why it was first adopted, and that’s why it’s still the prominent architecture today.”

Discrete memory and computing units mean you can design them separately and configure them more or less any way you want. Historically, this has made it easier to design computing systems because the best components can be selected and paired, based on the application.

Even the cache memory, which is integrated into a single chip with the processor, can still be individually upgraded. “I’m sure there are implications for the processor when you make a new cache memory design, but it’s not as difficult as if they were coupled together,” Le Gallo-Bourdeau said. “They’re still separate. It allows some freedom in designing the cache separately from the processor.”

How the von Neumann bottleneck reduces efficiency

For AI computing, the von Neumann bottleneck creates a twofold efficiency problem: the number of model parameters (or weights) to move, and how far they need to move. More model weights mean larger storage, which usually means more distant storage, said IBM Research scientist Hsinyu (Sidney) Tsai. “Because the quantity of model weights is very large, you can’t afford to hold them for very long, so you need to keep discarding and reloading,” she said.

The main energy expenditure during AI runtime is spent on data transfers — bringing model weights back and forth from memory to compute. By comparison, the energy spent doing computations is low. In deep learning models, for example, the operations are almost all relatively simple matrix vector multiplication problems. Compute energy is still around 10% of modern AI workloads, so it isn’t negligible, said Tsai. “It is just found to be no longer dominating energy consumption and latency, unlike in conventional workloads,” she added.

About a decade ago, the von Neumann bottleneck wasn’t a significant issue because processors and memory weren’t so efficient, at least compared to the energy that was spent to transfer data, said Le Gallo-Bourdeau. But data transfer efficiency hasn’t improved as much as processing and memory have over the years, so now processors can complete their computations much more quickly, leaving them sitting idle while data moves across the von Neumann bottleneck.

The farther away the memory is from the processor, the more energy it costs to move it. On a basic physical level, an electrical copper wire is charged to propagate a 1, and it’s discharged to propagate a 0. The energy spent charging and discharging the wires is proportional to their length, so the longer the wire is, the more energy you spend. This also means greater latency, as it takes more time for the charge to dissipate or propagate the longer the wire is.

Admittedly, the time and energy cost of each data transfer is low, but every time you want to propagate data through a large language model, you need to load up to billions of weights from the memory. This could mean using the DRAM from one or more other GPUs, because one GPU doesn’t have enough memory to store them all. After they’re downloaded to the processor, it performs its computations and sends the result to another memory location for further processing.

Aside from eliminating the von Neumann bottleneck, one solution includes closing that distance. “The entire industry is working to try to improve data localization,” Tsai said. IBM Research scientists recently announced such an approach: a polymer optical waveguide for co-packaged optics. This module brings the speed and bandwidth density of fiber optics to the edge of chips, supercharging their connectivity and hugely reducing model training time and energy costs.

With currently available hardware, though, the result of all these data transfers is that training an LLM can easily take months, consuming more energy than a typical US home does in that time. And AI doesn’t stop needing energy after model training. Inferencing has similar computational requirements, meaning that the von Neumann bottleneck slows it down in a similar fashion.

Getting around the bottleneck

For the most parts, model weights are stationary, and AI computing is memory-centric, rather than compute heavy, said Le Gallo-Bourdeau. “You have a fixed set of synaptic weights, and you just need to propagate data through them.”

This quality has enabled him and his colleagues to pursue analog in-memory computing, which integrates memory with processing, using the laws of physics to store weights. One of these approaches is phase-change memory (PCM), which stores model weights in the resistivity of a chalcogenide glass, which is changed by applying an electrical current.

“This way we can reduce the energy that is spent in data transfers and mitigate the von Neumann bottleneck,” said Le Gallo-Bourdeau. In-memory computing isn’t the only way to work around the von Neumann bottleneck, though.

The AIU NorthPole is a processor that stores memory in digital SRAM, and while its memory isn’t intertwined with compute in the same way as analog chips, its numerous cores each has access to local memory — making it an extreme example of near-memory computing. Experiments have already demonstrated the power and promise of this architecture. In recent inference tests run on a 3-billion-parameter LLM developed from IBM’s Granite-8B-Code-Base model, NorthPole was 47 times faster than the next most energy-efficient GPU and was 73 times more energy efficient than the next lowest latency GPU.

It’s also important to note that models trained on von Neumann hardware can be run on non-von Neumann devices. In fact, for analog in-memory computing, it’s essential. PCM devices aren’t durable enough to have their weights changed over and over, so they’re used to deploy models that have been trained on conventional GPUs. Durability is a comparative advantage of SRAM memory in near-memory or in-memory computing, as it can be rewritten infinitely.

Why von Neumann computing isn’t going away

While von Neumann architecture creates a bottleneck for AI computing, for other applications, it’s perfectly suited. Sure, it causes issues in model training and inference, but von Neumann architecture is perfect for processing computer graphics or other compute-heavy processes. And when 32- or 64-bit floating point precision is called for, the low precision of in-memory computing isn’t up to the task.

“For general purpose computing, there’s really nothing more powerful than the von Neumann architecture,” said Burr. Under these circumstances, bytes are either operations or operands that are moving on a bus from a memory to a processor. “Just like an all-purpose deli where somebody might order some salami or pepperoni or this or that, but you’re able to switch between them because you have the right ingredients on hand, and you can easily make six sandwiches in a row.” Special-purpose computing, on the other hand, may involve 5,000 tuna sandwiches for one order — like AI computing as it shuttles static model weights.

Even when building their in-memory AIU chips, IBM Researchers include some conventional hardware for the necessary high-precision operations.

Even as scientists and engineers work on new ways to eliminate the von Neumann bottleneck, experts agree that the future will likely include both hardware architectures, said Le Gallo-Bourdeau. “What makes sense is some mix of von Neumann and non-von Neumann processors to each handle the operations they are best at.”

Source: zedreviews.com

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

Why a decades old architecture decision is impeding the power of AI computing

From our partners:

Why does the von Neumann bottleneck exist?

How the von Neumann bottleneck reduces efficiency

Why von Neumann computing isn’t going away

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

aster.cloud

Related Topics

Pure Accelerate 2025: All the news and updates live from Las Vegas

‘This was a very purposeful strategy’: Pure Storage unveils Enterprise Data Cloud in bid to unify data storage, management

What is cloud bursting?

There’s a ‘cloud reset’ underway, and VMware Cloud Foundation 9.0 is a chance for Broadcom to pounce on it

What is confidential computing?

Oracle adds xAI Grok models to OCI

Fine-tune your storage-as-a-service approach

Advanced audio dialog and generation with Gemini 2.5

A Father’s Day Gift for Every Pop and Papa

Global cloud spending might be booming, but AWS is trailing Microsoft and Google

Most Popular

Google Cloud, Cloudflare struck by widespread outages

What is PC as a service (PCaaS)?

Crayon targets mid-market gains with expanded Google Cloud partnership

By the numbers: Use AI to fill the IT skills gap

Apple services deliver powerful features and intelligent updates to users this autumn

Why a decades old architecture decision is impeding the power of AI computing

From our partners:

Why does the von Neumann bottleneck exist?

How the von Neumann bottleneck reduces efficiency

Why von Neumann computing isn’t going away

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Related Topics

You May Also Like