AI challenger Cerebras assembles modular supercomputer ‘Andromeda’ to accelerate large language models

news7f11/14/2022

0 7 minutes read

AI challenger Cerebras assembles modular supercomputer 'Andromeda' to accelerate large language models

cerebras-andromeda-door-closed-2022 — Andromeda is a cluster of 16 Cerebras CS-2 AI computers that are combined via a dedicated cloth switch and monitored by a memory machine that updates the neural network’s settings. Cerebras says programming machines to run large language models is the start of the wave of clustering computing in AI.

Cerebras system

The current prevalence of machine learning programs that process large amounts of natural language input is pushing the boundaries of computing, fueling its own supercomputer arms race.

Where once a supercomputer solely for scientific problems, the development of AI programs known as large language models, or LLMs, is driving businesses to seek capacity similar to The world’s leading research laboratory.

For example, Nvidia, the standard bearer in chips for AI, announced in September a cloud computing facility dedicated to large language models will be leased as a service by enterprises.

Monday, Cerebras systemThe six-year-old startup based in Sunnyvale, California, is among the companies challenging Nvidia’s dominance, disclosure a supercomputer called Andromeda that performs a quarter of a million floating point operations per second, on par with the top supercomputer in the world, Borderand that can significantly speed things up for tasks like LLM beyond the capabilities of thousands of GPU chips.

Also: AI startup Cerebras celebrates chip win, where others have tried and failed

Unlike purpose-built supercomputers that take years to assemble by system manufacturers such as Hewlett Packard Enterprise and IBM, the Andromeda machine uses a block-building approach that makes it modular and can be assembled in just a few days.

“We stood up in three days and what cost us $600 million, cost us less than $30 million,” said Andrew Feldman, co-founder and CEO of Cerebras.

Within 10 minutes of Andromeda was fully assembled, “we were able to demonstrate linear scalability without changing a single line of code,” Feldman said. Linear scaling means that as more individual machines are added to the cluster, the time required to perform calculations decreases proportionally.

For example, scientists at the Department of Energy’s Argonne National Laboratory, working with the Andromeda machine in its early stages, cut the time it takes to train a large language model from 4.1 hours to just 4.1 hours. 2.4 hours by doubling the number of machines from two to four.

The Andromeda machine will be introduced by Feldman of Cerebras on Monday at SC22 . conference, a gathering of supercomputing technologists taking place this week in Dallas, Texas. Scientists from Argonne Laboratories are also presenting their research paper described using the Cerebras machine.

Also: AI chip startup Cerebras raises $250 million in Series F round at over $4 billion valuation

Andromeda’s cluster is a combination of Cerebras’ CS-2 computers, specialized AI machines the size of a dorm fridge. Each chip of the CS-2, Wafer-Scale-Engine, world’s largest semiconductorhas 850,000 parallel compute cores powered by 40 gigabytes of fast on-chip SRAM.

The Andromeda cluster gathers 16 CS-2s with a total of 13.5 million compute cores, 60% more than the Frontier system. Millions of cores execute in parallel the matrix multiplication linear algebra operations needed to transform the data samples at each layer of the neural network. Each CS-2 receives a piece of neural network training data to work with.

CS-2 is tied together by a special data switch Cerebras introduced last year, called Swarm-X, connects the CS-2 to a third machine, Memory-X. Memory-X acts as a central repository for neural “weights” or parameters, broadcast to each CS-2. The result of the matrix multiplication in each CS-2 is then passed back through Swarm-X to Memory-X as a gradient update for the weights and Memory-X does the work of recomputing the weights, and the cycle starts again.

supercomputer-2022-announce-deck-slide-15 — Andromeda is assembled in a building block fashion by combining 16 Cerebras CS-2 AI computers linked together by a switch called Swarm-X, which communicates with a central dispatching computer that updates the key updates. neural number named Memory-X.

Cerebras system

The Andromeda Cluster is installed by Santa Clara, based in California, as a cloud availability machine Colovorecompete in the hosting market with Equinix.

The secret of the modular design is that the CS-2 machines can be arranged as a single system without the uncanny parallel programming effort often required of a supercomputer. Up to 192 CS-2s can be active at the same time, and the Cerebras software takes care of the low-level functions of calculating the allocation for each CS-2 and managing the weight flow and gradient on the Swarm-X fabric.

Also: Cerebras prepares for the era of 120 trillion parameter neural networks

“Unlike traditional supercomputers, you can submit your work as if it were a single job on a single CPU,” said Feldman, directly from a Jupyter notebook. “All you have to do is specify four things: which model and which parameters; how many CS-2 out of 16 do you want to use; where you want the results to be sent on completion; and which model you want to model. figure to run – that’s it, no parallel programming, no distributed computing work.”

supercomputer-2022-announce-floor-slide-24 — Cerebras emphasizes the ease of its CS-2 clustering, which doesn’t require writing oddly paralleled, distributed programming code.

Cerebras system

Early users like the Argonne team demonstrate that the Andromeda method can beat some supercomputers using thousands of Nvidia GPUs and even perform some tasks that can’t be run on supercomputers because of memory limitations.

Argonne’s work is a new twist on major linguistic models: a biological linguistic model that predicts not word combinations in sentences but biological compounds in genetic sequences. In particular, they devised a way to predict the genetic sequence of variants of COVID-19’s SARS-CoV-2 viral DNA.

Using the approach of the large language model GPT-2 created by the startup OpenAI, lead author Maxim Zvyagin and colleagues built a program to predict the order of four nucleic acid radicals in DNA and RNA, adenine (A), cytosine (C), guanine (G), thymine (T).

By providing the GPT-2 program with sequences of more than 110 million prokaryotic gene sequences, and then “tweaking” with 1.5 million different SARS-CoV-2 genomes, the program developed ability to predict different mutations that have emerged in variants of COVID-19.

The result is a “genome-scale language model” or GenSLM “as Zvyagin and the team call their program. It can be used to monitor viruses, to predict the occurrence of viruses. new COVID variants as a kind of early warning system.

Also: Nvidia CEO Jensen Huang announces availability of GPU ‘Hopper’, cloud service for big AI language models

“We propose a system that learns to model whole-genome evolution using LLM based on observational data and enables tracking of VOCs [variants of concern] based on measures of fitness and immune escape,” they wrote.

The authors tested the GenSLM program on two Polaris supercomputers, a cluster of more than two thousand Nvidia A100 GPUs; and Selene, a cluster of more than 4,000 A100s. Those two machines are the 14th and 8th fastest supercomputers in the world. They also ran work on Andromeda to see how it would stack up.

The Andromeda system cuts training time from more than a week to a few days, they write:

[T]Hese training usually takes >1 week on dedicated GPU resources (such as Polaris@ALCF). To enable training of larger models over the entire chain length (10,240 tokens), we used AI hardware accelerators such as Cerebras CS-2, both in standalone mode and as a single clusters are interconnected and converging GenSLMs are obtained in less than a day.

There is a version of the GenSLM task that cannot even run on a Polaris and Selene machine, Zvyagin and colleagues write.

A language model that takes as input a certain number of letters, words, or other “tokens” that are considered to be put together as a string. In the case of natural language tasks, such as predicting the next word, a sequence of five hundred or a thousand words may be sufficient.

But the genetic code, such as the nucleic acid parent sequence, must be considered across thousands of tokens, known as an “open reading frame”, of which the longest is 10,240 tokens. Because more input tokens take up on-chip memory, the GPUs in Polaris and Selene are unable to process a sequence of 10,240 tokens for language models that exceed a certain size because of both weight and code memory. input messages exhaust the available memory of the GPU.

Cerebras CEO Andrew Feldman said the market was ripe for clustered computing. “Large language models, we’re getting to the point where people want to *Fast,”* he says. “If we built a large cluster of buildings a year ago, people would be like, What? But right now, people are eager to train GPT-3 at 13 billion parameters. “

Cerebras system

“We note that for model sizes larger than ‘2.5 billion neural weights or parameters and 25 billion neural weights’, training on the SARS-CoV-2 data has a length of 10,240 is not feasible on GPU clusters due to memory out-of-memory errors during attentional computation.” However, the Andromeda engine was able to process a sequence of 10,240 tokens due to the massive 40 gigabytes of on-chip memory in each chip. of CS-2 using models with up to 1.3 billion parameters.

According to Feldman, while the Argonne paper only describes the two- and four-node versions of Andromeda, this week’s presentation at SC22 shows how time for computation continues to scale down as more machines are added. The same 10.4 hours that four-way Andromeda takes to train GenSLM on 10,240 input tokens with 1.3 billion weights can be reduced to 2.7 hours using all 16 machines.

More than simply speed and scale, the GenSLM paper, Feldman said, suggests something profound is at play in combining biological data with language models.

“We put the entire COVID genome in that sequence window, and each gene was analyzed by us in the context of the whole genome,” Feldman said.

“Why is that cool? That’s great because what we’ve learned over the last 30 years is that like words, genes express themselves differently based on who their neighbors are. ”

From a business standpoint, Feldman said, the market is ripe for the horsepower to run big language cars.

“Large language models, we’re getting to the point where people want to Fast,” he say. “If we built a large cluster of buildings a year ago, people would be like, What? But right now, people are eager to train GPT-3 at 13 billion parameters, or GPT-Neo, which is a model of 20 billion parameters. ”

He suggested: Cluster can be a big step forward for both parallel processing of a job and for multi-user scenarios in an organization.

“I think there’s a market emerging where people want time on a big cluster, and they want SSH-in, they don’t want anything fancy. They just want to feed their data and go. .”

news7f11/14/2022

0 7 minutes read