Tenstorrent Launches Wormhole AI Processors: 466 FP8 TFLOPS at 300W
by Anton Shilov on July 19, 2024 2:30 PM EST- Posted in
- AI
- Tenstorrent
- Wormhole
Tenstorrent has unveiled its next-generation Wormhole processor for AI workloads that promises to offer decent performance at a low price. The company currently offers two add-on PCIe cards carrying one or two Wormhole processors as well as TT-LoudBox, and TT-QuietBox workstations aimed at software developers. The whole of today's release is aimed at developers rather than those who will deploy the Wormhole boards for their commercial workloads.
“It is always rewarding to get more of our products into developer hands. Releasing development systems with our Wormhole™ card helps developers scale up and work on multi-chip AI software.” said Jim Keller, CEO of Tenstorrent. “In addition to this launch, we are excited that the tape-out and power-on for our second generation, Blackhole, is going very well.”
Each Wormhole processor packs 72 Tensix cores (featuring five RISC-V cores supporting various data formats) with 108 MB of SRAM to deliver 262 FP8 TFLOPS at 1 GHz at 160W thermal design power. A single-chip Wormhole n150 card carries 12 GB of GDDR6 memory featuring a 288 GB/s bandwidth.
Wormhole processors offer flexible scalability to meet the varying needs of workloads. In a standard workstation setup with four Wormhole n300 cards, the processors can merge to function as a single unit, appearing as a unified, extensive network of Tensix cores to the software. This configuration allows the accelerators to either work on the same workload, be divided among four developers or run up to eight distinct AI models simultaneously. A crucial feature of this scalability is that it operates natively without the need for virtualization. In data center environments, Wormhole processors will scale both inside one machine using PCIe or outside of a single machine using Ethernet.
From performance standpoint, Tenstorrent's single-chip Wormhole n150 card (72 Tensix cores at 1 GHz, 108 MB SRAM, 12 GB GDDR6 at 288 GB/s) is capable of 262 FP8 TFLOPS at 160W, whereas the dual-chip Wormhole n300 board (128 Tensix cores at 1 GHz, 192 MB SRAM, aggregated 24 GB GDDR6 at 576 GB/s) can offer up to 466 FP8 TFLOPS at 300W (according to Tom's Hardware).
To put that 466 FP8 TFLOPS at 300W number into context, let's compare it to what AI market leader Nvidia has to offer at this thermal design power. Nvidia's A100 does not support FP8, but it does support INT8 and its peak performance is 624 TOPS (1,248 TOPS with sparsity). By contrast, Nvidia's H100 supports FP8 and its peak performance is massive 1,670 TFLOPS (3,341 TFLOPS with sparsity) at 300W, which is a big difference from Tenstorrent's Wormhole n300.
There is a big catch though. Tenstorrent's Wormhole n150 is offered for $999, whereas n300 is available for $1,399. By contrast, one Nvidia H100 card can retail for $30,000, depending on quantities. Of course, we do not know whether four or eight Wormhole processors can indeed deliver the performance of a single H300, though they will do so at 600W or 1200W TDP, respectively.
In addition to cards, Tenstorrent offers developers pre-built workstations with four n300 cards inside the less expensive Xeon-based TT-LoudBox with active cooling and a premium EPYC-powered TT-QuietBox with liquid cooling.
Sources: Tenstorrent, Tom's Hardware
20 Comments
View All Comments
flyingpants265 - Monday, July 22, 2024 - link
This is a purely meaningless post.Dante Verizon - Friday, July 19, 2024 - link
What is the chip's lithography? That doesn't seem efficient to me compared to mi300/H100mode_13h - Friday, July 19, 2024 - link
The board-level specs are here: https://tenstorrent.com/hardware/wormholeThe chips are detailed here: https://www.semianalysis.com/p/tenstorrent-wormhol...
According to that, the chips are 12 nm.
Dante Verizon - Friday, July 19, 2024 - link
Not bad for a 12nm chip. At that price, it should appeal to some niche consumers.nandnandnand - Friday, July 19, 2024 - link
How does it compare to an RTX 4090 for the operations both support? Because at those prices and being a PCIe card, consumers could use it in their desktops.mode_13h - Saturday, July 20, 2024 - link
The RTX 4090 is rated at 292/330 fp16 TFLOPS (base/boost; non-sparse) @ 450W.The Tenstorrent n300s is rated at 131 fp16 TFLOPS @ 300W and sells for $1400.
Both have 24 GB of on-board GDDR6.
So, it's not really competitive, but then I guess the silicon dates back to 2021 and matched up pretty well against the RTX 3090. The main reason to go with Tenstorrent is probably as a development vehicle, in preparation for their future chips.
Terry_Craig - Sunday, July 21, 2024 - link
"By contrast, Nvidia's H100 supports FP8 and its peak performance is massive 1,670 TFLOPS (3,341 TFLOPS with sparsity) at 300W, which is a big difference from Tenstorrent's Wormhole n300."H100 TDP's 700w > https://www.nvidia.com/en-us/data-center/h100/ *The FP8 numbers mentioned on the website are using sparsity.
Plus, It is worth remembering that the theoretical number is not always achieved in practice, especially when it comes to GPUs:
"I noticed in CUDA 12.1 update 1 that FP8 matrix multiples are now supported on Ada chips when using cuBLASLt. However, when I tried a benchmark on an RTX 4090 I was only able to achieve 1/2 of the rated throughput, around ~330-340 TFLOPS. My benchmark was a straightforward modification of the cuBLASLt FP8 sample 124 to use larger matrices, run more iterations and use CUDA streams. I primarily tried N = N = K = 8192, but other sizes had similar behavior. I tried this with both FP16 and FP32 output and got the same result, although I was only able to use FP32 for the compute type as this is the only supported mode in cuBLASLt right now.
My result is quite far off from the specified 660 TFLOPs in the Ada whitepaper 211 for FP8 tensor TFLOPs with FP32 accumulate. Is there a mistake in the white paper, or is there some incorrect throttling of FP8->FP32 operations going on (much like how FP16 → FP32 operations are half-rate on GeForce cards)?" >
https://forums.developer.nvidia.com/t/ada-geforce-...
mode_13h - Monday, July 22, 2024 - link
> Is there a mistake in the white paper, or is there some incorrect throttling of> FP8->FP32 operations going on (much like how FP16 → FP32 operations
> are half-rate on GeForce cards)?"
I had this exact thought. I doubt the excuse made by the Moderator that it's being power/thermal-limited, since the specified performance at base clocks is just about 11.5% less than boost, not half!
Rudde - Thursday, July 25, 2024 - link
Why not also quote the follow-up post that mentions Nvidia has corrected the whitepaper numbers? https://forums.developer.nvidia.com/t/fp8-fp16-acc...SanX - Tuesday, July 30, 2024 - link
Price is good but everything else is not even close to 4 years old A100. Better name for it would be Ahole