Arm's New Mali-G77 & Valhall GPU Architecture: A Major Leap
by Andrei Frumusanu on May 27, 2019 12:00 AM ESTThe Mali-G77 Microarchitecture
Having covered the execution engine which is responsible for arithmetic processing, this is only part of the wider core design. Here Arm has generally kept the overall design quite similar to previous generation GPUs, however with some important changes in several blocks.
A shader core still contains the execution engine, load/store unit with cache, attribute unit, varying unit, texture mapping unit and pixel backend, as well as various other 3D fixed function blocks.
The biggest change here was on the texture unit block, which has doubled its throughput compared to the already doubled unit which we found on the Mali-G76.
From a high-level functionality standpoint, the new TMU looks quite similar to its predecessor, however we find some very significant changes in terms of the throughput of the new design.
The design is prationioned into two “paths”, a hit- and miss-path that either deal with misses inside the cache or outside the texture cache. The hit-path is naturally a shorter more latency optimised path.
On the hit-path, the texture cache itself has been improved and is now 32KB and is able of 16 texels/cycle throughput. The filtering unit has also been improved and its throughput increased and now supports one quad per cycle for bilinear texturing, or half a quad per cycle for trilinear texturing, both 2x of G76’s throughput.
Interestingly, Arm says that the new TMU is roughly the same area as its predecessor yet still enabling this doubling of capability, which is quite a nice engineering feat.
Fundamentally this large increase in the texturing capability of a core changes the ALU:Tex ratio of the GPU. Even though ALU capability has increased by 33%, the doubling of the TMU throughput means that essentially we’re now back to a lower ratio, more in favour of texture throughput, whereas past GPUs focused on increasing the compute performance. Arm deemed this as a necessary change for workloads that are now starting to tax this aspect of GPUs more.
It’s to be noted that while the texture filtering throughput has increased, the actual pixel backend throughput has not. Here a shader core is still only able to draw out 2 pixels per clock, so we now have a 2:1 texel:pixel ratio whereas in the past it remained 1:1.
Another new redesign among the shader core blocks is a new load-store cache block. Functionally it’s the same as in the past, however it’s now been redesigned with more throughput in mind. Within the same area, the amount of pipeline stages have been reduced by half, further reducing the latency of the core’s operation. The bandwidth has been widened to a full cacheline width, which should be a doubling over its predecessor.
The actual cache is 16KB in size and 4-way set associative, and is said to be very useful for ML workloads.
Putting all the pieces together and zooming out from a shader core to the GPU-level, we again see a large familiarity on how Arm organises its overall block. The architecture supports scaling shader cores from 1 core to 32 cores, although the microarchitecture of the G77 currently only supports up to 16 cores. Furthermore the current smallest design that Arm makes RTL ready for is a 7-core configuration, as the company deems customers going for smaller configurations would be better served by different IP (Such as the G52, or maybe a future unannounced IP in the same range).
The L2 cache still consists of up to four slices with each from 256KB to 1MB in size. Currently, most vendors have gone with 2MB configurations and I don’t think any licensee has ever implemented 4MB. In terms of bandwidth, the L2 to the LSC bandwidth has also doubled up from 32B/cycle to 64B/cycle (a full cacheline), while the external bandwidth depends on whether the vendor implements a 128-bit or 256-bit AXI interface to each of the L2 slices.
42 Comments
View All Comments
patel21 - Monday, May 27, 2019 - link
"and now Arm as well as the partner licensees just need to execute properly for users to be able to enjoy the end-results.”This has been the biggest issue. Samsung gimps on GPU cores on all their soc's except their top tier.
Same with Mediatek, where it had chance to use a higher core gpu in its P60/P70 but it didn't.
eastcoast_pete - Monday, May 27, 2019 - link
Yes, those 2 core GPUs didn't exactly help the image of MALI architecture.Lolimaster - Wednesday, May 29, 2019 - link
Specially on things like the horrible Galaxy A8 with a pathetic mp2ZolaIII - Monday, May 27, 2019 - link
While Mali G77 looks very decent and competitive for the first time in a firm's history what's the use if ARM will lose 70% of the Mali GPU market share?darkich - Monday, May 27, 2019 - link
That, and the fact that Samsung is in the process of developing their own, supposedly revolutionary GPU architecture.Correction though..Mali was more than competitive back in the Galaxy S1/S2 days
ZolaIII - Tuesday, May 28, 2019 - link
Correction MALI whose never ever competitive before, it laged far behind Imagion (ATI - later QC Adreno).jackthepumpkinking6sic6 - Thursday, May 30, 2019 - link
Correction in S2, Note2, Note4, S6/Note5, and even reading blows in s8 generation the Mali gpu was stronger though brute force or longevity.Not to mention that in almost every generation the CPU was better, overall efficiency was better, and audio was better.
It's only been the last couple years that Qualcomm really stepped up the efficiency and audio game. And unfortunately Samsung decided to start flopping those as Qualcomm makes great strides
ZolaIII - Thursday, May 30, 2019 - link
You must be joking. Get a grip on your self. The Scorpions ware somehow the letdown the regular A9 whose much better, the Krait made a established leader of Qualcomm all together the rest is a same o as it's based on little to no changed default core's. Adding an external (not on SoC) AMP/DAC has nothing with either of them. Seriously MALI whosent never even considered as worthy competitor to neither Adreno or Power VR in the past, Broadcomm VP whose at least competitive at first.jackthepumpkinking6sic6 - Thursday, May 30, 2019 - link
You're clearly the only one that needs to get a grip. You didn't even counter what I said.Troll again? Of course you will
ZolaIII - Friday, May 31, 2019 - link
I see you get a hold on a grip on your self & self corrected your mumbling! So Homingbird used PowerVR SGX 540 not MALI 400. S6 Had a worst battery life ever.