Tried it out: AMD GCN - Introducing the Radeon HD 7970 and HD 7950
The time has come for us to finally introduce the AMD GCN (Graphics Core Next) architecture and its two most powerful representatives, the Radeon HD 7970 and the Radeon HD 7950.
Of the HD 7950, we immediately paid tribute to two of them, so we also had the opportunity to test CrossFireX, and we also performed tuned measurements with both cards. Of course, we also sent a number of riders against the two new cannons to see how much the cards had accelerated compared to the previous generation GeForces and Radeons.
Before we get into the participants and the tests, we will take a closer look at the GCN architecture and take over the features of the HD 7970 and HD 7950.
Graphics Core Next
In May 2007, AMD introduced the Radeon HD 2900 XT graphics card, which was already built on a unified shader architecture. As it turned out, the design had a number of flaws, but the problems were almost completely eliminated by the time of the Radeon HD 4000 series, giving the company a foothold in the desktop graphics card market. At this point, it could be seen that radical changes were needed now. The HD 6900 “Cayman” series is considered the first step. Here, the previous 5-way superscalar processors (VLIW5) have been replaced by 4-way processors (VLIW4), and Cayman was the first chip to handle multiple independent instruction streams. The other big innovation was the introduction of two “graphics engines” that doubled the triangle setup capacity - increasing tessellating power - and the number of some elements (Rasterizer, Hierarchical Z, Tessellator). He became the subject of our next level test today. Thanks to an architecture called Graphics Core Next (GCN), shader arrays working with VLIW instructions used so far have become obsolete, replaced by so-called Compute Units (CUs). GCN made its debut in the Radeon HD 7900 “Tahiti” family.
Interestingly, but not surprisingly, Tahiti GPUs have achieved outstanding transistor densities thanks to TSMC’s 28nm bandwidth manufacturing technology - containing 365 billion transistors per 4,3 square millimeters. One Compute Unit contains four SIMDs and one scalar unit. AMD’s flagship Radeon HD 7970 “Tahiti XT” works with 32 active CUs, assuming a total of 2048 shader processors (four 16-way SIMDs, 64 ALUs). Given the progress of generations so far, this does not seem to be an outstanding value at first glance, but for the sake of better efficiency and utilization, we would like to state that it is not worth accidentally drawing far-reaching conclusions from this one technical indicator. Theoretically, a CU can perform as much as a single Cayman SIMD unit. A major problem with previous generations is data dependency (consecutive instructions depend on data), which has caused utilization to fluctuate sharply. The GCN architecture is also a step forward in this area because it eliminates previously experienced dependencies through stream processing. The benefits are in keywords only: scheduling, debugging, estimating expected performance, and driver development have also become radically simpler and more transparent.
A CU not only contains four SIMD units, but also has its own scheduler, 340 KB of temporary storage and a texturing cluster. This results from the sum of the 4 × 64 KB vector register, the Local Data Share, which is also 64 KB in size, the 4 KB scalar register, and the first-level cache with a capacity of 16 KB. In the picture above, you can see one more component that definitely deserves a mention, and this is the so-called "Branch & Message Unit", which plays a role in more efficient control of the programs.
Having the information so far, let’s review the key parameters of the “Tahiti XT” graphics processor again: 32 CU (2048 shader processors, 128 SIMDs), 128 texturing units, 512 Load-Store Units, and a total of 8,2 MB cache. So the girl's position was different right away, even though we were just starting to "undress".
Front-end
In terms of frontend, we can see significant differences compared to the architecture of NVIDIA GF110. Management is basically not done at the level of CUs. This task is performed by the Command Processor and the Asynchronous Compute Engine (ACE). The chip has been given two geometric motors that, in addition to the Geometry-Assembler, the Vertex-Assembler, also accommodate ninth-generation tessellator units. Communication with CUs is facilitated by Global Data Share (GDS), through which these units can also share data with each other. The frontend section contains two rasterizers - you can see the layout below.
ROPs and memory interface
AMD Tahiti contains 8 ROP clusters - at this point we found a match with the Cayman chip. Each such “array” contains four ROP units and 16 Z samplers. It is important to mention that each cluster was given its own cache. Another major change has taken place: there is no longer a direct connection to the memory controller. The move is intended to improve flexibility and usability, which we may see in the context of Pitcairn… ROPs can write to the 768 KB L2 cache, which in turn can be read by multiple units. The memory interface receives a joyful image. The six 64-bit memory controllers have a total capacity of 384 bits. We would just add a word to this. Finally! The default size of video memory is 3072 MB, but in theory, 1536 MB and 6 GB are also possible.
We hope our readers don’t take it in a bad name, but at this point we would express our personal opinion on the backend area. The relationship between Barts, which has performed really well, and the Cayman chip, which has shown modest results, suggests that the “general problem” with AMD chips is the tight ROP capacity. No progress has been made on Tahiti here either, while pages could still be written about other novelties of the chip with some exaggeration. The role of the ROPs is especially prominent during the games, during the GPGPU tasks and applications they become second violinists. It is also certain that this section consumes a large number of transistors, which of course is also reflected in the size of the chip.
AMD’s enhancements so far have largely served the needs of gamers. Now there has been a turn of at least 90 degrees and it has become a strong crossroads to meet professional needs, to use the GPU more widely. Of course, this is not a problem, as we are basically talking about a very rough level of performance, which will certainly withstand the trials of modern games for a few years. According to rumors, not only AMD, but also NVIDIA is treating ROPs narrowly with Kepler.
Expanding the memory bus was a commendable step. In fact, the designers had little choice. Clocks can no longer be significantly increased, but the chip is starving for data. In our opinion, this move alone could have increased performance during games by up to 15 percent.
DirectX 11.1 and PCI Express 3.0
The PCI-Express 3.0 standard increases the speed from 16 GB / second to 32 GB / second, doubling the data transfer rate of PCIe 2.0. Motherboard manufacturers immediately “bitten on the subject,” but no matter how much they want to, the switch doesn’t offer a significant advantage at the moment. PCIe 3.0 is an important weapon from a marketing perspective, a mandatory standard for AMD and NVIDIA, and another “money trap” for users.
DirectX 11.1 can start its conquest with the following Windows operating system, which contains minor fixes and optimizations. According to official material, we can expect native stereo 3D support and more efficient rasterization from the new API. Unfortunately, perhaps the most interesting point, which discusses how flexibility and the widespread usability of graphics hardware can be improved, has not been detailed.
The Graphics-Core-Next architecture looks broadly like this. Of course, the chip not only serves the needs of players, but also has room for professional tasks. Tahiti’s theoretical calculation peak performance (for double-precision calculations) is 947 GFLOP, four times higher for single-precision floating-point operations. In addition, the memories have ECC support and the GPU is well acquainted with the DirectCompute 11.1, OpenCL 1.2 C ++ AMP APIs. New features: Zero-Core
In general, Radeon HD 7900-level top predators are accustomed to consuming as a taboo subject, but AMD engineers lack the ingenuity. The idea is simple but great, but not new. If you leave your computer for a long time, but for some reason do not want to turn it off, you may want to leave the monitor in standby mode only. Thanks to ZeroCore Power technology, with the display turned off, the entire graphics controller can be de-energized and no active cooling is required in this form. The benefits are convincing: zero noise, 3 watts of power consumption. It will be an insignificant factor for many, but the procedure for four-way Crossfire systems shuts down non-primary video cards, significantly reducing your electricity bill - although anyone thinking of such an assembly does little to address energy efficiency.
Eyefinity 2.0
One of the interesting features of the new version is that it allows you to conduct multi-monitor conference conversations with multi-band audio. The official name of the procedure is Discrete Digital Multi-Point (DDM) Audio. The Radeon HD 7970 can be connected to three displays at the same time, which can receive an eight-channel audio stream. This may not be of interest specifically to home users, but it is a good example of how many areas the new cannon can be used in. The Catalyst drive is also evolving, making it easier to position the tray, for example, and allow you to compile custom resolutions. It is worth mentioning that Full HD stereo 3D content can also be viewed in Eyefinity mode.
UVD and VCE
UVD 3.0 already offers hardware acceleration for DivX / Xvid, MPEG-4 Part 2 MVC content, and the Video Code Engine (VCE) is virtually the AMD equivalent of Intel Quick Sync Video. VCE is stand-alone hardware and is only designed to speed up the transcoding of H.264 videos. The engine is slower than the shader processors in the graphics processor, but much more energy efficient. There are two modes available to users. At first, only the VCE works, which in itself is faster than most CPUs. In this case, we will not experience a slowdown, we can load the video card or the central unit without any problems. The second option is hybrid mode. The arithmetic-logic units of the VCE and the GPU jump to the task together. This “marriage” obviously has a good effect on the encoding speed, but in that case, don’t be surprised if your favorite game switches to “slideshow” mode.
Now that we are aware of the theory and the numbers, let’s get acquainted with the three GCN models in the test!