Select Page

We tested the Bulldozer: FX-8150 and three 990FX motherboards on the test bench

We tested the Bulldozer: FX-8150 and three 990FX motherboards on the test bench

Technológia

The chips are made on GlobalFoundries ’32nm SHP node. For Bulldozers, the SOI previously introduced has been combined with Intel’s HKMG (High-K Metal Gate), which can help combat leakage current. The architecture is well-designed to achieve high clock speeds (“Speed ​​racer”), making the model range likely to be completely devoid of products below 3 GHz. All the central units in the old language are Black Edition, therefore it is now not specifically marked.

At this point, let’s take a little detour and look at the other side of the coin as well. The fastest quad-core Phenom II processor is ticking at 3,7 GHz, and the 1100T based on the six-core Thuban chip is ticking at 3,3 GHz. In comparison, the 32nm AMD FX-8150 has a near-disappointing base shot and only the 4,2 GHz “level” of the Turbo Core is acceptable, which immediately promises a 10-15% power surplus (no). XbitLabs ventilated a year ago that the Bulldozer was crossing the 3,5 GHz clock, which came together, but in spite of a series of slips. It seems correct to assume that there are still serious problems with the production and output of the new cannon, which has a significant impact on performance.

amd_bulldozer_six-speed
The second integer only increases the size of the module by 12 percent. [+]

Based on many years of experience, even a basic concept was born, which was based on the following: central units perform fixed point operations at an average rate of over 80 percent. From this, it can be seen that floating-point calculations are much less present in the lives of “centipedes”. In the design, consequently, two integer cores are connected, which have their own first-level cache, but already have to share the second-level cache and the floating-point unit. AMD has named the unit as a module.

amd_bulldozer_one_module
One module [+]

According to internal measurements, the second integer basically increases the size of the module to a negligible extent, in contrast, it can ideally cause a performance increase of up to 80%. The part of the primary cache responsible for data is directly connected to the processors (size 16 Kbyte, delay 4 clock), but the 64 Kbyte cache designed to store instructions is already shared among the integrals.

1_module_what_which
Digging deeper [+]

Based on the test results, the L1 data cache is not only too small, but even slow and this together is not a very good combination. The size of the second-level storage shared within the module is satisfactory, but its latency is high, 25-27 cycles. It is easy to imagine that a larger L1 cache and a faster L2 (12-15 cycles) would improve processor performance by 10-20%.

Not surprisingly, achieving 8MB of L3 is not at the speed of light either (65 cycles). In summary, the Bulldozer cache system will not be the eighth wonder of the world.

instructions
In the instruction set maze [+]

Bulldozer currently has the widest set of instruction set support: MMX, SSE, SSE2, SSE3, SSE4A, SSSE3, SSE4.1, SSE4.2, AVX, AES, FMA4, XOP, PCLMULQDQ and of course the 64-bit extension. Of the two novelties (FMA4, XOP), FMA4 is of great importance in the HPC market, and XOP offers a small advantage over multimedia applications. As far as we know, the latest version of x264 already supports the new instruction sets. Outdated 3DNow! support has been discontinued, I think it does not cause many readers sleepless nights.

It is known to use your Intel VT to access x86 virtual memory. IOMMU significantly increases system virtualization performance, however, surprisingly, Intel’s high-end solutions (Core i5-2600K, i7-2600K) do not support this technology and this “black circle” includes current Sandy Bridge E solutions. Again, an extra service compared to direct competition, although its usefulness to the average user is questionable.

40
Turbo Core in theory [+]

The Turbo Core has also been further developed, working with multiple clock gates and even better adapted to varying degrees of utilization. If all cores are active but floating point units are not currently in use, the Turbo Core 2.0 clock will take effect. The procedure dynamically changes the clock signals of the cores as a function of the load, inactive resources, modules and components inside the module can be disconnected, so there would be no complaint in this area to the front of the housing. Unfortunately, the software side of the thing throws you into the soup thoroughly.

in practice
Practical implementation [+]

Windows 7 Scheduler is, to put it mildly, not the most efficient way to allocate tasks because it frequently alternates the allocation of tasks between cores. The next version of the operating system will fix the problem and a fix will be made for this system soon, so - in extreme cases it can be 15-25 percent - we will soon get 2-10% more performance. Another very nice benefit will be that the idle power consumption can be reduced by 4-5 watts because the modules can stay “on savings” longer.

bulldozerwin8_and_bf3
"Don't treat your teeth as a gift" [+]

bulldozerbf3betafx

The "transformation" during Battlefield 3 [+]

Battlefield 3 also shows well how much some optimization helps a processor. In this game, the currently most powerful FX series processor can achieve the performance of the Core i7-2600k.

The FX series processors come with a Socket AM3 + enclosure and are housed in AMD motherboards with a 9-series chipset. Orientation is also facilitated by the color of the socket, which is mostly black. To implement the infinitely sonic Scorpius platform, we need an FX-series processor, a motherboard with a 9-series chipset, and a Radeon HD 6000-series video card. The Bulldozer has a dual-channel DDR1866 memory controller that supports 3 MHz modules.

phenomu_folulk

AMD FX-8150 with a Phenom II X4 970 BE - from above [+]

In conclusion, we would like to add another interesting addition. The fact that the work done by Bulldozer-based processors per clock (instructions per cycle) has decreased somewhat on average compared to its predecessor has caused serious controversy. Some immediately envision the fall of architecture, others list similar examples from the past. In this regard, as always, let us confine ourselves to the facts. Programmers today are increasingly realizing the benefits of multi-core optimization. With an 8-cylinder engine that basically delivers good performance, we rarely think about what it can do with 1 cylinder.

phenomualulk

AMD FX-8150 with a Phenom II X4 970 BE - bottom [+]

The example is not the best, but it may shed light on the point. We do not claim that we will make optimal use of eight integer cores densely, but Turbo Core 2.0 targets the highest possible clock (4,2 GHz) in this case. What is only available in the case of K10.5 at the price of “bloody sweat” is considered a “base clock” here. There is also no doubt that the implementation of AVX, FMA and XOP has cost a significant set of transistors. The basics of architecture are used in several segments (server, desktop PC), so this seemed like a mandatory step, but today we see even less of its benefits (especially in a desktop environment).

socket_2k

Lying in bed [+]

Ideally (FMA4 + AVX), the Bulldozer really feels very elementary, delivers surprising performance and puts things in a different light right away. According to measurements from the German HT4U, during the C-Ray 1.1 rendering application, the AMD FX-8150 performs in the same 15 seconds as the Intel Core i7 990X. That’s exactly half the time an AMD Phenom II X6 1100T processor took to do the job. We would note in parentheses that we also weighed the other extreme, Super PI.

About the Author