AMD's much anticipated next-generation Bulldozer and Bobcat CPU cores are almost here. Big things are expected since this is the first time in 7 years after the launch of their Athlon 64/K8 microarchitecture where a radically new design was developed completely from the ground up. With every generational progression, not only is the performance projected to increase but the power consumption efficiency is to be improved as well. With that goal in mind AMD makes a two-pronged attack going for both the mainstream/server markets as well as the ultramobile/low power processor market.
High performance desktop and server needs are demanding and a new approach has to be taken if efficiency and increased output is to be expected from the chip. Heavily threaded performance has been the bane of AMD for some time now. While their main competition, Intel implemented Hyper Threading technology to address the issue, AMD opted to take a more modular approach with the Bulldozer.
Each Bulldozer core is armed with two independent integer clusters each with a dedicated L1 data cache (essentially two physical cores in one Bulldozer core) and modularly, shares L2 cache, Floating Point Scheduler as well as two 128-bit FPUs supporting up to 256-bit floating point execution. Sharing resources reduces not only the power consumption but also the die space (therefore lowering the cost).
Bulldozer has a deeper pipeline that relies on improved branch prediction and prefetchers, aiming to resolve the bottleneck caused by incorrect branch prediction. Unlike the previous architecture, the predict and fetch pipelines are decoupled. A queue of future fetch addresses are created by the predictor and allows the fetch logic to go through this queue and compare it to what's in the instruction cache. Another major change that Bulldozer brings to the table is Power Gating. Each module can be clocked and power gated independently, allowing unused cores to be powered off while the other active cores can be driven up in frequency ala Intel's Turbo Boost.
Built on the latest 32nm silicon on insulator technology, Bulldozer-based chips can have up to 8 cores (each core is seen by the Operating System as a logical processor) comprised of four Bulldozer modules sharing L3 Cache and NB resources between them. Bulldozer is also designed to dynamically switch between shared and dedicated components to maximize efficiency while featuring new x86 instruction sets like SSE4.1, SSE4.2, AVX, and XOP including 4-operand FMAC allowing for greater performance and flexibility. AMD projects an estimated 50% increase in throroughput while having the same power envelope as the current AMD Opteron 6100 Series "Magny-Cours" server-based processors and an estimated average of 80% of the CMP performance within a smaller area..
<hrdata-mce-alt="AMD Bobcat x86 Core" class="system-pagebreak" title="AMD Bobcat x86 Core" />
In order to take a large bite out of the ultramobile CPU market, AMD has to create a processor that is highly-synthesizable and operates with a very low power design. Balancing an impressive performance on a smaller space with an even smaller power draw is a tall order for any microprocessor design and AMD intends to pounce on the consumer electronic markets with the Bobcat.
The layout of the Bobcat core is much simpler than the Bulldozer but it is much more effective than other processors in the consumer electronic market because of its support for out-of-order execution (full OOO instruction execution with full OOO store/load engine), minimizing data movement and unnecessary reads. With four integer pipes and two floating point pipes, there is more than adequate bandwidth in terms of execution units. AMD claims that Bobcat performs at at roughly 90% of today's mainstream performance with less than half of silicon area.
Bobcat has a dual x86 decoder that scans up to 22 bytes and directly map 89% of x86 instructions to a single microOp, an additional 10% to a pair of microOps, and more complicated x86 instructions are microcoded. There are two indepndent, dual ported integer schedulers. One feeds two Arithmetic-Logic Units and the other feeds two Address Generation Units (one for storage and one for load). Physical Register File uses maps/pointers to reduce power by decreasing data movement. The FPU has a centralized Floating Point scheduler that feeds a pair of 64-bit FP execution stacks (one for FP multiply unit performing two SP multiplies per cycle and one for FP add unit that can perform two SP additions per cycle). The MMX and logical units are replicated on both stacks.
If all of AMD's roads lead to Fusion then Bobcat is the first set of bricks laid out toward that goal. This marriage of GPUs and CPUs is a vision both shared by AMD and Intel. As the CPU on AMD's upcoming "Ontario" processor, manufactured on TSMC’s 40nm process, the success of Fusion rests on the success of the Bobcat processor. Targeting both the high end server and ultra-low power markets at the same time is a very bold and somewhat uncharacteristic move by AMD. The necessity of such boldness however is well calculated and not without logic. If the long term goal is for AMD's Fusion to bloom, they need to plant the seeds in markets where the competition favors them and the return to be immediate.