Technologies: Sandy Bridge Architecture
Sandy Bridge is the code name for Intel’s next generation microprocessor architecture following Nehalem, which was introduced in late 2008, and Westmere, which was the 32nm die-shrink of the Nehalem, released in January of 2010. This latest microprocessor architecture integrates the memory controller, graphics controller and PCI-Express controller typically found in the North Bridge in older architectural designs, inside a monolithic die. Improving upon the previous design, Sandy Bridge also features the next generation Intel Turbo Boost and Hyper-Threading Technology, Advanced Vector Extensions (AVX), a high Bandwidth/low-latency modular core/graphics Ring-based interconnect, and a substantially upgraded graphics engine.
The Sandy Bridge Architecture
Manufactured in the 32nm process, Sandy Bridge processors are Intel’s 2nd generation Core i3/i5/i7 microprocessors and will require an entirely new socket (LGA-1155) and chipset (6-series) from Nehalem/Westmere based systems. Sandy Bridge also handles instruction sets in a new way. An “L0 micro-op cache” is able to cache and store instructions as they are decoded, up to approximately 6KB of instructions. This cache has an 80% hit rate for most applications, which means it is used 80% of the time. Whenever this cache is used, decoders, as well as the L1 instruction cache, are put to sleep. This saves a lot of time and improves performance since the x86 instructions are already decoded in the cache and decoders use up a lot of energy. This technique is not unlike the Trace Cache technique used in Intel’s NetBurst architecture except a lot more efficient since the microinstruction cache only stores individual decoded instructions (micro-ops instead macro-ops).
The Branch Prediction Unit on Sandy Bridge has been rebuilt from the ground up. Compared to Nehalem, the Branch Target Buffer is now doubled in size. This Branch Predictor is also a lot faster and more accurate because of smaller target sizes and usage of more history bits. Using smaller target sizes results in less wasted space and enables the CPU to keep track of more targets. There are six dispatch ports in the scheduler: three for execution units and three for memory operations. Sandy Bridge has 15 execution units compared to Nehalem’s 12 and also boasts improved floating-point performance.
Sandy Bridge uses a physical register file since Out-of-Order hardware had to be larger so it can accommodate the micro-ops, as well as their data. The new AVX instruction sets add 12 new instructions and increases the size of the XMM registers to 256 bits. Use of a physical register file enables an increase in Out-of-Order buffers to provide a higher throughput floating point engine and add AVX at minimal die expense. AVX, or Advanced Vector Extensions, uses the same Single Instruction, Multiple Data concept introduced in MMX and used in Streaming SIMD Extensions. The 256-bit AVX instructions borrow 128-bits of integer SIMD datapath, minimizing the impact of AVX on the execution die area.
Sandy Bridge also features symmetric load and store address ports so the load bandwidth is doubled and able to handle the improved floating point performance. The memory unit can service three data accesses per cycle: two read request of up to 16 bytes and one store of up to 16 bytes, while an internal sequencer deals with the queued requests. There are also several integer execution improvements in Sandy Bridge, including the doubling of Add with Carry (ADC) instruction throughput and large scale multiples resulting in a ~25% speedup on existing RSA binaries.
In the previous Nehalem structure, all cores had a private path dedicated to the last level L3 cache. This approach becomes much more inefficient as the access demand to the L3 scales upward, especially with the addition of on-die graphics and a transcoding engine. Intel’s Solution is a scalable ring on-die interconnect between all the cores, graphics, last level cache and System agent domain. This ring bus is composed of four independent rings: a Data ring, a Request ring, an Acknowledge ring and a Snoop ring. Each stop for each ring can take 32-bytes of data per clock. This wire routing runs over the last level cache with no area impact and to minimize latency, access on ring always picks the shortest path. With every core count and cache size increase, the cache bandwidth increases accordingly, making the system completely scalable. The L3 speed and latency has been changed compared to the Westmere L3. The L3 cache now runs at the same speed as the core clock while the L3 latency has been reduced from 36 cycles to a variable cache latency of 26~31 cycles.
What was formerly called Un-core in the previous generation is now referred to as System Agent in Sandy Bridge. This system agent houses the PCI-E 2.0 lanes, DMI, display engine and a redesigned dual-channel DDR3 memory controller. The system agent also contains a power control unit that handles all power management and reset functions in the chip. The system agent’s clock speed is lower than the rest of the core and is on an independent power plane.