Analyzing PCI Express performance in the Intel System Controller Hub US15W
Power, performance, size, features - these are just a few of the things silicon designers have to balance when developing platforms. With the introduction of the Intel Atom processor and Intel System Controller Hub (Intel SCH) US15W, which has a thermal design point of less than 2.5 W, the balance has shifted significantly toward power optimization. Scott explores how this focus affects other metrics from the perspective of PCI Express (PCIe) I/O on the Intel SCH.
Due to its focus on mobile Internet devices and netbooks, the Intel Atom processor platform (Figure 1) allocates a sizable portion of the power budget to media applications. For example, in terms of memory bandwidth consumption, graphics are given higher-priority access to this central storage pool. The processor, via the Front Side Bus (FSB), is also given more of the memory controller’s time for running generic media applications and/or performing software-based DSP functions. Other I/O devices only require low to medium amounts of bandwidth (bursty).
All the I/O devices on the Intel SCH run on two equally shared unidirectional buses collectively called the “backbone,” which has less memory bandwidth than either the integrated graphics controller or the FSB. The backbone’s bus has slightly more bandwidth than the original 32-bit, 33 MHz PCI bus could handle. But PCIe allows for higher bandwidths than that, right? To know for sure, let’s take a quick crash course on PCIe technology.
PCIe is a serialized/packetized point-to-point implementation of the original parallel PCI bus. While PCIe shares the same core acronym and software concepts as PCI, everything else about the hardware is different.
At a high level, PCIe looks like a typical network stack. It has many of the same or similar layers as found in the Open System Interconnection model for networked communications. Lower layers consist of physical transmission – the electrical/signaling aspects – and eventually progress all the way up to transaction-based packets.
The transaction layer is responsible for all of the read/write transactions (packets) implicit in data communication, specifying where data is from and where it is going. Moving further down the PCIe protocol stack, the endpoints are connected via links and lanes.
Lanes are bidirectional serial interconnects (physically, two differential pairs) of a specific bandwidth. For PCIe Gen 1, which is used in the Intel SCH US15W, this bandwidth is 2.5 Gbps in each direction of the lane, thanks to a 2.5 GHz signaling rate. With 8/10-bit encoding (one byte requires a 10-bit code to transmit, so one byte per 10-bit code), the effective bandwidth calculation is easy: Dividing by 10 results in a rate of 250 MBps of bandwidth in each direction for any given lane.
Links comprise one or more lanes. A link can have four, eight, or even 16 lanes squashed together. The Intel SCH only has two lanes, each of which is also an individual link. Because the links consist of one lane each, both links are x1.
As with all other forms of communication, overhead is needed to make the data reliable and understandable. Because PCIe is serial and fairly robust, this overhead – including headers and framing – impacts the true bandwidth. Given that the maximum data payload size on the Intel SCH US15W is 128 bytes and the overhead typically consumes 20 bytes on a x1 link, the true theoretical maximum transaction bandwidth is 216 MBps, leading to 86 percent efficiency in each direction.
To understand PCIe performance data, engineers performed tests with a custom traffic generator endpoint in each PCIe port on the Intel SCH, as illustrated in Figure 2. Test cases instructed the traffic generators to vary each link’s transaction type and the transaction payload size.
Transaction types include read, write, and idle. Writes to memory flow from right to left – from PCIe endpoint to memory. Reads are requested in the same direction as writes, but are completed by data transactions flowing from left to right (effectively two transactions). The tests examined each of these scenarios individually and added the idle case to the mix. All useful combinations of these states were enumerated for each link, and then test cases were run using two payload sizes: 64 bytes and 128 bytes.
Two different platforms – one with a Z530 processor running at 1.6 GHz and the other with a Z510 processor running at 1.1 GHz – were tested for comparison. Both used the Intel SCH US15W as the chipset, but with different memory and processor FSB frequencies per their paired processors. Test results are shown in Figure 3.
After carefully analyzing this data, the engineers were able to reconstruct a detailed illustration of the Intel SCH’s internal architecture, which has two internal backbone buses (one upstream, one downstream) connecting all the I/O devices to main memory, as shown in Figure 4. This backbone is a shared round-robin bus, likely 64 bits wide, running at a clock rate of 25 MHz for the 1.1 GHz 400 megatransfers per second FSB part, and 33 MHz for the 1.6 GHz 533 megatransfers per second FSB part (the backbone clock appears tied to the FSB clock). Furthermore, the transaction sizes max out at 64 bytes.
For a more detailed look at this data, see the Intel Technology Journal article at .
Because the backbone is a shared bus, multiple simultaneous accesses have significant effects on short-term bandwidth, known as latency when the time period is one transaction. This means that for high-bandwidth applications, any other traffic on the bus will temporarily diminish the bandwidth by a significant amount. Random disk accesses, for example, might not change the overall bandwidth on average, but they will halve the available bandwidth for short durations/bursts. If three devices require full bandwidth at the same time, each will only get one-third of the bandwidth in the short term given round-robin arbitration.
This is not the only way that bandwidth can be affected. Even when the backbone is completely open to a single device, bandwidth can be robbed at the memory controller level, as with video decoding, depicted in Figure 5 (see page 26).
In this example, the PCIe traffic generator writes data to memory. Decoding high-definition video using the Intel SCH’s hardware decoder produces regular dips in PCIe traffic. These blips occur at regular intervals that align to each frame in the video being decoded (approximately 41 ms for 24 pixel video). Tests confirmed that memory bandwidth was the culprit.
Decoding high-definition video is fairly memory-intensive, as each uncompressed video frame is bounced to and from memory several times. Adding more CPU accesses for memory on top of this only exacerbates the problem.
Applying architecture to optimization
Putting together all of this architecture and performance information reveals what optimizations are possible. Due to the shared parallel bus nature of the backbone, PCIe hardware running on this platform performs better when conforming to certain parallel bus (think: PCI) concepts. For example, PCIe devices cannot expect constant bandwidth at all times since they must share it equally with other peripherals. Consequently, PCIe devices that can vary their requirements (throttle) to maintain an average throughput will work well. Devices that do not throttle and instead enforce overly rigid bandwidth requirements will suffer. Devices also must be able to operate under the maximum throughputs discussed earlier and not theoretical PCIe rates.
Beyond hardware, drivers can be optimized the same way as devices on a shared PCI bus. Driver tweaks can significantly improve the hardware’s ability to adapt to differing bandwidth conditions. Application software also plays a role because applications can be created to more readily minimize concurrent activities in embedded applications.
PCIe device performance can be optimized with packet size changes and memory use. Read requests should be 128 bytes or larger for maximum throughput. Writes should be 64 bytes for optimal performance, even though 128 bytes historically performed better. Because of odd differences like this, hardware flexibility is ideal to maintain high performance on a wide range of platforms.
Even memory organization can affect performance. Aligning transactions to cache line boundaries is important and applies to many different architectures.
All of the aforementioned topics – architecture, performance, and optimization – are vital aspects to consider when creating platforms with the Intel Atom processor and Intel SCH US15W. When developing high-performance systems with small power budgets, engineers must pay close attention to these metrics to attain the maximum benefits. By learning and understanding this analysis of the Intel SCH, engineers can know how to design a platform that can be used as efficiently and effectively as possible.