You are currently on IBM Systems Media’s archival website. Click here to view our new website.

MAINFRAME > Administrator > Performance

Beyond the z196 Pipeline


In the January/February 2012 print edition, I covered some aspects of how the IBM zEnterprise 196 (z196) processor executes instructions through its instruction pipeline. But, there is more to the z196 processor than just the pipeline. Here’s an overview of some of the other elements, including branch prediction, special-purpose co-processors and enhanced translation look-aside buffers (TLBs). But first, let’s take a brief look at the nest.

The Cache Topology

The nest is the part of the processor responsible for accessing data in memory and maintaining proper coherency of the data. It’s extremely important because one of the major delays in instruction execution is the delay to access data. To reduce the delay, in addition to main memory, the nest includes a number of levels of buffers, called caches, to hold some of the data closer to the instruction processor.

In computer design, closer is faster. Figure 1 shows the topology of the caches on z10 and on z196. The z10 has three levels of cache: a small, low-latency cache called L1 associate with each processor; a larger but slower cache called L1.5 also associated with each processor; and, a much larger, higher latency cache called L2, which is shared by all of the processors in a book. The upper part of this figure is a schematic view of the cache topology on the z10. The L2 caches on each book communicate with each other and maintain cache coherency across all of the processors. But, even with these three levels of cache to buffer data and avoid access all the way out to main memory, it wasn’t enough to keep the z196 pipeline rolling smoothly. The z196 has an additional level of cache, which is shared by the four processors on a chip and thus is a chip-level cache, which can be seen in the lower part of the figure. The cache levels are renamed so that on the z196, L1 and L2 are the two-processor private caches, L3 is the new chip-level cache and L4 is the cache shared by all processors on a book. The lower part is a schematic of the z196 cache topology. With the addition of the on-chip L3 cache, the z196 is much better able to keep its 5.2 GHz out-of-order pipeline fed with instructions and data.

As a further optimization, there are actually two L1 caches, one for instruction fetches and one for operand fetches. These caches are exclusive in that the processor cannot maintain the same cache line in both of these L1 caches. This is why there’s such a large performance penalty for “self-modifying code.” If a program modifies instructions that are in the instruction cache, then the containing cache line must be purged from the instruction cache and loaded into the data cache and then pushed up to L2 from the data cache so it can again be loaded into the instruction cache. Actually, it doesn’t even have to be instructions that are modified. Any modification on a 256-byte cache line in the instruction cache initiates this process, so programmers and compilers have to be careful.

Bob Rogers worked on mainframe system software for 43 years at IBM before retiring as a Distinguished Engineer in 2012.



Like what you just read? To receive technical tips and articles directly in your inbox twice per month, sign up for the EXTRA e-newsletter here.



Advertisement

Advertisement

2019 Solutions Edition

A Comprehensive Online Buyer's Guide to Solutions, Services and Education.

Accelerating Enterprise Application Modernization

Modernizing existing applications rather than replacing them is a time-tested approach to competitive advantage in the financial-services industry.

IBM Systems Magazine Subscribe Box Read Now Link Subscribe Now Link iPad App Google Play Store
Mainframe News Sign Up Today! Past News Letters