ISBN: | 0-12-627230-1 |
Author: | Mike Schmit |
Publisher: | AP Professional, 1995 |
I Review and Historical Context 1 Number Systems 3 2 What is Assembly? 13 3 The 8086 Family History and Architecture 19 II 80x86 Family Background 4 8086 Architecture and Instruction Set 29 5 Writing Beginning programs 75 6 Assembly Tools 83 7 The Instruction Set Evolves: 186 to the 386 89 III Introduction to Pentium and Tools 8 80486 and Pentium 103 9 Superscalar Programming 109 10 Integer and Floating Point Pipeline Operation 119 11 Using the Pentium Optimizer Program 139 12 Timing with a Software Timer 143 IV Superscalar Pentium Programming 13 Optimization Warm-ups 153 14 String Search and Translate 163 15 Checksums and Extended Precision Addition 179 V Advanced Topics 16 Floating Point Math 195 17 Interfacing to C 211 18 Protected mode programming 231 19 Final Notes and Optimizations 255 VI PowerPC vs. Pentium 20 PowerPC vs. Pentium 273 Appendices A Instruction Set Reference 283 B Optimization Cross-reference by Instruction 317 C Optimization Guide-lines by CPU 327 D Simple Instructions for Pentium Pairing 333 E Instruction Pairing Rules for Pentium 335 F Single-byte Instructions 337 G Quick Reference for Important Instruction Timings 341 H Undocumented Pentium Registers 345 I DEBUG32 command summary 349 J Improving Performance 355 K Glossary of Terms 359 L Products Mentioned 375 Index 377
Integer and Floating Point Pipeline Operation
(Introductory material omitted)
Several independent steps must be performed by the CPU to complete even the simplest of instructions. Let's take a look at the internal operations of the CPU for a few instructions.
To load data from memory, i.e.,
mov ax, [bx]
The processor must perform the following actions:
Let's look at another instruction
add ax, bx
For this instruction the processor must:
(material omitted)
Not every instruction needs to perform all the same steps, but they certainly are very similar. What if a computer was designed like a factory assembly line? Each instruction would move from station to station as if it were on a conveyor belt. At each station a specialist would perform its job. This is how a pipeline works. It was first used by Intel's 80x86 processors on the 486. The stations, or stages are as follows
PF pre-fetch D1 instruction decode D2 address generation EX execute and cache access WB write back
Here is a brief description of what happens during each pipeline stage:
PF: Prefetch. Instructions are fetched from the cache or memory are stored in the prefetch queue.
D1: Instruction decode or Decode1. The instruction is decoded and broken into component parts, opcode and operands. An extra cycle is required for instructions that contain a prefix.
D2: Decode2 or Address Generation. The effective address of the memory operand, if present, is calculated. On the 486 an extra cycle is required if an address contains both a base and an index component or both a displacement and an immediate data value.
EX: Execute and Cache Access. The processor performs the action(s) required by the instruction including reading data from memory and storing the results in registers.
WB: Write Back. Instructions complete and data to be written to memory is sent to a write buffer. Instructions are enabled to modify the CPU state.
Instructions proceed from one stage to the next stage in one cycle, (usually). So even the fastest instruction on the 486 takes 5 cycles to complete. But wait! If you've looked at any of the CPU cycle tables you've seen that many instructions take only 1 cycle on the 486. How can this be?
The published cycle times are actually the fewest number of cycles that an instruction would take when added to a stream of other instructions. Think of it as the effective throughput, not the actual time an instruction is in the pipeline. Cars may drive off an assembly line every two minutes, but each one takes many hours to assemble. Here's an example:
stages --> PF D1 D2 EX WB cycles 1 I1 I1 starts 2 I2 I1 I2 starts 3 I3 I2 I1 I3 starts 4 I4 I3 I2 I1 5 I5 I4 I3 I2 I1 I1 completes 6 I6 I5 I4 I3 I2 I2 completes 7 I7 I6 I5 I4 I3 I3 completes (I1 = instruction #1, I2 = instruction #2, etc.)
(material omitted)
When working on the Pentium we must take into account the fact that there are now two pipelines (the U pipe and the V pipe) -- like a factory with two assembly lines or two conveyor belts. However, unlike building cars or some other product, the outputs from the two pipelines must be kept in order. Each instruction in each pipeline can modify data in memory, data in registers and the state of the CPU.
Because of this it is sometimes necessary for activities in one pipeline to be known by the other pipeline. This is done so that the end result is exactly the same as if instructions were executed sequentially in the exact same order as they were fetched.
stages --> PF D1 D2 EX WB cycles pipe 1 U I1 I1 starts V I2 I2 starts 2 U I3 I1 I3 starts V I4 I2 I4 starts 3 U I5 I3 I1 I5 starts V I6 I4 I2 I6 starts 4 U I7 I5 I3 I1 I7 starts V I8 I6 I4 I2 I8 starts 5 U I9 I7 I5 I3 I1 I1 completes V I10 I8 I6 I4 I2 I2 completes 6 U I11 I9 I7 I5 I3 I3 completes V I12 I10 I8 I6 I4 I4 completes
While progressing through the pipeline instructions may stall for various reasons. To insure proper sequencing of instruction results, instructions in the U pipe and V pipe enter and leave the D1 and D2 pipeline stages in unison.
When an instruction in one pipe causes a stall, then both pipelines stall. When an instruction in the U pipe stalls in the EX stage, the instruction in the V pipe also stalls. However, if the V pipe instruction stalls, the U pipe instruction is allowed to continue to the WB stage. The next instruction (or instruction pair) is not allowed to enter the EX stage until both previous instructions enter the WB stage.
This prevents the possibility of a V pipe instruction stalling in the
EX stage and being passed by instructions in the U pipe.
(remaining portion of chapter omitted)
"The difference between the almost-right word and the right
word is really a large matter
-- it's the difference between the lightning bug and the lightning."
Mark Twain
The "optimization" process starts early in the development cycle. It should really start even before choosing an algorithm. The requirements of the project should clearly define the performance criteria. But that rarely happens. The design includes, among other details, your choice of data structures and algorithms. Your data structures and algorithms will have the most drastic impact on performance in all but the most extreme cases.
There are many sources of information on good algorithms, including your colleagues, on-line information services and books. My favorite book for algorithms is "Algorithms" by Robert Sedgewick, Addison Wesley. I don't know how long I would have studied Quicksort before understanding it without this book. A few hours spent working with the correct algorithms will almost always payoff more than days of code twiddling -- "an ounce of design is worth a pound of debugging."
For the most part, the optimizations in this book are focused on replacing almost-right instructions with the right instructions.
In this chapter and the next we'll attempt various techniques for optimizing code that operates on strings. Keep in mind, although many routines operate on ASCII string data, this is not a requirement. In most cases, with few modifications, the string could just as easily be structures of numbers or graphics data.
You'll also notice we're going to be primarily interested in working with small loops. We'll be doing this for several reasons:
First, small loops of code are ideal for learning superscalar programming techniques.
Second, these loops are of great practical use for everyday programming.
Third, optimizing the inner-most loops in any routine usually provides the best first level of optimization of the Pentium. Optimizing code that does not run in a loop on the Pentium may not speed up the code at all.
Consider the following code I wrote in the early 80's for the 8086. This code copies an ASCIIZ string (a string of ASCII characters terminated with a null byte).
lbl: lodsb ; load a byte stosb ; store a byte or al, al ; check for a null jne lbl ; loop if not a null
An alternative to writing the code with the string instructions is to use the combination of the corresponding MOV and INC for the LODSB and STOSB (see figure 13.1b). This does not duplicate the original function exactly since STOSB uses the ES segment by default. Adding a segment override to the second MOV in figure 13.1b would do this and change the cycle counts by 1 or 2 on some CPU's.
Throughout all the examples, unless stated otherwise, I will assume the code has been rearranged to eliminate segment overrides or some initialization code has made this unnecessary. When required to perform intra-segment operations, such as a copy from one segment to another, the code will be slower.
I will defer discussing this until the end of this section. As each new CPU was introduced over the last decade, I reviewed code such as the string copy to see if it would benefit from being changed. On the 8088 through 386, string instructions are generally better or equal in performance to simple load and store instructions. On the 486, (being more RISC-like), simple load and store instructions tend to perform better.
Although string operations on the 486 do not continue to measure up in speed, they are still more compact (in this case, 6 bytes vs. 11). With the Pentium, the speed up is dramatic -- from 6 cycles to 3, a (theoretical) 100% speed up while on the 486 the speed up is 75% (14 to 8 cycles).
Notice in the figures I have indicated the number of cycles for each Pentium instruction assuming no pairing occurs. In the next column (titled: w/pairing) the cycles are given assuming pairing occurs according to the pairing rules. An instruction that executes in the V pipe will show the number of cycles beyond those required for the U pipe instruction (usually 0).
We've seen that some 80x86 string instructions are slower than executing the simple move and increment instructions, since the simple instructions can be paired and execute in a single cycle. In addition the CMP/Jcc (or TEST/Jcc) combination can be paired so this is also executed in a single cycle. See table 13.1. (We'll cover the repeat string instructions later.)
; 8088 286 386 486 Pent. w/pair bytes loop1: lodsb ; 16 5 5 5 2 2 1 stosb ; 15 3 4 5 3 3 1 or al, al ; 3 2 2 1 1 1 2 jne loop1 ; 16 8 8 3 1 0 2 -------------------------- --- 50 18 19 14 7 6 6
; 8088 286 386 486 Pent. w/pair bytes loop2: mov al, [si] ; 17 5 4 1 1 1 2 inc si ; 3 2 2 1 1 0 1 mov [di], al ; 18 4 2 1 1 1 2 inc di ; 3 2 2 1 1 0 1 cmp al, 0 ; 4 3 2 1 1 1 2 jne loop2 ; 16 9 8 3 1 0 2 -------------------------- --- 61 25 20 8 6 3 11 Notes: w/pair = cycles with pairing bytes = instruction length in bytes