Pentium^TM Processor Optimization Tools

ISBN:	0-12-627230-1
Author:	Mike Schmit
Publisher:	AP Professional, 1995

I  Review and Historical Context
   1   Number Systems                                   3
   2   What is Assembly?                                13
   3   The 8086 Family History and Architecture         19
II 80x86 Family Background
   4   8086 Architecture and Instruction Set            29
   5   Writing Beginning programs                       75
   6   Assembly Tools                                   83
   7   The Instruction Set Evolves: 186 to the 386      89
III Introduction to Pentium and Tools
   8   80486 and Pentium                                103
   9   Superscalar Programming                          109
   10  Integer and Floating Point Pipeline Operation    119
   11  Using the Pentium Optimizer Program              139
   12  Timing with a Software Timer                     143
IV Superscalar Pentium Programming
   13  Optimization Warm-ups                            153
   14  String Search and Translate                      163
   15  Checksums and Extended Precision Addition        179
V  Advanced Topics
   16  Floating Point Math                              195
   17  Interfacing to C                                 211
   18  Protected mode programming                       231
   19  Final Notes and Optimizations                    255
VI PowerPC vs. Pentium
   20  PowerPC vs. Pentium                              273

Appendices

  A  Instruction Set Reference                          283
  B  Optimization Cross-reference by Instruction        317
  C  Optimization Guide-lines by CPU                    327
  D  Simple Instructions for Pentium Pairing            333
  E  Instruction Pairing Rules for Pentium              335
  F  Single-byte Instructions                           337
  G  Quick Reference for Important Instruction Timings  341
  H  Undocumented Pentium Registers                     345
  I  DEBUG32 command summary                            349
  J  Improving Performance                              355
  K  Glossary of Terms                                  359
  L  Products Mentioned                                 375
     Index                                              377

Diskette includes:

over 100 source files (.asm and .c) from the book
PentOpt a working demo of Quantasm's PentOpt Professional program for optimizing code for the 486 and Pentium processors
DEBUG32 32-bit protected-mode debugger with DPMI support
16-bit and 32-bit libraries for software timing with the IBM PC hardware timers

The following excerpts are taken directly from the book (reformatted for the web) with permission of the copyright holder.

Sample section #1: Chapter 10

Integer and Floating Point Pipeline Operation
(Introductory material omitted)

Pipelines

Several independent steps must be performed by the CPU to complete even the simplest of instructions. Let's take a look at the internal operations of the CPU for a few instructions.

To load data from memory, i.e.,

mov ax, [bx]

The processor must perform the following actions:

fetch the instruction
decode what actions must be taken
calculate the effective address by combining: (DS * 16) + BX
read the data from memory
store the data in the AX register

Let's look at another instruction

add ax, bx

For this instruction the processor must:

fetch the instruction
decode what actions must be taken
add the two registers
store the data in the AX register

(material omitted)

Not every instruction needs to perform all the same steps, but they certainly are very similar. What if a computer was designed like a factory assembly line? Each instruction would move from station to station as if it were on a conveyor belt. At each station a specialist would perform its job. This is how a pipeline works. It was first used by Intel's 80x86 processors on the 486. The stations, or stages are as follows

        PF      pre-fetch
        D1      instruction decode
        D2      address generation
        EX      execute and cache access
        WB      write back

Here is a brief description of what happens during each pipeline stage:

PF: Prefetch. Instructions are fetched from the cache or memory are stored in the prefetch queue.

D1: Instruction decode or Decode1. The instruction is decoded and broken into component parts, opcode and operands. An extra cycle is required for instructions that contain a prefix.

D2: Decode2 or Address Generation. The effective address of the memory operand, if present, is calculated. On the 486 an extra cycle is required if an address contains both a base and an index component or both a displacement and an immediate data value.

EX: Execute and Cache Access. The processor performs the action(s) required by the instruction including reading data from memory and storing the results in registers.

WB: Write Back. Instructions complete and data to be written to memory is sent to a write buffer. Instructions are enabled to modify the CPU state.

Instructions proceed from one stage to the next stage in one cycle, (usually). So even the fastest instruction on the 486 takes 5 cycles to complete. But wait! If you've looked at any of the CPU cycle tables you've seen that many instructions take only 1 cycle on the 486. How can this be?

The published cycle times are actually the fewest number of cycles that an instruction would take when added to a stream of other instructions. Think of it as the effective throughput, not the actual time an instruction is in the pipeline. Cars may drive off an assembly line every two minutes, but each one takes many hours to assemble. Here's an example:

Table 10.2 Normal Pipeline Operation

    stages --> PF      D1      D2      EX      WB
    cycles
      1        I1                                       I1 starts
      2        I2      I1                               I2 starts
      3        I3      I2      I1                       I3 starts
      4        I4      I3      I2      I1
      5        I5      I4      I3      I2      I1       I1 completes
      6        I6      I5      I4      I3      I2       I2 completes
      7        I7      I6      I5      I4      I3       I3 completes

      (I1 = instruction #1, I2 = instruction #2, etc.)

(material omitted)

Paired Pipelines

When working on the Pentium we must take into account the fact that there are now two pipelines (the U pipe and the V pipe) -- like a factory with two assembly lines or two conveyor belts. However, unlike building cars or some other product, the outputs from the two pipelines must be kept in order. Each instruction in each pipeline can modify data in memory, data in registers and the state of the CPU.

Because of this it is sometimes necessary for activities in one pipeline to be known by the other pipeline. This is done so that the end result is exactly the same as if instructions were executed sequentially in the exact same order as they were fetched.

Table 10.5 Pentium Pipeline Operation

     stages --> PF      D1      D2      EX      WB
 cycles  pipe
      1  U     I1                                       I1 starts
         V     I2                                       I2 starts
      2  U     I3      I1                               I3 starts
         V     I4      I2                               I4 starts
      3  U     I5      I3      I1                       I5 starts
         V     I6      I4      I2                       I6 starts
      4  U     I7      I5      I3      I1               I7 starts
         V     I8      I6      I4      I2               I8 starts
      5  U     I9      I7      I5      I3      I1       I1 completes
         V     I10     I8      I6      I4      I2       I2 completes
      6  U     I11     I9      I7      I5      I3       I3 completes
         V     I12     I10     I8      I6      I4       I4 completes

While progressing through the pipeline instructions may stall for various reasons. To insure proper sequencing of instruction results, instructions in the U pipe and V pipe enter and leave the D1 and D2 pipeline stages in unison.

When an instruction in one pipe causes a stall, then both pipelines stall. When an instruction in the U pipe stalls in the EX stage, the instruction in the V pipe also stalls. However, if the V pipe instruction stalls, the U pipe instruction is allowed to continue to the WB stage. The next instruction (or instruction pair) is not allowed to enter the EX stage until both previous instructions enter the WB stage.

This prevents the possibility of a V pipe instruction stalling in the EX stage and being passed by instructions in the U pipe.
(remaining portion of chapter omitted)

Sample section #2: Chapter 13

Optimization Warm-ups

"The difference between the almost-right word and the right word is really a large matter
-- it's the difference between the lightning bug and the lightning." Mark Twain

The "optimization" process starts early in the development cycle. It should really start even before choosing an algorithm. The requirements of the project should clearly define the performance criteria. But that rarely happens. The design includes, among other details, your choice of data structures and algorithms. Your data structures and algorithms will have the most drastic impact on performance in all but the most extreme cases.

There are many sources of information on good algorithms, including your colleagues, on-line information services and books. My favorite book for algorithms is "Algorithms" by Robert Sedgewick, Addison Wesley. I don't know how long I would have studied Quicksort before understanding it without this book. A few hours spent working with the correct algorithms will almost always payoff more than days of code twiddling -- "an ounce of design is worth a pound of debugging."

For the most part, the optimizations in this book are focused on replacing almost-right instructions with the right instructions.

In this chapter and the next we'll attempt various techniques for optimizing code that operates on strings. Keep in mind, although many routines operate on ASCII string data, this is not a requirement. In most cases, with few modifications, the string could just as easily be structures of numbers or graphics data.

You'll also notice we're going to be primarily interested in working with small loops. We'll be doing this for several reasons:

First, small loops of code are ideal for learning superscalar programming techniques.

Second, these loops are of great practical use for everyday programming.

Third, optimizing the inner-most loops in any routine usually provides the best first level of optimization of the Pentium. Optimizing code that does not run in a loop on the Pentium may not speed up the code at all.

String instruction optimizations

Consider the following code I wrote in the early 80's for the 8086. This code copies an ASCIIZ string (a string of ASCII characters terminated with a null byte).

      lbl:
        lodsb                   ; load a byte
        stosb                   ; store a byte
        or      al, al          ; check for a null
        jne     lbl             ; loop if not a null

An alternative to writing the code with the string instructions is to use the combination of the corresponding MOV and INC for the LODSB and STOSB (see figure 13.1b). This does not duplicate the original function exactly since STOSB uses the ES segment by default. Adding a segment override to the second MOV in figure 13.1b would do this and change the cycle counts by 1 or 2 on some CPU's.

Throughout all the examples, unless stated otherwise, I will assume the code has been rearranged to eliminate segment overrides or some initialization code has made this unnecessary. When required to perform intra-segment operations, such as a copy from one segment to another, the code will be slower.

I will defer discussing this until the end of this section. As each new CPU was introduced over the last decade, I reviewed code such as the string copy to see if it would benefit from being changed. On the 8088 through 386, string instructions are generally better or equal in performance to simple load and store instructions. On the 486, (being more RISC-like), simple load and store instructions tend to perform better.

Although string operations on the 486 do not continue to measure up in speed, they are still more compact (in this case, 6 bytes vs. 11). With the Pentium, the speed up is dramatic -- from 6 cycles to 3, a (theoretical) 100% speed up while on the 486 the speed up is 75% (14 to 8 cycles).

Notice in the figures I have indicated the number of cycles for each Pentium instruction assuming no pairing occurs. In the next column (titled: w/pairing) the cycles are given assuming pairing occurs according to the pairing rules. An instruction that executes in the V pipe will show the number of cycles beyond those required for the U pipe instruction (usually 0).

We've seen that some 80x86 string instructions are slower than executing the simple move and increment instructions, since the simple instructions can be paired and execute in a single cycle. In addition the CMP/Jcc (or TEST/Jcc) combination can be paired so this is also executed in a single cycle. See table 13.1. (We'll cover the repeat string instructions later.)

Figure 13.1a ASCIIZ string copy

                         ; 8088  286 386 486 Pent. w/pair  bytes

   loop1:
      lodsb              ;  16    5   5   5   2    2         1
      stosb              ;  15    3   4   5   3    3         1
      or     al, al      ;   3    2   2   1   1    1         2
      jne    loop1       ;  16    8   8   3   1    0         2
                           --------------------------       ---
                            50   18  19  14   7    6         6

figure 13.1b

                         ; 8088  286 386 486 Pent. w/pair  bytes
   loop2:
      mov    al, [si]    ;  17    5   4   1   1    1         2
      inc    si          ;   3    2   2   1   1    0         1
      mov    [di], al    ;  18    4   2   1   1    1         2
      inc    di          ;   3    2   2   1   1    0         1
      cmp    al, 0       ;   4    3   2   1   1    1         2
      jne    loop2       ;  16    9   8   3   1    0         2
                           --------------------------       ---
                            61   25  20   8   6    3        11
Notes: w/pair = cycles with pairing
       bytes  = instruction length in bytes

As an exercise: further optimize the figure 13.1b code from 3 cycles per byte to 2 cycles per byte.

Home Page e-mail to Quantasm Order form Site Map

PentiumTM Processor Optimization Tools

Table of Contents