The following is a list of optimizations that may come in handy. Each one is listed alphabetically (more or less) in the first column.
The second column lists the CPU or CPU's that this optimization is applicable to; alternatively it may be noted as applicable to 16-bit code or 32-bit code.
The third column contains one or more replacement sequences of code that is either faster or smaller (sometimes both) than the first column. For some obscure optimizations, the action of the first column instruction is explained.
The forth column contains a description and/or examples.
replacement instruction CPU's or action description/notes --------------------------------------------------------------------------- aad (imm8) all AL = AL+(AH*imm8) If imm8 is blank uses 10. AH = 0 AAD is almost always slower, but only 2 bytes long. aam (imm8) all AH = AL/imm8 Same as AAD. AL = AL MOD imm8 add 16-bit lea reg, [reg+reg+disp] Use LEA to add base + index + displacement Also preserves flags; for example: add bx, 4 can be replaced by: lea bx, [bx+4] when the flags must not be changed. add 32-bit lea reg, [reg+reg*scale+disp] Use LEA to add base + scaled index + disp Also preserves flags. (See previous example). The 32-bit form of LEA is much more powerful than the 16-bit version because of the scaling and the fact that almost all of the 8 General purpose registers can be used as base and index registers. and reg, reg Pent test reg, reg Use TEST instead of AND on the Pentium because fewer register conflict will result in better pairing bswap Pent ror eax, 16 Pairs in U pipe, BSWAP doesn't pair. disadvantage: modifies flags (Not a direct replacement) call dest1 286+ push offset dest2 When CALL is followed by jmp dest2 jmp dest1 a JMP, change the return address to the JMP destination. call dest1 all jmp dest1 When a CALL is followed by a ret RET, the CALL can be replaced by a JMP. cbw 386+ mov ah, 0 When you know AL < 128 use MOV AH, 0 for speed. But use CBW for smaller code size. cdq 486+ xor edx, edx When you know EAX is positive Faster, better pairing. disadvantage: modifies flags Pent mov edx, eax When EAX value could be sar edx, 31 positive or negative because of better pairing cmp mem, reg 286 cmp reg, mem reg, mem is 1 cycle faster cmp reg, mem 386 cmp mem, reg mem, reg is 1 cycle faster dec reg16 lea reg16, [reg16 - 1] Use to preserve flags for BX, BP, DI, SI dec reg32 lea reg32, [reg32 - 1] Use to preserve flags for EAX, EBX, ECX, EDX EDI, ESI, EBP div <op> 8088 shr accum, 1 When <op> resolves to 2, use shift for division. (use CL for 4, 8, etc.) div <op> 186+ shr accum, n When <op> resolves to a power of 2 use shifts for division. enter imm16, 0 286+ push bp ENTER is always slower mov bp, sp and 4 bytes in length sub sp, imm16 if imm16 = 0 then push/mov is smaller 386+ push ebp 32-bit mov ebp, esp sub esp, imm16 inc reg16 lea reg16, [reg16 + 1] Use to preserve flags for BX, BP, DI, SI inc reg32 lea reg32, [reg32 + 1] Use to preserve flags for EAX, EBX, ECX, EDX EDI, ESI, EBP jcxz <dest>: 486+ test cx, cx JCXZ is faster and je <dest>: smaller on 8088-286. On the 386 it is the about the same speed 486+ test ecx, ecx Never use JCXZ on 486 je <dest>: or Pentium except for compactness lea reg, mem 8088-286 mov reg, OFFSET mem MOV reg, imm is faster on 8088 - 286. 386+ they are the same. Note: There are many uses for LEA, see: add, inc, dec, mov, mul leave 486+ mov sp, bp LEAVE is only 1 byte pop bp long and is faster on the 186-386. The mov esp, ebp MOV/POP is much faster pop ebp on 486 and Pentium lodsb 486+ mov al, [si] LODS is only 1 byte long inc si and is faster on 8088-386, much slower on the 486. On the Pentium the MOV/INC or MOV/ADD instructions pair, taking only 1 cycle. lodsw 486+ mov ax, [si] see lodsb add si, 2 lodsd 486+ mov eax, [esi] see lodsb add esi, 4 loop <dest>: 386+ dec cx LOOP is faster and jnz <dest>: smaller on 8088-286. on 386+ DEC/JNZ is loopd <dest>: dec ecx much faster. On the Pentium jnz <dest>: the DEC/JNZ instructions pair taking only 1 cycle. loopXX <dest>: 486+ je $+5 The 3 replacement instructions ( XX = e,ne,z or nz) dec cx are much faster on the 486+. jnz <dest>: LOOPxx is smaller and faster on 8088-286 loopdXX <dest>: 486+ je $+5 The speed is about the dec ecx same on the 386. jnz <dest>: mov reg2, reg1 286+ lea reg2, [reg1+n] LEA is faster, smaller and followed by: preserves flags. This is a inc/dec/add/sub reg2 way to do a MOV and ADD/SUB of a constant, n. mov acc, reg all xchg acc, reg Use XCHG for smaller code when one of the registers final value can be ignored. Note that acc = AL, AX or EAX. mov mem, 1 Pent lea bx, mem Displacement/immediate does mov [bx], 1 not pair. LEA/MOV can be used if other code can be placed inbetween to prevent AGI's. mov ax, 1 MOV/MOV may be easier to pair. mov mem, ax mov [bx+2], 1 Pent mov ax, 1 Better pairing because mov [bx+2], ax displacement/immediate instructions do not pair. lea bx, [bx+2] mov [bx], 1 movsb 486+ mov al, [si] MOVS is faster and inc si smaller to move a single mov [di], al byte, word or dword inc di on the 8088-386. On the 486+ the MOV/INC method is faster. NOTE: REP MOVS is always faster to move a large block. movsw 486+ mov ax, [si] see MOVSB add si, 2 mov [di], ax add di, 2 movsd 486+ mov eax, [esi] see MOVSB add esi, 4 mov [edi], eax add edi, 4 movzx r16, rm8 486+ xor bx, bx MOVZX is faster and mov bl, al smaller on the 386. On the 486+ XOR/MOV movzx r32, rm8 486+ xor ebx, ebx is faster. Possible mov bl, al pairing on the Pentium. (source can be reg or mem) movzx r32, rm16 486+ xor ebx, ebx disadvantage: modifies flags mov bx, ax mul n 8088+ shl ax, cl Use shifts or ADDs instead of multiply when n is a power of 2 mul n Pent add ax, ax ADD is better than single shift because it pairs better. mul 32-bit lea Use LEA to multiply by 2, 3, 4, 5, 7, 8, 9 lea eax, [eax+eax*4] (ex: multiply EAX * 5) LEA is better than SHL on the Pentium because it pairs in both pipes, SHL pairs only in the U pipe. or reg, reg Pent test reg, reg Better pairing because OR writes to register. (This is for src = dest.) pop mem 486+ pop reg Faster on 486+ mov mem, reg Better pairing on Pentium push mem 486+ mov reg, mem Faster on 486 push reg Better pairing on Pentium pushf 486+ rcr reg, 1 To save only the carry flag use a rotate (RCR or RCL) or into a register. RCR and RCL are pairiable (U pipe only) rcl reg, 1 and take 1 cycle. PUSHF is slow and not pairable. popf 486+ rcl reg, 1 To restore only the carry flag. See PUSHF. or rcr reg, 1 rep scasb Pent loop1: REP SCAS is faster and mov al, [di] smaller on 8088-486. inc di Expanded code is faster cmp al, reg2 on Pentium due to pairing. je exit dec cx jnz loop1 exit: shl reg, 1 Pent add reg, reg ADD pairs better. SHL only pairs in the U pipe. stosb 486+ mov [di], al STOS is faster and smaller inc di on the 8088-286, and the same speed on the 386. On the 486+ stosw 486+ mov [di], ax the MOV/INC is slightly add di, 2 faster. stosd 486+ mov [edi], eax REP STOS is faster on 8088-386. add edi, 4 MOV/INC or MOV/ADD is faster on the 486+ Note: use LEA SI, [SI+n] to advance LEA without changing the flags. xchg all Use xchg acc, reg to do a 1 byte MOV when one register can be ignored. xchg reg1, reg2 Pent push reg1 pushes and pops are 1 cycle push reg2 faster on Pentium due to pop reg1 pairing. pop reg2 disadvantage: uses stack Pent mov reg3, reg1 Faster and better pairing mov reg1, reg2 if reg3 is available. mov reg2, reg3 xlatb 486+ mov bh, 0 XLAT is faster and smaller mov bl, al on 8088-386. MOV's are faster mov al, [bx] on 486+. Best to rearrange instructions to prevent AGI's xlatb 486+ xor ebx, ebx and get pairing on Pentium. mov bl, al Force high part of BX/EBX mov al, [ebx] to zero outside of loop. disadvantage: modifies flags