diff options
Diffstat (limited to 'lib/hipe/x86/NOTES.OPTIM')
-rw-r--r-- | lib/hipe/x86/NOTES.OPTIM | 200 |
1 files changed, 200 insertions, 0 deletions
diff --git a/lib/hipe/x86/NOTES.OPTIM b/lib/hipe/x86/NOTES.OPTIM new file mode 100644 index 0000000000..4c241cacb4 --- /dev/null +++ b/lib/hipe/x86/NOTES.OPTIM @@ -0,0 +1,200 @@ +$Id$ + +Partial x86 code optimisation guide +=================================== +Priority should be given to P6 and P4, then K7, +then P5, and last to K6. + +Rules that are blatantly obvious or irrelevant for HiPE are +generally not listed. These includes things like alignment +of basic data types, store-forwarding rules when alignment +or sizes don't match, and partial register stalls. + +Intel P4 +-------- +The P6 4-1-1 insn decode template no longer applies. + +Simple insns (add/sub/cmp/test/and/or/xor/neg/not/mov/sahf) +are twice as fast as in P6. + +Shifts are "movsx" (sign-extend) are slower than in P6. + +Always avoid "inc" and "dec", use "add" and "sub" instead, +due to condition codes dependencies overhead. + +"fxch" is slightly more expensive than in P6, where it was free. + +Use "setcc" or "cmov" to eliminate unpredictable branches. + +For hot code executing out of the trace cache, alignment of +branch targets is less of an issue compared to P6. + +Do use "fxch" to simulate a flat FP register file, but only +for that purpose, not for manual scheduling for parallelism. + +Using "lea" is highly recommended. + +Eliminate redundant loads. Use regs as much as possible. + +Left shifts up to 3 have longer latencies than the equivalent +sequence of adds. + +Do utilise the addressing modes, to save registers and trace +cache bandwidth. + +"xor reg,reg" or "sub reg,reg" preferred over moving zero to reg. + +"test reg,reg" preferred over "cmp" with zero or "and". + +Avoid explicit cmp/test;jcc if the preceeding insn (alu, but not +mov or lea) set the condition codes. + +Load-execute alu insns (mem src) are Ok. + +Add-reg-to-mem slightly better than add-mem-to-reg. + +Add-reg-to-mem is better than load;add;store. + +Intel P6 +-------- +4-1-1 instruction decoding template: can decode one semi-complex +(max 4 uops) and two simple (1 uop) insns per clock; follow a +complex insn by two simple ones, otherwise the decoders will stall. + +Load-execute (mem src) alu insns are 2 uops. +Read-modify-write (mem dst) alu insns are 4 uops. + +Insns longer than 7 bytes block parallel decoding. +Avoid insns longer than 7 bytes. + +Lea is useful. + +"movzx" is preferred for zero-extension; the xor;mov alternative +causes a partial register stall. + +Use "test" instead of "cmp" with zero. + +Pull address calculations into load and store insn addressing modes. + +Clear a reg with "xor", not by moving zero to it. + +Many alu insns set the condition codes. Replace "alu;cmp;jcc" +with "alu;jcc". This is not applicable for "mov" or "lea". + +For FP code, simulate a flat register file on the x87 stack by +using fxch to reorder it. + +AMD K7 +------ +Select DirectPath insns. Avoid VectorPath insns due to slower decode. + +Alu insns with mem src are very efficient. +Alu insns with mem dst are very efficient. + +Fetches from I-cache are 16-byte aligned. Align functions and frequently +used labels at or near the start of 16-byte aligned blocks. + +"movzx" preferred over "xor;mov" for zero-extension. + +"push mem" preferred over "load;push reg". + +"xor reg,reg" preferred over moving zero to the reg. + +"test" preferred over "cmp". + +"pop" insns are VectorPath. "pop mem" has latency 3, "pop reg" has +latency 4. + +"push reg" and "push imm" are DirectPath, "push mem" is VectorPath. +The latency is 3 clocks. + +Intel P5 +-------- +If a loop header is less than 8 bytes away from a 16-byte +boundary, align it to the 16-byte boundary. + +If a return address is less than 8 bytes away from a 16-byte +boundary, align it to the 16-byte boundary. + +Align function entry points to 16-byte boundaries. + +Ensure that doubles are 64-bit aligned. + +Data cache line size is 32 bytes. The whole line is brought +in on a read miss. + +"push mem" is not pairable; loading a temp reg and pushing +the reg pairs better -- this is also faster on the 486. + +No conditional move instruction. + +Insns longer than 7 bytes can't go down the V-pipe or share +the insn FIFO with other insns. +Avoid insns longer than 7 bytes. + +Lea is useful when it replaces several other add/shift insns. +Lea is not a good replacement for a single shl since a scaled +index requires a disp32 (or base), making the insn longer. + +"movzx" is worse than the xor;mov alternative -- the opcode +prefix causes a slowdown and it is not pariable. + +Use "test" instead of "cmp" with zero. + +"test eax,imm" and "test reg,reg" are pairable, other forms are not. + +Pull address calculations into load and store insn addressing modes. + +Clear a reg with "xor", not by moving zero to it. + +Many alu insns set the condition codes. Replace "alu;cmp;jcc" +with "alu;jcc". This is not applicable for "mov" or "lea". + +For FP code, simulate a flat register file on the x87 stack by +using fxch to reorder it. + +"neg" and "not" are not pairable. "test imm,reg" and "test imm,mem" +are not pairable. Shifts by "cl" are not pairable. Shifts by "1" or +"imm" are pairable but only execute in the U-pipe. + +AMD K6 +------ +The insn size predecoder has a 3-byte window. Insns with both prefix +and SIB bytes cannot be short-decoded. + +Use short and simple insns, including mem src alu insns. + +Avoid insns longer than 7 bytes. They cannot be short-decoded. +Short-decode: max 7 bytes, max 2 uops. +Long-decode: max 11 bytes, max 4 uops. +Vector-decode: longer than 11 bytes or more than 4 uops. + +Prefer read-modify-write alu insns (mem dst) over "load;op;store" +sequences, for code density and register pressure reasons. + +Avoid the "(esi)" addressing mode: it forces the insn to be vector-decoded. +Use a different reg or add an explicit zero displacement. + +"add reg,reg" preferred over a shl by 1, it parallelises better. + +"movzx" preferred over "xor;mov" for zero-extension. + +Moving zero to a reg preferred over "xor reg,reg" due to dependencies +and condition codes overhead. + +"push mem" preferred over "load;push reg" due to code density and +register pressure. (Page 64.) +Explicit moves preferred when pushing args for fn calls, due to +%esp dependencies and random access possibility. (Page 58.) +[hmm, these two are in conflict] + +There is no penalty for seg reg prefix unless there are multiple prefixes. + +Align function entries and frequent branch targets to 16-byte boundaries. + +Shifts by imm only go down one of the pipes. + +"test reg,reg" preferred over "cmp" with zero. +"test reg,imm" is a long-decode insn. + +No conditional move insn. |