aboutsummaryrefslogblamecommitdiffstats
path: root/erts/emulator/hipe/hipe_amd64_abi.txt
blob: 8a34bfa67fa8daec9c94d94dd5dcde946cbead6f (plain) (tree)
1
2
3
4
5


                 
 















































































































































                                                                                       
 %CopyrightBegin%
 %CopyrightEnd%



HiPE AMD64 ABI
==============
This document describes aspects of HiPE's runtime system
that are specific for the AMD64 (x86-64) architecture.

Register Usage
--------------
%rsp and %rbp are fixed and must be preserved by calls (callee-save).
%rax, %rbx, %rcx, %rdx, %rsi, %rdi, %r8, %r9, %r10, %r11, %r12, %r13, %r14
are clobbered by calls (caller-save).
%r15 is a fixed global register (unallocatable).

%rsp is the native code stack pointer, growing towards lower addresses.
%rbp (aka P) is the current process' "Process*".
%r15 (aka HP) is the current process' heap pointer. (If HP_IN_R15 is true.)

Notes:
- C/AMD64 16-byte aligns %rsp, presumably for SSE and signal handling.
  HiPE/AMD64 does not need that, so our %rsp is only 8-byte aligned.
- HiPE/x86 uses %esi for HP, but C/AMD64 uses %rsi for parameter passing,
  so HiPE/AMD64 should not use %rsi for HP.
- Using %r15 for HP requires a REX instruction prefix, but performing
  64-bit stores needs one anyway, so the only REX-prefix overhead
  occurs when incrementing or copying HP [not true (we need REX for 64
  bit add and mov too);�only overhead is when accessing floats on the
  heap /Luna].
- XXX: HiPE/x86 could just as easily use %ebx for HP. HiPE/AMD64 could use
  %rbx, but the performance impact is probably minor. Try&measure?
- XXX: Cache SP_LIMIT, HP_LIMIT, and FCALLS in registers? Try&measure.

Calling Convention
------------------
Same as in the HiPE/x86 ABI, with the following adjustments:

The first NR_ARG_REGS (a tunable parameter between 0 and 6, inclusive)
parameters are passed in %rsi, %rdx, %rcx, %r8, %r9, and %rdi.

The first return value from a function is placed in %rax, the second
(if any) is placed in %rdx.

Notes:
- Currently, NR_ARG_REGS==0.
- C BIFs expect P in C parameter register 1: %rdi. By making Erlang
  parameter registers 1-5 coincide with C parameter registers 2-6,
  our BIF wrappers can simply move P to %rdi without having to shift
  the remaining parameter registers.
- A few primop calls target C functions that do not take a P parameter.
  For these, the code generator should have a "ccall" instruction which
  passes parameters starting with %rdi instead of %rsi.
- %rdi can still be used for Erlang parameter passing. The BIF wrappers
  will push it to the C stack, but \emph{parameter \#6 would have been
  pushed anyway}, so there is no additional overhead.
- We could pass more parameters in %rax, %rbx, %r10, %r11, %r12, %r13,
  and %r14. However:
  * we may need a scratch register for distant call trampolines
  * using >6 argument registers complicates the mode-switch interface
    (needs hacks and special-case optimisations)
  * it is questionable whether using more than 6 improves performance;
    it may be better to just cache more P state in registers

Instruction Encoding / Code Model
---------------------------------
AMD64 maintains x86's limit of <= 32 bits for PC-relative offsets
in call and jmp instructions. HiPE/AMD64 handles this as follows:
- The compiler emits ordinary call/jmp instructions for
  recursive calls and tailcalls.
- The runtime system code is loaded into the low 32 bits of the
  address space. (C/AMD64 small or medium code model.) By using mmap()
  with the MAP_32BIT flag when allocating memory for code, all
  code will be in the low 32 bits of the address space, and hence
  no trampolines will be necessary.

When generating code for non-immediate literals (boxed objects in
the constants pool), the code generator should use AMD64's new
instruction for loading a 64-bit immediate into a register:
mov reg,imm with a rex prefix.

Notes:
- The loader/linker could redirect a distant call (where the offset
  does not fit in a 32-bit signed immediate) to a linker-generated
  trampoline. However, managing trampolines requires changes in the
  loaders and possibly also the object code format, since the trampoline
  must be close to the call site, which implies that code and its
  trampolines must be created as a unit. This is the better long-term
  solution, not just for AMD64 but also for SPARC32 and PowerPC,
  both of which have similar problems.
- The constants pool could also be restricted to the low 32 bits of
  the address space. However:
  * We want to move away from a single constants pool. With multiple
    areas, the address space restriction may be unrealistic.
  * Creating the address of a literal is an infrequent operation, so
    the performance impact of using 64-bit immediates should be minor.

Stack Frame Layout
Garbage Collection Interface
BIFs
Stacks and Unix Signal Handlers
-------------------------------
Same as in the HiPE/x86 ABI.


Standard C/AMD64 Calling Conventions
====================================
See <http://www.x86-64.org/abi.pdf>.

%rax, %rdx, %rcx, %rsi, %rdi, %r8, %r9, %r10, %r11 are clobbered by calls (caller-save)
%rsp, %rbp, %rbx, %r12, %r13, %r14, %r15 are preserved by calls (callee-save)
[note: %rsi and %rdi are calleR-save, not calleE-save as in the x86 ABI]
%rsp is the stack pointer (fixed). It is required that ((%rsp+8) & 15) == 0
when a function is entered. (Section 3.2.2 in the ABI document.)
%rbp is optional frame pointer or local variable
The first six integer parameters are passed in %rdi, %rsi, %rdx, %rcx, %r8, and %r9.
Remaining integer parameters are pushed right-to-left on the stack.
When calling a variadic function, %rax (%al actually) must contain an upper
bound on the number of SSE parameter registers, 0-8 inclusive.
%r10 is used for passing a function's static chain pointer.
%r11 is available for PLT code when computing the target address.
The first integer return value is put in %rax, the second (for __int128) in %rdx.
A memory return value (exact definition is complicated, but basically "large struct"),
is implemented as follows: the caller passes a pointer in %rdi as a hidden first
parameter, the callee stores the result there and returns this pointer in %rax.
The caller deallocates stacked parameters after return (addq $N, %rsp).

Windows 64-bit C Calling Conventions
====================================
See "Calling Convention for x64 64-Bit Environments" in msdn.

%rax, %rcx, %rdx, %r8, %r9, %r10, %r11 are clobbered by calls (caller-save).
%rsp, %rbp, %rbx, %rsi, %rdi, %r12, %r13, %r14, %r15 are preserved
by calls (callee-save).
[Note: %rsi and %rdi are calleE-save not calleR-save as in the Linux/Solaris ABI]
%rsp is the stack pointer (fixed). %rsp & 15 should be 0 at all times,
except at the start of a function's prologue when ((%rsp+8) & 15) == 0.
Leaf functions may leave (%rsp & 15) != 0.
The first four integer parameters are passed in %rcx, %rdx, %r8, and %r9.
Remaining integer parameters are pushed right-to-left on the stack,
starting at the fifth slot above the caller's stack pointer.
The bottom of the caller's frame must contain 4 slots where the callee
can save the four integer parameter registers, even if fewer than 4
parameters are passed in registers.
An integer return value is put in %rax. Large integers (_m128), floats,
and doubles are returned in %xmm0. Larger return values cause the caller
to pass a pointer to a result buffer in %rcx as a hidden first parameter.
The caller may deallocate stacked parameters after return (addq $N, %rsp).