Age | Commit message (Collapse) | Author |
|
These will not only be useful for hipe_regalloc_prepass, but also, after
the introduction of a mk_move/2 (or similar) callback, for the purpose
of range splitting.
Since the substitution needed to case over all the instructions, a new
module, hipe_x86_subst, was introduced to the x86 backend.
Due to differences in the 'jtab' field of a #jmp_switch{} between x86
and amd64, it regrettably needed to be duplicated to hipe_amd64_subst.
|
|
Now that all backends do register allocation on a CFG directly and
define the defun_to_cfg/1 callback as the identity function, it can be
removed.
|
|
As the just_as_good_as assertion was loosened with the `NowRegs >=
CheckRegs` check, it no longer verified that hipe_regalloc_prepass had
not incorrectly labeled a temp as unallocatable. We add that behaviour
back.
|
|
|
|
This is due to the improvements in hipe_temp_map, removing the need for
duplicated logic in the backends.
|
|
|
|
|
|
|
|
|
|
|
|
hipe_regalloc_prepass speeds up register allocation by spilling any temp
that is live over a call (which clobbers all register).
In order to detect these, a new function was added to the target
interface; defines_all_alloc/1, that takes an instruction and returns a
boolean.
|
|
* sverker/hipe-performance-o1/PR-1154:
hipe_sparc: Minimise CFG<->linear conversions
hipe_ppc: Minimise CFG<->linear conversions
hipe_arm: Minimise CFG<->linear conversions
hipe_x86: Use lea instead of move+add
hipe_arm: Improve peephole optimiser
hipe_arm: Be resilient to crappy RTL
hipe_ppc: Be resilient to crappy RTL
hipe_sparc: Be resilient to crappy RTL
hipe: Reuse liveness info for spillmin
hipe_x86: Minimise CFG<->linear conversions
hipe: Fix o0 and o1
hipe: Add o0 and o1 to tests
hipe_rtl_binary:get_word_integer/4: Handle imms
hipe_x86: Be resilient to crappy RTL
hipe_x86: LSRA for SSE2
|
|
|
|
* sverker/hipe-sparc-19/PR-1148:
Eliminate catch-all clause from two functions
Increase the time limit used by the test suite
|
|
* maint:
dialyzer: Increase time limit of suites
dialyzer: Remove a check that always fails
dialyzer: Optimize an opaque type case
|
|
Fix a mistake in commit 85f6fe3b.
Instead of using the declared opaque type, the form's type is used in
a case where the opaque type is turned into a non-opaque type. The
result is more general types (smaller Erlang terms) and faster
analyses.
|
|
Now, there will only ever be a single Linear->CFG conversion, just after
lowering from RTL, and only ever a single CFG->Linear conversion, just
before the finalise pass. Both of these now happen in hipe_sparc_main.
|
|
Now, there will only ever be a single Linear->CFG conversion, just after
lowering from RTL, and only ever a single CFG->Linear conversion, just
before the finalise pass. Both of these now happen in hipe_ppc_main.
|
|
Now, there will only ever be a single Linear->CFG conversion, just after
lowering from RTL, and only ever a single CFG->Linear conversion, just
before the finalise pass. Both of these now happen in hipe_arm_main.
|
|
This is primarily useful for heap allocations, as a two-address 'add'
can't be used to both copy the heap pointer to another register, and add
the tag.
|
|
|
|
The ARM backend crashes if certain RTL optimisations were omitted,
preventing it from being usable at lower optimisation levels.
One of the problems were caused by shift-by-immediate-zero, which wraps
to immediate-32 with some shiftops. TODO: Someplace should be modified
to crash when these are generated so debuging further instances of this
gets easier in the future.
|
|
The PowerPC backend crashes if certain RTL optimisations were omitted,
preventing it from being usable at lower optimisation levels.
|
|
The SPARC backend crashes if certain RTL optimisations were omitted,
preventing it from being usable at lower optimisation levels.
|
|
For x86, additionally reuse liveness from float LSRA for the GP LSRA.
|
|
Most x86 passes were either linearise(pass(to_cfg(Code))) or trivially
rewritable to process a CFG. This saves a great deal of time and memory
churn when compiling large programs.
Now, there will only ever be a single Linear->CFG conversion, just after
lowering from RTL, and only ever a single CFG->Linear conversion, just
before the finalise pass. Both of these now happen in hipe_x86_main.
|
|
These options would not do anything, because they would not supress the
'o2' in ?COMPILE_DEFAULTS. Such behaviour is added to expand_options/2.
|
|
Now that x86 is no longer broken with these optimisation levels, we add
them to the test suite to ensure they do not break again.
Bump timeout to 6min since tests are run twice as many times.
The option set of o1 was changed to all optimisations that run fast on
both big and small programs, incurring only a slight compile time
increase compared to the old set, but with a, presumably, significant
improvement to speed of compiled code.
Change o0 register allocator to linear_scan.
|
|
Immediate arguments to get_word_integer/4 would lead to bad but
unreachable RTL being generated. We omit its generation by testing for
immediates and performing the logic at compile time.
|
|
The x86 backend crashes if certain RTL optimisations were omitted,
preventing it from being usable at lower optimisation levels.
|
|
There is little point offering LSRA for x86 if we're still going to call
hipe_graph_coloring_regalloc for the floats. In particular, all
allocators except LSRA allocates an N^2 interference matrix, making them
unusable for really large functions.
|
|
|
|
A stronger version of Dialyzer complained that some case clauses in
functions xaluop_is_shift/1 and xaluop_normalise/1 are unreachable.
These clauses are now commented out. While at it, I thought that it
would be better to eliminate the catch-all clauses in order to be
certain we properly handle all RTL instructions that are used as
inputs to these functions.
Note: The code will now crash if there are unhandled cases.
|
|
This is required in some really old SPARC machines running Solaris
we still have access to.
|
|
Register allocation could transform something like
fmove u32, d99
to
fmove $rdx, 0x20($rsp)
which is an invalid instruction.
|
|
Since the link register/return address is restored before stack
arguments are stored to the frame, we must not use it to store a stack
argument. We do that by adding it to the registers clobbered by
pseudo_tailcall_prepare.
|
|
The problem was caused by shift-by-immediate-zero, which wraps to
immediate-32 with some shiftops. TODO: Someplace should be modified to
crash when these are generated so debugging further instances of this
gets easier in the future.
|
|
|
|
|
|
|
|
The 'array' module is highly optimised for the hipe_vectors use-case,
and seems to perform slightly better than the gb_trees implementation.
Also, we remove the completely unnecessary hipe_vectors.hrl header.
|
|
|
|
|
|
Slightly improves performance.
|
|
|
|
Also, remove unused field 'counter' from #state{}.
|
|
|
|
Profiling showed that hipe_sdi spent most of its time in updateParents,
discarding nodes that were already deleted. By introducing a delete
operation to the segment trees, we can pay this cost only once, when
deleting the node from the graph.
Instead of keeping the ranges around, we recompute the range of the node
when we delete it, since this can be done in constant time, without any
memory allocation.
Although segment trees are not designed to be modified once built,
implementing a delete operation turned out to be a simple matter of
repeating insertion, but deleting the index from, instead of consing it
on, the appropriate nodes' values (segment lists).
This optimisation drastically sped up hipe_sdi to the point of no longer
being the bottleneck in the Assembly stage.
|
|
This speeds up parentsOfChild/2 from O(n) to O(lg n + k).
A new module misc/hipe_segment_trees.erl is introduced.
|
|
hipe_icode_bincomp:find_bs_get_integer/3 was quadratic for no good reason. By observing
that NewSuccs and Rest are always disjoint, we can see that the worklist
does not need to be a set. Furthermore, by replacing the ordset Visited
with a map, we reduce complexity to (a very low) O(n lg n).
On cuter_binlib, this change reduced the time for hipe_icode_bincomp
from 60s to .25s. Using a gb_set for Visited gives .5s, and a sets:set
1s.
We apply the same optimisation to hipe_icode_range.
|