Board index » delphi » Faster general Trunc/Int functions

Faster general Trunc/Int functions


2005-03-11 07:34:36 AM
delphi85
Hi all,
The application that I am working uses Int/Trunc extensivelly and after
reading that both of these functions gets-sets-compute-and-restores the FPU
Control Word. From what I have read, there is a bit of an overhead when
this occurs.
So what I am thinking of doing is to set the FPU CW to the System.cwChop
value ($1F32) and calling custom Trunc/Int routines.
John Herbster mentioned that people in this newsgroup are a bunch of
helpful people (see thread "Setting the FPU Control Word on application
startup" in 'borland.public.delphi.non-technical', followed up in to
'borland.public.delphi.language.delphi.general'), so I am seeking help.
I have looked in to what Delphi (5Pro UP1) does and done some research
(Steve Williams from the threads mentioned) and it seems that the functions
should be something like:
function Trunc(v : Extended) : Int64;
asm
SUB ESP,12 // Do I need this?
FLD v
fistP qword ptr [ESP+4]
POP ECX
POP EAX
POP EDX
end;
function Int(v : Extended) : Extended;
asm
FLD v
FRNDINT
//FSTP ???
end;
Cheers,
Nick
 
 

Re:Faster general Trunc/Int functions

Hi
Have a look here for Trunc
dennishomepage.gugs-cats.dk/TruncChallenge.htm
And here for Trunc32
dennishomepage.gugs-cats.dk/Trunc32Challenge.htm
Regards
Dennis
 

Re:Faster general Trunc/Int functions

Hi
Then I need to know which precision you want?
Do you want to Trunc to Int64 or to Integer?
And which CPU do you use?
Regards
Dennis
 

Re:Faster general Trunc/Int functions

Quote
Have a look here for Trunc
All the FastCode RTL-compatible Trunc implementations
are goind to be slower than a Control-Word aware implementation
as they have to do the FPU control word shuffling.
Eric
 

Re:Faster general Trunc/Int functions

Quote
SUB ESP,12 // Do I need this?
It moves the stack pointer so that integer values can be
popped back directly (the extended is placed in the stack,
and uses those 12 bytes, which are reused to place
the truncated value).
Quote
//FSTP ???
Something similar to the previous one should do it.
Note that if your values are>0, then you can assume:
Trunc(x)=Round(x-0.5)
Might absolve you from the need of using Trunc (and thus
need of altering the control word).
Eric
 

Re:Faster general Trunc/Int functions

Hi Eric
Quote
All the FastCode RTL-compatible Trunc implementations
Not SSE/SSE3 based functions.
Quote
are goind to be slower than a Control-Word aware implementation
as they have to do the FPU control word shuffling.
I wanted to rip it off.
Quote
Eric
Regards
Dennis
 

Re:Faster general Trunc/Int functions

Hi
Quote
Note that if your values are>0, then you can assume:
Trunc(x)=Round(x-0.5)
Might absolve you from the need of using Trunc (and thus
need of altering the control word).
And benefit from the fact that Round is inlined by compiler magic. A benefit
that a custom solution will not get.
Quote
Eric
Regards
Dennis
 

Re:Faster general Trunc/Int functions

Hi Eric,
Quote
Note that if your values are>0, then you can assume:
Trunc(x)=Round(x-0.5)
That's not correct. For example:
Trunc(1) = 1
and Round(1 - 0.5) = 0
Whereas
Trunc(2) = 2
and Round(2 - 0.5) = 2
That's because round uses banker's rounding (nearest or even), not rounding
to nearest or down.
Regards,
Pierre
 

Re:Faster general Trunc/Int functions

Hi Dennis,
Quote
And benefit from the fact that Round is inlined by compiler magic. A
benefit
that a custom solution will not get.
Round is not inlined. It calls System._Round. However, System._Round breaks
the parameter passing rules in what is probably an attempt to speed things
up... Unfortunately it makes a complete hash of it by causing a bad store to
load forward stall in the process.
Regards,
Pierre
 

Re:Faster general Trunc/Int functions

Hi Pierre
Quote
Round is not inlined. It calls System._Round.
Correct. Just checked at D2005.
Quote
Regards,
Pierre
Regards
Dennis
 

Re:Faster general Trunc/Int functions

Hi Pierre
Quote
System._Round breaks
the parameter passing rules in what is probably an attempt to speed things
up... Unfortunately it makes a complete hash of it by causing a bad store
to
load forward stall in the process.
But the RTL Round won in nearly all categories anyway.
How would you remove that stall?
Quote
Regards,
Pierre
Regards
Dennis
PS Vote for this report "Make BASM Functions Inlineable"
qc.borland.com/wc/qcmain.aspx
 

Re:Faster general Trunc/Int functions

Hi Dennis,
Quote
How would you remove that stall?
I would return the result on the stack, instead of loading the 64-bit fistP
result into the two 32-bit halves EDX:EAX.
Unfortunately the FPU codegen of Delphi is still optimised for the 386/486.
Evidence of this can be seen in the many completely unnecessary FWAIT
instructions. The only reason you would want to use FWAIT on a "modern"
processor (80486DX onwards) is if you want to trap a possible FPU exception
at the exact address that it occurs, otherwise it just wastes CPU time.
Delphi still generates code compatible with the 80486SX with co-processor -
are there still any of those around?
It's no secret that it doesn't generate great FPU code at all for modern
processors. The code it generates is rife with these STLF stalls. The
problem is: I think it will require a lot of work to fix it... time that
will be better spent giving us a 64-bit compiler :-).
Regards,
Pierre
 

Re:Faster general Trunc/Int functions

"Pierre le Riche" <XXXX@XXXXX.COM>writes
Quote
The only reason you would want to use FWAIT on a "modern" processor
(80486DX onwards) is if you want to trap a possible FPU exception at the
exact address that it occurs,
I think this is one of the main reasons they still include it.
 

Re:Faster general Trunc/Int functions

Hi
Quote
If you want to trap a possible FPU exception at the
>exact address that it occurs,

I think this is one of the main reasons they still include it.
Only needed after some instructions. Some instructions raise the flagged
exception and others do not.
The solution is to only include wait after the instructions that need it.
Removing them all is a bad idea. The cost of FWAIT is very very little.In
the complex number challenges it is less than out measurement error.
Regards
Dennis
 

Re:Faster general Trunc/Int functions

Quote
The cost of FWAIT is very very little.In the complex number
challenges it is less than out measurement error.
At least on Athlon is assimilated to NOP (Execute Latency = zero cycle),
so the only hit you take is with instruction decoder and wasted
memory bandwidth.
Eric