Board index » delphi » Faster MOVE() replacement

Faster MOVE() replacement

After I finally got it working, I thought to myself why the {*word*76}y hell did it
take me so long to figure out how to do the overlapping moves...

Below is a replacement for the TP/BP MOVE() routine. Two versions are given,
one a replacement for the routine in the TP6 RTL (and likely the TP/BP7) the
other as an assmbler procedure to be used if you don't have the RTL sources:

BTW, you may also want to have a look at

<http://developer.intel.com/drg/pentiumII/appnotes/813/813.htm>

where Intel itself gives timings for various moves and a C program to test
your own PC.

Here are the routines:

RTL replacement:

; *******************************************************
; *     Turbo Pascal Runtime Library Version 6.0        *
; *     Block Move Routine                              *
; *                                                     *
; *     Copyright (C) 1988,91 Borland International     *
; *     Copyright (C) 1999    Robert AH Prins           *
; *                                                     *
; *     Aligned blockmoves shamelessly adapted from     *
; *     Paul Hsieh's code on                            *
; *     <http://www.pobox.com/~qed/blockcopy.html>      *
; *******************************************************
             TITLE   MEMHMOV

             INCLUDE SE.ASM

CODE         SEGMENT BYTE PUBLIC

             ASSUME  CS:CODE

; Publics

             PUBLIC  MoveMem

; Move standard procedure

MoveMem      PROC    FAR
             MOV     BX, SP
             MOV     DX, DS            ; save TURBO-Pascal data segment
             LDS     SI, SS: [BX+10]   ; source
             LES     DI, SS: [BX+6]    ; destination
             xor     ecx, ecx
             MOV     CX, SS: [BX+4]    ; counter

; *******************************************************
; * 16-bit test, doesn't really take into account that  *
; * values of DS & ES may very well prevent overlap     *
; *******************************************************
             CMP     SI, DI
             JAE     no_overlap        ; source >= destination

overlap:     std
             add     si, cx            ; end of source + 1
             add     di, cx            ; end of destination + 1
             xor     eax, eax
             mov     ax, cx
             sub     cx, di
             sub     cx, ax
             neg     cx
             and     cx, 3

             dec     si                ; last byte of source
             dec     di                ; last byte of destination

             sub     eax, ecx
             jle     @ler

             rep     movsb

             sub     si, 3
             sub     di, 3             ; align on dword

             mov     cx, ax
             and     ax, 3
             shr     cx, 2
             rep     movsd

             add     si, 3             ; add for last movsb
             add     di, 3

ler:
             add     cx, ax
             rep     movsb
             jmps    end_move

no_overlap:  cld
             mov    eax, ecx
             sub    cx, di
             sub    cx, ax
             and    cx, 3
             sub    eax, ecx
             jle    LEndBytes

             rep    movsb
             mov    cx, ax
             and    ax, 3
             shr    cx, 2
             rep    movsd

LEndBytes:
             add    cx, ax
             rep    movsb

end_move:    MOV    DS, DX             ; restore TURBO-Pascal DS
             cld                       ; restore autoincrement!!!
             RET    10                 ; pop parameters and return
MoveMem      ENDP

             ALIGN  4

CODE         ENDS

             END

Pascal assembler procedure:

procedure move(var source, dest; size: word); assembler;
asm
  mov   dx, ds
  segss
  lds   si, source
  segss
  les   di, dest
  db $66; xor   cx, cx
  segss
  mov   cx, size
  cmp   si, di
  jae   @no_over

  std
  add   si, cx
  add   di, cx
  db $66; xor   ax, ax
  mov   ax, cx
  sub   cx, di
  sub   cx, ax
  neg   cx
  and   cx, 3

  dec   si
  dec   di

  db $66;  sub   ax, cx
  jle   @leo

  rep   movsb

  sub   si, 3
  sub   di, 3

  mov   cx, ax
  and   ax, 3
  shr   cx, 2
  rep;  db $66; movsw

  add   si, 3
  add   di, 3

@leo:
  add   cx, ax
  rep   movsb
  jmp   @end_move

@no_over:
  db $66; xor   ax, ax
  mov   ax, cx
  sub   cx, di
  sub   cx, ax
  and   cx, 3
  db $66; sub   ax, cx
  jle   @len

  rep   movsb
  mov   cx, ax
  and   ax, 3
  shr   cx, 2
  rep;  db $66; movsw

@len:
  add   cx, ax
  rep   movsb

@end_move:
  mov    ds, dx
  cld                       { autoincrement!!! }
end; {move}

Please note that at least a 386 is required.

Robert
--
Robert AH Prins
prin...@williscorroon.com

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/       Search, Read, Discuss, or Start Your Own    

 

Re:Faster MOVE() replacement


Quote
Robert AH Prins wrote:

...
Quote
> Below is a replacement for the TP/BP MOVE() routine. Two versions are given,
> one a replacement for the routine in the TP6 RTL (and likely the TP/BP7) the
> other as an assmbler procedure to be used if you don't have the RTL sources:

[ good code snipped ]

Robert! It really seems to me, you are spending lot of effort in
improving the 16 Bit (isn't it?) MOVE of PASCAL. But why always just in
32 bit code?
I compared your optimized MOVE with my primitive MMX MOVE, and the
results for 64000 Bytes were 1.1msec against 0.6 msec! That is 56 MB/sec
against 100 MB/sec.
WHY does none of YOU, who HAVE the ability to write a professional MOVE
use these extended features? Because you think, there are still these
without MMX? We can support both!
And what about using 80 bit FP instructions to move memory? I once
heared, this would be possible, too; and could be even more faster than
these MMX instructions with only 64 bit!

I really would like do write such MOVEs, but I'm afraid, I cannot manage
all these aligning things and so on... But if there would be anybody
interested in a cooperation to develop an optimized 64 or 80 bit MOVE,
I'd join!

regards
--
Arno Fehm (af...@bigfoot.de)

------------------------------------------------------------------------
Member of Grey Dreams Design: visit http://GreyDreams.home.pages.de !!!!
He who can destroy a thing has the real control over it. (Frank Herbert)
------------------------------------------------------------------------

Re:Faster MOVE() replacement


Quote
> I compared your optimized MOVE with my primitive MMX MOVE, and the
> results for 64000 Bytes were 1.1msec against 0.6 msec! That is 56 MB/sec
> against 100 MB/sec.
> WHY does none of YOU, who HAVE the ability to write a professional MOVE
> use these extended features? Because you think, there are still these
> without MMX? We can support both!
> And what about using 80 bit FP instructions to move memory? I once
> heared, this would be possible, too; and could be even more faster than
> these MMX instructions with only 64 bit!

Hehee, you are REALLY funny!
Have you EVER tested such a MMX-move on some other CPUs? I can only
guess not. There you'll see that MMX-moves even slow down the speed.
Don't spread unuseable knowledge.

Bye,
Stefan
---
please remove the P in my email-adress to answer me
take a look @ my homepage: http://sourcenet.home.pages.de/

Re:Faster MOVE() replacement


In article <372DD2D2.5DF86...@bigfoot.de>,
  Arno Fehm <af...@bigfoot.de> wrote:

Quote
> Robert AH Prins wrote:
> ...
> > Below is a replacement for the TP/BP MOVE() routine. Two versions are given,
> > one a replacement for the routine in the TP6 RTL (and likely the TP/BP7) the
> > other as an assmbler procedure to be used if you don't have the RTL sources:
> [ good code snipped ]

> Robert! It really seems to me, you are spending lot of effort in
> improving the 16 Bit (isn't it?) MOVE of PASCAL. But why always just in
> 32 bit code?

I've got my reasons for still using TP6...

Quote
> I compared your optimized MOVE with my primitive MMX MOVE, and the
> results for 64000 Bytes were 1.1msec against 0.6 msec! That is 56 MB/sec
> against 100 MB/sec.
> WHY does none of YOU, who HAVE the ability to write a professional MOVE
> use these extended features? Because you think, there are still these
> without MMX? We can support both!

Have you had a look at the Intel page,

<http://developer.intel.com/drg/pentiumII/appnotes/813/813.htm>

downloaded the program and run it? You will see that the advantage of MMX is
virtually non-existant - in some cases it's worse!

Part of the text on that page is:

<quote> On systems based on the Pentium II processor and with memory
alignment on 8-byte boundaries (memory addresses evenly divisible by 8),
including alignment on 32-byte boundaries, special "fast string" microcode
will be invoked after:

rep movsb = 64 bytes (8 bytes)
rep movsw = 128 bytes (8 words)
rep movsd = 256 bytes (8 dwords)

have been transferred. The range of speedup (0-30%) of "rep movs" over moving
qwords with "movq" using MMX technology is because of evictions from the L1
and L2 caches. There may be negligible speedup using "rep movs" over "movq" if
the load into the register and the store to main memory both cause cache line
evictions. A noticeable speedup will occur if neither the load nor the store
cause any evictions.
</quote>

I could adapt my code very easily to make the destination 8-byte aligned, but
given the above sizes for the invocation of fast microcode and the size of
data structures moved, I'm not sure if I would gain anything.

Here are the results, forgive me the possible typos, as the Intel program
doesn't give them in text form:

Copy & Fill 4194304 bytes
MB/s: cold cache / warm cache
-----------------------------
Source & Destination Aligned
bytes (rep movsb)               69.9/104.9
words (rep movsw)              139.8/104.9
dwords (rep movsd)             139.8/139.8
qwords (FP registers)           83.9/139.8
qwords (MMX (tm) Technology)   139.8/104.9
Fill dwords (rep stosd)        209.7
Fill qwords (rep movq)         139.8

Source Unaligned, Destination Aligned
bytes (rep movsb)               83.9/104.9
words (rep movsw)               83.9/104.9
dwords (rep movsd)              83.9/135.3
qwords (FP registers)           83.9/104.9
qwords (MMX (tm) Technology)    83.9/139.8
Fill dwords (rep stosd)        209.7
Fill qwords (rep movq)         139.8

Source Aligned, Destination Unaligned
bytes (rep movsb)               35.0/ 10.5
words (rep movsw)               59.9/ 52.4
dwords (rep movsd)             104.9/ 83.9
qwords (FP registers)          139.8/ 83.9
qwords (MMX (tm) Technology)   139.8/104.9
Fill dwords (rep stosd)        139.8
Fill qwords (rep movq)         209.7

PC: 350 MHz PII 64 MB NT 4 SP3

Quote
> And what about using 80 bit FP instructions to move memory? I once
> heared, this would be possible, too; and could be even more faster than
> these MMX instructions with only 64 bit!

Yes it is, but unless Intel has changed something in the FILD/fist
instructions it's useless, as the pair cannot handle a number of bit
patterns.

Quote
> I really would like do write such MOVEs, but I'm afraid, I cannot manage
> all these aligning things and so on... But if there would be anybody
> interested in a cooperation to develop an optimized 64 or 80 bit MOVE,
> I'd join!

You've got all the aligning you want in my code, it shouldn't be too hard to
change it into 8/16 byte alignements in stead of the 4-bytes I use...

Regards,

Robert
--
Robert AH Prins
prin...@williscorroon.com

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/       Search, Read, Discuss, or Start Your Own    

Re:Faster MOVE() replacement


First: My apologizes for spreading 'unuseable knowledge'. But it was my
real and proved faith that MMX moves are faster.

I'm working on a project, which has extensive use of MMX instruction.
Also of MOVQ. Switching of the MMX causes the frame rates to drop (e.g.
from 60 to 45); on my CPU as well as on others (like PII). I always
thought, this was mainly was because of the moves. And measurements on
my CPU (P55C) agreed. Heavily. Sorry again!

Quote
Robert AH Prins wrote:
> > I compared your optimized MOVE with my primitive MMX MOVE, and the
> > results for 64000 Bytes were 1.1msec against 0.6 msec! That is 56 MB/sec
> > against 100 MB/sec.
> > WHY does none of YOU, who HAVE the ability to write a professional MOVE
> > use these extended features? Because you think, there are still these
> > without MMX? We can support both!

> Have you had a look at the Intel page,

> <http://developer.intel.com/drg/pentiumII/appnotes/813/813.htm>

Thank you for citing this old clpb-thread. It was 'before my times', so
I did not now of this being discussed before.
And this is the really first time, I hear that MOVQ might even be
SLOWER...
But as far as I can read your figures below, for this case MOVQ is never
slower in moving blocks. Only sometimes equal!
..But I see, the Intel page tells different truth...

Quote
> downloaded the program and run it? You will see that the advantage of MMX is
> virtually non-existant - in some cases it's worse!

May I correct? Forgive me, but the advantage IS existant!! On my CPU it
is! 100% sure, 100% faster (nearly).

Quote
> > And what about using 80 bit FP instructions to move memory? I once
> > heared, this would be possible, too; and could be even more faster than
> > these MMX instructions with only 64 bit!
> Yes it is, but unless Intel has changed something in the FILD/fist
> instructions it's useless, as the pair cannot handle a number of bit
> patterns.

Ok.. let's forget about this.

Quote
> > I really would like do write such MOVEs, but I'm afraid, I cannot manage
> > all these aligning things and so on... But if there would be anybody
> > interested in a cooperation to develop an optimized 64 or 80 bit MOVE,
> > I'd join!
> You've got all the aligning you want in my code, it shouldn't be too hard to
> change it into 8/16 byte alignements in stead of the 4-bytes I use...

Well, I really was going to do so. But this all has discouraged me a
bit. If for 80% of all CPUs, MOVQ shows no effect, why the effort?

Sorry again for (indirectly) blaming you for being old-fashioned, I just
hoped I could do some improvements!

regards
--
Arno Fehm (af...@bigfoot.de)

------------------------------------------------------------------------
Member of Grey Dreams Design: visit http://GreyDreams.home.pages.de !!!!
He who can destroy a thing has the real control over it. (Frank Herbert)
------------------------------------------------------------------------

Re:Faster MOVE() replacement


Quote
> First: My apologizes for spreading 'unuseable knowledge'. But it was my
> real and proved faith that MMX moves are faster.
> I'm working on a project, which has extensive use of MMX instruction.
> Also of MOVQ. Switching of the MMX causes the frame rates to drop (e.g.
> from 60 to 45); on my CPU as well as on others (like PII). I always
> thought, this was mainly was because of the moves. And measurements on
> my CPU (P55C) agreed. Heavily. Sorry again!

If it helps you: moving data to the screen via MMX is especially on
AGP-Cards in most cases alot faster, also normal PCI-Cards gain about
10%.
I program a graphic unit (GrafX2), which tries to take most advantage
of nearly every situation. If you're interested in my experiences,
contact me.

Quote
> Thank you for citing this old clpb-thread. It was 'before my times', so
> I did not now of this being discussed before.
> And this is the really first time, I hear that MOVQ might even be
> SLOWER...
> But as far as I can read your figures below, for this case MOVQ is never
> slower in moving blocks. Only sometimes equal!
> ..But I see, the Intel page tells different truth...

On Intels CPUs, MMX moves within the normal
memory are faster. But on the normal P55C (AFAIK) and non-Intel CPUs,
like I have one (K6-2 315), the caching for the FPU-part is not always
efficient.

Quote
> > downloaded the program and run it? You will see that the advantage of
MMX is
> > virtually non-existant - in some cases it's worse!
> May I correct? Forgive me, but the advantage IS existant!! On my CPU it
> is! 100% sure, 100% faster (nearly).

Well, as I really guess you move the image data to the screen. That's
of course faster with MMX.

Bye,
Stefan
--
please remove the P in my email-adress to answer me
take a look @ my homepage: http://sourcenet.home.pages.de/

Other Threads