Board index » cppbuilder » Re: Anything faster then memcpy ?

Re: Anything faster then memcpy ?


2007-12-19 10:38:54 PM
cppbuilder55
Hi Bob
Bob Gonder says:
Quote

>How much is larger blocks ?

I don't know these things, but would guess somewhere
around 64 bytes you break even, and start to rock at 1k.
more then 64 bytes will be the case in 99% of the
cases, so I better look into this.
Quote
The folks who know these things hang out in
borland.public.delphi.language.basm
I'll ask there then.
Thank You very much for explaining.
Kind regards
Asger
 
 

Re:Re: Anything faster then memcpy ?

Asger Joergensen wrote:
Quote
>/* duplicate line 0 to rest of bmp (vertical stripes) */
>memcpy( Bitmap->Scanline[1], Bitmap->Scanline[0],
>(Bitmap->Height-1)*(((Bitmap->Width+7)/8)*3) );

Is this possible / safe with memcpy ?
It does what it says it does:
Copy 0 to 1, then 1 to 2.
Note that 1 looks like 0 by the time it is copied.
So the effect is to copy 0 to 2, and 3, and.....
Quote
If src and dest overlap, the behavior of memcpy is undefined.
That means no checking is performed.
There are slower versions that allow for overlapped true copies.
There are corner cases where memcpy might not do what the code
expects, such as when length of line 0 is less than the move-unit
size of the function (If it moves 128 bits at a time, and the
line is 64 bits, then it might copy line 1 into the last line.)
Trying to visualize this:
memcpy( 2, 1, 8 ) via 4 digit copy
1234 5678 xxxx
memcpy4digit( 2, 1 ) copy 1234 to location 2
1123 4678 xxxx
memcpy4digit( 2+4, 1+4 ) copy 4678 to location 6
1123 4467 8xxx
via 2 digit copy
1234 5678 xxxx
memcpy2digit( 2, 1 )
1124 5678 xxxx
memcpy2digit( 2+2, 1+2 )
1122 4678 xxxx
memcpy2digit( 2+4, 1+4 )
1122 4468 xxxx
memcpy2digit( 2+6, 1+6 )
1122 4466 8xxx
Substitute "scanline" for "digit".
Point is if the copy reads more than one unit (scanline)
per cycle, it can have unintended consequences.
Since memcpy uses 32 bit copies, and the faster SSE
versions use 64, 128 (maybe 256 someday) bit copies,
it will do what we expect down to a scanline length
of 11 24bit pixels (at 256 bit copies).
 

Re:Re: Anything faster then memcpy ?

Bob Gonder < XXXX@XXXXX.COM >wrote:
Quote
Asger Joergensen wrote:

>>/* duplicate line 0 to rest of bmp (vertical stripes) */
>>memcpy( Bitmap->Scanline[1], Bitmap->Scanline[0],
>>(Bitmap->Height-1)*(((Bitmap->Width+7)/8)*3) );
>
>Is this possible / safe with memcpy ?

It does what it says it does:
Copy 0 to 1, then 1 to 2.
Note that 1 looks like 0 by the time it is copied.
So the effect is to copy 0 to 2, and 3, and.....
Or possibly not. You make the assumption that memcpy() works from the
start to the end of its range. That is an implementation detail. It's
also perfectly possible that the underlying code does exactly the
opposite. It's quite possible that on some architectures code that
decrements an index, and stops when the index passes zero, is more
efficient.
The specification of memcpy() is such that copying the elements in
random order would also be acceptable, though such an implementation
would be at least mildly perverse.
Quote
>If src and dest overlap, the behavior of memcpy is undefined.

That means no checking is performed.
It means it's undefined. If the writers of the C standard had thought
that your description was reasonable, it would have been included. It
wasn't. That means that your assumption is only that - an assumption. So
long as you stay on the same library, you'll *probably* be OK.
Quote
There are slower versions that allow for overlapped true copies.
Indeed - they check to see if the source and destination do overlap.
Since there are only four possible layouts, it's not hugely difficult to
work out scenarios:
a) No overlap - copy any way desired
b) Total overlap - don't copy at all (it's odd how self-assignment pops
up)
c) src>dst ... use incrementing index
d) src < dst ... use decrementing index
Again, if there are strong processor reasons to go for something more
complicated, then the library writers can, so long as the result matches
what the standard prescribes.
Alan Bellingham
--
Team Browns
ACCU Conference 2008: 2-5 April 2008 - Oxford, UK
 

{smallsort}

Re:Re: Anything faster then memcpy ?

Remy Lebeau (TeamB) wrote:
Quote
No, they are not. The size of each element of a scanline depends on the
pixel depth of the image. For Asger's original code to work, he would have
Windows DIBs are *always* 4 byte aligned, regardless of color format
and/or compression.
 

Re:Re: Anything faster then memcpy ?

Alan Bellingham wrote:
Quote
Bob Gonder wrote:
>Note that 1 looks like 0 by the time it is copied.
>So the effect is to copy 0 to 2, and 3, and.....

Or possibly not. You make the assumption that memcpy() works from the
start to the end of its range. That is an implementation detail.
Forgot about that one, but I know my current library, and I believe the
Fastcode as well, work in the "forward" direction (OP should view source
if not sure, and probably comment the UB for future library changes).
Quote
The specification of memcpy() is such that copying the elements in
random order would also be acceptable, though such an implementation
would be at least mildly perverse.
Not perverse...artistic!
Get out your bit-goggles.
Fire up your time dialator.
Watch the pretty patterns.
(Don't you miss the old mainframes with
their bits dancing in their bulbs?)
Quote
>>If src and dest overlap, the behavior of memcpy is undefined.
>
>That means no checking is performed.

It means it's undefined.
Which includes the possibility that it works as intended.
It is encumbant upon the programmer to ensure he knows how
his library works when he goes about using UB tricks.
And memcpy is a UB waiting to happen as it has no checks.
(I think if it had checks, it would no longer be UB?)
Quote
If the writers of the C standard had thought
that your description was reasonable, it would have been included.
If they thought that my description was reasonable,
they should be taken out and shot.
My descriptions are from working knowlege,
and often are not StandSpec.
If some idiot provider breaks my UB, I'll just have
to write my own version.
Quote
That means that your assumption is only that - an assumption. So
long as you stay on the same library, you'll *probably* be OK.
Yep, I assume I'll be using the same vendor, and they won't{*word*222}me.
 

Re:Re: Anything faster then memcpy ?

"Asger Joergensen" < XXXX@XXXXX.COM >wrote in message
Quote
If I have a bottleneck it is memcpy
You can't know that without profiling. A lot of people don't use profiling
initially, assuming one thing is a bottleneck when it is really something
else that profiling shows them when they finally try it.
The TBitmap::Scanline property is more likely to be the real bottleneck. It
is not a simple operation internally. Every time you access the property,
the underlying image data is freed and regenerated, the Row parameter
validated, and the number of bytes per scanline recalculated. Multiply all
of that by the height of the bitmap, and that is a lot of work being done.
Gambit
 

Re:Re: Anything faster then memcpy ?

Hi Remy Lebeau (TeamB)
Remy Lebeau (TeamB) says:
Quote

"Asger Joergensen" < XXXX@XXXXX.COM >wrote in message
news: XXXX@XXXXX.COM ...

>If I have a bottleneck it is memcpy

You can't know that without profiling. A lot of people don't use profiling
initially, assuming one thing is a bottleneck when it is really something
else that profiling shows them when they finally try it.
And how do I do profiling ?
Quote
The TBitmap::Scanline property is more likely to be the real bottleneck. It
is not a simple operation internally. Every time you access the property,
the underlying image data is freed and regenerated, the Row parameter
validated, and the number of bytes per scanline recalculated. Multiply all
of that by the height of the bitmap, and that is a lot of work being done..
Isn't it only when there is multiplerefereces to the Bitmap
or if the Bitmap isn't a DIB? (mine are DIB's)
But checking the source, showed me that even if the image data
isn't free'd there sure is done a lot of calculation pre line.
So I came up with the code below, which gave me 10% extra speed
on long line (1000pix.) and 50% on short lines (100pix.). So once
again You were right. ~ Dam.&%ยค#
I am a little woried though, that I had to do [-(y*LineW)] I gues
the default TBitmap is Bottom-Up.
How do I check which direction the bitmap have, TBitmapImage in
the TBitmap is private. I found the WinAPI BITMAPINFO in the help,
but I cant find the function to get it.
Thanks for Your help
Kind regards
Asger
static void __fastcall GradientFillRectH(TBmp *Bmp, TRect &Rct, TAjColor C1,
TAjColor C2)
{
int Bottom = Rct.Bottom;
int W = Rct.Width();
int BW = W*3;
int LineW = (BW+3)&~3;
int BlitStart = Rct.Top + MEMCOPY_H;
TScanColor *ScanLineStart = static_cast<TScanColor*>( Bmp->ScanLine[Rct.Top]);
TScanColor *ScanLine = &ScanLineStart[Rct.Left];
GradientArray(ScanLine, W, C1, C2);//Calculating colors in the first line
LPBYTE ByteLine = (LPBYTE)ScanLine;
for(int y = Rct.Top+1; y < Bottom && y < BlitStart; ++y)
{
LPBYTE NewByteLine = &ByteLine[-(y*LineW)];
memcpy(NewByteLine, ByteLine, BW);
}
if(BlitStart < Bottom)
FillHorizontal(Bmp->Canvas->Handle, Rct.Left, Rct.Top, W, Rct.Bottom);
//using BitBlt if more then 32 lines
}
//-------------------------------------------------------------------------
 

Re:Re: Anything faster then memcpy ?

"Asger Joergensen" < XXXX@XXXXX.COM >wrote in message
Quote
And how do I do profiling ?
Use a third-party profiler, such as AQTime. The main purpose of a profiler
is to show you the actual time that elapses for each line of your code at
runtime. A profiler can analyze and calculate other metrics of your code as
well, such as the number of times a function is called and by whom, etc.
Quote
I am a little woried though, that I had to do [-(y*LineW)]
I gues the default TBitmap is Bottom-Up.
That is controlled by the undering HBITMAP when it is created.
Quote
How do I check which direction the bitmap have
<snip>
I found the WinAPI BITMAPINFO in the help, but I cant find the function to
get it.
GetObject()
Gambit
 

Re:Re: Anything faster then memcpy ?

Hi Remy
Remy Lebeau (TeamB) says:
Quote

"Asger Joergensen" < XXXX@XXXXX.COM >wrote in message
news: XXXX@XXXXX.COM ...

>And how do I do profiling ?

Use a third-party profiler, such as AQTime.
Thanks Remy, I'l look into that, but the must be something
less expencive, I cost more then the builder.:(
Thanks for Your help.
Kind regards
Asger
 

Re:Re: Anything faster then memcpy ?

"Asger Joergensen" < XXXX@XXXXX.COM >wrote in message
Quote
>>And how do I do profiling ?
>
>Use a third-party profiler, such as AQTime.

Thanks Remy, I'l look into that, but the must be something
less expencive, I cost more then the builder.:(
Profiling tools are not simple products that you can just throw together and
give away. Even Borland gave up maintaining their own TurboProfiler tool
many years ago. If you want a decent profiling tool, you are going to have
to pay for it. There may be free or shareware tools out there, but I would
be doubtful of their capability. AQTime is a very good profiling tool. A
little spendy, yes; but worth the cost if you need it.
- Dennis
 

Re:Re: Anything faster then memcpy ?

Asger Joergensen wrote:
Quote
Is there somthing faster then memcpy ?
Can something be done in Asambler ?
Have you looked at the Intel IPP libraries?
www.intel.com/cd/software/products/asmo-na/eng/perflib/ipp/302910.htm
Jon
 

Re:Re: Anything faster then memcpy ?

Hi Jonathan
Jonathan Benedicto says:
Quote
Asger Joergensen wrote:
>Is there somthing faster then memcpy ?
>Can something be done in Asambler ?

Have you looked at the Intel IPP libraries?

www.intel.com/cd/software/products/asmo-na/eng/perflib/ipp/302910.htm
Thanks for Your reply, but I'm mostly testing and learning,
hopefully something will come out of it though..;-)
So I think IPP is a little to pricy for my project.
Kind regards
Asger