Board index » delphi » Unexplained benchmark results (not FastCode)

Unexplained benchmark results (not FastCode)


2006-01-12 04:08:18 AM
delphi247
I'm testing two hash functions (MD5 and Whirlpool) on two computers. First (1) is P4 2.4 on gigabyte motherboard with 800MHZ dual channel memory running W2k Pro. Second (2) is Dell with P4 Celeron 2.4 and 400MHz (not sure) memory running XP. The tests are run using same compiled (not from IDE) executable. Both functions are made to hash a string consisting of 10^7 spaces. The Whirlpool uses MMX (namely MOVQ and PXOR). I use RDTSC to count clocks. I get following results:
On 1: MD5 - 24 clocks per byte, Whirlpool 155 clocks per byte
On 2: MD5 - 17 clocks per byte, Whirlpool 97 clocks per byte
Why so much faster on Dell?
Sisoft Sandra reports both CPUs to be almost identical with exception of L2 cache size (1: 512kb, 2: 128kb) and Revision/Stepping (1: 2/9(9), 2: 2/9(A)). Estimated performance rating is 1: PR5304, 2: PR2632.
Other Sandra tests:
CPU Arithmetic:
1) Dhrystone 5287, Whetstone FPU/SSE2 1857/4750
2) Dhrystone 4735, Whetstone FPU/SSE2 1282/2899
CPU Multimedia:
1) Integer SSE2 11081; FP iSSE2 17106
2) Integer SSE2 9377; FP iSSE2 11577
Memory Bandwidth:
1) Integer 4152MB/s; Float iSSE2 4142MB/s
2) Integer 1299MB/s; Float iSSE2 1648MB/s
I didn't use hand-held timer, but I can tell that Dell actually comes up with the test results faster. Can anybody give me a clue as to why?
The same test run on dual P3 800MHZ SDRAM133:
MD5 - 18 clocks per byte, Whirlpool 145 clocks per byte
On Athlon 2.4(2.0) DDR266:
MD5 - 12 clocks per byte, Whirlpool 197 clocks per byte
Les.
 
 

Re:Unexplained benchmark results (not FastCode)

Hi Les
How big is the dataset? 10 MB?
If it is so big that it does not fit in L1 then the L2 cache size and the
memory bandwidth makes the difference.
Best regards
Dennis Kjaer Christensen
 

Re:Unexplained benchmark results (not FastCode)

Quote
How big is the dataset? 10 MB?

If it is so big that it does not fit in L1 then the L2 cache size and the
memory bandwidth makes the difference.
The problem is the Dell has about twice as slow memory (according to Sandra tests at least) yet the hash test runs 30% faster on it.
Les.
 

Re:Unexplained benchmark results (not FastCode)

Hi Les
But it has a bigger L2?
This migth be more important in this particular benchmark.
Best regards
Dennis Kjaer Christensen
 

Re:Unexplained benchmark results (not FastCode)

Quote
But it has a bigger L2?

This migth be more important in this particular benchmark.
In the original post it said the Dell had less L2 (128kb against 512kb).
The only thing i can think of it that Windows XP might handle MMX more
efficiently during task switches, but it would be hard to believe the
effect would be so drastic. Otherwise some programs might be running
there, but I'd assume that has been taken proper care of by OP.
ISTM the best way to find out would be to install XP on the slower one
too.
--
The Fastcode Project: www.fastcodeproject.org/
 

Re:Unexplained benchmark results (not FastCode)

Is one of the two a laptop or using speedstep or some other from of
dynamic CPU clock adjustement?
Eric
 

Re:Unexplained benchmark results (not FastCode)

hi
yes i got the two pc's mixed up.
the situation is certainly weird.
perhaps there is a problem with the benchmark
Best regards
Dennis Kjaer Christensen
 

Re:Unexplained benchmark results (not FastCode)

Quote
The only thing i can think of it that Windows XP might handle MMX more
efficiently during task switches, but it would be hard to believe the
effect would be so drastic. Otherwise some programs might be running
there, but I'd assume that has been taken proper care of by OP.

ISTM the best way to find out would be to install XP on the slower one
too.
I was thinking about difference in OS too, but like you put it, it is unlikely that a thread in W2K spends 30% of time task switching (MMX notwithstanding). As soon as I get the chance to put W2K on Dell, I will do it. It might be easier to switch the CPUs though. At first I thought that the Dell's CPU is some newer P4 model (like Prescott) which might handle MMX better, but they both report as Northwood and the first test (MD5) does not use MMX anyway.
My friend just suggested that maybe it is Hyper Threading. The first CPU has HT enabled, so I will turn it off and see. Could it be that RDTSC adds up clocks for both virtual cores? Nah...
That actually makes me wonder: Are CPU clock counts always synchronized in dual (or more) CPU systems? if not then is QueryPerformanceCounter immune to that?
Les.
 

Re:Unexplained benchmark results (not FastCode)

Quote
Is one of the two a laptop or using speedstep or some other from of
dynamic CPU clock adjustement?
No laptops. I am not sure about Dell - it is a cheap desktop system. I will take a look in the manual.
Les.
 

Re:Unexplained benchmark results (not FastCode)

Quote
My friend just suggested that maybe it is Hyper Threading. The first
CPU has HT enabled, so I will turn it off and see.
Good thinking. It might also be multiple effects (OS, HT) adding up.
Quote
Could it be that
RDTSC adds up clocks for both virtual cores? Nah...

That actually makes me wonder: Are CPU clock counts always
synchronized in dual (or more) CPU systems? if not then is
QueryPerformanceCounter immune to that?
An interesting question. The post below seems to suggests that they are
synchronized, but may not always be exactly synchronized:
softwareforums.intel.com/ids/board/message&message.id
=590
--
The Fastcode Project: www.fastcodeproject.org/
 

Re:Unexplained benchmark results (not FastCode)

Quote
My friend just suggested that maybe it is Hyper Threading. The first CPU
has HT enabled, so I will turn it off and see. Could it be that RDTSC adds up
clocks for both virtual cores? Nah...
IIRC W2k doesn't like HT, it'll run fine but I have read alot of cases where
performance decreased when HT was enabled. Windows 2003 Server on the other hand is
supposed to see an improvement with HT enabled. I am guessing a few tweaks in
the scheduler where needed.
DD
 

Re:Unexplained benchmark results (not FastCode)

It is Hyper Threading. In combination with W2K it seems. I have tested SetThreadIdealProcessor and it does not make a difference on W2K Pro - that is OS still schedules it equally on both virtual CPUs. It does work as expected on W2K Server with dual P3. Forcing the test to run on specific virtual CPU by setting affinity mask does not produce any benchmark improvements. Not until HT is actually disabled.
Les.