[PLUG] surprising performance on my HNOW nodes

Tue Aug 10 23:23:01 UTC 2004

>From: Russell Senior <seniorr at aracnet.com>
> >>>>> "Elliott" == Elliott Mitchell <ehem at m5p.com> writes:
> Elliott> [...] So left with what killed the P4? First thing I'll note
> Elliott> is despite the resident size being 1370KB notice that is
> Elliott> bigger than the cache of most processors. Notably if this is
> Elliott> a late model P3 then it might have 512KB cache, if the P4 was
> Elliott> an early model it might have a mere 256KB.
> 
> The P4 has 512 KB.  The P3 has only 256 KB!

You've double-checked? Okay, so we're not seeing a cache effect here.

> Elliott> [...] If your code has a lot of irregular branches, this will
> Elliott> *kill* the P4 (no modern processor likes branches, but none
> Elliott> compare to the P4's dislike of them).
> 
> This might be it.  The main loop traverses a list of heterogeneous
> objects.  Certainly there is branching to handle different subtypes,
> and branching to decide whether computations are even needed in some
> cases.  I'll have to go back and look that code and see if I might be
> able to smooth it out.

Removing branches will help any modern processor (just most won't die on
branches like the P4). Are you running an SMP kernel? Try running your
program as two processes. They'll fight for the core, but may result in
fewer pipeline stalls, helping the P4.

> Do you know which modern processors might be better?  

Anything that isn't a P4.  :-)   Seriously, Intel Marketing told Intel
Engineering "clock speed at all costs"; what they got was a processor
that could handle high clock speeds, but had a 20 stage pipeline. They've
now revised it to a 31 stage pipeline seeking yet higher clock speeds.
The result is what you'd expect, in nice consistant loops you get good
performance, in code with hard to predict branches you get killed.

The Intel Centrino wasn't about clock speed and doesn't have this
problem. The AMD chips go for clock speed, but not at the cost of
performance, they'll be fine. PowerPC chips are known for a short
pipeline, if the branches really are the problem they can't be beaten.

-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \   (    |         EHeM at gremlin.m5p.com PGP 8881EF59         |    )   /
  \_  \   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
    \___\_|_/82 04 A1 3C C7 B1 37 2A*E3 6E 84 DA 97 4C 40 E6\_|_/___/