topic My Parallel Sort Library and benchmarks ... in Intel® Moderncode for Parallel Architectures

My Parallel Sort Library and benchmarks ...

aminer10 — Fri, 07 May 2010 08:09:55 GMT

Hello all,

I have used my Parallel Sort Library and benchmarked an Object Pascal
program that sorts a dynamic array of 10 millions of strings.

Please look at the benchmarks in following page:

http://pages.videotron.com/aminer/parallelsort/parallelsort.htm

On four cores four threads, here is the results:

Parallel Quicksort gives 5.31x speed and 2.877 seconds
Parallel Heapsort gave me 4.72x and 7.452 seconds

Note: Parallel Quicksort is much faster in pratice than parallel heapsort
or parallel mergesort.

Sincerely
Amine Moulay Ramdane.

My Parallel Sort Library and benchmarks ...

Tudor — Fri, 07 May 2010 10:43:39 GMT

Are you sure the algorithms perform normally when there is only 1 core? It seems you are getting superlinear speedup for every configuration compared to 1 core. That seems very strange.
For example, I extracted this from your tests:
Heapsort:
2 cores - 13.382 sec
4 cores - 7.453 sec
this is normal, a bit below 2x speedup. But:
1 core - 35.188 sec
2 cores - 13.382 sec
that's almost 3x speedup for double the number of cores.

My Parallel Sort Library and benchmarks ...

aminer10 — Fri, 07 May 2010 18:54:29 GMT

Tudor wrote:
>Are you sure the algorithms perform normally when there is only 1 core? It seems you are getting >superlinear speedup for every configuration compared to 1 core. That seems very strange.

I thinkit's the 'divide and conquer' that givesthis speed up,due
to the fact that i am sorting 'parts' of the array in parallel and in
every parts of the array i am applying the quicksort or heapsort...

So, if we have 80 elements , in single threaded the average complexity
will be 80 Ln(80)this will give350.56 , but in mutlthreaded with four threads
on four coresthis will give 4(20 ln(20)) = 239 , and we are also runnning
the quicksort or heapsort in parallel on every part of the array , it's why
it gives better than 4x.

:)

Sincerely,
Amine.

My Parallel Sort Library and benchmarks ...

aminer10 — Fri, 07 May 2010 19:02:42 GMT

Hello,

Of course i am using NO synchronization between the threads...

Just a smallinterlockincrement() to know when the sort have finnished...

Amine.

My Parallel Sort Library and benchmarks ...

aminer10 — Fri, 07 May 2010 19:16:43 GMT

Tudor wrote:
>1 core - 35.188 sec
>2 cores - 13.382 sec
>that's almost 3x speedup for double the number of cores.

Don't forget that i am using an Intel Core 2 Quad , so,
the CPU-Z gave me: *2* x 4096 KBytes of Level2 cache.

So, on four cores i amsorting four chunks of the big array
with just *2* Level 2 caches :)

Amine.

My Parallel Sort Library and benchmarks ...

aminer10 — Fri, 07 May 2010 19:41:24 GMT

Tudor wrote:
>1 core - 35.188 sec
>2 cores - 13.382 sec
>that's almost 3x speedup for double the number of cores.

Forget about he complexity , cause with a big chunks there
is almost no difference in the complexity from one to two cores...

So , i think the single threaded program is less cache friendly
than the two threads... and since we have 2 xLevel 2 caches , this
explains the 3x from 1 core to two cores...

I don't have any other explaination...

Amine.