Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software Development Topics
- Intel® Moderncode for Parallel Architectures
- Cluster vs. multicore different scalability

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

Hello,

carlomaria

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-14-2011
02:16 AM

33 Views

Cluster vs. multicore different scalability

I'm running on two machines two parallel algorithms for matrix multiplication to assess scalability, with OpenMPI.

The first machine is a cluster of 4 Quad Core, total 16 CPUs available, the second is a Dell PC with 16 GB RAM and Intel Core i7 processor (total 8 CPUs available).

Algorithm 1 performs multiplication as follows:

[bash]{ unsigned int i, j, k; double sum; for (i = 0; i < A.m; i++) // Rows { for (j = 0; j < B.n; j++) // Cols { sum = 0; for (k = 0; k < A.n; k++) sum += A.rowsIn algorithm 2 I used pointers instead to enhance speedup:* B.rows ; C.rows = sum; } } } [/bash]

[bash]{ unsigned int i, j, k; double *c_ptr = &C.rows[0][0]; double *b_ptr = &B.rows[0][0]; double *a_ptr = &A.rows[0][0]; for (i = 0; i < A.m; i++) // Rows { for (j = 0; j < B.dim; j++) // Cols { double sigma = 0; double *A_ptr = (a_ptr + i*A.dim); double *B_ptr = b_ptr + j*B.dim; for (k = 0; k < A.dim; k++) { sigma += (*A_ptr) * (*B_ptr); A_ptr++; B_ptr++; } *c_ptr++ = sigma; } } }[/bash]The MPI structure and data decomposition is the same for both programs.

Algorithm 1 shows linear scalability on cluster up to 8 processors and linear sccalability up to 4 processors on PC. Algorithm 2 shows linear scalability on cluster up to 8 processors but is not scalable at all on PC.

Tests were performed multiplying dense square matrices 1000x1000 and 5000x5000.

Does anyone know what could the difference be? Is it in the algorithm or in the machine?

Are there issues with dynamic memory allocation in MPi environment?

Thanks for your help,

Carlo Maria

4 Replies

Highlighted
##

You omit a lot of important information, such as whether your compiler swaps loops and vectorizes, and why you don't make it easier for the compiler to do so.

No doubt, you're aware that HyperThreading doesn't accelerate properly code matrix multiply, yet you seem to expect otherwise.

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-14-2011
09:18 AM

33 Views

No doubt, you're aware that HyperThreading doesn't accelerate properly code matrix multiply, yet you seem to expect otherwise.

Highlighted
##

Hello,

carlomaria

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-14-2011
10:38 AM

33 Views

my compiler is mpic++ with default optimisation options. How is it possible to make it easier for the compiler?

I didn't expect a great acceleration, but I can't understand why the two algorithms performs so differently in terms of scalability on the PC. I attach the speedup comparison on the PC.

Regards,

Carlo Maria

Highlighted
##

mpic++ could be any MPI and a variety of compilers. If it's g++, mpic++ -v would confirm it, and show whether you have a recent enough version to expect optimization. Then you would probably need to declare the pointers as * __restrict__ to enable auto-vectorization with -O3 -ffast-math -ftree-vectorizer-verbose=1

To optimize for core I7 or Xeon 55xx, you would need a recent enough g++ to accept -mtune=barcelona. -march=corei7 isn't supported until g++ 4.6 as far as I know.

You would want to do at least minimal checking to be certain that you are using all cores in your quad core MPI runs (top could be sufficient). Of course, your multi-core scaling is more likely to be good if you don't use optimization in the compilation (and so get least performance per core).

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-14-2011
11:47 AM

33 Views

To optimize for core I7 or Xeon 55xx, you would need a recent enough g++ to accept -mtune=barcelona. -march=corei7 isn't supported until g++ 4.6 as far as I know.

You would want to do at least minimal checking to be certain that you are using all cores in your quad core MPI runs (top could be sufficient). Of course, your multi-core scaling is more likely to be good if you don't use optimization in the compilation (and so get least performance per core).

Highlighted
##

Thank you for your answer, it was of great help. In fact after checking I noticed I was using g++ 4.4.

carlomaria

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-18-2011
12:56 PM

33 Views

For more complete information about compiler optimizations, see our Optimization Notice.