Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

More threads than sections ?

afd_lml
Beginner
459 Views


My program code is listed in the following:

void mv(void)
{
double vx[size1]; // size1 = 10000
double vy[size2]; // size2 = 10000
double vz[size3]; // size3 = 10000

// I must compute vx, vy, and vz seprately
computeV(vx); //computethearrayvx
computeV(vy); // computethearrayvy
computeV(vz); //computethearrayvz

//sum upvx, vy, vz
...................................
}

void computeV(double v[])
{
//most computationwill becarried out in this function, heavilyCPU burden.
//such as, calculate V by4 times fast fourier transformations (using intel MKL), and matrix-vector multiplication, likethis
for(int i=0; i v = ........
........................................
}


My workstation has 24-core Xeon 7400, if Isimply use openmp like the following code,is there only 3 coresbuy ?, in other words, 21 cores are idle ? What happens if the number of threads and the number of sections are different? More threads than sections ? How to obtain the best performance of my code ?

void mv(void)
{
double vx[size1]; // size1 = 100000
double vy[size2]; // size2 = 100000
double vz[size3]; // size3 = 100000

#pragma omp parallel sections
{
#pragma omp section
computeV(vx);//computethearrayvx

#pragma omp section
computeV(vy);// computethearrayvy

#pragma omp section
computeV(vz);//computethearrayvz
}

//sum upvx, vy, vz
...................................
}

Would anyone like to help me ? many thanks.

0 Kudos
1 Solution
robert-reed
Valued Contributor II
459 Views
Quoting - afd.lml

My program code is listed in the following:

[snip]

My workstation has 24-core Xeon 7400, if Isimply use openmp like the following code,is there only 3 coresbuy ?, in other words, 21 cores are idle ? What happens if the number of threads and the number of sections are different? More threads than sections ? How to obtain the best performance of my code ?


As others have suggested before me, the code as written would only take advantage of three of the 24 HW threads available on your machine. Here's chapter and verse from the OpenMP 3.0 specification:

Each structured block is executed once by one of the threads in the team in the context of its implicit task.


Also previously mentioned, the natural thing to look at is whether the elements of vx, vy and vz can be computed in parallel. Perhaps that for loop cited in your post could be wrapped in an omp parallel for construct? It would require that each of the array elements could be computed independently and in any order, but the parallel for could use all 24 of your HW threads if such a computational organization is possible. If that works, I would start with the parallelization of the for loop in computeV() and skip the sections until I had the loop parallelization working.

View solution in original post

0 Kudos
4 Replies
gilthe
Beginner
459 Views
hi,

don't take my word for it (beginner myself) but the trivial 3-way parallelization you implemented is in fact limited to 3 cores.
your function computeV can theoretically be parallelized further (especially if it contains a simple enough, outermost master loop) but you have to analyze data dependency for that and eliminate shared writes.
if it's all about performance for a specific example, also try auto-vectorization and auto-parallelization first and see what they tell/give you.

cheers,
andreas

0 Kudos
jimdempseyatthecove
Honored Contributor III
459 Views

In the above case the compiler can see you have 3 sections and "should" be able to schedule just 3 threads. This said, your program may not always clearly expose to the compiler an appropriate number of threads to use. For this there is the num_threads(n) modifier you can add

#pragma omp parallel sections num_threads(3)

Also, you may (or may not) find it benificial to request fewer threads than sections (e.g. when in nested parallel regions).

Jim Dempsey
0 Kudos
robert-reed
Valued Contributor II
460 Views
Quoting - afd.lml

My program code is listed in the following:

[snip]

My workstation has 24-core Xeon 7400, if Isimply use openmp like the following code,is there only 3 coresbuy ?, in other words, 21 cores are idle ? What happens if the number of threads and the number of sections are different? More threads than sections ? How to obtain the best performance of my code ?


As others have suggested before me, the code as written would only take advantage of three of the 24 HW threads available on your machine. Here's chapter and verse from the OpenMP 3.0 specification:

Each structured block is executed once by one of the threads in the team in the context of its implicit task.


Also previously mentioned, the natural thing to look at is whether the elements of vx, vy and vz can be computed in parallel. Perhaps that for loop cited in your post could be wrapped in an omp parallel for construct? It would require that each of the array elements could be computed independently and in any order, but the parallel for could use all 24 of your HW threads if such a computational organization is possible. If that works, I would start with the parallelization of the for loop in computeV() and skip the sections until I had the loop parallelization working.
0 Kudos
afd_lml
Beginner
459 Views

As others have suggested before me, the code as written would only take advantage of three of the 24 HW threads available on your machine. Here's chapter and verse from the OpenMP 3.0 specification:

Each structured block is executed once by one of the threads in the team in the context of its implicit task.


Also previously mentioned, the natural thing to look at is whether the elements of vx, vy and vz can be computed in parallel. Perhaps that for loop cited in your post could be wrapped in an omp parallel for construct? It would require that each of the array elements could be computed independently and in any order, but the parallel for could use all 24 of your HW threads if such a computational organization is possible. If that works, I would start with the parallelization of the for loop in computeV() and skip the sections until I had the loop parallelization working.


thank you all for your help !

sorry for my wrong click for rating.
0 Kudos
Reply