Showing results for

- Intel Community
- Software Development SDKs and Libraries
- Intel® oneAPI Threading Building Blocks & Intel® Threading Building Blocks
- Problems with spin_mutex lock for matrix assembly

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

Manav_B_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-12-2012
09:43 AM

5 Views

Problems with spin_mutex lock for matrix assembly

Hi,

I am using the latest release of TBB for parallelizong my c++ finite element analysis application. I am facing trouble with the matrix assembly part, where each thread calculates a component of the global matrix and adds it to the global matrix (i.e., each thread is writing to a global variable). To facilitate this, I am using a spin_mutex lock. Unfortunately, every once in a while I miss one of the thread's contributions to the matrix, and the numerical computations blow-up.

Following is the structure of my code. I think I am using the spin_mutex correctly, but the the problem is random. Sometimes it happens after two iterations, and sometimes it takes 100 or so (and all possibilities in between). I would appreciate if someone could comment on whether there is an error in my use of the mutex, or if this is a known issue and there is a workaround.

I am using this on Mac OS 10.8.2. The hardware is: MacBook Air with 1.7 GHz Intel Core i5 with 4GB 1333 MHz DDR3 RAM.

Thanks,

Manav

Update: Found this to be an error in my code. Fixed and working fine now!

tbb::spin_rw_mutex assembly_mutex;

class AssembleElementMatrices

{

public:

AssembleElementMatrices(const std::vector<FESystem::Mesh::ElemBase*>& e,

FESystem::Numerics::VectorBase<FESystemDouble>& r,

FESystem::Numerics::MatrixBase<FESystemDouble>& stiff):

elems(e),

residual(r),

global_stiffness_mat(stiff)

{ }

void operator() (const tbb::blocked_range<FESystemUInt>& r) const

{

FESystem::Numerics::DenseMatrix<FESystemDouble> elem_mat;

FESystem::Numerics::LocalVector<FESystemDouble> elem_vec;

for (FESystemUInt i=r.begin(); i!=r.end(); i++)

{

// code to calculate elem_vec and elem_mat

{

tbb::spin_rw_mutex::scoped_lock my_lock(assembly_mutex, true);

dof_map.addToGlobalVector(*(elems[i]), elem_vec, residual); // adds elem_vec to appropriate locations in residual vector

dof_map.addToGlobalMatrix(*(elems[i]), elem_mat, global_stiffness_mat); // adds elem_mat to appropriate locations in the global_stiffness_mat matrix

}

}

}

}

protected:

const std::vector<FESystem::Mesh::ElemBase*>& elems;

FESystem::Numerics::VectorBase<FESystemDouble>& residual;

FESystem::Numerics::MatrixBase<FESystemDouble>& global_stiffness_mat;

};

void calculateQuantities()

{

const std::vector<FESystem::Mesh::ElemBase*>& elems = mesh.getElements();

tbb::parallel_for(tbb::blocked_range<FESystemUInt>(0, elems.size()), AssembleElementMatrices(elems, residual, global_stiffness_mat));

}

10 Replies

Highlighted
##

>>...I think I am using the spin_mutex correctly, but the the problem is random...
That thread is unanswered for more than 2 days and I wonder if you could provide more details? If you post some sources of a complete test-case that reproduces your random error somebody ( for example me ) will take a look at it. Does it make sence?
Best regards,
Sergey

SKost

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-16-2012
05:50 PM

5 Views

Highlighted
##

This could have been positioned more prominently: "Update: Found this to be an error in my code. Fixed and working fine now!"

RafSchietekat

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-16-2012
10:58 PM

5 Views

Highlighted
##

>>..."Update: Found this to be an error in my code. Fixed and working fine now!"
That is possible. Anyway, it would be nice to hear from 'Manav B.'.

SKost

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-17-2012
07:47 AM

5 Views

Highlighted
##

Hi Sergey and Raf,
Thanks for your message. I agree that I could have made the update more prominent.
The problem in the code was that I was unintentionally using a global matrix as scratch in each of the threads. So each thread would modify it in the sequence of computations, which would get read by another thread that was expecting to see some other numbers in the matrix.
It took a few days to identify this, but once rectified, the code works beautifully.
Thanks,
Manav

Manav_B_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-17-2012
02:09 PM

5 Views

Highlighted
##

Hi Manav,
>>...It took a few days to identify this, but once rectified, the code works beautifully...
I have a couple of questions:
- How big is your matrix?
- Could you provide some performance numbers? ( Please provide technical details like CPU, operating system, size of the matrix, etc )
- Did you have a chance to test matrix multiplication with TBB?
Thanks in advance.

SKost

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-18-2012
08:55 AM

5 Views

Highlighted
##

>>...Please provide technical details like CPU, operating system, size of the matrix, etc...
Sorry, I see your data:
>>...I am using this on Mac OS 10.8.2. The hardware is: MacBook Air with 1.7 GHz Intel Core i5 with 4GB 1333 MHz DDR3 RAM...
What about a size of the matrix?

SKost

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-18-2012
08:58 AM

5 Views

Highlighted
##

The matrix size in this case is 16x16. Each thread, however, has several of such matrices that are needed for the computations.
I have also used TBB to parallelize my LU decomposition solver, which I frequently use for sparse matrices with order of a few hundred thousand.
I have not done a parallelization efficiency benchmarking yet, but intend to finish that in the coming days.
Manav

Manav_B_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-19-2012
06:18 PM

5 Views

Highlighted
##

Thank you for additional details.
>>...The matrix size in this case is 16x16. Each thread, however, has several of such matrices that are needed for the computations.
I think 16x16 matricies are too small for any multi-processing on them. What about overhead related to threads ( context switches, etc ) and how many threads do you create in total?
>>...I have not done a parallelization efficiency benchmarking yet, but intend to finish that in the coming days.
Would you be able to evaluate performance of serial and muli-threaded versions? I won't be surprised to see that a serial version ( if it exists ) outperforms the multi-threaded.
Best regards,
Sergey

SKost

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-20-2012
06:40 AM

5 Views

Highlighted
##

Hi Sergey,
I just uploaded a pdf file that shows the speedup I was able to obtain. This was done on a Mac Pro with 2 x 3.06 GHz 6-core Intel Xeon and 32 GB of 1333 MHz DDR3 RAM. The OS is 10.8.2. So, including hyperthreading, the machine sees 24 processing cores, but the speedup saturates at 12 threads (expected, I think).
I have parallelized two separate blocks of my code. The first block does matrix assembly where each thread calculates one matrix for each element assigned to it and then adds it to a global matrix. Each thread in this block owns a few matrices and vectors of dimension 16.
The second block that I have parallelized is the LU decomposition solver where I make each thread operate on a set of rows independently.
Manav

Manav_B_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-20-2012
09:19 AM

5 Views

Highlighted
##

Consider using two mutex
{
tbb::spin_rw_mutex::scoped_lock my_lock(residual_mutex, true);
dof_map.addToGlobalVector(*(elems*), elem_vec, residual); // adds elem_vec to appropriate locations in residual vector
}
{
tbb::spin_rw_mutex::scoped_lock my_lock(stiffness_mutex, true);
dof_map.addToGlobalMatrix(*(elems**), elem_mat, global_stiffness_mat); // adds elem_mat to appropriate locations in the global_stiffness_mat matrix
}
Jim Dempsey*

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-21-2012
01:43 PM

5 Views

For more complete information about compiler optimizations, see our Optimization Notice.