- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there
I was trying to offload some computation to MIC using "pragma", sending data addressed by a pointer p, then how to ensure the alignment of data on MIC after MIC recieved it? Does" __assume(p, 64)" work?I was trying to use instrinsics to load data to the vector RF, which requires the alignment of data.
Another problem, that I was trying to active lots of threads for the calculation using "#pragma omp parallel for", and some arrays inside the loop must be thread private while also 64-byte aligned.
I was using "_mm_malloc()" inside the loop to ensure these, but this leads to reduplicated and unnecessary allocation.
Thanks.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you possibly post a small sample code? Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Frances Roth (Intel) wrote:
Could you possibly post a small sample code? Thanks.
In the main function:
The data addressed by p is transfered into MIC And the function foo is defined like this:
Thus p1 is allocated repeatedly inside the loop to make sure it's thread-private, while p1 has to be aligned.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
At the very least you should structure that more like this (which allocates once per thread, rather than once per iteration)
#pragma omp parallel { long long iter; // Though does it *really* need to be 64 bits!? How many iterations do you have? // 64bit indexes are likely inefficient. double * p1 = (double *) _mm_malloc (sizeof(double)*1024, 512); #pragma omp for for (iter=0; iter<N; iter++) { _mm_512d _A; ... etc ... } _mm_free (p1); }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
James Cownie (Intel) wrote:
At the very least you should structure that more like this (which allocates once per thread, rather than once per iteration)
#pragma omp parallel { long long iter; // Though does it *really* need to be 64 bits!? How many iterations do you have? // 64bit indexes are likely inefficient. double * p1 = (double *) _mm_malloc (sizeof(double)*1024, 512); #pragma omp for for (iter=0; iter<N; iter++) { _mm_512d _A; ... etc ... } _mm_free (p1); }
I really have that many iterations. Reconstructing the code helps ,thanks~

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page