- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To maximize parallel operations on a computer with an M3 processor that contains 11 CPUs and 11 threads, it is essential to optimize the workload distribution and ensure efficient resource utilization.
Implementation details:
Inputs: file with input matrix (you choose the size) and kernel (fixed size 4x4)
The goal is to first have a working sequential code for the four operations. Then, parallelize the
operations that can be efficiently parallelized. Pay special attention to data races (more threads
requesting the same input) and to concurrency/conflicts (multiple threads updating the same
output). As a general suggestion, to achieve the best performance you should group in one
thread (or in multiple threads executed in the same CPU) all the operations that work on the
same input data. This avoids costly data copy to multiple locations.
There are multiple ways to parallelize the code. You can parallelize the single convolution or you
can parallelize the convolutions (each thread executes a 4x4 convolution). Please discuss the
benefit of each solution and evaluate the performance of both.
Suggestion: when you parallelize convolutions pay attention that if multiple threads take
subsequent sliding convolutions they all will need the same part of the input data, thus....
You need to create an OpenMP file with the implementation of the convolution and a main
file for testing the function. The main will:
- read the input matrix from a text file (matrix.txt) - randomly generated or static, you choose
- read the kernel from a text file (kernel.txt) - fixed 4x4 size, you choose the values
- apply convolution and save the result in a file
You need to present a performance report where you show the measurements of the
execution time of the sequential implementation (to simply, simply set the number of threads to
- and of various parallel implementations (degree of parallelism, threads distribution, threads
grouping etc...). Write your consideration in a PDF document to add to the submission.
1 - consider the "zero padding", by performing convolution in the whole input matrix, till the last
column and the last row. Please do not add 3 extra rows and 3 extra columns of zeros in the
input matrix but try more smart solutions. The output matrix will have the same size as the input
matrix.
2 - consider bigger input matrix sizes and discuss if/why the performance improves.
Someone has some ideas?
My idea was to create 10 submatrices, each managed by a thread, with a master thread overseeing the operation. Each submatrix will be of appropriate size. For instance, if I have an input matrix of 100x100, the matrix will be divided into 10 submatrices of 52x22 to ensure all possible combinations are covered
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This forum is specifically for people to discuss problems related to Intel oneAPI DPC++/C++ compiler.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page