Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2485 Discussions

Is it possible to write to system memory at full bandwidth available?

Victor_D_
New Contributor I
3,698 Views

My i7-12700 laptop is supposed to support bandwidth of nearly 64 GBytes/sec. When I run parallel fill(dpl::) I'm only getting about 16 GBytes/sec.

On Intel 48-core Intel Xeon 8275CL AWS node, which has 240 GBytes/sec of bandwidth, parallel fill(dpl::) gets to nearly 40 GBytes/sec.

Is it possible to get closer to the full bandwidth of system memory?

Thank you,

-Victor

0 Kudos
13 Replies
VaishnaviV_Intel
Employee
3,676 Views

Hi,


Thanks for posting on Intel communities.


Could you please share with us the following details,

1. A sample reproducer

2. Could you please let us know how you are measuring bandwidth usage?


Thanks & Regards,

Vankudothu Vaishnavi.


0 Kudos
VaishnaviV_Intel
Employee
3,608 Views

Hi,


We have not heard back from you. Could you please provide us with an update on your issue?


Thanks & Regards,

Vankudothu Vaishnavi.


0 Kudos
Victor_D_
New Contributor I
3,590 Views

Here is my benchmark implementation:

#define DPL_ALGORITHMS

#ifdef DPL_ALGORITHMS
// oneDPL headers should be included before standard headers
#include <oneapi/dpl/algorithm>
#include <oneapi/dpl/execution>
#include <oneapi/dpl/iterator>
#else
#include <algorithm>
#include <execution>
#include <iterator>
#endif

#include <iomanip>
#include <iostream>
#include <random>

#include <immintrin.h>

#include <sycl/sycl.hpp>

#include "utils.hpp"
using namespace sycl;
using namespace std;
using std::chrono::duration;
using std::chrono::duration_cast;
using std::chrono::high_resolution_clock;
using std::milli;

void print_results(const char* const tag, const vector<int>& sorted,
high_resolution_clock::time_point startTime,
high_resolution_clock::time_point endTime)
{
printf("%s: Lowest: %d Highest: %d Time: %fms\n", tag, sorted.front(), sorted.back(),
duration_cast<duration<double, milli>>(endTime - startTime).count());
}

void fill_benchmark()
{
std::vector<int> data(100000000);

auto startTime = high_resolution_clock::now();
std::fill(std::execution::seq, data.begin(), data.end(), 42);
auto endTime = high_resolution_clock::now();
print_results("Serial std::fill", data, startTime, endTime);

//startTime = high_resolution_clock::now();
//std::fill(std::execution::unseq, data.begin(), data.end(), 42);
//endTime = high_resolution_clock::now();
//print_results("SIMD Fill", data, startTime, endTime);

startTime = high_resolution_clock::now();
std::fill(std::execution::par, data.begin(), data.end(), 42);
endTime = high_resolution_clock::now();
print_results("Parallel std::fill", data, startTime, endTime);

startTime = high_resolution_clock::now();
std::fill(std::execution::par_unseq, data.begin(), data.end(), 42);
endTime = high_resolution_clock::now();
print_results("Parallel SIMD std::fill", data, startTime, endTime);

#ifdef DPL_ALGORITHMS
startTime = high_resolution_clock::now();
std::fill(oneapi::dpl::execution::seq, data.begin(), data.end(), 42);
endTime = high_resolution_clock::now();
print_results("Serial dpl::fill", data, startTime, endTime);

startTime = high_resolution_clock::now();
std::fill(oneapi::dpl::execution::unseq, data.begin(), data.end(), 42);
endTime = high_resolution_clock::now();
print_results("SIMD dpl::fill", data, startTime, endTime);

startTime = high_resolution_clock::now();
std::fill(oneapi::dpl::execution::par, data.begin(), data.end(), 42);
endTime = high_resolution_clock::now();
print_results("Parallel dpl::fill", data, startTime, endTime);

startTime = high_resolution_clock::now();
std::fill(oneapi::dpl::execution::par_unseq, data.begin(), data.end(), 42);
endTime = high_resolution_clock::now();
print_results("Parallel SIMD dpl::fill", data, startTime, endTime);

startTime = high_resolution_clock::now();
std::fill(oneapi::dpl::execution::dpcpp_default, data.begin(), data.end(), 42);
endTime = high_resolution_clock::now();
print_results("Parallel DPCPP_DEFAULT dpl::fill", data, startTime, endTime);
#endif
}

int main()
{

fill_benchmark();


return 0;
}

 

On 48-core instance in AWS Xeon 8275CL this runs at 28 GBytes/sec out of over 200 GBytes/sec.

On 14-core i7-12700H laptop nearly the same performance.

0 Kudos
VaishnaviV_Intel
Employee
3,559 Views

Hi,

 

Thank you for sharing your code with us. However, when we tried to run the code, we encountered a fatal error, which prevented us from proceeding.

VaishnaviV_Intel_0-1686302661776.png

 

As a result, we attempted to comment out the problematic section, and we were able to obtain some output as a result.

VaishnaviV_Intel_1-1686302676767.png

 

Could you please provide us with the steps you followed to execute the code? Additionally, it would be helpful if you could share the specific command line you used to run the code.

Furthermore, we would like clarification on how you are measuring the usage of bandwidth. So that we can investigate the issue more from our end.

 

Thank you and regards,

Vankudothu Vaishnavi.

 

0 Kudos
Victor_D_
New Contributor I
3,550 Views

So glad you're able to compile and run the code. Nice! You compiled and ran from command line.

I'm building and running in VisualStudio Community 2022, which uses the Intel compiler for this project.

The way I compute bandwidth for fill is the array is 100,000,000 32-bit integers = 100,000,000 integers * 4 bytes/integer = 400,000,000 bytes. To fill the array in your case took 47.7 milliseconds. To compute the bandwidth 400,000,000 bytes / 0.047 seconds = 8.385 Billion bytes/second.

I should have the code compute it

-Victor

0 Kudos
VaishnaviV_Intel
Employee
3,465 Views

Hi,


Thanks for the explanation.

Could you please elaborate more on your issue?

Are you attempting to utilize the maximum available bandwidth for parallel fill(dpl::)

i.e std::fill(oneapi::dpl::execution::par, data.begin(), data.end(), 42)?

Also, could you please explain your purpose or intention behind adding std::fill() in your code?


Thanks & Regards,

Vankudothu Vaishnavi.


0 Kudos
Victor_D_
New Contributor I
3,458 Views

Hi,

I'm working on other parallel algorithms which are limited by write speed to system memory. Benchmarking std::fill pointed out an issue, either with code or with Intel CPU architecture where writes don't seem to use all of the available system memory bandwidth. I'm trying to get to the root cause of this performance issue. The code above is a benchmark that shows even std::fill with parallel execution can't seem to utilize all of the available bandwidth.

-Victor

0 Kudos
VaishnaviV_Intel
Employee
3,352 Views

Hi,


We are working on this issue internally. We will get back to you soon.


Thanks & Regards,

Vankudothu Vaishnavi.


0 Kudos
VaishnaviV_Intel
Employee
3,285 Views

Hi,

 


We apologize for the delay in our response. We wanted to inform you that we are actively working on your issue and will provide you with an update shortly.

 

Thank you for your patience and understanding.

 


Thanks & Regards,

Vankudothu Vaishnavi


0 Kudos
Victor_D_
New Contributor I
3,261 Views

Thank you for the update Vankudothu. Looking forward to your findings.

-Victor

0 Kudos
VaishnaviV_Intel
Employee
3,174 Views

Hi,

 

Thanks for your patience and understanding.

 

The fill() operation fills a given memory destination with a constant value. It is more efficient to route this operation through compute units rather than the memory subsystem. Can you please explain how you are certain that this is a test for memory bandwidth?

And also, Could you please try a copy operation (using std::copy()) with an array of the same size and check if it is faster on the same machine?

 

Thanks & regards,

Vankudothu Vaishnavi.

 

0 Kudos
VaishnaviV_Intel
Employee
3,141 Views

Hi,


We have not heard back from you. Could you please provide us the details we asked for?


Thanks & Regards,

Vankudothu Vaishnavi.


0 Kudos
VaishnaviV_Intel
Employee
3,036 Views

Hi,


We have not heard back from you. So, We will go ahead and close this thread.

If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.


Thanks & Regards,

Vankudothu Vaishnavi.


0 Kudos
Reply