Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Scalable Memory Allocation using INTEL TBB

Martin_T_1
Beginner
1,060 Views

I want to allocate about 40 GB on RAM. My first try was:

#include <iostream>
#include <ctime>

int main(int argc, char** argv)
{
    unsigned long long  ARRAYSIZE = 20ULL * 1024ULL * 1024ULL * 1024ULL;
    unsigned __int16 *myBuff = new unsigned __int16[ARRAYSIZE];  // 3GB/s  40GB / 13.7 s
    unsigned long long i = 0;
    const clock_t begintime = clock(); 
    for (i = 0; i < ARRAYSIZE; ++i){
	myBuff = 0;
    }
    std::cout << "finish:  " << float(clock() - begintime) / CLOCKS_PER_SEC << std::endl;
    std::cin.get();
    return 0;
}

The memory write speed was about 3 GB/s that was not satisfactory for my high performance system.

So I tried Intel Cilk Plus as below:

	/*
	nworkers =  5; 8.5 s ==> 4.7 GB/s
	nworkers =  8; 8.2 s ==> 4.8 GB/s
	nworkers =  10; 9 s   ==> 4.5 GB/s
	nworkers = 32; 15 s  ==> 2.6 GB/s
	*/

#include "cilk\cilk.h"
#include "cilk\cilk_api.h"
#include <iostream>
#include <ctime>

int main(int argc, char** argv)
{
    unsigned long long  ARRAYSIZE = 20ULL * 1024ULL * 1024ULL * 1024ULL;
    unsigned __int16 *myBuff = new unsigned __int16[ARRAYSIZE];
    if (0 != __cilkrts_set_param("nworkers", "32")){
	std::cout << "Error" << std::endl;
    }
    const clock_t begintime = clock();
    cilk_for(long long j = 0; j < ARRAYSIZE; ++j){
	myBuff = 0;
    }
    std::cout << "finish:  " << float(clock() - begintime) / CLOCKS_PER_SEC << std::endl;
    std::cin.get();
    return 0;
}

The results are commented above the code. As it can be seen, there is speed up for nworkers = 8.
But the larger nworkers, the slower allocating. I thought maybe it was due to locking by threads.
So I tried scalable allocator provided by Intel TBB as:

#include "tbb\task_scheduler_init.h"
#include "tbb\blocked_range.h"
#include "tbb\parallel_for.h"
#include "tbb\scalable_allocator.h"
#include "cilk\cilk.h"
#include "cilk\cilk_api.h"
#include <iostream>
#include <ctime>
// No retry loop because we assume that scalable_malloc does
// all it takes to allocate the memory, so calling it repeatedly
// will not improve the situation at all
//
// No use of std::new_handler because it cannot be done in portable
// and thread-safe way (see sidebar)
//
// We throw std::bad_alloc() when scalable_malloc returns NULL
//(we return NULL if it is a no-throw implementation)

void* operator new (size_t size) throw (std::bad_alloc)
{
	if (size == 0) size = 1;
	if (void* ptr = scalable_malloc(size))
		return ptr;
	throw std::bad_alloc();
}

void* operator new[](size_t size) throw (std::bad_alloc)
{
	return operator new (size);
}

void* operator new (size_t size, const std::nothrow_t&) throw ()
{
	if (size == 0) size = 1;
	if (void* ptr = scalable_malloc(size))
		return ptr;
	return NULL;
}

void* operator new[](size_t size, const std::nothrow_t&) throw ()
{
	return operator new (size, std::nothrow);
}

void operator delete (void* ptr) throw ()
{
	if (ptr != 0) scalable_free(ptr);
}

void operator delete[](void* ptr) throw ()
{
	operator delete (ptr);
}

void operator delete (void* ptr, const std::nothrow_t&) throw ()
{
	if (ptr != 0) scalable_free(ptr);
}

void operator delete[](void* ptr, const std::nothrow_t&) throw ()
{
	operator delete (ptr, std::nothrow);
}



int main(int argc, char** argv)
{
    unsigned long long  ARRAYSIZE = 20ULL * 1024ULL * 1024ULL * 1024ULL;
	tbb::task_scheduler_init tbb_init;
	unsigned __int16 *myBuff = new unsigned __int16[ARRAYSIZE];
	if (0 != __cilkrts_set_param("nworkers", "10")){
		std::cout << "Error" << std::endl;
	}
	const clock_t begintime = clock();
	cilk_for(long long j = 0; j < ARRAYSIZE; ++j){
		myBuff = 0;
		}
	std::cout << "finish:  " << float(clock() - begintime) / CLOCKS_PER_SEC << std::endl;

	std::cin.get();
	return 0;
}

(Above code is adapted from Intel TBB book by James Reinders, O'REILLY)
But results are almost identical to the previous try. I set TBB_VERSION environment variable to see if I really use 
Scalable_malloc and the got information is in this picture (nworkers = 32):

https://www.dropbox.com/s/y1vril3f19mkf66/TBB_Info.png?dl=0


I am willing to know what is wrong whit my code. I expect memory write speed at least about 40 GB/s.
How should I use scalable allocator correctly?
Can somebody please present a simple verified example of using scalable allocator from INTEL TBB?

Environment:
Intel Xeon CPU E5-2690 0 @ 2.90 GHz (2 processors), 224 GB RAM (2 * 7 * 16 GB) DDR3 1600 MHz, Windows server 2008 R2 Datacenter,
Microsoft visual studio 2013 and Intel C++ compiler 2017.

0 Kudos
8 Replies
Alexei_K_Intel
Employee
1,060 Views

Hi Martin,

Your approach is similar to the stream benchmark (https://www.cs.virginia.edu/stream/) idea. There are a lot of aspects that should be considered to achieve maximum performance, e.g. NUMA, cache organization (you cannot write a cache line if you have not read it yet), compiler optimization options and so on. To avoid some problems you can try std::memset that is expected to be highly optimized for single thread performance.

As for the allocator, usually it is important in scenarios when memory is allocated and deallocated concurrently from multiple threads. If you allocate memory with a single block it should not affect you.

Regards,
Alex

 

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,060 Views

>>(you cannot write a cache line if you have not read it yet)

Unless you are using AVX512 aligned stores which write an entire cache line.

The scalable allocator advantages are obtained by a reduction in latency of re-allocation of nodes previously allocated and returned by the same thread. Think of it as each thread having its own heap and thus can eliminate (most) critical sections. The thread heaps are not organized like CRTL heap, but it is most efficient for allocations/reallocation of high numbers of similar sized nodes. On a NUMA systems some additional advantages can be attained ... provided the allocation of, use of, and deallocation is performed by the same thread. Cross-thread allocation/deallocation will lead to inefficiencies.

Additional Note 2. First time allocation within any address range of the virtual machine will typically encounter a "first touch" overhead as the touches enter a page not previously touched. Page sizes can vary, default is likely 4KB (but can be 2MB or 1GB). On a NUMA system a first touch page fault will (usually) allocate/map from the NUMA node of the running thread. It is the programmer's responsibility that the allocation of, and the deallocation of nodes by the same thread .AND. that the use of the allocation be either performed by the same thread .OR. any thread on the same NUMA node.

Additional Note 3. TBB is typically affinity agnostic (and thus also NUMA agnostic). However, there are extensions to TBB that interact with affinity. The TBB implementation of affinity is (en)cumbersome IMHO.

Jim Dempsey

0 Kudos
Alexei_K_Intel
Employee
1,060 Views

jimdempseyatthecove wrote:

>>(you cannot write a cache line if you have not read it yet)

Unless you are using AVX512 aligned stores which write an entire cache line.

As far as I know, no one of Intel Xeon Processor E5 family supports AVX-512.

Just a curiosity, are you sure that it is really so? Do you have any educational materials and/or a reference?

jimdempseyatthecove wrote:

The scalable allocator advantages are obtained by a reduction in latency of re-allocation of nodes previously allocated and returned by the same thread. Think of it as each thread having its own heap and thus can eliminate (most) critical sections. The thread heaps are not organized like CRTL heap, but it is most efficient for allocations/reallocation of high numbers of similar sized nodes. On a NUMA systems some additional advantages can be attained ... provided the allocation of, use of, and deallocation is performed by the same thread. Cross-thread allocation/deallocation will lead to inefficiencies.

According to the code there is only one allocation and I do not understand how the allocator can help in this scenario.

jimdempseyatthecove wrote:

Additional Note 2. First time allocation within any address range of the virtual machine will typically encounter a "first touch" overhead as the touches enter a page not previously touched. Page sizes can vary, default is likely 4KB (but can be 2MB or 1GB). On a NUMA system a first touch page fault will (usually) allocate/map from the NUMA node of the running thread. It is the programmer's responsibility that the allocation of, and the deallocation of nodes by the same thread .AND. that the use of the allocation be either performed by the same thread .OR. any thread on the same NUMA node.

Additional Note 3. TBB is typically affinity agnostic (and thus also NUMA agnostic). However, there are extensions to TBB that interact with affinity. The TBB implementation of affinity is (en)cumbersome IMHO.

Are you sure that first touch policy works on Windows?

If we suppose that first touch policy works, how affinity can help in this scenario? There is only one parallel loop and OS usually tends not to move threads across the CPUs during computations.

Regards,
Alex

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,060 Views

>>Are you sure that first touch policy works on Windows?

I do not have a NUMA Windows system for absolute certainty. This said, an observation seems to indicate it will. This is that on Windows versions that I have used, a first allocation will usually succeed, only to be followed shortly by page fault should you have too small of a page file to map the additional page (when page file size is exhausted). IOW mapping of physical resources are not performed until first touch.

In prior years, when I did have a NUMA Windows system (AMD Opteron), first touch of a page did appear to map to the preferred node of the thread performing the touch. While I cannot test this today, I doubt if MS would remove this behavior.(or feature if you prefer).

>>If we suppose that first touch policy works, how affinity can help in this scenario? There is only one parallel loop and OS usually tends not to move threads across the CPUs during computations.

Using static scheduling of your loops will partition the loops the same way on each call. Note, it is important that the first touching thread is preponderantly the thread that uses the page the most. This generally means:

a) some arbitrary thread allocated the memory
b) you initialize the memory using a static loop with the same number (all) threads (or all threads of nest level) that will be using the array
c) you do not return this memory, you keep it around for reuse of same purpose.

NOTE: Your BIOS must be setup to configure memory as NUMA. Your Windows version must support NUMA (Windows Server does, I cannot say about Windows 10 home/pro).

Jim Dempsey

0 Kudos
Alexei_K_Intel
Employee
1,060 Views

Using static scheduling of your loops will partition the loops the same way on each call. ...

The example uses only one path over the allocated memory. So if the first touch policy works then for each new page access, OS will allocate and nullify the page. Therefore, you will never see the maximum possible memory bandwidth. If the first touch policy does not work and memory is nullified at allocation moment that is probably inefficient for a NUMA system. Hence, the both cases does not depend on affinity.

Author's system definitely supports NUMA because "Windows server 2008 R2 Datacenter" is used.

Regards,
Alex

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,060 Views

His problem is that TBB is placement agnostic to a great extent. IOW you have to go through some peculiar measures to split loops into partitions that are serviced by the same threads. TBB does not have static scheduling for parallel_for. The NUMA advantage could be attained when each thread manages the entirety/majority of a specific allocation. Good for oop, but not for large array processing. For large array processing that may have benefit from NUMA usage, OpenMP may be a better choice.

Jim Dempsey

0 Kudos
Alexei_K_Intel
Employee
1,060 Views

The example has the only one iteration over the memory. How placement of threads can affect the performance? (Of course, we believe that OS does not bind two or more threads to the same hardware thread).

Regards,
Alex

0 Kudos
Martin_T_1
Beginner
1,060 Views

Thanks for the points mentioned by experts. So isn't there any mistake in my code? Did I implement parallel allocating memory correctly?

Can somebody please present a simple verified example of using scalable allocator from INTEL TBB?

0 Kudos
Reply