Performance issue of scalable_realloc() vs. glibc realloc()

Artur_B_ · ‎12-12-2014

Hi,

I have an application that initially allocates a 4MB block of memory and linearly grows this block using 4MB steps. glibc's realloc() handles this pattern quite well. Linking against tbbmalloc_proxy (or replacing glibc realloc() with scalable_realloc()) degrades performance very badly. The following test program demonstrates the issue:

#include <string>
#include <iostream>
#include <stdlib.h>
#include <tbb/scalable_allocator.h>

const size_t start = 4 * 1024 * 1024;
const size_t step = 4 * 1024 * 1024;
const size_t end = 1024 * 1024 * 1024;

struct libcalloc
{
	void * realloc(void * ptr, size_t sz)
	{
		return ::realloc(ptr, sz);
	}
};

struct tbballoc
{
	void * realloc(void * ptr, size_t sz)
	{
		return scalable_realloc(ptr, sz);
	}
};

template<typename alloc>
void test()
{
	char * ptr = NULL;
	for (size_t prev = 0, cur = start; cur <= end; cur += step)
	{
		ptr = (char *) alloc().realloc(ptr, cur);
		for (; prev < cur; prev++)
			ptr[prev] = prev;
	}
	alloc().realloc(ptr, 0);
}

int main(int argc, char *argv[])
{
	const std::string libc = "libc";
	const std::string tbb = "tbb";
	if (argc != 2 || (argv[1] != libc && argv[1] != tbb))
	{
		std::cout << "Usage: " << argv[0] << " " << libc << "|" << tbb << std::endl;
		return 1;
	}

	if (argv[1] == libc)
		test<libcalloc>();
	else
		test<tbballoc>();
	return 0;
}

Timing this program shows the libc version to take 2.5 seconds while the TBB version takes 56.5 seconds:

$ time ./test libc
2.15user 0.33system 0:02.49elapsed 99%CPU (0avgtext+0avgdata 1049660maxresident)k
0inputs+0outputs (0major+131749minor)pagefaults 0swaps
$ time ./test tbb
30.57user 25.81system 0:56.51elapsed 99%CPU (0avgtext+0avgdata 2092092maxresident)k
0inputs+0outputs (0major+167339minor)pagefaults 0swaps

glibc realloc() uses mremap() to achieve its performance. Quoting from the mremap man page:

mremap() uses the Linux page table scheme. mremap() changes the mapping between virtual addresses and memory pages. This can be used to implement a very efficient realloc(3).

Can we expect TBB to implement similarly efficient scalable_realloc() on Linux?

By the way: after changing linear growth by exponential glibc realloc() still wins, though the difference is much smaller in that case (2.4sec vs. 2.7sec).

Best regards,
Artur

Alexandr_K_Intel1 · ‎12-14-2014

Artur, Thanks, that looks like a useful optimization. Could you say more about using realloc in your workload? For example, are in interesting in size decreasing during realloc? I wish I had better understanding why someone need realloc despite expected “if vector became filled, let’s make it twice as big”.

jimdempseyatthecove · ‎01-02-2015

Artur,

The benefit of the scalable allocator is not achieved on the first allocation. It is experienced on the reallocation of a node that was previously allocated. Try something like:

    for(int i=0; i < 20; ++i) {
 if (argv[1] == libc)
  test<libcalloc>();
 else
  test<tbballoc>();
    }

Jim Dempsey

Artur_B_ · ‎01-03-2015

Hi Alexandr,

Alexandr Konovalov (Intel) wrote:

Could you say more about using realloc in your workload? For example, are in interesting in size decreasing during realloc?

No, in this application the buffer grows until it consumes all relevant data and is then flushed to disk.

I wish I had better understanding why someone need realloc despite expected “if vector became filled, let’s make it twice as big”.

I am not the original author of the code; I am just helping debug a performance regression when linking the application against libtbbmalloc_proxy. That said, I can speculate why one would choose such a design:

Growing the buffer exponentially instead of linearly has a bigger probability of exhausting the address space on 32-bit architectures. In this application the buffer can easily grow beyond 1GB.
This approach has worked quite well with glibc realloc(). Better than std::vector in fact as the latter cannot avoid the allocate + copy dance.

By the way, I did a micro-benchmark to see how efficient mremap() is relative to scalable_realloc():

template<typename alloc>
void test_one_big_realloc()
{
	char * ptr = (char *) alloc().realloc(NULL, 1GB);
	for (size_t i = 0; i < 1GB; i++)
		ptr = i;
	time_start();
	ptr = (char *) alloc().realloc(ptr, 2GB);
	time_print();
	alloc().realloc(ptr, 0);
}

glibc realloc() consistently completes the reallocation in 0.02s while scalable_realloc() takes 0.43s to relocate the buffer. I checked using strace that mremap() did indeed change ptr in my test:

mremap(0x7f76ba149000, 1073745920, 2147487744, MREMAP_MAYMOVE) = 0x7f763a148000

Best regards,
Artur

Artur_B_ · ‎01-03-2015

Hi Jim,

I tried your suggestion. The result is not much better:

$ time ./test libc
7.80user 6.89system 0:14.76elapsed 99%CPU (0avgtext+0avgdata 1049656maxresident)k
0inputs+0outputs (0major+2627075minor)pagefaults 0swaps
$ time ./test tbb
576.12user 519.14system 18:16.93elapsed 99%CPU (0avgtext+0avgdata 2092620maxresident)k
0inputs+0outputs (0major+3156735minor)pagefaults 0swaps

That's 15 seconds for glibc realloc() vs. 18 minutes for scalable_realloc().

Best regards,
Artur

Alexandr_K_Intel1 · ‎06-15-2015

Support for realloc via mremap was added in TBB 4.3 Update 6 (for object > 1 MB).