Hi Anoop,

velvia · ‎05-03-2015

Hi,

I am working on the design of my own numerical library (with il::Vector, il::StaticVector, il::Matrix, il::StaticMatrix) with some improvements over the STL (indices are signed, il::Vector<double> are initialised to NaN in debug mode, etc). I've been thinking over a long time on a good design for custom allocators but I can't make out my mind. The STL way of doing it invades the type system and makes it a pain to use. The Bloomberg way of doing it, uses dynamic dispatch for memory allocation and does not invade the type system. It looked like the right solution to me before I saw Chandler Carruth from Google saying that compilers can optimise away memory allocation if they can see it. This kind of optimisations are obviously lost in the Bloomberg model. It took me a while to find an example where memory allocation are optimised away, but the following code

#include <iostream>
#include <vector>

std::vector<double> f_val(std::size_t i, std::size_t n) {
    auto v = std::vector<double>( n );
    for (std::size_t k = 0; k < v.size(); ++k) {
        v = static_cast<double>(k + i);
    }
    return v;
}

int main (int argc, char const *argv[])
{
    const auto n = std::size_t{10};
    const auto nb_loops = std::size_t{300000000};

    auto v = std::vector<double>( n, 0.0 );
    auto start_time = std::chrono::high_resolution_clock::now();
    for (std::size_t i = 0; i < nb_loops; ++i) {
        auto w = f_val(i, n);
        for (std::size_t k = 0; k < v.size(); ++k) {
            v += w;
        }
    }

    std::cout << v[0] << " " << v[n - 1] << std::endl;

    return 0;
}

blew my mind when compiled with clang. There is only one memory allocation in the whole program: the one for v. It seems that the call to f_val is inline, the loops are fused and then memory allocation for w is completely removed! Both gcc and icpc don't do this kind of optimisation. It's not really clear if this kind of optimisation is allowed by the standard but there is a proposal to clarify that point: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3664.html . Is there any work at Intel on doing this kind of optimisation?

Anoop_M_Intel · ‎05-04-2015

Hi Velvia,

Will changing the below line

auto w = f_val(i, n);

to

auto w = std::move(f_val(i, n));

force the compiler to use move constructor which will avoid the extra memory allocation for "w".

Thanks and Regards
Anoop

velvia · ‎05-05-2015

Hi Anoop,

There is a misunderstanding.

std::move is useless here as f_val(i, n) is already a r-value reference
The line auto "w = f_val(i, n);" already contains 2 copy elisions. The first one when returning the object and the second one when constructing w. Therefore move semantics are useless here.
What is amazing is that this line should trigger one allocation. But with LLVM and libc++, this allocation is completely removed. There is not a single allocation happening in this loop and w is never created. This is completely legal since N3664 which has been approved for C++14, but it was unclear before. LLVM has been doing that kind of optimisation for 2 years now.

You can check the assembly, but the "easy way" is to run the program. If it takes less than one second, it is that the allocation is optimised away. But both Gcc and Icpc can't do that right now.

Memory allocation optimised away