Block Matrix Multiplication With Cilk?

Patrick_P_3 · ‎03-20-2015

I'm trying to tackle the same problem every HPC student gets: multiply matrices faster with as few memory accesses as possible. I've started with the dumb 6-deep nested for-loop block algorithm, but I feel like you can eliminate the 3 innermost loops (or should be able to) with Cilk notation and take advantage of SSE/AVX. This is what I came up with, but I get a compile error on icpc that the array bases are invalid. I've seen tons of cilk examples with variables for array bounds, so I'm REALLY confused as to why this is invalid. Our Matrix is implemented generally using std::vector<std::vector<Val>> where in our case Val is an int but can change.

Matrix
Matrix::operator*(const Matrix& src) const throw(std::exception) {
    if (data.size() != src.data[0].size()) {
        throw std::runtime_error("Incompatible Matrix Dimensions!");
    }
    unsigned int BS = blockSize, m1x = data[0].size(), m1y = data.size(),
        m2x = src.data[0].size();

    Matrix toRet = Matrix(data.size(), src.data[0].size()); // purely canonical notation

    // There must be a way to reorder this for better ILP...
    for (unsigned int i = 0; i < m1y; i += BS) {
        unsigned int endm1y = std::min(i+BS, m1y);
        for (unsigned int j = 0; j < m1x; j += BS) {
            unsigned int endm1x = std::min(j+BS, m1x);
            for (unsigned int k = 0; k < m2x; k += BS) {
                unsigned int endm2x = std::min(k+BS, m2x);
                // should be a way to turn these next 3 for-loops into Cilk
                for (unsigned int x = i; x < endm1y; x++) {
                    /* This is somehow invalid... These are all proper "array" bases
                       Perhaps mathematically a bit odd, but at the very least this should be valid
                    toRet.data[x : endm1y].data()[j : endm1x] +=
                        (data[x : endm1y].data()[y : endm1y] *
                         src.data[k : endm1y].data()[j : endm2x]);
                    */
                    for (unsigned int y = j; y < endm2x; y++) {
                        Val sum = 0;

                        /* Also does not compile but is mathematically sound
                           Less powerful than the above statements, but it would
                           still eliminate a loop construct
                        toRet.data += (data.data() +
                                             src.data.data());
                         */
                        for (unsigned int z = k; z < endm1x; z++) {
                            sum += data * src.data;
                        }
                        toRet.data += sum;
                    }
                }
            }
        }
    }
    return toRet;
}

Even if my mathematical reasoning is way off (sum of multiplications doesn't seem that difficult), I feel this is an error in the intel compiler or a flaw in the definitions of cilk itself. I see no reason this provided cilk code should not compile at all, as everything is still a pointer to an array location. Does anyone have a resource I could look at to better optimize block multiplication (before getting openmp parallel involved). It seems like there's no nice way to avoid a load of cache misses, but I feel I should be able to eliminate a data hazard or two with regards to summing items into toRet.

I feel there has to be a way at the very least to reorder the last two or three loops, but I've spun my wheels for a couple hours already and got hung up on cilk.

TimP · ‎03-20-2015

where you have [x : endm1y] it looks like you meant i rather than x.

If your questions are about array notation indexing of containers, the cilk forum might be the place to ask. Going beyond posted examples seems to raise the question of whether you are trying to extend the definitions. There are open source references on cilk notation (which don't exactly match what is implemented in current compilers). My experiments with cilk plus (tm) have been more oriented toward the C99 combination, although some Intel marketing people say C++ is all that matters.

unsigned int induction variables may be an obstacle to optimization.

The inner loop looks like a candidate for cilk reducer, if your point is to maximize cilk plus (tm) notation to see if there is a any penalty vs. C. In my view, the presence of inner_product() in C++ makes one wonder about mixing cilk reducers with C++. You also have the altermatives of #pragma simd reduction (originally advertised as Cilk plus (tm)) and #pragma omp simd reduction to assure this is optimized and that the compiler doesn't automatically renest the loops.

If you wish to introduce a cilk_for, it seems that should be done at a higher level than your comments suggest. Unfortunately, as far as I know, there isn't a way to restrict the cilk workers to number of cores and spread them out optimally. The "composability" aspect of cilk plus (tm) is not so useful in traditional HPC framework.