Re: Strange behavior for matrix multiplication

Tudor · ‎10-29-2009

Hi there,

I wrote this simple algorithm to multiply then used openMP to run it in parallel. When I tested it on my core2quad Q8200 with Windows Server 2008 and 4gb ram, the behavior for two 1024 x 1024 matrices is very strange.
The time for 1023 x 1023 and 1025 x 1025 is much smaller both sequentially and in parallel, but the time for 1024 x 1024 is huge. Also, the scaling in this case is horrible on 4 cores, barely 33% increased performance. I suspect it has something to do with the cache, but I don't know enough about this subject.

Here is the code:

[cpp]void OpenMPMatrixMultiply()
{
	int i, j, k;

#pragma omp parallel for private(j, k)
    for (i = 0; i < size1; i++)
    {
        for (j = 0; j < size3; j++)
        {
            int partial = 0;
            for (k = 0; k < size2; k++)
            {
                partial += matrix1 * matrix2;
            }
            result1 += partial;
        }
    }
}[/cpp]

Any help would be appreciated.

jimdempseyatthecove · ‎10-29-2009

I see everything is not proceeding as you have forseen.

Try buffering your writes. Something along the line of

[cpp]01.void OpenMPMatrixMultiply()   
02.{   
03.    int i, j, k;   
04.  
05.#pragma omp parallel for private(j, k)   
06.    for (i = 0; i < size1; i++)   
07.    {
           __declspec(align(64))   

           int result1Temp[size3];
08.        for (j = 0; j < size3; j++)   
09.        {   
10.            int partial = 0;   
11.            for (k = 0; k < size2; k++)   
12.            {   
13.                partial += matrix1 * matrix2;   
14.            }   
15.            result1Temp = partial;   
16.        }   
           for (j = 0; j < size3; j++)   
           {   
             result1 += result1Temp;   
           }   
17.    }   
18.}  
[/cpp]

Jim Dempsey

Tudor · ‎10-29-2009

Quoting - jimdempseyatthecove

I see everything is not proceeding as you have forseen.

Try buffering your writes. Something along the line of

[cpp]01.void OpenMPMatrixMultiply()   
02.{   
03.    int i, j, k;   
04.  
05.#pragma omp parallel for private(j, k)   
06.    for (i = 0; i < size1; i++)   
07.    {
           __declspec(align(64))   

           int result1Temp[size3];
08.        for (j = 0; j < size3; j++)   
09.        {   
10.            int partial = 0;   
11.            for (k = 0; k < size2; k++)   
12.            {   
13.                partial += matrix1 * matrix2;   
14.            }   
15.            result1Temp = partial;   
16.        }   
           for (j = 0; j < size3; j++)   
           {   
             result1 += result1Temp;   
           }   
17.    }   
18.}  
[/cpp]

Jim Dempsey

How about merging the 1st and 3rd for statements like this:

[cpp]void OpenMPMatrixMultiply2()
{
	int i, k, line, column;
#pragma omp parallel for private(i, line, column)
	for(k = 0; k < size1 * size3; k++)
	{
		line = k / size3;
		column = k % size3;
		int partial = 0;
		for(i = 0; i < size2; i++)
		{
			partial += matrix1[line] * matrix2[column];
		}
		result[line][column] = partial;
	}
}[/cpp]

Sadly, none of these two modifications seems to produce any difference in the run time.

jimdempseyatthecove · ‎10-29-2009

Is matrix a 2D array or a 1D array of pointers to 1D array?

The idea of the code I sent was to isolate the output of rows into a local array, then in seperate loop block copy the row. The purpose of this was an attempt to reduce evictions due to false sharing (on writes). This may also have improved on SSE small vector usage.

Collapsing the loop would not improve upon this.

Try reducing the temp array to size3/2+1 then use two sectons
{ loop to produce local copy of first half of products, copy local copy to first half of output }
{ loop to produce local copy of second half of products, copy local copy to second half of output }

Or if you envision larger arrays,peel-off1000 elements at a time (vary according to cache size). Your system has 4MB L2

Jim

Tudor · ‎10-29-2009

Quoting - jimdempseyatthecove

Is matrix a 2D array or a 1D array of pointers to 1D array?

The idea of the code I sent was to isolate the output of rows into a local array, then in seperate loop block copy the row. The purpose of this was an attempt to reduce evictions due to false sharing (on writes). This may also have improved on SSE small vector usage.

Collapsing the loop would not improve upon this.

Try reducing the temp array to size3/2+1 then use two sectons
{ loop to produce local copy of first half of products, copy local copy to first half of output }
{ loop to produce local copy of second half of products, copy local copy to second half of output }

Or if you envision larger arrays,peel-off1000 elements at a time (vary according to cache size). Your system has 4MB L2

Jim

Thank you for your advice, I will get to implementing this soon.
Any idea why this bad performance only happens for precisely 1024 x 1024 matrices?

jimdempseyatthecove · ‎10-29-2009

Sounds like a false sharing cache problem.

1024x1024x4 is a multiple of 4MB, so is 1024x1024x8

Try the following:

Get the size of the L2 cache (4MB I think)
divide it by the number of threads to get byte size of cache
When you allocate the thread private arrays (assuming allocated ~together)

char* x = new char[cacheSize/nThreads]; // padd
double* Array = new double[ArraySize]; // real array
delete []x; // release padd

Now the beginning of the arrays will not be cache aligned.
(adjust the above if you are using alligned_malloc)

You should have the idea. You can complicate the technique later (i.e. small arrays do not require padd when all arrays fit inside cache).

Jim

TimP · ‎10-29-2009

These CPUs are known for an aliasing problem in cache mapping, where cache lines spaced by 8MB (4MB on some steppings, 64KB on early P4) would evict each other. That's not what is normally known as false sharing, but the effect is likely to be similar, or worse, if the conflicts occur between threads on the same cache.
The quoted (pseudo?) code seems reasonably well organized to avoid standard false sharing. The evident problem is in each value of column referring to a different cache line, and the likelihood of thrashing the local TLB.
Without a data type specification, as touched upon earlier in the thread, I don't see how this can be viewed as more than pseudo-code, although I guess it's reasonable to assume plain ints.
Performance library implementations of matrix multiplication are ubiquitous, and would include one or another method to improve cache locality, of course with vectorization if the data type matches the parallel arithmetic instructions of the CPU.

Dmitry_Vyukov · ‎10-30-2009

Quoting - Tudor

Thank you for your advice, I will get to implementing this soon.
Any idea why this bad performance only happens for precisely 1024 x 1024 matrices?

Probably it has something to do with limited associativity of lower levels of cache.
For example, if you have 32k L1D$ with 2-way associativity, then you can hold in the cache at most 2 memory locations which address p is (p % 32k) == C.
Addresses of values M[i, j] and M[i + 1, j] are separated by exactly 4k (8k), when matrix is of size 1024. So you can hold in the cache at most 16 (8) locations from single matrix column.
While if matrix size is 1023 or 1025, then values from single column do not conflict in the "(p % 32k) == C" sense, so you can hold in the cache the whole matrix column.

jimdempseyatthecove · ‎10-30-2009

Quoting - Dmitriy Vyukov

Probably it has something to do with limited associativity of lower levels of cache.
For example, if you have 32k L1D$ with 2-way associativity, then you can hold in the cache at most 2 memory locations which address p is (p % 32k) == C.
Addresses of values M[i, j] and M[i + 1, j] are separated by exactly 4k (8k), when matrix is of size 1024. So you can hold in the cache at most 16 (8) locations from single matrix column.
While if matrix size is 1023 or 1025, then values from single column do not conflict in the "(p % 32k) == C" sense, so you can hold in the cache the whole matrix column.

Tudor,

Can you allocate larger than what is required?
IOW Set the row length (number of columns allocated, not referenced) to an odd count (or multiple of 3, 5, 7).

Addresses of values M[i, j] and M[i + 1, j]

Would then always have a TLB skew (could be stairstep, but skewed none the less)

Jim Dempsey

Tudor · ‎10-30-2009

Quoting - jimdempseyatthecove

Tudor,

Can you allocate larger than what is required?
IOW Set the row length (number of columns allocated, not referenced) to an odd count (or multiple of 3, 5, 7).

Addresses of values M[i, j] and M[i + 1, j]

Would then always have a TLB skew (could be stairstep, but skewed none the less)

Jim Dempsey

Indeed, it seems that simply allocating an odd number, like size + (size % 2 + 1), solves the problem.
Also, my original program had bidimensional static allocation of int matrix[1024][1024]. If I change it to array of pointers, then it works ok for the mentioned values, but it obviously yields worse performance.
Thank you for the cache associativity lesson, that was very informative. It's funny how in college they tell you that the programmer need not worry with the low level issues of hardware implementation, yet even a simple problem such as matrix multiplication can cause such problems.

jimdempseyatthecove · ‎10-30-2009

>>It's funny how in college they tell you that the programmer need not worry with the low level issues of hardware implementation, yet even a simple problem such as matrix multiplication can cause such problems.

This is the difference between theory and practice. QED (Quite Easily Doh!)

In 10 years this may not make a difference (there will be enough circuit space to remove the n-way associativity from the cache) - all "RAM" will be as fast as L1 cache. (IOW no cache on systems).

Jim Dempsey

Dmitry_Vyukov · ‎11-02-2009

Quoting - Dmitriy Vyukov

Probably it has something to do with limited associativity of lower levels of cache.
For example, if you have 32k L1D$ with 2-way associativity, then you can hold in the cache at most 2 memory locations which address p is (p % 32k) == C.
Addresses of values M[i, j] and M[i + 1, j] are separated by exactly 4k (8k), when matrix is of size 1024. So you can hold in the cache at most 16 (8) locations from single matrix column.
While if matrix size is 1023 or 1025, then values from single column do not conflict in the "(p % 32k) == C" sense, so you can hold in the cache the whole matrix column.

There are indeed some non-linear effects. You may experiment with the following program to investigate them. It accesses N memory locations in a cycle, memory locations are separated by M bytes. I will provide my output for N=1/2/4/8/16/32/64, M=32768/16384/8192/4096/2048/1024 and 32896/16448/8224/4112/2056/1028.

[cpp]#include "stdafx.h"

#include 
#include 
#include 

template
struct iter
{
    __forceinline static void run(char volatile* mem)
    {
        char val = mem[idx * block_size]; // read
        //mem[idx * block_size] = 1; // write
        //mem[idx * block_size] += 1; // read-write
        iter::run(mem);
    }
};

template
struct iter
{
    __forceinline static void run(char volatile* mem)
    {
    }
};

template
struct test
{
    static size_t const iter_count = 50000000;

    static void run(char* mem)
    {
        iter::run(mem);
        unsigned __int64 t0 = __rdtsc();
        body(mem);
        unsigned __int64 time = __rdtsc() - t0;
        unsigned cost = (unsigned)(time / iter_count);
        printf("location count #%u: %un", count, cost);
        test::run(mem);
    }

    __declspec(noinline) static void body(char volatile* mem)
    {
        for (size_t i = 0; i != iter_count / count / 4; i += 1)
        {
            iter::run(mem);
            iter::run(mem);
            iter::run(mem);
            iter::run(mem);
        }
    }
};

template
struct test
{
    static void run(char* mem)
    {
    }
};

template
struct suite
{
    static void run(char* mem)
    {
        printf("nblock_size=%un", block_size);
        test::run(mem);
        suite<(block_size / 2), count, (block_size / 2 < 1024)>::run(mem);
    }
};

template
struct suite
{
    static void run(char* mem)
    {
    }
};

int main()
{
    size_t const block_size1 = 32*1024;
    size_t const block_size2 = 32*1024 + 128;
    size_t const test_count = 64;

    char* mem = (char*)VirtualAlloc(0, test_count * block_size2, MEM_LARGE_PAGES | MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
    if (mem == 0)
    {
        printf("failed to allocate large pagesn");
        mem = (char*)VirtualAlloc(0, test_count * block_size2, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
        if (mem == 0)
            return printf("failed to allocate small pages toon");
    }

    suite::run(mem);
    suite::run(mem);
}

[/cpp]

Dmitry_Vyukov · ‎11-02-2009

Quoting - Dmitriy Vyukov

There are indeed some non-linear effects. You may experiment with the following program to investigate them. It accesses N memory locations in a cycle, memory locations are separated by M bytes. I will provide my output for N=1/2/4/8/16/32/64, M=32768/16384/8192/4096/2048/1024 and 32896/16448/8224/4112/2056/1028.

And here is my output:

block_size=32768
location count #64: 10
location count #32: 4
location count #16: 1
location count #8: 1
location count #4: 1
location count #2: 1
location count #1: 1

block_size=16384
location count #64: 4
location count #32: 4
location count #16: 1
location count #8: 1
location count #4: 1
location count #2: 1
location count #1: 1

block_size=8192
location count #64: 5
location count #32: 4
location count #16: 1
location count #8: 1
location count #4: 1
location count #2: 1
location count #1: 1

block_size=4096
location count #64: 5
location count #32: 4
location count #16: 1
location count #8: 1
location count #4: 1
location count #2: 1
location count #1: 1

block_size=2048
location count #64: 6
location count #32: 3
location count #16: 1
location count #8: 1
location count #4: 1
location count #2: 1
location count #1: 1

block_size=1024
location count #64: 4
location count #32: 1
location count #16: 1
location count #8: 1
location count #4: 1
location count #2: 1
location count #1: 1

block_size=32896
location count #64: 1
location count #32: 1
location count #16: 1
location count #8: 1
location count #4: 1
location count #2: 1
location count #1: 1

block_size=16448
location count #64: 1
location count #32: 1
location count #16: 1
location count #8: 1
location count #4: 1
location count #2: 1
location count #1: 1

block_size=8224
location count #64: 1
location count #32: 1
location count #16: 1
location count #8: 1
location count #4: 1
location count #2: 1
location count #1: 1

block_size=4112
location count #64: 1
location count #32: 1
location count #16: 1
location count #8: 1
location count #4: 1
location count #2: 1
location count #1: 1

block_size=2056
location count #64: 1
location count #32: 1
location count #16: 1
location count #8: 1
location count #4: 1
location count #2: 1
location count #1: 1

block_size=1028
location count #64: 1
location count #32: 1
location count #16: 1
location count #8: 1
location count #4: 1
location count #2: 1
location count #1: 1