Software Archive
Read-only legacy content

Optimizing matrix-vector multiplication

Ioannis_V_
Beginner
1,241 Views

Hello everyone,

In the application I am developing a large matrix M exists of size nxm (n >> m) and a vector x of appropriate size (I will explain what I mean by appropriate). I have to perform two matrix-vector multiplications multiple times, but not with the whole matrix M. I have included a picture to help the discussion.

The application creates a binary tree, with the root node representing the whole matrix M. In the picture I assume that the matrix has n=20 rows, which are represented by the indices contained in the node. At each step the matrix is divided into two disjoint sets of rows (it is not necessary for the two sets to be of equal size). In order to keep track of this, each node contains a vector 'indices' and an integer field 'num_of_indices', that hold which rows of the original matrix belong to each node and how many these rows are. As it is obvious, the matrices represented in each node are not contiguous in memory and there is no special structure about which indices belong to which node (with the exception of the root node of course).

The operation I have to perform involves the rows of the M matrix that belong to a node. More specifically, I have to perform Mi*MiT*x, i.e., the matrix Mi for node i as expressed by 'indices' is multiplied by its transpose and by vector x (which has the appropriate size, i.e., 'num_of_indices' elements). In order to reduce the total number of operations I do Mi*(MiT*x), which results in two matrix-vector multiplications. This operation is executed until the method I use converges (could be up to 400 times with my current data set).

All this is performed in parallel. Each new node is created as an OpenMP task and in order to perform the above operation, each node uses an appropriate number of threads using OpenMP parallel and for directives. For example, nodes 1 and 2 would use 8/20 and 12/20 of the available cores of a Xeon Phi to perform the matrix-vector multiplications. After they finish, nodes 3, 4, 5 and 6 would use 5/20, 3/20, 5/20 and 7/20 of the available cores, etc.

Since the format of the matrix in each node does not follow a well known storage format (dense, CRS, etc), I cannot use MKL to perform the operation I need. Therefore, I have written my own code to do that, which is as follows (for the MiT*x part of the calculation):

#pragma omp for
for (i = 0; i < node->num_of_indices; i++) {
  row = &M[node->indices * m]; // allocated with malloc() as a 1D vector, need to translate 2D coordinates
  #pragma vector aligned nontemporal
  #pragma ivdep
  for (j = 0; j < m; j++) {
      x_2 += row * x;
  }
}

The code for Mi*x_2 is similar. The problem I have is that the operation Mi*(MiT*x) requires over 85% of the total execution time and I cannot seem to optimize it further. I used VTune and from the statistics collected it seems that most of the time is spent in vmovapd instructions, i.e., reading data from memory. I understand that the operations performed per data item fetched from memory are very few in a matrix-vector multiplication, but I was wondering if there are any ideas about how I could further optimize this operation on the Xeon Phi with the specific representation of matrix M in each node.

I also tried to use some kind of prefetching in the code, but failed miserable. More specifically, I tried blocking the i loop by touching a few elements in each row (64 bytes away each time to go to the next cache line), touching as many rows so as to fill the L2 cache and then perform the required operations. This repeats as many times as necessary to finish with all rows of a node. I did it correctly, since I get the correct results and verified addresses being touched, but execution time almost doubled.

I use icc 16.0.0 and compile with -O3 -fopenmp -Wall -opt-report=5 (the optimization report indeed verifies that all inner loops are vectorized and all memory accesses are aligned).

If anyone has any ideas about this, I would appreciate his/her help.

Ioannis

0 Kudos
16 Replies
Andrey_Vladimirov
New Contributor III
1,241 Views

Hi Ioannis,

to estimate the potential for further optimization, can you measure the performance and compute the number of GFLOP/s that you are achieving?

The conversion from time to GFLOP/s is 2*2*n*m/(T*1e9), where n and m are your matrix size, T is the measured time in seconds, and the factor factor 2*2 is to account for the fact that you are doing 2 matrix-vector multiplications, each with 2*n*m FLOPs.

If sizeof(double)*n*m is much greater than L2 cache in the system, a well-optimized code should give you around 40-45 GFLOP/s on Xeon Phi or 20-25 GFLOP/s on Xeon. This is because you have a bandwidth-bound problem, and you can get around 180 GB/s memory copy bandwidth on Xeon Phi and around 100 GB/s on Xeon. Factor in 2 FLOPs per 8 bytes read, and you get (180 GB/s)/(4 B/FLOP)=45 GFLOP/s on Xeon Phi, and (100 GB/s)/(4 B/FLOP)=25 GFLOP/s on Xeon.

Post your currently measured performance, and from there you will see (i) if you are close to the optimum or quite far from it, and (ii) what additional tests you can do. For example, if you are at a few GFLOP/s, the most likely problem will be the threading model. If you are at about 50% of the optimal values, then memory traffic is more likely to blame. And if you are at the peak, there is no room for further improvement.

See also this webinar, where I talk about memory traffic optimization in dense matrix-vector multiplication: http://colfaxresearch.com/cdt-v02-optimization-of-memory-traffic/

 

Andrey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,241 Views

On the Xeon Phi (KNC) you need to use at least 2 threads per core, often 3, but sometimes 4. Please experiment using 2, 3, and 4 threads per core... on the code you produce following the following strategy:

See if you can structure the distribution of the rows such that the threads that share the same core, use a clumping of page table entries. Example sketch

Your current code may distribute rows under the assumption that row placement has no affect on performance. In practice it does with respect to potential TLB misses within a core.

Your current code may be assigning logical rows n:m to an arbitrary thread (regardless of core), rows m+1:o to a different arbitrary thread, etc. IOW a convenient programming distribution (easy to code).

What I suggest to do is use KMP_AFFINITY=compact, and KMP_PLACE_THREADS=60c,3t, (or your experimentation) to use 3 threads per core as an example (experiment with 2t and 4t). In this configuration the first 3 OpenMP threads are on the first core, next 3 on next core. etc.

Each core can be grossly considered as the OpenMP thread number / 3, and the HT thread within the core as the OpenMP thread number % 3.

Then for core X, HT 0 assign it row n
core X, HT 1 assign it row n+1
core X, HT 2 assign it row n+2
core X, HT 0 assign it row n+3
core X, HT1, assign it row n+4
core X, HT 2, assign it row n+5
...

IOW each core receives as compact in memory number of rows...
each HT within the core interleaves the rows for the core.

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
1,241 Views

The default KNC software prefetch ought to work, with the possibility to experiment with various compile time settings of the 2 prefetch distances. You can also shut off the L2 software prefetch and fall back on the hardware prefetch plus L1 prefetch (I don't know of any relevant results to show usefulness of that). When you prefetch beyond the end of a row, it is important to have set the compact/close affinity so that those cases where multiple threads touch the same data don't result in unnecessary extra memory references.

L2 prefetch distance may interact with the number of threads used per core.  If it is truly memory bandwidth limited, you may not need more than 1 thread per core, maybe not as many as 60 cores.

There is also the possibility to experiment with the level of unroll (controlled by pragma or compile option) and unroll-and-jam (either pragma, if it takes effect, or explicit source code change).  The opt-report will report the compiler's choices of unroll and unroll-and-jam.

Did you check whether the compiler ignores the seemingly inapplicable nontemporal directive or does it maybe disable some software prefetch ?  The compiler, with the default option -opt-streaming-stores=auto, should try to figure out where nontemporal is beneficial.

0 Kudos
Ioannis_V_
Beginner
1,241 Views

Dear Andrey,

Firstly let me verify that I use double precision numbers, which I forgot to mention in my original post.

I measured the performance of both matrix-vector operations. I used gettimeofday(), which provides microsecond accuracy. I assume that's enough for our purpose. In my case n=40000 and m=97. Best execution time for my app is currently for 112 threads, which is exactly half of the maximum available (we have the 57-core version of the Xeon Phi and 1 core is used for the handling of offloads). Numbers I report are for 112 threads.

The code I have shown in my previous post makes a multiplication with sizes (97x40000)x(40000x1) and achieves 21-22 GFlop/s on the Xeon Phi. This is half of the maximum you suggested in your answer. I also tried with n=100000 and m=97, without observing any significant change. Notice that the code is not the typical matrix-vector code. This is because M is stored per row and hence for MT, columns are sequential in memory and I tried to exploit this fact.

Now the matrix (40000x97) is multiplied with the resulting vector (97x1). As you can see, the vector size is much smaller now and hence the performance reaches only 8-9 GFlop/s for this operation.

Both numbers are far from the optimum of 45 GFlop/s, so there should be room for improvement.

I haven't had the time to watch the webinar, but hopefully will do so over the weekend.

Ioannis

0 Kudos
Ioannis_V_
Beginner
1,241 Views

Dear Tim,

I was not aware of the possibility to control prefetch distances. I will try to experiment with that and see what happens. I should also have mentioned that the starting address of matrix M is 64-byte aligned and that each row is padded so as to make the beginning of the next row aligned on a 64-byte boundary. I did that to help vectorization but it should also solve the problem of multiple threads accessing the same cache line. I just wanted to keep things as clear as possible in my description. In reality line 3 in my code above is:

row = &M[node->indices * padded_m];

in order to account for the padding.

I have already experimented with the unroll options/directives. It seems the compiler does a very good job in automatically determining the best values. I couldn't get any better performance.

Finally, the nontemporal directive actually gave me some better performance. That's why I kept it. The -opt-streaming-stores=auto option shows that streaming stores are generated for vector x_2.

Ioannis

0 Kudos
Ioannis_V_
Beginner
1,241 Views

Hello everyone,

Just in case it helps to know what happens when all threads of the Xeon Phi are used, I repeated the experiments for 224 threads and now I get 14-15 GFlop/s for the first matrix-vector multiplication and 4-5 GFlop/s for the second one.

Ioannis

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,241 Views

What is the relationship between the content of your indicies[] on the first one and on the second one?

Same, different, if different how different.

Can you post your test code? (complete example)

Note, the posted code needs to be symptomatic of what issue you are observing (Example, where you get 14-15 GFlop/s for the first matrix-vector multiplication and 4-5 GFlop/s for the second one).

Jim Dempsey

0 Kudos
Ioannis_V_
Beginner
1,241 Views

Dear Jim,

As mentioned in my first post, I am multiplying the matrix corresponding to each node with its transpose, so the indices are the same. I cannot upload the complete code, but I am copying here the function under consideration. I hope this is enough. Let me know if you need any further details.

Ioannis

__attribute__((target(mic))) void myfunc(Node *node, double *M, unsigned long m, unsigned long padded_m, double **v, unsigned long threads)
{
	double          *x_prev, *x_curr, *x_temp, max, norm = INFINITY;
	unsigned long   i, j;

	x_prev = (double *)mmap(NULL, node->num_of_indices * sizeof(double), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
	x_curr = (double *)mmap(NULL, node->num_of_indices * sizeof(double), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
	x_temp = (double *)mmap(NULL,                    m * sizeof(double), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);

	if ((x_prev == MAP_FAILED) || (x_curr == MAP_FAILED) || (x_temp == MAP_FAILED)) {
		printf("ERROR: Could not allocate a vector in power iteration.\n");
		exit(0);
	}

	__assume_aligned(x_prev, 64);
	__assume_aligned(x_curr, 64);
	__assume_aligned(x_temp, 64);
	__assume_aligned(node, 64);
	__assume_aligned(*v, 64);

	#pragma vector aligned
	#pragma ivdep
	for (i = 0; i < node->num_of_indices; i++) {
		x_curr = 1.0;
	}

	#pragma omp parallel default(none) private(i, j) shared(x_temp, x_curr, x_prev, node, max, norm, M, m, padded_m) num_threads(threads)
	{
	__attribute__((aligned(64))) double     x_temp_local;

	double          *temp, *M_line;
	unsigned long   iter = 0;
	struct timeval  start, end;

	__assume_aligned(M_line, 64);

	do {
		#pragma omp single
		{
		max = 0.0;

		temp   = x_prev;
		x_prev = x_curr;
		x_curr = temp;

		memset(x_temp, 0, m * sizeof(double));
		}

		memset(x_temp_local, 0, m * sizeof(double));

                /* First matrix-vector multiplication */
		gettimeofday(&start, NULL);
		#pragma omp for nowait
		for (i = 0; i < node->num_of_indices; i++) {
			M_line = &M[node->indices * padded_m];
			#pragma vector aligned nontemporal
			#pragma ivdep
			for (j = 0; j < m; j++) {
				x_temp_local += (M_line - node->centroid) * x_prev;
			}
		}
		gettimeofday(&end, NULL);

		printf("1. %ld %3d %f\n", node->num_of_indices, omp_get_thread_num(), (2 * m * node->num_of_indices) / (1000.0 * ((end.tv_sec - start.tv_sec) * 1000000.0 + (end.tv_usec - start.tv_usec))));

		#pragma vector aligned
		#pragma ivdep
		for (j = 0; j < m; j++) {
			#pragma omp atomic
			x_temp += x_temp_local;
		}

		norm = 0.0;
		#pragma omp barrier

                /* Second matrix-vector multiplication */
		gettimeofday(&start, NULL);
		#pragma omp for reduction(max:max)
		for (i = 0; i < node->num_of_indices; i++) {
			x_curr = 0.0;
			M_line = &M[node->indices * padded_m];
			#pragma vector aligned nontemporal
			#pragma ivdep
			for (j = 0; j < m; j++) {
				x_curr += (M_line  - node->centroid) * x_temp;
			}

			if (fabs(x_curr) > max) {
				max = fabs(x_curr);
			}
		}
		gettimeofday(&end, NULL);

		printf("2. %ld %3d %f\n", node->num_of_indices, omp_get_thread_num(), (2 * m * node->num_of_indices) / (1000.0 * ((end.tv_sec - start.tv_sec) * 1000000.0 + (end.tv_usec - start.tv_usec))));

		#pragma omp for reduction(+:norm)
		for (i = 0; i < node->num_of_indices; i++) {
			x_curr /= max;
			norm += (x_curr - x_prev) * (x_curr - x_prev);
		}

		iter++;

	} while (((norm / node->num_of_indices) > TOLERANCE) && (iter < MAX_ITERATIONS));
	} // #pragma omp parallel

	munmap(x_prev, node->num_of_indices * sizeof(double));
	munmap(x_temp, m * sizeof(double));

	*v = x_curr;
}

 

0 Kudos
TimP
Honored Contributor III
1,241 Views

I haven't been able to edit this to compilable form.

Use of shared j in parallel regions is a big problem, if the compiler doesn't optimize it away.  This would easily account for lack of normal parallel performance (and unpredictable results).

I don't know how you can use nowait reliably if it may set up a race on x_temp_local[].

Did you try Inspector or similar tools to find and resolve some race conditions?

The preference for local int for() parameters is even stronger on MIC than on host.  64-bit unsigned could be a serious obstacle on this hardware.  Comparing reports for -Qopt-report4 should give some idea of the usefulness of changing this (once you have corrected race conditions).

 

0 Kudos
Ioannis_V_
Beginner
1,241 Views

Dear Tim,

Thank you for taking the time to look into this, but I am not certain that I understand your comment about j being shared. It is explicitly declared as private in the parallel directive. I get constantly the same results, which are equal to the ones produced by the serial version of the code. I don't think that there is a race condition in the code.

I use opt-report=5 but it doesn't mention anything about the loop variables.

Ioannis

0 Kudos
TimP
Honored Contributor III
1,241 Views

Ok I didn't see the private. So you could change some for loops to use int and see whether opt-report4 or 5 predict better vector performance.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,241 Views

Ioannis,

A few things stand out that may be of an issue

For your second multiply loop try

  #pragma omp for reduction(max:max)
  for (i = 0; i < node->num_of_indices; i++) {
   M_line = &M[node->indices * padded_m];
   #pragma vector aligned nontemporal
   #pragma ivdep
   for (j = 0; j < m; j++) {
    x_temp_local = (M_line  - node->centroid) * x_temp;
   }
   double x_curr_local = 0.0;
   for (j = 0; j < m; j++) {
    x_curr_local += x_temp_local;
   }
   x_curr = x_curr_local;
   if (fabs(x_curr_local) > max) {
    max = fabs(x_curr_local);
   }
  }

This may work better.

You still have not answered if node->indices, node->indices[i+1], node->indices[i+2], ... tend to represent adjacent numbers (with occasional gaps in sequence).

Jim Dempsey

0 Kudos
Ioannis_V_
Beginner
1,241 Views

Hello Jim,

I tried the code you suggest, but it actually performs just a bit worse than my original code (~0.5 GFlop/s less).

I didn't correctly understood previously your question regarding indices. The answer is that in general we don't know if they are contiguous, as this depends on the input matrix M. In the runs with my current input matrix it seems that in general they are not contiguous or only a very small percentage is.

Ioannis

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,241 Views

Hmm... How about helping the compiler to reduce common sub expressions with

#pragma omp parallel default(none) private(i, j) shared(x_temp, x_curr, x_prev, node, max, norm, M, m, padded_m) num_threads(threads)
{
  __attribute__((aligned(64))) double     x_temp_local;
  __attribute__((aligned(64))) double     x_temp_diff_local;
  ...
  for (j = 0; j < m; j++) {
    x_temp_diff_local = M_line - node->centroid;
    x_temp_local += x_temp_diff_local * x_prev;
  }
  ...
  for (j = 0; j < m; j++) {
    x_curr += x_temp_diff_local * x_temp;
  }
  ...

Jim Dempsey

0 Kudos
Ioannis_V_
Beginner
1,241 Views

Hello,

I just wanted to make an update on this post, in case someone finds it in the future and wonders what the outcome was.

Firstly, there has been an error in the previous posts about how I calculate flop/s. Since there is a subtraction of a vector from each line of the matrix before making the multiplication, there are actually 3 floating point operations per element. Taking this into account, the real performance I get is around 33 GFlop/s.

Andrey suggested that we can get an actual of 180 GB/s of memory copy bandwidth on the Xeon Phi. I assume this is for the high-order range of the Xeon Phi. However, we have the 3120P version, which can provide 240GB/s peak bandwidth. In order to figure out the real bandwidth that I can get, I used the STREAM benchmark and it reported around 94 GB/s for the problems that are closer to my case (copy and add). Since the centroid vector is small (only 97 double values) we can assume that it remains in cache. Furthermore, each element of vector x_prev is loaded once and reused many times. Hence, we can assume that actually only elements of the matrix are loaded from main memory, i.e. 8 bytes and 3 floating point operations are performed. That leads to 94 / (8/3) = 35 GFlop/s of maximum performance. So, I think that I have actually reached maximum performance for my problem on the Phi.

However, I am not satisfied with this solution, as I think that this performance is too low for what the Phi can provide. I will try to reorganize data so that the matrix-vector operation performed will be with a dense matrix. But I will open a new thread, since I already have a question about it.

Best regards,

Ioannis E. Venetis

0 Kudos
McCalpinJohn
Honored Contributor III
1,241 Views

Thanks for the update!

I have not tested bandwidth on a Xeon Phi 3120, but I would have expected a bit better than 94 GB/s.   The two things that help STREAM performance the most are:

  1. non-temporal stores
    1. Not useful here, since I think you want the intermediate results to stay in the L2 caches.
  2. large pages
    1. These are very helpful on Xeon Phi because it is the only recent Intel processor that drops software prefetches if they require a Page Table Walk. 
    2. On my Xeon Phi SE10P systems, I get the best results with large pages and a prefetch distance of 64 cache lines ahead.
      1. Note that 64 cache lines is 4KiB, so *all* of these prefetches will be dropped when running with 4KiB pages (assuming that the amount of data accessed exceeds the DTLB range of 64 entries * 4KiB/entry = 256 KiB).

I am not sure how much better the performance should be -- the limitations on sustained bandwidth result from extremely complex and subtle interactions of many factors, many of which are not directly observable (such as activity at the Distributed Tag Directories).  But I would guess that you should be able to get closer to 120 GB/s for a read-dominated access pattern.

 

A secondary factor in Xeon Phi bandwidth is interaction of memory access streams and open pages.  

  • Performance is expected to drop if the number of simultaneous memory access streams from the cores exceeds the number of DRAM banks, since each bank will have to be opened and closed repeatedly to load the required data.
    • Each of the memory controllers has 2 GDDR5 DRAM channels, and each channel has 16 banks (4 bank groups with 4 sub-banks each).   So the Xeon Phi 3120 has 6*2*16=192 DRAM banks, while the Xeon Phi SE10P has 8*2*16=256 DRAM banks.
    • The number of load streams is slightly more important than the number of store streams (since stores are often buffered in the memory controller before being written back to DRAM), but both have an impact.
  • Example from Xeon Phi SE10P (job limited to first 60 cores):
    • Copy & Scale run at the same speed with 1 or 2 threads/core (120 or 240 total access streams on 60 cores)
      • Performance drops ~3% with 3 threads/core (360 memory access streams) 
      • Performance drops ~6% with 4 threads/core (480 memory access streams)
    • Add & Triad run best with 1 thread per core (180 memory access streams on 60 cores)
      • Performance drops ~3.5% with 2 threads/core (360 memory access streams)
      • Performance drops ~11% with 3 threads/core (540 memory access streams)
      • Performance drops ~13% with 4 threads/core (720 memory access streams)

With 192 DRAM banks, you will probably get the best memory throughput with large pages and ~3 memory access streams per core.  Of course your code is more complex and may not generate enough concurrency to completely cover the memory latency (or it might suffer from any of a number of other limitations), but I thought some of these performance comments might help with the analysis.....

0 Kudos
Reply