Storing data is bottleneck?

Arthur_U_ · ‎01-09-2013

Hi,

I'm writing some example code of AVX like below:

double a[SIZE]__attribute__((aligned(32)));
double b[SIZE]__attribute__((aligned(32)));
double c[SIZE]__attribute__((aligned(32)));

srand(time(NULL));

for(inti=0; i<SIZE; i++) {
a = rand()/(double)RAND_MAX + rand()/(double)RAND_MAX * pow(10,-8);
b = rand()/(double)RAND_MAX + rand()/(double)RAND_MAX * pow(10,-8);
}
__m256d ymm0, ymm1, ymm2;

gettimeofday(&t0,NULL);
for(inti=0; i<SIZE; i+=4) {
ymm0 = _mm256_load_pd(a+i);
ymm1 = _mm256_load_pd(b+i);
ymm2 = _mm256_mul_pd(ymm0, ymm1);
_mm256_store_pd(c+i, ymm2);
}
gettimeofday(&t1,NULL);

double time1;
time1 = (t1.tv_sec - t0.tv_sec) + (t1.tv_usec - t0.tv_usec)*1.0E-6;

double sum;
for(inti=0; i<SIZE; i++) {
sum += c;
}

And the result of the time1 in the code was 6.750000e-04(sec) .
That is slower result than scalar version which recorded around 5.0e-04(sec)..

Then, I've found that if I comment-out the storing part (_mm256_store_pd(c+i, ymm2); ), the results get more faster than before( time1 get 1.9300e-04(sec)).

Acording to these results, I think that storing data from ymm register to memory is bottleneck... but, is that right?
Is there any good way to store data while preventing an increase in execution time?

(The actual code was attached.)
OS: Mac OSX 10,8,2
CPU: 2GHz Intel Core i7
Compiler: gcc 4.8
Compiler-options: -mavx (AVX version only)

Thanks.

Thomas_W_Intel · ‎01-31-2013

Hmm, when you comment out the writes, the compiler should be smart enough to figure out that you are not using the result and it should completely elimitate the loop. I'm therefore surprised that the loop takes any time at all. Are you sure that your measurements are precise enough?

SergeyKostrov · ‎01-31-2013

>>...Are you sure that your measurements are precise enough? >>... >>gettimeofday( &t0, NULL ); >>for(inti=0; i>{ >>... >>} >>gettimeofday( &t1, NULL ); >>... Even if a CRT-function gettimeofday is not the fastest, compared to RDTSC or gettime, I don't see any problems with how it is used. However, it makes sense to try another functions.

Bernard · ‎03-04-2014

>>>Even if a CRT-function gettimeofday is not the fastest, compared to RDTSC or gettime >>>

If I am not wrong gettimeofday relies on RTC so it is not so accurate.

McCalpinJohn · ‎03-04-2014

Some Notes:

1. CRITICAL: The code does not initialize the output array before using it in the loop. This means that the overhead of allocating physical pages for the output array falls in the timed region -- definitely not a good idea. This is consistent with seeing a big speedup if the stores are removed, since the two arrays being loaded were properly initialized before the time region. Of course you also have to check the assembly code carefully to make sure that the compiler does not eliminate the loads when the stores are deleted.)

2. IMPORTANT: Pay attention to the array sizes compared to the various caches. For N=100,000 elements, the total storage used is 2,400,000 Bytes, or about 2.29 MiB -- the data that has been initialized will be in the L3 cache. Therefore the performance will be limited by the effective L3 cache bandwidth, which is only indirectly related to the choice of scalar, 128-bit vector, or 256-bit vector.

The specific timing values reported also suggests that there is a big overhead in the timed section. Moving 100,000 elements (2,400,000 Bytes) in 6.74e-4 seconds corresponds to a bandwidth of 3.56 GB/s, which is much much lower than a single processor should be sustaining from L3 cache. Removing the loads changes this to reading 1,600,000 Bytes in 1.93e-4 seconds, which is 8.3 GB/s -- a much more reasonable number for L3 read bandwidth. (It is still on the slow side, but 2 GHz is a fairly slow processor, so it is not too far off.)

3. Advisory: The granularity of the "gettimeofday()" timer is 1 microsecond, and the overhead for calling that routine appears to be no more than 1 microsecond in my tests (though my processors are a bit faster). Although you can reduce this overhead by using the RDTSC instruction (which should have an overhead of about 24 cycles on the 2 GHz Core i7), it may be easier to simply repeat the inner loop 10 or 100 times to make the timing interval larger. This is also important on systems that have power-saving enabled, as they will typically drop the processor frequency down to a minimum value when the system is idle and only ramp it up when the processor gets busy. If you can't disable these power-saving features, I recommend repeating the measurement loop until you have something like a 10 second run time. In this case you might repeat the inner loop a few hundred times to get the processor "warmed up" and then repeat the loop another few hundred times and only pay attention to the timing of the last set of iterations.

4. Once problem (1) is fixed, there will probably still be a small difference in performance between scalar, packed 128-bit SSE, and packed 256-bit AVX when the data is in L3 or memory. This difference is not well understood, but it has been widely reported that on "Sandy Bridge" processors (2nd generation Core i7) using AVX instructions results in slightly lower memory bandwidth than using SSE instructions -- so it is not surprising that the L3 bandwidth is different as well. As far as I know there is not a clear public explanation of this difference in performance, but my speculation is that using AVX instructions results in a slower "ramp-up" of hardware prefetches than using scalar SSE or 128-bit packed SSE instructions. (Unfortunately one of the performance counters that would be most useful in shedding light on this issue is broken in Sandy Bridge processors, and I have not had time to attack the problem from a different direction.)