performance degradation

Altera_Forum · ‎04-05-2006

We are running heterogeneous multiprocessor NIOS2 cores on out Stratix2 fpga. We have made a benchmark that contends for a shared memory locked by a hardware mutex, performs a multiply-accumulate, and stores the data. We are extracting performance data for this benchmark over various configurations of fast and standard NIOS2 processors.

When we have 2 standard processors along with a fast processor contending for the shared memory, we see a drastic performance degradation for just the standards. We then ran three standards with the same benchmark, and found the execution time per processor to increase two-fold over a two standard processor version running the same benchmark. Using performance counters, we saw that the execution time for the load-multiply-accumulate-store of the locked data increased.

This seems very strange as once the processor gains the lock, it is the only processor that can operate on the data and the load-multiply-accumulate-store should be the same regardless of the number of processors on the fpga. Does anyone have any reasoning for why execution time of operations like multiplies and adds would increase when we increase the number of processors on the fpga from 2 to 3?

Or are there any problems with performance counters running on multiple processor systems simultaneously?

Altera_Forum · ‎04-06-2006

difficult question to answer without a more detailed memory layout...

- are the different CPUs sharing the same memory also for code and data?

- what kind of code is executing on each CPU?

- are you using GP-relative addressing or not? this changes the frequency of memory accesses

- what about caches?

bye

PJ

Altera_Forum · ‎04-06-2006

First, I want to explain that this is for research, so we are trying to model contention and that is why our code might seem simple and trivial.

The CPUs are using the sdram for instruction and data memory. They have small 4kb caches, but our code is extremely small. We are basically using an architecture like the one in the 3 processor tutorial by Altera. The strange thing is, we don't see a considerable degradation in performance from the fast processors (performance data below code). For the code, CPU1 reads macarray index 1-3, CPU2 reads from 4-7, and CPU3 reads from 8-10. We run the code on all processors at the same time so they will fight for the lock. We expect the lock contention to increase, as it does, however everything else (execution/loop iter) seems like it shouldn't vary too much in execution time.

our code is:# define FAST 2666667# define STAN 1000000# define ECON 200000

# define LOOP FAST

int main(void) {

int i,j;

int id = 1;

int temp;

macarray = (int*)MESSAGE_BUFFER_RAM_BASE;

for(i=1;i<4;i++)

IOWR(&macarray,0,1);

mutex = altera_avalon_mutex_open("/dev/message_buffer_mutex");

perf_reset(performance_cpu1_base);

perf_start_measuring(performance_cpu1_base);

perf_begin(performance_cpu1_base, 1);

for(j=0; j< loop; j++) {

for(i=1; i<4; i++) {

perf_begin(performance_cpu1_base, 2);

perf_begin(performance_cpu1_base, 3);

altera_avalon_mutex_lock(mutex, 1);

perf_end(performance_cpu1_base, 3);

temp = iord(&macarray,0);

IOWR(&macarray[i],0, temp*i + temp);

altera_avalon_mutex_unlock(mutex);

PERF_END(PERFORMANCE_CPU1_BASE, 2);

}

PERF_END(PERFORMANCE_CPU1_BASE, 1);

PERF_STOP_MEASURING(PERFORMANCE_CPU1_BASE);

in terms of performance:

for 2 CPU fast

total 3707864005

lock contention 2727460037

loop iteration 122.5504807

for 3 CPU fast

total 5219197128

lock contention 4230256283

loop iteration 123.6175902

for 2 CPU standard

total 4004902599

lock contention 2745356939

loop iteration 419.8485533

for 3 CPU standard

total 8519047729

lock contention 6664269607

loop iteration 618.259374

Altera_Forum · ‎04-08-2006

ok, now I understand a little bit better. From which University do you come from?

a few comments:

- are you sure the code you are going to execute fits in RAM?

- since the code should be small, try to put the three codes on three tightly coupled or onchip rams, or simply use different memories (ext_ram, sdram, and onchip) for the three processors, to remove possible contentions on the same memory

- are you measuring just one processor, ar all three? Altera Mutexes implement simple spin locks so there is no bounded access time for the wait time on an altera mutex (and that's the reason why we implemented a queuing spin lock on ERIKA Enterprise)

- it could be that using fast and standard cores accessing the same memory maybe leads to more contentions that three standards, because the pattern of memory accesses could be in general different

- I do not understand the performance counter setup...

why not

PERF_BEGIN(PERFORMANCE_CPU1_BASE, 3);

altera_avalon_mutex_lock(mutex, 1);

PERF_END(PERFORMANCE_CPU1_BASE, 3);

// --- moved

PERF_BEGIN(PERFORMANCE_CPU1_BASE, 2);

// --- moved

temp = IORD(&macarray,0);

iowr(&macarray,0, temp*i + temp);

altera_avalon_mutex_unlock(mutex);

PERF_END(PERFORMANCE_CPU1_BASE, 2);

??

- are you -really- sure the various CPUs are competing to access the mutexes? I do not see any code that synchronizes the various CPUs on the start of the cycle... That is in general a common problem, because each CPu has its own boot time. If you want to be sure, you probably need to use something like the startup barrier we implemented in ERIKA Enterprise.

bye

PJ