Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Multicore SMP system task time measurements error

atilla_k_
Beginner
608 Views

Hello,

I have an i7-4700EQ processor. I want to use 4 cores with parallel. I compiled below code and run it. With only 1 core time measurement was 7198.200000us. But with 4 cores, i saw 18290.221667us for each cores. How can it possible? I should have seen about 7198us, right? Because I used independent tasks and independent memories.

build specs: CC_ARCH_SPEC = -march=core2 -nostdlib -fno-builtin -fno-defer-pop -m64 -fno-omit-frame-pointer -mcmodel=kernel -mno-red-zone -mavx2 -fno-implicit-fp

code;

void MultiCoresExample(int iA, int iB, int affin);
void TempMultiCoreCopy(int iA, int iB, int affin);

double dtime1[4], dtime2[4];
typedef struct
{
    float *vInput[4];
    float *vOutput[4];
}tempStruct;
tempStruct tmpStr;

void MultiCoresExample(int iA, int iB, int affin)
{
    TASK_ID  tids[4];  /* some task IDs */
    char taskName[32];
    int cpuIx[] = {0,1,2,3};  /* core ID's*/
    int i, j;
    cpuset_t affinity;
    float *f0, *f1, *f2, *f3, *f4, *f5, *f6, *f7;
    float *fIn[4];
    float *fOut[4]; 
    f0 = memalign(128, iA*iB*4);
    f1 = memalign(128, iA*iB*4);
    f2 = memalign(128, iA*iB*4);
    f3 = memalign(128, iA*iB*4);
    f4 = memalign(128, iA*iB*4);
    f5 = memalign(128, iA*iB*4);
    f6 = memalign(128, iA*iB*4);
    f7 = memalign(128, iA*iB*4);
 tmpStr.vInput[0]  = f0;
 tmpStr.vInput[1]  = f1;
 tmpStr.vInput[2]  = f2;
 tmpStr.vInput[3]  = f3;
 tmpStr.vOutput[0] = f4;
 tmpStr.vOutput[1] = f5;
 tmpStr.vOutput[2] = f6;
 tmpStr.vOutput[3] = f7;
    /******* init ***************/
    for(i=0; i<affin; i++)
    {
        for(j=0; j<iA*iB; j++)
        {
         f0= (i+1)*j/100.;
         f1= (i+1)*j/70.;
         f2= (i+1)*j/40.;
         f3= (i+1)*j/20.;
        }
    }
    /****************************/
    printf("Cores are setting...\n");
    for(i=0; i<affin; i++)
    {
        CPUSET_ZERO (affinity);
        CPUSET_SET(affinity, cpuIx);    
        sprintf(taskName, "t%s%d", "task", i);
  tids = taskCreate(taskName, 120, TASK_OPTIONS, 65536, (FUNCPTR)TempMultiCoreCopy, iA, iB, affin, 0,0,0,0, 0, 0, 0);
        printf("Task created:0x%08x\n", tids);
        if (tids == NULL)
        {
            /*return (ERROR);*/
            printf("Task create error:0x%08x\n", tids);
        }
        if(affin != -1)
        {
            /* Clear the affinity CPU set and set index for CPU */
            if (taskCpuAffinitySet(tids, affinity) == ERROR)
            {
                /* Either CPUs are not enabled or we are in UP mode */
                printf("Affinity error \n");
                taskDelete(tids);
                /*return (ERROR);*/
            }
            taskDelay(sysClkRateGet()/10);   
            taskCpuAffinityGet(tids, &affinity);
            printf("Task Affinity:%d\n", affinity);
        }
    }
     
 for(i=0; i<affin; i++)
 {
  taskActivate(tids);
 }
    taskDelay(sysClkRateGet()* 4); /* for finish all cores.*/
    for(i=0; i<affin; i++)
    {
        printf("\nStartTime[%d]=%f  FinishTime[%d]=%f  ExecutionTimeForCore[%d]=%f us\n", i, dtime1, i, dtime2, i, (dtime2-dtime1));
    }
    for(i=0; i<affin; i++)
    {
        taskDelete(tids);
    }
}

void TempMultiCoreCopy(int iA, int iB, int affin)
{
 int kk;
 int iCpuId = vxCpuIdGet();
 dtime1[iCpuId] = getTimeDouble(2); 
 for(kk=0; kk<1000; kk++)
 {
  memcpy(tmpStr.vOutput[iCpuId], tmpStr.vInput[iCpuId], iA*iB*4);
 }
 dtime2[iCpuId] = getTimeDouble(2);
}

screen;

sp MultiCoresExample,16,2048,1
Task spawned: id = 0xffff80000efd1510, name = t1
value = -140737236888304 = 0xffff80000efd1510
A->Cores are setting...
Task created:0x0efe2020
Task Affinity:1
StartTime[0]=218186962.911667  FinishTime[0]=218194161.111667  ExecutionTimeForCore[0]=7198.200000 us

sp MultiCoresExample,16,2048,2
Task spawned: id = 0xffff80000efd1510, name = t2
value = -140737236888304 = 0xffff80000efd1510
A->Cores are setting...
Task created:0x0efe2020
Task Affinity:1
Task created:0x0f1ea810
Task Affinity:2
StartTime[0]=264755500.995000  FinishTime[0]=264773712.746667  ExecutionTimeForCore[0]=18211.751667 us
StartTime[1]=264755514.550000  FinishTime[1]=264773614.643333  ExecutionTimeForCore[1]=18100.093333 us

sp MultiCoresExample,16,2048,3
Task spawned: id = 0xffff80000efd1510, name = t3
value = -140737236888304 = 0xffff80000efd1510
A->Cores are setting...
Task created:0x0efe2020
Task Affinity:1
Task created:0x0f1ea810
Task Affinity:2
Task created:0x0efe2510
Task Affinity:4
StartTime[0]=288507258.976667  FinishTime[0]=288525447.206667  ExecutionTimeForCore[0]=18188.230000 us
StartTime[1]=288507271.261667  FinishTime[1]=288525387.871667  ExecutionTimeForCore[1]=18116.610000 us
StartTime[2]=288507259.561667  FinishTime[2]=288514408.870000  ExecutionTimeForCore[2]=7149.308333 us


sp MultiCoresExample,16,2048,4
Task spawned: id = 0xffff80000efd1510, name = t4
value = -140737236888304 = 0xffff80000efd1510
A->Cores are setting...
Task created:0x0efe2020
Task Affinity:1
Task created:0x0f1ea810
Task Affinity:2
Task created:0x0f413610
Task Affinity:4
Task created:0x0f413b00
Task Affinity:8
StartTime[0]=307985065.768333  FinishTime[0]=308003355.990000  ExecutionTimeForCore[0]=18290.221667 us
StartTime[1]=307985078.606667  FinishTime[1]=308003284.243333  ExecutionTimeForCore[1]=18205.636667 us
StartTime[2]=307985064.923333  FinishTime[2]=308003229.746667  ExecutionTimeForCore[2]=18164.823333 us
StartTime[3]=307985066.711667  FinishTime[3]=308003220.956667  ExecutionTimeForCore[3]=18154.245000 us

 

0 Kudos
5 Replies
McCalpinJohn
Honored Contributor III
608 Views

What are the values of iA and iB?    It is not possible to figure out which parts of the memory hierarchy are being used if the sizes of the arrays are not known.

Your processor supports HyperThreading.  If HyperThreading is enabled, the system might map logical processors 0,1,2,3 to different physical cores, or it might map logical processors 0,2,4,6 to different physical cores.

Your 3-thread result suggests that you do have HyperThreading enabled and that logical processors 0,1 are mapped to physical core 0, 2,3 are mapped to physical core 1, 4,5 are mapped to physical core 2, and 6,7 are mapped to physical core 3.   So in the 3-thread case, threads 0 and 1 are sharing physical core 0 (and therefore running slowly), while thread 2 is running by itself on physical core 1 (and running at full speed).

0 Kudos
atilla_k_
Beginner
608 Views

Thanks John,

I changed the HyperThreading mode and it works.   iA=64 and iB=2048.

How about without disabling the HyperThreading? It can work with HyperThreading? When I set 0,2,4and 6. cores I saw"Affinity error" for 4. and 6. cores on screen. Can I run this code with HyperThreading(enable)?

0 Kudos
McCalpinJohn
Honored Contributor III
608 Views

Based on the output of your first run, it looks like you should try changing

int cpuIx[] = {0,1,2,3};  /* core ID's*/

to

int cpuIx[] = {0,2,4,6};  /* core ID's*/

When HyperThreading is enabled, this should place one thread on each physical core.

If HyperThreading is disabled, then this won't work, since the available cores will be [0,1,2,3], so you will need the code to be able to compensate.

It gets trickier if you need to programmatically determine whether or not HyperThreading is enabled and how the "logical processors" are mapped to the physical cores, and I don't know how to attempt to do this on VxWorks.  

0 Kudos
atilla_k_
Beginner
608 Views

Thanks John,

I tried cpuIx[] = {0,2,4,6};  but it didnt work correctly.(When HyperThreading is enabled)  When I set 4 and 6 cpuId, I took "Affinity error" from my code. I changed these Ids and tried all Ids but it didnt work. As soon as, I changed the mode of HyperThreading(disable), it worked right with cpuIx[] = {0,1,2,3};.

I didnt understand why it didnt work. Also I tried mapping loggical cores to physical cores.

Other problem is data size;

When iA=16 and iB=2048 so data size is equal  iA*iB*sizeOf(float), it works parallel with all cores (HyperThreading is enabled )

When iA=64 and iB=2048 so data size is equal  iA*iB*sizeOf(float), it does not work parallel with all cores (HyperThreading is enabled )

Do you have any idea ?


 

0 Kudos
McCalpinJohn
Honored Contributor III
608 Views

I don't have any experience with VxWorks, so I can't really speculate on what is going on with the affinity calls.

I just noticed that your compilation options include several flags that are specific to generating code for running inside the kernel, but the rest of the code does not look like it is set up as a kernel module.  This could be the cause of some of the troubles?

0 Kudos
Reply