array notation with no results

giannhssdra · ‎07-14-2012

Hello all

Im trying to use array notation in my code but i dont see any kind of improvment.
It takes the same ammount of time to do the same work in parallel and in serial.
And even though i have 4 workers the parallel coderuns with only one worker in one core.

These are the warnings it gives me
cc1plus: warning: label if_stmt_label_00001 defined but not used [-Wunused-label]
cc1plus: warning: label body_label_00001 defined but not used [-Wunused-label]
cc1plus: warning: label exit_label_00001 defined but not used [-Wunused-label]

Here is my command to compile
g++ -std=c99 -O2 -Wall -g -lcilkrts -ldl array_notation.cpp -o array_notation
With no optiomization the parallel runs a little bit faster.But i saw in the cilk documentation that in order for array notation to take place i need -O1 and above.But this way i see no faster results.

Is there anything missing from my code or my command options?

Thanks

Balaji_I_Intel · ‎07-14-2012

Hello,
Cilk Keywords are used to achieve task-level parallelism. Array notations areusedto express data-parallelism. In GCC, if you have -ftree-vectorize, it will try to vectorize the loop. Alternatively, if you have -O3 it will automatically turn on the ftree-vectorize flag.

Thanks,

Balaji V. Iyer.

giannhssdra · ‎07-14-2012

Ok i see

But what is the point of using them if there is no speedup?
I thought that since the elements have nothing to do with each other the runtime system would schedule the workers to work with parts of the array in parallel.But i dont see that happening.
Are there any spawns syncs in the array notation?

Is this a good behaviour the way i get my results?

Also i had some troubles with cilk_spawn and cilk_for and Barry told me that i should ask you.
Here is the first one and here is the second.(threads from the forum)

Thanks

Balaji_I_Intel · ‎07-14-2012

Hello Giannhssdra,
Array notation does not interact with the runtime, only the task parallelism components (Cilk Keywords and Hyperobjects) do.

Thanks,

Balaji V. Iyer.

ARCH_R_Intel · ‎07-16-2012

By the way, highly optimizing compilers are good at rearranging simple nests of loops like in array_notation.cpp so that they do not actually do all the operations. For example, I modified array_notation.cpp to change the variable-length arrays from "static" to automatic allocation, and compiled with "icc -xHost -O2" with icc 12.1. Then I ran:

[plain]$ a.out 1000000 1000000000 start of main... workers created...47 app workers...16 Initialization of arrays... serial program... time : 2.698ms ...
[/plain]If the serial code actually performed all the floating-point operations, that's about 370 Petaflops. Even our best cores don't run that fast yet :-) I poked through the assembly code and it looks like the compiler detected that multiple trips through the "i < times" loop were redundant and replaced them with a single trip.

I was able to stop icc from being this clever by declaring "times" as volatile.

Another thing to consider is that an automatic vectorizer can easily vectorize the serial code in the example, because detecting the lack of aliasing between a, b, and c is straightforward for a compiler.

giannhssdra · ‎07-18-2012

Hello

Can someone explain the way array notations work?

Im thrying to see if there is any kind of parallelism but i cant figure out.Even when i use it with a function( which i read that it should execute in parallel ) i can see that they are executed on the same worker serial.
a[:] = add( b[:] , c[:] );

Is it just another way of writing the equivalent for loop?

I read that i cand use array notation with while , if , and for.But can i use it with the cilk_for?is it safe?

Thanks

Barry_T_Intel · ‎07-18-2012

Exactly.

a[:] = add( b[:] , c[:] );

is equivalent to

for (int i=i; i < sizeof(a)/sizeof(a[0]); i++)
    a = add(b, c);

Most of the time the compiler's auto-vectorizer will recognize that it can vectorize a loop, but sometimes it will miss. Array notations allow you to explicitly specify what you want vectorized. You should also check out #pragma simd, which specifies that if the compiler cannot vectorize a loop, it should emit a warning.

Vectorization does not include parallelization, but they can work together. So you could write:

cilk_for (int i=i; i < sizeof(a/sizeof(a[0])); i++)
    a = add(b, c);

The compiler will automatically realize that the range of the cilk_for is divided by the vector unit width. Unfortunately you cannot easily combine cilk_for with #pragma simd.

- Barry

Balaji_I_Intel · ‎07-18-2012

Hello Giannhssdra,
You can think of your statement as being equivalent to a for loop.

for example: when you have something like this

a[:] = add (b[:], c[:])

The compiler will tranform itinto something like this (I am assuming size of a, b, c as SIZE):

for (i = 0; i < SIZE; i++)
a = add (b, c);

When you represent it in array notation, it will bebroken down intothe appropriatecode as mentioned above and ifthe compiler can vectorizethe loop, itwill try and vectorize it.

Yes, you should be able to use array notations inside cilk_for.

Thanks,

Balaji V. Iyer.

Jim_S_Intel · ‎07-18-2012

The first time you encounter array notations, it is pretty easy to get confused about what array notation does and what it doesn't. The array notations are not parallelizing across multiple threads, so you won't see speedups using only array notations as you change the number of worker threads in Cilk.

The goal of array notations is to make it easier for the compiler to identify where code can be vectorized.
Modern processors have vector hardwarethat enable the machine to exploit data parallelism (e.g., add a vector of 4 ints to another vector of 4 ints). The idea behind a statement like:
a[0:n] = add(b[0:n], c[0:n]);
is to make it easier for the compiler to figure out that it could put a, b, and c into vector registers.

Semantically, if you wrote the statement above, you are also allowed to write the following loop:

cilk_for(int i = 0; i < n; ++i) {
a = b + c;
}

This loop indicates that you are trying to explot task parallelism, i.e., use multiple threads to execute the work of this loop.

When you use array notation in the loop above, currently in Cilk Plus, the loop is NOT translated into a cilk_for, no matter how large n is. I found this point confusing myself, the first time I tried to use array notations.

If you want to exploit both task AND data parallelism for this loop in Cilk, you have to use an keyword for task parallelism. For example, you might write something like:

//EDIT: Note that this code needs n to be divisible by 128.
cilk_for(int i = 0; i < n; i += 128) {
a = b + c;
}

In principle, this transformation is could be done automatically. I suspect there are many reasons why this isn't done, but one I can think of is that array notations work with not only Cilk, but TBB, OpenMP, etc., so its not always clear what the compiler should do for task parallelism. Also, for such a simple loop, you really do need a pretty large value of n to see any speedups.

Now to get to what I think is your original question... why are you not seeing any speedup? There are three issues to consider here.

1. As mentioned earlier, you won't see speedup in this loop using more threads without using an explicit cilk_for.

2. The way to observe whether you are getting benefits of vector hardwareis to compare your code compiled with vectorization turned on, versus your code compiled with vectorization turned off. (I forget the details, but I believe there is a compile-time flag that you can set to disable vectorization). If these timings are the same, then you know that your code is not being vectorized efficiently. There is also a compiler flag to turn on a vectorization report, at least for icc.

3. Finally, you might try comparing your code with array notations vs. just using a normal for loop. But I would expect for a loop as simple as the one you describe, the auto-vectorizer in the compiler is going to be smart enough to figure out that it can vectorize the for loop anyway, even without you using array notation. For more complicated code, using array notations might help the compiler figure out it can vectorize when it otherwise wouldn't. But it is hard to see the difference on a simple example.

In summary, array notations are there to help the compiler do a better jobof exploiting data-parallelism (i.e., using its vector hardware). Thus,the benefits are a bit more subtle than something like a cilk_for loop. In simple cases, the compiler may already be able to vectorize, so you might already be getting the benefits even without using the array notation.

Hope that clarifies things a bit.

Cheers,

Jim

giannhssdra · ‎07-19-2012

Thanks for the answers i think i got it.
Something about vectorization now.
Since this simd unit is hardware depentant i guess its size( number of data that can process at once )is fixed.But how much is it?

Lets say its 4.The for loop with the -ftree-vectorize flag on will translated to this?

for( int i = 0 ; i < size ; i+= 4)
_data_in_simd_unit_and_execute_at_once( a , a[i+1] , a[i+2] , a[i+3 );

So i have x4 speedup in this for loop or whatever the width of the simd unit is.Is this how it works?

Also im trying to see when the loops are actually vectorized with the -ftree-vectorizer-verbose=2 flag
This code when compiled with this flag said that 4 loops were vectorized but not the a[0:size] = b[0:size] line.
Also when i uncomment the cilk_for line and compile again all the loops are un-vectorized.Why is this happening?

I cant see how array notation and cilk_for can work together.
1) cilk_for( int i = 0 ; i < sizeof(out)/sizeof(out[0]) ; i++ )
2) cilk_for( int i = 0 ; i < sizeof(out/sizeof(out[0]) ; i += size )
body : out[0:size] = doSome( in[0:size] , in2[0:size] );

In case 1 there are size iterations and each one of them executes the whole array.
In case 2 there is one itaration but the whole array is executed once but from the same worker.
How can i break the array so that it will executed in parallel with cilk_for?

Thanks

Jim_S_Intel · ‎07-19-2012

In principle, instead of writing

a[0:size] = b[0:size] + c[0:size],

you could write something like:

const int B = 10000;
int last_i = size / B;

cilk_for(int i = 0; i < last_i; i+=B) {
a = b + c;
}
a[last_i:(size-last_i]] = b[last_i:(size-last_i)] + c[last_i:(size-last_i)];

You have to divide your original loop into blocks manually, and execute blocks in parallel using cilk_for, but use array notations within each block.

Note that you have to be careful with the last iteration, which might not be a full block. In my haste, I left off this detail in my last code sample.

For this particular example though, I wouldn't be surprised if you don't see any speedup. There is so little computation being done on each word of memory being accessed, that I wouldn't be surprised if most of the time is spent moving the arrays between the CPU and memory. There is a limited on memory bandwidth that you might be running into, so splitting the work between CPUs may not help you?

I'm not quite sure what the compiler is generating, so I'll have to leave it to someone more knowledgeable to answer that question. :)

Cheers,

Jim

giannhssdra · ‎07-20-2012

Thanks for th answer

Are you sure about your code ?I tried it but it does only one iteration.

i tried this

const int STEP = someVAlue

cilk_for( int i = 0 ; i

a = b + c

But when i try it with cilk screen i gives 210 dublicate error messages.Why dublicate?I thought it was for overlapping ranges .

Thanks

TimP · ‎07-20-2012

I doubt that gcc makes the following distinctions which icpc makes between CEAN and plain C defaults:
-ansi-alias (equivalent to gcc -fstrict-aliasing) is in effect by default for CEAN, not for C. This might not matter in your example; gcc has -fstrict-aliasing as a default in both cases.
#pragma vector always is on by default for CEAN, not for C. In your example, this over-rides the compiler's effort to decide whether an array is big enough to vectorize. Declaring a large enough constant size (known at compile time) or #pragma loop count would be alternative ways for icpc to vectorize without CEAN. Short arrays can't profitably be vectorized unless they are aligned. For a trivial example like yours, you would use __attributes__(aligned(32)) to get enough alignment for AVX (16 for SSE). So the compiler is correct in not vectorizing a short loop with misaligned data, and CEAN forcing vectorization could slow it down.
#pragma ivdep is on by default for CEAN, not for C. This should not matter in your example, but in normal applications you need either restrict qualifiers (the only way for g++) or ivdep assertions.
I don't know whether g++ takes array notation as implying __restrict.
Other than the more aggressive defaults in icpc, I don't believe there is intended to be special magic in CEAN. icpc does often treat nested loops differently with array notation, for example it may use AVX-128 (rather than AVX-256) with array notation so that it need not add code to make a run-time decision whether to take a vector path. We haven't been able to find out whether such differences are intended by design, and I wouldn't expect gcc to make such distinctions.

Your static int declaration of VLA isn't accepted by my copies of g++ or icpc.

For cilk_for in icpc it's important to set CILK_NWORKERS to a reasonable value (not more than the number of cores). The defaults for CILK_NWORKERS are too big to work effectively. I would expect gcc to have similar treatment.
In the simple example of trying a CEAN vectorization inside a cilk_for, you could expect to need STEP > 1000 and size > STEP*STEP to see an advantage over plain vectorization. This approaches the size where you need to consider the influence of nontemporal stores.

Report Inappropriate Content · ‎09-20-2012