Re: SSSE3 optimilazation: -axT versus -xT switch

hladkyjiri · ‎10-28-2008

Hello!

I'm trying to achieve best possible runtimes of mathtool application from ARPREC package. I was really surprised to see that -axT compile option will be give me better results than -aT option:

F90 Flags = -O2 -FR -fp-model source -m64 -axT

gives me runtime (CPU): user 1m55.319s

and

F90 Flags = -O2 -FR -fp-model source -m64 -aT

gives me runtime (CPU): user 2m3.220s

I'm using Core2 CPU T7400 and Xeon E5440. Compilation is done with 64bit version of ifort (IFORT) 10.1 20080312.

-ax option: Tells the compiler to generate multiple, processor-specific code paths if there is a performance benefit. It also generates a generic IA-32 architecture code path.

-x option: Tells the compiler to generate optimized code specialized for the processor that executes your program.

Since using -xT will basically limit your code to run on a specific CPU I would assume that it will give you greater or same performance as -axT. I would certainly not expect -axT to give you by 5% better performance than -xT.

Can anybody please comment on this? Am I doing something wrong or are my expectations wrong or is it a bug in ifort?

Thanks a lot!

Jiri

TimP · ‎10-28-2008

Did you use -aT, or the correct -xT?

hladkyjiri · ‎11-03-2008

Quoting - tim18

Did you use -aT, or the correct -xT?

Hi Tim,

I have use -axT (which gives me better results)

-axT -- Can generate specialized code paths for SSSE3, SSE3, SSE2, and SSE instructions for Intel processors, and it can optimize for the Intel Core2 Duo processor family.

and -xT (which gives me worse results)

-xT -- Can generate SSSE3, SSE3, SSE2, and SSE instructions for Intel processors, and it can optimize for the Intel Core2 Duo processor family. This is the default on Mac OS X systems using Intel 64 architecture.

My understading is that -xT generates code that runs only on specified CPU. With -axT I will get code that can takes benefit of specified CPU but also runs on a generic IA-32 architecture.

I don't really understand your question -there is no -aT option for icc 10.1 20080312 (according to man icc). Can you please explain what have you meant?

Thanks

Jiri

hladkyjiri · ‎11-03-2008

Quoting - hladky.jiri

Hi Tim,

I have use -axT (which gives me better results)

-axT -- Can generate specialized code paths for SSSE3, SSE3, SSE2, and SSE instructions for Intel processors, and it can optimize for the Intel Core2 Duo processor family.

and -xT (which gives me worse results)

-xT -- Can generate SSSE3, SSE3, SSE2, and SSE instructions for Intel processors, and it can optimize for the Intel Core2 Duo processor family. This is the default on Mac OS X systems using Intel 64 architecture.

My understading is that -xT generates code that runs only on specified CPU. With -axT I will get code that can takes benefit of specified CPU but also runs on a generic IA-32 architecture.

I don't really understand your question -there is no -aT option for icc 10.1 20080312 (according to man icc). Can you please explain what have you meant?

Thanks

Jiri

Hi Tim,

I'm sorry, I got your question! I have used -axT options (generated binary runs faster) and -xT option (which is producing slower binary). -aT was a typo in my post. I do apologize for the confusion!

Thanks

Jiri

TimP · ‎11-03-2008

Quoting - hladky.jiri

Hi Tim,

I'm sorry, I got your question! I have used -axT options (generated binary runs faster) and -xT option (which is producing slower binary). -aT was a typo in my post. I do apologize for the confusion!

Thanks

Jiri

I can't think of any "good" reason why -axT would run faster than -xT. A possibility might be differences in code alignment, possibly affecting operation of Loop Stream Detector. You might have to profile with VTune, or at least gprof, to locate it, if the alignment changes don't make it go away.

hladkyjiri · ‎11-05-2008

Quoting - tim18

I can't think of any "good" reason why -axT would run faster than -xT. A possibility might be differences in code alignment, possibly affecting operation of Loop Stream Detector. You might have to profile with VTune, or at least gprof, to locate it, if the alignment changes don't make it go away.

Hi Tim,

thanks for the hint. I will try to use following compile flags (taken from here)

Fo fortran part of the code: -warn alignments -align all -align rec8byte

For C part of the code: -Zp8

I will also try gprof and post my results here.

Thanks

Jiri

hladkyjiri · ‎11-06-2008

Hi Tim,

I have tried to use code aligning options but without any effect. So I have used gprof to find out what's going on. By far the biggest difference is from for_index function. I do not have this function in my code. It's probably part of FORTRAN libraries....

-xT:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
28.94 28.88 28.88 for_index

-axT:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
25.11 24.45 24.45 for_index

Do you have any clue where for_index comes from and how can I influence the runtime of this function?

Thanks a lot

Jiri

TimP · ‎11-06-2008

Quoting - hladkyjiri

Hi Tim,

I have tried to use code aligning options but without any effect. So I have used gprof to find out what's going on. By far the biggest difference is from for_index function. I do not have this function in my code. It's probably part of FORTRAN libraries....

-xT:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
28.94 28.88 28.88 for_index

-axT:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
25.11 24.45 24.45 for_index

Do you have any clue where for_index comes from and how can I influence the runtime of this function?

Thanks a lot

Jiri

My guess would be that you have the Fortran INDEX intrinsic function in your source code.

I don't know why its behavior would change according to the compile options you have chosen. A possible way of investigating it might be to run VTune or PTU profiler and see whether the most often executed loops are different. You would be able to see this only in assembler instruction view.

hladkyjiri · ‎11-07-2008

Quoting - tim18

My guess would be that you have the Fortran INDEX intrinsic function in your source code.

I don't know why its behavior would change according to the compile options you have chosen. A possible way of investigating it might be to run VTune or PTU profiler and see whether the most often executed loops are different. You would be able to see this only in assembler instruction view.

Hi Tim,

thanks, yes, you are probably right. Basically, the program is parsing command file and for this index function is used. I'm downloading PTU and will try to debug it further. I will post my finding here. However I don't want to spend much more time on it. For me it seems like a compiler bug...

Anyhow, thanks a lot for all your valuable inputs!

Jiri