Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

dcopy and MOVAPS (request for comments)

redo
Beginner
450 Views
I've just run into some difficulty compiling and running a fortran code. After
a while I was able to reproduce the problem using the following toy code:

program test

implicit none

integer :: maxorb
parameter(maxorb=1024)

type :: t4_type
integer :: firstint
integer :: occ(maxorb)
end type t4_type

type :: t4_type_2
integer :: firstint, secondint
integer :: occ(maxorb)
end type t4_type_2

integer norb, i, occi(maxorb)

type (t4_type) :: t4
type (t4_type_2) :: t42

real*8 :: pop(maxorb)

do i=1,maxorb
pop(i) = i
t4%occ(i) = 0
enddo

norb = 250

call testa (norb,pop,occi)
print *, "first"
call testa (norb,pop,t42%occ)
print *, "second"
call testa (norb,pop,t4%occ)
print *, "third"

end

subroutine testa (norb,pop,occ)

implicit none

integer norb
real*8 pop(1), occ

call dcopy(norb,pop,1,occ,1)

return
end

I am copying 250 double into three different arrays of 1024 integer, so
it should be a "legitimate" operation. If I compile the code using the
intel ifort 9.1 20070109 (I am using mkl_8.1.1):

$ ifort -check all -r8 -g -o test main.F90 -L/opt/intel/mkl/lib/em64t
-lmkl -lguide -lpthread

I obtain a segmentation fault running the "test" program on an Intel
Xeon CPU 5160 3.00GHz:

$ ./test
first
second
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libpthread.so.0 000000307930C430 Unknown Unknown Unknown
libmkl_mc.so 0000002A959B8CD5 Unknown Unknown Unknown

Now the first two calls to testa() work perfectly, but not the third.
Using gdb is quite easy to see that the problem occurs inside a
function named Steps1_X8_Y16_Loop32, and specifically here:

0x0000002a959b8cd5 : movaps %xmm0,0x0(%rcx)

...

(gdb) print $rcx
$1 = 5684076
(gdb) x/2g $rcx
0x56bb6c : 0x0000000000000000 0x0000000000000000

it seems a valid memory address but not aligned on a 16-byte boundary as
requested by MOVAPS instruction, is it correct ? (The same code works on an
Intel Pentium D CPU 3.00GHz.)

thanks in advance for your help
loriano

0 Kudos
6 Replies
TimP
Honored Contributor III
450 Views
It's difficult to support such an old version of MKL. I thought that MKL dcopy would not be threaded, and it should perform its own peeling to reach an aligned boundary before using movaps to store data. If it is not accounted for by a different number of threads, it's hard to account for a different behavior between CPUs as similar as yours.
My impression of MKL dcopy is that it's not supported for performance tuning, only for compatibility with existing source code, now that current Intel compilers make automatic fast_memcpy substitutions, so this seems to be a seldom visited subject.
0 Kudos
redo
Beginner
450 Views
Quoting - tim18
It's difficult to support such an old version of MKL. I thought that MKL dcopy would not be threaded, and it should perform its own peeling to reach an aligned boundary before using movaps to store data. If it is not accounted for by a different number of threads, it's hard to account for a different behavior between CPUs as similar as yours.
My impression of MKL dcopy is that it's not supported for performance tuning, only for compatibility with existing source code, now that current Intel compilers make automatic fast_memcpy substitutions, so this seems to be a seldom visited subject.

Dear tim thanks for your comments. Following your suggestion
I tried using the new version of the Intel compiler (ifort 11.0 20081105),
as well as a new version of the Intel Math Kernel Library (mkl 10.1.0.015).
I tried the same:

$ ifort -check all -r8 -g -o test main.F90
-L/home/redo/intel/mkl/10.1.0.015/lib/em64t -lmkl -lguide

and again the program runs without any problem on the Pentium D,
instead when I run it on the Xeon:

$ ./test
first
second
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libpthread.so.0 000000307930C430 Unknown Unknown Unknown
libmkl_mc.so 0000002A96D19664 Unknown Unknown Unknown

0x0000002a96d19664 : movaps %xmm0,(%rcx)

(gdb) print $rcx
$1 = 5963436
(gdb) x/2g $rcx
0x5afeac : 0x0000000000000000 0x0000000000000000

In any case seems to be more an intrinsic "limitation" of the fortran language
itself than an ifort or mkl problem.

loriano

0 Kudos
TimP
Honored Contributor III
450 Views
Could you consider submitting the test case on premier.intel.com? I'd still be curious whether it depends on number of threads (set OMP_NUM_THREADS to same value on each platform) or on whether the more current libiomp5 is used in place of libguide.
0 Kudos
redo
Beginner
450 Views
Quoting - tim18
Could you consider submitting the test case on premier.intel.com? I'd still be curious whether it depends on number of threads (set OMP_NUM_THREADS to same value on each platform) or on whether the more current libiomp5 is used in place of libguide.

ok I'll try to submit the test case.
0 Kudos
Gennady_F_Intel
Moderator
450 Views
Hi redo,
please try to link your example with the following command line:
ifort -check all -r8 -g -I/opt/intel/mkl/10.1.0.015/include test.f90
/opt/intel/mkl/10.1.0.015/lib/em64t/libmkl_intel_lp64.a /opt/intel/mkl/10.1.0.015/lib/em64t/libmkl_intel_thread.a
/opt/intel/mkl/10.1.0.015/lib/em64t/libmkl_core.a
-L/opt/intel/mkl/10.1.0.015/lib/em64t -liomp5 -lpthread -lm -o test.out
--Gennady

0 Kudos
redo
Beginner
450 Views
Hi redo,
please try to link your example with the following command line:
ifort -check all -r8 -g -I/opt/intel/mkl/10.1.0.015/include test.f90
/opt/intel/mkl/10.1.0.015/lib/em64t/libmkl_intel_lp64.a /opt/intel/mkl/10.1.0.015/lib/em64t/libmkl_intel_thread.a
/opt/intel/mkl/10.1.0.015/lib/em64t/libmkl_core.a
-L/opt/intel/mkl/10.1.0.015/lib/em64t -liomp5 -lpthread -lm -o test.out
--Gennady


Hi Gennady,
I just tried the following:

$ ifort -check all -r8 -g -I/home/redo/intel/mkl/10.1.0.015/include test.f90 /home/redo/intel/mkl/10.1.0.015/lib/em64t/libmkl_intel_lp64.a /home/redo/intel/mkl/10.1.0.015/lib/em64t/libmkl_intel_thread.a /home/redo/intel/mkl/10.1.0.015/lib/em64t/libmkl_core.a -L/home/redo/intel/mkl/10.1.0.015/lib/em64t -liomp5 -lpthread -lm -o test.out

I am working on a:

Red Hat Enterprise Linux AS release 4 (Nahant Update 4)

kernel:

Linux 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux

$ ldd ./test.out
libiomp5.so => /home/redo/intel/Compiler/11.0/074/lib/intel64/libiomp5.so (0x0000002a95557000)
libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000003079300000)
libimf.so => /home/redo/intel/Compiler/11.0/074/lib/intel64/libimf.so (0x0000002a95718000)
libm.so.6 => /lib64/tls/libm.so.6 (0x0000003078b00000)
libc.so.6 => /lib64/tls/libc.so.6 (0x0000003078800000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000000307af00000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003078600000)
/lib64/ld-linux
-x86-64.so.2 (0x0000003078400000)
$ ./test.out
first
second
forrtl: severe (174): SIGSEGV, segmentation fault occurred

the problem is still there:

(gdb) r
Starting program: /home/redo/molecule/aug-cc-PVDZ/test/test
[Thread debugging using libthread_db enabled]
[New Thread 182915222752 (LWP 22767)]
first
second

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 182915222752 (LWP 22767)]
0x0000002a96d19664 in Steps1_X8_Y16_Loop32gas_1 () from /home/redo/intel/Compiler/11.0/074/mkl/lib/em64t/libmkl_mc.so
(gdb) disassemble
...
0x0000002a96d19664 : movaps %xmm0,(%rcx)
...
(gdb) print $rcx
$1 = 5963436
(gdb) x/8g $rcx
0x5afeac : 0x0000000000000000 0x0000000000000000
0x5afebc : 0x0000000000000000 0x0000000000000000
0x5afecc : 0x0000000000000000 0x0000000000000000
0x5afedc : 0x0000000000000000 0x0000000000000000

the cpuinfo:

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel Xeon CPU 5160 @ 3.00GHz
stepping : 6
cpu MHz : 3000.111
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm pni monitor ds_cpl est tm2 cx16 xtpr
bogomips : 6004.34
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

cut...

thanks for your answer
Loriano
0 Kudos
Reply