You shouldn't need rpath if

Ioannis_E__Venetis · ‎06-10-2014

Hello everyone,

I am trying to do something rather exotic. I think I have reached a good point in the process, but now I got stuck.

More specifically, I have a MATLAB program, part of which is very computationally intensive. I have successfully put this part into a MEX file (written in C) and parallelized it using OpenMP. So, the MATLAB part only sets up a bunch of arrays and the time-consuming part is done in the MEX file. Everything fine up to this point.

Now I am trying to offload the code in the MEX file to the Phi that is available on the system and use the parallelized version to run it there. I have managed to insert the offload pragmas into the code and compile the MEX file. But when I try to run it (from within MATLAB of course) I get the following error:

/path/to/mex/file/code_Phi.mexa64': /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/liboffload.so.5:
undefined symbol: _intel_fast_memmove

I found the above symbol in libintlc.so.5 and linked the MEX file with it, but I am still getting the same error. I got really stuck here and I am not certain what to do. Is the problem on the host side or the Phi side? In either case, does anyone have an idea what I could try to solve it?

I didn't want to fill this message with a lot of information from compilation and linking commands that might not be necessary to solve the problem. If you think they are required, I can provide them.

Best regards,

Ioannis E. Venetis

Kevin_D_Intel · ‎06-10-2014

The problem appears to be on the host side. Does this link of the MEX file use the icc/icpc compiler driver for linking or does it invoke ld directly where you control the list of libraries that are linked?

Ioannis_E__Venetis · ‎06-10-2014

Dear Kevin,

Thank you for looking into this. I use icc for the linking stage, not ld.

By the way, the version of icc I use is 14.0.2 (if this helps).

Best regards,

Ioannis E. Venetis

Kevin_D_Intel · ‎06-10-2014

Ok, that's good to hear. I will need to see some details. Let's start with the linking. Can you share with me the linking commands?

Ravi_N_Intel · ‎06-10-2014

is the path to libintlc.so.5 in LD_LIBRARY_PATH

Ioannis_E__Venetis · ‎06-10-2014

The compilation and linking is done through MATLAB, calling the command 'mex'. I assume it is some kind of wrapper. The actual commands executed are as follows. Most of the parameters are included by 'mex'. I only added -fopenmp and -lintlc.

Compilation:

icc -c -I/path/to/Matlab_R2013a/bin/matlab/extern/include -I/path/to/Matlab_R2013a/bin/matlab/simulink/include -DMATLAB_MEX_FILE -ansi -D_GNU_SOURCE -fexceptions -fPIC -fno-omit-frame-pointer -pthread -std=c99 -fopenmp -DMX_COMPAT_32 -O3 -DNDEBUG "code_Phi.c"

Linking:

icc -O3 -pthread -shared -Wl,--version-script,/path/to/opt/Matlab_R2013a/bin/matlab/extern/lib/glnxa64/mexFunction.map -Wl,--no-undefined -fopenmp -o "code_Phi.mexa64" code_Phi.o -Wl,-rpath-link,/path/to/Matlab_R2013a/bin/matlab/bin/glnxa64 -L/path/to/Matlab_R2013a/bin/matlab/bin/glnxa64 -lmx -lmex -lmat -lm -lstdc++ -lintlc

In case it is useful, running ldd on the produced mexa64 file gives:

$ ldd code_Phi.mexa64
        linux-vdso.so.1 => (0x00007ffffbbff000)
        libmx.so => not found
        libmex.so => not found
        libmat.so => not found
        libm.so.6 => /lib64/libm.so.6 (0x00007f50a63d7000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f50a60d0000)
        libintlc.so.5 => /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libintlc.so.5 (0x00007f50a5e7a000)
        libimf.so => /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libimf.so (0x00007f50a59b7000)
        libsvml.so => /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libsvml.so (0x00007f50a4dbb000)
        libirng.so => /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so (0x00007f50a4bb4000)
        libiomp5.so => /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libiomp5.so (0x00007f50a489c000)
        liboffload.so.5 => /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/liboffload.so.5 (0x00007f50a466a000)
        libcilkrts.so.5 => /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libcilkrts.so.5 (0x00007f50a442b000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f50a4215000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f50a3ff7000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f50a3c63000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f50a3a5f000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003dcc400000)

libmx.so, libmex.so and libmat.so are MATLAB libraries and are found when the mexa64 file is loaded from within MATLAB for execution. I had never a problem with these.

@Ravi: I execute 'source /opt/intel/bin/iccvars.sh intel64' before running anything else. This should set up LD_LIBRARY_PATH, which is as follows:

/opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/composer_xe_2013_sp1.2.144/mpirt/lib/intel64:/opt/intel/composer_xe_2013_sp1.2.144/ipp/../compiler/lib/intel64:/opt/intel/composer_xe_2013_sp1.2.144/ipp/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64:/opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64:/opt/intel/composer_xe_2013_sp1.2.144/tbb/lib/intel64/gcc4.4

Ioannis E. Venetis

Ioannis_E__Venetis · ‎06-13-2014

Hello again,

It seems that I finally managed to overcome the problem. I modified the linking flags and now the linking command is:

icc -O3 -pthread -shared -Wl,--version-script,/path/to/opt/Matlab_R2013a/bin/matlab/extern/lib/glnxa64/mexFunction.map -Wl,--no-undefined -fopenmp -o "code_Phi.mexa64" code_Phi.o -Wl,-rpath-link,/path/to/Matlab_R2013a/bin/matlab/bin/glnxa64 -L/path/to/Matlab_R2013a/bin/matlab/bin/glnxa64 -lmx -lmex -lmat -lm -lstdc++ -Wl,-rpath,/opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64 -lintlc

Adding the -rpath option to the linker seems to fix the problem. And I say 'seems' because now I have a new issue. When the code runs on the Phi with 224 threads it is actually slower compared to the case when it runs on the 8 available threads of the CPU (1x i7-3770 @3.40GHz). Searching on the Internet I found that this might happen when native x86 code executes on the Phi (at least that is what I understood). If so, how can I check what kind of code is sent and executed on the Phi? Is there anything else I should check to make certain that things are compiled/linked/executed as required?

Best regards,

Ioannis E. Venetis

TimP · ‎06-13-2014

You shouldn't need rpath if you have set up the PATHs correctly according to sourceing compilervars.

It would be rather difficult to run host CPU code on the MIC, not something you are likely to accomplish by accident, nor in an offload compilation mode.

It would be important to verify, at least by opt-report, whether your important code is both vectorized and parallelized for MIC; otherwise MIC is unlikely to prove beneficial. I guess that fits into your "what kind of code" category.

If your code needs only 8 host threads, there are additional likely reasons why you may not get a gain on MIC, at least not by accident.

If you succeed in offloading a large enough case with a high enough ratio of MIC local computation to data transfer, you may still need to optimize the MIC_PREFIX, MIC_KMP_PLACE_THREADS, MIC_OMP_PROC_BIND environment settings.

Ioannis_E__Venetis · ‎06-13-2014

You shouldn't need rpath if you have set up the PATHs correctly according to sourceing compilervars.

That was my understanding too. But after reading the documentation about rpath I just thought to try it as it looked relevant. And it worked.

With respect to the code, the 8 threads case on the CPU requires about 330sec to run. When running the code on the Phi I ssh'd into mic0 and top verified that there are 224 threads running. But it takes 430sec to complete. Data transferred is about 25MB. It shouldn't need that much time to transfer, especially when compared to the total execution time.

The code is relatively simple. A series of nested loops, with the outermost loop having no data dependencies and having a lot of iteration (about 250000). Only synchronization required is at the end of each step where a calculated value has to be added to an element of an array, with that element being potentially updated by more threads. I use atomic for that. So, parallelization is straightforward. I get almost linear speedup on every system I have tried. That's why I am puzzled with the performance on the Phi.

Best regards,

Ioannis E. Venetis

jimdempseyatthecove · ‎06-13-2014

What happens to performance when you comment out the atomic update?

Do your vectorization reports indicate that the inner loop is not vectorized?

Jim Dempsey

TimP · ‎06-13-2014

Intel(r) Xeon Phi(tm) has different threading scaling characteristiics from other multi-core platforms, so you still need to investigate, e.g.

MIC_KMP_PLACE_THREADS=55c,1t

MIC_KMP_PLACE_THREADS=32c,1t

MIC_KMP_PLACE_THREADS=55c,2t

MIC_OMP_PROC_BIND=close

and so on, information which would help avoid a lot of speculation here.

I'm assuming you have 57 cores, where offload mode would default to using 56 of them for the application.

Roth__Gary · ‎10-04-2018

What if you don't an /opt/intel directory?

TimP · ‎10-05-2018

This last question appears unrelated to the old thread you attached to. If you mean that you install Intel compilers in a different path from the default, the scripts such as compilervars.sh are edited at install time to what you have chosen, but you need to know what path to use to source them. A likely case for installing in another directory is with the use of the linux "module" application, but then you have that application taking care of the paths.

Run-time error: undefined symbol: _intel_fast_memmove