IPP and multiple threads

bijoyjth · ‎11-24-2009

Hello,

I am new to using the Intel IPP library and had a question on using the -N option to the performace utilities. I am trying to see how the performance of the libraries scales as the number of threads increases. I am trying this on a specific function and am using the ps_ippdcem64t tool. Does the -N control how many threads are spwaned and will use the function I specify with the -f option, or is the -N option used to control threading within the IPP libraries? Thanks in advance.

Bijoy.

bijoyjth · ‎11-24-2009

Quoting - bijoyjth

Hello,

I am new to using the Intel IPP library and had a question on using the -N option to the performace utilities. I am trying to see how the performance of the libraries scales as the number of threads increases. I am trying this on a specific function and am using the ps_ippdcem64t tool. Does the -N control how many threads are spwaned and will use the function I specify with the -f option, or is the -N option used to control threading within the IPP libraries? Thanks in advance.

Bijoy.

Apologies for replying to my own thread :-) .. but it looks like the ippSetNumThreads() is used to control internal threading within the IPP libs. How can I use the performance tools to measure performance of the IPP functions with multiple threads? Or do I can to write my own top level application which spawns multiple threads?

Thanks,
Bijoy.

PaulF_IntelCorp · ‎11-24-2009

Quoting - bijoyjth

Apologies for replying to my own thread :-) .. but it looks like the ippSetNumThreads() is used to control internal threading within the IPP libs. How can I use the performance tools to measure performance of the IPP functions with multiple threads? Or do I can to write my own top level application which spawns multiple threads?

Thanks,
Bijoy.

Hello Bijoy,

Yes, ippSetNumThreads() can be used to control the maximum number of OpenMP threads used by an IPP function. Note that only about 15% of the IPP functions are threaded, many do not receive significant benefit from threading because they are too small and/or the threading overhead outweighs the benefits. A list of those functions that are threaded can be found in the ThreadedFunctionsList.txt file in the IPP doc directory.

If you want a full test of how multiple cores impacts your system you might want to use the trick outlined by this KB article:

http://software.intel.com/en-us/articles/limiting-the-number-of-cores-of-execution-on-a-windows-system/

which describes how to limit Windows to a specific number of hardware threads. This assumes, of course, that you are doing your testing on Windows. :) This might be a better way to compare the performance variation as a function of the number of available hardware threads in the system.

You could try using the perfsys tools to make measurements (...ipp6.1.x.xxxia32toolsperfsys) or test some specific functions of interest and use a timer mechanism to measure duration (such as those built into the OS or RDTSC). Be sure to measure multiple iterations, to identify variations due to other tasks running on the system, system variations, etc.

Paul

bijoyjth · ‎11-24-2009

Quoting - Paul Fischer (Intel)

Quoting - bijoyjth

Apologies for replying to my own thread :-) .. but it looks like the ippSetNumThreads() is used to control internal threading within the IPP libs. How can I use the performance tools to measure performance of the IPP functions with multiple threads? Or do I can to write my own top level application which spawns multiple threads?

Thanks,
Bijoy.

Hello Bijoy,

Yes, ippSetNumThreads() can be used to control the maximum number of OpenMP threads used by an IPP function. Note that only about 15% of the IPP functions are threaded, many do not receive significant benefit from threading because they are too small and/or the threading overhead outweighs the benefits. A list of those functions that are threaded can be found in the ThreadedFunctionsList.txt file in the IPP doc directory.

If you want a full test of how multiple cores impacts your system you might want to use the trick outlined by this KB article:

http://software.intel.com/en-us/articles/limiting-the-number-of-cores-of-execution-on-a-windows-system/

which describes how to limit Windows to a specific number of hardware threads. This assumes, of course, that you are doing your testing on Windows. :) This might be a better way to compare the performance variation as a function of the number of available hardware threads in the system.

You could try using the perfsys tools to make measurements (...ipp6.1.x.xxxia32toolsperfsys) or test some specific functions of interest and use a timer mechanism to measure duration (such as those built into the OS or RDTSC). Be sure to measure multiple iterations, to identify variations due to other tasks running on the system, system variations, etc.

Paul

Hey Paul,

Thanks for the reply. I am using a Linux box running on two quad cores (i7) and I can set the maxcpus boot parameter to the kernel on bootup to control how many cores the kernel will use.

You are right. I do want to include the number of available cores as a parameter in my measurements. Lets say I enable only one core, and I have a single thread that is using the Intel IPP data compression API to compress a 100MB file .. assume it completes in 1 sec.

Now what if I have two threads, each trying to compress a 100MB file. Do each thread complete in 1 sec or do they take more time since both threads had to be scheduled on the single core? What if I enabled 2 cores .. will each thread complete in 1 sec now that there are two cores available to run them?

This is the kind of tests I want to run. And I want to use the Intel IPP data compression routines for my tests. Are there any options to the ps_ippdc{arch} (em64t in my case) performance command that I can give to run something like this?

Thanks,
Bijoy.

Vladimir_Dudnik · ‎11-25-2009

Hi Bijoy,

IPP performance system provided with install package is dedicated to measure performance of each separate primitive function.File compression functionality is implemented as a high level IPP sample, to measure performance of it you will need to download sample package, build data compression utilities (there are several, ipp_gzip, ipp_zlib, ipp_bzip2 and so on) and run them on file you are interested in.

Note, each sample comes with readme.htm file where you can find the details on how to run it and what options are supported. For example, from ipp_gzip readme.htm file:

Compression is the most 'heavy' operation in terms of CPU resources, so the maximum benefit from multi-threading can be obtained, as it would be expected, during the compression. There are two ways of using multi-threading: multi-file threading and multi-chunk threading.

Multi-file threading is used when more than one file is specified on the command line. For example, if we want to compress two files on a two-CPU computer (or on a single-unit Intel Core 2 Duo processor computer), our natural decision will be to process each file in a separate thread and thus fully benefit from a dual-CPU computer. That is what IPP_GZIP does. For example:

> ipp_gzip file1 file2

will compress file1 on one CPU and file2 on the other CPU. If our system has more than two CPUs, other CPUs will not be used. If number of files specified on the IPP_GZIP command line is more than number of available CPUs , all of them will be processed in parallel using existing CPUs. For example, file1 on CPU1, file 2 on CPU2, file3 on CPU1, etc.

Multi-chunk file processing is used when we process a single file on a multi-CPU computer. Thus, on a 4-CPU computer the command line

> ipp_gzip a-very-huge-file.dat

will split the "a-very-huge-file.dat" file into 4 pieces (chunks) and will compress each chunk on separate CPUs combining processed data into a single output file " a-very-huge-file.dat.gz ". Of course, the compression ratio in this case will be a little bit worse than in the single-thread compression - since LZ77 compression methods use statistical data (or pre-history) to compress better - but this overhead (actually, 1-2%) is the cost of boosted compression performance ( 10x-20x times faster on 4/8-CPU computers vs. original GZIP compression speed).

The "-m" option can be used to control the multi-thread operations. For example, using "-m 2" on a 4-CPU computer we can limit IPP_GZIP to two threads. Or, vice versa, using "-m 4" option on a single-CPU computer we can produce archives as if they were compressed on a 4-core CPU. Of course, forced multi-threading on a single-CPU computer will not speed-up the compression, but it will produce the archives which can be decompressed on a multi-CPU system and thus benefit from multi-CPU.

The "-j size" option controls the multi-chunk compression. For example, if we are using a multiprocessor system, but the file to be processed is not big enough, we may not speed-up, but, rather, slow-down the compression because of thread creation/synchronization overhead. The default value of minimum file length is 256 KB and is defined by "#define MIN_LENGTH_TO_SLICE ..." value in the "ipp_gzip.h" file.

Regards,
Vladimir

bijoyjth · ‎11-25-2009

Quoting - Vladimir Dudnik (Intel)

Hi Bijoy,

IPP performance system provided with install package is dedicated to measure performance of each separate primitive function.File compression functionality is implemented as a high level IPP sample, to measure performance of it you will need to download sample package, build data compression utilities (there are several, ipp_gzip, ipp_zlib, ipp_bzip2 and so on) and run them on file you are interested in.

Note, each sample comes with readme.htm file where you can find the details on how to run it and what options are supported. For example, from ipp_gzip readme.htm file:

Compression is the most 'heavy' operation in terms of CPU resources, so the maximum benefit from multi-threading can be obtained, as it would be expected, during the compression. There are two ways of using multi-threading: multi-file threading and multi-chunk threading.

Multi-file threading is used when more than one file is specified on the command line. For example, if we want to compress two files on a two-CPU computer (or on a single-unit Intel Core 2 Duo processor computer), our natural decision will be to process each file in a separate thread and thus fully benefit from a dual-CPU computer. That is what IPP_GZIP does. For example:

> ipp_gzip file1 file2

will compress file1 on one CPU and file2 on the other CPU. If our system has more than two CPUs, other CPUs will not be used. If number of files specified on the IPP_GZIP command line is more than number of available CPUs , all of them will be processed in parallel using existing CPUs. For example, file1 on CPU1, file 2 on CPU2, file3 on CPU1, etc.

Multi-chunk file processing is used when we process a single file on a multi-CPU computer. Thus, on a 4-CPU computer the command line

> ipp_gzip a-very-huge-file.dat

will split the "a-very-huge-file.dat" file into 4 pieces (chunks) and will compress each chunk on separate CPUs combining processed data into a single output file " a-very-huge-file.dat.gz ". Of course, the compression ratio in this case will be a little bit worse than in the single-thread compression - since LZ77 compression methods use statistical data (or pre-history) to compress better - but this overhead (actually, 1-2%) is the cost of boosted compression performance ( 10x-20x times faster on 4/8-CPU computers vs. original GZIP compression speed).

The "-m" option can be used to control the multi-thread operations. For example, using "-m 2" on a 4-CPU computer we can limit IPP_GZIP to two threads. Or, vice versa, using "-m 4" option on a single-CPU computer we can produce archives as if they were compressed on a 4-core CPU. Of course, forced multi-threading on a single-CPU computer will not speed-up the compression, but it will produce the archives which can be decompressed on a multi-CPU system and thus benefit from multi-CPU.

The "-j size" option controls the multi-chunk compression. For example, if we are using a multiprocessor system, but the file to be processed is not big enough, we may not speed-up, but, rather, slow-down the compression because of thread creation/synchronization overhead. The default value of minimum file length is 256 KB and is defined by "#define MIN_LENGTH_TO_SLICE ..." value in the "ipp_gzip.h" file.

Regards,
Vladimir

Thanks. I downloaded the samples tarball and built ipp_gzip. Thanks a lot guys!!

Cheers,
Bijoy.

bijoyjth · ‎11-25-2009

Quoting - Vladimir Dudnik (Intel)

Hi Bijoy,

IPP performance system provided with install package is dedicated to measure performance of each separate primitive function.File compression functionality is implemented as a high level IPP sample, to measure performance of it you will need to download sample package, build data compression utilities (there are several, ipp_gzip, ipp_zlib, ipp_bzip2 and so on) and run them on file you are interested in.

Note, each sample comes with readme.htm file where you can find the details on how to run it and what options are supported. For example, from ipp_gzip readme.htm file:

Compression is the most 'heavy' operation in terms of CPU resources, so the maximum benefit from multi-threading can be obtained, as it would be expected, during the compression. There are two ways of using multi-threading: multi-file threading and multi-chunk threading.

Multi-file threading is used when more than one file is specified on the command line. For example, if we want to compress two files on a two-CPU computer (or on a single-unit Intel Core 2 Duo processor computer), our natural decision will be to process each file in a separate thread and thus fully benefit from a dual-CPU computer. That is what IPP_GZIP does. For example:

> ipp_gzip file1 file2

will compress file1 on one CPU and file2 on the other CPU. If our system has more than two CPUs, other CPUs will not be used. If number of files specified on the IPP_GZIP command line is more than number of available CPUs , all of them will be processed in parallel using existing CPUs. For example, file1 on CPU1, file 2 on CPU2, file3 on CPU1, etc.

Multi-chunk file processing is used when we process a single file on a multi-CPU computer. Thus, on a 4-CPU computer the command line

> ipp_gzip a-very-huge-file.dat

will split the "a-very-huge-file.dat" file into 4 pieces (chunks) and will compress each chunk on separate CPUs combining processed data into a single output file " a-very-huge-file.dat.gz ". Of course, the compression ratio in this case will be a little bit worse than in the single-thread compression - since LZ77 compression methods use statistical data (or pre-history) to compress better - but this overhead (actually, 1-2%) is the cost of boosted compression performance ( 10x-20x times faster on 4/8-CPU computers vs. original GZIP compression speed).

The "-m" option can be used to control the multi-thread operations. For example, using "-m 2" on a 4-CPU computer we can limit IPP_GZIP to two threads. Or, vice versa, using "-m 4" option on a single-CPU computer we can produce archives as if they were compressed on a 4-core CPU. Of course, forced multi-threading on a single-CPU computer will not speed-up the compression, but it will produce the archives which can be decompressed on a multi-CPU system and thus benefit from multi-CPU.

The "-j size" option controls the multi-chunk compression. For example, if we are using a multiprocessor system, but the file to be processed is not big enough, we may not speed-up, but, rather, slow-down the compression because of thread creation/synchronization overhead. The default value of minimum file length is 256 KB and is defined by "#define MIN_LENGTH_TO_SLICE ..." value in the "ipp_gzip.h" file.

Regards,
Vladimir

Hey Vladimir,

I tried running ipp_gzip with some of the options you mentioned. My box is a dual quad core (i7) so my proc/cpuinfo shows cores [0-7].

I am noticing something strange. Regardless of what value I pass to the -m option, I always see cores [0-3] idle and cores [4-7] being utilized. I am running ipp_gzip on 10,000 files. Here is a snapshot:

12:38:16 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
12:38:19 PM all 2.50 0.00 8.42 9.23 0.04 0.28 0.00 79.53 6710.33
12:38:19 PM 0 0.00 0.00 0.00 3.30 0.00 0.00 0.00 96.70 0.00
12:38:19 PM 1 0.00 0.00 0.83 0.00 0.00 0.00 0.00 99.17 0.00
12:38:19 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
12:38:19 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
12:38:19 PM 4 5.33 0.00 16.00 0.00 0.00 0.00 0.00 78.67 0.00
12:38:19 PM 5 5.32 0.00 15.95 0.00 0.00 0.00 0.00 78.74 1.33
12:38:19 PM 6 5.32 0.00 18.94 73.09 0.33 2.33 0.00 0.00 401.33
12:38:19 PM 7 4.67 0.00 17.00 0.00 0.00 0.00 0.00 78.33 0.00

Even with -m 1, I see cores [4-7] being utilized. I thought only a single thread would run for each of the 10,000 files in order. Also, since the file size (120K) is below the split_size, the file_chunk_threading is prevented. So how come 4 cores are being utilized? Nothing else was running on the machine at the time this was run.

Any ideas?

Thanks,
Bijoy.

PaulF_IntelCorp · ‎11-25-2009

Hello Bijoy,

I'm assuming you're using the multi-threaded version of the library in your test (which is the default case if you use the exsiting scripts to compile and test the samples). In this case, there may be some threaded functions within the IPP library that are causing you to see some activities on the other cores.

I did a quick search to see if I could identify any OpenMP threaded functions that might be getting used in the ipp_gzip source code. Using the following command line (I did this on a Windows system, so there may be minor discrepencies compared to what you would type on your Linux command line):

grep -Eoh ipp[ism][A-Za-z0-9]+_ *.c | sort | uniq >temp.txt

which gave me the following list:

ippsCopy_
ippsCRC32_
ippsDeflateHuff_
ippsDeflateLZ77_
ippsInflate_
ippsSet_
ippsSubC_

My grep was designed to eliminate the mode fields in the names, so we're just looking at the core names of the functions. After that I relied on my editor to search for each line within this listAND the ThreadedFunctionsList.txt file in my IPP doc directory, which lists the IPP functions that have been threaded using the Intel OpenMP library. The last function in the list: ippsSubC, is a threaded function.

My guess is that this function is using extra cores on your system. It may be that the function is limiting the number of cores it utilizes for multi-threading, and that is why it doesn't consume all the cores on your system. But I don't know how many cores it will use when it is not limited by the ippSetNumThreads function.

You might want to use the trick that limits the number of cores used by the OS to make your tests, or add a call to the ippSetNumThreads function to limit the number of hardware threads used by this function (and any others I may have missed in my search).

Note: the search I put together may not have found every ipp function in the source code. :)

Paul

Sergey_K_Intel · ‎11-25-2009

Quoting - Paul Fischer (Intel)

I'm assuming you're using the multi-threaded version of the library in your test (which is the default case if you use the exsiting scripts to compile and test the samples). In this case, there may be some threaded functions within the IPP library that are causing you to see some activities on the other cores.

Hi,

I need to say that almost none of IPP DC functions (except ippsLZO) use OpenMP. This means that SetNumThread has no affect on data compression functions.

So, bijoyjth, you've chosen the wrong IPP domain to study "-N" option :).

Moreover, since data compression utilities deal with files (i.e. with input/output), their behaviour, or their dependence on thread number, is very non-linear. The CPU time usage (or real time usage) highly depends on i/o subsystem performance.

Regarding ipp_gzip, I will check if it uses all available cores on Linux.

Regards,
Sergey

bijoyjth · ‎11-30-2009

Quoting - Sergey Khlystov (Intel)

Quoting - Paul Fischer (Intel)

I'm assuming you're using the multi-threaded version of the library in your test (which is the default case if you use the exsiting scripts to compile and test the samples). In this case, there may be some threaded functions within the IPP library that are causing you to see some activities on the other cores.

Hi,

I need to say that almost none of IPP DC functions (except ippsLZO) use OpenMP. This means that SetNumThread has no affect on data compression functions.

So, bijoyjth, you've chosen the wrong IPP domain to study "-N" option :).

Moreover, since data compression utilities deal with files (i.e. with input/output), their behaviour, or their dependence on thread number, is very non-linear. The CPU time usage (or real time usage) highly depends on i/o subsystem performance.

Regarding ipp_gzip, I will check if it uses all available cores on Linux.

Regards,
Sergey

Thanks Sergey. I did notice that calling setNumThreads(1) had no effect. But even if the IPP functions aren't internally threaded, it will be interesting to see some numbers given that ipp_gzip does top level threading when passed multiple files are arguments, and also performs multi-chunk threading as Paul pointed out.

In any case, I still dont understand why 4 cores were utilized and the other 4 cores were 100% idle, even though I passed in 10,000 files as arguments to ipp_gzip. I have to make a correction to an earlier post .. I mentioned my machine was a dual quad core .. its not .. its a single quad core (i7) with HT enabled and so I have 8 logical CPU's.

I agree .. the data compression performance and throughput will be affected by the I/O routines .. I guess I will have to write my own program which slurps in a file into memory and calls the IPP data compression API functions directly and places the compressed output into another memory buffer and that way avoid disk I/O.

Regards,
Bijoy.