I can set the number of OpenMP threads and it works for "omp parallel for"
ippGetNumThreads always returns 1 though.
ippSetNumThreads(8) reports "No operation has been exacuted"
Not sure how to successfully set the threads. Or if setting the thread count higher will make ippsSortRadixIndexAscend_8u run with multiple threads.
omp_set_dynamic(0); omp_set_num_threads(8); int threads = 0; ippGetNumThreads(&threads); wprintf(L"ippGetNumThreads %d\n", threads); IppStatus errorTh = ippSetNumThreads(8); printf("-- warning %d, %s\n", errorTh, ippGetStatusString( errorTh )); ippGetNumThreads(&threads); wprintf(L"ippGetNumThreads %d\n", threads);
Hello thanks for your reply Igor.
Yes, I do have them. I wasn't sure which to select when setting up the project though. Initially I chose Multi-threaded DLL. I just tried switching to Multi-threaded static library and that allowed me to change the thread count.
However ippsSortRadixIndexAscend still only runs on one thread.
ippsSortRadixIndexAscend is not internal threaded. A related ippsSortRadixAscend function is threaded. You can find the threaded function list at: documentation\en\ipp\common\ThreadedFunctionsList.txt
I tested ippsSortRadixAscend_32s_I there was no performance change with 1 thread vs 4 or 8 threads. Sorting using 10 million elements.
Any suggestions for a parallel (key,value) sort library? I'm looking for a sort on a single CPU, final version will use a 10-14 core Xeon, with the goal to sort 10 million 32bit (key, value) pairs in about 30ms. While that may not be possible, in a single sort I have also thought about multi step sorts. For example a course grained sort using 16 or 8bit keys then sent to a co-processor for an exact sort and further parallel computing.
Decided to use OpenMP and Intel's sorts for a 2 stage sort, as a temporary solution.
parallel merge sort will be introduced (2 stage parallel sort - (1) radix, (2) merge) in the IP version that is next after 2017. SortRadix had threaded implementation in some older IPP version, but then was commented because of non-efficient implementation (2 threads were supported only).
That's interesting, the 2 stage parallel sort was what I decided. I'm using ippsSortRadixIndexAscend_32s in combination with OpenMP. Then merging the data sets. (I realize this conversation has drifted way off the titles topic, but was interesting)
My merge/reduce is parallel but not optimized (it manipulates the key/value pairs too not just index), but seems like really good gains for a 4 core CPU with 8 threads. From what I was told about hyperthreading at my last HPC internship, it is like thread switching on cores for higher utilization of the CPU. So to have total gains greater than the core count seems really good.
ippGetNumThreads 1 Set Num THreads warning 0, ippStsNoErr: No errors. ippGetNumThreads 8 ItemCount: 10000000 ItemCount Per Thread 1250000 BufferSize: 5020576 Thread # 5 - Sorting ippsSortRadixIndexAscend_32s... Thread # 4 - Sorting ippsSortRadixIndexAscend_32s... Thread # 6 - Sorting ippsSortRadixIndexAscend_32s... Thread # 7 - Sorting ippsSortRadixIndexAscend_32s... Thread # 3 - Sorting ippsSortRadixIndexAscend_32s... Thread # 1 - Sorting ippsSortRadixIndexAscend_32s... Thread # 2 - Sorting ippsSortRadixIndexAscend_32s... Thread # 8 - Sorting ippsSortRadixIndexAscend_32s... Time for partial sorts with OpenMP(8): 200.29 ms Reduce (4): 138.03 ms Full Sort: 338.32 ms ItemCount: 10000000 BufferSize: 40020576 1. Sorting ippsSortRadixIndexAscend_32s... Time for single sort of all elements: 1726.63 ms Performance Gains with a Quad Core CPU: 5.10x