Since I have not used OpenMP before I have found this code.
#pragma omp parallel for private(i, j, k, temp) for(i=0; i for(j=0; j temp = 0; for(k=0; k temp += arr1 * arr2; } arr = temp; } }
They are telling let OpenMP do the magic for you. Personally I do not see any magic.
Edit: I have seen people reporting a scalability of time/PE interesting, one can argue on the ease of people used to sequential computing to also be able to port the code. Ok. Check with this if I also observe the same with my processor.
It actually scales ok for instance 80secs for a 2000x2000 matrix and 25secs with 100% CPU utilization on the i7. Nice scale. Though there are some points here.
1) Does OpenMP do any caching of the elements of the Matrices in the CPU? What about memory latency? This is why I do not like so much blackboxes and I do not believe it is presentable for comparison for the GPU which use Shared memory fetching and coalescing with the global GDDR memory.
OpenMP does no cacheing. The processor will. The compiler may (usually does) performs operations that will improve performance. e.g. the variable temp will be registerized.
You can do things to help the compiler perform additional optimizations (under the right circumstances.
Assume arr1 is some sort of filter that is to be applied to 1,000's of video frames i.e. out1 = filter x frame1 out2 = filter x frame2 ...
In this circumstanc, if you were to transpost the filter, then the inner loop subscripts for the transposed filter would be transposed as well. Inthe above sample code, if arr2 were the transposed matrix then the inner loop would be
for(k=0; k temp += arr1 * arr2; } arr = temp;
and then this inner loop can be vectorized by the compiler (IOW could perform 4-wide operatons for floats or 2-wide operations for doubles).
I do not know Tim which is better and which is worse. Doing the fetching on the PE on your own or let the compiler do this. For instance in GPU with the CUDA API you have your Processor and with the proper declarations on the Grid and Block dimensionality you allocate resourses on the Streaming Multiprocessors that reside inside it. You have full control of the registers, you can do prefetching in a 2 step manner first on registers -> memory in the PE so that to acquire the peak performance of the processor also you know how the thread are unrolled into warps where a single instruction is executed into them (supposing you do not have the divergent if/then statement). i.e you have full control of parallelism which gives you then the potential to build a parallelization plan... So i.e. in GPU Hardware dictates your parallelization theory. Anyway since I am into parallelism now I am going to make case studies in all these APIs since for certain certain portions of the algorithm can be run on the host (CPU) and other portions on the device (GPU). This is why there is a lot of crave on the OpenCL standart which allows heterongenous parallel pragramming and we see how Intel is integrating into a single chip CPU and GPU this gives a high communication bandwidth between these two for heterogenous programming. In general Tim I believe that we now live in the right period for parallelism and this is quite exciting.
Whoa! There's so much more to consider here even before you start splitting this task between threads. Consider the innermost loop:
[cpp]for(k=0; k * arr2;
C arrays are in row-major order, so each successive multiplicand is being pulled from the next row of arr2. Even with floats, the row length for the suggested 2000x2000 array size is 2000 x 4 = 8000 bytes. Since a TLB (Translation Lookaside Buffer) only covers 4K bytes (assuming typical small page use), that means each successive computation in this loop will require a fresh TLB entry (even the arr1 accesses will typically take two TLB entries, sometimes three). Even on an Intel Core i7 processor with a 64-entry 4-way set associative DTLB, after a maximum of 64 x 4 = 256 rows all the TLBs will be used and the processor will need to start discarding the old ones (actually less than 256 since TLB entries are needed for the other arrays). TLB page walks are not cheap.
How to fix this? One method is blocking, dividing the array up into blocks small enough to be able to reuse TLB entries to maximize both read and write performance. Blocking is usually done by adding more nested loops. agalex hints at this solution with the introduction of Cuda terminology in a subsequent reply.
Jim's suggested transposition is another way to approach this problem, the transposition changing the arr2 accesses so that both arrays are read along their natural memory order, but sadly this solution only translates the problem to the writes of arr, though it does reduce the frequency of the problemsince it removes the need for a fresh TLB entry from every execution of the innermost loop.
But, adding threading poses even a greater problem with this approach. Since the threading in the original code is across the index i, multiple threads will be accessing adjacent entries of array arr and likely causing lots of thrashing between the threads as they compete for who owns the valid version of the cache line.
Since tx are consecutives ty can be considered fixed and see that in both cases you have a coalesced access in the array. For more information see Programming Massively Parallel Processors by Kirk & Wu.
Not only this you can then have a register -> memory strategy for prefetching.
Meaning you have total control.
So in other words someone who is in OpenMP should know these issues and how memory is pulled from the DDR to some Level of the Cache in the Processor. They say that they want also to fix a compiler in CUDA where this can be done "Magically" I don't see this happening though since people want the peak performance in GPU. Some clever things like loop unrolling in the compiler issue maybe. I strongly believe that people should know the fetching/prefetching issues. When you go massive in Processors great care should be given in designing correct parallel philosophy. This is the mesaage I will pass tomorrow in a speach I will give about Massive Processors. Just basic things that will deter people from thinking that you just spawn threads and wait to have the peak performance. I think this is sufficient for the time being.
Also the problem is shown with different size of matrices the timing is not proportional to the dimensionality increase. This is because of memory latency (secs!). Even with blocking you still have the granularity problem which increases with the dimensionality increase of the matrix also in the GPU. It is a long story.
Edit : Just for the record for a 2048x2048 matrix the increase of speed over CPU of GPU (GTX-275) is 455x for the serial and 193x over the Parallel without blocking of course for OpenMP. floating numbers of course
Anyway just playing with the GPU code to see some things.
The same (or similar) tile/blocking techniques used in CUDA can be used in your OpenMP loop construction - just not automagically by the compiler.
Also, when you make your timing runs, include the time it takes to
1) inject the code into the GPGPUand JIT compile (assuming JIT compiled code does not remain resident, or if precompiled JIT available add the time to upload the object code) 2) copy C/C++ input data into GPGPU 3) actual multiply time inside GPGPU 4) copy C/C++ results data out of GPGPU
Then chart time for permutations on matrix sizes combined togetherfor 1-n cores verses GPGPU.
The blocking parameters on the CPU side will vary depending on the processor(s) cache and core/HT configuration (no different than considerations the CUDA JIT makes dependent on GPGPU model).
Well, I would suggest that "Blocking is transparent in CUDA" and "...you have total control" are contradictory statements (note the explicit references to tiles and tile sizes in this "transparent" code example--at most you could say that CUDA has features to support blocking), but I agree with your overall premise that programming for performance requires understanding of the underlying architecture. That was my point exactly. You can't even getclose to the best performance with one thread without understanding the underlying architecture, in my example of the costs of TLB management. There's a good lecture onneeding tounderstand both the abstract and the concrete in dealing with multi-threading, givenby Yale Patt in the most recent UPCRC distinquished lecturer series.
Yes Jim I have done this just counting the Kernel Time is not sufficient remember that PCI-express 2 has a write bandwidth CPU->GPU 4GB/s so the time of writting a 2048x2048 of floats to the GPU is insignificant,i.e 4ms so this is not the bottleneck. So you have 2*4= 8ms for a write and 4ms for a read thus 12ms total CPU-GPU DDR3-GDDR3. I have added it though also in the process.
There is one pointer thought in thousands of threads. Do they all work for an application? For instance in the Dijkstra algorithm there is sth like a tree structure of the utilization of threads which heavily undermine the peak power of the GPU since there is a need of sth like 200 iterations and some threads are practilaly not doing anything. One pointer on spawning thousand of threads.
Edit: Of course in many cases you do not have a single Dijkstra but many Dijkstras with decent thinking one could combile all Dijkstras in a single Kernel but this comes at a cost of memory utilization. Imagine having 200000 nodes and need to do 200 Dijkstras combining all the Dijkstras together means you have a 200x200000 array which is not wise to allocate it in memory. Thus threading utilization and if they actually execute sth is a very crucial issue I think which GPGPU has not addressed yet. Also there is a solution to allocate a 2D Cost matrix this has the advantage of not allocating contiguous memory which is not wise this actually could work you could think the problem layered in each thread and its thread can go from 1...200. So you have a good ratio of utilization but still not enough. Anyway good parallel thinking for utilization of threads is a crucial issue, this is the message that needs to be passed here which is also the case in every Processor.
Edit: What is good about the last idea is that the memory allocation can become very high but the complexity in order to transfer the 2D Matrix and use it in CUDA gets a little higher.
Anyway the end on CUDA this is a Forum on Intel. :-)