Two machines, two very different performances for a same program.
The problem I'm facing is probably not direclty related to TBB but you might maybe already faced something similar.
Indeed, I've got very different performances for a same program when using it on two different machines. In order to test the performances of my freshly parallelized program I've got two machines : a classic 2 cores machine and a 48 cores machine (HP proliant DL585 G7).
With the first machine (2 cores) and with TBB configured for working with 2 threads the time spent on computing is 2x faster. So far so good. With the second machine I've got about 3 or 4 times slower with TBB configured for working with only 2 cores or the 48. I'm obliged to configure TBB for working with 32 cores to get a speed up of 2x !
(I've attached a picture of the processor architecture of the HP Proliant DL585 G7)
Notes : -----------
1) The HP Proliant DL585 G7 is made of 4 groups of 12 processors, each group have two local shared memories. I firstly tried to explain this slow-down because of the architecture, indeed I thought my program was accessing other local shared memories. To check this I used the "taskset" program to force my program to use only certain processors and I noticed only a slight speed up. But I'm not sure if I used "taskset" the right way so I think I have to dig this up more.
2) Each core of the 2 cores machine has 4mo cache memory, each core of the 4 groups of 12 of the 48 cores machines has only about 400ko, would that be a problem ? Because I am using a lot of memory, maybe I do not benefit from cache effect because of the size of the cache ?
3) Write/read accesses have been thought so that there are the less concurrency problems possible in my program. However I'm maybe suffering from "pointer aliasing". Indeed, a colleague of mine told me local alias could resolve some memory problems, but I'm not sure to undertand that (?).
Did you experienced somthing similiar and have you some recommandations to give me to resolve such problem ?
I don't think the problem is due to the workload because I ran this program with only 2 cores of the 48 cores machine and I get very different results when compared to my 2cores desktop. Besides I tried to run the program with bigger data inputs and there is no improvement at all, it is still very slow compared to what I've got with my desktop machine.
In order to read and to write the computed data I'm using several big arrays. Each thread can only read/write data to/from these arrays at carfully separated memory space so that there is no conflict between threads (for reading some data there can be some "conflicts", but it happens very few times, BTW I tried the same program without this border effect and I got no acceleration at all). For a thread a first pass of data is computed within a certain interval (the vertices of a certain interval of Z) and then these data are re-use for computing other data (the triangles of a certain internal of Z). Maybe I generate too much data for the cache (512ko/core for the 48cores VS 4mo for my desktop).
I don't think this is about the workload because I tried the program with only 2 cores of the 48cores machine and I got very different results compared to what I've got with my 2cores desktop machine. Besides I tried the program with 8x bigger data inputs and I noticed no improvement at all (at least almost nothing), and even using only 2 cores.
I'm using several "big arrays" of data to read/store data. Each thread has an access to the arrays only at a certain interval so that there is no conflicts (there can some reading conflicts, I tried this without this "border effect" but I got no improvement). At first a thread write some data (indices of vertices) within a certain interval of some "global" arrays (global : I mean shared by all the thread), after that, these same data are read for computing triangles. Maybe I generate too much data for the cache (512ko/cores VS 4mo for my desktop machine) ?
A colleague of mine told me that he heard something about a "local alias" trick but he is not sure it can fix my problem. What do you think ? Is that for getting rid of pointer aliases ? But I don't think I'm suffering from it
>then i suggest to runIntel VTune Amplifier XE to find out a bottleneck in the code.
>BTW is there many IO operations?
>Does dataset fits to node local memory?
There are no IO operations (std::cout or file creation). For the scene I'm currently testing the total amount of allocated memory is about 140Mo, so when I'm only using 2 cores of one node it fits in the shared memory of the node, I think the data should not be shared by two shared memories of two different nodes.
>Have you also tried to resize those arrays (I didn't see any mention of
size relative to that of the cache), just to test >the hypothesis of
Yes I tried with another size of array. For the scene I'm currently using I've got 6 "global" arrays (all of size 128*64*48), 2 of them are arrays of Eigen 3d vectors (Eigen::Vector3d), one of them is an array of struct, and the three remaing are arrays of integers. 128*64*48 is also the size of the data input I've got to deal with. I'm using the scalable_malloc/free to handle this memory (because I noticed an interesting speedup on my desktop, and a slight improvement of performance with the 48cores machine). I tried with a 8x smaller scene (64*32*24) and noticed no improvement, performances with 2cores with this machine are far from 2x faster
I heard about a trick consisting of using local aliases to prevent the compiler of calling too many times same part of memory, but Ive got to search for further info on it. BTW I doubt I need such a thing.
Maybe the problem is purely because of the cache ... Could it be due to the size of the cache which is 4x smaller for the cores of the 48cores computer ? Maybe it limits the cache effect.
Besides it is difficult to profil my program on this platform, most of the tools I usually use don't work as expected (valgrind callgrind tools fail to simulate L3 cache, VTune doesn't run on console text mode, ...). Nevertheless, I still can profil more my program on my desktop computer in order to identify eventual bottelnecks or wrong memory management I missed before.
"local aliases to prevent the compiler of calling too many times same part of memory" Maybe you mean local "copies" (separate locations containing the same value), as opposed to "aliases" (separate accesses to the same location)? That could in fact prevent having to go to L3 cache (but could that really be it?), and the optimiser may be able to improve things a lot when it knows that no other part of the program can change those values (evaluation of r.end() should typically be "hoisted" out of inner loops), but the latter would not explain the nonscalability.
"Could it be due to the size of the cache which is 4x smaller for the cores of the 48cores computer ?" That array didn't look excessively large to begin with, and you saw no improvement shrinking it even further.