- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I ran a (very) simple cilk_for loop on a CoreI5-2400 CPU under windows XP-32bit.
The code is attached. It was compiled and built with the latest intel compiler using MSDEV 2010
It seems that this loop runs a little bit faster than this loop implemented with intrinsic C.
But my CPU has 4 cores.
I expect the cilk code to run 4 times faster.
How can I cause all cores to participate in the calculation ?
Thanks,
Zvika
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
By default the Intel Cilk Plus runtime will query the OS for the number of cores and use all of them. You can override this using the Cilk Plus API, or by setting the CILK_NWORKERS environment variable.
On Windows* you can see the cores being used by bringing up Task Manager (right click in the task bar and select "Start Task Manager") and click the "Performance" tab. You'll see a graph for each of your CPUs in the block "CPU Usage History". You should see a spike on each of the cores when your program runs.
The question is whether your application is actually doing enough work to engage all of the cores. Intel Cilk Plus distributes work using a technique called "work stealing". Each of the idle cores will randomly pick another core and try to steal work from it. If it fails to steal work, then it pauses briefly and then tries again. If the work gets done faster than it can be stolen, then the other cores won't have an opportunity to contribute. As Arch pointed out earlier, your application will be memory bandwidth limited.
- Barry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The code as you've written it should use all of the available cores. Here are a few things to consider:
1. I your machine relatively "quiet" when you run this test? If there is significant use of the CPU, then you will not get linear speedup.
2. What kind of numbers are you seeing? My concern is that the loop is doing so little work (even at 10 million iterations) that you are running into precision problems with the timers.
3. Another possibility (related to 2) is that the work per iteration is too small and that parallel overhead is overpowering the benefits. I recommend that you run cilkview on the program and look at the "burdened span" and expected speedup numbers.
4. Also related to the lack of work: you have only one parallel section in the program. The timing you are getting includes the start-up cost for spinning up the Cilk worker threads. On Windows, this cost can be substantial. To exclude the startup from your timing, you can try calling __cilkrts_init() explicitly before starting the timer (you need to #include <cilk/cilk_api.h> to use __cilkrts_init() ).
Let me know if any of that helps,
-Pablo
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The problem is likely the same as for http://software.intel.com/en-us/forums/topic/391378 -- the memory bandwidth, not the arithmetic units, are the limiting resource.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
I checked with task manager. There is a peak in all cores.
When each iteration does 3 operations (and not 1 as in my code), the cilk code runs ~40% faster than the regular code.
I also tried calling: __cilkrts_init() . The first iterations runs much faster (as the following ones).
Your help is highly appreciated.
Best regards,
Zvika
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It might be worthwhile to learn about cache-oblivious and cache-aware approaches to dealing with the flops/memory issue. https://en.wikipedia.org/wiki/Cache-oblivious_algorithm would be a good start on cache-oblivious algorithms. http://en.wikipedia.org/wiki/Loop_blocking describes a common cache-aware approach.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The first time you issue a Cilk_for (or other Cilk_... that instantiates the thread pool) you will encounter additional overhead. Try the following:
[cpp]
for(iRep = 0; iRep<5; ++iRep)
{
//cilk_for
QueryPerformanceCounter(&startTick);
cilk_for (int i=0; i<N_ELEMENTS; i++)
{
R = CalcOneElement (I,Q);
}
QueryPerformanceCounter(&stopTick);
totalTime = (double)(stopTick.QuadPart - startTick.QuadPart)/(double)ticksPerSecond.QuadPart;
printf ("cilk time=%f\n",totalTime);
}
[/cpp]
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page