- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Administrator,
When we use Nehalem processor, it's important to apply "First Touch" memory allocation for thread scalability.
First Touch is applied to MKL routines?
Thanks,
Kosuke Fujii
When we use Nehalem processor, it's important to apply "First Touch" memory allocation for thread scalability.
First Touch is applied to MKL routines?
Thanks,
Kosuke Fujii
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Optimization of first touch is important for multiple socket NUMA CPUs. It will not be important on single socket CPUs such as Core i7. On multiple socket platforms, it requires a combination of events to work well:
1. NUMA option selected in BIOS
2. RAM channels populated equally on all CPUs
3. OS with appropriate scheduling (current linux, or Windows 7 or the new server beta version)
4. appropriate affinity setting, e.g. KMP_AFFINITY=compact, HT disabled, all cores used, or OMP_NUM_THREADS set to number of cores used and GOMP_CPU_AFFINITY set to 1 thread per core
5. data "first touched" (initialized, usually in your own program code) by same CPU which will do the work on it, in static scheduled OpenMP parallel region
6. MKL functions used also employ static scheduling, using full number of threads
It might be interesting if the MKL notes would reveal which MKL functions ought to benefit from first touch scheduling. I don't think that documentation exists, so it is up to you to make performance tests. Certain MKL functions will use a number of threads which depends on the data set size, so you would have to know that number and take it into account in your own program and environment variable settings.
Note that the platforms come with the BIOS set to non-NUMA memory organization (cache lines alternating among memory banks, so that no strategy should result in more than 50% local references).
If MPI is used with hybrid affinity settings to make each process local to one socket, it should take care of memory locality, implicitly providiing local first touch, when the NUMA BIOS setting is in effect.
1. NUMA option selected in BIOS
2. RAM channels populated equally on all CPUs
3. OS with appropriate scheduling (current linux, or Windows 7 or the new server beta version)
4. appropriate affinity setting, e.g. KMP_AFFINITY=compact, HT disabled, all cores used, or OMP_NUM_THREADS set to number of cores used and GOMP_CPU_AFFINITY set to 1 thread per core
5. data "first touched" (initialized, usually in your own program code) by same CPU which will do the work on it, in static scheduled OpenMP parallel region
6. MKL functions used also employ static scheduling, using full number of threads
It might be interesting if the MKL notes would reveal which MKL functions ought to benefit from first touch scheduling. I don't think that documentation exists, so it is up to you to make performance tests. Certain MKL functions will use a number of threads which depends on the data set size, so you would have to know that number and take it into account in your own program and environment variable settings.
Note that the platforms come with the BIOS set to non-NUMA memory organization (cache lines alternating among memory banks, so that no strategy should result in more than 50% local references).
If MPI is used with hybrid affinity settings to make each process local to one socket, it should take care of memory locality, implicitly providiing local first touch, when the NUMA BIOS setting is in effect.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page