Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

cache topology

Ilya_Z_
Beginner
1,086 Views

hi,

I'm writting cpuid program. I need help with getting number of each type of cache. not its size, but the number. for example i need get info such as
below:

L1 data cache = 2 x 64KB.

CPUID will give me the size of each sort of cache, but not its number. On MSDN i've found that GetLogicalProcessorsInformationEx proc might be
helpful to get that number. but i'm not sure do i understood it right. I guess, that member of CACHE_RELATIONSHIP structure, the GROUP_AFFINITY will be related with quantity. Could some give me some hints or explain what this proc exactly does or tell me were else find such infos.

thanks in advance

0 Kudos
13 Replies
jimdempseyatthecove
Honored Contributor III
1,086 Views

The CPUID retrieves information relating to the hardware thread (its HT position within a core, and the core within a CPU, and the CPU within the system). The general technique to produce the cache topology is:

for each of the logical processors available to your process
set affinity to that logical processor (switching execution to that hardware thread)
Obtain the APIC number (and/or group number) of the CPU,
Then obtain the cache relationships within the CPU

This is a non-trivial process, especially if your code must handle a wide variety of CPUs (Intel and AMD).

Some of the O/S's may provide an API to obtain this information. As to if what they provide is what you want, that is a different story.

Jim Dempsey

0 Kudos
Ilya_Z_
Beginner
1,086 Views

thanks Jim for reply

I know what you are talking about. I already made proc which derives contans to create masks then pins to each logical processor and picks its APIC ID and extracts each subIDs to create topology.

In case when I have threads per cache parameter from CPUID leaf 04h EAX[25:14](Intel) and 8000001Dh EAX[25:14](AMD) it is probobly similiar to getting constans for CPU topology. I mean LogToNearestPowerOf2(threads per cache). I'm not sure do I follow the right way, but i think it is a good way.

But else problem is what to do when CPU doesnt support this leaf. i tested my code on AMD Phenom(tm) II X6 1075T. This algo assumes that if that leaf is not supported than each core has every level of cache for himself. And this code has a bug.
Look at report below:

Processor name:  AMD Phenom(tm) II X6 1075T Processor

Cores per CPU   :  006
Threads per CPU:  006

CACHE L1 CODE                 :  06 x 00064 KB

Ways of associativity            :  00002
Byte line size(B)                   :  00064
Physical line partitions          :  N/A
Number of sets                     :  N/A
Threads per cache                :  N/A
IDs for cores per CPU           :  N/A
Lines per tag                        :  00001
Self initializing                      :  N/A
Fully associativate                :  N/A
Write-back invalidate             :  N/A
Inclusive of lower cache levels:  N/A
Complex cache indexing        :  N/A
Unified on-die                        :  N/A

________________________________________________________________________________

CACHE L1 DATA                  :  06 x 00064 KB

Ways of associativity            :  00002
Byte line size(B)                   :  00064
Physical line partitions          :  N/A
Number of sets                     :  N/A
Threads per cache                :  N/A
IDs for cores per CPU           :  N/A
Lines per tag                        :  00001
Self initializing                      :  N/A
Fully associativate                :  N/A
Write-back invalidate             :  N/A
Inclusive of lower cache levels:  N/A
Complex cache indexing        :  N/A
Unified on-die                        :  N/A

_______________________________________________________________________________

CACHE L2 UNIFIED              :  06 x 00512 KB

Ways of associativity            :  00016
Byte line size(B)                   :  00064
Physical line partitions          :  N/A
Number of sets                     :  N/A
Threads per cache                :  N/A
IDs for cores per CPU           :  N/A
Lines per tag                        :  00001
Self initializing                      :  N/A
Fully associativate                :  N/A
Write-back invalidate             :  N/A
Inclusive of lower cache levels:  N/A
Complex cache indexing        :  N/A
Unified on-die                        :  N/A

_____________________________________________________________________________

___

CACHE L3 UNIFIED              :  06 x 06144 KB

Ways of associativity            :  00048
Byte line size(B)                   :  00064
Physical line partitions          :  N/A
Number of sets                     :  N/A
Threads per cache                :  N/A
IDs for cores per CPU           :  N/A
Lines per tag                        :  00001
Self initializing                      :  N/A
Fully associativate                :  N/A
Write-back invalidate             :  N/A
Inclusive of lower cache levels:  N/A
Complex cache indexing        :  N/A
Unified on-die                        :  N/A

________________________________________________________________________________


CPUID DATA:

      EAX    |    EAX        |    EBX       |    ECX         |    EDX    |
00000000h | 00000006h | 68747541h | 444D4163h | 69746E65h |
00000001h | 00100FA0h | 05060800h | 00802009h | 178BFBFFh |
00000002h | 00000000h | 00000000h | 00000000h | 00000000h |
00000004h | 00000000h | 00000000h | 00000000h | 00000000h |
00000004h | 00000000h | 00000000h | 00000000h | 00000000h |
00000004h | 00000000h | 00000000h | 00000000h | 00000000h |
00000004h | 00000000h | 00000000h | 00000000h | 00000000h |

80000000h | 8000001Bh | 68747541h | 444D4163h | 69746E65h |
80000005h | FF30FF10h | FF30FF20h | 40020140h | 40020140h |
80000006h | 20800000h | 42004200h | 02008140h | 0030B140h |
80000019h | F0300000h | 60100000h | 00000000h | 00000000h |
8000001Dh | 00000000h | 00000000h | 00000000h | 00000000h |
8000001Dh | 00000000h | 00000000h | 00000000h | 00000000h |
8000001Dh | 00000000h | 00000000h | 00000000h | 00000000h |
8000001Dh | 00000000h | 00000000h | 00000000h | 00000000h |

i checked it on cpu-world.com and it got only one L3

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,086 Views

Ilya,

RE: Analyse

When you inclue a URL (internet hyperlink) the administrator wants to make sure you are not using the forum to send SPAM.

RE: CPUID

Attached is QuickThread.cpp. This is one of the files from my threading toolkit QuickThread(R). You can download the entire toolkit from www.quickthreadprogramming.com.

Search for QuickThreadStruct::ProcessNodes()

This routine is called for each thread of the thread pool as it comes up. The routine, and the caller code, is a bit complex so making sense of what is happening is not straitforward. The mapping of cache to hardware thread could be made easier, but this is not entirely what is being done with the code.

If you build the toolkit, you could then step through the code using the debugger. Also, if you build a simple test app that initialized the thread pool (qt::qtInit), and call "void qt::DumpState()" it will print out the topology.

Jim Dempsey

0 Kudos
Ilya_Z_
Beginner
1,086 Views

it now remains for me to read and understand the code. It is hard to find someone how can help with discovering CPU and cache topology.

Thanks a lot Jim.

0 Kudos
Bernard
Valued Contributor I
1,086 Views

 >>>On MSDN i've found that GetLogicalProcessorsInformationEx proc might be >>>

I agree that this function is not the easiest to work with.I also had the problem with the output provided by this function.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,086 Views

Also note that GetLogicalProcessorsInformationEx is not available on all O/S's (rather Win32 DLL's), good code would query if the entrypoint is available. i.e. attempt to fetch the address of the function, if found make the call, else take alternative action.

Jim Dempsey

0 Kudos
Bernard
Valued Contributor I
1,086 Views

Thanks Jim for providing this valuable tip.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,086 Views

[cpp]


#if defined(__linux)
#else
typedef BOOL (WINAPI *LPFN_GLPI)(
 PSYSTEM_LOGICAL_PROCESSOR_INFORMATION,
 PDWORD);
#endif
...

#if defined(__linux)
// other code
#else
LPFN_GLPI Glpi = (LPFN_GLPI)GetProcAddress(
  GetModuleHandle(TEXT("kernel32")), "GetLogicalProcessorInformation");
if(Glpi != NULL)
  DoGlpi(Glpi);
else
  DoYourAlternateMethod();
#endif
...

void QuickThreadStruct::DoGlpi(LPFN_GLPI Glpi)
{
    BOOL done;
    BOOL rc;
   
    SYSTEM_LOGICAL_PROCESSOR_INFORMATION* SLPI;
    DWORD ReturnLength;
    int_Native nSLPI;
    int_Native iSLPI;
    int_Native packageCount;
    uint_Native procCoreCount;
 uint_Native ProcessorMask;
    LOGICAL_PROCESSOR_RELATIONSHIP Relationship;
    int8_t ProcessorCore_Flags;
 DWORD NumaNode_NodeNumber;
 _CACHE_DESCRIPTOR CacheDescriptor;
    done = FALSE;
    SLPI = 0;
    ReturnLength = 0;
 QuickThreadThreadContext *tls = qtThreadContext;
    // Determine the size of the buffer
    rc = Glpi(SLPI, &ReturnLength);
    if((rc != FALSE) || (GetLastError() != ERROR_INSUFFICIENT_BUFFER)) \
        DoStop(__FILE__, __LINE__, "Failure obtaining SYSTEM_LOGICAL_PROCESSOR_INFORMATION");
  
    nSLPI = ReturnLength / sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION);
    if((nSLPI*sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION)) != ReturnLength) DoStop(__FILE__, __LINE__, "Debug");
 {
  bool b = tls->set_Use_qt_malloc(false);
  SLPI = new SYSTEM_LOGICAL_PROCESSOR_INFORMATION[nSLPI];
  tls->set_Use_qt_malloc(b);
 }
    // now obtain the array of SLPI
    rc = Glpi(SLPI, &ReturnLength);
 if(rc != TRUE) DoStop(__FILE__, __LINE__, "Failure obtaining SYSTEM_LOGICAL_PROCESSOR_INFORMATION");

#ifndef RelationProcessorPackage
#define RelationProcessorPackage 3
#endif
 packageCount = 0;
    procCoreCount = 0;
 for(iSLPI=0; iSLPI<nSLPI; ++iSLPI)
 {
  ProcessorMask = (uint_Native)SLPI[iSLPI].ProcessorMask;
  Relationship = SLPI[iSLPI].Relationship;
  switch(Relationship)
  {
  case RelationProcessorCore:
   ProcessorCore_Flags = SLPI[iSLPI].ProcessorCore.Flags;
   procCoreCount = qt_bitCount(ProcessorMask);
   break;
  case RelationNumaNode:
   NumaNode_NodeNumber = SLPI[iSLPI].NumaNode.NodeNumber;
   if(NumaNode_NodeNumber)
   {
    if(tls->NUMA_NodeNumber == 0)
    {
     tls->pThreadFreeListList = tls->pDefaultThreadFreeListList; // allocate later
     tls->pNUMAPoolContext = 0;
     tls->NUMA_NodeNumber = NumaNode_NodeNumber;
    }
   }
   if(NumaNode_NodeNumber > HighestNumaNodeNumber)
   {
    DoStop(__FILE__, __LINE__, "NumaNode_NodeNumber > HighestNumaNodeNumber\n");
   }
   if(NumaNode_NodeNumber > HighestNumaNodeNumber)
    HighestNumaNodeNumber = NumaNode_NodeNumber;
   if(NumaNode_NodeNumber < nBits_uint_Native)
    NumaNodeMasks[NumaNode_NodeNumber].bm |= ProcessorMask;
   break;
  case RelationCache:
   CacheDescriptor = SLPI[iSLPI].Cache;
   break;
  case RelationProcessorPackage:
   ++packageCount;
   break;
  default:
#if defined(_DEBUG)
   printf("Undefined relationship %d\n", SLPI[iSLPI].Relationship);
#endif
   break;
  }
 }
 {
  bool b = tls->set_Use_qt_malloc(false);
  delete [] SLPI;
  tls->set_Use_qt_malloc(b);
 }
} // void QuickThreadStruct::DoGlpi(LPFN_GLPI Glpi)
[/cpp]

The above won't compile on your system (it is part of the QuickThread library). However, it wll provide an outline of what you are up against.

Jim Dempsey

0 Kudos
Bernard
Valued Contributor I
1,086 Views

Thanks Jim:)

Your example provides a lot of information how to use that Win API function.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,086 Views

The above code was written before Window 7. Windows 7 and later have added a new feature for systems with .gt. 64 logical processors. You will likely have to extend the code to handle this capability.

The other code uses CPUID/CPUIDX and APIC codes, this too may need revisions.

Jim Dempsey

0 Kudos
Bernard
Valued Contributor I
1,086 Views

I am working on C++ library wrapped around WinAPI and I will test your code on Win7 and Win8.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,086 Views

The code works on Windows 7 on systems with 1 Group. Pre-Windows 7 there was only one group (and no aggrigation of processors in groups). Pre-groups had a limitation of upto 64 logical processors (on x64 platform). On systems with groups, each group has an upper limit of 64 logical processors. Linux did not have this issue because they supported a variable length bitmask, depending on system 1024 to 65536 (though some may go beyond this).

See: http://archive.msdn.microsoft.com/64plusLP

for additional information.

Jim Dempsey

0 Kudos
Bernard
Valued Contributor I
1,086 Views

The similar info is also contained in Windows Internals book the sixth edition.Here I mean processor groups.

0 Kudos
Reply