Solved: I would also like to see more

AFog0 · ‎10-29-2019

I need to check from within an application if hyperthreading is enabled. The CPUID instruction on Intel CPUs will return a wrong number of threads if hyperthreading is disabled (AMD does not have this problem). The only method I can find is to compare the number of threads reported by CPUID with the number of threads reported by the operating system. This method will not work if there are multiple CPU chips on the computer. Please provide a better method (other than running tests in all threads). The method must work in all operating systems.

My test code is at https://github.com/vectorclass/add-on/blob/master/physical_processors/physical_processors.cpp

McCalpinJohn · ‎10-30-2019

Although I have often asked the same question, I wonder if the answer depends on what you are going to do with the information? My interpretation of the CPUID leaf 0x0b information is that it is not so much "wrong" as reporting what the hardware is capable of supporting. Either the BIOS or a hypervisor can control what is available to the OS, and it is not clear that it makes sense for the hardware to reflect such restrictions. (I would certainly like to have "real" topology information -- including the physical location of each logical processor and each cache slice on the die -- but Intel seems committed to hiding that information behind at least one layer of undocumented re-mapping....)

For "mainstream" Intel processors, I usually check the number of programmable performance counters using CPUID[0xa].eax[15:8] and assume that if the value is 8, HT is disabled, and if the value is 4, HT is enabled. If the value is 2, it is probably a KNL (and does not depend on HyperThreading). I don't deal with the Atom-based processors often enough to know (or care) what they provide.

Following up on Jim Dempsey's comment, the only difference I can see in the other parameters on SKX/CLX processors when HT is enabled is one of the fields in CPUID leaf 0x02 -- with HT enabled one of the output fields is 0xb5 "Instruction TLB: 4KiB pages, 8-way set associative, 64 entries", while with HT disabled I get a 0xb6 instead "Instruction TLB: 4KiB pages, 8-way set associative, 128 entries". This seems at least as likely to cause confusion as the number of programmable core performance counters, and has the disadvantage of being less clearly documented, and (as far as I can tell) without guarantee of any specific ordering of values in the EAX, EBX, ECX, EDX registers returned by CPUID leaf 0x02.

View solution in original post

jimdempseyatthecove · ‎10-29-2019

Agner,

You should be able to do this using CPUID to check the L1 cache associativity. IOW using affinity pinning and count the number of threads sharing an L1. As to the number of logical processors you need to run the tests on, this will depend on how the O/S sets up the logical processor to physical processor mapping.

Jim Dempsey

McCalpinJohn · ‎10-30-2019

Although I have often asked the same question, I wonder if the answer depends on what you are going to do with the information? My interpretation of the CPUID leaf 0x0b information is that it is not so much "wrong" as reporting what the hardware is capable of supporting. Either the BIOS or a hypervisor can control what is available to the OS, and it is not clear that it makes sense for the hardware to reflect such restrictions. (I would certainly like to have "real" topology information -- including the physical location of each logical processor and each cache slice on the die -- but Intel seems committed to hiding that information behind at least one layer of undocumented re-mapping....)

For "mainstream" Intel processors, I usually check the number of programmable performance counters using CPUID[0xa].eax[15:8] and assume that if the value is 8, HT is disabled, and if the value is 4, HT is enabled. If the value is 2, it is probably a KNL (and does not depend on HyperThreading). I don't deal with the Atom-based processors often enough to know (or care) what they provide.

Following up on Jim Dempsey's comment, the only difference I can see in the other parameters on SKX/CLX processors when HT is enabled is one of the fields in CPUID leaf 0x02 -- with HT enabled one of the output fields is 0xb5 "Instruction TLB: 4KiB pages, 8-way set associative, 64 entries", while with HT disabled I get a 0xb6 instead "Instruction TLB: 4KiB pages, 8-way set associative, 128 entries". This seems at least as likely to cause confusion as the number of programmable core performance counters, and has the disadvantage of being less clearly documented, and (as far as I can tell) without guarantee of any specific ordering of values in the EAX, EBX, ECX, EDX registers returned by CPUID leaf 0x02.

jimdempseyatthecove · ‎10-30-2019

One additional note. On a system with HT enabled, the O/S is still in control as to if it will actually use the additional HT sibling. While I cannot say that I have ever seen this to be the case, this does not preclude the O/S from providing access to only one of the HT siblings within a core. (or less than full complement of the HT siblings per core when number of HT siblings is greater than 2).

Also, should your test be required to reside within your 3rd party library as used within say OpenMP, the application may be started with a sub-set of threads selected from a single thread per core.

The code that I use to make this determination works well using APIC and APIC2 CPUID structured information on single/multi socket Intel processors (tested up through KNL but not tested on Scalable architectures). My code for AMD processors is way out of date. Last AMD CPUs were the AMD Opteron 200 series processors (~2003 vintage), CPUID tables certainly have changed for AMD over the intervening 16 years.

My code is in a threading toolkit. It performs the mapping amongst the selected thread configuration within the process given permitted affinities. e.g. if the process is launched as a rank in MPI with more than 1 logical processor and fewer than full set of logical processors on node, then the rank's process affinity is a subset of the full complement of logical processors on the node.

While. this is not a fastest test as to if more than one thread per core is used by the O/S, a derivative of this code could make this determination quicker by selectively choosing logical processor pair combinations the statistically produces the shortest path to determination.

Jim Dempsey

AFog0 · ‎10-31-2019

Thank you for the answers. I tried the method with performance counters. It works on Intel CPUs with performance monitoring version 3 and 4.

But how can I be sure that this will work on future processors? What if a future CPU has 8 performance counters with HT enabled? Would such a future processor have a different value for performance monitoring version ID (cpuid(A).eax bit 0-7) ?

I would certainly prefer that Intel make a CPUID function to tell whether hyperthreading is enabled. It would also be nice to tell which CPU resources are shared between threads. This could help decide whether it is optimal to run two threads in the same core.

McCalpinJohn · ‎10-31-2019

I would also like to see more of this information available via CPUID. The newest processors have even started including some frequency information there! (Not complete, unfortunately, but moving in the right direction -- and certainly better than parsing the "Brand String" provided by the CPUID command!)

I would also like to see more information available at zero latency in user space via special registers. In some of the work that I have done on highly optimized communication and synchronization, the performance can be severely degraded if the L1 Data Cache victimizes the lines holding such basic information as "my core number", "my thread number", "the list of thread numbers that I will be communicating with during each phase of the tree barrier", and "the list of NUCA-optimal addresses that I need to use when communicating with each other thread in the job". I don't want to risk taking several trips to memory just to decide which address I need to use for a particular communication step.

A perversity of the Intel TSO memory ordering model is that a HyperThread is not allowed to see the results of stores made by its sibling before those stores are visible to all cores. This requires extra design effort to thwart the natural latency benefit of cache sharing. :-(

Someday architectures may include communication and synchronization as first-class concepts. An implementation of such an architecture will probably have to include topology information, such as data about sharing physical cores and/or caches. Until then, the "multicore revolution" has not happened.....

AFog0 · ‎10-31-2019

Now this thread has turned into a more general discussion.

John, if you need some information to be available at zero latency, maybe you can store it in a vector register. With AVX512 you have 32 registers of 512 bits each. You may pack important data into a few of these. Or the obsolete x87/mmx registers. I agree that this may be difficult in high level language, but a __m512i or __m64 vector will rarely be spilled into memory if you inline all functions. We already have too many different register types; we don't need more registers that have to be saved at every task switch.

I agree that there are many problems with hyperthreading. It is difficult for an application program to even detect if hyperthreading is on or off, as this thread verifies. It is even more difficult to calculate whether it is optimal to use hyperthreading or not. You need to know which resources are shared between threads running in the same core, and how. The application may need to do calculations such as: How much multiplication do I need? How many multiplication units does this CPU have? What is the throughput of these units? Are they shared with another thread? Are they a limiting bottleneck? etc. etc. Similar calculations for instruction decoders, memory ports, BTB, and so on. These calculations will be different for every new CPU model, so you need long tables of the capabilities and oddities of each CPU model. The software needs to update these tables every time a new CPU model enters the market. This is totally unrealistic. My research on CPU dispatching shows that software often lags several years behind the hardware. Not even big software companies can afford to tune their software to every new CPU model.

You may leave the decision of whether to use hyperthreading or not to the operating system. In this case, the OS needs information about the resource uses of each thread: Is this thread limited by RAM access, instruction decoding, CPU execution units, BTB, or something else? This is no more realistic.

Another problem with hyperthreading is that a low priority thread can steal resources from a high priority thread running in the same core. Operating systems are not always able to handle this problem properly.

The multicore revolution, as you call it, may actually be happening for applications with coarse-grained parallelism, but I don't see any realistic prospect for a hyperthreading revolution.

jimdempseyatthecove · ‎11-02-2019

Agner,

The register context belongs to the hardware thread. Each thread has its own context (set of registers). Therefore one cannot use a register for inter-thread messaging. HT threads do not share registers. They do share L1 ICach and L1 DCache as well as TLB internal registers. L2 is shared amongst HT threads of a core as well as (potentially) additional cores.

I've given thought on this problem before. One option could be to add a property to a page table page info. It already contains information as to if the page is: present, readable, writable, executable, cacheable, combinable, (and potentially others)

The idea is to make the page lazy-store. By this I mean writes are stored into L1, but the write to RAM is deferred. Lazy-stored entries can be evicted from L1 to L2, then L2 to L3, then upon eviction from L3 - written to RAM.

It may be beneficial to have three different types of lazy-store: Lazy-store-L1, Lazy-store-L2, Lazy-store-L3 such that one can specify which threads of a process have nexus to the data (those sharing L1, those sharing L2, those sharing L3). I suppose one level could be used provided a non-same-core thread could lift an off-core lazy-store L1 to cache shared with its core. Note, there may be considerations (not made) involving cross-socket lazy store scheme. I think that intra-socket lazy store would be a nice feature to use.

>>Another problem with hyperthreading is that a low priority thread can steal resources from a high priority thread running in the same core

As far as I know, there is no hardware priority for resource ownership amongst core HTs. So yes, a low (software) priority thread within a core has equal access to a high (software) priority thread within the same core.

Some additional reading that you might find informative is by John McCalpin "Dr. Bandwidth":

https://sites.utexas.edu/jdm4372/2018/01/01/notes-on-non-temporal-aka-streaming-stores/

While this discusses streaming stores, what I suggest above is akin to anti-streaming stores favoring hardware thread proximity(ies).

Note, this feature would not require any coordination with O/S software.

Jim Dempsey

jimdempseyatthecove · ‎11-02-2019

Revision to post #6

In lieu of having the page table contain the anti-non-temporal storage indication, use some of the reserved instruction codes. The following is not official but from information I have, the pertinent op-code map information is:

The following have mod r/m for register to memory

Inst		pfx	byte1	byte2	byte3
VMOVNTPS	none    0F      2B
VMOVNTPD	66      0F      2B
available	F2      0F      2B
available	F3      0F      2B

MOVNTI		none    0F      C3
available	66      0F      C3
available	F2      0F      C3
available	F3      0F      C3

MOVNTQ		none    0F      E7
VMOVNTDQ	66      0F      E7
available	F2      0F      E7
available	F3      0F      E7

VMOVNTDQA	66      0F      38      2A
available	66      0F      38      4A
available	66      0F      38      6A
available	66      0F      38      7A
available	66      0F      38      8A
available	66      0F      38      CA
available	66      0F      38      DA
available	66      0F      38      EA
available	66      0F      38      FA

The MOVNTI could be modified to permit three variants of anti-non-temporal stores of DWORD to L1, L2 and L3

I do not understand the reasoning as to why there is both a VMOVNTPS and VMOVNTPD except for when a mask is provided. The VMOVNTPx could provide targeting to one of the anti-non-temporal store levels (my choice would be L3) to provide for fast reduction operations amongst all threads in socket. A similar suggestion can be made for MOVNTQ and VMOVNTDQ.

An additional extension of anti-non-temporal store (sub-set of hardware threads being capable of messaging via cache without incurring overhead of write to RAM) would be to have a means to perform LOCK operations that lock only to the extent of the targeted cache level. There are many situations where the programmer would find it beneficial to have groupings of threads that have faster interaction methods amongst the threads of the small group in indifference to all other threads not of the sub-set.

Jim Dempsey

jimdempseyatthecove · ‎11-02-2019

And continuing....

RE: LOCK

Note, that the anti-non-temporal store xMOVNTxx instructions are intended to annotate the line held in the given cache level as being an entry made with anti-non-temporal store. Therefor, the normal LOCK instructions can be augmented to differentiate (and take different actions) amongst cache lines containing the anti-non-temporal store flag. IOW no new instructions required.

*** It is the programmers benefit and responsibility to code properly. No different now with assuring thread-safe programming practices.

Jim Dempsey

CyrIng · ‎01-17-2020

Based on the Intel SDM, plus some lectures of other BKDG(s) , this function of CoreFreq is establishing the Topology of the Package, Cores, Threads, Caches L1,2,3 and even CCX

So far, it gives good results with architectures from old Core up to Skylake and derivatives. It is also OK with MP, Xeon(s) and EPYC (if Forum allows me to pronounce this word here)

The well known CPUID algorithm is working fine, you will have to split functions based on APIC, extended or not, manufacturer, and so on.

Although I'm working in 64-bits Kernel Module, CPU affinity is mandatory when collecting the CPUID leaves per Core. Any POSIX will offer you the same primitives (beside the MSR calls)

Things I'm still in need from the Kernel is the CPU count. Although it is pretty easy to get it from CPUID and other MSR; the enabled Cores available can only be reliability counted during the IPI startup (which is however already processed by the Kernel)

Hope it is independent enough! ?

You can give a try to CoreFreq to read the Processor Topology.

Is there a system-independent way of checking if hyperthreading is enabled?