Solved: OK, thanks a lot.

lin__chiungliang · ‎03-02-2020

Hi,

I found a VNNI sample code provided by Intel.

It declares a data type, __m512i, which is mapping to registers in CPU.

As I know, the number of registers in a CPU is limited.

Here is the information of CPU I ran

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Stepping:            7
CPU MHz:             1337.399
BogoMIPS:            5000.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-47
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni

I have some questions

1. How to know how many registers can be allocated in a program?

2. If the number of registers in a program is more than limitation of a CPU, is there any error message?

3. For multi-core / multi-thread CPU, are all registers shared or independent?

4. As mentioned in 3, if they are independent, how to know how many registers per core / per thread?

Lot of thanks

chiungliang

Bernard · ‎03-03-2020

I have some questions

1. How to know how many registers can be allocated in a program?

For the architectural GP registers: 16 (64-bit mode)

For the architectural SIMD registers: 32 (AVX512 ISA)

There are additional registers like debug,floating-point state,model specific registers.

2. If the number of registers in a program is more than limitation of a CPU, is there any error message?

Architectural registers are mapped to physical (hidden from software) integer and floating-point register file(s). There are usually files one for integer PRF and second for floating-point PRF. The question is what is the size of floating-point PRF on Skylake-uarch. The size could 512-bit divided into 4 128-bit lanes with own power up/down circuitry and wires connecting the different lanes.

3. For multi-core / multi-thread CPU, are all registers shared or independent?

Each core has own set of physical registers which on HT machine could be fairly shared between the threads.

4. As mentioned in 3, if they are independent, how to know how many registers per core / per thread?

View solution in original post

AthiraM_Intel · ‎03-03-2020

Hi,

Since the query is more about architecture fundamentals , we are moving it to respective forum.

lin__chiungliang · ‎03-03-2020

OK, thanks a lot.

McCalpinJohn · ‎03-03-2020

For all Intel processors, the number of register names is determined by the processor's mode of operation (32-bit or 64-bit) and the instruction set in use. The available registers for each mode are described in section 3.2 of Volume 1 of the Intel Architectures Software Developer's Manual (document 253665), available at https://software.intel.com/en-us/articles/intel-sdm

Attempting to use a register name that does not exist would trigger an error in the assembler.

Most (all?) Intel processors use "register renaming" to minimize stalls due to false dependencies. The number of "physical registers" onto which the named registers are mapped is sometimes mentioned in technical publications, but is not part of the implementation that is visible through mechanisms like CPUID.

There are at least enough "physical registers" for each "logical processor" to have all "named registers" mapped at the same time, so HyperThreading does not reduce the number of register names available.

Bernard · ‎03-03-2020

I have some questions

1. How to know how many registers can be allocated in a program?

For the architectural GP registers: 16 (64-bit mode)

For the architectural SIMD registers: 32 (AVX512 ISA)

There are additional registers like debug,floating-point state,model specific registers.

2. If the number of registers in a program is more than limitation of a CPU, is there any error message?

Architectural registers are mapped to physical (hidden from software) integer and floating-point register file(s). There are usually files one for integer PRF and second for floating-point PRF. The question is what is the size of floating-point PRF on Skylake-uarch. The size could 512-bit divided into 4 128-bit lanes with own power up/down circuitry and wires connecting the different lanes.

3. For multi-core / multi-thread CPU, are all registers shared or independent?

Each core has own set of physical registers which on HT machine could be fairly shared between the threads.

4. As mentioned in 3, if they are independent, how to know how many registers per core / per thread?

lin__chiungliang · ‎03-09-2020

Hi,

Thanks you both.

Since there are 32 ZMM registers,

I think if I try to allocate more than 32 512-bit register,

it might result in some problems.

I try to allocate more than 32 512-bit registers in a sub-routine


int main()
{
        const int NUM = 1024;
        const int NUM_MUL64 = NUM*64;

        __m512i a[NUM];
        __m512i b[NUM];

        int8_t int_a[NUM_MUL64];
        int8_t int_b[NUM_MUL64];
        int8_t int_c[NUM_MUL64];
        int8_t *p_a = int_a;
        int8_t *p_b = int_b;
        int8_t *p_c = int_c;

        //for(int i=0; i<NUM_MUL64; i++)
        //{
        //      int_a = (int8_t)i;
        //      int_b = (int8_t)i;
        //}
        memset(int_a, 0, sizeof(int8_t)*NUM_MUL64);
        memset(int_b, 0, sizeof(int8_t)*NUM_MUL64);

        for(int i=0; i<NUM; i++)
        {
                a = _mm512_loadu_si512(p_a);
                b = _mm512_loadu_si512(p_b);

                a = _mm512_add_epi8(a, b);
                _mm512_storeu_si512((void*)p_c, a);

                p_a += 64;
                p_b += 64;
                p_c += 64;
        }

        return 0;
}

There is no error message.

Is there anyone know how it work?

Lot of thanks

chiungliang

Thomas_W_Intel · ‎03-10-2020

When you are using intrinsics, the compiler will take care of register assignment. In other words, with your code, your are not "allocating registers" but using variables that might be mapped to registers or memory. You can see these if you disassemble the binary.This is not different from normal C/C++ code, where the compiler can choose to directly use registers for variables or use memory (and then copy to registers for computation).

If you want to have full control of what registers are used, you will have to write in assembly. However, it is one of the big advantages of using intrinsics in C/C++ that you don't need to deal with register assignment, because the compiler usually is good at it.

How to know how many registers in a CPU?