i5-1335U P-core and E-core integer operations throughput

SadClouds · ‎08-11-2024

Hi, I'm performing the following test on Intel i5-1335U running Linux:

1. Set both P-cores and E-cores to a constant frequency of 600 MHz. No dynamic underclocking or overclocking. For example:

# echo 600000 > /sys/devices/system/cpu/cpu11/cpufreq/scaling_min_freq

# echo 600000 > /sys/devices/system/cpu/cpu11/cpufreq/scaling_max_freq

# cat /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq
600000

2. Perform integer addition/subtraction operations on a P-core or E-core using a single thread. Thread affinity is used to make sure the thread is running on a specific core.

I'm noticing that at the same clock frequency, the overall throughput of integer addition/subtraction on a P-core is around 45% lower than on an E-core. This is quite a big difference for something that is designed as a "performance" core.

Does anyone know why there is such a big discrepancy in performance? Are there significant differences in pipeline architecture between P-cores and E-cores for integer operations, or is this due to E-cores having 50% more Level 1 instruction cache?

Thanks.

SadClouds · ‎08-13-2024

Linux intel_pstate driver reports the following base clock frequencies: P-cores at 800 MHz and E-cores as 600 MHz. I assume the hardware is configured for "minimum assured power" aka "cTDP-Down", as the frequencies are lower than the spec for base TDP.

I don't believe this makes difference when comparing performance of P-cores and E-cores running at the same clock frequency, as described previously.

I'm getting the following performance metrics for 64-bit integer arithmetic operations

As you can see there is something wrong with P-core instruction level parallelism for addition, subtraction and multiplication:

1. Single thread throughput is nearly half of the throughput for E-cores.

2. Two threads throughput on the same P-core (thread 0 affinity only for CPU 0, thread 1 affinity only for CPU 1, i.e. the same P-core with two hardware threads) is still lower than the throughput of a single thread on E-core.

Can anyone at Intel please suggest why P-cores running at the same clock frequency as E-cores, exhibit lower throughput for integer operations.

Thanks.

SadClouds · ‎08-13-2024

When I compare single thread 64-bit integer performance between Intel i5-1335U P-core and ARM Cortex-A72, both running at the same 600 MHz clock frequency, I get the following metrics in mega operations per second:

Intel i5-1335U | ARM Cortex-A72
P-core@600MHz | @600MHz
--------------------+---------------
Add 1237.21 | 1062.77
Sub 1244.23 | 1058.20
Mul 597.57 | 190.39
Div 59.76 | 149.14

These used the Same Debian-12.2 OS and GCC-12.2.0 compiler.

I still don't fully understand why Intel 13th gen P-cores are so underwhelming, however if my test methodology is correct, then Raspberry Pi 4 ARM Cortex-A72 CPU from 2016 seems to be nearly on a par (aside from multiplication) with the latest 2023 Intel mobile CPU, when forced to run at the same clock frequency. The almost X3 lower throughput for division operations looks particularly bad for Intel.

Dan0987 · ‎09-20-2024

Hi SadClouds,

Sorry to bother. Is this test's source code available somewhere?

best,

SadClouds · ‎09-20-2024

Not at the moment. It is part of a larger set of closed source benchmarks used internally for evaluating system performance and scalability.

JeanetteC_Intel · ‎08-22-2024

Hello SadClouds,

Thank you for posting in Intel Communities.

Upon reading this information, it is best to coordinate this with our team for further investigation. I will post an update once it's available.

Best regards,

JeanetteC.

Intel® Customer Support Technician

JeanetteC_Intel · ‎09-19-2024

Hello SadClouds,

Thank you for reaching out to us with your concerns. Upon reviewing the details of your situation, we have determined that the processor is being operated outside of its standard operating frequencies.

Our processors are designed to function optimally within a set of defined specifications, and operating them beyond these limits may lead to unpredictable behavior.

For more information on the Base Power and Frequency Specifications Options for the i5-1335U processor visit the Datasheet, Page 98: https://www.intel.com/content/www/us/en/content-details/743844/13th-generation-intel-core-and-intel-core-14th-generation-processors-datasheet-volume-1-of-2.html

P and E-cores build is different, their purpose and performance are different, even if they can do the same tasks.

I've also found some materials which can help in understanding the difference between them.

Efficient-core - Architecture Day 2021 | Intel Technology

https://youtu.be/agUwkj1qTCs?si=k9TWJJdXLdOFNfCU

https://en.wikipedia.org/wiki/Gracemont_(microarchitecture)

Meet Performance-Core - Architecture Day 2021 | Intel Technology

https://youtu.be/FNrOfDuP3rg?si=RZulHBUly2aY9l3b

https://en.wikipedia.org/wiki/Golden_Cove#Raptor_Cove

Compare schema pictures from wiki pages to see how they're different.

My advice for this test would be to change frequency to 1,2 or 1,3 GHz, thanks to this both cores will be working within specs or slightly below (P-core), but without that much of a performance impact (if we look at the datasheet).

Best regards,

JeanetteC.

Intel® Customer Support Technician

SadClouds · ‎09-19-2024

Hello, thank you for the updates. Prior to asking this question, I had already looked at the exact same processors data sheet and also performed various tests with different core frequencies. I don't think that the suggested root cause of "the processor is being operated outside of its standard operating frequencies" is quite correct. The relative throughput of integer operations for P-cores vs. E-cores does not seem to change with higher core frequencies. However, as you suggested, I repeated the tests at a fixed 1.3 GHz frequency.

Set CPU 0 (P-core) and CPU 11 (E-core) to 1.3 GHz:

# for i in 0 11
do
echo "1300000" > /sys/devices/system/cpu/cpu${i}/cpufreq/scaling_max_freq
echo "1300000" > /sys/devices/system/cpu/cpu${i}/cpufreq/scaling_min_freq
done

View core operating frequencies:

# lscpu -e
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
0 0 0 0 0:0:0:0 yes 4600.0000 400.0000 1300.0000
1 0 0 0 0:0:0:0 yes 4600.0000 400.0000 616.1430
2 0 0 1 4:4:1:0 yes 4600.0000 400.0000 600.0000
3 0 0 1 4:4:1:0 yes 4600.0000 400.0000 600.0000
4 0 0 2 8:8:2:0 yes 3400.0000 400.0000 600.0000
5 0 0 3 9:9:2:0 yes 3400.0000 400.0000 600.0000
6 0 0 4 10:10:2:0 yes 3400.0000 400.0000 600.0000
7 0 0 5 11:11:2:0 yes 3400.0000 400.0000 600.0000
8 0 0 6 12:12:3:0 yes 3400.0000 400.0000 600.0000
9 0 0 7 13:13:3:0 yes 3400.0000 400.0000 600.0000
10 0 0 8 14:14:3:0 yes 3400.0000 400.0000 600.0000
11 0 0 9 15:15:3:0 yes 3400.0000 400.0000 1300.0000

Obtain test results for 64-bit integer operations for each core:

Intel i5-1335U | Intel i5-1335U
P-core@1300MHz | E-core@1300MHz
--------------------+---------------
Add 2666.42 | 4876.67
Sub 2708.99 | 4882.55
Mul 1295.45 | 2591.26
Div 129.58 | 215.96

As you can see, at the same core frequency of 1.3 GHz, single thread integer operations throughput is significantly lower for P-core than it is for E-core. I think there may be a design issue with P-core integer operations pipeline.

SadClouds · ‎09-20-2024

If we look at the above test results for P-core and E-core running at 1.3 GHz, then we can estimate IPC (Instructions Per Cycle) throughput that this particular Intel product delivers:

P-core:

Average MegaOps/sec for Addition and Subtraction operations = (2666+2708)/2 = 2687 MegaOps/sec
Average IPC = (2687x10^6 Ops/sec)/(1300x10^6 Cycles/sec) = 2.06 Ops/Cycle

The benchmark algorithm (originally written in C) is as follows:

_Pragma ("GCC unroll 1") /* Disable loop unrolling */
begin loop
  Load MemVal->RegVal;
  ArithOp Reg1,RegVal;
  ArithOp Reg2,RegVal;
  ArithOp Reg3,RegVal;
  ...
  ArithOp Reg16,RegVal;
end loop;

Where there is a single load from memory into temporary register and then the same arithmetic operation is executed on the value in temporary register. There is no dependency between those 16 operations and the hardware could potentially execute them all in parallel.

However, the best that this particular P-core can manage is around 2 arithmetic instruction per cycle. This does not look good, considering the Raptor Cove micro architecture appears to have 5 separate integer ALUs. I think the problem may be that some of those P-core ALUs share BR (branch?) instructions in the same pipelines, where E-core has dedicated BR pipelines. Unfortunately many "industry standard" benchmarks are nearly useless when it comes to evaluating specific CPU pipelines. It would be nice if the experts at Intel could provide more technical info on why in this case the P-core IPC appears to be quite low compared to the E-core. How do people at Intel evaluate similar use cases? What benchmarks do they use? Are these benchmarks available to download from Intel which can test IPC throughput? Is there open documentation on test setup and methodology used?

SadClouds · ‎09-21-2024

I'm going to try and wrap this up, as I need to move on to other tasks. One of my tasks was to document a procedure for evaluating hybrid CPU architectures like Arm big.LITTLE, Intel P-cores + E-cores, etc. I used Intel i5-1335U as it was the most recent design, readily available and with good drivers support in Linux.

An anomaly came up during testing, where single thread arithmetic integer operations throughput was on average: P-core 2.0 and E-core 3.7 instructions per cycle. Both types of cores have at least 4 separate integer units, hence IPC of around 3.5-4.0 was expected for both (taking into account the overhead for taking a branch and incrementing loop counter).

The same benchmark was used on various Arm CPUs for which data sheets with the exact instruction IPC figures are publicly available. The benchmark results come very close to the official IPC figures for all - load, store, integer and floating point operations. Unfortunately at the moment, the benchmark used is not available for wider distribution, hence I cannot share it with people.

My only conclusion is that the anomaly could be due to the following:

1. The IPC is deliberately throttled for a single P-core hardware thread in order to reserve the bandwidth/power for a second hardware thread. With 2 hardware threads, the combined IPC improves significantly.

or

2. There is an issue with instruction scheduling for a single P-core hardware thread. So may be a microcode update could fix it.

Whatever the reasons, the issue described in this thread has nothing to do with core's operating frequency. I am also pretty confident it is not related to the benchmark used.

JeanetteC_Intel · ‎10-06-2024

Hello SadClouds,

Greetings from Intel Customer Support. I apologize for the delayed response.

Upon reading the information you shared, it is best to coordinate this with our team for further investigation. I will post an update once it's available.

Best regards,

JeanetteC.

Intel® Customer Support Technician

JeanetteC_Intel · ‎10-21-2024

Hello SadClouds,

Good day.

Kindly check your email inbox, junk, or spam folders, for this issue that you raised.

I hope to get your email reply as well.

Best regards,

JeanetteC.

Intel® Customer Support Technician