Overhead of running spin loop on Hyperthreading CPU

windtracekimo · ‎05-16-2008

Dear all:
I am currently implementing a project to reduce the linux system call overhead on Hyperthreading enabled CPU.
*Environment: Intel Pentium 4 (2.8 GHz) with HyperThreading, Linux kernel 2.6.22-14 with SMP support. (2 logical CPU are shown in the system)
Our goal is to reduce the user/kernel switch time for each system call (such as pushing or restoring registers on the stack).
To achieve this, our concept is to run one kernel thread on logic CPU1, and let user process to run on logical CPU2. These two processes both spin on a flag created by shared memory. When user want to make a system call request, it changes the flag to 1 to notify kernel thread. After kernel complete the system handler, kernel changes the flag to 0 to notify the user process. (code for simple test is as the following)

static inline void pause(void)
{
asm volatile("pause" : : : "memory");
}
static __inline__ __u64 rdtsc(void) {
__u32 lo, hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return (__u64)hi << 32 | lo;
}

[CPU1: kernel thread]
while(1) {
if(flag == 1) {
flag = 0;
}
pause();
}
[CPU2: user process]
__u64 time1, time2;
time1 = rdtsc();
for (i = 0; i < 10000000; i++) {
flag = 1;
while(flag ==1) pause();
}
time2 = rdtsc();

The shared memory is created by kernel thread by kmalloc(), and use SetPageReserved for those memory pages. Afterwards, user process use mmap() to map those pages into its memory address space. [ref]
The latency for each system call is measured by (time2 - time1) /10000000, which is equal to 500 cpu ticks (assume one tsc count for one cpu tick).

doubt 1: why this takes so long? The task for both user and kernl is just watch the flag on the shared memory and flip it.
doubt 2: is this the problem of using shared memory? For this question, I have done the following test in user space (kernel does nothing here)
time1 = rdtsc();
for (i = 0; i<10000000; i++) {
if(flag==1) flag=0;
else flag = 1;
}
time2 = rdtsc();
The result shows that each iteration takes only around 5~10 CPU ticks here.
Can anyone give us some idea about this?
Appreciate for any help on this.

TimP · ‎05-16-2008

Many serious investigations have been carried out on how to handle spin lock loops on HyperThreads, so there are many people far more expert than you or I. In case you never heard of Windows 2000, one of the reasons it was pushed out was the way its idle process and threaded wait loops hogged resources under HyperThreading.
The usual strategy to make a reasonable proportion of the resource available to the other thread is by limiting the greedy spinlock to a short interval, then issuing a pause instruction. Several applications have environment variables to control the time prior to pause, presumably using their own spin loop, to permit tuning for faster response over a limited time interval.
Using doubt in the usual sense of the English language, my first doubt is whether the scheme you advocate is suitable for your goals. My second doubt is in the value of spending effort to optimize a new scheme for a CPU with different characteristics from any produced in recent years.

windtracekimo · ‎05-17-2008

Hi, morning.
Thanks to your quick response. Our current idea for this school project should not be only applicable to Hyperthreading CPU, but also to multi-core. Since the design is just to use one dedicate CPU to service all other system calls on others CPUs and eliminate the user/kernel switch time. Beside for knowing whether this approach is correct, I am more curious to know why we can't raise the program performance in this CPU design.
[the following is my reasoning]
I have searched for some articles related to Hyperthreading, and still can't figure out the reason. From the result, it seems not a problem for shared memory or memory access bottleneck, since each access can be completed within 10 CPU ticks in user space (the memory allocated by kernel thread). So once both 2 threads are running on 2 logical CPUs, the overhead comes.
From the specification, these 2 threads are sharing all the physical circuits inside CPU, including the CPU cache, which means they can use 50% (I know it's not possible to be 50%). But the result shows is far worse than we can expect. It seems each thread only use 2% of it (comparing 10 CPU ticks to the result we have -- 250 ticks).

Appreciate any reply on this.
Best regards

windtracekimo · ‎05-17-2008

One more experiment result:
[purpose] to figure out what's the reason for this performance degradation
- possible reason 1: two threading running simultaneously severely degrade the performance
- possible reason 2: the cache shared by 2 logical CPU has some drawback while both 2 threads are modifying it.
[setup] let kernel thread spin on CPU1 and just read the flag instead of writing it. While user process read and write the flag for 10000000 iterations.
[user code]
time1 = rdtsc()
flag = 2;
for (i = 1; i < 10000000; i++)
{
if(flag == 2) flag = 3;
else flag = 2;
native_pause(); //explained in the first article
}
time2 = rdtsc();
[kernel code]
while(1) {
if (flag == 1) { }
native_pause();
}
[result]
The result of (time2-time1)/10000000 is around 50 ticks when each thread is running on its own logical CPU. If the kernel thread is disabled, the user thread can run at 20-25 ticks.
Thus, from this result, the multiple threading really has impact on each thread performance. However, it should not be so bad as my previous result.
I am wandering it is the side-effect when both thread on different CPU are modifying the same memory. Is there anyone familiar with the cache design on Pentium 4 HT processor?

Best regards

windtracekimo · ‎05-19-2008

Again, its my self-response to this article.
I surveyed some articles recently in this forum
http://software.intel.com/en-us/forums//topic/47560
http://software.intel.com/en-us/forums/showthread.php?t=47538
http://software.intel.com/en-us/forums//topic/47549

It turns out that there is a big penalty to run 2 threads separately on 2 logical CPU while modifying a shared memory region. Maybe there are some solutions to reduce the penalty, but I am not smart enough to grab it inside these good articles.

Now, my questions change to the following.
(1) Is there any good mechanism to run 2 separate thread (one for kernel, one for user process), such that user thread can notify the kernel quickly and kernel can response quickly back to the user? (or is interrupt mechanism faster?). Right now, the only solution I have is to let both of them modify a shared memory region.
(2) Is there any constraints while using the shared memory between 2 threads on different logical CPU? The article of cache blocking seems to suggest the memory or code size for each thread to be less than half of the cache size (not sure it means L1 or L2)

Appreciate any help or thought.
Or if this is not suitable here, please suggest some place for me to consult.
Thanks.

TimP · ‎05-20-2008

Future HyperThreading variants may not share (contend for)the same resources between logical processors as the one you are testing. Yes, L1 (as well as L2) cache was among the resources shared between logicals on past CPUs.

Maybe it's about time the banner at the top of the threading forum should stop advertising a CPU of 5 years ago.

windtracekimo · ‎05-20-2008

Thanks,
I just found some papers that you have mentioned in other related articles. (I will read it and see if I can get some idea to optimize the cache. Or if you have some better idea on the shared memory, please give some advices)

Actually, my original thought to use the shared memory is just to let user and kernel to notify each other by waiting and changing the flag. (kernel will push the system call result onto the shared memory, and user thread will read it)

I have tried monitor/mwait as the following in the beginning.
[user]
for (i = 0; i < 10000000; i++) {
flag = 1;
while(flag) { pause(); }
read_result_from_shared_memory();
}

[kernel]
while (1) {
if(flag ==0) {
__monitor((void *)&flag, 0, 0);
__mwait(0, 0);
}
if(flag ==1) {
write_result_into_shared_memory();
flag = 0;
}
}

But the weird thing is, the kernel thread still keeps the logical CPU to be 100% utilized, which does not relieve the load of it.
My linux kernel has boot with the following message:
..monitor/mwait feature present.
..using mwait in idle threads.
It seems that the default idle thread in linux kernel is also using mwait, and I did not see any high load in the normal system running. (I have checked their code, what it has done is to monitor the flag on its own task structure. I guess if scheduler marks the flag to TIF_NEED_RESCHED or change the state, the idle thread will be waked up)

What's the possible reason for my code to fail?