Almost off-topic: 6 cores = more work for us :(

kfsone · ‎04-23-2010

I don't know if the TBB team have access to Intel chip designers, but one thing has been evident to me since we started down the path of multi-core CPUs: more cores means more work (CPU instructions) for the code to execute in order to take advantage of those cores.

Before we head too much further down the path, one of the chipset manufacturers really needs to draw up a standard for core management in the form of hardware instructions.

For example, consider object->function(value). Such functionality already cries out for a "thisjmp" instruction, which tells the CPU that you are going to be performing operations on an object instance. The CPU can thus prefetch both the instructions AND the object data without any additional instructions from the calling code.

In terms of thread management, TBB and QuickThread ought to be mascots for CPU instruction set enhancement. We have kernel/software mutexes, Linux has futexes. Where are the cutexes?

Where is the "task" instruction which implements #pragma omp task in hardware, offloading the branch to another core if one is available, queueing it if not, and if the task queue is full, executing it immediately on the current core. Which, incidentally, could be proposed as "object:>function" in a future C++ standard.

Where are the conditional operation instructions so that trivial operations like "if ( i == 3 ) j++" can have the branch eliminated?

Look at the TBB documentation. "Don't do this unless the work load is at least 1000 cpu cycles". That's a pretty heady entry mark. My i7 CPU rarely uses most of it's cores for precisely the reason that the cost of parallelizing is too high to warrant it for most activities and will continue to do so as long as it has to be done by hand like that.

The programmers really need hardware support to make the most of the increasing number of cores and facilities.

Just my $0.02 :)

Dmitry_Vyukov · ‎04-23-2010

Yeah, and where is socket/connect/send/recv family of instructions???
I would also like to see drawClearTypeText and parseXHTML5.0 instructions.

Dmitry_Vyukov · ‎04-23-2010

Quoting kfsone

Where are the conditional operation instructions so that trivial operations like "if ( i == 3 ) j++" can have the branch eliminated?

CMOVcc

MASKMOVEQ

and others

Dmitry_Vyukov · ‎04-23-2010

Quoting kfsone

Look at the TBB documentation. "Don't do this unless the work load is at least 1000 cpu cycles". That's a pretty heady entry mark.

Consider switching to Cilk++. AFAICT, it has order of magnitude lower scheduling overheads.

Dmitry_Vyukov · ‎04-23-2010

Quoting kfsone

Look at the TBB documentation. "Don't do this unless the work load is at least 1000 cpu cycles". That's a pretty heady entry mark. My i7 CPU rarely uses most of it's cores for precisely the reason that the cost of parallelizing is too high to warrant it for most activities and will continue to do so as long as it has to be done by hand like that.

So, most of your computer's activities take less than 300ns... and... do I get you right that you want them to be faster? Humm... I would not be able to notice such time periods.

ARCH_R_Intel · ‎04-23-2010

Indeed, neither gcc nor icc at "-O2" optimization generate a branch for "if(i==3) j++". Here is the source code that I used to check:

[cpp]int bar( int );

void foo( int i ) {
    int j = bar(0);
    if(i==3) ++j;
    bar(j);
}[/cpp]

Routine bar forces definition and use of j. I used icc 11.1 and gcc 4.3.2."gcc -O2" generated (Linux assembly code):

[plain]        xorl    %edx, %edx
        cmpl    $3, %ebx
        sete    %dl
        leal    (%eax,%edx), %edx
[/plain]

"icc -march=pentium3 -S -O2" generated (Linux assemblycode):

[plain]        lea       1(%eax), %ecx         
        cmpl      $3, %edx             
        cmove     %ecx, %eax          
[/plain]

If you're compiler is generating a branch for 32-bit code, check if it needs to be passed options that specify that the hardware is Pentium Pro or later. That's why I added "-march=pentium3". 64-bit code already implies that the hardware is later. If it is still generating a branch, consider using a different compiler.

Dmitry_Vyukov · ‎04-23-2010

If seriously, IMVHO the problem is with semantics rather that with representation. Communication in a distributed system (SMP/multicore) can't cheap. Period. No matter how you will express it, with single instruction or several.
It's like with network communication. If you express send() with a single machine instruction, it gains you nothing, it's still slow network communication.

kfsone · ‎04-25-2010

Firstly, let me add that apparently Technology Review agrees with me: (Multicore Processors cause Software Headaches).

"Do I get you right that you want them to be faster?"

No :) If I wanted them faster, this post would be entitled "stop adding cores and increase CPU speeds".

I'm talking about activities within a particular piece of code, that wind up being executed in serial because the programmer/compiler can't justify the overhead of executing them in parallel.

The multi-pipeline pentium series was a real ground breaker and a golden era in programmer/compiler/hardware co-operation. Compilers could easily take advantage of it, programmers only had to be minimally aware of it in order to allow the compiler to produce code set to take advantage of a variety of configurations.

But what we have right now is not a truly parallel architecture, it primarily serves the purpose of multi-tasking. I say this because of the overheads of work parallelization at a purely instruction level.

I'm reminded of writing software interrupt handlers where the very first thing you have to consider is the potential race condition whereby you may be interrupting a previous handler that was called at an unfortunate moment in the context of the overall compute environment: i.e. the very instant it was torn up, the operating system had something to do, which in turn was interrupted by your interrupt...

Very few current multi-os threading solutions dare to take context into case. We make crude assumptions about how to parallelize our work: Hmm, there are 8 cores, so I'll run 8 workers at a time.

That may be fine in a trivial work unit; but in a larger, more complex application, are you shooting yourself in the foot?

It's only safe to assume that all of the CPU cores are available to you in a world where parallelism is the exception rather than the rule.

Which is why I firmly believe the next generation of CPUs need to introduce a solid foundation for parallelism through a hardware instruction set.

Imagine a CPU which has a "scheduler" core/pipeline. First of all, Operating Systems can take advantage of this core to perform scheduling work without stealing precious CPU cycles from running tasks... Best case scenario: An 8 core CPU running a single active process with all other threads/tasks/etc waiting. The OS still needs to periodically check to see if any work needs doing. Hopefully, it will be done on a different core.

Even then, there is the chance that the active process might happen to see the scheduler while it is making a parallelization decision and erroneously think that there are only 7 cores available to it.

In a more typical scenario, chances are that the OS will steal a few instruction cycles from an active process.

All the major desktop OSes have schedulers. So why not dedicate a core or something to them? Imagine a busy Linux server environment where the active set of working processes is fairly large. Imagine the benefits if the scheduler were running on core dedicated to thread/core management, and thus able to determine what thread to schedule and prefetch the thread-context necessary beforeinterrupting what is currently executing on that thread. That's a hell of a lot of cpu-cycles saved in the context of execution on that given core...

The same scheduler core would also take the burden of tearing up new threads without stealing actual execution time from primary cores.

Why devote a core and potentially a special instruction set to something so trivial? Precisely because (a) it is trivial, (b) it has a fundamentally and radically different (if not actually contrary) mode of operation and function to the code it is managing and so needs to participate in cache-usage etc very differently, (c) to make it architecturally transparent and divorced from the computational resource pool -- it is notcomputation.

And as to my suggestion of parallelization instructions: We need to re-attain the "multiple-pipelines" golden era of co-operation. Right now we are using computational CPU instructions/cycles to examine the computing environment and then tell the CPU what to do. Unless the next generation of CPUs use the same cache strategies (cache lines, branch prediction, etc, etc) there's no guarantee that we aren't just setting ourselves up for a whole lot of refactoring when they come out. (I'm working on code at the moment that is absolutely magnificent on 2001 hardware, but you'd be really hard pressed to come up with a way to make it worse for a 2009+ CPU).

What we have right now is parallelization by proposal and committee. The programmer writes a series of instructions in such a way as he thinks will lead to a particular execution pattern. The compiler interprets those and creates a series of machine instructions to coerce the CPU into doing what it thinks the programmer wanted to. The CPU core executes those instructions while the CPU's management facilities try to guess what the code is going to do next... And what we get is a nightmare of unpredictability. Its more art than science.

A parallelism-oriented instruction set is the only way to close that gap so that, for the most part, the compiler can tell the CPU exactly what it's expecting to do, and let the CPU worry about how it will distribute that across the silicon, and for the rest, the programmer will be able to clarify to the compiler (e.g. through intrinsics) what he is trying to achieve.

kfsone · ‎04-25-2010

Bah, rhetorical question :) But yes, you caught me in a poor choice of over-simplification :). Rather my point was that we're still spending a lot of time finding and resolving the "right" way to match the CPU's efforts to guess what we're trying to do. The term "branch prediction" where "prediction" is the keyword.

BTW, for me the current ICC under Linux produces:

[plain] 21         cmpl      $3, 8(%esp)                                   #5.18
 22         lea       1(%eax), %edx                                 #5.18
 23         cmove     %edx, %eax                                    #5.18
[/plain]

So yes, there is a cmove, but where is the 'clea'?

And a less trivial example reveals how easy it is to induce branches:

[cpp]extern int bar(int) ;[/cpp]

[cpp]void foo(int i) {
    int j = bar(0) ;
    if ( i < 3 || i > 10 ) ++j ;
    bar(j) ;
}
[/cpp]

which produces:

[plain]        movl      4(%esp), %edx                                 #5.2
        cmpl      $3, %edx                                      #5.11
        jl        ..B1.4        # Prob 50%                      #5.11
                                # LOE eax edx ebx ebp esi edi
..B1.3:                         # Preds ..B1.2
        cmpl      $10, %edx                                     #5.20
        jle       ..B1.5        # Prob 50%                      #5.20
                                # LOE eax ebx ebp esi edi
..B1.4:                         # Preds ..B1.3 ..B1.2
        incl      %eax                                          #5.27
                                # LOE eax ebx ebp esi edi[/plain]

Even with -O3. Although, that's probably one for the C++ compiler thread :)

We could probably stand to benefit from more high-level CPU instructions to match this kind of logic construct, such as a "boolean truth" register to assist logic short-circuiting. Consider:

[plain]ltcmpl $3, %edx # true if edx is < 3
gtcmpl $10, %edx # true if edx > 10
jfalse ..B1.3 # jump if either of those were false.
[/plain]

Dmitry_Vyukov · ‎04-25-2010

> if(i<3||i>10)

You especially request branching. Built-in C/C++ operator || implies shorten computations (yes, a compiler is not obliged to do them on assembly level, but they usually do that to be more predictable).
Try if ((i < 3) | (i > 10)).

kfsone · ‎04-25-2010

You especially request branching

Only because you and I happen to know that the CPU doesn't have instructions to handle a case like this.

To the average programmer the "if" is one natural statement. And several languages have range comparisons, e.g. SQL's "BETWEEN". Several mathematical languages have constructs like

if ( 3 < i < 11 )

and there, the programmers would have to be aware that this translates into a pair of branch operations.

kfsone · ‎04-25-2010

Also, GCC produces the following code, which avoids branches:

[plain]    movl    $0, (%esp)
    call    _Z3bari
    subl    $3, %ebx
    cmpl    $8, %ebx
    sbbl    $-1, %eax
    movl    %eax, 8(%ebp)[/plain]

kfsone · ‎04-25-2010

Additionally;

[bash] if ((i < 3) | (i > 10))[/bash]

You've made exactly the kind of assumption about the compute environment I'm emphasizing we shouldn't have to make and written far less legible code that doesn't say what you mean, it says what you think the CPU should do to achieve your result: you weren't writing code to evaluate the logical equality of the statements (i < 3) and (i > 10). You wanted to know if i was less than 3 or greater than 10.

kfsone · ‎04-25-2010

Yeah, and where is socket/connect/send/recv family of instructions???

That'sdisingenuous; I'm talking about instructions that directly relate to changes in CPU architecture and operation and interacting with what is becoming an increasingly complex system.

And that system is the CPU.

In-software Parallelization, as we have right now, is the CPU talking to itself about itself... At the expense of useful compute cycles.

I would also like to see drawClearTypeText and parseXHTML5.0 instructions.

http://en.wikipedia.org/wiki/Intel_Sandy_Bridge_(microarchitecture)

Parallelism should be as close to free as possible, so it can become pervasive. When I say "most activities", I'm talking about subsystem operations like free, close, etc. Things we instinctively think of as inherently serial. Really, what's serial about either of those other than the fact that they probably probably don't amount to more than 1000 cpu instructions.

And sure, there's no need to optimize a program that writes "hello world" to a file; but I'm not thinking of "helloworld.cpp", I'm thinking of things like mail or web servers. Will they gain massively from having "close" and "free" execute on additional cores? No, but I'm not suggesting that if we get on-chip parallelization that the world will be saved by parallel-free. I'm talking about attaining a level of parallelization where evenfree and closecan afford to self-parallelize, and that's only possible by adding instructions for parallelization.

Dmitry_Vyukov · ‎04-25-2010

So currently scheduling eats up some cycles, and you propose it to eat up a whole core? Looks a kind of counter-intuitive for me. Why do not give that core to a programmer to use it for whatever *he* needs it?
There are a lot of special activities in every program. That way we end with a special core for scheduling, special core for network processing, special core for keyboard/mouse processing, etc. IMVHO that's just inflexible and non-efficient. Just do give me all the cores, and let me decide what and when I want to do, do not force me to donate a core for this and a core for that, maybe I do not need that at all, or just do not need right now.

There are specialized cores for graphics processing. And some time ago people start realizing that it's not efficient structure. If I do not need graphics processing right now, I am forced to let idle substantial part of the system. But better I would be able to do number-crunching on the cores. Moreover, not all applications need to do number-crunching, so better these cores would be able to do all kinds of processing main processor can do (full x86 instruction set).

And I see some problems in particular with the proposed approach.
First, what scheduling algorithm you are going to hardwire into hardware? You have to hardwire some The Only True Scheduling Algorithm. And there are seems no such algorithm. There are work-stealing, work-requesting, work-distribution, proactive work-balancing, and dozens of combinations.
Second, single-dedicated core means inherently centralized algorithm. And centralized algorithms just do not scale. The fact that it's implemented in hardware does not help you here.
So I can manually implement a distributed scheduling algorithm that perfectly fits my application, and it will kill to death performance-wise centralized non-suitable hardware-based algorithm.

Plus current situation with scheduling seems to that bad as you draw it. A lot of people out there are indeed able to achieve perfect linear scalability to a lot of cores w/o any kind of special support you are talking about. So what's the problem in the first place?

Dmitry_Vyukov · ‎04-25-2010

Quoting kfsone

When I say "most activities", I'm talking about subsystem operations like free, close, etc. Things we instinctively think of as inherently serial. Really, what's serial about either of those other than the fact that they probably probably don't amount to more than 1000 cpu instructions.

For most programs free/close is just incorrect level for parallelization - parallelization can be done on higher level.

And then it's just not worth while providing a very different hardware for few remaining cases.

Note that even if some algorithm does not amenable to paralleization, that does not mean that there is no other parallelizable algorithms for the same problem.

kfsone · ‎04-25-2010

You asked:

So currently scheduling eats up some cycles, and you propose it to eat up a whole core? Looks a kind of counter-intuitive for me. Why do not give that core to a programmer to use it for whatever *he* needs it?

I already answered that:

Why devote a core and potentially a special instruction set to something so trivial? Precisely because (a) it is trivial, (b) it has a fundamentally and radically different (if not actually contrary) mode of operation and function to the code it is managing and so needs to participate in cache-usage etc very differently, (c) to make it architecturally transparent and divorced from the computational resource pool -- it isnotcomputation.

I'm not talking about a whole compute core, I'm talking about a trivial core/device that specializes in thread/task management and scheduling. It could handle things like preliminary interrupt handling, allowing the OS to process interrupt signals well before it has to disturb work executing on processing cores.

Then you say:

And I see some problems in particular with the proposed approach.
First, what scheduling algorithm you are going to hardwire into hardware? You have to hardwire some The Only True Scheduling Algorithm. And there are seems no such algorithm. There are work-stealing, work-requesting, work-distribution, proactive work-balancing, and dozens of combinations.

So, because a hardware solution might only work for a subset of problems, you suggest that no problem should have a hardware solution? If the hardware doesn't provide you with the right algorithm, well then you'll just have to solve it in software exactly the same as you do now. You don't lose anything, and you might get some performance benefits from a better API to the parallel computing environment thru atomic instructions.

Sounds to me like a net gain for all, and a major gain for the most typical cases.

You also say:

Plus current situation with scheduling seems to that bad as you draw it. A lot of people out there are indeed able to achieve perfect linear scalability to a lot of cores w/o any kind of special support you are talking about. So what's the problem in the first place?

When did I say anything about it being unachievable? The current system is fine for parallelizing large payloads of work. Given that you're the one that challenged me on the parallelization of sub-300ns tasks, are you actually paying any attention or are you just being argumentative?

kfsone · ‎04-25-2010

As I suspected you would, you said:

For most programs free/close is just incorrect level for parallelization - parallelization can be done on higher level.

Ah, you are infact just being argumentative, since the sentence you were responding to ends with:

... I'm not suggesting that [...] the world will be saved by parallel-free. I'm talking about attaining a level of parallelization whereevenfree andclosecan afford to self-parallelize, and that's only possible by adding instructions for parallelization.

Or perhaps you are oblivious to the impact that the change from faster-and-faster CPUs to slightly-faster-but-more-of-them has had on the general programming industry.

Dmitry_Vyukov · ‎04-25-2010

I do not understand why every trivial and different thing requires special hardware support. There are memory allocation, object lifetime management, logging, scheduling - they are all trivial and different and not computations.

> So, because a hardware solution might only work for a subset of problems, you suggest that no problem should have a hardware solution?

I suggest that hardware must provide a means to build solutions, rather than particular solutions.

> Given that you're the one that challenged me on the parallelization of sub-300ns tasks

Nope. I just meant that it's senseless.
Just send a whole request to a core for processing, rather than process 1 request at a time on N cores trying to parallelize every bit of processing.

> are you actually paying any attention or are you just being argumentative?

Both.
But I am a software guy, so everything I've said here is a pure speculation.
I've indeed seen some academic papers on basically incorporating distributed Cilk-style scheduler into hardware several years ago. But I do not think that it's the way to go. I mainly deal with server software and for me it will be wasted transistors. With GPU we are already on the way back: specialized -> general purpose.
If you want constructive discussion, I would suggest to ask the question over comp.arch. There is a lot of *hardware* guys who can provide you with answers from hardware POV.

Dmitry_Vyukov · ‎04-25-2010

Quoting kfsone

and you might get some performance benefits from a better API to the parallel computing environment thru atomic instructions.

That looks much more reasonable for me. Because that's a tool to build solutions, rather than a particular solution.

In particular I would like to see support for Advanced Synchronization Facility:

http://developer.amd.com/CPU/ASF/Pages/default.aspx