- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would also like to see drawClearTypeText and parseXHTML5.0 instructions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
CMOVcc
MASKMOVEQ
and others
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Consider switching to Cilk++. AFAICT, it has order of magnitude lower scheduling overheads.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So, most of your computer's activities take less than 300ns... and... do I get you right that you want them to be faster? Humm... I would not be able to notice such time periods.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp]int bar( int ); void foo( int i ) { int j = bar(0); if(i==3) ++j; bar(j); }[/cpp]
Routine bar forces definition and use of j. I used icc 11.1 and gcc 4.3.2."gcc -O2" generated (Linux assembly code):
[plain] xorl %edx, %edx cmpl $3, %ebx sete %dl leal (%eax,%edx), %edx [/plain]
"icc -march=pentium3 -S -O2" generated (Linux assemblycode):
[plain] lea 1(%eax), %ecx cmpl $3, %edx cmove %ecx, %eax [/plain]
If you're compiler is generating a branch for 32-bit code, check if it needs to be passed options that specify that the hardware is Pentium Pro or later. That's why I added "-march=pentium3". 64-bit code already implies that the hardware is later. If it is still generating a branch, consider using a different compiler.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's like with network communication. If you express send() with a single machine instruction, it gains you nothing, it's still slow network communication.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[plain] 21 cmpl $3, 8(%esp) #5.18 22 lea 1(%eax), %edx #5.18 23 cmove %edx, %eax #5.18 [/plain]
[cpp]extern int bar(int) ;[/cpp]
[cpp]void foo(int i) { int j = bar(0) ; if ( i < 3 || i > 10 ) ++j ; bar(j) ; } [/cpp]which produces:
[plain] movl 4(%esp), %edx #5.2 cmpl $3, %edx #5.11 jl ..B1.4 # Prob 50% #5.11 # LOE eax edx ebx ebp esi edi ..B1.3: # Preds ..B1.2 cmpl $10, %edx #5.20 jle ..B1.5 # Prob 50% #5.20 # LOE eax ebx ebp esi edi ..B1.4: # Preds ..B1.3 ..B1.2 incl %eax #5.27 # LOE eax ebx ebp esi edi[/plain]
[plain]ltcmpl $3, %edx # true if edx is < 3 gtcmpl $10, %edx # true if edx > 10 jfalse ..B1.3 # jump if either of those were false. [/plain]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You especially request branching. Built-in C/C++ operator || implies shorten computations (yes, a compiler is not obliged to do them on assembly level, but they usually do that to be more predictable).
Try if ((i < 3) | (i > 10)).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Only because you and I happen to know that the CPU doesn't have instructions to handle a case like this.You especially request branching
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[plain] movl $0, (%esp) call _Z3bari subl $3, %ebx cmpl $8, %ebx sbbl $-1, %eax movl %eax, 8(%ebp)[/plain]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[bash] if ((i < 3) | (i > 10))[/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yeah, and where is socket/connect/send/recv family of instructions???
I would also like to see drawClearTypeText and parseXHTML5.0 instructions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are a lot of special activities in every program. That way we end with a special core for scheduling, special core for network processing, special core for keyboard/mouse processing, etc. IMVHO that's just inflexible and non-efficient. Just do give me all the cores, and let me decide what and when I want to do, do not force me to donate a core for this and a core for that, maybe I do not need that at all, or just do not need right now.
There are specialized cores for graphics processing. And some time ago people start realizing that it's not efficient structure. If I do not need graphics processing right now, I am forced to let idle substantial part of the system. But better I would be able to do number-crunching on the cores. Moreover, not all applications need to do number-crunching, so better these cores would be able to do all kinds of processing main processor can do (full x86 instruction set).
And I see some problems in particular with the proposed approach.
First, what scheduling algorithm you are going to hardwire into hardware? You have to hardwire some The Only True Scheduling Algorithm. And there are seems no such algorithm. There are work-stealing, work-requesting, work-distribution, proactive work-balancing, and dozens of combinations.
Second, single-dedicated core means inherently centralized algorithm. And centralized algorithms just do not scale. The fact that it's implemented in hardware does not help you here.
So I can manually implement a distributed scheduling algorithm that perfectly fits my application, and it will kill to death performance-wise centralized non-suitable hardware-based algorithm.
Plus current situation with scheduling seems to that bad as you draw it. A lot of people out there are indeed able to achieve perfect linear scalability to a lot of cores w/o any kind of special support you are talking about. So what's the problem in the first place?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For most programs free/close is just incorrect level for parallelization - parallelization can be done on higher level.
And then it's just not worth while providing a very different hardware for few remaining cases.
Note that even if some algorithm does not amenable to paralleization, that does not mean that there is no other parallelizable algorithms for the same problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So currently scheduling eats up some cycles, and you propose it to eat up a whole core? Looks a kind of counter-intuitive for me. Why do not give that core to a programmer to use it for whatever *he* needs it?
Why devote a core and potentially a special instruction set to something so trivial? Precisely because (a) it is trivial, (b) it has a fundamentally and radically different (if not actually contrary) mode of operation and function to the code it is managing and so needs to participate in cache-usage etc very differently, (c) to make it architecturally transparent and divorced from the computational resource pool -- it isnotcomputation.
And I see some problems in particular with the proposed approach.
First, what scheduling algorithm you are going to hardwire into hardware? You have to hardwire some The Only True Scheduling Algorithm. And there are seems no such algorithm. There are work-stealing, work-requesting, work-distribution, proactive work-balancing, and dozens of combinations.
Plus current situation with scheduling seems to that bad as you draw it. A lot of people out there are indeed able to achieve perfect linear scalability to a lot of cores w/o any kind of special support you are talking about. So what's the problem in the first place?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For most programs free/close is just incorrect level for parallelization - parallelization can be done on higher level.
... I'm not suggesting that [...] the world will be saved by parallel-free. I'm talking about attaining a level of parallelization whereevenfree andclosecan afford to self-parallelize, and that's only possible by adding instructions for parallelization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
> So, because a hardware solution might only work for a subset of problems, you suggest that no problem should have a hardware solution?
I suggest that hardware must provide a means to build solutions, rather than particular solutions.
> Given that you're the one that challenged me on the parallelization of sub-300ns tasks
Nope. I just meant that it's senseless.
Just send a whole request to a core for processing, rather than process 1 request at a time on N cores trying to parallelize every bit of processing.
> are you actually paying any attention or are you just being argumentative?
Both.
But I am a software guy, so everything I've said here is a pure speculation.
I've indeed seen some academic papers on basically incorporating distributed Cilk-style scheduler into hardware several years ago. But I do not think that it's the way to go. I mainly deal with server software and for me it will be wasted transistors. With GPU we are already on the way back: specialized -> general purpose.
If you want constructive discussion, I would suggest to ask the question over comp.arch. There is a lot of *hardware* guys who can provide you with answers from hardware POV.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That looks much more reasonable for me. Because that's a tool to build solutions, rather than a particular solution.
In particular I would like to see support for Advanced Synchronization Facility:
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page