已解决: Re: CPU enhancement wish list

youjaes · ‎07-05-2009

From time to time, I write inline asm code to speed up a number cruncher. I always try to avoid writing to memory, but I must write temporary arrays to memory as there isn't enough register space. So, my question is this, are there any plans on adding an array register to Intel CPUs?

Something on the order of 64k bytes with an acronym of RX or ERX, depending on whether entries are referenced 16 or 32 bits, would be good. The array register doesn't have to have a slew of instructions, just Read Item, Write Item, Zero All, and possibly Read/Write Block of Items from/to memory. A fast temporary array storage on the CPU would shed lots of clock cycles from code that doesn't need to save a list of values between calls.

Has anyone else noticed this bottleneck? Has this been proposed before? Am I wasting my time by asking these questions?

James

jimdempseyatthecove · ‎07-09-2009

How big is your block of data?

Since writes are relatively expensive and reads from L1 cache are cheap

count the occurances of same byte x 16 across width of xmm registerfor up to 255 iterations (4080 bytes) then use PSADBW (against 0's) to perform a horizontal add. Add the result 16 bits to a larger register, then continue for the next 4080 bytes. At end of block perform test for residual bytes, then store into 16/32/64 bit tally. loop back for remaining 255 bit values.

This reduces task to all reads and 256 writes.

Note, you could loop the 256byte value iterations on something just less thanL1 cache size on the input buffer. Change store into tally to add to tally (start with 0). Now you have 256 writes + (256 RMW)*(buffer size/L1 size) + something for odd bytes at end of buffer.

Also, this should work well on multi-threaded system, even HT system.

And, when byte set has known missing values you can omit those tests (e.g. contains only ASCII printable characters)

Jim Dempsey

在原帖中查看解决方案

jimdempseyatthecove · ‎07-06-2009

The problem with this is encorporating this into the operating system for context switching. It would make more sense to increase the L1 cache size or increase number ofand/or create a "sticky" TLB. These could be added without requirements of changes in O/S.

Example a periodic PREFETCHn of given type could reassert stickyness to a TLB. (Or LOCK PREFETCHn could (re)assert stickyness).Interrupt/ PUSHAF/ LGDT /...??could remove stickyness (something already used in typical context switch by O/S).

For more "register" space, consider making better use of the SSE registers, currently 128 bits, soon to be 256 bits, later perhaps 512 bits.

Jim Dempsey

youjaes · ‎07-07-2009

Thanks for the reply. I already use the SSE registers quite a bit for number crunching functions. There are things that are still inefficient with them. For example, the naive task of counting the number of each byte value in a block of data requires a temporary array (256 entries x 16 or 32 bits) in main memory.

Now, I'm by no means an expert on the intracacies of Intel microcode, but I think there are enough clever people around to overcome current design limitations.

Again, thanks for the reply,

James

capens__nicolas · ‎07-07-2009

Quoting - youjaes

From time to time, I write inline asm code to speed up a number cruncher. I always try to avoid writing to memory, but I must write temporary arrays to memory as there isn't enough register space. So, my question is this, are there any plans on adding an array register to Intel CPUs?

Hi James,

The L1 cache is already pretty much what you're asking for. It's incredibly fast, so don't be afraid of reading/writing memory if necessary. There's little chance that a dedicated array register would outperform it.

What's the actual algorithm you're implementing where you see this asbeing abottleneck? Chances are that there are far better opportunities to speed it up than actually waiting for a hardware change...

Cheers,

Nicolas

jimdempseyatthecove · ‎07-09-2009

How big is your block of data?

Since writes are relatively expensive and reads from L1 cache are cheap

count the occurances of same byte x 16 across width of xmm registerfor up to 255 iterations (4080 bytes) then use PSADBW (against 0's) to perform a horizontal add. Add the result 16 bits to a larger register, then continue for the next 4080 bytes. At end of block perform test for residual bytes, then store into 16/32/64 bit tally. loop back for remaining 255 bit values.

This reduces task to all reads and 256 writes.

Note, you could loop the 256byte value iterations on something just less thanL1 cache size on the input buffer. Change store into tally to add to tally (start with 0). Now you have 256 writes + (256 RMW)*(buffer size/L1 size) + something for odd bytes at end of buffer.

Also, this should work well on multi-threaded system, even HT system.

And, when byte set has known missing values you can omit those tests (e.g. contains only ASCII printable characters)

Jim Dempsey

youjaes · ‎07-12-2009

Quoting - jimdempseyatthecove

How big is your block of data?

Jim Dempsey

Let's say we are looking at the Wikipedia data set (~1.8 terabytes), and we are looking at a losslesscompression of the article change histories.

Since your method, which I like, still has to do many recursions on the same data block, then do you agree that if you had a CPU local data block that could be written to and read from, you could accelerate the process by eliminating memory writes to a final block transfer of 256 dwords to memory and only a single pass across the data blocks, eliminating 255 of the recursions?

Of course my suggestion won't help anyone for possibly years, but it looks like a good idea to pass it on to those "in the know" to evaluate and possibly implement if it passes muster.

Cheers,

James Youlton

jimdempseyatthecove · ‎07-16-2009

From your description of the problem I would suggest

A specialized decompression function that runs as a parallel pipeline using SSE when possible. Note, this may be a nested series of parallel pipelines.

The container format would likely not change (still be describable in abstract terms in single thread model).

First stage would convert dictionary to format suitable for subsequent SSE manipulation.
Second stage would convert packed tokens into SSE compatible tokens (copy and insert SSE bit masks)
Third stage performs 1st level expansion fetch indirect and store usingSSE bit masks
Forth stage performs 2nd level expansion
...

The pipeline stages should be affinity arranged to take advantage of HT capable processors. HT siblings (L1 cache sharing messaging), L2 siblings (L2 cache sharing messaging), L3 siblings, Socket siblings, ...

First stage could likely be done once (held in seperate file or attribute of original file).

This would be relatively easy to construct using QuickThread (at least the cache level task scheduling).