Time to revisit REP;MOVS

jimdempseyatthecove · ‎07-26-2006

My current programming language is Fortran (IVF). In looking at the disassembly window I notice an inordinent amount of overhead (code) generated to test for and to take advantage of memory alignment issues in particular for memory moves. Example

real(8), pointer :: pFoo(:)
...
pFoo => pSomewhere.aFoo ! make pointer local

or

real(8), pointer:: arrayA(:), arrayB(:)
...
arrayA = arrayB ! copy contents of array B to A

or

do I=1,size(arrayB)
arrayA(I) = arrayB(I)
end do

The above loop being unrolled by the compiler.

It would seem to me that all the hoop jumping for memory alignment, as well as, optimal loop unrolling could be performed by the processor executing a

REP; MOVS

The processor could even be coded to handle unaligned moves optimally.

I realize this may require saving and restoringESI and EDI as well as potentiallythe DF (depending on the rules of engagement). But, if you look at the code generated by the compiler to optimizemoves you will see it is (code wise) a better deal to use the REP; MOVS.

I do realize that REP; MOVS is currently much slower than the code to test for and perform the alignment then perform a faster internal loopmov perhaps including SSE3 instructions. But, the "slowness" is only due to the lack of attention, by the processor engineers, to the REP; MOVS technique of moving data. Internally, even for byte moves, the processor could move multiple bytes (16) per iteration as well as perform alignment via temporary internal storage.

Because Intel produces both Processors and Compilers it would seem that by incorporating this into both products would give you a temporary edge over AMD (as it will take them time to revamp their processors).

Jim Dempsey

grg99 · ‎07-27-2006

the MOVS instructions arent quite as super as they may seem. On most of the newer CPU's, a simple loop moves memory faster than MOVS. Plus memory hasnt kept up with CPU speeds, so often the bottleneck is the memory, not the CPU.

Steven_L_Intel1 · ‎07-28-2006

As grg99 says, the simple loop, especially unrolled., is faster. Believe me, this is the sort of thing we pay VERY close attention to.

dbruceg · ‎07-28-2006

Once upon a time, back when it was Digital Fortran, I dumped the assembler code for a move and noticed that it started with a little adjustment and ended with a little wrapup, both processes set up such that the bulk of the move was done by busloads.

I threw all of my assembler routines away.

Bruce

jimdempseyatthecove · ‎07-28-2006

Apparently you guys did not read the part about:

But, the "slowness" is only due to the lack of attention, by the processor engineers.

Below my response are examples of the hoop jumping to perform the moves (IVF 9.1).

The REP;MOVS on the other had could be optimized internally to the processor such that even for unaligned source and target addresses that the data is pipelined in and out such that, for large enough moves, the reads and writes iterate at 16 bytes (or whatever the width of the memory is).

A lot of good effort was spent by the processor design team to optimize the pipelines (out of order read and write, branch prediction, register remapping, other tricks of the trade), and optimize FPU performance. By comparison it would be a relatively trivial task to optimize REP;MOVS to optimally move the data, even unaligned data, from point A to point B.

Unless an interrupt or fault occurs the REP;MOVS could temporarily buffer the unaligned overlap such that each read and write occurs at memory bus width (except potentially the first and last). If an interrupt or fault occurs during the MOVS the ESI, EDI, ECX and EIP are set appropriately for later resumption.

Note,

All too often the source and target data is (are) not optimally aligned for memory bus width moves. The current method of having the IA32/IA64 determine alignment and choose appropriate path cannot optimally handle misaligned data. A rewrite of REP;MOVS could.

Additionally, the code bloat of having the IA32/IA64 determine alignment and choose appropriate path severely taxes the instruction cache.

Finally, should the next generation processorincrease the memory width, the new old code with REP;MOVS runs faster. The old old code without REP;MOVS would potentially be slower.

As it stands now

Hardware design team: Nobody uses the stinkn REP;MOVS.

Software design team: The REP;MOVS is too stinkn slow.

While, nobody at Intel has realized the potential performance impact of revisiting REP;MOVS. I implore someone at Intel to do a study of impact onactual applications and not just on the handful of industry benchmarks.

Example 1:

real(8), pointer:: arrayA(:), arrayB(:)
...

do I=1,size(arrayB)
arrayA(I) = arrayB(I)
end do

The code generated is:

004013D8 A1 B8 CF 46 00 mov eax,dword ptr ds:[0046CFB8h]

004013DD 89 45 F4 mov dword ptr [ebp-0Ch],eax

004013E0 85 C0 test eax,eax

004013E2 0F 8E C7 01 00 00 jle 004015AF

004013E8 83 F8 09 cmp eax,9

004013EB 0F 82 5A 02 00 00 jb 0040164B

arrayA(I) = arrayB(I)

004013F1 A1 BC CF 46 00 mov eax,dword ptr ds:[0046CFBCh ]

004013F6 89 45 E4 mov dword ptr [ebp-1Ch],eax

do I=1,size(arrayA)

004013F9 83 F8 08 cmp eax,8

004013FC 0F 85 1F 02 00 00 jne 00401621

arrayA(I) = arrayB(I)

00401402 A1 E4 CF 46 00 mov eax,dword ptr ds:[0046CFE4h]

00401407 89 45 D4 mov dword ptr [ebp-2Ch],eax

do I=1,size(arrayA)

0040140A 83 F8 08 cmp eax,8

0040140D 0F 85 ED 01 00 00 jne 00401600

00401413 8B 55 F4 mov edx,dword ptr [ebp-0Ch]

arrayA(I) = arrayB(I)

00401416 8B 35 C0 CF 46 00 mov esi,dword ptr ds:[46CFC0h]

0040141C 8B 0D E8 CF 46 00 mov ecx,dword ptr ds:[46CFE8h]

do I=1,size(arrayA)

00401422 89 55 C8 mov dword ptr [ebp-38h],edx

arrayA(I) = arrayB(I)

00401425 8B 15 A0 CF 46 00 mov edx,dword ptr ds:[46CFA0h]

0040142B 89 4D D0 mov dword ptr [ebp-30h],ecx

0040142E 8B FE mov edi,esi

00401430 C1 E7 03 shl edi,3

00401433 F7 DF neg &n bsp;edi

00401435 03 FA add edi,edx

00401437 89 7D C0 mov dword ptr [ebp-40h],edi

do I=1,size(arrayA)

0040143A 8D 47 08 lea eax,[edi+8]

arrayA(I) = arrayB(I)

0040143D 8B 3D C8 CF 46 00 mov edi,dword ptr ds:[46CFC8h]

do I=1,size(arrayA)

00401443 83 E0 0F and eax,0Fh

00401446 89 45 CC mov dword ptr [ebp-34h],eax

00401449 C1 E1 03 shl ecx,3

0040144C F7 D9&n bsp; neg ecx

0040144E 03 CF add ecx,edi

00401450 89 4D F8 mov dword ptr [ebp-8],ecx

00401453 8D 49 08 lea ecx,[ecx+8]

00401456 89 4D C4 mov dword ptr [ebp-3Ch],ecx

00401459 85 C0 test eax,eax

0040145B 74 2E je 0040148B

0040145D A8 07 test al,7

0040145F 0F 85 94 01 00 00 jne 004015F9

00401465 8B 45 F8 mov eax,dword ptr [ebp-8]

arrayA(I) = arrayB(I)

00401468 F2 0F 10 40 08 movsd xmm0,mmword ptr [eax+8]

do I=1,size(arrayA)

0040146D 8D 48 10 lea ecx,[eax+10h]

arrayA(I) = arrayB(I)

00401470 8B 45 C0 mov eax,dword ptr [ebp-40h]

00401473 F2 0F 11 40 08 movsd mmword ptr [eax+8],xmm0

do I=1,size(arrayA)

00401478 89 4D C4 mov dword ptr [ebp-3Ch],ecx

0040147B 8B 4D F4 mov ecx,dword ptr [ebp-0Ch]

0040147E 8D 49 FF lea ecx,[ecx-1]

00401481 89 4D C8 mov dword ptr [ebp-38h],ecx

00401484 B9 01 00 00 00 mov ecx,1

00401489 EB 02 jmp 0040148D

0040148B 33 C9 xor ecx,ecx

0040148D 8B 45 C8 mov eax,dword ptr [ebp-38h]

00401490 83 E0 07 and eax,7

00401493 F7 D8 neg eax

00401495 03 45 F4 add eax,dword ptr [ebp-0Ch]

00401498 89 45 C8 mov dword ptr [ebp-38h],eax

0040149B 8B 45 C4 mov eax,dword ptr [ebp-3Ch]

0040149E A8 0F test al,0Fh

004014A0 75 4E jne 004014F0

004014A2 8B 45 F8 mov eax,dword ptr [ebp-8]

004014A5 89 7D DC mov dword ptr [ebp-24h],edi

004014A8 8B 7D C8 mov edi,dword ptr [ebp-38h]

004014AB 89 75 F0 mov dword ptr [ebp-10h],esi

004014AE 8B 75 C0 mov esi,dword ptr [ebp-40h]

arrayA(I) = arrayB(I)

004014B1 66 0F 28 44 C8 08 movapd xmm0,xmmword ptr [eax+ecx*8+8]

004014B7 66 0F 29 44 CE 08 movapd xmmword ptr [esi+ecx*8+8],xmm0

004014BD 66 0F 28 4C C8 18 movapd xmm1,xmmword ptr [eax+ecx*8+18h]

004014C3 66 0F 29 4C CE 18 movapd xmmword ptr [esi+ecx*8+18h],xmm1

004014C9 66 0F 28 54 C8 28 movapd xmm2,xmmword ptr [eax+ecx*8+28h]

004014CF 66 0F 29 54 CE 28 movapd xmmword ptr [esi+ecx*8+28h],xmm2

004014D5 66 0F 28 5C C8 38 movapd xmm3,xmmword ptr [eax+ecx*8+38h]

004014DB 66 0F 29 5C CE 38 movapd xmmword ptr [esi+ecx*8+38h],xmm3

do I=1,size(arrayA)

004014E1 83 C1 08 add ecx,8

end do

004014E4 3B CF cmp ecx,edi

004014E6 72 C9 jb& nbsp; 004014B1

004014E8 8B 7D DC mov edi,dword ptr [ebp-24h]

004014EB 8B 75 F0 mov esi,dword ptr [ebp-10h]

004014EE EB 64 jmp 00401554

004014F0 8B 45 F8 mov eax,dword ptr [ebp-8]

004014F3 89 7D DC mov dword ptr [ebp-24h],edi

004014F6 8B 7D C8 mov edi,dword ptr [ebp-38h]

004014F9 89 75 F0 mov dword ptr [ebp-10h],esi

004014FC 8B 75 C0 mov esi,dword ptr [ebp-40h]

arrayA(I) = arrayB(I)

004014FF F2 0F 10 44 C8 08 movs d xmm0,mmword ptr [eax+ecx*8+8]

00401505 66 0F 16 44 C8 10 movhpd xmm0,qword ptr [eax+ecx*8+10h]

0040150B 66 0F 29 44 CE 08 movapd xmmword ptr [esi+ecx*8+8],xmm0

00401511 F2 0F 10 4C C8 18 movsd xmm1,mmword ptr [eax+ecx*8+18h]

00401517 66 0F 16 4C C8 20 movhpd xmm1,qword ptr [eax+ecx*8+20h]

0040151D 66 0F 29 4C CE 18 movapd xmmword ptr [esi+ecx*8+18h],xmm1

00401523 F2 0F 10 54 C8 28 movsd xmm2,mmword ptr [eax+ecx*8+28h]

00401529 66 0F 16 54 C8 30 movhpd xmm2,qword ptr [eax+ecx*8+30h]

0040152F 66 0F 29 54 CE 28 movapd xmmword ptr [esi+ecx*8+28h],xmm2

00401535 F2 0F 10 5C C8 38 movsd xmm3,mmword ptr [eax+ecx*8+38h]

0040153B 66 0F 16 5C C8 40 movhpd xmm3,qword ptr [eax+ecx*8+40h]

00401541 66 0F 29 5C CE 38 movapd xmmword ptr [esi+ecx*8+38h],xmm3

&nb sp; do I=1,size(arrayA)

00401547 83 C1 08 add ecx,8

0040154A 3B CF cmp ecx,edi

0040154C 72 B1 jb 004014FF

0040154E 8B 7D DC mov edi,dword ptr [ebp-24h]

00401551 8B 75 F0 mov esi,dword ptr [ebp-10h]

00401554 8B 45 F4 mov eax,dword ptr [ebp-0Ch]

00401557 3B C8 cmp ecx,eax

00401559 73 54 jae 004015AF

0040155B 0F AF 75 E4 imul esi,dword ptr [ebp-1Ch]

0040155F 8B 45 D4 mov eax,dword ptr [ebp-2Ch]

00401562 89 4D CC mov dword ptr [ebp-34h],ecx

00401565 8B 4D D0 mov ecx,dword ptr [ebp-30h]

00401568 0F AF C8 imul ecx,eax

0040156B 2B D6 sub edx,esi

0040156D 2B F9 sub edi,ecx

0040156F 8B 75 E4 mov esi,dword ptr [ebp-1Ch]

00401572 8B 4D CC mov ecx,dword ptr [ebp-34h]

00401575 03 F8 add edi,eax

00401577 03 D6&nb sp; add edx,esi

00401579 89 55 D8 mov dword ptr [ebp-28h],edx

0040157C 8B D1 mov edx,ecx

0040157E 0F AF D0 imul edx,eax

00401581 8B C1 mov eax,ecx

00401583 0F AF C6 imul eax,esi

00401586 89 55 C8 mov dword ptr [ebp-38h],edx

00401589 8B 55 D8 mov edx,dword ptr [ebp-28h]

0040158C 89 55 D8 mov dword ptr [ebp-28h],edx

0040158F 8B 55 C8 mov edx,dwor d ptr [ebp-38h]

arrayA(I) = arrayB(I)

00401592 8B 75 D8 mov esi,dword ptr [ebp-28h]

00401595 F2 0F 10 04 3A movsd xmm0,mmword ptr [edx+edi]

do I=1,size(arrayA)

0040159A 03 55 D4 add edx,dword ptr [ebp-2Ch]

arrayA(I) = arrayB(I)

0040159D F2 0F 11 04 30 movsd mmword ptr [eax+esi],xmm0

do I=1,size(arrayA)

004015A2 03 45 E4 add eax,dword ptr [ebp-1Ch]

004015A5 8B 75 F4 mov esi,dword ptr [ebp-0Ch]

004015A8 83 C1 01 add ecx,1

004015AB 3B CE cmp ecx,esi

004015AD 72 E3 jb 00401592

discontiguous code followed by remainder of do loop

00401600 8B 35 C0 CF 46 00 mov esi,dword ptr ds:[46CFC0h]

00401606 A1 E8 CF 46 00 mov eax,dword ptr ds:[0046CFE8h]

0040160B 8B 3D C8 CF 46 00 mov edi,dword ptr ds:[46CFC8h]

00401611 8B 15 A0 CF 46 00 mov edx,dword ptr ds:[46CFA0h]

00401617 33 C9 xor ecx,ecx

00401619 89 45 D0 mov dword ptr [ebp-30h],eax

0040161C E9 3A FF FF FF jmp 0040155B

00401621 8B 35 C0 CF 46 00 mov esi,dword ptr ds:[46CFC0h]

00401627 A1 E8 CF 46 00 mov eax,dword ptr ds:[0046CFE8h]

0040162C 8B 15 E4 CF 46 00 mov edx,dword ptr ds:[46CFE4h]

00401632 8B 3D C8 CF 46 00 mov edi,dword ptr ds:[46CFC8h]

00401638 33 C9 xor ecx,ecx

0040163A 89 45 D0 mov dword ptr [ebp-30h],eax

0040163D 89 55 D4 mov dword ptr [ebp-2Ch],edx

00401640 8B 15 A0 CF 46 00 mov edx,dword ptr ds:[46CFA0h]

Steven_L_Intel1 · ‎07-28-2006

I suggest filing a request with Intel Premier Support. That way it can get properly directed.

dbruceg · ‎08-01-2006

Sorry, Jim. I missed your point at first. I imagine the problem originated with the first step up from the 8080/Z80 into segmented memory. It seems to me that one should be able to latch the whole bus to an arbitrary byte address, thereby eliminating the alignment problem, but that statement may just underscore my ignorance about hardware issues.

More to the point, I'm quite probably suffering from the "REP;MOVS" problem in a stream I/O buffering system I just cobbled up. The idea of the buffering system is to be able to handle both very short and very long I/O operations. If I write, say, one I4 variable followed by a big R8 array, then if the length of the second aray exceeds the buffer length, I flush the integer in the buffer then write the big array directly. I probably take a big hit hereif IVF moves thedata to a C stream buffer, where it will be misaligned, but nobody seems to know much about how that works.

When I read the stream, I probably take two hits - one moving the misaligned data from the C stream buffer to my own buffer, and another reading from my buffer.

I could eliminate both of these problems by padding in the write buffer and skipping in the read buffer. Question - might I see a significant speed increase if I did so?

Bruce

grg99 · ‎08-02-2006

Perhaps you don't understand why speeding up the MOVxx instructions is unnecessary and very difficult.

In the original 8088 and 8086 and 80286 and mayube even the 80386, all instructions were microcoded. That means for each instruction, a little microprogram was run. Microprogramming makes CPU design much easier, as the chip designer doesnt have to design hard-wired circuitry for each and every operation.

But the downside is that every operation, no matter how simple, takes several micro-clock cycles.

About the time of the 486, the designers realized that simple instructions, like move byte or move word, could be implemented in hardwareand be many times faster than running the microcode. But there was only a limited amount of chip space for this, so only some of the simpler instructions got the fast treatment-- the rest remained in microcode. That's when MOVxx fell behind-- it got left in microcode.

Then with the Pentium things got worse-- moving memory not only happened in hardware, the Pentium had TWO pipes, both of which could do MOV's. So now you had TWO and FAST memory movers, compared to just ONE microcoded MOVxx.
And Oh, the new floating-point load and store instructions are also faster than MOVxx. And starting up a MOVxx microcoded instruction stops the other pipe, so it's a double loss-- MOVxx runs slower than the simpler instructions, and MOVxx prevents any other instructions from getting into the other pipe. Double-ungood.

By this time most programmers and run0-time libraries realized this and changed their block-move routines to use the faster instructions. Now there's even less incentive to speed up MOVxx, as very few programs use it, and it has a huge handicap to overcome.

Alongside this the CPU speeds have been climbing faster than memory speeds, so now it usually doesnt matter, in many cases MOVxx is faster than the memory bus! Not to mention the AMD CPU's have THREE integer CPU units, so you can have even more memory moving going on in each clock cycle.

Write yourself a little test program that uses MOVSB MOVSW, MOVSD, and regular mov instructions (overlapped and unrolled a bit), and also try fp move. I think you'll find a simple mov instruction loop easily saturates the memory bus.

dbruceg · ‎08-03-2006

Thanks. That was very informative, and you're quite right-I was blissfully unaware of all that. But I'm not trying to speed up the MOV instructions.My questionis about half a dozen levels above the microcode.

One presumably pays a penalty for moving misaligned data. How big is the penalty in Intel Fortran optimized for speed? I can obviously cobble up a bunch of tests to attempt to determine that myself, butsince I don't fully understand the matter, I could easily miss something important.I thought you guys might have a handle on it.

Bruce

Steven_L_Intel1 · ‎08-03-2006

Have you tried running your program under VTune? It can read the performance counters and see how many times misaligned data was found. As with any sort of performance question, the issue is what portion of the total application time is spent in this move.

dbruceg · ‎08-03-2006

I haven't yet tried VTune, but my project is not quite ready for fine-tuning. It'san FE analysis package with about 30 "applications" operating on million-equation systems. Clocking it crudely (i.e., with SECNDS), I obtain the expected results:I/O is a big factor, as it always has been in such programs. By knocking down the number of physical operations with buffering, I've picked up a LOT of speed, but I/O is STILL the bottleneck. I KNOW that I'm transferring misaligned data to and from the buffers, and I KNOW that those transfers are clocking a respectable amount of time. What I do NOT know is whether or not data alignment will speed it up enough to justify the recoding.

Will VTune give me any help there? I'm planning on using it eventually anyway.

Based on your experience, can one say that data misalignment increases data move time by 10%? - or 20%? - or 5%? If I can get about 15%, it would be worth going after.

Bruce

Steven_L_Intel1 · ‎08-03-2006

I doubt it would be more than 10% on IA-32, but this is very application dependent. Yes, VTune will help a lot in telling you where to spend your effort and where not to.

Seth_A_Intel · ‎08-04-2006

The topic of this thread may have diverged, but for what it is worth, here is some information on REP MOVSD. Over the past 4 years, I have been involved with the analysis and design of REP MOVSD/STOSD in Intel processors. Let me try to respond to some of the discussion relevant to REP MOVSD/STOSD instructions.

As has been pointed out, the very small opcode size (two bytes) has a dramatic impact on code size. Unfortunately, REP MOVSD does not always perform as well as custom code sequences. Many people within Intel have pointed this out, and there have been ongoing efforts to do something about it. If you examine the performance on the very latest CPUs (the new Intel Core 2 Duo processors and Xeon Processor 5100 series) you will see some of this effect-- REP MOVS/STOS instructions are substantially faster than they used to be. It is still possible to write code that is faster still, but the performance gap is not as large, and it is a little harder than it used to be to beat REP MOVSD/STOSD. I will put some detail on this at the end of the post.

Misaligned code is still a big problem. It turns out that dealing with misaligned source and destination buffers is just as hard in the micro-code within the REP MOVSD instruction as it is to deal with in regular code. The suggestions made in this post of testing alignment, internal buffering and custom multiple code paths are very painful in micro-code, and the performance you obtain is often worse that what you would end up with if you just wrote that memory copy subroutine. Furthermore, the REP MOVSD instruction has other constraints-- it must be able to operate correctly when stopped by an interrupt and update all the register values properly, so that the instruction can be restarted from the point it was interrupted. This significantly complicates align/shift micro-code implementations. When we have looked into dealing with alignment in micro-code, we have found the overhead to be prohibitive. The ideas make sense, it seems like it should be easy, but we have found in practice that it gets ugly pretty quick.

Putting the alignment issues aside, let me elaborate a bit on how REP MOVSD stacks up against a memory copy function you (or the compiler) might write. You can characterize the performance of a copy operation by an overhead and a throughput. Ideally, you would want a REP MOVSD to have a throughput close to the fundamental peak transfer rate to cache, with an overhead close to zero. When you hit the first level cache, your maximum throughput is one load and one store per clock, so it matters how big your load and stores are. REP MOVSD achieves 1 load and one store per clock, and it will actually change the load and store sizes for copies that are "long enough", so that your throughput increases. This is called "fast strings" mode-- you probably have seen information about this in Intel documentation. This fast string mode has its own overhead-- the price you pay for that higher throughput. It is this overhead that determines what is "long enough" to enter the fast strings mode. There are also some other restrictions on entry to fast strings-- some alignments on source and destination, and some restrictions to make sure that fast strings cannot corrupt memory (like if the strings overlap).

In normal strings mode, the Intel Core 2 Duo processors have cut the overhead of REP MOVSD by about a factor of two (in terms of processor clocks) vs. the Pentium 4 processors. The throughput is doubled vs. Pentium 4 processor, again in terms of clocks (well, in terms of bytes per clock). The overhead to enter fast strings is about 1/4 of what it was on the Pentium 4 processor, and the throughput is about double (in terms of clocks).

If you (or the compiler) write custom copy code, you can obtain highest throughput by using the largest possible size move (subject to alignment concerns), and minimizing your loop overhead, both in terms of the code you write and in terms of getting the best possible branch prediction. For very small copies of fixed size, multiple load/stores with no loop at all have the highest performance. When your data is misaligned, the best code I have seen keeps the loads and stores aligned, using shifts and the like to make it all work out right. Like many coding optimizations, there is a tradeoff with code size and performance that the programmer must address. You can still beat REP MOVSD/STOSD with such code, but as previous posters point out, you use a lot of instructions to do it-- at least if you want to cover all possible alignment cases. If you know something about the alignment or length up front, you can simplify your code considerably.

jimdempseyatthecove · ‎08-14-2006

Seth,

Thanks for taking the time to reply. I do have a few observations:

It is often observed (as in this case) that there is a dichotomy between what is envisioned and what is practiced. I will assume your perspective is that of a processor engineer familiar with the internal microcode. You presumably have mastered assembly language and may have a good grasp of C/C++. As such, your expectations (what you envision) of memory move operations are focused on what is performed by way of a subroutine call (memmove, memcpy). What is practiced is quite different. In Intel Visual Fortran the practice is to place the generalized memory move code in-line. This is done in an effort to optimize the shorter run memory moves by eliminating the overhead of a subroutine call. The consequence of this is code bloat and the side effect of adversely affecting the instruction cache.

While it would be easy for you to prescribe to the compiler writers use subroutines in practice they cannot, because to do so might place the generated code at a competitive disadvantage or would be contrary to the user dictates of produce fastest possible code.

RE: The suggestions made in this post of testing alignment, internal buffering and custom multiple code paths are very painful in micro-code, and the performance you obtain is often worse that what you would end up with if you just wrote that memory copy subroutine.

This is only very painful once. And only painful if your threshold for pain is quite low. In my opinion, the effort to add some of the SSE3 instructions must be more complex than creating efficient REP MOVSx instructions. Should be no worse than root-canal.

As suggested in my earlier post, create a study to examine some real world applications. My preference is in scientific computing using IVF but realistically you will have to include commercial packages regardless of the language. I will venture to guess that the preponderance of the in-line moves are good candidates for MOVSD or MOVSQ. i.e. the source and target strings for MOVSD are almost always dword aligned and the source and target strings for MOVSQ are generally qword aligned but are almost always dword aligned. Therefore, the concentration of the microcode should be to optimize dword aligned REP MOVSD. This will simplify the logic in the microcode. The compiler writers of IVF could easily integrate REP MOVSD as a code generation option. Later implementations of microcode could address other alignment circumstances.

If source or target not dword aligned branch to old MOVSD microcode

Initialize cache such that next items read/written are marked as the Least Recently Used

i.e. first to be retired. The intention is to not permit large memory moves to flush the data cache.

Use an internal buffer of memory width (16 bytes now, 32 later, 64 whatever) to align writes to memory width (first write potentially a read/modify/write to accommodate skewed data).

The microcode optimally performs prefetch.

During move, perform interrupt early-exit after write, as presumably fetch of next input line is in progress. As with current REP MOVSD the instruction is interruptible.

The advantages of performing the very painful task in microcode as opposed to subroutine or in-line are:

The assembler code is not required to determine the underlying cache or memory width. True, a compiler option could specify the number of bytes for memory width. The width specified for current processors now may not be true for next generation processors later. Yes, the runtime system code could store the optimal width for use by the generated code but this then requires an additional memory read and use of register. For small transfers the overhead would be burdensome. Accommodating the cache/memory width would be attainable with a subroutine but incorporating this in-line is questionable.

Determining the optimal prefetch distance is problematic in user code but well understood for a given processor design. The runtime system startup code could make this determination. A generalized memmove subroutine could take advantage of this but this adds to the subroutine initialization overhead. In-line code would likely not be able to take advantage of prefetch.

Jim Dempsey