Re: Advice needed: Seems to be a prefetch problem

alex_borsen · ‎10-18-2005

This simple program below runs 10 times slower when I change PAR form 8 to 128:

main ()
{

#define PAR 128

#define s 1024*1024*8
char *n=new char;int i,j,m=0;
for(j=0;j<1000;j++)
for(i=0;i<10000000;i++)m+=n[((i+j)*PAR+i)&(s-1)];
}

My real problem is even more complicated because I can't predict index change. Help please!

rmauldin · ‎10-19-2005

main ()
{

#define PAR 128

#define s 1024*1024*8
char *n=new char;int i,j,m=0;
for(j=0;j<1000;j++)
for(i=0;i<10000000;i++)m+=n[((i+j)*PAR+i)&(s-1)];
}

For starters letslook for the largest index value in n: i = 9999999 and j = 999. That would mean that (i+j)*128+i = 1290127743. s-1 = 8388607.

Then 1290127743 AND "&" 8388607 = 6670719. So 's' is large enough for n not to have any index problems. However m is just going to add up whatever garbage is in n[] which is not really going to do much but overflow a million times or so. Oh wait, what have we here... m is an integer and n is a char? Well that looks like your problem. You might need to cast those char values and convert them into int.

m+=(int)n[((i+j)*PAR+i)&(s-1)];

alex_borsen · ‎10-19-2005

To my regret, overflow is not a problem. You can get rid of + (just put m=.... in my script). It changes nothing. Program runs very slow.

Alex

jim_dempsey · ‎10-19-2005

If you look at your code as to how it advanced through memory you will find

PAR=8

0, 9, 18, 27, 36, ...

PAR = 128

0, 129, 258, 387, ...

The PAR=128 advances faster through memory. This can affect performance in two ways. Most processors have in their memory management system a thing called TLB (Translation Lookaside Buffer). This a type of cache. Though instead of storing the data (in the data cache)it stores the address ranges of what data is stored in the data cache. This in itself could not account for the 10x difference. The next thing to look at the multiplier of PAR is what is called a scaling operation. The IA32 instruction set is capable of scaling at 1, 2, 4 and 8 without using a multiplication instruction.I believethis is where the performance hit comes.

You can verify this hypothesis by timeing runs with PAR set to 1, 2, 3, 4, 5, 6, 7, 8, 9, 19, 11, 12, 13, 14, 15, 16, 17, ...128

What you should find is the order of speed might be 1, 2, 4, 8, 3, 6, 5, 7, ...

Jim Dempsey

rmauldin · ‎10-19-2005

Sorry for being sarcastic earlier.... what was this program trying to accomplish? Is PAR the speed of the program like Jim was mentioning?

alex_borsen · ‎10-19-2005

Thank you guys for you quick replies.
I double-checked assembly listing and found that compiler automatically uses << instead of * if operand is 2**n.
So it is not the problem. I tried127 instead of 128. It takes less than 1% of test time.
My application deals with real-time in-memory database. The database structure is a kind of multi-tied lists.

rmauldin · ‎10-20-2005

>> The IA32 instruction set is capable of scaling at 1, 2, 4 and 8 without using a multiplication instruction.I believethis is where the performance hit comes. <<

Good job Jim... I wasn't anywhere close to thinking that... I had no Idea that the shift operator would have slowed program speed down there. Wouldn't you think it should actually be faster, since the values are just being shifted?

jim_dempsey · ‎10-20-2005

The CPU contains a shifters for <<0, <<1, <<2, <<3 ( which corisponds to *1, *2, *4, *8). All four shift combinationsoccur simultaneously with all instructions that have a memory reference. The encoding in the instruction set includes which of the 4 shifts is to be selected for the scale operation.

Most memory references generated by high level languages try to regester-ize the base address of the item of interest and if the item of interest is a structure it would also compute an index into another regester. Thenin place of adding these two registers the IA32 instruction set contains a preamble byte to tell the processor to add the base and the index to produce the effective address. This preamble byte is called the SIB. Intel carried this once further. The most commonscaling operations are the 4 listed above. There was room in the SIB byte to indicate the mostcommon scalingoperations. So scaling of 1, 2, 4, 8 are essentialy free. In fact they are better than free. Because not only do you eliminate the time to execute the shift left instruction but also by eliminating the bytes of the instruction your program becomes more compact and thus more of it can fit into the instruction cache.

Jim Dempsey