Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
106 Views

One 64bit pointer is fooling me around, please help!

Jump to solution

Obviously my C knowledge sucks, back to basics!

I am writing one console tool and using Intel v15.0, I can't find why during execution the executable crashes, my guess is that the problem lies in when an address is going beyond the 2GB mark, the snippet that tortures me:

	NumberOfThreadsToPlayWith = 8;
	SourceFileSize=91964279;
	printf("Allocating %s bytes...\n", _ui64toaKAZEcomma((uint64_t)SourceFileSize*NumberOfThreadsToPlayWith+512, llTOaDigits, 10));	
	SourceBlock = (unsigned char*)malloc((uint64_t)SourceFileSize*NumberOfThreadsToPlayWith+512);
if( SourceBlock == NULL )
{ puts( "\nLexx: Needed memory allocation denied!\n" ); return( 1 ); }
	TargetFileSize=273401856;
	printf("Allocating %s bytes...\n", _ui64toaKAZEcomma((uint64_t)TargetFileSize*NumberOfThreadsToPlayWith+512*NumberOfThreadsToPlayWith, llTOaDigits, 10));	
	TargetBlock = (unsigned char*)malloc((uint64_t)TargetFileSize*NumberOfThreadsToPlayWith+512*NumberOfThreadsToPlayWith);
if( TargetBlock == NULL )
{ free(SourceBlock); puts( "\nLexx: Needed memory allocation denied!\n" ); return( 1 ); }
printf("Source&Target buffers are allocated.\n");

	fread(SourceBlock, 1, SourceFileSize, fp);
	fclose(fp);

printf("Simulating we have %d blocks for decompression...\n", NumberOfThreadsToPlayWith);
	for (i = 1; i <= (NumberOfThreadsToPlayWith-1); i++) {
		memcpy(SourceBlock+(uint64_t)i*SourceFileSize, SourceBlock, SourceFileSize);
	}

#ifdef Commence_OpenMP
		printf("Enforcing %d thread(s).\n", NumberOfThreadsToPlayWith);
#else
		printf("Enforcing 1 thread.\n");
#endif

#ifdef Commence_OpenMP
		printf("omp_get_num_procs( ) = %d\n", omp_get_num_procs( ));
		printf("omp_get_max_threads( ) = %d\n", omp_get_max_threads( ));
#endif


			printf("TargetBlock = %s\n", _ui64toaKAZEcomma((uint64_t)(TargetBlock), llTOaDigits, 10));	
			printf("Target pointer#8 = %s\n", _ui64toaKAZEcomma((uint64_t)(TargetBlock+(uint64_t)(8-1)*(TargetFileSize+512)), llTOaDigits, 10));	
			printf("Source pointer#8 = %s\n", _ui64toaKAZEcomma((uint64_t)(SourceBlock+(uint64_t)(8-1)*SourceFileSize), llTOaDigits, 10));	

#if defined(_icl_mumbo_jumbo_)
ticksStart = GetRDTSC();
#endif


#ifdef Commence_OpenMP
#pragma omp parallel shared(TargetBlock, SourceBlock, TargetFileSize, SourceFileSize) private(TargetSize001,TargetSize002,TargetSize003,TargetSize004,TargetSize005,TargetSize006,TargetSize007,TargetSize008,TargetSize009,TargetSize010,TargetSize011,TargetSize012,TargetSize013,TargetSize014,TargetSize015,TargetSize016,TargetSize017,TargetSize018,TargetSize019,TargetSize020,TargetSize021,TargetSize022,TargetSize023,TargetSize024,TargetSize025,TargetSize026,TargetSize027,TargetSize028,TargetSize029,TargetSize030,TargetSize031,TargetSize032)
#endif
{
#ifdef Commence_OpenMP
  #pragma omp sections
#endif
    {

#ifdef Commence_OpenMP
    #pragma omp section
#endif
	{
	// Thread 001:
		TargetSize001 = Decompress001(TargetBlock+(uint64_t)(1-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(1-1)*SourceFileSize, SourceFileSize);
		if (TargetFileSize != TargetSize001) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
printf("1 done\n");
	}

#ifdef Commence_OpenMP
    #pragma omp section
#endif
	{
	// Thread 002:
		TargetSize002 = Decompress002(TargetBlock+(uint64_t)(2-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(2-1)*SourceFileSize, SourceFileSize);
		if (TargetFileSize != TargetSize002) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
printf("2 done\n");
	}

#ifdef Commence_OpenMP
    #pragma omp section
#endif
	{
	// Thread 003:
		TargetSize003 = Decompress003(TargetBlock+(uint64_t)(3-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(3-1)*SourceFileSize, SourceFileSize);
		if (TargetFileSize != TargetSize003) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
printf("3 done\n");
	}

#ifdef Commence_OpenMP
    #pragma omp section
#endif
	{
	// Thread 004:
		TargetSize004 = Decompress004(TargetBlock+(uint64_t)(4-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(4-1)*SourceFileSize, SourceFileSize);
		if (TargetFileSize != TargetSize004) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
printf("4 done\n");
	}

#ifdef Commence_OpenMP
    #pragma omp section
#endif
	{
	// Thread 005:
		TargetSize005 = Decompress005(TargetBlock+(uint64_t)(5-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(5-1)*SourceFileSize, SourceFileSize);
		if (TargetFileSize != TargetSize005) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
printf("5 done\n");
	}

#ifdef Commence_OpenMP
    #pragma omp section
#endif
	{
	// Thread 006:
		TargetSize006 = Decompress006(TargetBlock+(uint64_t)(6-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(6-1)*SourceFileSize, SourceFileSize);
		if (TargetFileSize != TargetSize006) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
printf("6 done\n");
	}

#ifdef Commence_OpenMP
    #pragma omp section
#endif
	{
	// Thread 007:
		TargetSize007 = Decompress007(TargetBlock+(uint64_t)(7-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(7-1)*SourceFileSize, SourceFileSize);
		if (TargetFileSize != TargetSize007) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
printf("7 done\n");
	}

#ifdef Commence_OpenMP
    #pragma omp section
#endif
	{
	// Thread 008:
		TargetSize008 = Decompress008(TargetBlock+(uint64_t)(8-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(8-1)*SourceFileSize, SourceFileSize);
		if (TargetFileSize != TargetSize008) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
printf("8 done\n");
	}

The output in command prompt is this:

http://www.sanmayce.com/Downloads/bug.png

bug.png

D:\_KAZE\Instructions_per_tick_during_branchless_decompression_32-threaded>timer64 Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.exe Autobiography_411-ebooks_Collection.tar.Nakamichi
Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Allocating 735,714,744 bytes...
Allocating 2,187,218,944 bytes...
Source&Target buffers are allocated.
Simulating we have 8 blocks for decompression...
Enforcing 1 thread.
TargetBlock = 2,147,418,176
Target pointer#8 = 4,061,234,752
Source pointer#8 = 648,468,609
1 done
2 done
3 done
4 done
5 done
6 done
7 done

Exit code: -1073741819

Since the executable is single-threaded (from command line OpenMP is not specified) I see the bug is not OpenMP related at all.


 


 

Tags (1)
0 Kudos

Accepted Solutions
Highlighted
New Contributor I
110 Views

I did some tests in MSVC. Debug&Release

With FlagMASKnegated = Flag - 1; I get 0xFFFFFFFF00000000.  Because Flag - 1 is a 32bit result.

With FlagMASKnegated = Flag - 1LL; the compiler decrements a 64bit value of 0, resulting in 0xFFFFFFFFFFFFFFFF.

It looks to me that you're expecting the assignment of Flag to a 64bit value 1st, and then be decremented. But this isn't what you're telling the compiler.

You're saying subtract 1 from a 32bit value, and then put the 32bit result into a 64bit value.  You need to force the value of Flag to be 64bit before the decrement/subtraction.  MSVC seems to promote it to 64bit due to the subtraction value being 64bit, but it looks like the Intel compiler is a bit different.

You probably want;

FlagMASKnegated = (uint64_t)Flag - 1LL;

 

View solution in original post

0 Kudos
44 Replies
Highlighted
Beginner
88 Views

The section #8 is where the bug occurs, there I decompress from:

Source Address = SourceBlock+(uint64_t)(8-1)*SourceFileSize

to

Target Address = TargetBlock+(uint64_t)(8-1)*(TargetFileSize+512)

SourceFileSize bytes which is 260MB

So, Target pointer#8 = 4,061,234,752

and I guess when to above value 260MB is added then the TargetPointer#8 goes beyond 4GB mark, but how this can cause any crash, I am using 8bytes long pointers?

Add-on:
Stupid of me, forgot to give the actual function that crashes:

uint64_t Decompress008 (unsigned char* ret, unsigned char* src, uint64_t srcSize) {
	unsigned char* retLOCAL = ret;
	unsigned char* srcLOCAL = src;
	unsigned char* srcEndLOCAL = src+srcSize;
	unsigned int DWORDtrio;
	unsigned int Flag;
	uint64_t FlagMASK; //=       0xFFFFFFFFFFFFFFFF;
	uint64_t FlagMASKnegated; //=0x0000000000000000;
	while (srcLOCAL < srcEndLOCAL) {
		DWORDtrio = *(unsigned int*)srcLOCAL;
// Branchless [
		DWORDtrio = DWORDtrio&( 0xFFFFFFFF >> ((3-(DWORDtrio & 0x03))<<3) );
		Flag=!((DWORDtrio & 0x0F)-0x0C);
		// In here Flag=0|1
		FlagMASKnegated= Flag - 1; // -1|0
		FlagMASK= ~FlagMASKnegated;
				#ifdef _N_YMM
		SlowCopy256bit( (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), retLOCAL);
				#endif
				#ifdef _N_GP
		memcpy(retLOCAL, (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), 16*2);
				#endif
		retLOCAL+= ((uint64_t)((DWORDtrio>>4))&FlagMASK) +   ((uint64_t)(((1+((DWORDtrio>>2)&0x03))<<2) << ((1+(DWORDtrio&0x03))>>2))&FlagMASKnegated) ; 
		srcLOCAL+= ((uint64_t)((DWORDtrio>>4)+1)&FlagMASK) + ((uint64_t)(1+(DWORDtrio&0x03))&FlagMASKnegated) ;
// Branchless ]
	}      
	return (uint64_t)(retLOCAL - ret);
}

It works perfectly for TargetPointer#1,TargetPointer#2,TargetPointer#3,TargetPointer#4,TargetPointer#5,TargetPointer#6,TargetPointer#7 but for TargetPointer#8 breaks?!

 

0 Kudos
Highlighted
Beginner
88 Views

I found the line where the crash happens, but can't tell why!

...

//debug [
	if ((uint64_t)retLOCAL>0x100000000) printf("(const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ): %p\n",(const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ));
	if ((uint64_t)retLOCAL>0x100000000) printf("(const char *)( ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)) ): %p\n",(const char *)( ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)) ));
	if ((uint64_t)retLOCAL>0x100000000) printf("(const char *)( ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1))&FlagMASKnegated ): %p\n",(const char *)( ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1))&FlagMASKnegated ));

	if ((uint64_t)retLOCAL>0x100000000) printf("SOURCE: %s\n", _ui64toaKAZEcomma((uint64_t)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), llTOaDigits, 10));
	if ((uint64_t)retLOCAL>0x100000000) printf("retLOCAL: %p\n",retLOCAL);
	if ((uint64_t)retLOCAL>0x100000000) printf("retLOCAL: %s\n", _ui64toaKAZEcomma((uint64_t)retLOCAL, llTOaDigits, 10));	

	if ((uint64_t)retLOCAL>0x100000000) printf("FlagMASK, FlagMASKnegated: %d, %d\n", FlagMASK,FlagMASKnegated);
	//if ((uint64_t)retLOCAL>0x100000000) printf("(DWORDtrio>>4): %d\n", (DWORDtrio>>4));
	if ((uint64_t)retLOCAL>0x100000000) printf("\n");

/*
(const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ): 00000000F21664A7
(const char *)( ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)) ): 00000000F21664A6
(const char *)( ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1))&FlagMASKnegated ): 00000000F21664A6
SOURCE: 4,061,553,831
retLOCAL: 000000010000017B
retLOCAL: 4,294,967,675
FlagMASK, FlagMASKnegated: 0, -1

(const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ): 00000000FFFC1763
(const char *)( ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)) ): 00000000FFFC1762
(const char *)( ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1))&FlagMASKnegated ): 00000000FFFC1762
SOURCE: 4,294,711,139
retLOCAL: 0000000100000183
retLOCAL: 4,294,967,683
FlagMASK, FlagMASKnegated: 0, -1

(const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ): 00000000000000CA
(const char *)( ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)) ): 00000001000000C9
(const char *)( ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1))&FlagMASKnegated ): 00000000000000C9
SOURCE: 202
retLOCAL: 000000010000018B
retLOCAL: 4,294,967,691
FlagMASK, FlagMASKnegated: 0, -1


Exit code: -1073741819
*/
//debug ]

		memcpy(retLOCAL, (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), 16*2); //ORIGINAL
				#endif
		retLOCAL+= ((uint64_t)((DWORDtrio>>4))&FlagMASK) +   ((uint64_t)(((1+((DWORDtrio>>2)&0x03))<<2) << ((1+(DWORDtrio&0x03))>>2))&FlagMASKnegated) ; 
		srcLOCAL+= ((uint64_t)((DWORDtrio>>4)+1)&FlagMASK) + ((uint64_t)(1+(DWORDtrio&0x03))&FlagMASKnegated) ;
// Branchless ]
	}      
//printf("\nloopcounter=%d\n",loopcounter);  // loopcounter=29763921 for 'Autobiography_411-ebooks_Collection.tar.Lexx.Nakamichi'
	return (uint64_t)(retLOCAL - ret);
}

I fully expected the '&' operation to work with both 4bytes and 8bytes expressions, why when we get past 0xFFFFFFFF and make '&' the upper 4bytes are gone???!!!

I am talking about the last group:

(const char *)( ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)) ): 00000001000000C9
(const char *)( ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1))&FlagMASKnegated ): 00000000000000C9

'FlagMASKnegated' equals 0xFFFFFFFFFFFFFFFF or -1 i.e. the first line after ANDing should remain the same but instead upper half is zeroed?!

Please explain why?

0 Kudos
Highlighted
Beginner
88 Views

Very confusing situation:

(uint64_t)(-1): 18,446,744,073,709,551,615
(uint64_t)FlagMASKnegated: 4,294,967,295
FlagMASK, FlagMASKnegated: 0, -1

Why casting is different for '-1' and for variable 'FlagMASKnegated' of type uint64_t which is also '-1'?!

0 Kudos
Highlighted
Beginner
88 Views

I found the workaround, it works, however the question remains!

My dummy guess is that the compiler introduces this nasty behavior in line:

		FlagMASKnegated= Flag - 1; // -1|0

The workaround is to define the right side also to be 64bit of size:

//	unsigned int Flag; // This line made my hair white! For some reason it should be 64bit, otherwise it makes 'FlagMASKnegated' 32bit during 'FlagMASKnegated= Flag - 1; // -1|0'
	uint64_t Flag;

Why such counterintuitive behavior?!

 

0 Kudos
Highlighted
88 Views

Hi Georgi - I'm not following every thing with regard to the type cast versus the & operation problem.  However, it seems you may just be experiencing undefined behavior.  As to the bit-wise AND, it requires binary numbers of the same length.  If you are not doing that, then you will get undefined behavior, I think.  The specific behavior you see could probably be explained by having a look at the disassembly code.  It may have to do with byte-order and/or the data model.  If you still think there is a bug of some kind in the compiler, you could send a simplified test case to demonstrate.  We would want to know what architecture you are building for, the compile settings/switches and such as well.

However, to answer your question, it might be helpful to realize that a 32-bit -1 is not the same as a 64-bit -1 when you look at it bitwise.  The sign-bit will have changed places.  And you might get an undefined byte or word order if you start "anding" various sizes together, such that your 00000001000000C9 8-byte number becomes this 000000C9 00000001 just before the AND.

You could check the assembly to be sure.  But, your workaround is actually how it's supposed to be done, I think, based on the description of the bitwise AND requirement for equal length.

If you prefer, for secure communications with the VTune Amplifier XE support staff, you can always submit your problem to Intel(R) Premier Support.

https://premier.intel.com/
 
It's free, secure technical support.

 

0 Kudos
Highlighted
Beginner
88 Views

Hi Bob, thank you for your readiness to help me.
 

> If you still think there is a bug of some kind in the compiler, you could send a simplified test case to demonstrate. 

Sure, probably I will write a simple several lines long program that will replicate this, just this night I am on other wave.

>We would want to know what architecture you are building for, the compile settings/switches and such as well.

Sadly, I have access only to laptop with Core 2 Q9550s 4GB DDR2 running Windows 7 64bit. The problem appeared when compilation was for x64, the package below, though, contains two AVX instructions.

>But, your workaround is actually how it's supposed to be done, I think, based on the description of the bitwise AND requirement for equal length.

Hm, the problem is more subtle, the workaround doesn't change the '&' operation at all! I just experienced some weird casting in this line:

FlagMASKnegated= Flag - 1; // -1|0

As if the 8bytes long FlagMASKnegated has become the type of Flag, which was 4bytes long i.e. unsigned int. Having made it 8 bytes the '&' operation started to work correctly!

>You could check the assembly to be sure. 

Yes, I did check and code was this (before the workaround):

uint64_t Decompress (unsigned char* ret, unsigned char* src, uint64_t srcSize) {
	unsigned char* retLOCAL = ret;
	unsigned char* srcLOCAL = src;
	unsigned char* srcEndLOCAL = src+srcSize;
	unsigned int DWORDtrio;
	unsigned int Flag;
	uint64_t FlagMASK; //=       0xFFFFFFFFFFFFFFFF;
	uint64_t FlagMASKnegated; //=0x0000000000000000;
	while (srcLOCAL < srcEndLOCAL) {
		DWORDtrio = *(unsigned int*)srcLOCAL;
// Branchless [
		DWORDtrio = DWORDtrio&( 0xFFFFFFFF >> ((3-(DWORDtrio & 0x03))<<3) );
		Flag=!((DWORDtrio & 0x0F)-0x0C);
		// In here Flag=0|1
		FlagMASKnegated= Flag - 1; // -1|0
		FlagMASK= ~FlagMASKnegated;
				#ifdef _N_YMM
		SlowCopy256bit( (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), retLOCAL);
				#endif
				#ifdef _N_GP
		memcpy(retLOCAL, (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), 16*2);
				#endif
		retLOCAL+= ((uint64_t)((DWORDtrio>>4))&FlagMASK) +   ((uint64_t)(((1+((DWORDtrio>>2)&0x03))<<2) << ((1+(DWORDtrio&0x03))>>2))&FlagMASKnegated) ; 
		srcLOCAL+= ((uint64_t)((DWORDtrio>>4)+1)&FlagMASK) + ((uint64_t)(1+(DWORDtrio&0x03))&FlagMASKnegated) ;
// Branchless ]
	}      
	return (uint64_t)(retLOCAL - ret);
}


/*
; 'Oniyanma-Monsterdragonfly-Lexx_branchless' decompression loop, c2-2d+6=155 bytes long:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140";
; mark_description "-O3 -QxSSE2 -D_N_YMM -D_N_prefetch_4096 -FAcs";

.B8.3::                         
  0002d 44 8b 3a         mov r15d, DWORD PTR [rdx]              
  00030 44 89 f9         mov ecx, r15d                          
  00033 83 f1 03         xor ecx, 3                             
  00036 41 bc ff ff ff 
        ff               mov r12d, -1                           
  0003c c1 e1 03         shl ecx, 3                             
  0003f bd 01 00 00 00   mov ebp, 1                             
  00044 41 d3 ec         shr r12d, cl                           
  00047 45 23 fc         and r15d, r12d                         
  0004a 45 33 e4         xor r12d, r12d                         
  0004d 45 89 fe         mov r14d, r15d                         
  00050 45 89 fb         mov r11d, r15d                         
  00053 41 83 e6 0f      and r14d, 15                           
  00057 4c 89 c9         mov rcx, r9                            
  0005a 41 83 fe 0c      cmp r14d, 12                           
  0005e 44 0f 44 e5      cmove r12d, ebp                        
  00062 48 89 d5         mov rbp, rdx                           
  00065 41 c1 eb 04      shr r11d, 4                            
  00069 41 ff cc         dec r12d                               
  0006c 45 89 da         mov r10d, r11d                         
  0006f 4d 89 e6         mov r14, r12                           
  00072 49 2b ca         sub rcx, r10                           
  00075 49 f7 d6         not r14                                
  00078 48 ff c9         dec rcx                                
  0007b 49 23 ee         and rbp, r14                           
  0007e 49 23 cc         and rcx, r12                           
  00081 41 ff c3         inc r11d                               
  00084 4d 23 d6         and r10, r14                           
  00087 4d 23 de         and r11, r14                           
  0008a c5 fe 6f 44 29 
        01               vmovdqu ymm0, YMMWORD PTR [1+rcx+rbp]  
  00090 44 89 fd         mov ebp, r15d                          
  00093 83 e5 03         and ebp, 3                             
  00096 41 83 e7 0c      and r15d, 12                           
  0009a ff c5            inc ebp                                
  0009c 41 83 c7 04      add r15d, 4                            
  000a0 89 e9            mov ecx, ebp                           
  000a2 c1 e9 02         shr ecx, 2                             
  000a5 41 d3 e7         shl r15d, cl                           
  000a8 49 23 ec         and rbp, r12                           
  000ab 4d 23 fc         and r15, r12                           
  000ae 4c 03 dd         add r11, rbp                           
  000b1 4d 03 d7         add r10, r15                           
  000b4 49 03 d3         add rdx, r11                           
  000b7 c4 c1 7e 7f 01   vmovdqu YMMWORD PTR [r9], ymm0         
  000bc 4d 03 ca         add r9, r10                            
  000bf 49 3b d0         cmp rdx, r8                            
  000c2 0f 82 65 ff ff 
        ff               jb .B8.3 
*/

Few days ago I wanted to offer the only one (known to me) IPC (Instructions-Per-Clock) reporter tool. I wanted to give Intel/AMD fans (and mostly myself) a tool reporting that number in one REALWORLD scenario - LZSS decompression of 260MB English texts. So I wrote single-threaded and its multi-threaded counterpart, the whole package contains:

D:\_KAZE\Instructions_per_tick_during_branchless_decompression_32-threaded>dir
 
07/13/2015  05:14 PM        91,964,279 Autobiography_411-ebooks_Collection.tar.Nakamichi
07/13/2015  05:14 PM               287 Get_IPC.bat
07/13/2015  05:14 PM         1,114,552 libiomp5md.dll
07/13/2015  05:14 PM             1,228 MakeEXEs_Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC.bat
07/13/2015  05:14 PM             1,632 MokujIN 224 prompt.lnk
07/13/2015  05:14 PM           129,024 Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_1-thread.exe
07/13/2015  05:14 PM           345,439 Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.c
07/13/2015  05:14 PM         2,054,019 Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.cod
07/13/2015  05:14 PM           131,584 Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.exe
07/13/2015  05:14 PM             6,144 timer64.exe

D:\_KAZE\Instructions_per_tick_during_branchless_decompression_32-threaded>

You may download package Instructions_per_tick_during_branchless_decompression_32-threaded.zip, 87MB in size:

https://mega.co.nz/#!I4hHwC5Y!3udON_nVUrcc8JBfn_7N7gqXGW63WSLmQGh89Jp8gns

Despite my amateurism I am still fascinated by search-for-SPEED, I am fond of speedy C etudes to the extent that I want to kindle this appreciation in other coders, hopefully someone will better at some point the work of previous fans.

 

Bob, if it is not too much please run Get_IPC.bat on some powerful CPU, I want to see Results.txt dump file so much. You see, I am big fan of Intel C optimizer since 12.1 version, however I am AMD fan since my AMD Barton CPU - with this chip I explored for the first time the speediness of my C etudes. So, I intend to buy AMD 'Zen' if it happens to outperform Intel 5960X in my search/decompression benchmarks. I initiated a thread in which fairly&openly those two chips can be tested (soon) in such integer tasks.

http://www.overclock.net/t/1562519/amd-zen-cpu-confirmed-by-roy-taylor/0_20

0 Kudos
Highlighted
Beginner
88 Views

Oh, and the workaround:

uint64_t Decompress (unsigned char* ret, unsigned char* src, uint64_t srcSize) {
	unsigned char* retLOCAL = ret;
	unsigned char* srcLOCAL = src;
	unsigned char* srcEndLOCAL = src+srcSize;
	unsigned int DWORDtrio;
	//unsigned int Flag; // This line made my hair white! For some reason it should be 64bit, otherwise it makes 'FlagMASKnegated' 32bit during 'FlagMASKnegated= Flag - 1; // -1|0'
	uint64_t Flag;
	uint64_t FlagMASK; //=       0xFFFFFFFFFFFFFFFF;
	uint64_t FlagMASKnegated; //=0x0000000000000000;
//int loopcounter=0;
	while (srcLOCAL < srcEndLOCAL) {
//loopcounter++;
		DWORDtrio = *(unsigned int*)srcLOCAL;
//#ifndef _N_GP
//#ifdef _N_prefetch_4096
//		_mm_prefetch((char*)(srcLOCAL + 64*64), _MM_HINT_T0);
//#endif
//#endif
// |1stLSB    |2ndLSB  |3rdLSB   |
// -------------------------------
// |OO|LL|xxxx|xxxxxxxx|xxxxxx|xx|
// -------------------------------
// [1bit          16bit]    24bit]
// OOLL = 0011 means Literal                                                                        
// OO = 00b MatchOffset, 0xFFFFFFFF>>(3-OO), 1 bytes long i.e. Sliding Window is 1*8-LL-OO=(1+OO)*8-4=04 or   16B    
// OO = 01b MatchOffset, 0xFFFFFFFF>>(3-OO), 2 bytes long i.e. Sliding Window is 2*8-LL-OO=(1+OO)*8-4=12 or   4KB    
// OO = 10b MatchOffset, 0xFFFFFFFF>>(3-OO), 3 bytes long i.e. Sliding Window is 3*8-LL-OO=(1+OO)*8-4=20 or   1MB    
// OO = 11b MatchOffset, 0xFFFFFFFF>>(3-OO), 4 bytes long i.e. Sliding Window is 4*8-LL-OO=(1+OO)*8-4=28 or 256MB     
// LL = 00b means 04/08/12    MatchLength, ((1+LL)<<2) << (1+OO)>>2)
// LL = 01b means 04/08/12/16 MatchLength, ((1+LL)<<2) << (1+OO)>>2)
// LL = 10b means 04/08/12/16 MatchLength, ((1+LL)<<2) << (1+OO)>>2)
// LL = 11b means 08/16/24/32 MatchLength, ((1+LL)<<2) << (1+OO)>>2)
// (1<<2<<0):1 =  4:1 priority #08                          #01 12:1 = 12
// (2<<2<<0):1 =  8:1 priority #02                          #02  8:1 =  8
// (3<<2<<0):1 = 12:1 priority #01                          #03 16:2 =  8
// (4<<2<<0):1 = 16:1 (not used in 'Hoshimi')               #04 32:4 =  8
// (1<<2<<0):2 =  4:2 priority #13                          #05 12:2 =  6
// (2<<2<<0):2 =  8:2 priority #09                          #06 24:4 =  6
// (3<<2<<0):2 = 12:2 priority #05                          #07 16:3 =  5.3
// (4<<2<<0):2 = 16:2 priority #03                          #08  4:1 =  4
// (1<<2<<0):3 =  4:3 priority #15                          #09  8:2 =  4
// (2<<2<<0):3 =  8:3 priority #12                          #10 12:3 =  4
// (3<<2<<0):3 = 12:3 priority #10                          #11 16:4 =  4
// (4<<2<<0):3 = 16:3 priority #07                          #12  8:3 =  2.6
// (1<<2<<1):4 =  8:4 priority #14 (not used in 'Hoshimi*') #13  4:2 =  2
// (2<<2<<1):4 = 16:4 priority #11                          #14  8:4 =  2
// (3<<2<<1):4 = 24:4 priority #06                          #15  4:3 =  1.6
// (4<<2<<1):4 = 32:4 priority #04
// In 'Hoshimi' two bit combinations were unexploited, in 'Hoshimikou' one bit combination was unexploited, in 'Lexx' none is left.
/*
// Branchfull [
		DWORDtrio = DWORDtrio&( 0xFFFFFFFF >> ((3-(DWORDtrio & 0x03))<<3) );
		if ( (DWORDtrio & 0x0F) == 0x0C ) {       
				#ifdef _N_GP
		memcpy(retLOCAL, (const char *)( (uint64_t)(srcLOCAL+1) ), 16*2); // Hard lesson: XMM and YMM are not to be used together.
				#endif
				#ifdef _N_YMM
		SlowCopy256bit( (const char *)( (uint64_t)(srcLOCAL+1) ), retLOCAL );
				#endif
		retLOCAL+= (DWORDtrio>>4);
		srcLOCAL+= (DWORDtrio>>4)+1;
		} else {
				#ifdef _N_GP
			memcpy(retLOCAL, (const char *)( (uint64_t)(retLOCAL-(DWORDtrio>>4)) ), 16*2);
				#endif
				#ifdef _N_YMM
			SlowCopy256bit( (const char *)( (uint64_t)(retLOCAL-(DWORDtrio>>4)) ), retLOCAL );
				#endif
		srcLOCAL+= 1+(DWORDtrio&0x03); // 4|3|2|1
		retLOCAL+= ((1+((DWORDtrio>>2)&0x03))<<2) << ((1+(DWORDtrio&0x03))>>2); // 4/8/12/16/24/32
		}
// Branchfull ]
*/
// Branchless [
		DWORDtrio = DWORDtrio&( 0xFFFFFFFF >> ((3-(DWORDtrio & 0x03))<<3) );
		Flag=!((DWORDtrio & 0x0F)-0x0C);
		// In here Flag=0|1
		FlagMASKnegated= Flag - 1; // -1|0
		FlagMASK= ~FlagMASKnegated;
				#ifdef _N_YMM
//		SlowCopy256bit( (const char *)( ((uint64_t)(srcLOCAL+1)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4))&FlagMASKnegated) ), retLOCAL);
// Another (incompatible with Branchfull variant, though) way to avoid 'LEA' is to put the '+1' outside the FlagMASK but then the encoder has to count literals from zero in order to compensate '-((DWORDtrio>>4)-1) = -(DWORDtrio>>4)+1' within FlagMASKnegated:
		SlowCopy256bit( (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), retLOCAL);
				#endif
				#ifdef _N_GP
//		memcpy(retLOCAL, (const char *)( ((uint64_t)(srcLOCAL+1)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4))&FlagMASKnegated) ), 16*2);
// Another (incompatible with Branchfull variant, though) way to avoid 'LEA' is to put the '+1' outside the FlagMASK but then the encoder has to count literals from zero in order to compensate '-((DWORDtrio>>4)-1) = -(DWORDtrio>>4)+1' within FlagMASKnegated:
		memcpy(retLOCAL, (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), 16*2);
				#endif
		retLOCAL+= ((uint64_t)((DWORDtrio>>4))&FlagMASK) +   ((uint64_t)(((1+((DWORDtrio>>2)&0x03))<<2) << ((1+(DWORDtrio&0x03))>>2))&FlagMASKnegated) ; 
		srcLOCAL+= ((uint64_t)((DWORDtrio>>4)+1)&FlagMASK) + ((uint64_t)(1+(DWORDtrio&0x03))&FlagMASKnegated) ;
// Branchless ]
	}      
//printf("\nloopcounter=%d\n",loopcounter);  // loopcounter=29763921 for 'Autobiography_411-ebooks_Collection.tar.Lexx.Nakamichi'
	return (uint64_t)(retLOCAL - ret);
}

 

; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140";
; mark_description "726";
; mark_description "-O3 -QxSSE2 -D_N_YMM -D_N_prefetch_4096 -D_icl_mumbo_jumbo_ -Qopenmp -Qopenmp-link:static -DCommence_OpenMP ";
; mark_description "-D_N_REALTIME -FAcs";

.B30.3::                        
  00030 45 8b 38         mov r15d, DWORD PTR [r8]               
  00033 44 89 f9         mov ecx, r15d                          
  00036 83 f1 03         xor ecx, 3                             
  00039 41 bc ff ff ff 
        ff               mov r12d, -1                           
  0003f c1 e1 03         shl ecx, 3                             

;;; 		Flag=!((DWORDtrio & 0x0F)-0x0C);
;;; 		// In here Flag=0|1
;;; 		FlagMASKnegated= Flag - 1; // -1|0

  00042 bd 01 00 00 00   mov ebp, 1                             
  00047 41 d3 ec         shr r12d, cl                           
  0004a 45 23 fc         and r15d, r12d                         
  0004d 45 33 e4         xor r12d, r12d                         
  00050 45 89 fe         mov r14d, r15d                         

;;; 		FlagMASK= ~FlagMASKnegated;
;;; 				#ifdef _N_YMM
;;; 		SlowCopy256bit( (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), retLOCAL);

  00053 45 89 fb         mov r11d, r15d                         
  00056 41 83 e6 0f      and r14d, 15                           
  0005a 48 89 c1         mov rcx, rax                           
  0005d 41 83 fe 0c      cmp r14d, 12                           
  00061 44 0f 44 e5      cmove r12d, ebp                        
  00065 4c 89 c5         mov rbp, r8                            
  00068 41 c1 eb 04      shr r11d, 4                            
  0006c 49 ff cc         dec r12                                
  0006f 45 89 da         mov r10d, r11d                         
  00072 4d 89 e6         mov r14, r12                           
  00075 49 2b ca         sub rcx, r10                           
  00078 49 f7 d6         not r14                                
  0007b 48 ff c9         dec rcx                                
  0007e 49 23 ee         and rbp, r14                           
  00081 49 23 cc         and rcx, r12                           

;;; 				#endif
;;; 				#ifdef _N_GP
;;; 		memcpy(retLOCAL, (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), 16*2);
;;; 				#endif
;;; 		retLOCAL+= ((uint64_t)((DWORDtrio>>4))&FlagMASK) +   ((uint64_t)(((1+((DWORDtrio>>2)&0x03))<<2) << ((1+(DWORDtrio&0x03))>>2))&FlagMASKnegated) ; 
;;; 		srcLOCAL+= ((uint64_t)((DWORDtrio>>4)+1)&FlagMASK) + ((uint64_t)(1+(DWORDtrio&0x03))&FlagMASKnegated) ;

  00084 41 ff c3         inc r11d                               
  00087 4d 23 d6         and r10, r14                           
  0008a 4d 23 de         and r11, r14                           
  0008d c5 fe 6f 44 29 
        01               vmovdqu ymm0, YMMWORD PTR [1+rcx+rbp]  
  00093 44 89 fd         mov ebp, r15d                          
  00096 83 e5 03         and ebp, 3                             
  00099 41 83 e7 0c      and r15d, 12                           
  0009d ff c5            inc ebp                                
  0009f 41 83 c7 04      add r15d, 4                            
  000a3 89 e9            mov ecx, ebp                           
  000a5 c1 e9 02         shr ecx, 2                             
  000a8 41 d3 e7         shl r15d, cl                           
  000ab 49 23 ec         and rbp, r12                           
  000ae 4d 23 fc         and r15, r12                           
  000b1 4c 03 dd         add r11, rbp                           
  000b4 4d 03 d7         add r10, r15                           
  000b7 4d 03 c3         add r8, r11                            
  000ba c5 fe 7f 00      vmovdqu YMMWORD PTR [rax], ymm0        
  000be 49 03 c2         add rax, r10                           
  000c1 4d 3b c1         cmp r8, r9                             
  000c4 0f 82 66 ff ff 
        ff               jb .B30.3 

Still can't figure it out.

0 Kudos
Highlighted
88 Views

Hi Georgi - Looking at the disassembly ... I think we would want to follow the r12 register for the non-workaround case and compare to see what's the expected behavior there.  You can see that without the workaround, only the lower 32 bits are decremented.  Ultimately, the 64-bit value is moved to r14, which is then negated.  So, what happens when the lower-32 bits is decremented and then moved into r14?  What is the value of r14 at that point in both cases?  How does that affect the ultimate negate of the full 64-bit register?  Is it a case of 0xFFFFFFFFFFFFFFFF versus 0xFFFFFFFF ???  So, in the non-workaround case, do you end up with 0x00000000FFFFFFFF?  What happens when you not that value?  Versus when you not this value:  0xFFFFFFFFFFFFFFFF? 

55
  00069 41 ff cc         dec r12d                              (lower 32-bits decremented) 
56
  0006c 45 89 da         mov r10d, r11d                        
57
  0006f 4d 89 e6         mov r14, r12                      (full 64-bit register moved)
58
  00072 49 2b ca         sub rcx, r10                           
59
  00075 49 f7 d6         not r14                         (what happens here and what's the expected behavior?)

===============================================================================
35
  0006c 49 ff cc         dec r12                                (full 64-bit register moved)
36
  0006f 45 89 da         mov r10d, r11d                         
37
  00072 4d 89 e6         mov r14, r12                     (full 64-bit register moved)
38
  00075 49 2b ca         sub rcx, r10                           
39
  00078 49 f7 d6         not r14                  (what happens here and what's the expected behavior?)

0 Kudos
Highlighted
88 Views
IOW, based on what you find, it still looks like an issue of bitwise operations on operands of different bit-depth/length, which as you can see, could cause significantly different results that what's expected, though perhaps, is well defined according to the language spec. I'd have to check to be sure, but let's see what the registers show first.
0 Kudos
Highlighted
Beginner
88 Views

That's right this is the culprit, but why?

((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated)

The '-1' in above expressions is that 'dec r12d', but why not 'dec r12' - the code is the same!?

0 Kudos
Highlighted
Beginner
88 Views

Oh, didn't answer:

>So, in the non-workaround case, do you end up with 0x00000000FFFFFFFF?  What happens when you not that value?  Versus when you not this value:  0xFFFFFFFFFFFFFFFF? 

I DO, if the upper 4bytes are zeroed then the 64bit pointer causes the crash in the memcpy().
The workaround prevents zeroing the upper half of the 8bytes, and this is just the redefinition:

	    //unsigned int Flag;
	    uint64_t Flag;

Other than that the two 'Decompress' functions are EXACTLY THE SAME.

0 Kudos
Highlighted
88 Views

Because you are bitwise "anding" a 32-bit value with a 64-bit value.

Or, put another way, you have incorrect grouping, if you are specifically asking about this line:

((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated)

It's evaluated like so:

DWORDtrio>>4

-retLOCAL

-1

& FlagMASKnegated

cast to 64-bits

Try this.  I think that will work, but didn't test it.  However, this kind of thing is error prone IMO.  Just make it all 64-bit, IMO.

(((uint64_t)(retLOCAL-(DWORDtrio>>4)-1))&FlagMASKnegated)
 

 

 

0 Kudos
Highlighted
Beginner
88 Views

>Because you are bitwise "anding" a 32-bit value with a 64-bit value.

I am not, compiler is fooled by something, it is '&'-ing like that.

>Try this.  I think that will work, but didn't test it.  However, this kind of thing is error prone IMO.  Just make it all 64-bit, IMO.

It didn't work, I myself tried before several different '(uint64_t)' castings and they all failed.

;;; 		FlagMASK= ~FlagMASKnegated;
;;; 				#ifdef _N_YMM
;;; //		SlowCopy256bit( (const char *)( ((uint64_t)(srcLOCAL+1)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4))&FlagMASKnegated) ), retLOCAL);
;;; // Another (incompatible with Branchfull variant, though) way to avoid 'LEA' is to put the '+1' outside the FlagMASK but then the encoder has to count literals from zero in order to compensate '-((DWORDtrio>>4)-1) = -(DWORDtrio>>4)+1' within FlagMASKnegated:
;;; 		//SlowCopy256bit( (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), retLOCAL);
;;; 		SlowCopy256bit( (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + (((uint64_t)(retLOCAL-(DWORDtrio>>4)-1))&FlagMASKnegated) ), retLOCAL);

  00053 45 89 fb         mov r11d, r15d                         
  00056 41 83 e6 0f      and r14d, 15                           
  0005a 48 89 c1         mov rcx, rax                           
  0005d 41 83 fe 0c      cmp r14d, 12                           
  00061 44 0f 44 e5      cmove r12d, ebp                        
  00065 4c 89 c5         mov rbp, r8                            
  00068 41 c1 eb 04      shr r11d, 4                            
  0006c 41 ff cc         dec r12d                               
  0006f 45 89 da         mov r10d, r11d                         
  00072 4d 89 e6         mov r14, r12                           
  00075 49 2b ca         sub rcx, r10                           
  00078 49 f7 d6         not r14                                
  0007b 48 ff c9         dec rcx                                
  0007e 49 23 ee         and rbp, r14                           
  00081 49 23 cc         and rcx, r12                           

I'm aware that my C style is dirty, yet in here I can't say mea culpa.

0 Kudos
Highlighted
New Contributor I
88 Views

Is it not just a case of using "-1LL" to force the constant to be 64bit, instead of just -1, which will always be a 32 bit constant even in 64bit mode ?

0 Kudos
Highlighted
Beginner
88 Views

>Is it not just a case of using "-1LL" to force the constant to be 64bit, instead of just -1, which will always be a 32 bit constant even in 64bit mode ?

Thanks, good guess, but the result is the same:

;;; 		FlagMASK= ~FlagMASKnegated;
;;; 				#ifdef _N_YMM
;;; //		SlowCopy256bit( (const char *)( ((uint64_t)(srcLOCAL+1)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4))&FlagMASKnegated) ), retLOCAL);
;;; // Another (incompatible with Branchfull variant, though) way to avoid 'LEA' is to put the '+1' outside the FlagMASK but then the encoder has to count literals from zero in order to compensate '-((DWORDtrio>>4)-1) = -(DWORDtrio>>4)+1' within FlagMASKnegated:
;;; 		//SlowCopy256bit( (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), retLOCAL);
;;; 		SlowCopy256bit( (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1LL)&FlagMASKnegated) ), retLOCAL);

  00053 45 89 fb         mov r11d, r15d                         
  00056 41 83 e6 0f      and r14d, 15                           
  0005a 48 89 c1         mov rcx, rax                           
  0005d 41 83 fe 0c      cmp r14d, 12                           
  00061 44 0f 44 e5      cmove r12d, ebp                        
  00065 4c 89 c5         mov rbp, r8                            
  00068 41 c1 eb 04      shr r11d, 4                            
  0006c 41 ff cc         dec r12d                               
  0006f 45 89 da         mov r10d, r11d                         
  00072 4d 89 e6         mov r14, r12                           
  00075 49 2b ca         sub rcx, r10                           
  00078 49 f7 d6         not r14                                
  0007b 48 ff c9         dec rcx                                
  0007e 49 23 ee         and rbp, r14                           
  00081 49 23 cc         and rcx, r12                           

 

0 Kudos
Highlighted
88 Views

Hi Georgi - Okay, thanks.  So, I think, let's get the scaled down test case that we talked about earlier.  You can also create a case with premier support as that's free, secure and private.  And/or feel free to post as a private message to me, if you want.  Tracking is better with a support case, should this turn out to be a defect in the compiler. 

If it is a narrowing conversion problem, you probably should be getting compiler-time warnings anyway.  Let me take this to the compiler team and see what they say.

Thanks and regards,

-Bob

0 Kudos
Highlighted
Beginner
88 Views

Better later than never...

A quick replication of the problem:

#define _N_GP
//#define _N_YMM

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h> // uint64_t needed
#include <string.h>

#ifdef _N_YMM
#include <emmintrin.h> // SSE2 intrinsics
#include <smmintrin.h> // SSE4.1 intrinsics
#include <immintrin.h> // AVX intrinsics
#endif

#ifdef _N_YMM
void SlowCopy256bit (const char *SOURCE, char *TARGET) { _mm256_storeu_si256((__m256i *)(TARGET), _mm256_loadu_si256((const __m256i *)(SOURCE))); }
#endif

#ifndef NULL
#ifdef __cplusplus
#define NULL 0
#else
#define NULL ((void*)0)
#endif
#endif

int main( int argc, char *argv[] ) {

	unsigned char* retLOCAL;
	unsigned char* srcLOCAL;
	unsigned int DWORDtrio;
	unsigned int Flag; // This line made my hair white! For some reason it should be 64bit, otherwise it makes 'FlagMASKnegated' 32bit during 'FlagMASKnegated= Flag - 1; // -1|0'
	//uint64_t Flag;
	uint64_t FlagMASK; //=       0xFFFFFFFFFFFFFFFF;
	uint64_t FlagMASKnegated; //=0x0000000000000000;

	srcLOCAL = (unsigned char*)malloc(512);
	retLOCAL = (unsigned char*)malloc((1LL<<32)+512);

if( srcLOCAL == NULL )
{ printf("Needed memory allocation denied!\n"); return(1); }
if( retLOCAL == NULL )
{ printf("Needed memory allocation denied!\n"); return(1); }

		DWORDtrio = *(unsigned int*)srcLOCAL;

		DWORDtrio = DWORDtrio&( 0xFFFFFFFF >> ((3-(DWORDtrio & 0x03))<<3) );
		Flag=!((DWORDtrio & 0x0F)-0x0C);
		// In here Flag=0|1
		FlagMASKnegated= Flag - 1; // -1|0
		FlagMASK= ~FlagMASKnegated;

				#ifdef _N_YMM
		SlowCopy256bit( (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), retLOCAL);
				#endif
				#ifdef _N_GP
		memcpy(retLOCAL, (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), 16*2);
				#endif

printf("retLOCAL = %p\n", retLOCAL);
printf("((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) = %p\n", ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated)); //00000001 3F 9C 00 3B
exit(0);
}

// Output when non-workaround is in effect (i.e. 'unsigned int Flag;' is active):
/*
D:\_KAZE\Instructions_per_tick_during_branchless_decompression_32-threaded>icl /O3 losingHigh4bytes.c
Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140726
Copyright (C) 1985-2014 Intel Corporation.  All rights reserved.

losingHigh4bytes.c
Microsoft (R) Incremental Linker Version 10.00.30319.01
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:losingHigh4bytes.exe
losingHigh4bytes.obj

D:\_KAZE\Instructions_per_tick_during_branchless_decompression_32-threaded>losingHigh4bytes.exe
!!! CRASH !!!
D:\_KAZE\Instructions_per_tick_during_branchless_decompression_32-threaded>
*/

// Output when workaround is in effect (i.e. 'uint64_t Flag;' is active):
/*
D:\_KAZE\Instructions_per_tick_during_branchless_decompression_32-threaded>icl /O3 losingHigh4bytes.c
Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140726
Copyright (C) 1985-2014 Intel Corporation.  All rights reserved.

losingHigh4bytes.c
Microsoft (R) Incremental Linker Version 10.00.30319.01
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:losingHigh4bytes.exe
losingHigh4bytes.obj

D:\_KAZE\Instructions_per_tick_during_branchless_decompression_32-threaded>losingHigh4bytes.exe
retLOCAL = 000000013F920040
((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) = 000000013F92003B

D:\_KAZE\Instructions_per_tick_during_branchless_decompression_32-threaded>
*/

If something is overlooked I will try again, by the way I have another issue to complain about (I guess 'internal heap' related) but one-at-a-time.

0 Kudos
Highlighted
88 Views

Hi Georgi - I spent some time debugging the test case, but I'm afraid I still don't see where you think the compiler is doing something wrong.

Please tell me why you think this is somehow incorrect behavior:

unsigned 32-bit Flag with a value of zero - 1 = 0xffffffff

unsigned 64-bit FlagMASKnegated with a value of zero gets lower 32-bits set to 0xffffffff, giving it a 64-bit representation of 0x00000000ffffffff.

FlagMASKnegated = Flag - 1; // -1|0

FlagMASK= ~FlagMASKnegated;

The value is then negated, giving you 0xffffffff00000000.

Why is that not expected behavior?  Nothing is getting truncated.  No higher bits are getting lost.  They were never there to begin with.  Is that the problem?  That seems to be the problem to me, still.  I don't see anywhere in that code where the compiler is doing something unexpected.  However, obviously, doing a negation on 0xFFFFFFFF00000000 is going to give you a much different result than on 0xFFFFFFFFFFFFFFFF.

Also, I don't know anything about your greater compression algorithm and I'm not entirely clear how you are attempting to allocate the memory in the first place and so, I must assume that the rest of the code is correct.

If I'm still misunderstanding the problem, please decode your second parameter to memcpy so we know exactly what you intend that to be.  There is too much casting and shifting going on for me to know whether or not what you get is actually what you are intending to pass.

0 Kudos
Highlighted
Beginner
88 Views

>Please tell me why you think this is somehow incorrect behavior:

>unsigned 64-bit FlagMASKnegated with a value of zero gets lower 32-bits set to 0xffffffff

This is new to me. I intended (as stated in the comment below) 'Flag' to be either 0 or 1 before the setting 'FlagMASKnegated' to either 0-1 or 1-1, are we on one page here?

        // In here Flag=0|1
        FlagMASKnegated= Flag - 1; // -1|0

If we are, then 'FlagMASKnegated' should be either 0x0000000000000000 or 0xFFFFFFFFFFFFFFFF, I have no other thing to add.

0 Kudos