Software Archive
Read-only legacy content
17061 Discussions

Are there any instructions in k1om can replace lfence instruction in x86_64

anxuan_y_
Beginner
1,754 Views

I'm compiling Supersonic, an opensource database of google on Intel Phi using icc with option -mmic

but I find some lfence in the source code, but it seems that Phi doesn't support lfence instruction, so I want to replace lfence by some other instructions in Phi.

Is it practicable? for example,

inline Atomic32 Barrier_AtomicIncrement(volatile Atomic32* ptr,
                                        Atomic32 increment) {
  Atomic32 temp = increment;
  __asm__ __volatile__("lock; xaddl %0,%1"
                       : "+r" (temp), "+m" (*ptr)
                       : : "memory");
  // temp now holds the old value of *ptr
  if (AtomicOps_Internalx86CPUFeatures.has_amd_lock_mb_bug) {
    __asm__ __volatile__("lfence" : : : "memory");
  }
  return temp + increment;
}

thx for any help.

0 Kudos
1 Solution
James_C_Intel2
Employee
1,754 Views

In the example code you have there, the lfence is redundant anyway (since it immediately follows an atomic operation which is itself a full fence). It seems to be there to work around some bug in an AMD processor?

More generally

  • KNC is an in order processor, so memory fences are not normally necessary
  • The one place they are required is after NGO stores. The Intel compiler will insert the necessary fence if it generates an NGO store. Provided that you also do that if you write assembler code with an NGO store in it, you won't need explicit memory fences elsewhere,
  • You should, though, ensure that you have a compiler fence (to force the compiler to push variables it has cached in registers out to store).
  • The instruction to use after an NGO store to enforce the full fence is lock; addl $0,0(%rsp), which is a full fence and will normally execute in 5 cycles or so (assuming the base of the stack is still in L1 cache).

So, the two snippets of code you likely need look like this

// Use everywhere a memory fence of any kind was intended
#define COMPILER_FENCE() __asm__ volatile ("":::"memory")

// Use after an NGO store 
#define MFENCE() __asm__ volatile ("lock; addl $0,0(%%rsp)":::"memory")

 

View solution in original post

0 Kudos
5 Replies
James_C_Intel2
Employee
1,755 Views

In the example code you have there, the lfence is redundant anyway (since it immediately follows an atomic operation which is itself a full fence). It seems to be there to work around some bug in an AMD processor?

More generally

  • KNC is an in order processor, so memory fences are not normally necessary
  • The one place they are required is after NGO stores. The Intel compiler will insert the necessary fence if it generates an NGO store. Provided that you also do that if you write assembler code with an NGO store in it, you won't need explicit memory fences elsewhere,
  • You should, though, ensure that you have a compiler fence (to force the compiler to push variables it has cached in registers out to store).
  • The instruction to use after an NGO store to enforce the full fence is lock; addl $0,0(%rsp), which is a full fence and will normally execute in 5 cycles or so (assuming the base of the stack is still in L1 cache).

So, the two snippets of code you likely need look like this

// Use everywhere a memory fence of any kind was intended
#define COMPILER_FENCE() __asm__ volatile ("":::"memory")

// Use after an NGO store 
#define MFENCE() __asm__ volatile ("lock; addl $0,0(%%rsp)":::"memory")

 

0 Kudos
JJK
New Contributor III
1,754 Views

this is google's protobuf code that contains inline assembly.

what is happening here is that the protobuf code uses some handcoded assembly for the x86_64 architecture. ICC sets the __x86_64 flag when compiling for the MIC but obviously the assembly instructions are different.

Below is a patch to the protobuf code so that it compiles with -mmic.

I've gone ahead and compiled supersonic 0.9.4 for the Xeon Phi but this was not trivial. You will need to build your own version of BOOST, as the Intel-supplied version in the mpss k1om RPMs is too old. Also, not all unit tests pass on the Phi.

diff -Naur protobuf-2.6.1/config.sub protobuf-2.6.1mic/config.sub
--- protobuf-2.6.1/config.sub	2014-10-22 22:10:28.000000000 +0200
+++ protobuf-2.6.1mic/config.sub	2015-05-20 17:36:27.842674200 +0200
@@ -265,6 +265,7 @@
 	| hexagon \
 	| i370 | i860 | i960 | ia64 \
 	| ip2k | iq2000 \
+	| k1om \
 	| le32 | le64 \
 	| lm32 \
 	| m32c | m32r | m32rle | m68000 | m68k | m88k \
diff -Naur protobuf-2.6.1/gtest/build-aux/config.sub protobuf-2.6.1mic/gtest/build-aux/config.sub
--- protobuf-2.6.1/gtest/build-aux/config.sub	2014-10-22 22:10:25.000000000 +0200
+++ protobuf-2.6.1mic/gtest/build-aux/config.sub	2015-05-20 17:36:27.842674200 +0200
@@ -265,6 +265,7 @@
 	| hexagon \
 	| i370 | i860 | i960 | ia64 \
 	| ip2k | iq2000 \
+	| k1om \
 	| le32 | le64 \
 	| lm32 \
 	| m32c | m32r | m32rle | m68000 | m68k | m88k \
diff -Naur protobuf-2.6.1/src/google/protobuf/stubs/platform_macros.h protobuf-2.6.1mic/src/google/protobuf/stubs/platform_macros.h
--- protobuf-2.6.1/src/google/protobuf/stubs/platform_macros.h	2014-10-21 02:01:40.000000000 +0200
+++ protobuf-2.6.1mic/src/google/protobuf/stubs/platform_macros.h	2015-05-20 17:40:55.550484335 +0200
@@ -41,7 +41,7 @@
 //   http://www.agner.org/optimize/calling_conventions.pdf
 //   or with gcc, run: "echo | gcc -E -dM -"
 #if defined(_M_X64) || defined(__x86_64__)
-#define GOOGLE_PROTOBUF_ARCH_X64 1
+//#define GOOGLE_PROTOBUF_ARCH_X64 1
 #define GOOGLE_PROTOBUF_ARCH_64_BIT 1
 #elif defined(_M_IX86) || defined(__i386__)
 #define GOOGLE_PROTOBUF_ARCH_IA32 1

 

0 Kudos
anxuan_y_
Beginner
1,754 Views

James Cownie (Intel) wrote:

In the example code you have there, the lfence is redundant anyway (since it immediately follows an atomic operation which is itself a full fence). It seems to be there to work around some bug in an AMD processor?

More generally

  • KNC is an in order processor, so memory fences are not normally necessary
  • The one place they are required is after NGO stores. The Intel compiler will insert the necessary fence if it generates an NGO store. Provided that you also do that if you write assembler code with an NGO store in it, you won't need explicit memory fences elsewhere,
  • You should, though, ensure that you have a compiler fence (to force the compiler to push variables it has cached in registers out to store).
  • The instruction to use after an NGO store to enforce the full fence is lock; addl $0,0(%rsp), which is a full fence and will normally execute in 5 cycles or so (assuming the base of the stack is still in L1 cache).

So, the two snippets of code you likely need look like this

// Use everywhere a memory fence of any kind was intended
#define COMPILER_FENCE() __asm__ volatile ("":::"memory")

// Use after an NGO store 
#define MFENCE() __asm__ volatile ("lock; addl $0,0(%%rsp)":::"memory")

 

Thank you very much James, I have removed all 'lfence' and 'mfence' from source file, although there are still some other errors, but this issue has been resolved. But I have some questions: Do NGO store instructions only exist in Intel Phi? When should I use them? And do I need to add them when I port the program from X86_64 to Phi?

0 Kudos
anxuan_y_
Beginner
1,754 Views

JJK wrote:

this is google's protobuf code that contains inline assembly.

what is happening here is that the protobuf code uses some handcoded assembly for the x86_64 architecture. ICC sets the __x86_64 flag when compiling for the MIC but obviously the assembly instructions are different.

Below is a patch to the protobuf code so that it compiles with -mmic.

I've gone ahead and compiled supersonic 0.9.4 for the Xeon Phi but this was not trivial. You will need to build your own version of BOOST, as the Intel-supplied version in the mpss k1om RPMs is too old. Also, not all unit tests pass on the Phi.

diff -Naur protobuf-2.6.1/config.sub protobuf-2.6.1mic/config.sub
--- protobuf-2.6.1/config.sub	2014-10-22 22:10:28.000000000 +0200
+++ protobuf-2.6.1mic/config.sub	2015-05-20 17:36:27.842674200 +0200
@@ -265,6 +265,7 @@
 	| hexagon \
 	| i370 | i860 | i960 | ia64 \
 	| ip2k | iq2000 \
+	| k1om \
 	| le32 | le64 \
 	| lm32 \
 	| m32c | m32r | m32rle | m68000 | m68k | m88k \
diff -Naur protobuf-2.6.1/gtest/build-aux/config.sub protobuf-2.6.1mic/gtest/build-aux/config.sub
--- protobuf-2.6.1/gtest/build-aux/config.sub	2014-10-22 22:10:25.000000000 +0200
+++ protobuf-2.6.1mic/gtest/build-aux/config.sub	2015-05-20 17:36:27.842674200 +0200
@@ -265,6 +265,7 @@
 	| hexagon \
 	| i370 | i860 | i960 | ia64 \
 	| ip2k | iq2000 \
+	| k1om \
 	| le32 | le64 \
 	| lm32 \
 	| m32c | m32r | m32rle | m68000 | m68k | m88k \
diff -Naur protobuf-2.6.1/src/google/protobuf/stubs/platform_macros.h protobuf-2.6.1mic/src/google/protobuf/stubs/platform_macros.h
--- protobuf-2.6.1/src/google/protobuf/stubs/platform_macros.h	2014-10-21 02:01:40.000000000 +0200
+++ protobuf-2.6.1mic/src/google/protobuf/stubs/platform_macros.h	2015-05-20 17:40:55.550484335 +0200
@@ -41,7 +41,7 @@
 //   http://www.agner.org/optimize/calling_conventions.pdf
 //   or with gcc, run: "echo | gcc -E -dM -"
 #if defined(_M_X64) || defined(__x86_64__)
-#define GOOGLE_PROTOBUF_ARCH_X64 1
+//#define GOOGLE_PROTOBUF_ARCH_X64 1
 #define GOOGLE_PROTOBUF_ARCH_64_BIT 1
 #elif defined(_M_IX86) || defined(__i386__)
 #define GOOGLE_PROTOBUF_ARCH_IA32 1

 

I am very sincerely grateful to you for your help, JJK.

There are also some inline assembly code in supersonic's source file 'supersonic/utils/atomicops-internals-x86.h', and I removed the lfence and mfence from it as James Cownie said, and it worked.

0 Kudos
James_C_Intel2
Employee
1,754 Views

But I have some questions:

Do NGO store instructions only exist in Intel Phi?

The non-globally-ordered store instructions exist only on KNC. There are similar instructions (non-temporal stores) in SSE and later vector instruction sets on Xeon. All of these are optimizations to improve the use of caches and memory bandwidth. 

When should I use them? 

Probably never. 

And do I need to add them when I port the program from X86_64 to Phi?

No, at least initially just let the compiler do its job and don't worry about this. If there are no NT loads/stores in the original assembly code which you're porting, that's a good sign that you don't need to worry about using NGO stores on KNC.

For more details on non-temporal memory, Ulrich Drepper has a good article on LWN. (The whole series is worth reading if you have the time).

0 Kudos
Reply