Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Links to instruction documentation

Thomas_W_Intel
Employee
6,882 Views
0 Kudos
38 Replies
SergeyKostrov
Valued Contributor II
2,504 Views
>>...If a hardware engineer gives me a single number on this, I am certain that is not a complete picture... Could you get that number? I'm sorry and let us decide what to do next. As I've told several times: "...The latest edition of "Intel Optimization Reference Manual" ( 04.2012 ) has lots of details about these two instructions but by some unexplained reason latencies are not specified..." I also don't see any logic in your statements: ...David Chaiken's recommendation on algorithm... ...I am certain that is not a complete picture... ...A number in CPU core cycle will certainly be useless... ...If your algorithm is able to deal with a range of values... ...If you can replace the longer journey with shorter ones... Shih, we would like to see just two numbers ( !!! ), that is latencies for two Intel instructions and nothing else. Do you understand this?
0 Kudos
Thomas_W_Intel
Employee
2,504 Views
Sergey, as Shiv has pointed out, the latency depends on several factors. Therefore, we need some information about the system that you are using. What is the core architecture that you are using? Which platform do you have? What is the core and uncore frequency? What DIMMs are you using (speed and rank) and how are they populated? Do you need the loaded latency and if so what is the bandwidth that you have? Kind regards Thomas
0 Kudos
SergeyKostrov
Valued Contributor II
2,504 Views
Hi everybody, >>... >>What is the core architecture that you are using? Which platform do you have? What is the core and uncore frequency? What DIMMs >>are you using (speed and rank) and how are they populated? Here are some technical specs for my system: Dell Precision Mobile M4700 Intel Core i7-3840QM ( Ivy Bridge / 4 cores / 8 logical processors )( http://ark.intel.com/compare/70846 ) 16GB RAM 320GB HDD NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory ) Windows 7 Professional 64-bit Best regards, Sergey
0 Kudos
Bernard
Valued Contributor I
2,504 Views

>>>Once again, Where coud I find latencies for MOVNTDQ and VMOVNTDQ instructions?>>>

Latency of MOVNTDQ is given in Agner instruction tables and it is ~400 cycles for Haswell CPU.

0 Kudos
paul_l_2
Beginner
2,504 Views

hi,

I find that the c++ compiler doesn't generate the AVX2 assembly while I write the AVX2 intrinsics or inline assembly.

but the compiler can generate the correct AVX assembly.

and so I am confused.

some samples as follows:

//////----intrinsic

b = _mm256_stream_load_si256(&a);
011E10A2  lea         eax,
011E10A8  db          c4h 
011E10A9  loop        wmain+118h (11E1128h)
011E10AB  sub         al,byte ptr [eax]
011E10AD  vmovdqu     ymmword ptr [ebp-138h],ymm0
011E10B5  vmovdqu     ymm0,ymmword ptr [ebp-138h]
011E10BD  vmovdqu     ymmword ptr ,ymm0

//////----inline assembly

__asm
 {
  vmovntdqa ymm0, a;
011E110A  db          c4h 
011E110B  loop        _wmain+17Ah (11E118Ah)
011E110D  sub         al,byte ptr
  vmovntpd b, ymm0;
011E1113  db          c5h 
011E1114  std             
011E1115  sub         eax,dword ptr
  vxorps  ymm1, ymm1, ymm1;
011E1118  vxorps      ymm1,ymm1,ymm1
  vpmulhw  ymm2, ymm2, ymm2;
011E111C  db          c5h 
011E111D  in          eax,dx
011E111E  in          eax,0D2h

}

 

thank you very much!

0 Kudos
Sergio_J__C_
Beginner
2,504 Views

Very nice,thanks bro

0 Kudos
adel_s_1
Beginner
2,504 Views

Information is very valuable

0 Kudos
Amir_K_2
Beginner
2,504 Views

Thanks for  sharing the links 

Best Regards 
Amir

0 Kudos
Islam_A_
Beginner
2,504 Views

Thomas,

Is there a downloadable PDF of the Optimization Reference Manual? I'm not finding it.

Also, is there any published data on expected performance of the various AVX intrinsics relative to SSE by cache? I.E. vmulps is 2X faster in L1, 1.8X faster in L2, etc. Maybe that's a dumb question, but it's hard to tell if code is optimal without some idea of ideal hw throughput.

Thanks for the pointers,
 

0 Kudos
McCalpinJohn
Honored Contributor III
2,504 Views

The best way to find the Intel Optimization Reference Manual is to do a search on the document number.   E.g., with Google, the search would be "248966 site:intel.com".    The PDF should be one of the first results.  
Searching for "248966" using the Intel website internal search engine also gets the result quickly.

The most recent update is revision 033, dated June 2016.

To help make these searches easier, I typically rename the PDF files on my system to include both a descriptive name and the full document number (including revision).  Then I don't have to open the document to look up the number when I do my periodic checks for new versions.

0 Kudos
Anton_R_
Beginner
2,504 Views

Hello,

May be Intel has the instructions set reference in some formal format suitable for reading programmatically i.e. in xml? Can I have it?

Thank you,

Anton

0 Kudos
Thomas_W_Intel
Employee
2,504 Views

Anton,

unfortunately, I'm not aware of such a instruction set reference that is easily parsable by programs.

Kind regards

Thomas

0 Kudos
sirrida
Beginner
2,504 Views

You can obtain a machine readable instruction set reference e.g. at http://www.nasm.us/pub/nasm/snapshots/latest/ (NASM) in the source file insns.dat of e.g. nasm-2.12.01rc1-20160308.zip. It should be quite complete and up to date.

0 Kudos
sirrida
Beginner
2,504 Views

The sources of NASM (http://nasm.us/) contain a machine readable instruction set reference.

0 Kudos
james_l_3
Beginner
2,504 Views

Hi Sergey

The best advice I could offer is to borrow from an article I read about David Chaiken's recommendation on the algorithm.

To design a suitable algorithm, think about its performance model underneath.

If a hardware engineer gives me a single number on this, I am certain that is not a complete picture, and it would be a dis-service to publish a number due to the complexity of situations that software can deploy into the wide variety of platform.

A number in CPU core cycle will certainly be useless, considering the core operates in a different clock domain. I believe the DRAM subsystem may bring in another clock domain into the picture.

The sources of NASM https://www.surfproxyserver.com contain a machine-readable instruction set reference

If your software gets deployed on a multi-socket platform, what kind of complications will snoop bring?

0 Kudos
Note__Mark
Beginner
2,504 Views

Thank. read, very informative :-)
 

0 Kudos
DANNIE__SANG
Beginner
2,504 Views

Brijender Bharti (Intel) wrote:

Hi,
Please use the following link:
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia...

It will open the reading pan. On Top right Mega Fast Keto Boost hand corner there is a down arrow button that means download (next to print).

Hi Thanks for the tip :)

0 Kudos
Podjachev__Evgeny
2,504 Views

HI, site https://software.intel.com/sites/landingpage/IntrinsicsGuide doesn't work. It loads but doesn't show any intrisicts. Can't it be fixed? Or is there pdf version of it?

0 Kudos
Reply