Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software Development Technologies
- Intel® ISA Extensions
- Haswell New Instructions posted

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Mark_B_Intel1

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-10-2011
07:20 PM

165 Views

Haswell New Instructions posted

http://software.intel.com/file/m/36945. A blog will be coming shortly.

-Mark Buxton

Link Copied

15 Replies

sirrida

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-11-2011
10:17 AM

165 Views

I'm very happy to see that most of the integer commands have been promoted to YMM. This essential for the graphics programming we do. AVX2 will surely be a big push for us.

The commands from the BMI groups will for sure become handy when used in a compiler, especially JIT.

The new PDEP and PEXT will for sure cost some silicon. I'd like to see them acting on XMM and YMM registers too, preferably with an adjustable granularity; it does not matter if there is only one such unit per die.

Unfortunately I'm not happy with the promoted vpshufb and palignr because they cannot operate cross-lanes. It will become difficult to e.g. convert an array of RGB pixels (SoA) to AoS. I also sourly miss a gather command for bytes and words, see my example (Lab color correction) in this forum.

capens__nicolas

New Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-12-2011
08:53 AM

165 Views

Just to be clear, will Haswell support both FMA and AVX2?

capens__nicolas

New Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-12-2011
05:37 PM

165 Views

Quoting sirrida

Those things just aren't feasible. A 256-bit shuffle unit takes four times more area than a 128-bit one. It simply doesn't scale well to wider vectors. Note that AVX can widen to 512 and 1024-bit in the future, so it was necessary to keep things divided into manageable chunks. I think 128-bit lanes is a great compromise. Also note it's quite possible that 256-bit integer operations might actually be executed as two 128-bit parts, hence cross-lane operations also aren't easily possible. Frankly I'm quite thrilled though to get such a complete instrution set with AVX2.

I'm more curious about what will happen to the IGP. A mainstream 8-core Haswell with FMA could deliver 1 TFLOP of computing power. Compare that to Ivy Bridge's IGP (also at 22 nm) which may not achieve more than 200 GFLOPS. It doesn't make sense to waste a lot of die area on a more powerful IGP. Instead, they could just use Larrabee's software rendering technology on the CPU cores. The only major issue I can see is high power consumption from the out-of-order execution. That can be solved by executing 1024-bit operations on 256-bit or 128-bit execution units, but support for wider registers hasn't been announced yet. Perhaps Haswell will be a sort of hybrid, with a simple IGP assisted by the CPU...

ange4771

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-13-2011
03:50 AM

165 Views

edit: this post wasn't intended to be a reply to #3, but I can't delete it. the typo is in the official PDF document

randombit

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-14-2011
06:55 AM

165 Views

Shouldn't the _pdep_u64 and _pext_u64 intrinsics take a 64-bit mask? (Pages 7-19 and 7-21)

gligoroski

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-14-2011
09:05 AM

165 Views

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-14-2011
12:12 PM

165 Views

Quoting gligoroski

MarkC_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-14-2011
12:26 PM

165 Views

bronxzv

New Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-14-2011
01:37 PM

165 Views

thanks to let us know

MarkC_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-14-2011
01:45 PM

165 Views

The FMA instructions are present & supported in the currently downloadable version of the emulator.

bronxzv

New Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-14-2011
02:04 PM

165 Views

The FMA instructions are present & supported in the currently downloadable version of the emulator.

neat! thanks for your quick feedback, it will allow to validate the FMA path far ahead of the final hardware, still waiting for a supporting compiler though

sirrida

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-22-2011
05:44 AM

165 Views

They act as if there are 2 xmm registers in one ymm (slices) instead of acting cross-lanes.

This will make porting difficult.

Here are some examples:

- pack/punpck
- pshufb
- palignr
- Horizontal ops, e.g. phadd

gilgil

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-30-2011
01:07 AM

165 Views

Will they be implemented in the upcoming ivy-bridge or should we wait further ?

capens__nicolas

New Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-01-2011
02:35 AM

165 Views

Quoting gilgil

Will they be implemented in the upcoming ivy-bridge or should we wait further ?

Yes, Ivy Bridge will support them.

http://software.intel.com/en-us/blogs/2011/06/13/haswell-new-instruction-descriptions-now-available/

"These build upon the instructions coming in Intel microarchitecture code name Ivy Bridge, including the digital random number generator, half-float (float16) accelerators, and extend the Intel Advanced Vector extensions (Intel AVX) that launched in 2011."

Ryan_Wong

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-18-2013
11:11 PM

165 Views

I would like to see this particular function implemented as a single instruction.

This is a proposed instruction for the 128-bit (XMM) version. I will explain the effect of this function, and also discuss how to derive the corresponding instruction for the 256-bit (YMM) version.

As a primer, I will use the notations defined on this webpage for this discussion: http://programming.sirrida.de/bit_perm.html#hypercube

__m128i IntraByteBitShuffle::operator() (__m128i input) const

{

__m128i output = _mm_setzero_si128();

output = _mm_insert_epi16(output, _mm_movemask_epi8(input), 7);

input = _mm_slli_epi16(input, 1);

output = _mm_insert_epi16(output, _mm_movemask_epi8(input), 6);

input = _mm_slli_epi16(input, 1);

output = _mm_insert_epi16(output, _mm_movemask_epi8(input), 5);

input = _mm_slli_epi16(input, 1);

output = _mm_insert_epi16(output, _mm_movemask_epi8(input), 4);

input = _mm_slli_epi16(input, 1);

output = _mm_insert_epi16(output, _mm_movemask_epi8(input), 3);

input = _mm_slli_epi16(input, 1);

output = _mm_insert_epi16(output, _mm_movemask_epi8(input), 2);

input = _mm_slli_epi16(input, 1);

output = _mm_insert_epi16(output, _mm_movemask_epi8(input), 1);

input = _mm_slli_epi16(input, 1);

output = _mm_insert_epi16(output, _mm_movemask_epi8(input), 0);

return output;

}

First of all, we have seen numerous requests for Intel CPU to support arbitrary bit-level permutation patterns, using various schemes of butterfly networks and omega networks, for example. These proposals are important for various domains, most notably in cryptography and hash functions. The difficulty in implementing arbitrary bit-level permutation patterns is that (1) it requires a lot of silicon, (2) it is a multi-cycle instruction, or even require a sequence of instructions, and (3) the amount of configuration data needed to encode a permutation pattern may span more than a handful of vector-wide registers.

For certain applications, it might be suffice to support a more "regular" kind of permutation patterns: to exchange and gather regularly-spaced bits within a vector with an appropriate element size. The prime example is the PMOVMSKB instruction. It "skims off" the top bit of each byte in a vector, and pack them into a 16-bit integer.

If you have not read the Bit Permutation terminology article yet, please do so now. http://programming.sirrida.de/bit_perm.html#hypercube

In the C++ intrinsics code above, I use the PMOVMSKB function along with PSLLW eight times in a row. The result is inserted back into an XMM register, 16-bits at a time. The result of this function is a bitwise permutation.

To understand this permutation, let us construct a table of two columns: Input-Index, and Output-Index. Each table cell contains the binary representation of a label of every bit in an XMM register. Since there are 128 bits in an XMM register, each label has 7 bits.

The correspondence is established by this binary pattern. Substituting each of { a, b, c, d, e, f, g } with { 0, 1 } will exhaustively define the input-output relationship for all 128 bits.

- Input pattern = MSB { g, f, e, d, c, b, a } LSB
- Output pattern = MSB { c, b, a, g, f, e, d } LSB

For example, substituting a = 0, b = 0, c = 1, d = 1, e = 0, f = 0, g = 1, we have Input = { binary(1001100) == decimal(76) }, and Output = { binary(1001001) = decimal(73) }. This means the 76-th bit from the input XMM register is copied into the 73-th bit in the output XMM register.

The significance of this proposed instruction is that it * commutes bit-level data with byte-level data*. Thanks to byte-level permutation instruction PSHUFB and also to the PUNPCK*-family of instructions, once bit-level data is commuted to the byte level, a variety of processing can occur with more freedom. Once processing is finished, a re-application (*) of this instruction can commute the byte-level changes back to bit-level.

(*) This instruction is not a self-inverse function; it requires two additional PSHUFB to permute back to the original bit pattern.

Being a hard-wired one-to-one bitwise permutation, this instruction does not require any configuration parameter. It also does not require any logic gates in between. Therefore, I believe it is possible to implement it with a latency of a single clock cycle.

For more complete information about compiler optimizations, see our Optimization Notice.