It is much worst than that

Cheng_C_ · ‎07-17-2014

Dear all,

I need to conduct some 64-bit integers multiplications.

In Xeon CPU, for the scalar multiplication operation, I have the mulq assembly instruction. What it dose is it multiplies 2 64-bit integers and stores the 128-bit result in 2 64-bit integers for the low 64 bits and high 64 bits of the result. I want to conduct the calculation on Phi and vectorize it.

I looked through the Xeon Phi Coprocessor Instruction Set Reference Mannual, and I found the vectorizable VPMULHD and VPMULLD instructions (and the equivalent intrinsics). However, they only operates on 32-bit integers. And I have to use 2 instructions (VPMULHD and VPMULLD) to get the low and high part of the 64-bit result separately. Obviously, they are less efficient than the mulq instruction for 64-bit integers multiplication if I can vectorize the mulq instruction on Phi.

My question is:
1. Is there an equavilent instruction to mulq in Xeon Phi, so that I can vectorize the operations?
2. If I have to use the VPMULHD and VPMULLD instructions for 64-bit integers multiplication, I need to use those instructions to multiply 2 32-bit integers first. They multiply 2 32-bit integers but only store the high part and low part of the 64-bit result reperately. So I need two instructions for 32-bit integers multiplication. Is there a way that I can conduct the 32-bit integers multiplication in only one instruction, and I can vectorize it at the same time?

Thanks very much for the help.

jimdempseyatthecove · ‎07-17-2014

It is much worst than that

AL = low 32-bits of 64-bit A, AH = high 32-bits of 64-bit A, BL=low 32-bits of 64-bit B, BH=high 32-bits of 64-bit.

CLL = low 32-bits of low 64-bits of 128-bit result C, CLH= high 32-bits of low 64-bits of 128-bit result C

CHL = low 32-bits of high 64-bits of 128-bit result C, CHH= high 32-bits of high 64-bits of 128-bit result C

I will let you sort out the unsigned and signed intrinsic, as well as suitability of using temps or redoing multiply

CLL = low(AL * BL)

CLH = high(AL*BL) + low(AH*BL) + low(AL*BH)

CHL = high(AH*BL) + high(AL*BH) + low(AH * BH)

CHH = high(AH * BH)

Added to the above, is potential carry detection and propagation.

I think you are much better off using scalar instructions in GP registers

Jim Dempsey

Cheng_C_ · ‎07-17-2014

jimdempseyatthecove wrote:

It is much worst than that

AL = low 32-bits of 64-bit A, AH = high 32-bits of 64-bit A, BL=low 32-bits of 64-bit B, BH=high 32-bits of 64-bit.

CLL = low 32-bits of low 64-bits of 128-bit result C, CLH= high 32-bits of low 64-bits of 128-bit result C

CHL = low 32-bits of high 64-bits of 128-bit result C, CHH= high 32-bits of high 64-bits of 128-bit result C

I will let you sort out the unsigned and signed intrinsic, as well as suitability of using temps or redoing multiply

CLL = low(AL * BL)

CLH = high(AL*BL) + low(AH*BL) + low(AL*BH)

CHL = high(AH*BL) + high(AL*BH) + low(AH * BH)

CHH = high(AH * BH)

Added to the above, is potential carry detection and propagation.

I think you are much better off using scalar instructions in GP registers

Jim Dempsey

Dear Jim,

Thanks very much for the detailed explanations and your suggestion. So you mean there is no such an vectorizable instruction for Xeon Phi that I could directly multiply 2 64-bit integers?

Regards,

Cheng

McCalpinJohn · ‎07-17-2014

The Xeon Phi Instruction Set Architecture manual (document 327364-001, September 2012) does not list any arithmetic instructions for vectors of 64-bit integers. There appears to be a complete set of load/store and Boolean operations for this data type, but no add/subtract/multiply/etc.

Cheng_C_ · ‎07-17-2014

John D. McCalpin wrote:

The Xeon Phi Instruction Set Architecture manual (document 327364-001, September 2012) does not list any arithmetic instructions for vectors of 64-bit integers. There appears to be a complete set of load/store and Boolean operations for this data type, but no add/subtract/multiply/etc.

Dear John,

Get it. Thanks very much for the reply.

Regards,

Cheng

jimdempseyatthecove · ‎07-18-2014

Cheng,

As you see from my abstract notation you have at least 12 operations of * and +. Add to this the carry propagation and you would be up to at least 20 operations, but it could be somewhat more.

This would be an interesting case to study. The scalar code, though it can do it in one operation has the issue of more loads and stores (even if coming from L1)

Phi: two loads, 20-30 operations two stores (24 to 34 instructions 4xL1 latencies) to produce 8 128-bit results

Scalar: 16 loads, 8 operations, 16 stores (40 instructions 32xL1 latencies) to produce 8 128-bit results

As ugly as the vector form will be, it still could outperform the scalar code.

This does assume you have full vectors to work with.

Jim Dempsey

Cheng_C_ · ‎07-18-2014

jimdempseyatthecove wrote:

Cheng,

As you see from my abstract notation you have at least 12 operations of * and +. Add to this the carry propagation and you would be up to at least 20 operations, but it could be somewhat more.

This would be an interesting case to study. The scalar code, though it can do it in one operation has the issue of more loads and stores (even if coming from L1)

Phi: two loads, 20-30 operations two stores (24 to 34 instructions 4xL1 latencies) to produce 8 128-bit results

Scalar: 16 loads, 8 operations, 16 stores (40 instructions 32xL1 latencies) to produce 8 128-bit results

As ugly as the vector form will be, it still could outperform the scalar code.

This does assume you have full vectors to work with.

Jim Dempsey

Jim,

Thanks very much. This is really helpful. By the way, do all the instructions need the same amount of CPU cycles? Intuitively, I would think mulq needs more CPU cycles than VPMULHD and VPMULLD, as, apparently, it does more jobs.

Regards,

Cheng

jimdempseyatthecove · ‎07-18-2014

MUL of 64 bit (GP registers), depending on architecture, has latency of 10, throughput of 1 (this is on host processor, I do not have the table handy for Xeon Phi, I imagine it is about the same).

PMULxx on host is latency 3, throughput 1 (this is on host processor, I do not have the table handy for Xeon Phi, I imagine it is about the same).

Assuming you have large arrays of Int64's to multiply, both have throughput of 1 cycle. So with loop unrolling and sufficient registers, you could conceivably hit 1 cycle per multiply, once cycle for non-dependent add. So the deterministic factors would be the number of instructions .AND. the number of L1 (and L2, and RAM latencies). The scalar method has significantly larger L1 latencies. L2 and RAM number of latencies would likely be the same.

I think rather than philosophizing on the matter, it would be more beneficial to write the code and perform the acid test.

Jim Dempsey

Cheng_C_ · ‎07-18-2014

jimdempseyatthecove wrote:

MUL of 64 bit (GP registers), depending on architecture, has latency of 10, throughput of 1 (this is on host processor, I do not have the table handy for Xeon Phi, I imagine it is about the same).

PMULxx on host is latency 3, throughput 1 (this is on host processor, I do not have the table handy for Xeon Phi, I imagine it is about the same).

Assuming you have large arrays of Int64's to multiply, both have throughput of 1 cycle. So with loop unrolling and sufficient registers, you could conceivably hit 1 cycle per multiply, once cycle for non-dependent add. So the deterministic factors would be the number of instructions .AND. the number of L1 (and L2, and RAM latencies). The scalar method has significantly larger L1 latencies. L2 and RAM number of latencies would likely be the same.

I think rather than philosophizing on the matter, it would be more beneficial to write the code and perform the acid test.

Jim Dempsey

Jim,

OK. Thanks for all the help and have a nice day.

Regards,

Cheng

64-bit integer multiplication