Re: Floating Point Matrix Multiplication

Altera_Forum · ‎11-18-2009

I want to multiply a 56x56 matrix with a 56x1 matrix in floating point.

There is a altfp_matrix_mult megafunction which does not compute this quickly enough.

I am looking for ideas to implement this in floating point. Please let me know if you have any ideas. Thanks.

Altera_Forum · ‎11-19-2009

Is latency really a factor? whats the application for this?

FPGAs really dont like doing floating point. I would reocmmened trying to convert it to fixed point as it Hugely reduces the latency and logic requirements. If you really have to do it floating point, you're stuck with long latency and large resource requirements.

Altera_Forum · ‎11-19-2009

The altfp_matrix_mult Handbook gives an overview of required FPGA resources versus GFlops/s throughput. You can hardly expect to achieve a better result with a different FP design, so it can basically answer the question, if the intended design is feasible at all.

If the achievable GFlop amount isn't an issue, but altfp_matrix_mult doesn't fit the design structure, then it can be meaningful to think about a different FP design.

Altera_Forum · ‎11-19-2009

Are you resource starved? What happens if you change the calculation to 56 dot products ... each row of the 56x56 matrix by the same 56 element vector?

Altera_Forum · ‎11-19-2009

--- Quote Start ---

Is latency really a factor? whats the application for this?

FPGAs really dont like doing floating point. I would reocmmened trying to convert it to fixed point as it Hugely reduces the latency and logic requirements. If you really have to do it floating point, you're stuck with long latency and large resource requirements.

--- Quote End ---

I want to first see if this can be done in floating point before looking at fixed point.

The result needs to be available every 6 us. This includes the time to load the matrice - at least the smaller one. The bigger one does not change frequently.

Yes, I want to do it with the least amount of resources.

I am targeting a Stratix 3 so if I do 56 multiplications in parallel then wil use up 224/288 ~80% of the multipliers just for this.

Altera_Forum · ‎11-19-2009

Since 56 in parallel is too many resources then split the large matrix by groups of rows - enough to meet the latency requirement. For example, use 7 matrix mults each processing 8 rows, then recollect the 7 8x1 results.

Altera_Forum · ‎11-19-2009

I generated the megafunction with Q9.0

Columns AA : 64 (it does not allow 56)

Rows AA : 56

Columns BB :1

Vector size : 16

Block size :2

synthesis gave me :

64 mutipliers (22%)

14% logic

7% memory.

When I simulate it.

The time period between successive results is 18 us

(Takes nearly 6 us to output the result)

Is it possible to input and output data faster in order to get a higher throughput.

Even with a vector size of 32 it takes same amount of time to compute it and ~3 us to output the result.

Do you know how many Gigaflops I can expect from this design.

I believe I am getting 0.38 Gigaflops (the benchmarks has designs with 4 - 55 Gigaflops on Stratix -3). How is it computed?

Altera_Forum · ‎11-19-2009

I have uploaded the waveform showing time between 2 output elements. 20 clocks at time period = 5 ns.

Altera_Forum · ‎11-20-2009

Looking at the altfp_matrix_mult handbook, I fear, it's not designed to perform a fast multiplication with a [m,1] vector "matrix". All examples are [n,m] x [m,n]. Hopefully it's at least giving correct results. You most likely need to implement an optimized algorithm yourself.

Altera_Forum · ‎11-20-2009

You could also try rearranging the matrix and vector and check if the implementation works on rows faster then columns. Try using BB transpose for AA and AA transpose for BB. Mathmatically, the result = AA * BB = (BB' * AA')'. See http://en.wikipedia.org/wiki/matrix_multiplication#common_properties. Depending on how your matrices are stored, transposes can be "free" in FPGAs.

Altera_Forum · ‎11-20-2009

Try a fully pipelined architecture to multiply 8 input elements pairs and add the results to get a partial_sum(n) every clock.

This requires 8 altfp_mul and 7 altfp_add_sub, chained in a 4 stage pipeline.

Then add each sequence of 7 partial results to obtain a output element.

You should be able to pipeline this part as well, with 3 more altfp_add_sub chained in a 3 stage pipeline.

- Every 2 clocks, partial_sum1(n) = partial_sum(n) + partial_sum(n-1);

- Every 4 clocks, partial_sum2(n) = partial_sum1(n) + partial_sum(n-3).

- Every 8 clocks, element(n) = partial_sum2(n) + partial_sum2(n-7)

Because this last part actually needs 8 operands, the first part needs a "pause" cycle after each sequence of 7 partial sums, in which it's output is zero.

This should be able to calculate one complete result every 8*56 clocks.

Latency should be 8*56 clocks + 1 altfp_mul + 3*altfp_add_sub + 3*altfp_add_sub.

PS: I'm assuming altfp_add_sub and altfp_mult have 1 cycle throughput, like the integer counterparts. If that's not true, then this doesn't work.

Altera_Forum · ‎11-20-2009

They have a minimum pipeline delay of 7 respectively 5 cycles, more if highest clock frequency is intended.

Altera_Forum · ‎11-20-2009

But they're fully pipelined, right?

Altera_Forum · ‎11-20-2009

Yes, of course.

Altera_Forum · ‎11-25-2009

That will take 5+7*3 (for first 8 multiplications and 7 additions) +5+ 7+7+7 = 52 cycles for 1 row

For 56 rows = 56X52 = 2912 cycles = 14560 ns

That is slower than I need.

Altera_Forum · ‎11-25-2009

The project feasibility can be determined in terms of required GFlops/s and available (respectively granted) multiplier resources, before thinking about structures.

Altera_Forum · ‎11-25-2009

One row: 56 mults, 55 adds.

All 56 rows: 3136 mults, 3080.

All 56 rows every 6 us: 523 Mmult/s and 523 Madd/s.

With a 5 ns clock, that becomes 2.6 mults/clock and 2.6 adds/clock.

Quite feasible.

Victor,

looks like you're mixing throughput with latency.

A altfp_mul has a latency of 5 cycles but a throughput of a result every cycle. Which means it takes 5 cycles to multiply a pair ofnumbers but you can feed it a new pair of numbers every cycle. Same goes for altfp_add_sub.

Since matrix multiplication has lots of independent operations, this can be exploited.

The architecture I suggested performs 8 mults/clock and more than 7 adds/clock and manages one complete result every 448 cycles (2.2 us), with a latency of 495 cycles (2.4 us).

Altera_Forum · ‎11-25-2009

I did not realize that you can send the next data input without waiting for the prior operation to finish. Thanks for pointing it out.

Altera_Forum · ‎01-24-2011

Hello there,

I would like to ask you if you got the same error as me, I always try to compile a single matrix_multiplier but this error appears:

library "work" does not contain primary unit "hcc_package"

Do i have to import this package? I've been looking for it but i cant find it, where is it?

Regards,

Juan