CAM the hard way or how to compare against thousands of bits?

Altera_Forum · ‎05-21-2012

Guys,

I don't have a lot of experience with FPGAs and so I turn to you my friends for an advice. The circuit I am aiming to create is relative simple, but I am not sure about what is the right way to do it.

The module should get data off of the Avalon-MM FIFO, 32-bit wide. Let's think in software terms and call those 32-bits an unsigned integer. That integer must be compared against a pre-configured list of numbers, approx. 5000 of them (5K * 32 bits = 19.53125 kilobytes). The output should be two things:

1) A signal that is positive if there was a match, or negative otherwise.

2) If the output is positive, the offset of the matching register (i.e. register address) should be stored in [31:0] register value.

(of course there will also be a simple reset logic)

I am thinking to use on-chip memory as 19KByte doesn't seem like a big deal, and create a module where a number of 32-bit registers will be a parameter. Those registers will be exported to a on-chip CPU, where software would write some data into it (HAL + uCOS).

Does this sound as a good idea thus far? Now, I can think about two ways to go from here to compare that number against value in registers:

Way #1, something like:

[code]

for (i = 0; i < MAX_REGS_PARAMETER; ++i)

begin

if (input_data[31:0] == reg[31 + (i * ...)i * ...)])

output_signal <= 1'b1;

end

[code]

...

Way #2, generated combinatorial:

[code]

assign output_signal = (input_data[31:0] == reg1[31:0]) || (input_data[31:0] == reg2[31:00]) ....

[code]

I am not sure about how that code will synthesize and how big of a delay this comparison will introduce. Will that comparison take long and does it make sense to pipeline or no?

Which way do you think is better and why? Or is there any other way to do a thing like that? How would you do it? Your help is very much appreciated.

Thank you very much,

V.

Altera_Forum · ‎05-22-2012

You have basically correct RTL in both cases, but both will require 160K flops rather than 160K bits of SRAM. With SRAM inside an FPGA, you can read one value per cycle, so you can only compare the input to one value at a time. Only some of the largest FPGAs have more than 160K flops, I think that it's just the ones over $1K.

If your constraints allow you to use 5K clocks to read all of the SRAM entries, then you can easily fit this into an FPGA.

But if you need the compare results faster, then you will probably need to use an external CAM, or consider alternative approaches, such as binary trees.

You will find a few references to implementing CAM in FPGAs using SRAM, but those techniques usually only allow a small number of of compare values (it's hard to go much beyond ~100 entries).

\c

Altera_Forum · ‎05-23-2012

Since reading through 5000 memory locations can take a while, if your system needs speed you can break that up into several parallel sequencers. Using 256 entries each (for an M9K memory) you can increase your speed by about 20x and still maintain 32 bits per compare.

If you need significantly fewer than 256 cycles, you could break the compares accross several memories - a technique you should find through a search for CAMs and FPGAs - where you can use 8 memories with 4 bits each for the compare and cycle through 576 entries 36 at a time in 16 clocks in each "octet" of memory elements. This would quadruple the number of bits per compare to 128 but with a quicker hit.

This could be extended to 4 memories with 8 bits each for the compare to look at 36 entries at once. The number of bits per compare swells to 1024 bits per compare which will outgrow most devices for 5k entries.

The piecemeal compares would be implemented in the FPGA fabric with pseudo-code for the fastest results might look something like

for( 1=0; i<MAX_REGS_PARAMETER/36; ++i)

if( input_data[31:24][35:0]

& input_data[23:16][35:0]

& input_data[15:8][35:0]

& input_data[7:0][35:0] != 36'd0

) output_signals[i] <= 1'b1;

output_signal <= |output_signals;