Introduction
FPGAs implement logic in lookup tables (LUTs). These are like truth tables or Karnaugh maps that FPGAs allow you to wire together to create almost any logical function you can imagine. LUTs are specified by the number of inputs that they can resolve. For example, a 4-input LUT requires 16 bits to store an output value for each of the possible combinations of the 4 inputs. For logic that depends on more than 4 inputs, these will need to be cascaded. There are drawbacks to cascading LUTs, mostly in the propagation delays required. Using larger LUTs requires fewer layers of LUTs which can improve performance but making the LUTs too large can be wasteful for simple logical functions. To minimize waste, these LUTs can be made fracturable. How they are divided can have a significant impact on overall device utilization. This article will look at some of these tradeoffs.
Lookup Table Implementation
To begin, let’s look at the structure of a generic n-input lookup table. A lookup table is a sequence of multiplexers that select a specific storage location based on the inputs, just like a 1-bit memory. The number of storage locations required is 2 raised to the power of the number of inputs, again like a memory with n address pins. Below is an example of a 4-input LUT, with the smaller LUTs indicated by dashed bounding boxes. We will use the simplified representation of the 4-input LUT through the rest of this article.
Figure 1 n-Input LUT
Adding inputs increases the complexity of the logic that can be implemented, but each input bit that is added doubles the number of storage bits and multiplexers required. These extra resources will be wasted when implementing simpler logic. One way to minimize the wasted bits is to provide taps to the intermediate multiplexers. An example of a 6-input LUT with dual 5-input LUT outputs is shown below. This simple approach can allow the implementation of two 5-input LUTs, but it has the restriction that both LUTs are controlled by the same 5 inputs.
Figure 2 Basic 6-Input LUT
A better approach
Intel Agilex FPGAs take a different approach that allows for greater device utilization. As shown in the diagram below, Agilex FPGAs have a novel 6-input LUT structure with 8 inputs called Adaptive Logic Module(ALM) . Some of the inputs to the smaller LUTs are separated to allow for additional flexibility. With the additional inputs it is possible to mix smaller logic functions dependent on different signals. Some of the different LUT combinations are shown in the diagram below.
Figure 3 Agilex FPGA ALM
When datac0 and datad0 are tied to the same signals as datac1 and datad1 respectively, this block implements a traditional 6-input LUT, but there are some 7 and 8 input logic functions that can be implemented by using them separately. These additional 7 and 8 input configurations, plus all the 3, 4 and 5 input LUT configurations make this implementation much more flexible and allow for greater device utilization.
How does this affect device utilization?
How significant are these advantages? To answer, we will need a metric that indicates how much logic has been implemented. It would be nice to use a common function like a processor core, and one that everyone is familiar with and has access to, like a RISC-V processor. A single core is not likely to fill an FPGA so we would need an array of these processors, and the processor should be small for fine grain comparisons. And a script that stitched them all together would also be nice. But this is clearly too much to ask and too obscure for anyone to actually develop, or is it? CoreScore is “an award-giving benchmark for FPGAs and their synthesis/P&R tools. It tests how many SERV cores that can be put into a particular FPGA.” ( https://github.com/olofk/corescore#readme ) SERV is a tiny “award-winning bit-serial RISC-V core.” ( https://github.com/olofk/serv#readme ) This provides a vendor independent metric to represent the logic capacity of an FPGA.
Now that we have identified a metric for comparison, let’s take a closer look at the logic capacity of each implementation. For comparison, AMD Virtex* UltraScale+* FPGAs use the basic 6-input LUT with optional 5-input LUT output, while Intel Agilex 7 devices use the 8-input ALM configuration described above. Looking at the results from CoreScore.store, we can see that it takes more than 210 of the basic divided 6-input LUTs per core, while it requires less than 170 of the 8-input configuration used in Agilex ALMs.
FPGA | CoreScore | LUT-6/ALM | LUT/ALM per Core |
AGI040 | 8,225 | 1,372,000 | 167 |
VCU128(VU37P) | 6,000 | 1,303,680 | 217 |
AGF027 | 5,525 | 912,800 | 165 |
VCU118(VU9P) | 5,087 | 1,182,240 | 232 |
AGF014 | 2,970 | 487,200 | 164 |
If we instead compare Logic Element(LE) / System Logic Cell(SLC) usage, we get a different picture. All the devices require about 490 LE/SLC +/-5%. This is not accidental; it is due to the fact that both companies have applied a scaling factor to their logic to be more representative of the capacity.
FPGA | CoreScore | LE/SLC | LE/SLC per Core |
AGI040 | 8,225 | 4,047,400 | 492 |
VCU128(VU37P) | 6,000 | 2,851,800 | 475 |
AGF027 | 5,525 | 2,692,760 | 487 |
VCU118(VU9P) | 5,087 | 2,586,150 | 508 |
AGF014 | 2,970 | 1,437,240 | 484 |
The data shows that Logic Elements and System Logic Cells are useful metrics for representing the capacity of any FPGA. It also shows that the 8-input ALM used in Agilex FPGAs is able to implement more logic with fewer instances than the more traditional 6-input structure, even though they both contain the same number of lookup table bits. This suggests that more of the bits are getting stranded in the 6-input version requiring more instances to achieve the same functionality. These extra instances will consume additional area on the die and more power in the system, which is something to consider when selecting an FPGA for your next design. The special case 7 and 8 input functions that are only supported by Agilex ALM will require two levels of logic to implement in the 6 input LUT configurations which comes with a significant timing penalty in addition to the extra LUTs consumed.
Summary
FPGAs are complex devices and finding the right device for your application can be a daunting task. Metrics like Logic Elements and System Logic Cells are helpful, but it is also important to consider the underlying architecture of the logic fabric you will be using, in addition to the rest of the features and tools that come with the device. The ALM that is the fundamental building block of an Agilex FPGA is designed to do more with less for greater system optimization. Links to some additional resources are included at the end of this blog for your convenience. In any event, the next time you are looking for an FPGA, be sure to check the CoreScore.
Additional Resources
- FPGA Architecture White Paper
- Intel Agilex® FPGAs Deliver a Game-Changing Combination of Flexibility and Agility for the Data-Centric World
- Performance Advantages on OpenCores with Intel Agilex® 7 FPGAs
Notes/disclaimers: © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. *Other names and brands may be claimed as the property of others.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.