Programmable Devices
CPLDs, FPGAs, SoC FPGAs, Configuration, and Transceivers
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
21615 Discussions

A question about floating pointing timing

Altera_Forum
Honored Contributor II
1,088 Views

I am tring to find time cost for floating point number calculations. 

 

I have followed the tutorial "Using Nios II Floating-Point Custom Instructions" and tested results on the DE2 board, but the result is different to the tutorial. The tutorial shows a 14 clock cycle for each Custom Instruction. 

 

The tutorial said the result is varied by the hardware setting.  

 

Does any one know: 

is there anywhere to find the time cost for floating point custom instruction for de2? 

 

Thank you.  

the result i got is about 47 clock cycle...:blink: 

 

 

--- Quote Start ---  

--Performance Counter Report-- 

Total Time: 0.0869525 seconds (8695248 clock-cycles) 

+---------------+-----+-----------+---------------+-----------+ 

| Section | % | Time (sec)| Time (clocks)|Occurrences| 

+---------------+-----+-----------+---------------+-----------+ 

|FP CI ADD |0.541| 0.00047| 47000| 1000| 

+---------------+-----+-----------+---------------+-----------+ 

|FP SW ADD | 54| 0.04697| 4696646| 1000| 

+---------------+-----+-----------+---------------+-----------+ 

 

 

--Performance Counter Report-- 

Total Time: 0.0939769 seconds (9397685 clock-cycles) 

+---------------+-----+-----------+---------------+-----------+ 

| Section | % | Time (sec)| Time (clocks)|Occurrences| 

+---------------+-----+-----------+---------------+-----------+ 

|FP CI SUBTRACT | 0.5| 0.00047| 47000| 1000| 

+---------------+-----+-----------+---------------+-----------+ 

|FP SW SUBTRACT | 48.5| 0.04560| 4559580| 1000| 

+---------------+-----+-----------+---------------+-----------+ 

 

 

--Performance Counter Report-- 

Total Time: 0.252263 seconds (25226350 clock-cycles) 

+---------------+-----+-----------+---------------+-----------+ 

| Section | % | Time (sec)| Time (clocks)|Occurrences| 

+---------------+-----+-----------+---------------+-----------+ 

|FP CI MULTIPLY |0.194| 0.00049| 49000| 1000| 

+---------------+-----+-----------+---------------+-----------+ 

|FP SW MULTIPLY | 79.9| 0.20154| 20153700| 1000| 

+---------------+-----+-----------+---------------+-----------+ 

 

 

--Performance Counter Report-- 

Total Time: 0.109719 seconds (10971917 clock-cycles) 

+---------------+-----+-----------+---------------+-----------+ 

| Section | % | Time (sec)| Time (clocks)|Occurrences| 

+---------------+-----+-----------+---------------+-----------+ 

|FP CI DIVIDE |0.656| 0.00072| 72000| 1000| 

+---------------+-----+-----------+---------------+-----------+ 

|FP SW DIVIDE | 49.7| 0.05450| 5449516| 1000| 

+---------------+-----+-----------+---------------+-----------+ 

--- Quote End ---  

The hardware setting is attached at bottom.
0 Kudos
1 Reply
Altera_Forum
Honored Contributor II
384 Views

I don't know what the numbers should be for a DE2 board but I'll explain how the custom instructions work and how other things play a role. 

 

So the floating point custom instructions perform +, -, *, and optional /. Any time the Nios II processor executes the custom instruction ... instruction it can potentially become a blocking operation. So if a custom instruction took 6 clock cycles to complete then the processor pipeline stalls for 6 clock cycles waiting for the result from the custom instruction. 

 

I forget how the tutorial software is written but I assume it performs a series of floating point operations over an array of data inside a loop (one floating point operator per loop). So if all the data is cached this should be fairly quick, if not then the memory access times will play a role in the inefficiencies for the floating point operator. Another thing that will play a significant role is the optimization level of the compiler. 

 

Since I don't know much about the system you are running the FPU tutorial on my hunch is that there is a lot of latency between the processor and the memory that is used to store the code. If the main memory is on a different clock domain then I would expect the inefficiencies that you are seeing.
0 Kudos
Reply