Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
16596 Discussions

CL_INVALID_BINARY for larger design

Altera_Forum
Honored Contributor II
1,798 Views

I'm working on a kernel with scalable unrolling factor on aocl 16.0.2.222 for a Nallatech p385a_sch_ax115 board. Synthesis and hardware generation succeed for a range of unrolling factors and .aocx files get generated. However, for the largest design, the call to clCreateProgramWithBinary returns -42, which is CL_INVALID_BINARY. 

 

For reference:  

The acl_quartus_report.txt of the largest working design (unrolling factor 24)  

 

aluts: 102890 

registers: 221,229 

logic utilization: 116,166 / 427,200 ( 27 % ) 

i/o pins: 335 / 826 ( 41 % ) 

dsp blocks: 318 / 1,518 ( 21 % ) 

memory bits: 25,941,421 / 55,562,240 ( 47 % ) 

ram blocks: 1,693 / 2,713 ( 62 % ) 

actual clock freq: 211.111110583 

kernel fmax: 211.41 

1x clock fmax: 211.41 

2x clock fmax: 10000 

highest non-global fanout: 26952 

 

.aocx filesize is 183569064 bytes 

 

 

The acl_quartus_report.txt of the design with this problem (unrolling factor 32)  

 

aluts: 115713 

registers: 254,212 

logic utilization: 130,102 / 427,200 ( 30 % ) 

i/o pins: 335 / 826 ( 41 % ) 

dsp blocks: 422 / 1,518 ( 28 % ) 

memory bits: 31,872,467 / 55,562,240 ( 57 % ) 

ram blocks: 2,050 / 2,713 ( 76 % ) 

actual clock freq: 212.560385942 

kernel fmax: 212.58 

1x clock fmax: 212.58 

2x clock fmax: 10000 

highest non-global fanout: 26328 

 

.aocx filesize is 184094248 bytes 

 

 

I also generated slightly modified variants which display the same behaviour. 

 

Are there any ideas to investigate this problem further, for example look for specific log messages?
0 Kudos
6 Replies
Altera_Forum
Honored Contributor II
645 Views

I can think of two reason for CL_INVALID_BINARY: 

 

1. You are compiling against an incorrect BSP or board. 

2. The function in your host code that reads the kernel file is somehow corrupting it. 

 

If your problem depends on the size of the binary file, it is quite likely that the cause is the kernel reading function. Are you properly allocating memory using malloc to read the kernel file? Have you made sure that the memory buffer for the kernel is large enough?
0 Kudos
Altera_Forum
Honored Contributor II
645 Views

Thanks for the reply. Both are plausible reasons, but I double-checked that both the --board parameter and the host memory allocation are correct. 

 

I wonder if the number of local memories instantiated may cause a problem. In the somewhat cryptic sys_description.txt, the highest ID of a local memory in the working design seems to be 138, whereas the failing design has the highest ID 178. In both cases, some intermediate IDs seem to have been optimized away. 

 

Is anyone successfully running designs with that many local memories? 

I right now work on a synthetic example with a high number of local memories in order to find out if this might cause the problem.
0 Kudos
Altera_Forum
Honored Contributor II
645 Views

It looks like I was on the right track here. I modified the hello_world example to generate a large and useless number of local memories and prevent them from being optimized away. 

 

// Copyright (C) 2013-2016 Altera Corporation, San Jose, California, USA. All rights reserved. // Permission is hereby granted, free of charge, to any person obtaining a copy of this // software and associated documentation files (the "Software"), to deal in the Software // without restriction, including without limitation the rights to use, copy, modify, merge, // publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to // whom the Software is furnished to do so, subject to the following conditions: // The above copyright notice and this permission notice shall be included in all copies or // substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, // EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES // OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND // NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT // HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, // WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING // FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR // OTHER DEALINGS IN THE SOFTWARE. // // This agreement shall be governed in all respects by the laws of the State of California and // by the laws of the United States of America. // AOC kernel demonstrating device-side printf call # define BUFFERS(id) __local float mylocal##id; # define READBUF(id, arg) float t##id = mylocal##id; # define WRITEBUF(id, arg) mylocal##id = t##id; //#define GENFN(id, fn) fn(id) # define GENFN(id, fn, ...) fn(id,# #__VA_ARGS__) # define GEN2FNS(prefix, ...) GENFN(prefix##0,# #__VA_ARGS__) GENFN(prefix##1,# #__VA_ARGS__) # define GEN4FNS(prefix, ...) GEN2FNS(prefix##0,# #__VA_ARGS__) GEN2FNS(prefix##1,# #__VA_ARGS__) # define GEN8FNS(prefix, ...) GEN4FNS(prefix##0,# #__VA_ARGS__) GEN4FNS(prefix##1,# #__VA_ARGS__) # define GEN16FNS(prefix, ...) GEN8FNS(prefix##0,# #__VA_ARGS__) GEN8FNS(prefix##1,# #__VA_ARGS__) # define GEN32FNS(prefix, ...) GEN16FNS(prefix##0,# #__VA_ARGS__) GEN16FNS(prefix##1,# #__VA_ARGS__) # define GEN64FNS(prefix, ...) GEN32FNS(prefix##0,# #__VA_ARGS__) GEN32FNS(prefix##1,# #__VA_ARGS__) # define GEN128FNS(prefix, ...) GEN64FNS(prefix##0,# #__VA_ARGS__) GEN64FNS(prefix##1,# #__VA_ARGS__) # define GEN256FNS(prefix, ...) GEN128FNS(prefix##0,# #__VA_ARGS__) GEN128FNS(prefix##1,# #__VA_ARGS__) __kernel void hello_world(int thread_id_from_which_to_print_message) { GEN256FNS(0, BUFFERS) // Get index of the work item unsigned thread_id = get_global_id(0); for(int i=0; i<128; i++){ GEN256FNS(0, READBUF, thread_id+i) GEN256FNS(0, WRITEBUF, thread_id+i) } if(thread_id == thread_id_from_which_to_print_message) { printf("Thread# %u: Hello from Altera's OpenCL Compiler!\n", thread_id); } } 

 

This generates a .aocx file with the following acl_quartus_report.txt  

 

ALUTs: 36336 Registers: 77,105 Logic utilization: 50,379 / 427,200 ( 12 % ) I/O pins: 335 / 826 ( 41 % ) DSP blocks: 0 / 1,518 ( 0 % ) Memory bits: 2,728,450 / 55,562,240 ( 5 % ) RAM blocks: 267 / 2,713 ( 10 % ) Actual clock freq: 288.888888167 Kernel fmax: 289.51 1x clock fmax: 289.51 2x clock fmax: 10000 Highest non-global fanout: 15081 

 

and this sys_description.txt 

 

 

--- Quote Start ---  

14 p385a_sch_ax115 0 0 2 1024 0 4294967296 4294967296 8589934592 0 1 hello_world 0 128 0 0 0 1 0 3 0 0 4 2 1 8 0 0 4 1 0 Thread\40#%u:\40Hello\40from\40Altera's\40OpenCL\40Compiler!\15 256 5 16384 6 16384 7 16384 8 16384 9 16384 10 16384 11 16384 12 16384 13 16384 14 16384 15 16384 16 16384 17 16384 18 16384 19 16384 20 16384 21 16384 22 16384 23 16384 24 16384 25 16384 26 16384 27 16384 28 16384 29 16384 30 16384 31 16384 32 16384 33 16384 34 16384 35 16384 36 16384 37 16384 38 16384 39 16384 40 16384 41 16384 42 16384 43 16384 44 16384 45 16384 46 16384 47 16384 48 16384 49 16384 50 16384 51 16384 52 16384 53 16384 54 16384 55 16384 56 16384 57 16384 58 16384 59 16384 60 16384 61 16384 62 16384 63 16384 64 16384 65 16384 66 16384 67 16384 68 16384 69 16384 70 16384 71 16384 72 16384 73 16384 74 16384 75 16384 76 16384 77 16384 78 16384 79 16384 80 16384 81 16384 82 16384 83 16384 84 16384 85 16384 86 16384 87 16384 88 16384 89 16384 90 16384 91 16384 92 16384 93 16384 94 16384 95 16384 96 16384 97 16384 98 16384 99 16384 100 16384 101 16384 102 16384 103 16384 104 16384 105 16384 106 16384 107 16384 108 16384 109 16384 110 16384 111 16384 112 16384 113 16384 114 16384 115 16384 116 16384 117 16384 118 16384 119 16384 120 16384 121 16384 122 16384 123 16384 124 16384 125 16384 126 16384 127 16384 128 16384 129 16384 130 16384 131 16384 132 16384 133 16384 134 16384 135 16384 136 16384 137 16384 138 16384 139 16384 140 16384 141 16384 142 16384 143 16384 144 16384 145 16384 146 16384 147 16384 148 16384 149 16384 150 16384 151 16384 152 16384 153 16384 154 16384 155 16384 156 16384 157 16384 158 16384 159 16384 160 16384 161 16384 162 16384 163 16384 164 16384 165 16384 166 16384 167 16384 168 16384 169 16384 170 16384 171 16384 172 16384 173 16384 174 16384 175 16384 176 16384 177 16384 178 16384 179 16384 180 16384 181 16384 182 16384 183 16384 184 16384 185 16384 186 16384 187 16384 188 16384 189 16384 190 16384 191 16384 192 16384 193 16384 194 16384 195 16384 196 16384 197 16384 198 16384 199 16384 200 16384 201 16384 202 16384 203 16384 204 16384 205 16384 206 16384 207 16384 208 16384 209 16384 210 16384 211 16384 212 16384 213 16384 214 16384 215 16384 216 16384 217 16384 218 16384 219 16384 220 16384 221 16384 222 16384 223 16384 224 16384 225 16384 226 16384 227 16384 228 16384 229 16384 230 16384 231 16384 232 16384 233 16384 234 16384 235 16384 236 16384 237 16384 238 16384 239 16384 240 16384 241 16384 242 16384 243 16384 244 16384 245 16384 246 16384 247 16384 248 16384 249 16384 250 16384 251 16384 252 16384 253 16384 254 16384 255 16384 256 16384 257 16384 258 16384 259 16384 260 16384 0 0 0 2147483647 3 

--- Quote End ---  

 

 

Loading this design with the unaltered hello_world host code fails and yields additional information through the context callback. 

 

 

--- Quote Start ---  

Using AOCX: hello_world.aocx 

Context callback: Malformed program interface definition found in binary:  

Context callback: FAILED to read auto-discovery string at byte 188. Full auto-discovery string value is 14 p385a_sch_ax115 0 0 2 1024 0 4294967296 4294967296 8589934592 0 1 hello_world 0 128 0 0 0 1 0 3 0 0 4 2 1 8 0 0 4 1 0 Thread\40#%u:\40Hello\40from\40Altera's\40OpenCL\40Compiler!\15 256 5 16384 6 16384 7 16384 8 16384 9 16384 10 16384 11 16384 12 16384 13 16384 14 16384 15 16384 16 16384 17 16384 18 16384 19 16384 20 16384 21 16384 22 16384 23 16384 24 16384 25 16384 26 16384 27 16384 28 16384 29 16384 30 16384 31 16384 32 16384 33 16384 34 16384 35 16384 36 16384 37 16384 38 16384 39 16384 40 16384 41 16384 42 16384 43 16384 44 16384 45 16384 46 16384 47 16384 48 16384 49 16384 50 16384 51 16384 52 16384 53 16384 54 16384 55 16384 56 16384 57 16384 58 16384 59 16384 60 16384 61 16384 62 16384 63 16384 64 16384 65 16384 66 16384 67 16384 68 16384 69 16384 70 16384 71 16384 72 16384 73 16384 74 16384 75 16384 76 16384 77 16384 78 16384 79 16384 80 16384 81 16384 82 16384 83 16384 84 16384 85 16384  

Context callback: Invalid binary 

ERROR: CL_INVALID_BINARY  

--- Quote End ---  

 

 

So now the next question: 

Is there any documented or undocumented limit to the number of local memories in a design? 

Or is this simply a bug in the interface parser?
0 Kudos
Altera_Forum
Honored Contributor II
645 Views

CL_INVALID_BINARY should happen only when there is something wrong with the binary file; if there is something wrong with the code, you will get some error during OpenCL compilation or synthesis/placement/routing. Judging by the resource utilization of the modified hello_world example that you have posted, all of those local memory buffers that you instantiated have been optimized out, that resource utilization is hardly different from the base resource utilization of the BSP. 

 

Are you certain about the sanity of your environment? There might be some disk or memory issue that is corrupting the binary file.
0 Kudos
Altera_Forum
Honored Contributor II
645 Views

You are right, in the final design of the modified hello_world, the local memories have been optimized out. However, in the sys_description description they are still present. And the context callback specifically refers to a so-called auto-discovery string with precisely these contents. 

 

With a total of four bitstreams of large designs triggering the CL_INVALID_BINARY and around a dozen of smaller designs loading successfully, I doubt that a disk or memory issue. I could probably add some hashing that ensures that the bitstream in memory is the same as on disk, but I don't see a way to ensure the bitstream written to disk in the first place is not corrupted.
0 Kudos
Altera_Forum
Honored Contributor II
645 Views

 

--- Quote Start ---  

You are right, in the final design of the modified hello_world, the local memories have been optimized out. However, in the sys_description description they are still present. And the context callback specifically refers to a so-called auto-discovery string with precisely these contents. 

 

With a total of four bitstreams of large designs triggering the CL_INVALID_BINARY and around a dozen of smaller designs loading successfully, I doubt that a disk or memory issue. I could probably add some hashing that ensures that the bitstream in memory is the same as on disk, but I don't see a way to ensure the bitstream written to disk in the first place is not corrupted. 

--- Quote End ---  

 

 

 

I know this is an older thread but I'm running into the exact same issue and was wondering if you were able to nail down the issue.
0 Kudos
Reply