Here is a low computational cost way of doing random projections:https://drive.google.com/open?id=0BwsgMLjV0BnhOGNxOTVITHY1U28
You can then go and do <vector,vector> or <vector,scalar> single layer networks using that:
Or if you are willing to use evolution you can do deep networks with it. For example this algorithm works well:
So the latest nVidia GPU can do 120 TFlops of 16 bit floating point per second. That's 120,000,000 65536-point Walsh Hadamard transforms per second divided by say about 3 for all the addressing arithmetic. That's about 40 million 65536-point random projections (RP) per second. I won't even try to count the number of 1024-point RP's you could do a second. I guess you could evolve deep neural networks in real time that way.
Another thing you can do is use single layer RP networks as soft memory that deep neural networks can learn to exploit more easily that random access memory (RAM) which poses impossible needle in a haystack difficulties to evolution or gradient descent training algorithms. I mentioned you could use a truncation function to gate writes to such soft memory here: https://discourse.numenta.org/t/artificial-life-concept/2308
Yes, I followed the AI links on your website and this is where I was directed. Thank you.
There is also a related paper that was put on Arxiv today:
And I said a little more here:
Maybe there are a couple of ideas that could be useful to Intel if random projections (RP) etc. become important.
The Walsh Hadamard transform (and RP) could be pipelined in hardware. And if you went to extremes they could be done in analog since only patterns of addition and subtraction are required. There was also talk of multiplier free neural networks in recent papers. That you can do with single layer RP networks if you use the signof function as a nonlinear activation function: signof(x)=1, x>=0 signof(x)=-1, x<0 then weight updates during training don't require multiply either. And actually that proves to be a good nonlinearity to use anyway. For deep RP nets I think you are restricted to training with evolution strategies (ES) rather than BP (but I could be wrong.) However with super fast evaluation times for RP nets that mightn't be a problem.
For the "no multiply" networks:
In hardware all you would need are low transistor count, low power requirement integer add and subtract operations and a few other simple bit operators. Avoiding much more complex and space consuming multiply logic circuits. It should be quite easy to pipe-line the operations on a FPGA. One other thing is that you may not need full precision integer +- because 2's complement overflow would simply increase the amount of nonlinearity, but probably not too much.
You could imagine a chip that had say 100 million add/subtract logic units all working merrily away at 1 billion operations per second. 100 Peta operations per second. 10 Chips would get you to Exa scale. I presume you could train a network in real time at such speeds.
There has been a lack of discussion about binarization in neural networks. Multiplying those +1/-1 values by weights and summing allows you to store values with a high degree of independence. For a given binary input and target value you get an error. You divide the error by the number of binary values and then you simply correct each of the weights by the reduced error taking account of the binary sign. That gives a full correction to get the correct target output. In higher dimensional space most vectors are orthogonal. For a different binary input the adjustments you made to the weights will not align at all. In fact they will sum to Gaussian noise by the central limit theorem. The value you previously stored for a second binary input will now be contaminated by a slight amount of Gaussian noise which you can correct for. This will now introduce an even smaller amount of Gaussian noise on the value for the first binary input. Iterating back and forth will get rid of the noise entirely for both binary inputs.
This has high use in random projection,reservoir and extreme learning machine computing. And in fact turns a simple locality sensitive hash formed by random projection followed by binarization into a useful single layer neural network.
I suppose then there are 2 ways to train a single layer of a deep neural network. If you just alter the weights then for every input every output will be different. However if you view a single layer as an associative memory then you can alter the response of the layer for a single input only, its behavior for different inputs being only marginally affected. This should avoid issues like catastrophic forgetting. It would also suggest that altering the response of the layer in such a mode should involve distinct pattern search not probing around with Gaussian noise. You would need to morph back-propagation so that it looked more like Hooke and Jeeves, or actually use Hooke and Jeeves or a similar pattern search to optimize the deep network associative memory layer responses.
I was having a problem evolving deep networks where the information loss caused by the dot product weight and sum operations was greater than the computational gains per layer.
By allowing more ways for information to pass through each layer I am seeing markedly improved results.
Just to show how easy it is to code associative memory here is some code in FreeBasic:
type associativememory veclen as ulong density as ulong hash as ulong '32 bit unsigned integer weights as single ptr 'pointer to 32 bit floats binarize as boolean ptr work as single ptr declare sub init(veclen as ulong,density as ulong,hash as ulong) declare sub free() declare sub train(target as single,inVec as single ptr) declare function recall(inVec as single ptr) as single declare sub signflip(vec as single ptr,h as long) declare sub wht(vec as single ptr) end type sub associativememory.init(veclen as ulong,density as ulong, hash as ulong) this.veclen=veclen this.density=density this.hash=hash weights=callocate(veclen*density,sizeof(single)) 'allocate zeroed memory binarize=callocate(veclen*density,sizeof(boolean)) work=callocate(veclen,sizeof(single)) end sub sub associativememory.free() deallocate(weights) deallocate(binarize) deallocate(work) end sub function associativememory.recall(inVec as single ptr) as single dim as ulong i,j dim as single x=0f dim as single ptr wtidx=weights dim as boolean ptr bidx=binarize for i=0 to veclen-1 work=inVec next for i=0 to density-1 signflip(work,hash+i) 'sign flip and wht=random projection wht(work) for j=0 to veclen-1 if work
>0f then 'if greater than 0 x+=wtidx bidx =true else x-=wtidx bidx =false end if next wtidx+=veclen bidx+=veclen next return x end function sub associativememory.train(target as single,inVec as single ptr) dim as ulong i,j dim as single ptr wtidx=weights dim as boolean ptr bidx=binarize dim as single e=target-recall(inVec) 'get the prediction error e/=veclen*density 'scale the error correctly so that it will be fully corrected for i=0 to density-1 for j=0 to veclen-1 if bidx then wtidx +=e else wtidx -=e end if next wtidx+=veclen bidx+=veclen next end sub ' Pseudorandomly flip the sign of the values in vector using h as a seed ' h is a 32 bit signed interger sub associativememory.signflip(vec as single ptr,h as long) dim as ulong i for i=0 to veclen-1 h*=1664525 h+=1013904223 if h<0 then vec=-vec next end sub 'Fast Walsh Hadamard transform. Leaves vector length unchanged sub associativememory.wht(vec as single ptr) dim as ulong i,j,hs=1 dim as single a,b,scale=1.0/sqr(veclen) while hs<veclen i=0 while i<veclen j=i+hs while i<j a=vec b=vec[i+hs] vec=a+b vec[i+hs]=a-b i+=1 wend i+=hs wend hs+=hs wend for i=0 to veclen-1 vec*=scale next end sub screenres 300,300,32 print "Please wait!" dim as associativememory net dim as single ptr vec=callocate(256,sizeof(single)) 'allocate zeroed net.init(256,3,1234567) for i as ulong=0 to 99 for j as ulong=0 to 254 vec =1f net.train(sin(j*.06)*100,vec) vec[j+1]=1f net.train(cos(j*.2)*100,vec) vec =0f vec[j+1]=0f next next cls for i as ulong=0 to 254 vec=1f pset (i,150-sin(i*.06)*100),rgb(0,255,0) pset (i,150-net.recall(vec)),rgb(255,255,0) vec=0f next for i as ulong=0 to 254 vec=1f vec[i+1]=1f pset (i,150-cos(i*.2)*100),rgb(0,255,0) pset (i,150-net.recall(vec)),rgb(255,0,255) vec=0f vec[i+1]=0f next deallocate(vec) net.free() getkey
I've written the code in a way that should be easy to translate into C. The number of elements in the vector must be a power of 2 (16,32,64...) because of the wht function.
Circumstances led me to make a comment on the nVidia site about this type of network and associative memory:
Maybe they won't pay any attention. I did present earlier versions of the same ideas to them 10 years ago and all they did was "clone" the idea for generation the Gaussian distribution fast with out any attribution. Google also tend to "clone" ideas without attribution too.