I'm trying to convert a tensorflow model to run on a Movidius NCS and I need to do a reduce mean on a tensor of size [1, 2000, 100, 64]. According to supported layers page (https://software.intel.com/en-us/articles/OpenVINO-Using-TensorFlow#inpage-nav-10), mean isn't supported, but average pooling across spacial dimensions is. The best way I could think of to calculate the mean of the entire input is to pool over the whole height and width, then do a convolution with a kernel of ones:
avg_pool with kernel: [1,H,W,1] -> [1,1,1,C]
conv2d with kernel: np.ones([1,1,C,1]) -> [1,1,1,1]
divide by C to get the mean of the input
However, I've noticed that avg_pool has a lot of problems on the Movidius, especially if the input is too big:
-if pooling over the full height and width, if averaging over 16384 numbers at once, it returns 0
-if pooling less than the full height/width the kernel has more than ~700 numbers, it crashes
-if averaging over more than 7000 numbers at once, there is noticeable loss in accuracy from using FP16 compared to FP32
-if the sum of the numbers in the pool is too high, it returns inf (I think 65504 is the max for FP16)
Because of these restrictions, I have to do multiple pools to reduce the height and width:
input: [1, 2000, 100, 64]
avg_pool with kernel/stride: [1,500,1,1] -> [1,4,100,64]
avg_pool with kernel: [1,4,100,1] -> [1,1,1,64]
conv2d with kernel np.ones([1,1,64,1]) -> [1,1,1,1]
divide by 64 to get the mean of the input
This gets the right answer, but it is really slow. It takes 400+ms to run on the Movidius, which is several orders of magnitude slower than a simple reduce mean should take. My question is: is there a better way to do reduce mean, and if not, if there a more efficient way to do this kind of pooling?
Here is a code snippet for reference:
inputs = tf.placeholder('float32', [1,2000,100,64]) p1 = tf.nn.avg_pool(inputs, [1, 500, 1, 1], [1, 500, 1, 1], 'VALID') # [1,2000,100,64] -> [1,4,100,64] p2 = tf.nn.avg_pool(p1, [1, 4, 100, 1], [1, 1, 1, 1], 'VALID') # [1,4,100,64] -> [1,1,1,64] c1 = tf.nn.conv2d(p2, np.ones([1,1,64,1]), [1,1,1,1], 'VALID')# [1,1,1,64] -> [1,1,1,1] output = c1/64
Hello Sheen. Have you ever heard of NCSDK ? It's an alternative to OpenVino for Movidius. To install it kindly follow the instructions in the following link:
Here is the portal to Movidius stuff:
As you can see from the below link, the average pooling layer is definitely supported in NCSDK. Can you kindly try your experiment and see if it runs faster on Movidius using NCSDK ? Please report the results in your reply once you're done. Go to http://developer.movidius.com and click on NCS Forum at the bottom. You will have to create an account if you haven't done so already.
Thank you for using OpenVino !
I actually started using NCSDK (v2.05) but I switched to OpenVINO because most of the rest of my model was not supported on NCSDK. NCSDK also has the same problems with pooling. Even though it says average pooling is supported, if HxW > 700, it crashes.
Sheen, thank you for your patience.
It seems that your issue is a bug. However, a workaround has been given for you :
"use FullyConnected layer with weights set to 1/(W*H) and couple of reshapes/permutes to get correct channel ordering."
So in effect, you are materializing reduce mean by this part - 1/(W*H) of the equation on a FullyConnected layer. Does this make sense to you ? In the meantime, I will file a bug ticket on your behalf.
Thank you for using OpenVino !
Thanks for the suggestion Shubha. I tried 3 experiments using tf.matmul (which is the TF layer that corresponds to FullyConnected), but none of them performed better:
#Test 1 inputs = tf.placeholder('float32', [1,2000,100,64]) flat = tf.reshape(inputs, [1,-1]) fully_connected_tensor = tf.constant(np.ones([12800000,1])/12800000,dtype='float32') output = tf.matmul(flat,fully_connected_tensor)
This returns 0, which is incorrect. I think this is because the value 1/12800000 is smaller than what FP16 can represent.
#Test 2 inputs = tf.placeholder('float32', [1,2000,100,64]) flat = tf.reshape(inputs, [1,-1]) fully_connected_tensor = tf.constant(np.ones([12800000,1]),dtype='float32') output = tf.matmul(flat,fully_connected_tensor)/12800000
This returns inf. This is probably because before the division, the sum is greater than the max value of FP16.
#Test 3 inputs = tf.placeholder('float32', [1,2000,100,64]) flat = tf.reshape(inputs, [1,-1]) fully_connected_tensor = tf.constant(np.ones([12800000,1])/1280,dtype='float32') output = tf.matmul(flat,fully_connected_tensor)/10000
By separating the division into two steps, this returns the correct answer. However, it takes 500+ms to run, which is slower than using the multiple pools method in the OP.
Let me know if you have any other suggestions, and I can test them out.
Intel / Movidius team ~
why can't we just get a bare implementation of reduce-mean?
that way we dont have to resort to tricks like convos/fcl with ones?