Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software Development SDKs and Libraries
- Intel® Distribution of OpenVINO™ Toolkit
- Best way to do reduce mean on Movidius using OpenVINO

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Kao__Sheen

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-25-2018
05:10 PM

91 Views

Best way to do reduce mean on Movidius using OpenVINO

Hey there,

I'm trying to convert a tensorflow model to run on a Movidius NCS and I need to do a reduce mean on a tensor of size [1, 2000, 100, 64]. According to supported layers page (https://software.intel.com/en-us/articles/OpenVINO-Using-TensorFlow#inpage-nav-10), mean isn't supported, but average pooling across spacial dimensions is. The best way I could think of to calculate the mean of the entire input is to pool over the whole height and width, then do a convolution with a kernel of ones:

input: [1,H,W,C]

avg_pool with kernel: [1,H,W,1] -> [1,1,1,C]

conv2d with kernel: np.ones([1,1,C,1]) -> [1,1,1,1]

divide by C to get the mean of the input

However, I've noticed that avg_pool has a lot of problems on the Movidius, especially if the input is too big:

-if pooling over the full height and width, if averaging over 16384 numbers at once, it returns 0

-if pooling less than the full height/width the kernel has more than ~700 numbers, it crashes

-if averaging over more than 7000 numbers at once, there is noticeable loss in accuracy from using FP16 compared to FP32

-if the sum of the numbers in the pool is too high, it returns inf (I think 65504 is the max for FP16)

Because of these restrictions, I have to do multiple pools to reduce the height and width:

input: [1, 2000, 100, 64]

avg_pool with kernel/stride: [1,500,1,1] -> [1,4,100,64]

avg_pool with kernel: [1,4,100,1] -> [1,1,1,64]

conv2d with kernel np.ones([1,1,64,1]) -> [1,1,1,1]

divide by 64 to get the mean of the input

This gets the right answer, but it is really slow. It takes 400+ms to run on the Movidius, which is several orders of magnitude slower than a simple reduce mean should take. My question is: is there a better way to do reduce mean, and if not, if there a more efficient way to do this kind of pooling?

Here is a code snippet for reference:

inputs = tf.placeholder('float32', [1,2000,100,64]) p1 = tf.nn.avg_pool(inputs, [1, 500, 1, 1], [1, 500, 1, 1], 'VALID') # [1,2000,100,64] -> [1,4,100,64] p2 = tf.nn.avg_pool(p1, [1, 4, 100, 1], [1, 1, 1, 1], 'VALID') # [1,4,100,64] -> [1,1,1,64] c1 = tf.nn.conv2d(p2, np.ones([1,1,64,1]), [1,1,1,1], 'VALID')# [1,1,1,64] -> [1,1,1,1] output = c1/64

Link Copied

6 Replies

Shubha_R_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-30-2018
11:19 AM

91 Views

Hello Sheen. Have you ever heard of NCSDK ? It's an alternative to OpenVino for Movidius. To install it kindly follow the instructions in the following link:

https://movidius.github.io/ncsdk/install.html

Here is the portal to Movidius stuff:

https://movidius.github.io/ncsdk/

As you can see from the below link, the average pooling layer is definitely supported in NCSDK. Can you kindly try your experiment and see if it runs faster on Movidius using NCSDK ? Please report the results in your reply once you're done. Go to http://developer.movidius.com and click on NCS Forum at the bottom. You will have to create an account if you haven't done so already.

https://ncsforum.movidius.com/discussion/428/supported-layers

Thank you for using OpenVino !

Shubha

Kao__Sheen

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-30-2018
02:26 PM

91 Views

Hello Shubha,

I actually started using NCSDK (v2.05) but I switched to OpenVINO because most of the rest of my model was not supported on NCSDK. NCSDK also has the same problems with pooling. Even though it says average pooling is supported, if HxW > 700, it crashes.

Shubha_R_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-30-2018
04:41 PM

91 Views

Sheen thank you for answering. OK, please allow me some time to further investigate this.

Shubha_R_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-31-2018
04:19 AM

91 Views

Sheen, thank you for your patience.

It seems that your issue is a bug. However, a workaround has been given for you :

"use FullyConnected layer with weights set to 1/(W*H) and couple of reshapes/permutes to get correct channel ordering."

So in effect, you are materializing reduce mean by this part - 1/(W*H) of the equation on a FullyConnected layer. Does this make sense to you ? In the meantime, I will file a bug ticket on your behalf.

Thank you for using OpenVino !

Shubha

Kao__Sheen

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-31-2018
03:37 PM

91 Views

Thanks for the suggestion Shubha. I tried 3 experiments using tf.matmul (which is the TF layer that corresponds to FullyConnected), but none of them performed better:

#Test 1 inputs = tf.placeholder('float32', [1,2000,100,64]) flat = tf.reshape(inputs, [1,-1]) fully_connected_tensor = tf.constant(np.ones([12800000,1])/12800000,dtype='float32') output = tf.matmul(flat,fully_connected_tensor)

This returns 0, which is incorrect. I think this is because the value 1/12800000 is smaller than what FP16 can represent.

#Test 2 inputs = tf.placeholder('float32', [1,2000,100,64]) flat = tf.reshape(inputs, [1,-1]) fully_connected_tensor = tf.constant(np.ones([12800000,1]),dtype='float32') output = tf.matmul(flat,fully_connected_tensor)/12800000

This returns inf. This is probably because before the division, the sum is greater than the max value of FP16.

#Test 3 inputs = tf.placeholder('float32', [1,2000,100,64]) flat = tf.reshape(inputs, [1,-1]) fully_connected_tensor = tf.constant(np.ones([12800000,1])/1280,dtype='float32') output = tf.matmul(flat,fully_connected_tensor)/10000

By separating the division into two steps, this returns the correct answer. However, it takes 500+ms to run, which is slower than using the multiple pools method in the OP.

Let me know if you have any other suggestions, and I can test them out.

Luca1

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-01-2018
01:31 PM

91 Views

Intel / Movidius team ~

why can't we just get a bare implementation of reduce-mean?

https://www.tensorflow.org/api_docs/python/tf/reduce_mean

that way we dont have to resort to tricks like convos/fcl with ones?

thank you

.luca

For more complete information about compiler optimizations, see our Optimization Notice.